|This article needs additional citations for verification. (November 2013) (Learn how and when to remove this template message)|
||It has been suggested that Replay system be merged into this article. (Discuss) Proposed since April 2014.|
|L1 cache||8 KB to 16 KB per core|
|L2 cache||128 KB to 2048 KB
256 KB to 2048 KB (Xeon)
|L3 cache||4 MB to 16 MB shared|
|Created||November 20, 2000|
|Transistors||42M 180 nm (B2, C1, D0, E0)|
The NetBurst microarchitecture, called P68 inside Intel, was the successor to the P6 microarchitecture in the x86 family of CPUs made by Intel. The first CPU to use this architecture was the Willamette-core Pentium 4, released on November 20, 2000 and the first of the Pentium 4 CPUs; all subsequent Pentium 4 and Pentium D variants have also been based on NetBurst. In mid-2001, Intel released the Foster core, which was also based on NetBurst, thus switching the Xeon CPUs to the new architecture as well. Pentium 4-based Celeron CPUs also use the NetBurst architecture.
NetBurst was replaced with the Core microarchitecture, released in July 2006.
The NetBurst microarchitecture includes features such as Hyper-Threading, Hyper Pipelined Technology and Rapid Execution Engine which are firsts in this particular microarchitecture.
Hyper-threading is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on x86 microprocessors. Intel introduced it with NetBurst processors in 2002. Later Intel reintroduced it with Nehalem microarchitecture after its absence in Core microarchitecture.
Quad-Pumped Front-Side Bus
"Northwood" and "Willamette" feature an external Front-Side Bus that runs at 100 MHz and is 64-bits wide, but is also quad-pumped, thus giving 3.2 GB/s of memory bandwidth. The Intel "Northwood" i850 chipset with dual-channel RD-RAM can provide 3.2 GB/s of memory bandwidth. The "Presler" has an 800 MHz front-side bus, 64-bits wide, capable of transferring 6.4 GB/s, with 800 MHz DDR2 memory.
Hyper Pipelined Technology
This is the name given to the 20-stage instruction pipeline within the Willamette core. This is a significant increase in the number of stages when compared to the Pentium III, which had only 10 stages in its pipeline. The Prescott core has a 31-stage pipeline. Although a deeper pipeline has an increased branch misprediction penalty, the greater number of stages in the pipeline allow the CPU to have higher clock speeds which was thought to offset any loss in performance. A smaller instructions per clock (IPC) is an indirect consequence of pipeline depth—a matter of design compromise (a small number of long pipelines has a smaller IPC than a greater number of short pipelines). Another drawback of having more stages in a pipeline is an increase in the number of stages that need to be traced back in the event that the branch predictor makes a mistake, increasing the penalty paid for a mis-prediction. To address this issue, Intel devised the Rapid Execution Engine and has invested a great deal into its branch prediction technology, which Intel claims reduces mis-predictions by 33% over Pentium III.
Rapid Execution Engine
With this technology, the two ALUs in the core of the CPU are double-pumped, meaning that they actually operate at twice the core clock frequency. For example, in a 3.8 GHz processor, the ALUs will effectively be operating at 7.6 GHz. The reason behind this is to generally make up for the low IPC count; additionally this considerably enhances the integer performance of the CPU. Intel also replaced the high-speed barrel shifter with a shift/rotate execution unit that operates at the same frequency as the CPU core. The downside is that certain instructions are now much slower (relatively and absolutely) than before, making optimization for multiple target CPUs difficult. An example is shift and rotate operations, which suffer from the lack of a barrel shifter which was present on every x86 CPU beginning with the i386, including the main competitor processor, Athlon.
Execution Trace Cache
Within the L1 cache of the CPU, Intel incorporated its Execution Trace Cache. It stores decoded micro-operations, so that when executing a new instruction, instead of fetching and decoding the instruction again, the CPU directly accesses the decoded micro-ops from the trace cache, thereby saving considerable time. Moreover, the micro-ops are cached in their predicted path of execution, which means that when instructions are fetched by the CPU from the cache, they are already present in the correct order of execution.
The replay system is a little-known subsystem within the Intel Pentium 4 processor. Its primary function is to catch operations that have been mistakenly sent for execution by the processor's scheduler. Operations caught by the replay system are then re-executed in a loop until the conditions necessary for their proper execution have been fulfilled.
Despite these enhancements, the NetBurst architecture created obstacles for engineers trying to scale up its performance. With this microarchitecture, Intel looked to attain clock speeds of 10 GHz, but because of rising clock speeds, Intel faced increasing problems with keeping power dissipation within acceptable limits. Intel reached a speed barrier of 3.8 GHz in November 2004 but encountered problems trying to achieve even that. Intel abandoned NetBurst in 2006 after the heat problems reached a level of severity and then developed Core microarchitecture, inspired by the P6 Core of the Pentium Pro to the Tualatin Pentium III-S and most directly the Pentium M.
|Revision||Processor Brand(s)||Pipeline stages|
|Willamette (180 nm)||Celeron, Pentium 4||20|
|Northwood (130 nm)||Celeron, Pentium 4, Pentium 4 HT||20|
|Gallatin (130 nm)||Pentium 4 HT Extreme Edition, Xeon||20|
|Prescott (90 nm)||Celeron D, Pentium 4, Pentium 4 HT,
Pentium 4 Extreme Edition
|Cedar Mill (65 nm)||Celeron D, Pentium 4||31|
|Smithfield (90 nm)||Pentium D||31|
|Presler (65 nm)||Pentium D||31|
Intel replaced the original Willamette core with a redesigned version of the NetBurst microarchitecture called Northwood in January 2002. The Northwood design combined an increased cache size, a smaller 130 nm fabrication process, and Hyper-Threading Technology (although initially all models but the 3.06 GHz model had this feature disabled) to produce a more modern, higher-performing version of the NetBurst microarchitecture.
In February 2004, Intel introduced Prescott, a more radical revision of the microarchitecture. The Prescott core was produced on a 90 nm process, and included several major design changes, including the addition of an even larger cache (from 512 KB in the Northwood to 1 MB, and 2 MB in Prescott 2M), a much deeper instruction pipeline (31 stages as compared to 20 in the Northwood), a heavily improved branch predictor, the introduction of the SSE3 instructions, and later, the implementation of Intel 64, Intel's branding for their compatible implementation of the x86-64 64-bit version of the x86 microarchitecture (as with hyper-threading, all Prescott chips branded Pentium 4 HT have hardware to support this feature, but it was initially only enabled on the high-end Xeon processors, before being officially introduced in processors with the Pentium trademark). Power consumption and heat dissipation also became major issues with Prescott, which quickly became the hottest-running, and most power-hungry, of Intel's single-core x86 and x86-64 processors. Power and heat concerns prevented Intel from releasing a Prescott clocked above 3.8 GHz, along with a mobile version of the core clocked above 3.46 GHz.
Intel also released a dual-core processor based on the NetBurst microarchitecture branded Pentium D. The first Pentium D core was codenamed Smithfield, which is actually two Prescott cores in a single die, and later Presler, which consists of two Cedar Mill cores on two separate dies (Cedar Mill being the 65 nm die-shrink of Prescott).
Intel had Netburst based successors in development called Tejas and Jayhawk with between 40 and 50 pipeline stages, but ultimately decided to replace NetBurst with the Core microarchitecture, released in July 2006; these successors were more directly derived from 1995's Pentium Pro (P6 microarchitecture). August 8, 2008 marked the end of Intel NetBurst based processors. The reason for NetBurst's abandonment was the severe heat problems caused by high clock speeds. While Core- and Nehalem-based processors have higher TDPs, most processors are multi-core, so each core gives off a fraction of the maximum TDP, and the highest-clocked Core-based single-core processors give off a maximum of 27 W of heat. The fastest-clocked desktop Pentium 4 processors (single-core) had TDPs of 115 W, compared to 88 W for the fastest clocked mobile versions. Although, with the introduction of new steppings, TDPs for some models were eventually lowered.
The Nehalem microarchitecture, the successor to the Core microarchitecture, was actually supposed to be an evolution of NetBurst according to Intel roadmaps dating back to 2000. But due to NetBurst's abandonment, Nehalem is now a completely different project, but has some similarities with NetBurst. Nehalem reimplements the Hyper-threading Technology first introduced in the 3.06 GHz Northwood core of Pentium 4. Nehalem also implements an L3 cache in processors based on it. For a consumer processor implementation, an L3 cache was first used in the Gallatin core of Pentium 4 Extreme Edition, but was oddly missing from Prescott 2M core of the same brand.
- "The Trace Cache Branch Prediction Unit : Intel's New Pentium 4 Processor". Tomshardware.com. 2000-11-20. Retrieved 2012-01-02.