Jump to content

Pentium Pro: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
fix grammar error
fix grammar error
Line 33: Line 33:
Micro-ops exit the ROB and enter a reserve station, where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has two integer units and one floating-point unit (FPU). One of the integer units share the same ports as the FPU, therefore the Pentium Pro can only dispatch two integer micro-ops and one floating-point micro-op per a cycle. Of the two integer units, only one has the full compliment of functions such as a barrel shifter, multiplier and divider. The second integer unit, which shares paths with the FPU, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses.
Micro-ops exit the ROB and enter a reserve station, where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has two integer units and one floating-point unit (FPU). One of the integer units share the same ports as the FPU, therefore the Pentium Pro can only dispatch two integer micro-ops and one floating-point micro-op per a cycle. Of the two integer units, only one has the full compliment of functions such as a barrel shifter, multiplier and divider. The second integer unit, which shares paths with the FPU, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses.


The FPU executes floating-point operations. Addition and multiplication are pipelined and have a latency of three and five cycles, respectively. Division and square-root is not pipelined and are executed in separate units that share the FPU's ports. Division and multiplication have a latency of 18 to 36 and 29 to 69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB.
The FPU executes floating-point operations. Addition and multiplication are pipelined and have a latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and multiplication have a latency of 18 to 36 and 29 to 69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB.


After the microprocessor was released a bug was discovered in the [[floating point unit]], commonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating-point to integer conversion when the floating-point number won't fit into the smaller integer format causing the FPU to deviate from its documented behaviour. The bug is considered to be minor and occurs under such special circumstances that very few, if any, software programs are affected.
After the microprocessor was released a bug was discovered in the [[floating point unit]], commonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating-point to integer conversion when the floating-point number won't fit into the smaller integer format causing the FPU to deviate from its documented behaviour. The bug is considered to be minor and occurs under such special circumstances that very few, if any, software programs are affected.

Revision as of 07:28, 25 May 2009

Pentium Pro
200 MHz Pentium Pro with 256 KB of L2 cache
General information
LaunchedNovember 1, 1995
Common manufacturer
  • Intel
Performance
Max. CPU clock rate150 MHz to 200 MHz
FSB speeds60 MHz to 66 MHz
Architecture and classification
Technology node0.35µm to 0.50µm
MicroarchitectureP6
Instruction setx86
Physical specifications
Cores
  • 1
Socket
  • Socket 8

The Pentium Pro is a sixth-generation x86-based microprocessor developed and manufactured by Intel introduced in November 1995. It introduced the P6 microarchitecture (sometime referred as i686) and was originally intended to replace the original Pentium in a full range of applications. While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors, respectively, the Pentium Pro contained 5.5 million transistors. Later, it was reduced to a more narrow role as a server and high-end desktop chip and was used in supercomputers like ASCI Red. The Pentium Pro was capable of both dual- and quad-processor configurations. It only came in one form factor, the relatively large rectangular Socket 8. The Pentium Pro was succeeded by the Pentium II Xeon in 1998.

Microarchitecture

200 MHz Pentium Pro with a 512 KB L2 cache in PGA package
200 MHz Pentium Pro with a 1 MB L2 cache in PPGA package
Uncapped Pentium Pro 256 KB
Pentium II Overdrive with heatsink removed. Flip-chip Deschutes core is on the left. 512 KB cache is on the right.[1]

Summary

Belying its name, the Pentium Pro had a completely new microarchitecture, a departure from the Pentium rather than an extension of it. The Pentium Pro (P6) featured many advanced concepts not found in the Pentium, although it wasn't the first or only x86 processor that did (see NexGen Nx586 or Cyrix 6x86). The Pentium Pro pipeline employed extra decoding steps to translate IA-32 instructions dynamically into buffered micro-operation sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may feed more than one execution unit at once. The Pentium Pro thus featured out of order execution, including speculative execution via register renaming. It also had a wider 36-bit address bus (usable by PAE).

The Pentium Pro has an 8 KB instruction cache, from which up to 16 bytes are fetched on each cycle and sent to the instruction decoders. There are three instruction decoders. The decoders are not equal in capability, only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit micro-operations (micro-ops). The micro-ops are RISC-like, that is, they encode an operation, two sources and a destination. The general decoder can generate up to four micro-ops per a cycle, where as the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory (add this register to this location in the memory) can only be processed by the general decoder, as this operation requires at a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles.

Micro-ops exit the ROB and enter a reserve station, where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has two integer units and one floating-point unit (FPU). One of the integer units share the same ports as the FPU, therefore the Pentium Pro can only dispatch two integer micro-ops and one floating-point micro-op per a cycle. Of the two integer units, only one has the full compliment of functions such as a barrel shifter, multiplier and divider. The second integer unit, which shares paths with the FPU, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses.

The FPU executes floating-point operations. Addition and multiplication are pipelined and have a latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and multiplication have a latency of 18 to 36 and 29 to 69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB.

After the microprocessor was released a bug was discovered in the floating point unit, commonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating-point to integer conversion when the floating-point number won't fit into the smaller integer format causing the FPU to deviate from its documented behaviour. The bug is considered to be minor and occurs under such special circumstances that very few, if any, software programs are affected.

The Pentium Pro microarchitecture was used in one form or another by Intel for more than a decade. The pipeline would scale from its initial 150 MHz start, all the way up to 1.4 GHz with the "Tualatin" Pentium III. The design's various traits would continue after that in the derivative core called "Banias" in Pentium M and Intel Core (Yonah), which itself would evolve into Core architecture (Core 2 processor) in 2006 and onward.[2]

Performance

Performance with 32-bit code was excellent and well ahead of the older Pentium at the time, by 25-35%; however, the Pentium Pro's 16-bit performance was approximately only 20% faster than that of a Pentium due to the fact that register renaming was done on full 32-bit registers only (this was fixed in the Pentium-II). It was this, along with the Pentium Pro's high price, that caused the rather lackluster reception among PC enthusiasts, given the dominance at the time of the 16-bit MS-DOS, 16/32-bit Windows 3.1x, and 32/16-bit Windows 95 (parts of Windows 95, such as USER.exe, were still mostly 16-bit). To gain the full advantages of Pentium Pro's microarchitecture, one needed to run a fully 32-bit OS such as Windows NT 3.51, Unix, Linux or OS/2.

Compared to RISC microprocessors, the Pentium Pro, when introduced, slightly outperformed the fastest RISC microprocessors on integer performance when running the SPECint95 benchmark.[3] Floating-point performance was significantly lower, half of some RISC microprocessors.[3] The Pentium Pro's integer performance lead disappeared rapidly, first overtaken by the MIPS Technologies R10000 in January 1996, and then by Digital Equipment Corporation's EV56 variant of the Alpha 21164.[4]

An innovation in cache

Likely Pentium Pro's most noticeable addition was its on-package L2 cache, which ranged from 256 KB at introduction to 1 MB in 1997. At the time, manufacturing technology did not feasibly allow a large L2 cache to be integrated into the processor core. Intel instead placed the L2 die(s) separately in the package which still allowed it to run at the same clock speed as the CPU core. Additionally, unlike most motherboard-based cache schemes that shared the main system bus with the CPU, the Pentium Pro's cache had its own backside bus (called dual independent bus by Intel). Because of this, the CPU could read main memory and cache concurrently, greatly reducing a traditional bottleneck. The cache was also "non-blocking", meaning that the processor could issue more than one cache request at a time (up to 4), reducing cache-miss penalties. (This is an example of MLP, Memory Level Parallelism.) These properties combined to produce an L2 cache that was immensely faster than the motherboard-based caches of older processors. This cache alone gave the CPU an advantage in input/output performance over older x86 CPUs. In multiprocessor configurations, Pentium Pro's integrated cache skyrocketed performance in comparison to architectures which had each CPU sharing a central cache.

However, this far faster L2 cache did come with some complications. The Pentium Pro's "on-package cache" arrangement was unique. The processor and the cache were on separate dies in the same package and connected closely by a full-speed bus. The two or three dies had to be bonded together early in the production process, before testing was possible. This meant that a single, tiny flaw in either die made it necessary to discard the entire assembly, which was one of the reasons for the Pentium Pro's relatively low production yield and high cost. All versions of the chip were expensive, those with 1024 KB being particularly so, since it required two 512 KB cache dies as well as the processor die.

Available models

Pentium Pro clock speeds were 150, 166, 180 or 200 MHz with a 60 or 66 MHz external bus clock. Some users chose to overclock their Pentium Pro chips, with the 200 MHz version often being run at 233 MHz, and the 150 MHz version often being run at 166 MHz. The chip was popular in symmetric multiprocessing configurations, with dual and quad SMP server and workstation setups being commonplace.

In Intel's "Family/Model/Stepping" scheme, the Pentium Pro is family 6, model 1, and its Intel Product code is 80521.

Evolution in fabrication

As time progressed, the process used to fabricate the Pentium Pro processor die and its separate cache memory die changed, leading to a combination of processes used in the same package:

  • The 133 MHz Pentium Pro prototype processor die was fabricated in a 0.6 µm BiCMOS process.[5]
  • The 150 MHz Pentium Pro processor die was fabricated in a 0.50 µm BiCMOS process.[6]
  • The 166, 180, and 200 MHz Pentium Pro processor die was fabricated in a 0.35 µm BiCMOS process.[6]
  • The 256 KB L2 cache die was fabricated in a 0.50 µm BiCMOS process.[6]
  • The 512 and 1024 KB L2 cache die was fabricated in a 0.35 µm BiCMOS process.[6]

Packaging

The Pentium Pro is packaged in a ceramic multi-chip module (MCM). The MCM contains two underside cavities in which the microprocessor die and its companion cache die reside. The dies are bonded to a heat slug, whose exposed top helps enables the heat from the dies to be transferred more directly to cooling apparatus such as a heat sink. The dies are connected to the package using conventional wire bonding. The cavities are capped with a ceramic plate. The Pentium Pro with 1 MB of cache uses a plastic MCM. Instead of two cavities, there is only one, in which the three dies reside, bonded to the package instead of a heat slug. The cavities are filled in with epoxy.

The MCM has 387 pins, of which approximately half are arranged in a pin grid array (PGA) and half in an interstitial pin grid array (IPGA). The packaging was designed for Socket 8.

Upgrade paths

In 1998, the 300/333 MHz Pentium II Overdrive processor for Socket 8 was released. Featuring 512 KB of full-speed cache, it was produced by Intel as a drop-in upgrade option for owners of Pentium Pro systems. However, it only supported two-way glueless multiprocessing, not four-way or higher, which did not make it a usable upgrade for quad-processor systems.

As Slot 1 motherboards became prevalent, several manufacturers released slocket adapters, such as the Tyan M2020, Asus C-P6S1, Tekram P6SL1, and the Abit KP6. The slockets allowed Pentium Pro processors to be used with Slot 1 motherboards. The Intel 440FX chipset explicitly supported both Pentium Pro and Pentium II processors, but the Intel 440BX and later Slot 1 chipsets did not explicitly support the Pentium Pro, so the socket 8 slockets did not see wide use. Slockets, in the form of Socket 370 to slot 1 adapters, saw renewed popularity when Intel introduced socket 370 Celeron and Pentium III processors.

Core specifications

Pentium Pro

Pentium II Overdrive

  • L1 cache: 16 + 16 KB (Data + Instructions)
  • L2 cache: 512 KB external chip on CPU module clocked at CPU-speed
  • Socket: Socket 8
  • Multiplier: Locked at 5x
  • Front side bus: 60 and 66 MHz
  • VCore: 3.1-3.3 V (Has on-board voltage regulator)
  • Fabrication: 0.25 µm
  • Based on the Deschutes-generation Pentium II
  • First release: 1997
  • Supports MMX technology

Pentium Pro / 6th generation competitors

References

  1. ^ "Wayback machine archive of Heise, accessed April 24, 2009".
  2. ^ "Ars Technica review of the Core architecture, accessed April 24, 2009".
  3. ^ a b Slater, Michael (13 November 1995). "Intel Boosts Pentium Pro to 200 MHz". Microprocessor Report.
  4. ^ Gwennap, Linley (8 July 1996). "Digital's 21164 Reaches 500 MHz". Microprocessor Report.
  5. ^ Papworth, David B. (April 1996). "Tuning the Pentium Pro Microarchitecture". IEEE Micro, pp. 14–15.
  6. ^ a b c d Michael Slater. "Intel Boosts Pentium Pro to 200 MHz". Microprocessor Report, Volume 9, Number 15, November 13, 1995. MicroDesign Resources.
  7. ^ sandpile.org - IA-32 implementation - Intel P6

See also