Power Processing Element
|Made by Freescale|
|Made by IBM|
|Cancelled in gray, historic in italic|
The 90 nm Cell BE processor. The PPE is the upper fourth of the processor.
|Produced||From 2005 to Present|
|Marketed by||IBM, Sony, Microsoft|
|Max. CPU clock rate||2.8 GHz to 3.2 GHz|
|Min. feature size||90 nm to 45 nm|
|Instruction set||Power Architecture|
|L1 cache||32 KB instruction + 32 KB data|
|GPU||Xenos, in the XCGPU variant.|
|Application||Gaming Console, HPC|
|Variant||Cell BE, XCPU, XCGPU, PowerXCell 8i|
The Power Processing Unit (PPU) is a 64-bit dual threaded in-order Power Architecture microprocessor core designed by IBM for use primarily in the game consoles PlayStation 3 and Xbox 360, but has also found applications in high performance computing in supercomputers such as the record setting IBM Roadrunner.
In most instances the PPU is joined by a 512 KB L2 cache to form what is called the Power Processing Element (PPE).
The PPU is used as a main CPU core in three different processor designs:
- The Cell Broadband Engine (Cell BE) which is used primarily in Sony's PlayStation 3 gaming console. It uses the PPE and comes in three versions, a 90 nm, a 65 nm and a 45 nm part.
- The PowerXCell 8i which is a version of the Cell BE with enhanced FPU and memory subsystem. It was only manufactured as a single 65 nm version.
- The XCPU which is used in a three core configuration and a unified 1 MB L2 cache inside Microsoft's Xbox 360. It comes in three versions, the 90 nm and 65 nm versions, and the 45 nm XCGPU with an integrated graphics processor from ATI.
- 64-bit, dual-threaded core
- Typical 3.2 GHz clockrate
- 32 KB L1 Instruction cache
- 32 KB L1 Data cache
- 512 KB Unified L2 cache, 8-way set associative in the PPE variant.
- Compatible with 64-bit PowerPC ISA v.2.02 (POWER4 and PowerPC 970)
- AltiVec SIMD functionality
- Branch Unit (BRU)
- Fixed Point Integer Unit (FXU)
- Load and Store Unit (LSU)
- Floating-Point Unit (FPU)
- Vector Media Extension Unit (VMX)
The PPU is an In-Order processor, but it has some unique traits which allow it to achieve some benefits of Out-of-Order execution without expensive re-ordering hardware. Upon reaching an L1 cache miss - it can execute past the cache miss, stopping only when an instruction is actually dependent on a load. It can send up to 8 load instructions to the L2 cache out-of-order. It has an instruction delay pipe - a side path that allows it to execute instructions that would normally cause pipeline stalls without holding up the rest of the pipeline. The instruction delay pipeline is used for the Out-Of-Order Load/Stores: cache misses are put there while it moves on.
The PPE's Pipeline
The PPE has a 23 stage general pipeline with an additional 11 stages possible for Microcode and an additional 4 stages possible for Branch Prediction. 
The PPU runs two hardware threads simultaneously. The main registers for code execution are duplicated, as are the exception and interrupt-handling registers, and several essential arrays and queues. They can generate exceptions simultaneously, and perform branch prediction on their individual branch histories. The execution engine and caches are not duplicated though - so it is still just a single-core design.
Floating Point Capacity
Its 64-bit double precision floating-point unit, and 128-bit VMX unit (using the AltiVec instruction set), can perform a theoretical 12 floating-point operations per cycle, as all Power Architecture floating-point units can do floating-point multiply-adds, and come no smaller than 64-bits. That gives 3.2 billion clock cycles * 12 = 38.4 billion floating-point operations/second.