|This article needs additional citations for verification. (May 2013)|
SSE2, Streaming SIMD Extensions 2, is one of the Intel SIMD (Single Instruction, Multiple Data) processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.
Most of the SSE2 instructions implement the vector floating-point operations also found in MMX. They differ from their MMX equivalents in that they use the XMM registers instead of the x87 registers. The former are used by SSE instructions and the latter by scalar floating-point (x87) instructions. Since using a set of registers for a different instruction set requires mode switching, SSE2 allows mixing of scalar and vector floating-point operation without a mode switch performance penalty. Additionally, while the MMX uses the x87 registers as 64bit registers, the SSE2 can use the full 128bit capacity of the XMM registers, which gives the potential of great performance gains in optimized applications.
Other SSE2 extensions include a set of cache-control instructions intended primarily to minimize cache pollution when processing infinite streams of information, and a sophisticated complement of numeric format conversion instructions.
AMD's implementation of SSE2 on the AMD64 (x86-64) platform includes an additional eight registers, doubling the total number to 16 (XMM0 through XMM15). These additional registers are only visible when running in 64-bit mode. Intel adopted these additional registers as part of their support for x86-64 architecture (or in Intel's parlance, "Intel 64") in 2004.
Differences between x87 FPU and SSE2
FPU (x87) instructions provide higher precision by calculating intermediate results with 80 bits of precision, by default, to minimise roundoff error in numerically unstable algorithms (see IEEE 754 design rationale and references therein). However, the x87 FPU is a scalar unit only whereas SSE2 can process a small vector of operands in parallel.
If codes designed for x87 are ported to the lower precision double precision SSE2 floating point, certain combinations of math operations or input datasets can result in measurable numerical deviation, which can be an issue in reproducible scientific computations, e.g. if the calculation results must be compared against results generated from a different machine architecture. A related issue is that, historically, language standards and compilers had been inconsistent in their handling of the x87 80-bit registers implementing double extended precision variables, compared with the double and single precision formats implemented in SSE2: the rounding of extended precision intermediate values to double precision variables was not fully defined and was dependent on implementation details such as when registers were spilled to memory. However, modern language standards such as C99 and Fortran 2003 have incorporated IEEE 754 floating point support and now exactly specify the semantics of double extended ("long double") precision expressions to avoid such reproducibility problems.
Differences between MMX and SSE2
SSE2 extends MMX instructions to operate on XMM registers. Therefore, it is possible to convert all existing MMX code to an SSE2 equivalent. Since an XMM register is twice as long as an MMX register, loop counters and memory access may need to be changed to accommodate this. However, 8 byte loads and stores to XMM are available, so this is not strictly required.
Although one SSE2 instruction can operate on twice as much data as an MMX instruction, performance might not increase significantly. Two major reasons are: accessing SSE2 data in memory not aligned to a 16-byte boundary can incur significant penalty, and the throughput of SSE2 instructions in older x86 implementations was half that for MMX instructions. Intel addressed the first problem by adding an instruction in SSE3 to reduce the overhead of accessing unaligned data and improving the overall performance of misaligned loads, and the last problem by widening the execution engine in their Core microarchitecture in Core 2 Duo and later products.
Since MMX and x87 register files alias one another, using MMX will prevent x87 instructions from working as desired. Once MMX has been used, the programmer must use the emms instruction (C: _mm_empty()) to restore operation to the x87 register file. On some operating systems, x87 is not used very much, but may still be used in some critical areas like pow() where the extra precision is needed. In such cases, the corrupt floating-point state caused by failure to emit emms may go undetected for millions of instructions before ultimately causing the floating-point routine to fail, returning NaN. Since the problem is not locally apparent in the MMX code, the bug can be very time consuming to find and correct. As SSE2 does not have this problem, usually provides much better throughput and provides more registers in 64-bit code, it should be preferred for nearly all vectorization work.
When first introduced in 2000, SSE2 was not supported by software development tools. For example, to use SSE2 in a Microsoft Developer Studio project, the programmer had to either manually write inline-assembly or import object-code from an external source. Later the Visual C++ Processor Pack added SSE2 support to Visual C++ and MASM.
The Intel C++ Compiler can automatically generate SSE4, SSSE3, SSE3, SSE2, and SSE code without the use of hand-coded assembly.
The Sun Studio Compiler Suite can also generate SSE2 instructions when the compiler flag -xvector=simd is used.
CPUs supporting SSE2
- AMD K8-based CPUs (Athlon 64, Sempron 64, Turion 64)
- AMD Phenom CPUs
- Intel NetBurst-based CPUs (Pentium 4, Xeon, Celeron, Celeron D, etc.)
- Intel Pentium M and Celeron M
- Intel Core family (including Intel Core 2, Intel Core i5, Intel Core i7)
- Intel Atom
- Transmeta Efficeon
- VIA C7
- VIA Nano
- AMD A4
Notable IA-32 CPUs not supporting SSE2
SSE2 is an extension of the IA-32 architecture. Therefore any architecture that does not support IA-32 does not support SSE2. x86-64 CPUs all implement IA-32. All known x86-64 CPUs also implement SSE2. Since IA-32 predates SSE2, early IA-32 CPUs did not implement it. SSE2 and the other SIMD instruction sets were intended primarily to improve CPU support for realtime graphics, notably gaming. SSE2 is also a requirement for installing Windows 8 or Microsoft Office 2013 "to enhance the reliability of third-party apps and drivers running in Windows 8".
The following CPUs implemented IA-32 after SSE2 was developed, but did not implement SSE2:
- AMD CPUs prior to Athlon 64, including all Socket A-based CPUs
- Intel CPUs prior to Pentium 4
- VIA C3
- Transmeta Crusoe
- Matz, Michael; Hubicka, Jan; Jaeger, Andreas; Mitchell, Mark (January 2010). "System V Application Binary Interface - AMD64 Architecture Processor Supplement - Draft Version 0.99.4". Retrieved 26 April 2013.
- Fog, Agner. "Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms". Retrieved 26 April 2013.
- Microsoft Corporation. "What is PAE, NX, and SSE2 and why does my PC need to support them to run Windows 8 ?". Retrieved 19 March 2013.