FMA instruction set
The FMA instruction set is the name of a future extension to the 128 and 256-bit SIMD instructions in the X86 microprocessor instruction set to perform fused multiply–add (FMA) operations.[1] Two different variants of FMA instruction sets will be used:
- FMA4 is supported in AMD processors starting with Bulldozer architecture. FMA4 was realized in hardware before FMA3.
- FMA3 is supported in AMD processors starting with Piledriver architecture and will be supported by Intel in their Haswell processors in 2013 & Broadwell processors in 2014.
Contents |
New instructions [edit]
The FMA3 and FMA4 instruction sets have almost identical functionality but are not mutually compatible. Both contain fused multiply–add (FMA) instructions for floating point scalar and SIMD operations. It will take some time for the compiler creators to support the differences and to optimize code accordingly.
Compatibility issue [edit]
The difference between FMA3 and FMA4 concerns the issue of whether the instruction can have three or four different operands. The FMA operation has the form:

The 4-operand form (FMA4) allows a, b, c and d to be four different registers, while the 3-operand form (FMA3) requires that d is the same register as either a, b or c. The 3-operand form makes the code shorter and the hardware implementation slightly simpler while the 4-operand form provides more programming flexibility.
See XOP instruction set for more discussion of compatibility issues between Intel and AMD.
FMA3 instruction set [edit]
CPUs with FMA3 [edit]
- Intel
- Intel will introduce hardware FMA in processors based on Haswell (microarchitecture) during 2013.
- AMD
- AMD introduced FMA3 support in processors starting with Piledriver architecture for compatibility reasons.[2][3] The 2nd generation APU processors based on "Trinity" (32nm) supporting FMA3 instructions were launched May 15, 2012. The 2nd generation Bulldozer processors with Piledriver cores supporting FMA3 instructions were launched October 23, 2012.
Excerpt from FMA3 [edit]
| Mnemonic (AT&T) | Operands | Operation |
|---|---|---|
| VFMADD132PDy | ymm, ymm, ymm/m256 | $0 = $0×$2 + $1 |
| VFMADD132PSy | ||
| VFMADD132PDx | xmm, xmm, xmm/m128 | |
| VFMADD132PSx | ||
| VFMADD132SD | xmm, xmm, xmm/m64 | |
| VFMADD132SS | xmm, xmm, xmm/m32 | |
| VFMADD213PDy | ymm, ymm, ymm/m256 | $0 = $1×$0 + $2 |
| VFMADD213PSy | ||
| VFMADD213PDx | xmm, xmm, xmm/m128 | |
| VFMADD213PSx | ||
| VFMADD213SD | xmm, xmm, xmm/m64 | |
| VFMADD213SS | xmm, xmm, xmm/m32 | |
| VFMADD231PDy | ymm, ymm, ymm/m256 | $0 = $1×$2 + $0 |
| VFMADD231PSy | ||
| VFMADD231PDx | xmm, xmm, xmm/m128 | |
| VFMADD231PSx | ||
| VFMADD231SD | xmm, xmm, xmm/m64 | |
| VFMADD231SS | xmm, xmm, xmm/m32 |
FMA4 instruction set [edit]
CPUs with FMA4 [edit]
- AMD
- Intel
- It is uncertain whether future Intel processors will support FMA4, due to Intel's announced change to FMA3.
Excerpt from FMA4 [edit]
| Mnemonic (AT&T) | Operands | Operation |
|---|---|---|
| VFMADDPDx | xmm, xmm, xmm/m128, xmm/m128 | $0 = $1×$2 + $3 |
| VFMADDPDy | ymm, ymm, ymm/m256, ymm/m256 | |
| VFMADDPSx | xmm, xmm, xmm/m128, xmm/m128 | |
| VFMADDPSy | ymm, ymm, ymm/m256, ymm/m256 | |
| VFMADDSD | xmm, xmm, xmm/m64, xmm/m64 | |
| VFMADDSS | xmm, xmm, xmm/m32, xmm/m32 |
History [edit]
The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. The history can be summarized as follows:
- August 2007: AMD announces the SSE5 instruction set, which includes 3-operand FMA instructions. A new coding scheme (DREX) is introduced for allowing instructions to have three operands.[5]
- April 2008: Intel announces their AVX and FMA instruction sets, including 4-operand FMA instructions. The coding of these instructions uses the new VEX coding scheme which is more flexible than AMD's DREX scheme.[6]
- December 2008: Intel changes the specification for their FMA instructions from 4-operand to 3-operand instructions. The VEX coding scheme is still used.[7]
- May 2009: AMD changes the specification of their FMA instructions from the 3-operand DREX form to the 4-operand VEX form, compatible with the April 2008 Intel specification rather than the December 2008 Intel specification.[8]
- January 2012: AMD announces FMA3 support in future processors codenamed Trinity and Vishera; they are based on the Piledriver architecture.[9]
It is currently uncertain whether the 3-operand VEX coded form (here called FMA3) or the 4-operand form (FMA4) will be the dominating standard in the future. It is also possible that future processors will support both forms.
Compiler and assembler support [edit]
Different compilers provide different levels of support for FMA4:
- GCC 4.5.0 supports FMA4 with -mfma4,[10]
- GCC 4.7.0 also supports FMA3 with -mfma.
- Microsoft Visual C++ 2010 SP1 supports FMA4 instructions.[11]
- Microsoft Visual C++ 2012 supports FMA3 instructions.
- PathScale supports FMA4 with -mfma.
- Open64 5.0 adds "limited support".
- Intel compilers support only FMA3 instructions.[10]
- NASM supports FMA3 instructions since version 2.03 and FMA4 instructions since 2.06.
- YAsm supports FMA3 and FMA4 instructions since version 1.1.0.
- FASM supports both FMA3 and FMA4 instructions.
References [edit]
- ^ "FMA3 and FMA4 are not instruction sets, they are individual instructions -- fused multiply add. They could be quite useful depending on how Intel and AMD implement them" Woltmann, George (Prime95). "Intel AVX and GIMPS". http://www.mersenneforum.org/index.php. Great Internet Mersenne Prime Search (GIMPS) project. Retrieved 27 July 2011.
- ^ "Striking a balance". Dave Christie, AMD Developer blogs. May 7, 2009. Retrieved 2009-05-08.
- ^ Maffeo, Robin. "AMD and the Visual Studio 11 Beta". AMD. Retrieved 19 April 2012.
- ^ "AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions". AMD. May 1, 2009.
- ^ "128-Bit SSE5 Instruction Set". AMD Developer Central. Archived from the original on 2008-01-15. Retrieved 2008-01-28.
- ^ "Intel Advanced Vector Extensions Programming Reference". Intel. Retrieved 2008-04-05.
- ^ "Intel Advanced Vector Extensions Programming Reference". Intel. Retrieved 2009-05-06.
- ^ "Striking a balance". Dave Christie, AMD Developer blogs. May 7, 2009. Retrieved 2009-05-08.
- ^ "Software Optimization Guide for AMD Family 15h Processors". AMD. Retrieved 19 April 2012.
- ^ a b Latif, Lawrence (Nov 14 2011). "AMD Bulldozer only FMA4 and XOP instructions are supported by GCC Intel still mute". The Inquirer.
- ^ "FMA4 Intrinsics Added for Visual Studio 2010 SP1".
|
||||||||||||||