AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and scheduled to be supported in 2015 with Intel's Knights Landing processor. AVX-512 is the latest ISA based on Intel's Larrabee project, but while related to, it is not compatible with the earlier 512-bit AVX-like vector instructions sets in the Xeon Phi line of processors.
AVX-512 consists of multiple extensions not all meant to be supported by all processors implementing them. Only the core extension AVX-512F (AVX-512 Foundation) is required by all implementations.
The instruction set consists of the following: Introduced for Knight Landing:
- AVX-512 Foundation – expands most 32-bit and 64-bit based AVX instructions with EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control
- AVX-512 Conflict Detection Instructions (CDI) – efficient conflict detection to allow more loops to be vectorized, supported by Knights Landing and Skylake
- AVX-512 Exponential and Reciprocal Instructions (ERI) – exponential and reciprocal operations designed to help implement transcendental operations, supported by Knights Landing
- AVX-512 Prefetch Instructions (PFI) – new prefetch capabilities, supported by Knights Landing
Introduced for Skylake:
- AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations
- AVX-512 Doubleword and Quadword Instructions (DQ) – adds new 32-bit and 64-bit AVX-512 instructions
- AVX-512 Vector Length Extensions (VL) – extends most AVX-512 operations to also operate on XMM (128-bit) and YMM (256-bit) registers
Named but not announced for any specific new CPU:
- AVX-512 Integer Fused Multiply Add (IFMA).
- AVX-512 Vector Byte Manipulation Instructions (VBMI) adds vector byte permutation instructions which were not present in AVX-512BW.
- 1 Encoding and features
- 2 Opmask registers
- 3 New instructions in AVX-512 foundation
- 4 New instructions in AVX-512 conflict detection
- 5 New instructions in AVX-512 exponential and reciprocal
- 6 New instructions in AVX-512 prefetch
- 7 New instructions in AVX-512 BW and DQ
- 8 CPUs with AVX-512
- 9 See also
- 10 References
Encoding and features
Compared to VEX, EVEX adds the following benefits:
- Expanded register encoding allowing 32 512-bit registers.
- Adds 7 new opmask registers for masking most AVX-512 instructions.
- Adds a new scalar memory mode that automatically performs a broadcast.
- Adds room for explicit rounding control in each instruction.
- Adds a new compressed displacement memory addressing mode.
Unlike with mixing AVX and SSE instructions. The AVX-512 instruction are designed to mix with AVX without a performance penalty. This means AVX and AVX2 are supposed to be used for 128-bit and 256-bit SIMD, but since the EVEX encoding introduces both new registers and new instructions, AVX-512VL did introduce extensions to use AVX-512 on 128-bit and 256-bit registers. This means most SSE and AVX instructions have new AVX-512 versions that allow them to access the new features above such as opmask and more addressable registers. Unlike AVX-256, the new instructions does not have new names but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous.
Since AVX-512F only supports 32- and 64-bit values, SSE/AVX2 instructions that operate on bytes or words are only supported by the extension AVX-512BW (Byte & Word support). 
|Legacy SSE||SSE-SSE4.2||xmm0-xmm15||bytes, words, doublewords, quadwords, single float and double float (from SSE2)|
|AVX-128 (VEX)||AVX, AVX2||xmm0-xmm15||single float and double float. From AVX2: bytes, words, doublewords, quadwords|
|AVX-256 (VEX)||AVX, AVX2||ymm0-ymm15||single float and double float. From AVX2: bytes, words, doublewords, quadwords|
|AVX-128 (EVEX)||AVX-512VL||xmm0-xmm31 (k1-k7)||doublewords, quadwords, single float and double float. With AVX512BW: bytes and words|
|AVX-256 (EVEX)||AVX-512VL||ymm0-ymm31 (k1-k7)||doublewords, quadwords, single float and double float. With AVX512BW: bytes and words|
|AVX-512 (EVEX)||AVX-512F||zmm0-zmm31 (k1-k7)||doublewords, quadwords, single float and double float. With AVX512BW: bytes and words|
|511 256||255 128||127 0|
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 15 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
The extended registers, SIMD width, and opmask registers of AVX-512 all require OS support. Each set however has its own unique feature bits. While this could theoretically be used to only indicate support for some of the features; all three are required for AVX-512. Only the opmask registers may be used alone to extend traditional AVX-256 without full AVX-512 support.
While all features are required for AVX-512, that does not mean the extended register only work in 512-bit mode. All of the new registers 16-31 are also available to AVX-128 and AVX-256 modes of the EVEX prefix.
Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). The first one k0 is however a hardcoded constant used to indicate unmasked operations. The opmask are in most instructions used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.
The opmask registers are normally 16-bit wide, but can be up to 64 bits with the AVX-512BW extension. How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.
The opmask register is the reason why several bitwise instructions which naturally have no element widths, had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle, now exist in both double-word and quad-word variants with the only difference being in the final masking.
New opmask instructions
The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit (Word) versions. With AVX-512DQ 8-bit (Byte) versions are added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit (Double) and 64-bit (Quad) versions will be added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the classic x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.
||AVX-512F||Bitwise logical AND Masks|
||AVX-512F||Bitwise logical AND NOT Masks|
||AVX-512F||Move from and to Mask Registers|
||AVX-512F||Unpack for Mask Registers|
||AVX-512F||NOT Mask Register|
||AVX-512F||Bitwise logical OR Masks|
||AVX-512F||OR Masks And Set Flags|
||AVX-512F||Shift Left Mask Registers|
||AVX-512F||Shift Right Mask Registers|
||AVX-512F||Bitwise logical XNOR Masks|
||AVX-512F||Bitwise logical XOR Masks|
||AVX-512BW/DQ||Add Two Masks|
||AVX-512BW/DQ||Bitwise comparison and set flags|
New instructions in AVX-512 foundation
Many AVX-512 instructions are simply EVEX versions of old SSE or AVX instructions. There are however several new instructions, and old instructions that have been replaced with new AVX-512 versions. The new or majorly reworked instructions are listed below. These foundation instructions also include the extensions from AVX-512VL and AVX-512BW since those extensions merely add new versions of these instructions instead of new instructions.
Blend using mask
There are no EVEX-prefixed versions of the blend instructions from SSE4; instead, AVX-512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.
Since blending is an integral part of the EVEX encoding, these instruction may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.
||AVX-512F||Blend float64 vectors using opmask control|
||AVX-512F||Blend float32 vectors using opmask control|
||AVX-512F||Blend int32 vectors using opmask control|
||AVX-512F||Blend int64 vectors using opmask control|
||AVX-512BW||Blend byte integer vectors using opmask control|
||AVX-512BW||Blend word integer vectors using opmask control|
Compare into mask
AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration however they save the result to a mask register and only support doubleword and quadword comparisons. The AVX-512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.
|2||LE||Less than or equal|
|3||FALSE||Set to zero|
|5||NLT||Greater than or equal|
|7||TRUE||Set to one|
||AVX-512F||Compare signed/unsigned doublewords into mask|
||AVX-512F||Compare signed/unsigned quadwords into mask|
||AVX-512BW||Compare signed/unsigned bytes into mask|
||AVX-512BW||Compare signed/unsigned words into mask|
Logical set mask
The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note like the comparison instructions these take two opmask registers, one as destination and one a regular opmask.
||Logical AND and Set Mask|
||Logical NAND and Set Mask|
Compress and expand
The compress and expand instructions matches the APL operations of the same name. They use the opmask in a slightly different way from other AVX-512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.
||Store sparse packed double/single-precision floating-point values into dense memory|
||Store sparse packed doubleword/quadword integer values into dense memory/register|
||Load sparse packed double/single-precision floating-point values from dense memory|
||Load sparse packed doubleword/quadword integer values from dense memory/register|
A new set of permute instructions have been added for full two input permutations, they all take three arguments, two source registers and one index, the result is output by either overwriting the first source register or the index register. AVX-512BW extends the instructions to also include 16-bit (word) versions, but not 8-bit (byte) versions. The byte versions are considered separate instructions and are part of the separate AVX-512VBMI extension.
||AVX-512F||Full 64-bit permute overwriting the index.|
||AVX-512F||Full 32-bit permute overwriting the index.|
||AVX-512BW||Full 16-bit permute overwriting the index.|
||AVX-512F||Full 64-bit permute overwriting first source.|
||AVX-512F||Full 32-bit permute overwriting first source.|
||AVX-512BW||Full 16-bit permute overwriting first source.|
Bitwise ternary logic
Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed. These are the only bitwise vector instructions in AVX-512F, EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX-512DQ.
The difference in the doubleword and quadword versions is only the application of the opmask.
||Bitwise Ternary Logic|
|A0||A1||A2||Double AND (0x80)||Double OR (0xFE)||Bitwise blend (0xCA)|
Floating point decomposition
Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.
||Convert exponents of packed fp values into fp values|
||Convert exponent of scalar fp value into fp value|
||Extract vector of normalized mantissas from float32/float64 vector|
||Extract float32/float64 of normalized mantissa from float32/float64 scalar|
||Fix up special packed float32/float64 values|
||Fix up special scalar float32/float64 value|
Floating point arithmetics
This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2−14.
||Compute approximate reciprocals of packed float32/float64 values|
||Compute approximate reciprocals of scalar float32/float64 value|
||Round packed float32/float64 values to include a given number of fraction bits|
||Round scalar float32/float64 value to include a given number of fraction bits|
||Compute approximate reciprocals of square roots of packed float32/float64 values|
||Compute approximate reciprocal of square root of scalar float32/float64 value|
||Scale packed float32/float64 values with float32/float64 values|
||Scale scalar float32/float64 value with float32/float64 value|
New instructions in AVX-512 conflict detection
The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.
||Detect conflicts within vector of packed double- or quadwords values.||Compares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results.|
||Count the number of leading zero bits for packed double- or quadword values.||Vectorized
||Broadcast mask to vector register.||Either 8-bit mask to quadword vector, or 16-bit mask to doubleword vector.|
New instructions in AVX-512 exponential and reciprocal
AVX-512 exponential and reciprocal instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2−28. They also contain two new exponential functions that have a relative error of at most 2−23.
||Compute approximate exponential 2^x of packed single or double-precision floating point values|
||Compute approximate reciprocals of packed single or double-precision floating point values|
||Compute approximate reciprocal of scalar single or double-precision floating point value|
||Compute approximate reciprocals of square roots of packed single or double-precision floating point values|
||Compute approximate reciprocal of square root of scalar single or double-precision floating point value|
New instructions in AVX-512 prefetch
AVX-512 prefetch instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX-512.
T0 prefetch means prefetching into level 1 cache and
T1 means prefetching into level 2 cache.
||Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T0 hint.|
||Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T1 hint.|
||Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using writemask k1 and T0 hint with intent to write.|
||Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write.|
New instructions in AVX-512 BW and DQ
AVX-512BW adds byte and word version of instructions in AVX-512F and adds AVX-512 versions of several byte and word instructions that haven't had one. AVX-512DQ adds new instructions for doubleword and quadword registers, and AVX-512BW adds byte and words versions of the same instructions. Two new instructions were added to the mask instructions set, for those see the earlier section.
Among the instruction added by AVX-512DQ are several SSE, AVX instruction that didn't get AVX-512 versions with AVX-512F, among those are all the two input bitwise instructions and extract/insert integer instructions.
Instructions that are completely new are covered below.
Floating point instructions
Three new floating point operations are introduced. Since they are not only new to AVX-512 they have both packed/SIMD and scalar versions.
VFPCLASS instructions tests if the floating point value is one of eight special floating-point values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The
VRANGE instructions performs minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The
VREDUCE instructions operates on a single source, and subtracts from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.
||Test types of packed single and double precision floating point values.|
||Test types of scalar single and double precision floating point values.|
||Range restriction calculation for packed floating point values.|
||Range restriction calculation for scalar floating point values.|
||Perform reduction transformation on packed floating point values.|
||Perform reduction transformation on scalar floating point values.|
||AVX512DQ||Convert mask register to double- or quard-word vector register.|
||AVX512BW||Convert mask register to byte or word vector register.|
||AVX512DQ||Convert double- or quad-word vector register to mask register.|
||AVX512BW||Convert byte or word vector register to mask register.|
||AVX512BW||Down convert word to byte. Unsaturated, saturated and saturated unsigned.|
||AVX512DQ||Multiply packed quadword store low result. A quadword version of VPMULLD.|
||AVX512DQ||256-bit versions of VEXTRACTF128 from AVX, with either 32bit or 64bit masking.|
||AVX512DQ||256-bit versions of VEXTRACTI128 from AVX2, with either 32bit or 64bit masking.|
CPUs with AVX-512
- James Reinders (23 July 2013). "AVX-512 Instructions". Intel. Retrieved 20 August 2013.
- James Reinders (17 July 2014). "Additional AVX-512 instructions". Intel. Retrieved 3 August 2014.
- "Intel Architecture Instruction Set Extensions Programming Reference" (PDF). Intel. Retrieved 2014-01-29.
- "AVX-512 Architecture/Demikhovsky Poster". Intel. Retrieved 25 February 2014.