Jump to content

AVX-512

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Carewolf (talk | contribs) at 10:50, 27 April 2016 (CPUs with AVX-512: The current skylake E3 doesn't support AVX-512 unfortunately). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and is supported in Intel's Knights Landing processor.[1] AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors. The earlier 512-bit SIMD instructions used in Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.[1]

AVX-512 consists of multiple extensions not all meant to be supported by all processors implementing them. Only the core extension AVX-512F (AVX-512 Foundation) is required by all implementations.

Instruction set

The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit, however they are typically grouped by supporting processor generation.

F, CDI, ERI, PFI
Introduced with Xeon Phi Knights Landing and Skylake (future Xeon "Purley" only), with the last two (ERI and PFI) being specific to Knights Landing.
  • AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, supported by Knights Landing and Skylake Xeon
  • AVX-512 Conflict Detection Instructions (CDI) – efficient conflict detection to allow more loops to be vectorized, supported by Knights Landing[1] and Skylake Xeon
  • AVX-512 Exponential and Reciprocal Instructions (ERI) – exponential and reciprocal operations designed to help implement transcendental operations, supported by Knights Landing[1]
  • AVX-512 Prefetch Instructions (PFI) – new prefetch capabilities, supported by Knights Landing[1]
BW, DQ, VL
Introduced with Skylake (only Xeon "Purley" expected 2017).
  • AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations[2]
  • AVX-512 Doubleword and Quadword Instructions (DQ) – adds new 32-bit and 64-bit AVX-512 instructions[2]
  • AVX-512 Vector Length Extensions (VL) – extends most AVX-512 operations to also operate on XMM (128-bit) and YMM (256-bit) registers[2]
IFMA, VBMI
Future extensions scheduled for Cannonlake.[3]
  • AVX-512 Integer Fused Multiply Add (IFMA) - fused multiply add of integers using 52-bit precision.
  • AVX-512 Vector Byte Manipulation Instructions (VBMI) adds vector byte permutation instructions which were not present in AVX-512BW.

Encoding and features

The VEX prefix used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX-512. This has led them to define a new prefix called EVEX.

Compared to VEX, EVEX adds the following benefits:[4]

  • Expanded register encoding allowing 32 512-bit registers.
  • Support up to 4 operands.
  • Adds 7 new opmask registers for masking most AVX-512 instructions.
  • Adds a new scalar memory mode that automatically performs a broadcast.
  • Adds room for explicit rounding control in each instruction.
  • Adds a new compressed displacement memory addressing mode.

The extended registers, SIMD width bit, and opmask registers of AVX-512 are mandatory and all require support from the OS.

SIMD modes

The AVX-512 instructions are designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty. However AVX-512VL extensions allows the use of AVX-512 instructions on 128/256-bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have new AVX-512 versions encoded with the EVEX prefix which allow access to new features such as opmask and additional registers. Unlike AVX-256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX-512F only supports 32- and 64-bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are only supported with the AVX-512BW extension (Byte & Word support).[4]

Name Extension sets Registers Types
Legacy SSE SSE-SSE4.2 xmm0-xmm15 bytes, words, doublewords, quadwords, single float and double float (from SSE2)
AVX-128 (VEX) AVX, AVX2 xmm0-xmm15 single float and double float. From AVX2: bytes, words, doublewords, quadwords
AVX-256 (VEX) AVX, AVX2 ymm0-ymm15 single float and double float. From AVX2: bytes, words, doublewords, quadwords
AVX-128 (EVEX) AVX-512VL xmm0-xmm31 (k1-k7) doublewords, quadwords, single float and double float. With AVX512BW: bytes and words
AVX-256 (EVEX) AVX-512VL ymm0-ymm31 (k1-k7) doublewords, quadwords, single float and double float. With AVX512BW: bytes and words
AVX-512 (EVEX) AVX-512F zmm0-zmm31 (k1-k7) doublewords, quadwords, single float and double float. With AVX512BW: bytes and words

Extended registers

AVX-512 register scheme as extension from the AVX (YMM0-YMM15) and SSE (XMM0-XMM15) registers
511 256 255 128 127 0
  ZMM0     YMM0     XMM0  
ZMM1 YMM1 XMM1
ZMM2 YMM2 XMM2
ZMM3 YMM3 XMM3
ZMM4 YMM4 XMM4
ZMM5 YMM5 XMM5
ZMM6 YMM6 XMM6
ZMM7 YMM7 XMM7
ZMM8 YMM8 XMM8
ZMM9 YMM9 XMM9
ZMM10 YMM10 XMM10
ZMM11 YMM11 XMM11
ZMM12 YMM12 XMM12
ZMM13 YMM13 XMM13
ZMM14 YMM14 XMM14
ZMM15 YMM15 XMM15
ZMM16 YMM16 XMM16
ZMM17 YMM17 XMM17
ZMM18 YMM18 XMM18
ZMM19 YMM19 XMM19
ZMM20 YMM20 XMM20
ZMM21 YMM21 XMM21
ZMM22 YMM22 XMM22
ZMM23 YMM23 XMM23
ZMM24 YMM24 XMM24
ZMM25 YMM25 XMM25
ZMM26 YMM26 XMM26
ZMM27 YMM27 XMM27
ZMM28 YMM28 XMM28
ZMM29 YMM29 XMM29
ZMM30 YMM30 XMM30
ZMM31 YMM31 XMM31

The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.

Opmask registers

Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). The first one k0 is however a hardcoded constant used to indicate unmasked operations. The opmask are in most instructions used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.

The opmask registers are normally 16-bit wide, but can be up to 64 bits with the AVX-512BW extension.[4] How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.

The opmask register is the reason why several bitwise instructions which naturally have no element widths, had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle, now exist in both double-word and quad-word variants with the only difference being in the final masking.

New opmask instructions

The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit (Word) versions. With AVX-512DQ 8-bit (Byte) versions are added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit (Double) and 64-bit (Quad) versions will be added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the classic x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.

Instruction Extension set Description
KAND F Bitwise logical AND Masks
KANDN F Bitwise logical AND NOT Masks
KMOV F Move from and to Mask Registers
KUNPCK F Unpack for Mask Registers
KNOT F NOT Mask Register
KOR F Bitwise logical OR Masks
KORTEST F OR Masks And Set Flags
KSHIFTL F Shift Left Mask Registers
KSHIFTR F Shift Right Mask Registers
KXNOR F Bitwise logical XNOR Masks
KXOR F Bitwise logical XOR Masks
KADD BW/DQ Add Two Masks
KTEST BW/DQ Bitwise comparison and set flags

New instructions in AVX-512 foundation

Many AVX-512 instructions are simply EVEX versions of old SSE or AVX instructions. There are however several new instructions, and old instructions that have been replaced with new AVX-512 versions. The new or majorly reworked instructions are listed below. These foundation instructions also include the extensions from AVX-512VL and AVX-512BW since those extensions merely add new versions of these instructions instead of new instructions.

Blend using mask

There are no EVEX-prefixed versions of the blend instructions from SSE4; instead, AVX-512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.

Since blending is an integral part of the EVEX encoding, these instruction may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.

Instruction Extension set Description
VBLENDMPD F Blend float64 vectors using opmask control
VBLENDMPS F Blend float32 vectors using opmask control
VPBLENDMD F Blend int32 vectors using opmask control
VPBLENDMQ F Blend int64 vectors using opmask control
VPBLENDMB BW Blend byte integer vectors using opmask control
VPBLENDMW BW Blend word integer vectors using opmask control

Compare into mask

AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration however they save the result to a mask register and initially only support doubleword and quadword comparisons. The AVX-512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.[4]

Immediate Comparison Description
0 EQ Equal
1 LT Less than
2 LE Less than or equal
3 FALSE Set to zero
4 NEQ Not equal
5 NLT Greater than or equal
6 NLE Greater than
7 TRUE Set to one
Instruction Extension set Description
VPCMPD

VPCMPUD

F Compare signed/unsigned doublewords into mask
VPCMPQ

VPCMPUQ

F Compare signed/unsigned quadwords into mask
VPCMPB

VPCMPUB

BW Compare signed/unsigned bytes into mask
VPCMPW

VPCMPUW

BW Compare signed/unsigned words into mask

Logical set mask

The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note like the comparison instructions these take two opmask registers, one as destination and one a regular opmask.

Instruction Extension set Description
VPTESTMD, VPTESTMQ F Logical AND and set mask for 32 or 64 bit integers.
VPTESTNMD, VPTESTNMQ F Logical NAND and set mask for 32 or 64 bit integers.
VPTESTMB, VPTESTMW BW Logical AND and set mask for 8 or 16 bit integers.
VPTESTNMB, VPTESTNMW BW Logical NAND and set mask for 8 or 16 bit integers.

Compress and expand

The compress and expand instructions matches the APL operations of the same name. They use the opmask in a slightly different way from other AVX-512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.

Instruction Description
VCOMPRESSPD,

VCOMPRESSPS

Store sparse packed double/single-precision floating-point values into dense memory
VPCOMPRESSD,

VPCOMPRESSQ

Store sparse packed doubleword/quadword integer values into dense memory/register
VEXPANDPD,

VEXPANDPS

Load sparse packed double/single-precision floating-point values from dense memory
VPEXPANDD,

VPEXPANDQ

Load sparse packed doubleword/quadword integer values from dense memory/register

Permute

A new set of permute instructions have been added for full two input permutations, they all take three arguments, two source registers and one index, the result is output by either overwriting the first source register or the index register. AVX-512BW extends the instructions to also include 16-bit (word) versions, but not 8-bit (byte) versions. The byte versions are considered separate instructions and are part of the AVX-512VBMI extension.

Instruction Extension set Description
VPERMB VBMI Permute packed bytes elements.
VPERMW BW Permute packed words elements.
VPERMT2B VBMI Full byte permute overwriting first source.
VPERMT2W BW Full word permute overwriting first source.
VPERMI2PD, VPERMI2PS F Full single/double floating point permute overwriting the index.
VPERMI2D, VPERMI2Q F Full doubleword/quadword permute overwriting the index.
VPERMI2B, VPERMI2W BW Full byte/word permute overwriting the index.
VPERMT2PS, VPERMT2PD F Full single/double floating point permute overwriting first source.
VPERMT2D, VPERMT2Q F Full doubleword/quadword permute overwriting first source.
VSHUFF32x4, VSHUFF64x2,

VSHUFFI32x4, VSHUFFI64x2

F Shuffle four packed 128-bit lines.

Bitwise ternary logic

Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed.[4] These are the only bitwise vector instructions in AVX-512F, EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX-512DQ.

The difference in the doubleword and quadword versions is only the application of the opmask.

Instruction Description
VPTERNLOGD, VPTERNLOGQ Bitwise Ternary Logic

Examples:

A0 A1 A2 Double AND (0x80) Double OR (0xFE) Bitwise blend (0xCA)
0 0 0 0 0 0
0 0 1 0 1 1
0 1 0 0 1 0
0 1 1 0 1 1
1 0 0 0 1 0
1 0 1 0 1 0
1 1 0 0 1 1
1 1 1 1 1 1

Conversions

A number of conversion or move instructions were added, that completes the set of conversion instructions available from SSE2.

Instruction Extension set Description

VPMOVQDVPMOVSQDVPMOVUSQD,
VPMOVQW, VPMOVSQW,VPMOVUSQW,
VPMOVQB, VPMOVSQB, VPMOVUSQB,
VPMOVDW, VPMOVSDW, VPMOVUSDW,
VPMOVDB, VPMOVSDB, VPMOVUSDB

F Down convert quadword or doubleword to doubleword, word or byte; unsaturated, saturated or saturated unsigned. The reverse of the sign/zero extend instructions from SSE4.1.
VPMOVWB, VPMOVSWB, VPMOVUSWB BW Down convert word to byte; unsaturated, saturated or saturated unsigned.
VCVTPS2UDQ, VCVTPD2UDQ,

VCVTTPS2UDQ, VCVTTPD2UDQ

F Convert with or without truncation, packed single or double-precision floating point to packed unsigned doubleword integers.
VCVTSS2USI , VCVTSD2USI ,

VCVTTSS2USI , VCVTTSD2USI

F Convert with or without trunction, scalar single or double-precision floating point to unsigned doubleword integer.
VCVTPS2QQ, VCVTPD2QQ,

VCVTPS2UQQ, VCVTPD2UQQ,
VCVTTPS2QQ, VCVTTPD2QQ,
VCVTTPS2UQQ, VCVTTPD2UQQ

DQ Convert with or without truncation, packed single or double-precision floating point to packed signed or unsigned quadword integers.
VCVTUDQ2PS , VCVTUDQ2PD F Convert packed unsigned doubleword integers to packed single or double-precision floating point.
VCVTUSI2PS , VCVTUSI2PD F Convert scalar unsigned doubleword integers to single or double-precision floating point.
VCVTUSI2SD, VCVTUSI2SS F Convert scalar unsigned integers to single or double-precision floating point.
VCVTUQQ2PS, VCVTUQQ2PD DQ Convert packed unsigned quadword integers to packed single or double-precision floating point.
VCVTQQ2PD, VCVTQQ2PS F Convert packed quadword integers to packed single or double-precision floating point.

Floating point decomposition

Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.

Instruction Description
VGETEXPPD, VGETEXPPS Convert exponents of packed fp values into fp values
VGETEXPSD, VGETEXPSS Convert exponent of scalar fp value into fp value
VGETMANTPD, VGETMANTPS Extract vector of normalized mantissas from float32/float64 vector
VGETMANTSD, VGETMANTSS Extract float32/float64 of normalized mantissa from float32/float64 scalar
VFIXUPIMMPD, VFIXUPIMMPS Fix up special packed float32/float64 values
VFIXUPIMMSD, VFIXUPIMMSS Fix up special scalar float32/float64 value

Floating point arithmetics

This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2−14.[4]

Instruction Description
VRCP14PD, VRCP14PS Compute approximate reciprocals of packed float32/float64 values
VRCP14SD, VRCP14SS Compute approximate reciprocals of scalar float32/float64 value
VRNDSCALEPS, VRNDSCALEPD Round packed float32/float64 values to include a given number of fraction bits
VRNDSCALESS, VRNDSCALESD Round scalar float32/float64 value to include a given number of fraction bits
VRSQRT14PD, VRSQRT14PS Compute approximate reciprocals of square roots of packed float32/float64 values
VRSQRT14SD, VRSQRT14SS Compute approximate reciprocal of square root of scalar float32/float64 value
VSCALEFPS, VSCALEFPD Scale packed float32/float64 values with float32/float64 values
VSCALEFSS, VSCALEFSD Scale scalar float32/float64 value with float32/float64 value

Miscellaneous

Instruction Extension set Description
VALIGND, VALIGNQ F Align doubleword/quadword vectors
VDBPSADBW VL Double block packed SAD unsigned bytes
VPBROADCASTD, VPBROADCASTQ VL Load with Broadcast integer doubleword/quadword from GPR/memory
VPBROADCASTB, VPBROADCASTW BW Load with Broadcast integer byte/word from GPR/memory
VBROADCASTF32X4, VBROADCASTF64X4 VL 512-bit versions of VBROADCASTF128 from AVX, with either 32bit or 64bit masking.
VBROADCASTI32X4, VBROADCASTI64X4 VL 512-bit versions of VBROADCASTI128 from AVX2, with either 32bit or 64bit masking.
VBROADCASTF32X2, VBROADCASTF64X2, VBROADCASTF32X8 DQ 64/128/256-bit versions of VBROADCASTF128 from AVX, with either 32bit or 64bit masking.
VBROADCASTI32X2, VBROADCASTI64X2, VBROADCASTI32X8 DQ 64/128/256-bit versions of VBROADCASTI128 from AVX2, with either 32bit or 64bit masking.
VEXTRACTF32X4, VEXTRACTF64X4 VL 512-bit versions of VEXTRACTF128 from AVX, with either 32bit or 64bit masking.
VEXTRACTI32X4, VEXTRACTI64X4 VL 512-bit versions of VEXTRACTI128 from AVX2, with either 32bit or 64bit masking.
VEXTRACTF64X2, VEXTRACTF32X8 DQ 128/256-bit versions of VEXTRACTF128 from AVX, with either 32bit or 64bit masking.
VEXTRACTI64X2, VEXTRACTI32X8 DQ 128/256-bit versions of VEXTRACTI128 from AVX2, with either 32bit or 64bit masking.
VINSERTF32x4, VINSERTF64x4 VL 512-bit versions of VINSERTF128 from AVX, with either 32bit or 64bit masking.
VINSERTI32X4, VINSERTI64X4 VL 512-bit versions of VINSERTI128 from AVX2, with either 32bit or 64bit masking.
VINSERTF64X2, VINSERTF32X8 DQ 128/256-bit versions of VINSERTF128 from AVX, with either 32bit or 64bit masking.
VINSERTI64X2, VINSERTI32X8 DQ 128/256-bit versions of VINSERTI128 from AVX2, with either 32bit or 64bit masking.
VPABSQ F Packed absolute value quadword
VPMAXSQ, VPMAXUQ F Maximum of packed signed/unsigned quadword
VPMINSQ, VPMINUQ F Minimum of packed signed/unsigned quadword
VPMULTISHIFTQB VBMI Select packed unaligned bytes from quadword sources
VPROLD, VPROLVD,

VPROLQ, VPROLVQ,
VPRORD, VPRORVD,
VPRORQ, VPRORVQ

F Bit rotate left/right
VPSCATTERDD, VPSCATTERDQ,

VPSCATTERQD, VPSCATTERQQ

F Scatter packed doubleword/quadword with signed doubleword and quadword indices
VSCATTERDPS, VSCATTERDPD,

VSCATTERQPS, VSCATTERQPD

F Scatter packed float32/float64 with signed doubleword and quadword indices
VMOVDQA32, VMOVDQA64, F/VL Move aligned packed integers
VMOVDQU8, VMOVDQU16, VMOVDQU32, VMOVDQU64 VL/BW Move unaligned packed integers
VPXORD, VPXORQ F/VL Exclusive OR doubleword/quadword

New instructions in AVX-512 conflict detection

The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.[5]

Instruction Name Description
VPCONFLICTD, VPCONFLICTQ Detect conflicts within vector of packed double- or quadwords values. Compares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results.
VPLZCNTD, VPLZCNTQ Count the number of leading zero bits for packed double- or quadword values. Vectorized LZCNT instruction.
VPBROADCASTMB2Q,VPBROADCASTMW2D Broadcast mask to vector register. Either 8-bit mask to quadword vector, or 16-bit mask to doubleword vector.

New instructions in AVX-512 exponential and reciprocal

AVX-512 exponential and reciprocal instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2−28. They also contain two new exponential functions that have a relative error of at most 2−23.[4]

Instruction Description
VEXP2PD, VEXP2PS Compute approximate exponential 2^x of packed single or double-precision floating point values
VRCP28PD, VRCP28PS Compute approximate reciprocals of packed single or double-precision floating point values
VRCP28SD, VRCP28SS Compute approximate reciprocal of scalar single or double-precision floating point value
VRSQRT28PD, VRSQRT28PS Compute approximate reciprocals of square roots of packed single or double-precision floating point values
VRSQRT28SD, VRSQRT28SS Compute approximate reciprocal of square root of scalar single or double-precision floating point value

New instructions in AVX-512 prefetch

AVX-512 prefetch instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX-512. T0 prefetch means prefetching into level 1 cache and T1 means prefetching into level 2 cache.

Instruction Description
VGATHERPF0DPS, VGATHERPF0QPS, VGATHERPF0DPD, VGATHERPF0QPD Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T0 hint.
VGATHERPF1DPS, VGATHERPF1QPS, VGATHERPF1DPD, VGATHERPF1QPD Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T1 hint.
VSCATTERPF0DPS, VSCATTERPF0QPS, VSCATTERPF0DPD, VSCATTERPF0QPD Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using writemask k1 and T0 hint with intent to write.
VSCATTERPF1DPS, VSCATTERPF1QPS, VSCATTERPF1DPD, VSCATTERPF1QPD Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write.

New instructions in AVX-512 BW and DQ

AVX-512BW adds byte and word version of instructions in AVX-512F and adds AVX-512 versions of several byte and word instructions that haven't had one. AVX-512DQ adds new instructions for doubleword and quadword registers, and AVX-512BW adds byte and words versions of the same instructions. Two new instructions were added to the mask instructions set, for those see the earlier section.

Among the instructions added by AVX-512DQ are several SSE, AVX instruction that didn't get AVX-512 versions with AVX-512F, among those are all the two input bitwise instructions and extract/insert integer instructions.

Instructions that are completely new are covered below.

Floating point instructions

Three new floating point operations are introduced. Since they are not only new to AVX-512 they have both packed/SIMD and scalar versions.

The VFPCLASS instructions tests if the floating point value is one of eight special floating-point values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The VRANGE instructions performs minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The VREDUCE instructions operates on a single source, and subtracts from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.

Instruction Extension set Description
VFPCLASSPS, VFPCLASSPD DQ Test types of packed single and double precision floating point values.
VFPCLASSSS, VFPCLASSSD DQ Test types of scalar single and double precision floating point values.
VRANGEPS, VRANGEPD DQ Range restriction calculation for packed floating point values.
VRANGESS, VRANGESD DQ Range restriction calculation for scalar floating point values.
VREDUCEPS, VREDUCEPD DQ Perform reduction transformation on packed floating point values.
VREDUCESS, VREDUCESD DQ Perform reduction transformation on scalar floating point values.

Other instructions

Instruction Extension set Description
VPMOVM2D, VPMOVM2Q DQ Convert mask register to double- or quard-word vector register.
VPMOVM2B, VPMOVM2W BW Convert mask register to byte or word vector register.
VPMOVD2M, VPMOVQ2M DQ Convert double- or quad-word vector register to mask register.
VPMOVB2M, VPMOVW2M BW Convert byte or word vector register to mask register.
VPMULLQ DQ Multiply packed quadword store low result. A quadword version of VPMULLD.

New instructions in AVX-512 IFMA

Instruction Extension set Description
VPMADD52LUQ IFMA Packed multiply of unsigned 52-bit integers and add the low 52-bit products to qword accumulators
VPMADD52HUQ IFMA Packed multiply of unsigned 52-bit integers and add the high 52-bit products to 64-bit accumulators

Legacy instructions upgraded with EVEX encoded versions

Legacy encoding Group Instructions AVX-512 extensions
SSE SSE2/SSE4.1 AVX AVX2/FMA
Yes Yes Yes No VADD VADDPD, VADDPS, VADDSD, VADDSS
VAND VANDPD, VANDPS, VANDNPD, VANDNPS
VCMP VCMPPD, VCMPPS, VCMPSD, VCMPSS
VCOM VCOMISD, VCOMISS
VDIV VDIVPD, VDIVPS, VDIVSD, VDIVSS
VCVT VCVTDQ2PD, VCVTDQ2PS, VCVTPD2DQ, VCVTPD2PS,

VCVTPH2PS, VCVTPS2PH, VCVTPS2DQ, VCVTPS2PD,
VCVTSD2SI, VCVTSD2SS, VCVTSI2SD, VCVTSI2SS, VCVTSS2SD, VCVTSS2SI,
VCVTTPD2DQ, VCVTTPS2DQ, VCVTTSD2SI, VCVTTSS2SI

VMAX VMAXPD, VMAXPS, VMAXSD, VMAXSS
VMIN VMINPD, VMINPS, VMINSD, VMINSS
VMOV VMOVAPD, VMOVAPS, VMOVD, VMOVQ,

VMOVDDUP,
VMOVHLPS, VMOVHPD, VMOVHPS, VMOVLHPS, VMOVLPD, VMOVLPS,
VMOVNTDQA, VMOVNTDQ, VMOVNTPD, VMOVNTPS,
VMOVSD, VMOVSHDUP, VMOVSLDUP, VMOVSS, VMOVUPD, VMOVUPS

VMUL VMULPD, VMULPS, VMULSD, VMULSS
VOR VORPD, VORPS VL, DQ
VSQRTP VSQRTPD, VSQRTPS, VSQRTSD, VSQRTSS VL, F
VSUB VSUBPD, VSUBPS, VSUBSD, VSUBSS
VUCOMI VUCOMISD, VUCOMISS
VUNPCK VUNPCKHPD, VUNPCKHPS, VUNPCKLPD, VUNPCKLPS
VXOR VXORPD, VXORPS VL, DQ
No Yes Yes No VINSERTPS VINSERTPS
VPALIGNR VPALIGNR
VPEXTR VPEXTRB, VPEXTRW, VPEXTRD, VPEXTRQ
VPINSR VPINSRB, VPINSRW, VPINSRD, VPINSRQ BW, DQ
No Yes Yes Yes VPAB VPABSB, VPABSW, VPABSD BW, VL, F
VPACK VPACKSSWB, VPACKSSDW, VPACKUSDW, VPACKUSWB BW, VL, F
VPADD VPADDB, VPADDW, VPADDD, VPADDQ,

VPADDSB, VPADDSW, VPADDUSB, VPADDUSW

VPAND VPAND, VPANDN
VPAVG VPAVGB, VPAVGW
VPCMPEQ VPCMPEQB, VPCMPEQW, VPCMPEQD, VPCMPEQQ
VPCMPGT VPCMPGTB, VPCMPGTW, VPCMPGTD, VPCMPGTQ
VPMADD VPMADDUBSW VPMADDWD
VPMAX VPMAXSB, VPMAXSW, VPMAXSD, VPMAXUB, VPMAXUW, VPMAXUD BW, VL, F
VPMIN VPMINSB, VPMINSW, VPMINSD, VPMINUB, VPMINUW, VPMINUD BW, VL, F
VPMOV VPMOVSXBW, VPMOVSXBD, VPMOVSXBQ, VPMOVSXWD, VPMOVSXWQ, VPMOVSXDQ,

VPMOVZXBW, VPMOVZXBD, VPMOVZXBQ, VPMOVZXWD, VPMOVZXWQ, VPMOVZXDQ

BW, VL, F
VPMUL VPMULDQ, VPMULUDQ, VPMULHRSW, VPMULHUW, VPMULHW, VPMULLD, VPMULLQ, VPMULLW BW, VL, F
VPOR VPORD, VPORQ
VPSUB VPSUBB, VPSUBW, VPSUBD, VPSUBQ, VPSUBSB, VPSUBSW, VPSUBUSB, VPSUBUSW BW, VL, F
VPUNPCK VPUNPCKHBW, VPUNPCKHWD, VPUNPCKHDQ, VPUNPCKHQDQ,

VPUNPCKLBW, VPUNPCKLWD, VPUNPCKLDQ, VPUNPCKLQDQ\

BW, VL, F
VPXOR VPXORD, VPXORQ VL, F
No No Yes Yes VEXTRACT VEXTRACTF128, VEXTRACTI128, VEXTRACTPS
VPSADBW VPSADBW
VINSERT VINSERTF128, VINSERTI128
VPERM VPERMD, VPERMILPD, VPERMILPS, VPERMPD, VPERMPS, VPERMQ
VBROADCAST VBROADCASTSS, VBROADCASTSD, VBROADCASTF128, VBROADCASTI128
VPBROADCAST VPBROADCASTB, VPBROADCASTW, VPBROADCASTD, VPBROADCASTQ
No No No Yes VFMADD VFMADD132PD, VFMADD213PD, VFMADD231PD,

VFMADD132PS, VFMADD213PS, VFMADD231PS,
VFMADD132SD, VFMADD213SD, VFMADD231SD,
VFMADD132SS, VFMADD213SS, VFMADD231SS

VFMADDSUB VFMADDSUB132PD, VFMADDSUB213PD, VFMADDSUB231PD,

VFMADDSUB132PS, VFMADDSUB213PS, VFMADDSUB231PS

VFMSUBADD VFMSUBADD132PD, VFMSUBADD213PD, VFMSUBADD231PD,

VFMSUBADD132PS, VFMSUBADD213PS, VFMSUBADD231PS

VFMSUB VFMSUB132PD, VFMSUB213PD, VFMSUB231PD,

VFMSUB132PS, VFMSUB213PS, VFMSUB231PS,
VFMSUB132SD, VFMSUB213SD, VFMSUB231SD,
VFMSUB132SS, VFMSUB213SS, VFMSUB231SS

VFNMADD VFNMADD132PD, VFNMADD213PD, VFNMADD231PD,

VFNMADD132PS, VFNMADD213PS, VFNMADD231PS,
VFNMADD132SD, VFNMADD213SD, VFNMADD231SD,
VFNMADD132SS, VFNMADD213SS, VFNMADD231SS

VFNMSUB VFNMSUB132PD, VFNMSUB213PD, VFNMSUB231PD,

VFNMSUB132PS, VFNMSUB213PS, VFNMSUB231PS,
VFNMSUB132SD, VFNMSUB213SD, VFNMSUB231SD,
VFNMSUB132SS, VFNMSUB213SS, VFNMSUB231SS

VGATHER VGATHERDPS, VGATHERDPD, VGATHERQPS, VGATHERQPD
VPGATHER VPGATHERDD, VPGATHERDQ, VPGATHERQD, VPGATHERQQ
VPSRAV VPSRAVW, VPSRAVD, VPSRAVQ BW, VL, F
Yes Yes Yes Yes VPSHUF VPSHUFB, VPSHUFHW, VPSHUFLW, VPSHUFD,

VPSLLDQ, VPSLLW, VPSLLD, VPSLLQ,
VPSRAW, VPSRAD, VPSRAQ,
VPSRLDQ, VPSRLW, VPSRLD, VPSRLQ,
VPSLLVW, VPSLLVD, VPSLLVQ,
VPSRLVW, VPSRLVD, VPSRLVQ,
VPSHUFPD, VPSHUFPS

BW, VL, F

CPUs with AVX-512

Performance tools for AVX-512 analysis

Intel "Vectorization" Advisor (starting from version 2016 Update 3) supports native AVX-512 performance and vector code quality analysis for 2nd generation Intel® Xeon Phi™ (codenamed Knights Landing) processor. Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization,[10][11]

See also

References

  1. ^ a b c d e f James Reinders (23 July 2013). "AVX-512 Instructions". Intel. Retrieved 20 August 2013.
  2. ^ a b c James Reinders (17 July 2014). "Additional AVX-512 instructions". Intel. Retrieved 3 August 2014.
  3. ^ Anton Shilov. "Intel 'Skylake' processors for PCs will not support AVX-512 instructions". Retrieved 2015-03-17.
  4. ^ a b c d e f g "Intel Architecture Instruction Set Extensions Programming Reference" (PDF). Intel. Retrieved 2014-01-29.
  5. ^ "AVX-512 Architecture/Demikhovsky Poster" (PDF). Intel. Retrieved 25 February 2014.
  6. ^ http://vr-zone.com/articles/xeon-phi-knights-series-continues-landing-2015/64112.html
  7. ^ https://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=get&target=Cauldron14_AVX-512_Vector_ISA_Kirill_Yukhin_20140711.pdf
  8. ^ http://www.itworld.com/article/2985214/hardware/intels-xeon-roadmap-for-2016-leaks.html
  9. ^ http://www.phoronix.com/scan.php?page=news_item&px=Intel-Cannonlake-Clang
  10. ^ https://software.intel.com/en-us/articles/intel-advisor-xe-2016-update-3-what-s-new
  11. ^ https://software.intel.com/en-us/intel-advisor-xe