Permute instruction: Difference between revisions

Content deleted Content added

Inline

Revision as of 10:39, 10 July 2021

Permute (and Shuffle) instructions, part of bit manipulation as well as vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array.^[1] The size (bitwidth) of the source elements is not restricted but remains the same as the destination size.

There exists two important permute variants, known as gather and scatter, respectively. The gather variant is as follows:

for i = 0 to length-1
    dest[i] = src[indices[i]]

where the scatter variant is:

for i = 0 to length-1
    dest[indices[[i]] = src[i]

Note that unlike in memory-based gather-scatter all three of dest, src, and indices are registers (or parts of registers in the case of bit-level permute), not memory locations.

The scatter variant can be seen to "scatter" the source elements across (into) to the destination, where the "gather" variant is gathering data from the indexed source elements.

Given that the indices may be repeated in both variants, the resultant output is not a strict mathematical permutation because duplicates can occur in the output.

A special case of permute is also used in GPU "swizzling" (again, not strictly a permutation) which performs on-the-fly reordering of subvector data so as to align or duplicate elements with the appropriate SIMD lane.

Occurrences of permute instructions

Permute instructions occur in both scalar processors as well as vector processing engines as well as GPUs. In vector instruction sets they are typically named "Register Gather/Scatter" operations such as in RISC-V vectors,^[2] and take Vectors as input for both source elements and source array, and output another Vector.

In scalar instruction sets the scalar registers are broken down into smaller sections (unpacked, SIMD style) where the fragments are then used as array sources. The (small, partial) results are then concatenated (packed) back into the scalar register as the result.

Some ISAs, particularly for cryptographic applications, even have bit-level permute operations, such as bdep (bit deposit) in RISC-V bitmanip;^[3] in the Power ISA it is known as bpermd and has been included for several decades, and is still in the Power ISA v.3.0 B spec.^[4]

Also in some non-vector ISAs, due to there sometimes being insufficient space in the one source input register to specify the permutation source array in full (particularly if the operation involves bit-level permutation), will include partial reordering instructions. Examples include VSHUFF32x4 from AVX-512.

Permute operations in different forms are surprisingly common, occurring in AltiVec, Power ISA, PowerPC G4, AVX-512, SVE2^[5] and in Vector processors as well as GPUs. They are sufficiently important that LLVM has included a shufflevector^[6] intrinsic and gcc as __builtin_shuffle.^[7] GCC's intrinsic matches the functionality of OpenCL's shuffle intrinsics.^[8] Note that all of these, mathematically, are not permutations because duplicates can occur in the output.

References

^ Intel® 64 and IA-32 architectures software developer’s manual combined volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4 (PDF). Intel. June 2021. p. 5-356 Vol. 2C.
^ "RISC-V "V" Vector Extension – 16.4. Vector Register Gather Instructions". GitHub – riscv/riscv-v-spec. Retrieved 2021-07-10.
^ "riscv/riscv-bitmanip". GitHub. Retrieved 2021-07-10.
^ "Power ISA Version 3.0 B". Power.org. 2017-03-27. Retrieved 2019-08-11.
^ ARM HPC, SVE2 Extension summary, p32
^ "LLVM 13 documentation: shufflevector". LLVM Language Reference Manual. Retrieved 2021-07-10.
^ "Vector Extensions (Using the GNU Compiler Collection (GCC))". GCC, the GNU Compiler Collection - GNU Project - Free Software Foundation (FSF). Retrieved 2021-07-10.
^ "OpenCL Specification: shuffle, shuffle2". The Khronos Group Inc. Retrieved 2021-07-10.

External links

This computing article is a stub. You can help Wikipedia by expanding it.

This mathematics-related article is a stub. You can help Wikipedia by expanding it.

[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4 (PDF). Intel. June 2021. p. 5-356 Vol. 2C.

[2] "RISC-V "V" Vector Extension – 16.4. Vector Register Gather Instructions". GitHub – riscv/riscv-v-spec. Retrieved 2021-07-10.

[3] "riscv/riscv-bitmanip". GitHub. Retrieved 2021-07-10.

[isa307b-4] "Power ISA Version 3.0 B". Power.org. 2017-03-27. Retrieved 2019-08-11.

[5] ARM HPC, SVE2 Extension summary, p32

[LLVM_Language_Reference_Manual-6] "LLVM 13 documentation: shufflevector". LLVM Language Reference Manual. Retrieved 2021-07-10.

[7] "Vector Extensions (Using the GNU Compiler Collection (GCC))". GCC, the GNU Compiler Collection - GNU Project - Free Software Foundation (FSF). Retrieved 2021-07-10.

[8] "OpenCL Specification: shuffle, shuffle2". The Khronos Group Inc. Retrieved 2021-07-10.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

@@ Line 28: / Line 28: @@
 == Occurrences of permute instructions ==
-Permute instructions occur in both [[scalar processor]]s as well as [[vector processing]] engines as well as [[Graphics processing unit|GPUs]].  In vector instruction sets they are typically named "Register Gather/Scatter" operations such as in [[RISC-V]] vectors,<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions]</ref> and take Vectors as input for both source elements and source array, and output another Vector.
+Permute instructions occur in both [[scalar processor]]s as well as [[vector processing]] engines as well as [[Graphics processing unit|GPUs]].  In vector instruction sets they are typically named "Register Gather/Scatter" operations such as in [[RISC-V]] vectors,<ref>{{cite web | title=RISC-V "V" Vector Extension – 16.4. Vector Register Gather Instructions | website=GitHub – riscv/riscv-v-spec | url=https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions | access-date=2021-07-10}}</ref> and take Vectors as input for both source elements and source array, and output another Vector.
 In scalar instruction sets the scalar registers are broken down into smaller sections (unpacked, [[SIMD]] style) where the fragments are then used as array sources.  The (small, partial) results are then concatenated (packed) back into the scalar register as the result.
 Some ISAs, particularly for [[cryptographic]] applications, even have ''bit-level'' permute operations, such as
-<code>bdep</code> (bit deposit) in RISC-V bitmanip;<ref>[https://github.com/riscv/riscv-bitmanip]</ref> in the [[Power ISA]] it is known as {{code|bpermd}} and has been included for several decades, and is still in the Power ISA v.3.0 B spec.<ref name="isa307b">{{cite web
+<code>bdep</code> (bit deposit) in RISC-V bitmanip;<ref>{{cite web | title=riscv/riscv-bitmanip | website=GitHub | url=https://github.com/riscv/riscv-bitmanip | access-date=2021-07-10}}</ref> in the [[Power ISA]] it is known as {{code|bpermd}} and has been included for several decades, and is still in the Power ISA v.3.0 B spec.<ref name="isa307b">{{cite web | title=Power ISA Version 3.0 B | publisher=Power.org | date=2017-03-27 | url=https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 | access-date=2019-08-11}}</ref>
-	|title=Power ISA Version 3.0 B
-	|publisher=Power.org
-	|date= 2017-03-27
-	|url=https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
-	|access-date=2019-08-11
-	}}</ref>
 Also in some non-vector ISAs, due to there sometimes being insufficient space in the one source input register to specify the permutation source array in full (particularly if the operation involves bit-level permutation), will include partial reordering instructions.  Examples include <code>VSHUFF32x4</code> from [[AVX-512#Permute|AVX-512]].
-Permute operations in different forms are surprisingly common, occurring in [[AltiVec]], [[Power ISA]], [[PowerPC G4]], [[AVX-512]], [[Scalable_Vector_Extension|SVE2]]<ref>[https://indico.ph.ed.ac.uk/event/69/contributions/892/attachments/734/910/Arm_Exascale.pdf ARM HPC, SVE2 Extension summary, p32]</ref> and in [[Vector processors]] as well as [[Graphics processing unit|GPUs]]. They are sufficiently important that [[LLVM]] has included a {{code|shufflevector}}<ref>[https://llvm.org/docs/LangRef.html#shufflevector-instruction]</ref> intrinsic and gcc as {{code|__builtin_shuffle}}.<ref>[https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html]</ref> [[GNU Compiler Collection|GCC]]'s intrinsic matches the functionality of [[OpenCL]]'s shuffle intrinsics.<ref>[https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/shuffle.html]</ref> Note that all of these, mathematically, are not permutations because duplicates can occur in the output.
+Permute operations in different forms are surprisingly common, occurring in [[AltiVec]], [[Power ISA]], [[PowerPC G4]], [[AVX-512]], [[Scalable_Vector_Extension|SVE2]]<ref>[https://indico.ph.ed.ac.uk/event/69/contributions/892/attachments/734/910/Arm_Exascale.pdf ARM HPC, SVE2 Extension summary, p32]</ref> and in [[Vector processors]] as well as [[Graphics processing unit|GPUs]]. They are sufficiently important that [[LLVM]] has included a {{code|shufflevector}}<ref name="LLVM Language Reference Manual">{{cite web | title=LLVM 13 documentation: shufflevector | website=LLVM Language Reference Manual | url=https://llvm.org/docs/LangRef.html#shufflevector-instruction | access-date=2021-07-10}}</ref> intrinsic and gcc as {{code|__builtin_shuffle}}.<ref>{{cite web | title=Vector Extensions (Using the GNU Compiler Collection (GCC)) | website=GCC, the GNU Compiler Collection - GNU Project - Free Software Foundation (FSF) | url=https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html | access-date=2021-07-10}}</ref> [[GNU Compiler Collection|GCC]]'s intrinsic matches the functionality of [[OpenCL]]'s shuffle intrinsics.<ref>{{cite web | title=OpenCL Specification: shuffle, shuffle2 | website=The Khronos Group Inc | url=https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/shuffle.html | access-date=2021-07-10}}</ref> Note that all of these, mathematically, are not permutations because duplicates can occur in the output.
 == See also ==

v t e Instruction set extensions
SIMD (RISC)	Alpha MVI ARM NEON SVE MIPS MDMX MIPS-3D MXU MIPS SIMD PA-RISC MAX Power ISA VMX SPARC VIS
SIMD (x86)	MMX (1996) 3DNow! (1998) SSE (1999) SSE2 (2001) SSE3 (2004) SSSE3 (2006) SSE4 (2006) SSE5 ~~(2007)~~ AVX (2008) F16C (2009) XOP (2009) FMA (FMA4: 2011, FMA3: 2012) AVX2 (2013) AVX-512 (2015) AMX (2022) AVX10 (2023)
Bit manipulation	BMI (ABM: 2007, BMI1: 2012, BMI2: 2013, TBM: 2012) ADX (2014)
Compressed instructions	Thumb MIPS16e ASE RVC
Security and cryptography	PadLock (2003) AES-NI (2008); ARMv8 also has AES instructions CLMUL (2010) RDRAND (2012) SHA (2013) MPX (2015) SGX (2015) TDX (2021)
Transactional memory	TSX (2013) ASF
Virtualization	VT-x (2005) AMD-V (2006) VT-d (AMD-Vi)
Suspended extensions' dates are ~~struck through~~.