SSE4

From Wikipedia, the free encyclopedia
Jump to: navigation, search

SSE4 is an instruction set used in the AMD K10 (K8L) and Intel Core microarchitecture. It was announced on September 27, 2006 at the Fall 2006 Intel Developer Forum, with vague details in a white paper;[1] more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum in Beijing, in the presentation.[2] The SSE4 Programming Reference is available from Intel.

Contents

[edit] SSE4 subsets

Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1 in some Intel documentation, is available in Penryn. Additionally, SSE4.2, a second subset consisting of the 7 remaining instructions, is first available in Nehalem-based Core i7. Intel credits feedback from developers as playing an important role in the development of the instruction set.

AMD currently only supports 4 instructions from the SSE4 instruction set, but have also added two new SSE instructions that is named SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and alternatively AMD processors aren't supporting Intel's SSE4.1. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).[3]

[edit] Name confusion

What is now known as SSSE3 (Supplemental Streaming SIMD Extension 3), introduced in the Intel Core 2 processor line, was mistakenly referred to as SSE4 by the media during its development.

Intel is using the marketing term HD Boost to refer to SSE4.[4]

[edit] New instructions

Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications. It features a number of instructions whose action is determined by a constant field and a set of instructions that take XMM0 as an implicit third operand.

Several of these instructions are enabled by the single-cycle shuffle engine in Penryn. (Shuffle operations are operations that involve the repositioning of bits.)

[edit] SSE4.1

These instructions were introduced with Penryn microarchitecture, the 45nm shrink of Intel's Core microarchitecture.

Instruction Description
MPSADBW Compute eight offset sums of absolute differences (i.e. |x0-y0|+|x1-y1|+|x2-y2|+|x3-y3|, |x0-y1|+|x1-y2|+|x2-y3|+|x3-y4|, ...); this operation is extremely important for modern HDTV codecs, and (see [5]) allows an 8x8 block difference to be computed in fewer than seven cycles. One bit of a three-bit immediate operand indicates whether y0 .. y10 or y4 .. y14 should be used from the destination operand, the other two whether x0..x3, x4..x7, x8..x11 or x12..x15 should be used from the source.
PHMINPOSUW Sets the bottom unsigned 16-bit word of the destination to the smallest unsigned 16-bit word in the source, and the next-from-bottom to the index of that word in the source.
PMULDQ Packed signed multiplication on two sets of 2 out of 4 packed integers, the 1st and 3rd per packed 4, giving 2 packed 64-bit results.
PMULLD Packed signed multiplication, 4 packed sets of 32-bit integers multiplied to give 4 packed 32-bit results.
DPPS, DPPD Dot product for AOS (Array of Structs) data. This takes an immediate operand consisting of four (or two for DPPD) bits to select which of the entries in the input to multiply and accumulate, and another four (or two for DPPD) to select whether to put 0 or the dot-product in the appropriate field of the output.
BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDW Conditional copying of elements in one location with another, based (for non-V form) on the bits in an immediate operand, and (for V form) on the bits in register XMM0.
PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINSD, PMAXSD Packed minimum/maximum for different integer operand types
ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD Round values in a floating-point register to integers, using one of four rounding modes specified by an immediate operand
INSERTPS, PINSRB, PINSRD/PINSRQ, EXTRACTPS, PEXTRB, PEXTRW, PEXTRD/PEXTRQ The INSERTPS and PINSR instructions read 8, 16 or 32 bits from an x86 register memory location and insert it into a field in the destination register given by an immediate operand, EXTRACTPS and PEXTR read a field from the source register and insert it into an x86 register or memory location. For example, PEXTRD eax, [xmm0], 1; EXTRACTPS [addr+4*eax], xmm1, 1 stores the first field of xmm1 in the address given by the first field of xmm0.
PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ Packed sign/zero extension to wider types
PTEST This is similar to the TEST instruction, in that it sets the Z flag to the result of an AND between its operators: ZF is set, if DEST AND SRC is equal to 0. Additionally it sets the C flag if (NOT DEST) AND SRC equals zero.

This is equivalent to setting the Z flag if none of the bits masked by SRC are set, and the C flag if all of the bits masked by SRC are set.

PCMPEQQ Quadword (64 bits) compare for equality
PACKUSDW Convert signed DWORDs into unsigned WORDs with saturation.
MOVNTDQA Efficient read from write-combining memory area into SSE register; this is useful for retrieving results from peripherals attached to the memory bus.

[edit] SSE4.2

These instructions were first implemented in the Nehalem-based Intel Core i7 product line and complete the SSE4 instruction set.

Instruction Description
CRC32 Accumulate CRC32C value using the polynomial 0x11EDC6F41 (or, without the high order bit, 0x1EDC6F41).[6]
PCMPESTRI Packed Compare Explicit Length Strings, Return Index
PCMPESTRM Packed Compare Explicit Length Strings, Return Mask
PCMPISTRI Packed Compare Implicit Length Strings, Return Index
PCMPISTRM Packed Compare Implicit Length Strings, Return Mask
PCMPGTQ Compare Packed Signed 64-bit data For Greater Than
POPCNT Population count (count number of bits set to 1). POPCNT instruction may also be implemented in some processors that do not support the other SSE4 instructions and a separate bit can be tested to confirm POPCNT presence.

[edit] SSE4a

The SSE4a instruction group was introduced in AMD's Barcelona microarchitecture. With the exception of POPCNT, these instructions are not available in Intel processors.

Instruction Description
LZCNT Leading Zero Count - bit manipulation. LZCNT instruction may also be implemented in some processors that do not support the other SSE4 instructions and a separate bit can be tested to confirm LZCNT presence.
POPCNT Population count (count number of bits set to 1). POPCNT instruction may also be implemented in some processors that do not support the other SSE4 instructions and a separate bit can be tested to confirm POPCNT presence.
EXTRQ/INSERTQ Combined mask-shift instructions.
MOVNTSD/MOVNTSS Scalar streaming store instructions.

[edit] References

Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages