Half-precision floating-point format
In computing, half precision is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory.
In IEEE 754-2008 the 16-bit base 2 format is officially referred to as binary16. It is intended for storage (of many floating-point values where higher precision need not be stored), not for performing arithmetic computations.
Half-precision floating point is a relatively new binary floating-point format. It was created concurrently by Nvidia and Industrial Light & Magic. Nvidia defined the half datatype in the Cg language, released in early 2002, and was the first to implement 16-bit floating point in silicon, with the GeForce FX, released in late 2002.[1] ILM was searching for an image format that could handle dynamic ranges, but without the hard drive and memory cost of floating-point representations that are commonly used for floating-point computation (single and double precision).[2]
This format is used in several computer graphics environments including OpenEXR, OpenGL, Cg, and D3DX. The advantage over 8-bit or 16-bit binary integers is that the increased dynamic range allows for more detail to be preserved in highlights and shadows for images. The advantage over 32-bit single-precision binary formats is that it requires half the storage and bandwidth (at the expense of precision).[2]
| Floating-point precisions |
|---|
|
IEEE 754: |
Contents |
[edit] IEEE 754 half-precision binary floating-point format: binary16
The IEEE 754 standard specifies a binary16 as having:
- Sign bit: 1 bit
- Exponent width: 5 bits
- Significant precision: 11 (10 explicitly stored)
The format is assumed to have an implicit lead bit with value 1 unless the exponent field is stored with all zeros. Thus only 10 bits of the significand appear in the memory format but the total precision is 11 bits. In IEEE 754 parlance, there are 10 bits of significand, but there are 11 bits of significand precision (log10(211) ≈ 3.311 decimal digits). The bits are laid out as follows:
[edit] Exponent encoding
The half-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 15; also known as exponent bias in the IEEE 754 standard.
- Emin = 01h−0Fh = −14
- Emax = 1Eh−0Fh = 15
- Exponent bias = 0Fh = 15
Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 15 has to be subtracted from the stored exponent.
The stored exponents 00h and 1Fh are interpreted specially.
| Exponent | Significand zero | Significand non-zero | Equation |
|---|---|---|---|
| 00h | zero, −0 | subnormal numbers | (−1)signbit × 2−14 × 0.significandbits2 |
| 01h, ..., 1Eh | normalized value | (−1)signbit × 2exponent−15 × 1.significandbits2 | |
| 1Fh | ±infinity | NaN (quiet, signalling) | |
The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum representable value is (2−2−10) × 215 = 65504.
[edit] Half precision examples
These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes the sign, (biased) exponent, and significand.
3C00 = 1 C000 = −2 7BFF = 6.5504 × 104 (max half precision) 0400 = 2−14 ≈ 6.10352 × 10−5 (minimum positive normal) 0001 = 2−24 ≈ 5.96046 × 10−8 (minimum positive subnormal) 0000 = 0 8000 = −0 7C00 = infinity FC00 = −infinity 3555 ≈ 0.33325... ≈ 1/3
By default, 1/3 rounds down like for double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.
[edit] Precision limitations on integer values
Integers between 0 and 2047 can be exactly represented
Integers between 2048 and 4095 round down to the nearest multiple of 2 (even number)
Integers between 4096 and 8191 round down to the nearest multiple of 4
Integers between 8192 and 16383 round down to the nearest multiple of 8
Integers between 16384 and 32767 round down to the nearest multiple of 16
Integers between 32768 and 65535 round down to the nearest multiple of 32
[edit] See also
- IEEE Standard for Floating-Point Arithmetic (IEEE 754)
- ISO/IEC 10967, Language Independent Arithmetic
- Primitive data type
- RGBE image format
- minifloat
[edit] References
[edit] External links
- Minifloats (in Survey of Floating-Point Formats)
- OpenEXR site
- Half precision constants from D3DX (defunct page)
- OpenGL treatment of half precision
- Fast Half Float Conversions
- Analog devices variant (four-bit exponent)
- C source code to convert between IEEE double, single, and half precision can be found here