Extended precision

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Floating-point precisions

IEEE 754:
16-bit: Half (binary16)
32-bit: Single (binary32), decimal32
64-bit: Double (binary64), decimal64
128-bit: Quadruple (binary128), decimal128
Extended precision formats

Other: Minifloat · Arbitrary precision

Extended precision is a name various computer manufacturers have given to floating point number formats that provide greater precision than other already defined floating point formats supported by their architecture. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types (with a storage count that usually is not a power of two) using special software (or, rarely, hardware).

Contents

[edit] Extended Precision Implementations

The IBM 1130 offered two floating point formats: a 32-bit "standard precision" format and a 40-bit "extended precision" format. Standard precision format contained a 24-bit two's complement significand while extended precision utilized a 32-bit two's complement significand. The characteristic in both formats was an 8-bit field containing the power of two biased by 128. Floating-point arithmetic operations were performed by software, and double precision was not supported at all. The extended format occupied three 16-bit words, with the extra space simply ignored[1] .

The IBM System/360 supported a 32-bit "short" floating point format and a 64-bit "long" floating point format[2]. The follow-on System/370 added support for a 128-bit "extended" format[3]. These formats are still supported in the current z/Architecture however they are now called the "hexadecimal floating point" formats.

The x86 and x86-64 architectures support an 80-bit extended precision format. The Intel 8087 math coprocessor was designed to support a 32-bit "single precision" format and a 64-bit "double precision" format for encoding and interchanging floating point numbers. However certain functions (exponentiation in particular) suffer from significant precision loss when implemented using the algorithms chosen for the 8087. To mitigate this problem the internal registers in the 8087 were designed to hold intermediate results in an 80-bit "extended precision" format. The 8087 would automatically convert numbers to this format when loading floating point registers from memory and would also convert results back to the more conventional formats when storing the registers back into memory. Intel published documentation describing this 80-bit internal format and also provided instructions which would transfer values between memory and these internal registers without performing any conversion. The Floating Point Unit on all subsequent x86 architecture processors have supported this format. As a result software can be developed which takes advantage of the higher precision provided by this format. Prof. W. Kahan, a primary architect of the x87 arithmetic and initial IEEE 754 standard proposal notes on the development of the x87 floating point- "An Extended format as wide as we dared (80 bits) was included to serve the same support role as the 13-decimal internal format serves in Hewlett-Packard’s 10- decimal calculators" [4].

The Motorola 6888x math coprocessors and the Motorola 68040 and 68060 processors support a 96-bit extended precision format that is similar to the Intel format with 16 unused bits inserted between the exponent and significand fields[5]. The follow-on Coldfire processors do not support this 96-bit extended precision format[6].

The IEEE 754 standard provides for extended precision formats and specifies the minimum range of values and precision that must be supported by an implementation of an extended format but does not dictate an encoding or specify a maximum range of values or precision that must be supported by an implementation. These decisions are the pervue of the implementor[7]. The x86 80-bit and Motorola 68811 formats meet the requirements of the IEEE 754 double extended format [8].

[edit] x86 Architecture Extended Precision Format

The x86 Architecture Extended Precision Format is an 80-bit format first implemented in the Intel 8087 math coprocessor and is supported by all processors that are based on the x86 architecture which incorporate a floating point unit. This 80-bit format uses one bit for the sign of the significand, 15 bits for the exponent field (i.e. the same range as the 128-bit quadruple precision IEEE 754 format) and 64 bits for the significand. The exponent field is biased by 16383, meaning that 16383 has to be subtracted from the value in the exponent field to compute the actual power of 2[9]. An exponent field value of 32767 (all fifteen bits 1) is reserved so as to enable the representation of special states such as infinity and Not a Number. If the exponent field is zero, the value is a denormal number and the exponent of 2 is -16382[10].

X86 Extended Floating Point Format.svg

In contrast to the single and double-precision formats, this format does not utilize an implicit/hidden bit. Rather, bit 63 contains the integer part of the significand and bits 62-0 hold the fractional part. Bit 63 will be 1 on all normalized numbers except for zero. There were several advantages to this design when the 8087 was being developed: (1) Calculations can be completed a little faster if all bits of the significand are present in the register. (2) A 64-bit significand provides sufficient precision to avoid loss of precision when the results are converted back to double precision format in the vast number of cases. (3) This format provides a mechanism for indicating precision loss due to underflow which can be carried through further operations. For example, the calculation 2×10-4930 × 3×10-10 × 4×1020 generates the intermediate result 6×10-4940 which is a denormal and also involves precision loss. The product of all of the terms is 24×10-4920 which can be represented as a normalized number. The 80287 could complete this calculation and indicate the loss of precision by returning an "unnormal" result (exponent not 0, bit 63 = 0)[11]. Processors since the 80387 no longer generate unnormals and do not support unnormal inputs to operations. They will generate a denormal if an underflow occurs but will generate a normalized result if subsequent operations on the denormal can be normalized[12].

The 80-bit floating point format was widely available by 1984 [13], after the development of C, Fortran and similar computer languages, which initially offered only the common 32- and 64-bit floating point sizes. On the x86 architecture most C compilers now support 80-bit extended precision via the long double type, and this was specified in the C99 / C11 standards (IEC 60559 floating-point arithmetic (Annex F)). Compilers on x86 for other languages often support extended precision as well, sometimes via nonstandard extensions: for example, Turbo Pascal offers an extended type, and several Fortran compilers have a REAL*10 type (analogous to REAL*4 and REAL*8). Such compilers also typically include extended-precision mathematical subroutines, such as square root and trigonometric functions, in their standard libraries.

This format gives 18 - 21 decimal digits (Note: \log_{10}(2^{64}) \approx 19.266) of precision and has a range (including subnormals) from approximately 3.65\times 10^{-4951} to 1.18\times 10^{4932} (if a decimal string with at most 18 significant decimal is converted to 80-bit IEEE 754 double extended precision and then converted back to the same number of significant decimal, then the final string should match the original; and if an IEEE 754 (80-bit) double extended precision is converted to a decimal string with at least 21 significant decimal and then converted back to double extended, then the final number should match the original [14]).

[edit] Reason for the 80 bit format

The need for a minimum of 64 bits of precision in the significand of the extended precision format follows from the need to avoid precision loss when performing exponentiation on double precision values[15]. The x86 floating point units do not provide an instruction that directly performs exponentiation. Instead they provide a set of instructions that a program can use in sequence to perform exponentiation using the equation:

x^y = 2^{\,y\ log_2\, x}

In order to avoid precision loss, the intermediate results "log2 x" and "y log2 x" must be computed with much higher precision because effectively both the exponent and the significand fields of x must fit into the significand field of the intermediate result. Subsequently the significand field of the intermediate result is split between the exponent and significand fields of the final result when 2intermediate result is calculated. The following discussion describes this requirement in more detail.

An IEEE 754 double precision value can be represented as:

2^{(-1)^s\,\times\,E}\,\times\,M\

where s is the sign of the exponent (either 0 or 1), E is the unbiased exponent which is an integer that ranges from 0 to 1023, and M is the significand which is a 53-bit value that falls in the range 1 ≤ M < 2. Negative numbers and zero can be ignored because the logarithm of these values is undefined. For purposes of this discussion M does not have 53 bits of precision because it is constrained to be greater than or equal to one i.e. the hidden bit does not count towards the precision (Note that in situations where M is less than 1, the value is actually a denormal and therefore may have already suffered precision loss. This situation is beyond the scope of this article).

Taking the log of this representation of a double precision number and simplifying results in the following:

\begin{align} log_2(2^{(-1)^s\,\times\,E}\,\times\,M) & = (-1)^s\,\times\,E\,\times\,log_2 2\,+\,log_2 M  \\
& = \pm\,E\,+\,log_2 M\\ \end{align}

This result demonstrates that when taking base-2 logarithm of a number, the sign of the exponent of the original value becomes the sign of the logarithm, the exponent of the original value becomes the integer part of the significand of the logarithm, and the significand of the original value is transformed into the fractional part of the significand of the logarithm.

Because E is an integer in the range 0 to 1023, up to 10 bits to the left of the radix point are needed to represent the integer part of the logarithm. Because M falls in the range 1 ≤ M < 2, the value of log2 M will fall in the range 0 ≤ log2 M < 1 so at least 52 bits are needed to the right of the radix point to represent the fractional part of the logarithm. Combining 10 bits to the left of the radix point with 52 bits to the right of the radix point means that the significand part of the logarithm must be computed to at least 62 bits of precision. In practice values of M less than \sqrt{2} require 53 bits to the right of the radix point and values of M less than \sqrt[4]{2} require 54 bits to the right of the radix point to avoid precision loss. Balancing this requirement for added precision to the right of the radix point, exponents less than 512 only require 9 bits to the left of the radix point and exponents less than 256 require only 8 bits to the left of the radix point.

The final part of the exponentiation calculation is computing 2intermediate result. The "intermediate result" consists of an integer part "I" added to a fractional part "F". If the intermediate result is negative then a slight adjustment is needed to get a positive fractional part because both "I" and "F" are negative numbers.

For positive intermediate results: 2^{intermediate\ result} = 2^{I+F} = 2^I\,2^F

For negative intermediate results: 2^{intermediate\ result} = 2^{I+F} = 2^{I\,+\,(1-1)\,+\,F} = 2^{(I-1)\,+\,(1+F)} = 2^{I-1}\,2^{1+F}

Thus the integer part of the intermediate result ("I" or "I-1") plus a bias becomes the exponent of the final result and transformed positive fractional part of the intermediate result: 2F or 21+F becomes the significand of the final result. In order to supply 52 bits of precision to the final result, the positive fractional part must be maintained to at least 52 bits.

In summary, the exact number of bits of precision needed in the significand of the intermediate result is somewhat data dependent but 64 bits is sufficient to avoid precision loss in the vast majority of exponentiation computations involving double precision numbers.

The number of bits needed for the exponent of the extended precision format follows from the requirement that the product of two double precision numbers should not overflow when computed using the extended format. The largest possible exponent of a double precision value is 1023 so the exponent of the largest possible product of two double precision numbers is 2047 (an 11-bit value). Adding in a bias to account for negative exponents means that the exponent field must be at least 12 bits wide.

Combining these requirements: 1 bit for the sign, 12 bits for the biased exponent, and 64 bits for the significand means that the extended precision format would need at least 77 bits. Engineering considerations resulted in the final definition of the 80-bit format[15].

[edit] See also

[edit] References

  1. ^ IBM 1130 Subroutine Library 9th ed. IBM Corporation. 1974. pp. 93. http://media.ibm1130.org/1130-037-ocr.pdf. 
  2. ^ IBM System/360 Principles of Operation, 9th ed. IBM Corporation. 1970. pp. 41. 
  3. ^ IBM System/370 Principles of Operation, 7th ed. IBM Corporation. 1980. pp. 9-2 thru 9-3. 
  4. ^ William Kahan (22 November 1983). "MATHEMATICS WRITTEN IN SAND- the hp-15C, Intel 8087, etc.". http://www.cs.berkeley.edu/~wkahan/MathSand.pdf. 
  5. ^ Motorola MC68000 Family Programmer's Reference Manual. Freescale Semiconductor. 1992. pp. 1-16. http://www.freescale.com/files/archives/doc/ref_manual/M68000PRM.pdf. 
  6. ^ ColdFire Family Programmer’s Reference Manual. Freescale semiconductor. 2005. pp. 7-7. http://cache.freescale.com/files/dsp/doc/ref_manual/CFPRM.pdf. 
  7. ^ Kevin Brewer. "Kevin’s Report". IEEE-754 Reference Material. http://babbage.cs.qc.cuny.edu/IEEE-754.old/References.xhtml#report. Retrieved 2012-02-19. "http://babbage.cs.qc.cuny.edu/IEEE-754.old/References.xhtml#report" 
  8. ^ William Kahan (1 October 1997). [http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic"]. http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF. 
  9. ^ Intel 80C187 datasheet
  10. ^ Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 1. Intel Corporation. pp. 4-6 thru 4-9 and 4-18 thru 4-21. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-1-manual.html. 
  11. ^ Morse, Stephen P.; Albert, Douglas J. (1986). The 80286 Architecture. Wiley Press. pp. 91-111. ISBN 0 471-83185-9. 
  12. ^ Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 1. Intel Corporation. pp. 8-21 thru 8-22. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-1-manual.html. 
  13. ^ Charles Severance (20 February 1998). "An Interview with the Old Man of Floating-Point". http://www.eecs.berkeley.edu/~wkahan/ieee754status/754story.html. 
  14. ^ William Kahan (1 October 1987). "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic". http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF. 
  15. ^ a b Morse, Stephen P.; Albert, Douglas J. (1986). The 80286 Architecture. Wiley Press. pp. 96-98. ISBN 0 471-83185-9. 
Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages