Quadruple-precision floating-point format: Difference between revisions
No edit summary |
No edit summary |
||
Line 47: | Line 47: | ||
=== Implementations === |
=== Implementations === |
||
Quadruple precision is currently implemented in [[C (programming language)|C]]/[[C++]] and [[Fortran]] as [[long double]] and REAL*16<ref>{{cite web|title=Single, Double, and Long Double Precision |url=http://docs.sun.com/app/docs/doc/801-7639/6i1ucu1ul?a=view|work=|publisher= |
Quadruple precision is currently implemented in [[C (programming language)|C]]/[[C++]] and [[Fortran]] as [[long double]] and REAL*16<ref>{{cite web|title=Single, Double, and Long Double Precision |url=http://docs.sun.com/app/docs/doc/801-7639/6i1ucu1ul?a=view|work=|publisher=Sun Microsystems|date=|accessdate=2010-01-23}}</ref> respectively. Not all compilers support the quad-precison type. Microsoft [[Visual C++]], for example, makes the [[long double]] type synomymous with the double-precison type<ref>MSDN homepage, about Visual C++ compiler [http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx]</ref>. The [[GNU]] Fortran compiler also lacks support for the quad-precison, but proprietary compilers such as the [[Intel Fortran Compiler]] do support quad-precison<ref>{{cite web|title= Intel Fortran Compiler Product Brief |url=http://h21007.www2.hp.com/portal/download/files/unprot/intel/product_brief_Fortran_Linux.pdf|work=|publisher=Su|date=|accessdate=2010-01-23}}</ref>. |
||
=== Quadruple precision examples === |
=== Quadruple precision examples === |
Revision as of 06:03, 24 January 2010
In computing, quadruple precision (also commonly shortened to quad precision) is a binary floating-point computer numbering format that occupies 16 bytes (128 bits in modern computers) in computer memory.
In IEEE 754-2008 the 128-bit base 2 format is officially referred to as binary128.
Floating-point formats |
---|
IEEE 754 |
|
Other |
Alternatives |
IEEE 754 quadruple precision binary floating-point format: binary128
The IEEE 754 standard specifies a binary128 as having:
- Sign bit: 1
- Exponent width: 15
- Significand precision: 113 (112 explicitly stored)
The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits, ). The bits are laid out as follows:
Exponent encoding
The quadruple precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 16383; also known as exponent bias in the IEEE 754 standard.
- Emin = 0x0001−0x3fff = −16382
- Emax = 0x7ffe−0x3fff = 16383
- Exponent bias = 0x3fff = 16383
Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 16383 has to be subtracted from the stored exponent.
The stored exponents 0x0000 and 0x7fff are interpreted specially.
Exponent | Significand zero | Significand non-zero | Equation |
---|---|---|---|
0x0000 | 0, −0 | subnormal numbers | |
0x0001, ..., 0x7ffe | normalized value | ||
0x7fff | ±infinity | NaN (quiet, signalling) |
The maximum representable value is ≈ 1.1897 × 104932.
Implementations
Quadruple precision is currently implemented in C/C++ and Fortran as long double and REAL*16[1] respectively. Not all compilers support the quad-precison type. Microsoft Visual C++, for example, makes the long double type synomymous with the double-precison type[2]. The GNU Fortran compiler also lacks support for the quad-precison, but proprietary compilers such as the Intel Fortran Compiler do support quad-precison[3].
Quadruple precision examples
These examples are given in bit representation, in hexadecimal, of the floating point value. This includes the sign, (biased) exponent, and significand.
3fff 0000 0000 0000 0000 0000 0000 0000 = 1 c000 0000 0000 0000 0000 0000 0000 0000 = -2 7ffe ffff ffff ffff ffff ffff ffff ffff ≈ 1.189731495357231765085759326628007 × 104932 (max quadruple precision) 0000 0000 0000 0000 0000 0000 0000 0000 = 0 8000 0000 0000 0000 0000 0000 0000 0000 = -0 7fff 0000 0000 0000 0000 0000 0000 0000 = infinity ffff 0000 0000 0000 0000 0000 0000 0000 = -infinity 3ffd 5555 5555 5555 5555 5555 5555 5555 ≈ 1/3
By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand.
So the bits beyond the rounding point are 0101...
which is less than 1/2 of a unit in the last place.
See also
- IEEE Standard for Floating-Point Arithmetic (IEEE 754)
- ISO/IEC 10967, Language Independent Arithmetic
- Primitive data type
- long double
References
- ^ "Single, Double, and Long Double Precision". Sun Microsystems. Retrieved 2010-01-23.
- ^ MSDN homepage, about Visual C++ compiler [1]
- ^ "Intel Fortran Compiler Product Brief" (PDF). Su. Retrieved 2010-01-23.