Quadruple-precision floating-point format: Difference between revisions

Content deleted Content added

Inline

Revision as of 06:03, 24 January 2010

In computing, quadruple precision (also commonly shortened to quad precision) is a binary floating-point computer numbering format that occupies 16 bytes (128 bits in modern computers) in computer memory.

In IEEE 754-2008 the 128-bit base 2 format is officially referred to as binary128.

IEEE 754 quadruple precision binary floating-point format: binary128

The IEEE 754 standard specifies a binary128 as having:

Sign bit: 1
Exponent width: 15
Significand precision: 113 (112 explicitly stored)

The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits, $\log _{10}(2^{113})\approx 34.016$ ). The bits are laid out as follows:

Exponent encoding

The quadruple precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 16383; also known as exponent bias in the IEEE 754 standard.

E_min = 0x0001−0x3fff = −16382
E_max = 0x7ffe−0x3fff = 16383
Exponent bias = 0x3fff = 16383

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 16383 has to be subtracted from the stored exponent.

The stored exponents 0x0000 and 0x7fff are interpreted specially.

Exponent	Significand zero	Significand non-zero	Equation
0x0000	0, −0	subnormal numbers	$(-1)^{\text{signbit}}\times 2^{-16382}\times 0.{\text{significandbits}}_{2}$
0x0001, ..., 0x7ffe	normalized value		$(-1)^{\text{signbit}}\times 2^{{{\text{exponentbits}}_{2}}-16383}\times 1.{\text{significandbits}}_{2}$
0x7fff	±infinity	NaN (quiet, signalling)

The maximum representable value is ≈ 1.1897 × 10⁴⁹³².

Implementations

Quadruple precision is currently implemented in C/C++ and Fortran as long double and REAL*16^[1] respectively. Not all compilers support the quad-precison type. Microsoft Visual C++, for example, makes the long double type synomymous with the double-precison type^[2]. The GNU Fortran compiler also lacks support for the quad-precison, but proprietary compilers such as the Intel Fortran Compiler do support quad-precison^[3].

Quadruple precision examples

These examples are given in bit representation, in hexadecimal, of the floating point value. This includes the sign, (biased) exponent, and significand.

3fff 0000 0000 0000 0000 0000 0000 0000   = 1
c000 0000 0000 0000 0000 0000 0000 0000   = -2

7ffe ffff ffff ffff ffff ffff ffff ffff   ≈  1.189731495357231765085759326628007 × 10⁴⁹³² (max quadruple precision)

0000 0000 0000 0000 0000 0000 0000 0000   = 0
8000 0000 0000 0000 0000 0000 0000 0000   = -0

7fff 0000 0000 0000 0000 0000 0000 0000   = infinity
ffff 0000 0000 0000 0000 0000 0000 0000   = -infinity
				
3ffd 5555 5555 5555 5555 5555 5555 5555   ≈  1/3

By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.

References

^ "Single, Double, and Long Double Precision". Sun Microsystems. Retrieved 2010-01-23.
^ MSDN homepage, about Visual C++ compiler [1]
^ "Intel Fortran Compiler Product Brief" (PDF). Su. Retrieved 2010-01-23.

External links

High-Precision Software Directory

[1] "Single, Double, and Long Double Precision". Sun Microsystems. Retrieved 2010-01-23.

[2] MSDN homepage, about Visual C++ compiler [1]

[3] "Intel Fortran Compiler Product Brief" (PDF). Su. Retrieved 2010-01-23.

[1]

[2]

[3]

@@ Line 47: / Line 47: @@
 === Implementations ===
-Quadruple precision is currently implemented in [[C (programming language)|C]]/[[C++]] and [[Fortran]] as [[long double]] and REAL*16<ref>{{cite web|title=Single, Double, and Long Double Precision |url=http://docs.sun.com/app/docs/doc/801-7639/6i1ucu1ul?a=view|work=|publisher=Su|date=|accessdate=2010-01-23}}</ref> respectively. Not all compilers support the quad-precison type. Microsoft [[Visual C++]], for example, makes the [[long double]] type synomymous with the double-precison type<ref>MSDN homepage, about Visual C++ compiler [http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx]</ref>. The [[GNU]] Fortran compiler also lacks support for the quad-precison, but proprietary compilers such as the [[Intel Fortran Compiler]] do support quad-precison<ref>{{cite web|title= Intel Fortran Compiler Product Brief |url=http://h21007.www2.hp.com/portal/download/files/unprot/intel/product_brief_Fortran_Linux.pdf|work=|publisher=Su|date=|accessdate=2010-01-23}}</ref>.
+Quadruple precision is currently implemented in [[C (programming language)|C]]/[[C++]] and [[Fortran]] as [[long double]] and REAL*16<ref>{{cite web|title=Single, Double, and Long Double Precision |url=http://docs.sun.com/app/docs/doc/801-7639/6i1ucu1ul?a=view|work=|publisher=Sun Microsystems|date=|accessdate=2010-01-23}}</ref> respectively. Not all compilers support the quad-precison type. Microsoft [[Visual C++]], for example, makes the [[long double]] type synomymous with the double-precison type<ref>MSDN homepage, about Visual C++ compiler [http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx]</ref>. The [[GNU]] Fortran compiler also lacks support for the quad-precison, but proprietary compilers such as the [[Intel Fortran Compiler]] do support quad-precison<ref>{{cite web|title= Intel Fortran Compiler Product Brief |url=http://h21007.www2.hp.com/portal/download/files/unprot/intel/product_brief_Fortran_Linux.pdf|work=|publisher=Su|date=|accessdate=2010-01-23}}</ref>.
 === Quadruple precision examples ===