Jump to content

Floating-point arithmetic: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Tsuji (talk | contribs)
Tsuji (talk | contribs)
additional hyperlinked terms, remove one oops.
Line 1: Line 1:
'''Floating-point''' is a way of representing a subset of the real numbers in terms of an [[exponent]], which is an integer, and a [[significand]], which is a string of some chosen number of digits. Floating-point representation is very common because of computers, but it has its roots in [[scientific notation]].
'''Floating-point''' is a way of representing a [[subset]] of the real numbers in terms of an [[exponent]], which is an [[integer]], and a [[significand]], which is a string of some chosen number of digits. Floating-point representation is very common because of computers, but it has its roots in [[scientific notation]].


== Basics ==
== Basics ==
A floating-point representation requires, first of all, a choice of ''base'' or ''[[radix]]'' for the significand, and a choice of the number of digits in the significand. In this article, the base will be denoted by ''b'', and the number of digits, or [[precision]], by ''p''. The significand is a number consisting of ''p'' digits in radix ''b'', so each digit lies in the range 0 through b-1. A base of 2 (that is, binary representation) is nearly always used in computers, though some computers use b=10 or b=16. A base of 10 (that is, decimal representation) is used in the familiar scientific notation.
A floating-point representation requires, first of all, a choice of ''base'' or ''[[radix]]'' for the significand, and a choice of the number of digits in the significand. In this article, the base will be denoted by ''b'', and the number of digits, or [[precision]], by ''p''. The significand is a number consisting of ''p'' digits in radix ''b'', so each digit lies in the range 0 through b-1. A base of 2 (that is, [[Binary numeral system|binary]] representation) is nearly always used in computers, though some computers use b=10 or b=16. A base of 10 (that is, [[decimal]] representation) is used in the familiar scientific notation.


As an example, the revolution period of [[Jupiter (planet)|Jupiter]]'s moon [[Io (moon)|Io]] could be represented in scientific notation as 1.528535047 &times; 10<sup>5</sup> seconds. The string of digits "1528535047" is the significand, and the exponent is 5.
As an example, the revolution period of [[Jupiter (planet)|Jupiter]]'s moon [[Io (moon)|Io]] could be represented in scientific notation as 1.528535047 &times; 10<sup>5</sup> seconds. The string of digits "1528535047" is the significand, and the exponent is 5.
Line 13: Line 13:
The main benefit of scientific notation is that it makes representations like the last one unnecessary, by allowing the decimal point to be put in a convenient place. True floating-point notation uses a precise specification that the point is always just to the right of the leftmost digit of the significand, so the correct representation is 1.528535047 &times; 10<sup>5</sup>.
The main benefit of scientific notation is that it makes representations like the last one unnecessary, by allowing the decimal point to be put in a convenient place. True floating-point notation uses a precise specification that the point is always just to the right of the leftmost digit of the significand, so the correct representation is 1.528535047 &times; 10<sup>5</sup>.


This, plus the requirement that the leftmost digit of the significand be nonzero, is called '''normalization'''. By doing this, one no longer needs to say where the point is; it is deduced from the exponent. In decimal floating-point notation with precision of 10, the revolution period of Io is simply e=5; s=1528535047.
This, plus the requirement that the leftmost digit of the significand be nonzero, is called '''[[Normal number|normalization]]'''. By doing this, one no longer needs to say where the point is; it is deduced from the exponent. In decimal floating-point notation with precision of 10, the revolution period of Io is simply e=5; s=1528535047.


Some people (and some computer representations) may prefer a different convention for the presumed location of the point, such as to the left of the leftmost digit. This doesn't make any actual difference; it simply offsets the exponent.
Some people (and some computer representations) may prefer a different convention for the presumed location of the point, such as to the left of the leftmost digit. This doesn't make any actual difference; it simply offsets the exponent.
Line 44: Line 44:
Vital in avoiding the pitfalls of floating point numbers is the choice of [[Algorithm|algorithm]]. Different algorithms, computing the same problem, may have various degrees of [[Numerical stability|numerical instability]].
Vital in avoiding the pitfalls of floating point numbers is the choice of [[Algorithm|algorithm]]. Different algorithms, computing the same problem, may have various degrees of [[Numerical stability|numerical instability]].


Lastly, there is a very important caveat in trying to employ floating-point numbers in [[Arbitrary-precision arithmetic|"infinitely exact" arithmetic]] (although they are as capable of it as integers). It stems from the fact that floating point numbers are in fact binary fractions, while we commonly use decimals. Thus, 0.5 multiplied by 0.5 will exactly equal 0.25, with no caveats or gotchas. Comparisons for equality may be safely made, just as one is accustomed to do with integers. However, 0.1*0.1 will not equal 0.01, primarily because there isn't actually a way of representing 0.1 or 0.01 exactly as a binary fraction. Nor will the 0.1 approximation multiplied by itself equal the 0.01 approximation. Note, this is like adding 1/3 (0.333) to itself and getting 0.666 instead of 0.667 (2/3). Thus, in most cases, if one is trying to perform comparisons of arithmetic results, even those which seemingly are safely bound by the precision of the format, one must take the [[absolute value]] of the difference between them and compare it against a small number. (E.g., in [[C programming language|C]], instead of <code>if(a==b)</code>, write <code>if(abs(a-b)<0.00001)</code>).
Lastly, there is a very important caveat in trying to employ floating-point numbers in "infinitely exact" arithmetic (although they are as capable of it as integers). It stems from the fact that floating point numbers are in fact binary fractions, while we commonly use decimals. Thus, 0.5 multiplied by 0.5 will exactly equal 0.25, with no caveats or gotchas. Comparisons for equality may be safely made, just as one is accustomed to do with integers. However, 0.1*0.1 will not equal 0.01, primarily because there isn't actually a way of representing 0.1 or 0.01 exactly as a binary fraction. Nor will the 0.1 approximation multiplied by itself equal the 0.01 approximation. Note, this is like adding 1/3 (0.333) to itself and getting 0.666 instead of 0.667 (2/3). Thus, in most cases, if one is trying to perform comparisons of arithmetic results, even those which seemingly are safely bound by the precision of the format, one must take the [[absolute value]] of the difference between them and compare it against a small number. (E.g., in [[C programming language|C]], instead of <code>if(a==b)</code>, write <code>if(abs(a-b)<0.00001)</code>).


=== Banker's rounding ===
=== Banker's rounding ===
Line 64: Line 64:
The entire floating-point number is packed into a 32 bit word, with the sign bit leftmost, followed by the exponent in excess 127 format in the next 8 bits, followed by the significand (without the hidden bit) in the rightmost 23 bits.
The entire floating-point number is packed into a 32 bit word, with the sign bit leftmost, followed by the exponent in excess 127 format in the next 8 bits, followed by the significand (without the hidden bit) in the rightmost 23 bits.


For the approximation to π, we have
For the approximation of [[Pi|π]], we have
*sign = 0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)
*sign = 0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)
*e+127 = 128 = 10000000 in (8 bit) binary
*e+127 = 128 = 10000000 in (8 bit) binary
*final 32 bit result = 0 10000000 10010010000111111011011 = 40490FDB in hexadecimal
*final 32 bit result = 0 10000000 10010010000111111011011 = 40490FDB in [[hexadecimal]]
As noted above, this number is not really π, but is exactly 3.1415927410125732421875.
As noted above, this number is not really π, but is exactly 3.1415927410125732421875.


Line 145: Line 145:
*An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.
*An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.


Computer hardware is typically able to raise exceptions ("traps") when these events occur. How this is done is very system-dependent. Usually all exceptions are ''masked'' (disabled). Sometimes overflow, zerodivide, and operand error are enabled.
Computer hardware is typically able to raise [[Exception handling|exceptions]] ("traps") when these events occur. How this is done is very system-dependent. Usually all exceptions are ''masked'' (disabled). Sometimes overflow, zerodivide, and operand error are enabled.


== Peculiar properties of floating point arithmetic ==
== Peculiar properties of floating point arithmetic ==
Line 170: Line 170:
Because of the inability to represent all of the reals, and the consequent rounding of every arithmetic result, naive use of floating point arithmetic can lead to many problems. A good understanding of [[numerical analysis]] is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article.
Because of the inability to represent all of the reals, and the consequent rounding of every arithmetic result, naive use of floating point arithmetic can lead to many problems. A good understanding of [[numerical analysis]] is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article.


In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) will lead to errors. There is much controversy about the failings of compilers and language designs in this area.
In addition to careful design of programs, careful handling by the [[compiler]] is essential. Certain "optimizations" that compilers might make (for example, reordering operations) will lead to errors. There is much controversy about the failings of compilers and language designs in this area.


Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/. The "decimal" data type of the C# and Java programming languages, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.
Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/. The "decimal" data type of the C# and Java programming languages, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.
Line 227: Line 227:
18 786432 '''3.141592653'''32534345 '''3.14159265'''178665197 '''3.1415926'''4870926902
18 786432 '''3.141592653'''32534345 '''3.14159265'''178665197 '''3.1415926'''4870926902
19 1572864 '''3.14159265'''599338609 '''3.14159265'''130664472 '''3.1415926'''4193316198
19 1572864 '''3.14159265'''599338609 '''3.14159265'''130664472 '''3.1415926'''4193316198
20 3145728 '''3.1415926'''4532121550 '''3.1415926'''4645056668 '''3.1415926'''4870926902
20 3145728 '''3.1415926'''4532121550 '''3.1415926'''4645056668 '''3.1415926
21 6291456 '''3.1415926'''4532121550 '''3.1415926'''4419186433 '''3.1415926'''4193316198
22 12582912 '''3.141592'''98683065634 '''3.1415926'''4645055406 '''3.14159'''196569034949
23 25165824 '''3.14159'''503588652176 '''3.14159'''423790206850 '''3.1415926'''4193316198
24 50331648 '''3.14159'''776795893004 '''3.141592'''19110183233 '''3.1415'''8103738763692

Second, form b) for both:
Stage Sides Inscribed (2I + C)/3 Circumscribed
1 6 '''3'''.00000000000000000 '''3.1'''5470053837925153 '''3'''.46410161513775459
2 12 '''3.1'''0582854123024915 '''3.14'''234913054465692 '''3'''.21539030917347248
3 24 '''3.1'''3262861328123820 '''3.141'''63905621999229 '''3.1'''5965994209750048
4 48 '''3.1'''3935020304686721 '''3.14159'''554040838980 '''3.14'''608621513143497
5 96 '''3.141'''03195089050964 '''3.141592'''83380879586 '''3.14'''271459964536830
6 192 '''3.141'''45247228546208 '''3.1415926'''6485024934 '''3.141'''87304997982387
7 384 '''3.1415'''5760791185765 '''3.14159265'''429352127 '''3.141'''66274705684853
8 768 '''3.1415'''8389214831841 '''3.141592653'''63377545 '''3.141'''61017660468954
9 1536 '''3.14159'''046322805010 '''3.1415926535'''9254211 '''3.14159'''703432152615
10 3072 '''3.141592'''10599927155 '''3.141592653589'''96504 '''3.14159'''374877135203
11 6144 '''3.141592'''51669215745 '''3.141592653589'''80398 '''3.141592'''92738509703
12 12288 '''3.1415926'''1936538396 '''3.141592653589793'''91 '''3.141592'''72203861382
13 24576 '''3.1415926'''4503369090 '''3.1415926535897932'''8 '''3.1415926'''7070199805
14 49152 '''3.14159265'''145076765 '''3.1415926535897932'''4 '''3.14159265'''786784442
15 98304 '''3.14159265'''305503684 '''3.1415926535897932'''4 '''3.14159265'''465930603
16 196608 '''3.141592653'''45610414 '''3.1415926535897932'''4 '''3.141592653'''85717144
17 393216 '''3.141592653'''55637096 '''3.1415926535897932'''4 '''3.141592653'''65663779
18 786432 '''3.14159265358'''143767 '''3.1415926535897932'''4 '''3.141592653'''60650438
19 1572864 '''3.14159265358'''770435 '''3.1415926535897932'''4 '''3.1415926535'''9397102
20 3145728 '''3.141592653589'''27102 '''3.1415926535897932'''4 '''3.1415926535'''9083769
21 6291456 '''3.141592653589'''66268 '''3.1415926535897932'''4 '''3.1415926535'''9005435
22 12582912 '''3.1415926535897'''6060 '''3.1415926535897932'''4 '''3.141592653589'''85852
23 25165824 '''3.1415926535897'''8508 '''3.1415926535897932'''4 '''3.141592653589'''80956
24 50331648 '''3.14159265358979'''120 '''3.1415926535897932'''4 '''3.14159265358979'''732

4*atan(1) = '''3.1415926535897932'''4
Good digits: '''3.14159265358979323846264338327950288'''

== See also ==
*[[Significant digits]]
*[[Fixed-point arithmetic]]
*[[Computable number]]
*[[IEEE Floating Point Standard]]
*[[IBM Floating Point Architecture]]
*[[FLOPS]]
*[[−0 (number)]]
*[[half precision]] &ndash; [[single precision]] &ndash; [[double precision]] &ndash; [[quad precision]]&ndash; [[minifloat]]
*[[Scientific notation]]

== References ==
* An edited reprint of the paper ''[http://docs.sun.com/source/806-3568/ncg_goldberg.html What Every Computer Scientist Should Know About Floating-Point Arithmetic]'', by David Goldberg, published in the March, 1991 issue of Computing Surveys.
* David Bindel’s [http://www.cs.berkeley.edu/~dbindel/class/cs279/dsb-bib.pdf Annotated Bibliography] on computer support for scientific computation.
* [[Donald Knuth]]. ''The Art of Computer Programming'', Volume 2: ''Seminumerical Algorithms'', Third Edition. Addison-Wesley, 1997. ISBN 0-201-89684-2. Section 4.2: Floating Point Arithmetic, pp.214&ndash;264.
* Kahan, William and Darcy, Joseph (2001). How Java’s floating-point hurts everyone everywhere. Retrieved Sep. 5, 2003 from [http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf].
* [http://www.geocities.com/SiliconValley/Pines/6639/docs/fp_summary.html Introduction to Floating point calculations and IEEE 754 standard] by Jamil Khatib
* [http://home.earthlink.net/~mrob/pub/math/floatformats.html Survey of Floating-Point Formats] This page gives a very brief summary of floating-point formats that have been used over the years.
<!--
* Nika Aldrich has a paper exploring the difficulties of dithering floating point data: [http://www.cadenzarecording.com/floatingdither.html Exploring Dither in Floating-Point Systems]
-->

[[Category:Data types]]
[[Category:Computer arithmetic]]

[[de:Gleitkommazahl]]
[[es:Coma flotante]]
[[fr:Virgule flottante]]
[[ko:부동소수점]]
[[it:Virgola mobile]]
[[he:נקודה צפה]]
[[nl:Real (informatica)]]
[[ja:浮動小数点数]]
[[pl:Liczba zmiennoprzecinkowa]]
[[pt:Vírgula flutuante]]
[[sk:Pohyblivá desatinná čiarka]]
[[fi:Liukuluku]]
[[sv:Flyttal]]
[[zh:浮点数]]

Revision as of 23:01, 25 July 2006

Floating-point is a way of representing a subset of the real numbers in terms of an exponent, which is an integer, and a significand, which is a string of some chosen number of digits. Floating-point representation is very common because of computers, but it has its roots in scientific notation.

Basics

A floating-point representation requires, first of all, a choice of base or radix for the significand, and a choice of the number of digits in the significand. In this article, the base will be denoted by b, and the number of digits, or precision, by p. The significand is a number consisting of p digits in radix b, so each digit lies in the range 0 through b-1. A base of 2 (that is, binary representation) is nearly always used in computers, though some computers use b=10 or b=16. A base of 10 (that is, decimal representation) is used in the familiar scientific notation.

As an example, the revolution period of Jupiter's moon Io could be represented in scientific notation as 1.528535047 × 105 seconds. The string of digits "1528535047" is the significand, and the exponent is 5.

Now this could be represented as any of

  • 1528.535047 × 102
  • 1528535047. × 10-4
  • .000001528535047 × 1011

The main benefit of scientific notation is that it makes representations like the last one unnecessary, by allowing the decimal point to be put in a convenient place. True floating-point notation uses a precise specification that the point is always just to the right of the leftmost digit of the significand, so the correct representation is 1.528535047 × 105.

This, plus the requirement that the leftmost digit of the significand be nonzero, is called normalization. By doing this, one no longer needs to say where the point is; it is deduced from the exponent. In decimal floating-point notation with precision of 10, the revolution period of Io is simply e=5; s=1528535047.

Some people (and some computer representations) may prefer a different convention for the presumed location of the point, such as to the left of the leftmost digit. This doesn't make any actual difference; it simply offsets the exponent.

When a (nonzero) floating-point number is normalized, its leftmost digit is nonzero. The value of the significand obeys 1 ≤ s < b. (Zero can't be represented in normalized floating-point notation, and computers have to deal with it separately.)

The mathematical value of a floating-point number is s.ssssssss...sss × be.

In binary radix, the significand is a string of bits (1's and 0's) of length p, of which the leftmost bit is 1. The number π, represented in binary, is

11.0010010000111111011010101000100010000101101000110000100011010011.....

When rounded to a precision of 24, it is

11.0010010000111111011011

In binary floating-point, it is e=1 ; s=110010010000111111011011.

(There is no reason to write the exponent in any form other than a plain decimal number. When floating-point numbers are stored in a computer, a suitable binary encoding is used.)

Accuracy, and misconceptions thereof

For any fixed precision p, the floating-point numbers can represent only a subset of the real numbers. (This is so even if the possible exponents are any integer. In actual computer representations, there is a finite range for the exponent, and the total set of floating-point numbers is actually finite.) This incompleteness is the same as what people are used to when they express measurements to a certain number of decimal places, as in "The atomic weight of hydrogen is 1.008".

However, there is a common belief that floating-point numbers are imprecise, because they are only approximations. In practice, it is the floating-point arithmetic operations, not the numbers themselves, that are imprecise. Floating point numbers, having approximately as many as 15 decimal significant figures, carry, on their own, far more precision than exists in the vast majority of real-world measurements or is relevant to nearly all applications. The point can also be made that floating point numbers carry the same infinite precision as plain integers. While this is technically true, it also happens to be of little relevance (since making actual use of their infinite precision is far less practical than it is for integers, for reasons which will be given shortly).

Double as well as single precision floating point representations, in fact, carry extreme amounts of precision (even if we accept them as approximations). However, repeated operations of multiplication, division, addition, and, most importantly of all, subtraction, accumulate very large amounts of error. It is these floating point arithmetic operations that are the real source of imprecision, and it is they that make the very large representations necessary and even insufficient.

For example, one million repeated multiplications of numbers which have 7 significant decimal digits (i.e., IEEE single precision) will, on average, produce a final result accurate to just 4. In the worst case, a result may be accurate to just 1 significant decimal digit. Worst of all, it will be impossible to tell just how accurate the result turned out to be! Given that a modern processor can compute in the magnitude of 10 billion floating point operations per second, even double-precision numbers may not be safe in regard to multiplication, division, and addition.

However, the worst arithmetic operation to give imprecision is subtraction between two numbers of near-equal value. For example, 1.234567e0 - 1.234888e0 results, in case of decimal representations, in -3.210000e-4. Each one of those zeroes represents a digit whose value is unknown. In just a single operation, the precision of arithmetic has been halved!

Other operations, such as powers, square roots, trigonometric function, etc. carry various amounts of error and risk.

Vital in avoiding the pitfalls of floating point numbers is the choice of algorithm. Different algorithms, computing the same problem, may have various degrees of numerical instability.

Lastly, there is a very important caveat in trying to employ floating-point numbers in "infinitely exact" arithmetic (although they are as capable of it as integers). It stems from the fact that floating point numbers are in fact binary fractions, while we commonly use decimals. Thus, 0.5 multiplied by 0.5 will exactly equal 0.25, with no caveats or gotchas. Comparisons for equality may be safely made, just as one is accustomed to do with integers. However, 0.1*0.1 will not equal 0.01, primarily because there isn't actually a way of representing 0.1 or 0.01 exactly as a binary fraction. Nor will the 0.1 approximation multiplied by itself equal the 0.01 approximation. Note, this is like adding 1/3 (0.333) to itself and getting 0.666 instead of 0.667 (2/3). Thus, in most cases, if one is trying to perform comparisons of arithmetic results, even those which seemingly are safely bound by the precision of the format, one must take the absolute value of the difference between them and compare it against a small number. (E.g., in C, instead of if(a==b), write if(abs(a-b)<0.00001)).

Banker's rounding

When the bits after the last available result bit are 1000000000..... exactly (that is, it is known that there are an infinite number of zeros after the initial 1), rounding upward or downward is equally accurate. In this case, the technique of Banker's rounding is usually used. The rounding is done in the direction that makes the significand even.

Mantissa

The word mantissa is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand. The term "mantissa" is nevertheless widely used to refer to the significand.

Computer representation

(This section describes some general issues, but mostly follows the IEEE standard.)

To represent a floating-point number in a computer datum, the exponent has to be encoded into a bit field. Since the exponent could be negative, one could use two's complement representation. Instead, what is commonly done is something very similar: a fixed constant is added to the exponent, with the intention that the result be a positive number capable of being packed into a fixed bit field. For the common 32 bit "single precision" or "float" format of the IEEE standard, this constant is 127, so the exponent is said to be represented in "excess 127" format. The result of this addition is placed in an 8 bit field.

Since the leftmost significand bit of a (normalized) floating-point number is always 1, that bit is not actually placed in the computer datum. The computer's arithmetic hardware acts as though that "1" had been provided. This is the "implicit bit" or "hidden bit" of the IEEE standard. Because of this, a 24 bit significand is actually placed in a 23 bit field.

Finally, a sign bit is required. This is one to indicate that the entire floating-point number is negative, or zero to indicate that it is positive. (In the past, some computers have used a kind of two's complement encoding for the entire number, rather than simple "sign/magnitude" format, but it didn't work very well.)

The entire floating-point number is packed into a 32 bit word, with the sign bit leftmost, followed by the exponent in excess 127 format in the next 8 bits, followed by the significand (without the hidden bit) in the rightmost 23 bits.

For the approximation of π, we have

  • sign = 0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)
  • e+127 = 128 = 10000000 in (8 bit) binary
  • final 32 bit result = 0 10000000 10010010000111111011011 = 40490FDB in hexadecimal

As noted above, this number is not really π, but is exactly 3.1415927410125732421875.

In the common 64 bit "double precision" or "double" format of the IEEE standard, the offset added to the exponent is 1023, and the result is placed into an 11 bit field. The precision is 53 bits. After removal of the hidden bit, 52 bits remain. The result comprises 1+11+52=64 bits. The approximation to π is

  • sign = 0 ; e=1 ; s=11001001000011111101101010100010001000010110100011000 (including the hidden bit)
  • e+1023 = 1024 = 10000000000 in (11 bit) binary
  • final 64 bit result = 0 10000000000 1001001000011111101101010100010001000010110100011000 = 400921FB54442D18 in hexadecimal

This number is exactly 3.141592653589793115997963468544185161590576171875.

Overflow, underflow, and zero

The necessity to pack the offset exponent into a fixed-size bit field places limits on the exponent. For the standard 32 bit format, e+127 must fit into an 8 bit field, so −127 ≤ e ≤ 128. The values −127 and +128 are reserved for special meanings, so the actual range for normalized floating-point numbers is −126 ≤ e ≤ 127. This means that the smallest normalized number is

e=−126 ; s=100000000000000000000000

which is about 1.18 × 10−38, and is represented in hexadecimal as 00800000. The largest representable number is

e=+127 ; s=111111111111111111111111

which is about 3.4 × 1038, and is represented in hexadecimal as 7F7FFFFF. For double precision the range is about 2.2 × 10−308 to 1.8 × 10308.

Any floating-point computation that gives a result (after rounding to a representable value) higher than the upper limit is said to overflow. Under the IEEE standard, such result is set to a special value "infinity", which has the appropriate sign bit, the reserved exponent +128, and a bit pattern in the significand (typically zero) to indicate that this is infinity. Such numbers are generally printed as "+INF" or "-INF".

Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as

(+INF) + (+7) = (+INF)
(+INF) * (-2) = (-INF)

A floating-point computation that (after rounding) gives a nonzero result lower than the lower limit is said to underflow. This could happen, for example, if 10−25 is multiplied by 10−25 in single precision. Under the IEEE standard, the reserved exponent −127 is used, and the significand is set as follows.

First, if the number is zero, it is represented by an exponent of −127 and a significand field of all zeros. This means that zero is represented in hexadecimal as 00000000.

Otherwise, if normalizing the number would lead to an exponent of −127 or less, it is only normalized until the exponent is −127. That is, instead of shifting the significand bits left until the leftmost bit is 1, they are shifted until the exponent reaches −127. For example, the smallest non-underflowing number is

e=−126 ; s=1.00000000000000000000000 (about 1.18 × 10−38)

A number 1/16th as large would be

e=−130 ; s=1.00000000000000000000000 (about 7.3 × 10−40)

If it is partially normalized, one gets

e=−127 ; s=0.00100000000000000000000

This does not have a leading bit of 1, so using the "hidden bit" mechanism won't work. What is done is to store the significand without removing the leftmost bit, since there is no guarantee that it is 1. This means that the precision is only 23 bits, not 24. The exponent of −127 is stored in the usual excess 127 format, that is, all zeros. The final representation is

0 00000000 00010000000000000000000 = 00080000 in hexadecimal

Whenever the exponent is −127 (that is, all zeros in the datum), the bits are interpreted in this special format. Such a number is said to be "denormalized" (a "denorm" for short), or, in more modern terminology, "subnormal".

The smallest possible (subnormal) nonzero number is

0 00000000 00000000000000000000001 = 00000001 in hexadecimal
e=−127 ; s=0.0000000000000000000001

Which is 2−149, or about 1.4 × 10−45

The handling of the number zero can be seen to be a completely ordinary case of a subnormal number.

The creation of denorms is often called "gradual underflow". As numbers get extremely small, significand bits are slowly sacrificed. The alternative is "sudden underflow", in which any number that can't be normalized is simply set to zero by the computer hardware. Gradual underflow is difficult for computer hardware to handle, so hardware often uses software to assist it, through interrupts. This creates a serious performance penalty, and sometimes sudden underflow is used instead. This is a source of controversy.

Behavior of computer arithmetic

As noted above, the standard behavior of computer hardware is to round the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. In practice, there are other options. IEEE-754-compliant hardware allows one to set the rounding mode to any of

  • round to nearest (the default; by far the most common mode.)
  • round up (toward +∞. This means that negative results will round toward zero.)
  • round down (toward −∞. Negative results round away from zero.)
  • round toward zero (always. This is sometimes called "chop" mode. It is similar to the common behavior of float-to-integer conversions, which convert −3.5 to −3.)

In the default rounding mode the IEEE 754 standard mandates the round-to-nearest behavior described above for all fundamental algebraic operations, including square root. ("Library" functions such as cosine and log are not mandated.) This means that IEEE-compliant hardware's behavior is completely determined in all 32 or 64 bits.

The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.

The arithmetical distance between two consecutive representable floating point numbers is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. An ULP is about 10−7 in single precision, and 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of an ULP.

Exceptional values, and exceptions, under the IEEE standard

In addition to the "infinity" value that is produced when an overflow occurs, there is a special value "NaN" (standing for "not a number") that is produced by such things as taking the square root of a negative number, or the inverse sine of 3. (The inverse sine function is typically handled by a library routine, which explicitly produces this value when the argument is out of range.) NaN is encoded with the reserved exponent of 128 (or 1024), and a significand field that distinguishes it from infinity.

The intention of the INF and NaN values is that, under the most common circumstances, they can just propagate from one operation to the next (any operation with NaN as an operand produces NaN as a result), and they only need to be attended to at a point that the programmer chooses.

In addition to the creation of exceptional values, there are "events" that may occur, though some of them are quite benign:

  • An overflow occurs as described previously, producing an infinity.
  • An underflow occurs as described previously, producing a denorm.
  • A zerodivide occurs whenever a divisor is zero, producing an infinity of the appropriate sign. (The sign of zero is meaningful here.) Note that a very small but nonzero divisor can still cause an overflow and produce an infinity.
  • An "operand error" occurs whenever a NaN has to be created. This occurs whenever any operand to an operation is a NaN, or some other obvious thing happens, such a sqrt(-2.0) or log(-1.0).
  • An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.

Computer hardware is typically able to raise exceptions ("traps") when these events occur. How this is done is very system-dependent. Usually all exceptions are masked (disabled). Sometimes overflow, zerodivide, and operand error are enabled.

Peculiar properties of floating point arithmetic

The fact that the set of representable floating-point numbers is not the entire set of reals, and that floating-point arithmetic operations have to round each result into a representable item, means that the arithmetic performed by computers does not obey all of the expected properties.

While floating point addition and multiplication are commutative, they are not associative. This means that in general for floating point numbers x, y, and z:

Floating point arithmetic is also not distributive. This means that in general:

For example, with most floating-point implementations

1.0 + (1e100 − 1e100) gives the result 1.0
(1.0 + 1e100) − 1e100 gives the result 0.0.

The reason is that, in the first case, the initial subtraction gives zero, since that is the true result and is representable, and 1 can be added to it successfully. In the second case, the floating point approximation to 1e100, when 1 is added, results in the same approximation, so the subtraction yields zero.

(The situation is actually more subtle than that. By "1e100", we mean the floating point number closest to it. The starting point for any floating point operation has to be a representable number.)

In short, the order in which operations are carried out can change the output of a floating point calculation. This is important in numerical analysis since two mathematically equivalent formulas may not produce the same numerical output, and one may be substantially more accurate than the other.

Problems with floating-point

Because of the inability to represent all of the reals, and the consequent rounding of every arithmetic result, naive use of floating point arithmetic can lead to many problems. A good understanding of numerical analysis is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article.

In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) will lead to errors. There is much controversy about the failings of compilers and language designs in this area.

Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/. The "decimal" data type of the C# and Java programming languages, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.

The general types of things that can cause trouble are:

  • Non-representable numbers: common numbers, such as 0.1, are compromised right from the beginning.
  • Conversions to integer are unforgiving: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6.
  • Absorption: 1015 + 1 may be the same as 1015.
  • Cancellation: subtraction of nearly equal operands may cause serious loss of accuracy.
  • Limited exponent range: results might overflow, yielding infinity.
  • Testing for safe division is problematical: Checking that the divisor is not zero does not guarantee that a division will not overflow and yield infinity.
  • Comparison for exact equality of two numbers is problematical. Programmers often perform comparisons within some tolerance, but that doesn't necessarily make the problem go away.

Unlike the fixed-point counterpart, the application of dither in a floating point environment is nearly impossible. See external references for more information about the difficulty of applying dither and the rounding error problems in floating point systems.

A few nice properties

One can sometimes take advantage of a few nice properties:

  • Any integer strictly less than 224−1 can be exactly represented in the single precision format, and any integer strictly less than 253−1 can be exactly represented in the double precision format. Furthermore, any reasonable power of 2 times such a number can be represented.
  • The bit representations are monotonic, as long as exceptional values are avoided and the signs are handled properly. Floating point numbers are equal if and only if their integer bit representations are equal. Comparisons for larger or smaller can be done with integer comparisons on the bit patterns, as long as the signs match. However, the actual floating point comparisons provided by hardware typically have much more sophistication in dealing with exceptional values.
  • To a rough approximation, the bit representation of a floating point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.

IEEE standard

The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.

The standard allows for many different precision levels, of which the 32 bit ("single") and 64 bit ("double") are by far the most common, since they are supported in common programming languages. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit format, with 15 exponent bits and 64 significand bits, with no hidden bit. Software vendors may also provide additional extended formats, such as the H-P "quad" format (1 sign bit, 15 exponent bits, and 113 significand bits, 1 of which is hidden.) There is controversy about the failure of programming languages to make these hardware formats available to programmers.

As of 2000, the IEEE 754 standard is currently under revision. See IEEE 754r.

Example

Archimedes approximated π by calculating the perimeters of polygons inscribed within and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. This generates two recurrence formulae for the perimeters, in terms of the half-angle formulae for sines and tangents respectively. Both have two forms, mathematically equivalent, but not so computationally. Here are results via turbo Pascal using 80-bit floating point in an IBM-PC clone:

         Inscribed (sine)              Circumscribed (tangent)
a)  s:=sqrt{[1 - sqrt(1 - s*s)]/2}     t:=[sqrt(1 + t*t) - 1]/t
b)  s:= s/sqrt{2*[1 + sqrt(1 - s*s)]}  t:=t/[1 + sqrt(1 + t*t)]
Start:     s = 1/2                           t = 1/sqrt(3)

First, form a) for both:

Stage    Sides       Inscribed           (2I + C)/3       Circumscribed
    1        6 3.00000000000000000 3.15470053837925153 3.46410161513775459
    2       12 3.10582854123024915 3.14234913054465692 3.21539030917347248
    3       24 3.13262861328123820 3.14163905621999229 3.15965994209750048
    4       48 3.13935020304686720 3.14159554040838980 3.14608621513143498
    5       96 3.14103195089050961 3.14159283380879585 3.14271459964536834
    6      192 3.14145247228546194 3.14159266485024924 3.14187304997982385
    7      384 3.14155760791185764 3.14159265429352108 3.14166274705684795
    8      768 3.14158389214831805 3.14159265363377626 3.14161017660469268
    9     1536 3.14159046322805003 3.14159265359253476 3.14159703432150421
   10     3072 3.14159210599928383 3.14159265358995624 3.14159374877130105
   11     6144 3.14159251669204966 3.14159265358971935 3.14159292738505874
   12    12288 3.14159261936512077 3.14159265358966492 3.14159272203875323
   13    24576 3.14159264503460936 3.14159265358907182 3.14159267069799674
   14    49152 3.14159265144937596 3.14159265359313918 3.14159265788066561
   15    98304 3.14159265303352628 3.14159265355780571 3.14159265460636456
   16   196608 3.14159265332534345 3.14159265356534707 3.14159265404535431
   17   393216 3.14159265332534345 3.14159265308533982 3.14159265260533258
   18   786432 3.14159265332534345 3.14159265178665197 3.14159264870926902
   19  1572864 3.14159265599338609 3.14159265130664472 3.14159264193316198
   20  3145728 3.14159264532121550 3.14159264645056668 3.1415926