Jump to content

Floating-point arithmetic: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m fixing link
Amoss (talk | contribs)
Moved basics section to top
Line 18: Line 18:
*[[Single precision]], called "float" in the C language family, and "real" or "real*4" in Fortran. It occupies 32 bits (4 bytes) and gives about 7 digits of precision.
*[[Single precision]], called "float" in the C language family, and "real" or "real*4" in Fortran. It occupies 32 bits (4 bytes) and gives about 7 digits of precision.
*[[Double precision]], called "double" in the C language family, and "doubleprecision" or "real*8" in Fortran. It occupies 64 bits (8 bytes) and gives about 16 digits of precision.
*[[Double precision]], called "double" in the C language family, and "doubleprecision" or "real*8" in Fortran. It occupies 64 bits (8 bytes) and gives about 16 digits of precision.

==History==

The floating point system of numbers was used by the [[Kerala School]] of mathematics in [[14th century]] [[India]] to investigate and rationalise about the [[convergence]] of [[series (mathematics)|series]].

In [[1938]], [[Konrad Zuse]] of Berlin, completes the "Z1", the first mechanical binary programmable computer. It is based on Boolean Algebra and has most of the basic ingredients of modern machines, using the binary system and today's standard separation of [[computer storage|storage]] and control. Zuse's [[1936]] patent application (Z23139/GMD Nr. 005/021) also suggests a 'von Neumann' architecture (re-invented in [[1945]]) with program and data modifiable in storage. Originally the machine was called the "V1" but retroactively renamed after the war, to avoid confusion with the V1 missile. It works with floating point numbers having a 7-bit exponent, 16-bit mantissa, and a sign bit. The memory uses sliding metal parts to store 16 such numbers, and works well; but the arithmetic unit is less successful, occasionally suffering from certain mechanical engineering problems. The program is read from punched discarded 35 mm movie film. Data values can be entered from a numeric keyboard, and outputs are displayed on electric lamps. The machine is not a general purpose computer because it lacks looping capabilities. The [[Z3]] was completed in [[1941]] and was program controlled.

Once electronic digital computers became a reality, the need to process data in this way was quickly recognized. The first commercial computer to be able to do this in hardware appears to be the [[Z4 (computer)|Z4]] in [[1950]], followed by the the [[IBM 704]] in [[1954]]. For some time after that, floating-point hardware was an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. All modern general-purpose computers have this ability. The [[PDP-11/44]] was an extension of the 11/34 that included the [[cache memory]] and floating point units as a standard feature.

The [[UNIVAC 1100/2200 series]], introduced in [[1962]], supported two floating point formats: single precision - 36 bits: 1 sign bit, 8 bit characteristic, 27 bit mantissa, and double precision - 72 bits: 1 sign bit, 11 bit characteristic, 60 bit mantissa. The [[IBM 7094]], introduced the same year, also supported single and double precision, with slightly different format.

Prior to the [[IEEE-754]] standard, computers had many different formats (and, indeed, different word sizes), and there were also multiple formats in software emulations.

The IEEE-754 standard was created in the early 1980's, after word sizes of 32 bits (or 16 or 64) had been generally settled upon. Among its innovations are these:
*A precisely specified encoding of the bits, so that all compliant computers would interpret bit patterns the same way. This made it possible to transfer floating-point numbers from one computer to another.
*A precisely specified behavior of the arithmetic operations. This meant that a given program, with given data, would always produce the same result on any compliant computer. This helped reduce the almost mystical reputation that floating-point computation had for seemingly nondeterministic behavior.
*The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and be handled by the software in a controlled way.


==Basics==
==Basics==
Line 83: Line 66:
===Mantissa===
===Mantissa===
The word ''[[mantissa]]'' is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the ''characteristic'' is the integer part. This terminology comes from the way [[Common logarithm|logarithm]] tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand. <!-- this is already stated: The term "mantissa" is nevertheless widely used to refer to the significand because of the similarity of the pairs (mantissa, characteristic) in logarithms to (significand, exponent) in floating-point numbers. -->
The word ''[[mantissa]]'' is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the ''characteristic'' is the integer part. This terminology comes from the way [[Common logarithm|logarithm]] tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand. <!-- this is already stated: The term "mantissa" is nevertheless widely used to refer to the significand because of the similarity of the pairs (mantissa, characteristic) in logarithms to (significand, exponent) in floating-point numbers. -->

==History==

The floating point system of numbers was used by the [[Kerala School]] of mathematics in [[14th century]] [[India]] to investigate and rationalise about the [[convergence]] of [[series (mathematics)|series]].

In [[1938]], [[Konrad Zuse]] of Berlin, completes the "Z1", the first mechanical binary programmable computer. It is based on Boolean Algebra and has most of the basic ingredients of modern machines, using the binary system and today's standard separation of [[computer storage|storage]] and control. Zuse's [[1936]] patent application (Z23139/GMD Nr. 005/021) also suggests a 'von Neumann' architecture (re-invented in [[1945]]) with program and data modifiable in storage. Originally the machine was called the "V1" but retroactively renamed after the war, to avoid confusion with the V1 missile. It works with floating point numbers having a 7-bit exponent, 16-bit mantissa, and a sign bit. The memory uses sliding metal parts to store 16 such numbers, and works well; but the arithmetic unit is less successful, occasionally suffering from certain mechanical engineering problems. The program is read from punched discarded 35 mm movie film. Data values can be entered from a numeric keyboard, and outputs are displayed on electric lamps. The machine is not a general purpose computer because it lacks looping capabilities. The [[Z3]] was completed in [[1941]] and was program controlled.

Once electronic digital computers became a reality, the need to process data in this way was quickly recognized. The first commercial computer to be able to do this in hardware appears to be the [[Z4 (computer)|Z4]] in [[1950]], followed by the the [[IBM 704]] in [[1954]]. For some time after that, floating-point hardware was an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. All modern general-purpose computers have this ability. The [[PDP-11/44]] was an extension of the 11/34 that included the [[cache memory]] and floating point units as a standard feature.

The [[UNIVAC 1100/2200 series]], introduced in [[1962]], supported two floating point formats: single precision - 36 bits: 1 sign bit, 8 bit characteristic, 27 bit mantissa, and double precision - 72 bits: 1 sign bit, 11 bit characteristic, 60 bit mantissa. The [[IBM 7094]], introduced the same year, also supported single and double precision, with slightly different format.

Prior to the [[IEEE-754]] standard, computers had many different formats (and, indeed, different word sizes), and there were also multiple formats in software emulations.

The IEEE-754 standard was created in the early 1980's, after word sizes of 32 bits (or 16 or 64) had been generally settled upon. Among its innovations are these:
*A precisely specified encoding of the bits, so that all compliant computers would interpret bit patterns the same way. This made it possible to transfer floating-point numbers from one computer to another.
*A precisely specified behavior of the arithmetic operations. This meant that a given program, with given data, would always produce the same result on any compliant computer. This helped reduce the almost mystical reputation that floating-point computation had for seemingly nondeterministic behavior.
*The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and be handled by the software in a controlled way.



==Floating point arithmetic operations==
==Floating point arithmetic operations==

Revision as of 12:30, 13 November 2006

You must add a |reason= parameter to this Cleanup template – replace it with {{Cleanup|August 2007|reason=<Fill reason here>}}, or remove the Cleanup template.

Floating-point is a system of arithmetic in which the decimal (or other radix) point is allowed to float as computations are performed. Consider a system where we allow 3 digits in representing a number. In a fixed-point system we would set the position of the radix point, perhaps for two integer and one fractional digit(s). In this 2:1 system the multiplication 01.2 × 01.2 gives the result 01.4. In contrast floating point arithmetic allows the position of the point to 'float', so the three digits can be split four different ways in order to give a more accurate or exact answer: 1.2 × 1.2 might give the result 1.44.

'Floating-point' is also used as a generic term to describe number representations in computing that are used to implement the above system of arithmetic. A number representation (called a numeral system in maths) specifies some way of storing a number as a string of bits. The arthmetic is defined as a set of actions on bit-strings that simulate normal arithmetic operations. When the numbers being represented are the Rationals, one immediate issue is that there are an infinite number of Rational Numbers, and only a finite number of bits inside a real computer. The numbers that we represent must be an approximation of the entire set. When we restrict ourselves to binary expansions of numbers (because these are easiest to operate upon in a digital computer) the subset of the Rationals that we operate on is restricted to denominators that are powers of 2. Now any rational with a denominator that has a factor other than 2 will have an infinite binary expansion. If we consider this expansion as an infinite string of bits then there are several methods for approximating this string in memory:

  • When the bit-string stored is allowed to grow and shrink the representation is called Arbitrary-Precision arithmetic (eg "bignum" software). It is also possible to hold the denominator exactly an arbitrary-sized integer improving precision and accuracy as in rational arithmetic.
  • When we store a fixed size window at a constant position in the bit-string the representation is called Fixed Point. The hardware to manipulate these representations is less costly than Floating-Point and is commonly used to perform integer operations. In this case the radix point is always beneath the window.
  • When we store a fixed size window that is allowed to slide up and down the bit-string the representation is called Floating-point.

The fixed size window of bits is called the significand or mantissa, and indicates the magnitude of the number. An integer called the exponent records the position, or offset, of the window into the bit-string. This is also referred to at the characteristic or scale. The window always stores the bits beneath the radix point in the bit-string, so the exponent also indicates where the radix-point is located in the bit-string. Specifically the exponent is a scaling of the integer in the window by 2^e. The floating-point representation is analagous to scientific notation.

Computation with floating-point numbers plays a very important role in an enormous variety of applications in science, engineering, and industry. The ability to perform floating point operations is an important measure of performance for computers intended for such applications. The extent of this ability is measured in "FLOPS" (FLoating-point Operations Per Second).

This article will focus on the common formats used by nearly all modern floating-point computer hardware. These formats have been standardized as the IEEE 754 standard, commonly called "IEEE floating-point". The standard actually provides for many closely-related formats, differing in only a few details. Two of these formats are ubiquitous in computer hardware and languages:

  • Single precision, called "float" in the C language family, and "real" or "real*4" in Fortran. It occupies 32 bits (4 bytes) and gives about 7 digits of precision.
  • Double precision, called "double" in the C language family, and "doubleprecision" or "real*8" in Fortran. It occupies 64 bits (8 bytes) and gives about 16 digits of precision.

Basics

A floating-point representation requires, first of all, a choice of base or radix for the significand, and a choice of the number of digits in the significand. In this article, the base will be denoted by b, and the number of digits, or precision, by p. The significand is a number consisting of p digits in radix b, so each digit lies in the range 0 through b-1. A base of 2 (that is, binary representation) is nearly always used in computer hardware, though some computers use b=10 or b=16. A base of 10 (that is, decimal representation) is used in the familiar scientific notation.

As an example, the revolution period of Jupiter's moon Io could be represented in scientific notation as 1.528535047 × 105 seconds. The string of digits "1528535047" is the significand, and the exponent is 5.

Now this could be represented in scientific notation in many different ways. For example the value can be any of

  • 1528.535047 × 102
  • 1528535047. × 10−4
  • 1528535047000000. × 10−10
  • 0.000001528535047 × 1011

All of these are valid representations.

One benefit of scientific notation is that the decimal point can be placed so that long strings of leading or trailing zeros can be avoided. Floating-point notation goes a step further, mandating a specific place for the point—typically just after the leftmost nonzero digit, and lets the exponent handle the rest. So, in this case, the correct representation is 1.528535047 × 105.

This, plus the requirement that the leftmost digit of the significand be nonzero, is called normalization. By doing this, one no longer needs to express the point explicitly; the exponent provides that information. In decimal floating-point notation with precision of 10, the revolution period of Io is simply an exponent e=5 and a significand s=1528535047. The implied decimal point is after the first digit of s (after the '1' and before the first '5').

When a (nonzero) floating-point number is normalized, its leftmost digit is nonzero. The value of the significand obeys 1 ≤ s < b. (Zero needs special treatment; this will be described below.)

Value

The mathematical value of a floating-point number, using the convention given above, is s.ssssssss...sss × be.

In binary radix, the significand is a string of bits (1's and 0's) of length p, of which the leftmost bit is 1. The number π, represented in binary, is

11.0010010000111111011010101000100010000101101000110000100011010011... but is
11.0010010000111111011011 when rounded to a precision of 24 bits.

In binary floating-point, this is e=1 ; s=110010010000111111011011.

The actual real number that a floating-point number represents is

This number with a 24-bit significand has a decimal value of

3.1415927410125732421875 (exactly!), whereas the true value of π is
3.1415926535897932384626433832795...

The result of rounding π to 24-bit binary floating-point differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error.

The problem is that floating-point numbers with a limited number of digits can represent only a subset of the reals, and that π is not in that subset. The act of rounding a real number to a floating-point representation consists of finding the representable real number that is closest to the given real number.

One doesn't need numbers as sophisticated as π to exhibit this phenomenon. The decimal number 0.1 is not representable in binary floating-point of any finite precision. The exact binary representation would have a "1100" sequence continuing endlessly:

e=-4; s=1100110011001100110011001100110011..., but when rounded to 24 bits it becomes
e=-4; s=110011001100110011001101 which is actually 0.100000001490116119384765625 in decimal.

Single precision computer format has a precision of 24 bits. Double precision computer format has a precision of 53 bits.

Conversion and rounding

If a number is represented in some other format (such as a string of digits), it may be converted into floating point format. If the number can be represented in the floating point format then the conversion is exact. If not, then typically the number is rounded to either the floating point number next larger or next smaller in value. There are several different rounding schemes for making this decision which have been used. Originally truncation was the typical approach. Since the introduction of IEEE 754, the default method rounds to the nearest value, or in the case of a tie, so as to make the significand even. That is, when the bits after the last available result bit are 1000000000..... exactly (that is, it is known that there are an infinite number of zeros after the initial 1), rounding upward or downward is equally accurate. In this case, the technique of Banker's rounding is usually used. The rounding is done in the direction that makes the resulting significand even (this could leave a result of zero). Two other rounding modes are sometimes used in support of interval arithmetic.

Mantissa

The word mantissa is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand.

History

The floating point system of numbers was used by the Kerala School of mathematics in 14th century India to investigate and rationalise about the convergence of series.

In 1938, Konrad Zuse of Berlin, completes the "Z1", the first mechanical binary programmable computer. It is based on Boolean Algebra and has most of the basic ingredients of modern machines, using the binary system and today's standard separation of storage and control. Zuse's 1936 patent application (Z23139/GMD Nr. 005/021) also suggests a 'von Neumann' architecture (re-invented in 1945) with program and data modifiable in storage. Originally the machine was called the "V1" but retroactively renamed after the war, to avoid confusion with the V1 missile. It works with floating point numbers having a 7-bit exponent, 16-bit mantissa, and a sign bit. The memory uses sliding metal parts to store 16 such numbers, and works well; but the arithmetic unit is less successful, occasionally suffering from certain mechanical engineering problems. The program is read from punched discarded 35 mm movie film. Data values can be entered from a numeric keyboard, and outputs are displayed on electric lamps. The machine is not a general purpose computer because it lacks looping capabilities. The Z3 was completed in 1941 and was program controlled.

Once electronic digital computers became a reality, the need to process data in this way was quickly recognized. The first commercial computer to be able to do this in hardware appears to be the Z4 in 1950, followed by the the IBM 704 in 1954. For some time after that, floating-point hardware was an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. All modern general-purpose computers have this ability. The PDP-11/44 was an extension of the 11/34 that included the cache memory and floating point units as a standard feature.

The UNIVAC 1100/2200 series, introduced in 1962, supported two floating point formats: single precision - 36 bits: 1 sign bit, 8 bit characteristic, 27 bit mantissa, and double precision - 72 bits: 1 sign bit, 11 bit characteristic, 60 bit mantissa. The IBM 7094, introduced the same year, also supported single and double precision, with slightly different format.

Prior to the IEEE-754 standard, computers had many different formats (and, indeed, different word sizes), and there were also multiple formats in software emulations.

The IEEE-754 standard was created in the early 1980's, after word sizes of 32 bits (or 16 or 64) had been generally settled upon. Among its innovations are these:

  • A precisely specified encoding of the bits, so that all compliant computers would interpret bit patterns the same way. This made it possible to transfer floating-point numbers from one computer to another.
  • A precisely specified behavior of the arithmetic operations. This meant that a given program, with given data, would always produce the same result on any compliant computer. This helped reduce the almost mystical reputation that floating-point computation had for seemingly nondeterministic behavior.
  • The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and be handled by the software in a controlled way.


Floating point arithmetic operations

The usual rule for performing floating point arithmetic is that the exact mathematical value is calculated[1], and the result is then rounded to the nearest representable value in the specified precision. This is in fact the behavior mandated for IEEE-compliant computer hardware, aside from a few details that will be described below.

For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples. The fundamental principles are the same in any radix or precision.

Addition and subtraction

To add or subtract two numbers, they must have their decimal (or binary) points lined up. This is done by comparing the exponent fields and shifting the smaller number to the right. In the example below, the second number is shifted right by three digits because it has the smaller exponent. It is unnormalized at this point. The actual addition is then performed:

  e=5;  s=1.234567     (123456.7)
+ e=2;  s=1.017654     (101.7654)

  e=5;  s=1.234567
+ e=5;  s=0.001017654  (after shifting)
--------------------
  e=5;  s=1.235584654  (true sum: 123558.4654)

This is the "true" result, relative to the exact meaning of the incoming operands, but it has too many digits. It must be rounded to seven digits (to match the precision) and then normalized if necessary. The final result is e=5; s=1.235585, or 123558.5.

Note that the low 3 digits of the second operand (654) are essentially lost. Except for their possible influence on carrying and rounding, they have no effect. This is round-off error. Whenever numbers of different magnitudes are added or subtracted, the lowest digits (or bits) of the smaller one are lost. In the most serious case of this, the smaller operand can be totally "absorbed" (that is, it has no effect at all):

  e=5;  s=1.234567
+ e=-3; s=9.876543

  e=5;  s=1.234567
+ e=5;  s=0.00000009876543 (after shifting)
----------------------
  e=5;  s=1.23456709876543 (true sum)
  e=5;  s=1.234567         (after rounding/normalization)

Another problem of loss of significance occurs when two nearly equal numbers are subtracted.

  e=5;  s=1.235585
- e=5;  s=1.234567
----------------
  e=5;  s=0.001018 (true difference)
  e=2;  s=1.018000 (after rounding/normalization)

The result of this operation is exact with respect to its inputs. However, if the addend is considered as an approximation to the result of the addition in the earlier example, we can see that the bottom three digits, that we created during normalization do not add any accuracy to the result. That is, they are zero, but this might simply be because the original operands didn't have enough precision. This is cancellation and illustrates the danger in assuming that all of the digits of a computed result are meaningful. It occurs when nearly equal numbers are subtracted, or numbers of opposite sign but nearly equal magnitude are added. Dealing with the consequences of cancellation and other concerns relating to error analysis are topics in numerical analysis.

Multiplication and division

To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.

  e=3;  s=4.734612
× e=5;  s=5.417242
-----------------------
  e=8;  s=25.648538980104 (true product)
  e=8;  s=25.64854        (after rounding)
  e=9;  s=2.564854        (after normalization)

Division is done similarly, but that is more complicated.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex. (see Booth's multiplication algorithm and digital division)[2]

Computer representation

Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand, from left to right. For the common formats they are apportioned as follows:

         sign   exponent  (exponent bias)  significand   total
single    1        8         (127)             23         32
double    1       11         (1023)            52         64

Since the exponent can be positive or negative, it has a fixed "bias" added to it, and the sum is stored in the actual exponent field as an unsigned number. A value of zero, or all 1's, in that field is reserved for special treatment. Therefore the legal exponent range for normalized numbers is [-126, 127] for single precision or [-1022, 1023] for double.

When a number is normalized, its leftmost significand bit is known to be 1. In the IEEE single and double precision formats that bit is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has 24 bits of significand precision, while double precision format has 53.

For example, it was shown above that π, rounded to 24 bits of precision, has:

  • sign = 0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)

The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in single precision format as

  • 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB in hexadecimal

Dealing with exceptional cases

Floating-point computation in a computer can run into two kinds of problems:

  • An operation can be mathematically illegal, such as division by zero, or calculating the square root of -1 or the inverse sine of 2.
  • An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large) or underflow (exponent too small.)

Prior to the IEEE standard, such things usually caused the program to terminate, or caused some kind of trap that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not portable. Modern IEEE-compliant systems have a uniform way of handling these situations. An important part of the mechanism involves error values that result from a failing computation, and that can propagate silently through subsequent computation until they are detected at a point of the programmer's choosing.

The two error values are "infinity" (often denoted "INF"), and "NaN" ("not a number"), which covers all other errors.

"Infinity" does not necessarily mean that the result is actually infinite. It simply means "too large to represent".

Both of these are encoded with the exponent field set to all 1's. (Recall that exponent fields of all 0's or all 1's are reserved for special meanings.) The significand field is set to something that can distinguish them—typically zero for INF and nonzero for NaN. The sign bit is meaningful for INF, that is, floating-point hardware distinguishes between +∞ and −∞.

When a nonzero number is divided by zero (the divisor must be exactly zero), a "zerodivide" event occurs, and the result is set to infinity of the appropriate sign. In other cases in which the result's exponent is too large too represent, an "overflow" event occurs, also producing infinity of the appropriate sign.

Division of an extremely large number by an extremely small number can overflow and produce infinity. This is different from a zerodivide, though both produce a result of infinity, and the distinction is usually unimportant in practice.

Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as

  • (+INF) + (+7) = (+INF)
  • (+INF) × (-2) = (-INF)

When the result of an operation has an exponent too small to represent properly, an "underflow" event occurs. The hardware responds to this by changing to a format in which the significand is not normalized, and there is no "hidden" bit—that is, all significand bits are represented. The exponent field is set to the reserved value of zero. The significand is set to whatever it has to be in order to be consistent with the exponent. Such a number is said to be "denormalized" (a "denorm" for short), or, in more modern terminology, "subnormal". Denorms are perfectly legal operands to arithmetic operations.

If no significant bits are able to appear in the significand field, the number is zero. Note that, in this case, the exponent field and significand field are all zeros—floating-point zero is represented by all zeros.

Other errors, such as division of zero by zero, or taking the square root of -1, cause an "operand error" event, and produce a NaN result. NaN's propagate aggressively through arithmetic operations—any NaN operand to any operation causes an operand error and produces a NaN result.

There are five special "events" that may occur, though some of them are quite benign:

  • An overflow occurs as described previously, producing an infinity.
  • An underflow occurs as described previously, producing a denorm or zero.
  • A zerodivide occurs as described previously, producing an infinity of the appropriate sign.
  • An "operand error" occurs as described previously, producing a NaN.
  • An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.

Computer hardware is typically able to raise exceptions when these events occur. How this is done is system-dependent. Usually these exceptions are all masked (disabled), relying only on the propagation of error values. Sometimes overflow, zerodivide, and operand error are enabled.

Implementation in actual computers

The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.

The standard allows for many different precision levels, of which the 32 bit ("single") and 64 bit ("double") are by far the most common, since they are supported in common programming languages. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit extended precision format, with 15 exponent bits and 64 significand bits, with no hidden bit. There is controversy about the failure of most programming languages to make these extended precision formats available to programmers (although C and related programming languages usually provide these formats via the long double type on such hardware). System vendors may also provide additional extended formats (e.g. 128 bits) emulated in software.

As of 2000, the IEEE 754 standard is currently under revision. See IEEE 754r.

Behavior of computer arithmetic

The standard behavior of computer hardware is to round the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. In practice, there are other options. IEEE-754-compliant hardware allows one to set the rounding mode to any of the following:

  • round to nearest (the default; by far the most common mode)
  • round up (toward +∞; negative results round toward zero)
  • round down (toward −∞; negative results round away from zero)
  • round toward zero (sometimes called "chop" mode; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3)

In the default rounding mode the IEEE 754 standard mandates the round-to-nearest behavior described above for all fundamental algebraic operations, including square root. ("Library" functions such as cosine and log are not mandated.) This means that IEEE-compliant hardware's behavior is completely determined in all 32 or 64 bits.

The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.

The arithmetical distance between two consecutive representable floating point numbers is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. An ULP is about 10−7 in single precision, and 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of an ULP.


Accuracy problems

The facts that floating-point numbers cannot faithfully mimic the real numbers, and that floating-point operations cannot faithfully mimic true arithmetic operations, lead to many surprising situations.

For example, the non-representability of 0.1 and 0.01 means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as e=-4; s=110011001100110011001101, which is

.100000001490116119384765625 exactly.

Squaring this number gives

.010000000298023226097399174250313080847263336181640625 exactly.

Squaring it with single-precision floating-point hardware (with rounding) gives

.010000000707805156707763671875 exactly.

But the representable number closest to 0.01 is

.009999999776482582092285156250 exactly.

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:

  // Enough digits to be sure we get the correct approximation.
  double pi = 3.1415926535897932384626433832795;
  double z = tan(pi/2.0);

Will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be -22877332.0.

By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) .1225 × 10-15 in double precision, or -.8742 × 10-7 in single precision. [3]

In fact, while addition and multiplication are both commutative (a+b = b+a and a×b = b×a), they are not associative (a + b) + c = a + (b + c). Using 7-digit decimal arithmetic:

 1234.567 + 45.67844 = 1280.245
                       1280.245 + 0.0004 = 1280.245
 but 
 45.67844 + 0.0004 = 45.67884
                     45.67884 + 1234.567 = 1280.246

They are also not distributive (a + b)×c = a×c + b×c :

 1234.567 × 3.333333 = 4115.223
 1.234567 × 3.333333 = 4.115223
                       4115.223 + 4.115223 = 4119.338
 but 
 1234.567 + 1.234567 = 1235.802
                       1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:

  • Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. This is perhaps the most common and serious accuracy problem.
  • Conversions to integer are unforgiving: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than rounding.
  • Limited exponent range: results might overflow yielding infinity, or underflow yielding a denormal value or zero. If a denormal number results, precision will be lost.
  • Testing for safe division is problematical: Checking that the divisor is not zero does not guarantee that a division will not overflow and yield infinity.
  • Equality is problematical! Two computational sequences that are mathematically equal may well produce different floating-point values. Programmers often perform comparisons within some tolerance (often a decimal constant, itself not accurately represented), but that doesn't necessarily make the problem go away.

Minimizing the effect of accuracy problems

Because of the problems noted above, naive use of floating point arithmetic can lead to many problems. A good understanding of numerical analysis is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article.

In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article.

Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/. The "decimal" data type of the C# programming language, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.

Double precision floating point arithmetic is more accurate than just about any physical measurement one could make. For example, it could indicate the distance from the Earth to the Moon with an accuracy of about 50 nanometers. So, if one were designing an integrated circuit chip with 100 nanometer features, that stretched from the Earth to the Moon, double precision arithmetic would be somewhat problematical, but only somewhat.

What makes floating point arithmetic troublesome is that people write mathematical algorithms that perform operations an enormous number of times, and so small errors grow. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.

People often carry expectations from their mathematics training into the field of floating point computation. For example, it is known that , and that , and that eigenvectors are degenerate if the eigenvalues are equal. These facts can't be counted on when the quantities involved are the result of floating point computation.

While a treatment of the techniques for writing high-quality floating-point software is far beyond the scope of this article, here are a few simple tricks:

The use of the equality test (if (x==y) ...) is usually not a good idea when it is based on expectations from pure mathematics. Such things are sometimes replaced with "fuzzy" tests (if (abs(x-y) < 1.0E-13) ...). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.

An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000. A typical addition would then be something like

3253.671
+  3.141276
--------
3256.812

The low 3 digits of the addends are effectively lost. The Kahan summation algorithm may be used to reduce the errors.

Another thing that can be done is to rearrange the computation in a way that is mathematically equivalent but less prone to error. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. The recurrence formula for the circumscribed polygon is:

Here is a computation using IEEE "double" (53 bits of significand precision) arithmetic:

 i   6 × 2i × ti, first form    6 × 2i × ti, second form

 0   3.4641016151377543863      3.4641016151377543863
 1   3.2153903091734710173      3.2153903091734723496
 2   3.1596599420974940120      3.1596599420975006733
 3   3.1460862151314012979      3.1460862151314352708
 4   3.1427145996453136334      3.1427145996453689225
 5   3.1418730499801259536      3.1418730499798241950
 6   3.1416627470548084133      3.1416627470568494473
 7   3.1416101765997805905      3.1416101766046906629
 8   3.1415970343230776862      3.1415970343215275928
 9   3.1415937488171150615      3.1415937487713536668
10   3.1415929278733740748      3.1415929273850979885
11   3.1415927256228504127      3.1415927220386148377
12   3.1415926717412858693      3.1415926707019992125
13   3.1415926189011456060      3.1415926578678454728
14   3.1415926717412858693      3.1415926546593073709
15   3.1415919358822321783      3.1415926538571730119
16   3.1415926717412858693      3.1415926536566394222
17   3.1415810075796233302      3.1415926536065061913
18   3.1415926717412858693      3.1415926535939728836
19   3.1414061547378810956      3.1415926535908393901
20   3.1405434924008406305      3.1415926535900560168
21   3.1400068646912273617      3.1415926535898608396
22   3.1349453756585929919      3.1415926535898122118
23   3.1400068646912273617      3.1415926535897995552
24   3.2245152435345525443      3.1415926535897968907
25                              3.1415926535897962246
26                              3.1415926535897962246
27                              3.1415926535897962246
28                              3.1415926535897962246
              The true value is 3.1415926535897932385...

While the two forms of the recurrence formula are clearly equivalent, the first subtracts 1 from a number extremely close to 1, leading to huge cancellation errors. Note that, as the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

A few nice properties

One can sometimes take advantage of a few nice properties:

  • Any integer less than or equal to 224 can be exactly represented in the single precision format, and any integer less than or equal to 253 can be exactly represented in the double precision format. Furthermore, any reasonable power of 2 times such a number can be represented. This property is sometimes used in purely integer applications, to get 53-bit integers on machines that have double precision floats but only 32-bit integers.
  • The bit representations are monotonic, as long as exceptional values are avoided and the signs are handled properly. Floating point numbers are equal if and only if their integer bit representations are equal. Comparisons for larger or smaller can be done with integer comparisons on the bit patterns, as long as the signs match. However, the actual floating point comparisons provided by hardware typically have much more sophistication in dealing with exceptional values.
  • To a rough approximation, the bit representation of a floating point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.

See also

Notes and references

  1. ^ Computer hardware doesn't necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.
  2. ^ The enormous complexity of modern division algorithms once led to a famous error. An early version of the Intel Pentium chip was shipped with a division instruction that, on rare occasions, gave slightly incorrect results. Many computers had been shipped before the error was discovered. Until the defective computers were replaced, patched versions of compilers were developed that could avoid the failing cases. See "Pentium FDIV Bug"
  3. ^ But an attempted computation of cos(π) yields -1 exactly. Since the derivative is nearly zero near π, the effect of the inaccuracy in the argument is far smaller than the spacing of the floating point numbers around -1, and the rounded result is exact.