Talk:Double-precision floating-point format

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing / Software (Rated Start-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Software (marked as Low-importance).

Pros/Cons of Double Precision[edit]

This entry seems to be very technical explaining what Double Precision is, but not the benefits or applications of it.

I miss this too :/ Too bad I can't answer the question. -- Henriok 11:36, 8 January 2007 (UTC)

Um... pros: more digits of precision... cons: slower mathematical operations, takes more memory... ? Not too hard... Sdedeo (tips) 03:12, 10 September 2007 (UTC)

I don't think that was what they were asking. Given a fixed number of bits of precision "n", you could divide the interval over which you want to calculate in 2^n equal steps, this is integer arithmetic. Unfortunately it breaks down if the interval is not bounded or if both extremely large and extremely small values are expected and relative error is more important than absolute error. This means integers are unsuitable for a general purpose arithmetic (they are "bounded") and for many real world problems (where the relative error is most important). Now, you could save the logarithm of your values to solve this, but unfortunately adding and substracting numbers would become problematic. So double precision is essentially a hybrid approach. Shinobu (talk) 23:45, 13 December 2007 (UTC)
I think Gerbrant is reffering to the properties of floating point, not double precision in itself. (One can have double-precision integers). mfc (talk) 09:06, 22 December 2007 (UTC)
Its useful because computers are unable to store true Real numbers it can only store values of some arbitrary precision. So the computer can never truly store some values, for instance .1 or values like pi. So in application it allows for very small numbers to be represented in the computer, so for instance if you have some program that tracks the distance. And lets say you only care about like three zeros after the decimal place, then you could use floating points to store the values of the distance. (sorry to bring this topic back from the dead, I just wanted to give an example of when floating points would be use) Deadcellplus (talk) 21:41, 3 October 2008 (UTC)
Again, the question was most likely about double precision as compared to single precision, rather than floating point as compared to integer arithmetic. A lot of your explanation is also incorrect - for every number there is some encoding scheme that can finitely encode that value. The only true statement you can make is that no encoding scheme can encode an uncountable number of different real numbers, since the number of strings is countable. Floating-point arithmetic is useful because it's an encoding scheme designed to handle a wide range of real values frequently encountered in applications. Dcoetzee 00:46, 4 October 2008 (UTC)
Maybe my information is a bit outdated, but in the book I learned x86 assembly language from, there was a note that x87 FPUs internally always use 80-bit-precision; therefore there is no speed gain for smaller precision values, only memory storage benefits. Junkstar (talk) 09:44, 18 March 2013 (UTC)


I can see the exponent and the sign, but which digits in the significand are used for the numbers before the decimal point, and which are used for after? —Preceding unsigned comment added by (talk) 20:24, 20 May 2008 (UTC)

This is a good question - the short answer is, usually the position of the decimal point in the significand is fixed, so that it doesn't have to be encoded. In the IEEE encoding, the point is preceded by an implicit 1, also not encoded, and followed by the explicit bits of the significand. Underflow is a special case. Dcoetzee 00:54, 4 October 2008 (UTC)


If "All bit patterns are valid encoding.", then why is the highest exponent represented by bits [000 0111 1111]? I tested sets of bits for higher values in Microsoft Visual Studio (C++) and got "infinite" as a value. It also turns out that comparing a double value to that "infinite" value throws an exception... But I guess this is more related to the compiler itself. —Preceding unsigned comment added by (talk) 15:01, 19 March 2009 (UTC)

Infinity is a valid value. -- (talk) 18:47, 10 February 2011 (UTC)

Integer resolution limits[edit]

I was looking for a mention about within what range all integers can be represented exactly. After some experimentation with double precision floats (52 bit stored fraction) it seems and are the largest positive integers where a difference of 1 can be represented. would be rounded up to [1].

  1. ^ Calculations were made with converter at
I added it.--Patrick (talk) 11:35, 8 October 2010 (UTC)
The article seems definitely wrong about this. It's obvious from the formula that the negative and positive resolutions are the same, but the article says that there's an extra bit of positive resolution. Where does it come from? (talk) 18:00, 13 June 2014 (UTC)

17 digits used in examples[edit]

I'm confused, why do they use 17 digits in the examples if the prescribed number of digits is: 15.955 i.e. 1.7976931348623157 x 10^308 Also, an explanation of how you could have 15.955 digits would be nice. I'm assuming that the higher digits can't represent all values from 0-9 hence we can't get to a full 16 digits? — Preceding unsigned comment added by Ctataryn (talkcontribs) 22:45, 31 May 2011 (UTC)

You have 52 binary digits, which happens to be 15.955 decimal digits. Compared to 16 decimal digits, the last digit can't always represent all values from 0-9 (but in some cases it can, thus it represents 9.55 different values on average). Also, while on average you only have ~16 digits of precision, sometimes two different values have the same 16 digits, so you need a 17th digit to distinguish those. This means that for some values, you have 17 digits effective precision (while some others have only 15 digits precision). -- (talk) 20:52, 7 February 2013 (UTC)

Execution speed - grammar[edit]

"calculations with double precision are 3 to 8 times slower than float."

What exactly does "x times slower" mean? Is it n^-x? Or n/x? How much would the logical conclusion of "1 time slower" be? This unscientific colloquial English is confusing and should be clarified. I would like to, but I cannot make sense of the source given. Thanks. Andersenman (talk) 10:52, 16 July 2014 (UTC)

x times slower is generally understood to mean it takes x times as long. Dicklyon (talk) 03:18, 4 December 2014 (UTC)
THEN FOR GOD'S SAKE WHY NOT WRITE IT AS THAT?! Andersenman (talk) 10:31, 30 October 2016 (UTC)

this article is using an incorrect word[edit]

According to iee-754 verbatim:

significand. The component of a binary floating-point number that consists of an explicit or implicit leading bit to the left of its implied binary point and a fraction field to the right.

The word "mantissa" has no definition in the standard, nor does it appear anywhere in that text. — Preceding unsigned comment added by (talk) 03:01, 4 December 2014 (UTC)

subnormal representation allows values smaller than 1e−323 ???[edit]

The very ending of section "IEEE 754 double-precision binary floating-point format: binary64" says: "By compromising precision, subnormal representation allows values smaller than 1e−323". Shouldn't it be something like "..allows even smaller numbers up to 1e-323"? — Preceding unsigned comment added by Honzik.stransky (talkcontribs) 21:21, 18 April 2015 (UTC)

I think it meant that there were values smaller than 1e−323, but this was very confusing. I've done the correction. Vincent Lefèvre (talk) 10:43, 28 January 2016 (UTC)

Lack of Citations in Implementations Section[edit]

There do not appear to be citations for any of the claims made in the "Implementations" section. Are these considered common knowledge?

--Irony0 (talk) 16:32, 27 January 2016 (UTC)

I've done some updates. Not everything is common knowledge. There should be citations for the specifications. But the fact that one has double precision in practice is simply due to the IEEE 754 hardware and that many implementations are written in C/C++ and double is the preferred type (matching double precision, even before this was standardized by Annex F of C99). There may also be issues with x86 extended precision ([1]). Vincent Lefèvre (talk) 10:30, 28 January 2016 (UTC)


There are a few statements relating to precision that I think are incorrect (please see [1]):

 "This gives 15–17 significant decimal digits precision. If a decimal string with at most 15 significant digits is converted to IEEE 754 double precision representation and then converted back to a string with the same number of significant digits, then the final string should match the original. If an IEEE 754 double precision is converted to a decimal string with at least 17 significant digits and then converted back to double, then the final number must match the original.[1]"

You don't really get 17 digits of precision; you only get up to 16. 17 digits is useful for "round-tripping" doubles, but it is not precision.

 "with full 15–17 decimal digits precision"

15-16 digits, not 15-17.

 "the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log10(2) ≈ 15.955)"

The simple "take the logarithm" approach does not apply to floating-point, so the precision is not simply ≈ 15.955.

(I made similar comments about the single-precision format: )

--Behindthemath (talk) 12:31, 1 July 2016 (UTC)

Advantages & Disadvantages[edit]

IEEE 754 double-precision binary floating-point format: binary64 - contains the following as its first sentence:"Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost." [emphasis mine] Aside from the OBVIOUSLY poor usage of the word "over" when "compared to" would be much clearer, I challenge this claim. I have no idea whether it is its wider RANGE or greater PRECISION, or both which motivates its use, but both are obviously reasons to use it.
I have two other issues with this article. First is the discussions of its advantages and disadvantages sprinkled around the article are simply wrong. It may be, on a particular machine, faster than single precision. That is a fact. It may use the SAME storage as single, another fact. It may also use the same bandwidth. In a theoretical sense, double precision mathematics requires more resources but what it uses depends on BOTH the hardware and the software (not to mention firmware) of the machine it is running on. This article doesn't distinguish between what is theoretically true on some abstract machine, and what the various chip architectures, core - fpu - gpu utilizations, as well as OS and compiler/interpreter optimizations and defaults will ACTUALLY do. It needs a revision to qualify the overly broad sentiments.
Second, this article discusses various languages and whether or how they implement double-precision. This is problematical. First (repeating myself) it is chip dependent - and with the proliferation of 'non-Intel' architectures, this is more and more true. Second, it is version dependent. ANY statement of what C or JAVA does or doesn't do should include version numbers. (obviously). Thirdly for some of the more sophisticated language compilers, it may be (fully or partially) customizable. (talk) 17:21, 16 July 2016 (UTC)

I agree, the reason seems wrong to me. Moreover, the sentence doesn't say when. Vincent Lefèvre (talk) 23:06, 16 July 2016 (UTC)

Article NEVER fully defines how the number is calculated, is therefore rather useless if you don't have access to full standard[edit]

The article NEVER fully defines how the number is calculated, is therefore rather useless if you don't have access to full standard. It calculates a lot of numbers and all, but the only precise definition of how the number is actually calculated is given here:

   The real value assumed by a given 64-bit double-precision datum with a given biased exponent  
   and a 52-bit fraction is

... and it completely omits on how to properly calculate the exponent. Then it goes on with useless 10^x calculations and other stuff without ever filling this information in. (at least I cannot see where, and if it comes much later it should probably be added to the quoted section instead!)

Can someone who has access to this information maybe add it? (with the actual bits of the exponent filled in, in the proper way they are used to calculate the exponent) (talk) 09:17, 29 March 2017 (UTC)

The biased exponent comes from the representation, as described. There is nothing to calculate. Everything is well-defined in the article. Vincent Lefèvre (talk) 06:53, 30 March 2017 (UTC)