# Talk:Single-precision floating-point format

WikiProject Computing / Software / Hardware (Rated C-class, Low-importance)
This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C  This article has been rated as C-Class on the project's quality scale.
Low  This article has been rated as Low-importance on the project's importance scale.

## Exponent/mantissa ranges are inconsistent with other sources

The C++ standard library, as an example, will report the max and min exponents as (-125,+128) and assume that the mantissa is in the range (0.5,1). This article assumes that the exponent is between (-126,127) and that the mantissa is in the range (1,2). Although either method gets the same answer, the one given in the article does not match standard usage. The article should make this clear. —Preceding unsigned comment added by 71.104.122.200 (talk) 03:12, 13 February 2008 (UTC)

The values in the article are those given in the IEEE 754 standard (1985), which the article calls out in the first paragraph. That is the standard for floating-point arithmetic. What clarification(s) would you like to see? mfc (talk) 20:36, 22 February 2008 (UTC)
Actually, it matters, because some cstdlib functions will return the raw exponent/mantissa. - Richard Cavell (talk) 05:42, 17 December 2010 (UTC)

## When the implicit bit doesn't exist

Can someone check this for me:

"The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros."

Is the "exponent is stored with all zeros" part correct? - Richard Cavell (talk) 05:43, 17 December 2010 (UTC)

Yes, that's correct -- if there were an implicit one bit in that case then the number could not have the value zero. mfc (talk) 14:19, 21 December 2010 (UTC)

So it's definitely the case that there is no implicit bit when, and only when, the exponent, and only the exponent, is exactly 0? - Richard Cavell (talk) 07:53, 24 December 2010 (UTC)
The original statement is correct. For IEEE754, there is no implicit bit for zero or any of the denormalized representations, all of which have zero exponents. For pre-754 formats, there might not be an implicit bit at all. DEC floating formats used an explicit most significant bit, as did most mainframes. There are also exceptional cases of no implicit bit for the various types of NaNs, and the two infinities: in those the exponent has all bits set. Here is a good summary. —EncMstr (talk) 18:11, 24 December 2010 (UTC)

## Adding conversion to the main article?

The Section on converting Decimal to Binary32 is really useful, but hard to find tucked in here on the single-precision page. It would be nice if someone could expand it to the general case and put it on the main floating point page. — Preceding unsigned comment added by Canageek (talkcontribs) 20:38, 15 December 2011 (UTC)

## ${\displaystyle 2^{0}=0}$ or ${\displaystyle 2^{0}=1}$ for bits?

from article:
"consider 0.375, the fractional part of 12.375. To convert it into a binary fraction, multiply the fraction by 2, take the integer part and re-multiply new fraction by 2 until a fraction of zero is found or until the precision limit is reached which is 23 fraction digits for IEEE 754 binary32 format.

0.375 x 2 = 0.750 = 0 + 0.750 => b−1 = 0, the integer part represents the binary fraction digit. Re-multiply 0.750 by 2 to proceed

0.750 x 2 = 1.500 = 1 + 0.500 => b−2 = 1

0.500 x 2 = 1.000 = 1 + 0.000 => b−3 = 1, fraction = 0.000, terminate

We see that (0.375)10 can be exactly represented in binary as (0.011)2. Not all decimal fractions can be represented in a finite digit binary fraction. For example decimal 0.1 cannot be represented in binary exactly. So it is only approximated.

Therefore (12.375)10 = (12)10 + (0.375)10 = (1100)2 + (0.011)2 = (1100.011)2 "

So 12 can be as ${\displaystyle 2^{3}+2^{2}+2^{0}+2^{0}=12}$ if ${\displaystyle 2^{0}=0}$, or ${\displaystyle (2^{3})^{1}+(2^{2})^{1}+(2^{1})^{0}+(2^{0})^{0}=12}$ if ${\displaystyle 2^{0}=1}$? — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 17:49, 2 March 2012 (UTC)
I think, for last digit ${\displaystyle (2^{0})^{0}=0}$ and for any over ${\displaystyle (2^{x_{n}})^{0}=1.}$ Is that correct? For example, for last digit ${\displaystyle (2^{0})^{1}=(0)^{1}=0}$ and ${\displaystyle (2^{1})^{0}=(2)^{0}=1}$ and thats how we get 0. — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 18:09, 2 March 2012 (UTC)
Or if from article ${\displaystyle 130=10000010}$ so ${\displaystyle 2^{8}+2^{0}+2^{0}+2^{0}+2^{0}+2^{0}+2^{1}+2^{0}=128+0+0+0+0+0+2+0}$. But then how to get 1, need another bit (say, in the end) for 1 like if 0 (digital) then 0 (decimal) and if 1 (digital) then +1 (decimal). — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 18:26, 2 March 2012 (UTC)
It might be irrelevant, but there almost no chance, that CPU using this stupid conversion. It decimal digit is 4 bits and there is 82 (multiply table of decimal numbers and 0) multiply gates for one decimal digit with over decimal digit (4 bits multiplication with 4 bits and 82 possible results so at least need 4*82=328 transistors - this is minimum, but real number might be much bigger). Also the same 82 gates for division, addition and subtraction, so 4*328=1312 gates for 4 basic operations (+,-,*,/). So intel 4004 CPU have 2300 transistors, this looks little bit not enough. If we talking about single precision (32 bits = 32/4=8 decimal digits), then need minimum 1312*8=10496 transistors. Intel 8086 has 3500 transistors and seems have 80 bits coprocessor 8087 which is another separate chip. If 32 bits is 8 decimal digits, then 80 bits is ten decimal digits, then thus need minimum 1312*10=13120 transistors. This means either 8087 have more than 13120 transistors or there 80 bits doesn't mean, that it have 10 decimal places computing units, but just can calculate in such precision and 8087 is chip of instructions how to calculate in such (80 bits) precision, but this 8087 coprocessor itself don't calculate anything. Thus then 1312<3500, but still too small number (even for coprocessor 8087, but maybe 8087 have more than 3500 transitors, maybe it have over 10000 transistors), maybe 4004 also have coprocessor (which maybe consist of many chips or 4004 is of limited functionality). — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 19:11, 2 March 2012 (UTC) correction: Intel 8086 have 29000 transistors and Intel 8008 have 3500 transistors. http://www.intel.com/technology/timeline.pdf

Sorry I didn’t read your entire post but 20 = 1 by definition, which is not equal to zero. See Exponentiation#Arbitrary integer exponents (Zeroth power redirects). Is there anywhere that suggests it is zero? Vadmium (talk, contribs) 02:48, 3 March 2012 (UTC).

But this is there binary digits operations comes handy. To address memory (RAM or nand-flash) need generate code for get each next bit of file. For this need billions combinations and processor can't know each next bit code access number, so need use bits addition. Need to bits string add 1 bit each time. For example
```    1001010
+0000001=
1001011, then
1001011
+0000001=
1001100
```
and so on. By adding to string 0000000 each time 0000001 you will get all 2^7=128 results (only in the end of section (in case one file was removed and over inserted) need leave bits string in what section file information continues).
Because I don't know what somebody someone teaching in university or what standarts IEEE is for decimal digits conversion into binary, but there I don't see anyway how you can binary convert simple into decimal to output on screen. You can only emulate such process, but maybe even can't do that, because after trying to get, for example, from binary number 1001010 the digital number, then you know it is ${\displaystyle 2^{6}+0+0+2^{3}+0+2^{1}+0}$ (say, you can't get 1 for integers, but only 0.999999 for real numbers or 1.000002). But if you rise power 2^6 binary and add 2^3 and 2^1 binary then you again will get same binary number. And if you want have table of outputing 2^1, 2^2, 2^3 ... 2^n then you need billions bytes of memory to output one number, which have about 10 decimal digits. Theory of binary computations used in article don't apply in practice. Or I am wrong? — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 09:26, 3 March 2012 (UTC)
Actually it is possible to calculate in binary form. But in the end still need some tables of numbers and need to have decimal digits adders. So everything we calculating in binary and then hard part is to convert it to decimal. Free pascal calculating in 64 bits precision (15-16 decimal digits). So to get such precision without emulation, need about 64 bits multiply or divide or add or subtract with 64 bits. So after billions calculation then need this 64 bits convert to decimal digits. Need 64 bits divide into 16 parts, each of 4 bits. Then after final result say, we get 64 bits string:
1010000000000000000000000000000000000000000000000000000001001011.
Then we dividing this string into 16 parts:
1010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0100 1011.
Last 4 bits have number ${\displaystyle 2^{3}+0+2^{1}+(bit=1)=8+0+2+1=11.}$ Last 4 bits can have 16 combinations (number form 0 to 15), but only 10 of them are meaningful (this 4 bits can mean or 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9) when 15 converting into byte (4 bits for 0 or 1 and 4 bits for second number [of 15] from 0 to 9). But there need two decimal places, because 16 have two decimal places. Each of 16 combinations gives decimal number from 0 to 15. For first and for second decimal digit [of 15] 0 have the same 4 bits code for first number as for second number and the same must be for 1 (first decimal digit of 15 can be only 0 or 1; then we can add 1 to over second from end of 64 bits string 4 bits number which decoded becoming 8 or more bits number, but divisible by 4).
Now take a look at 4 bits from 66th bit to 60th bit, which is 0100. This 4 bits is number ${\displaystyle 0+2^{6}+0+0=64.}$ But maximum number of this bits can be, I am not sure, or 2^7=128 or 128+64+32+16=240. So this time table have 3 decimal digits and each of 4 bits, so need decode 0100 or any over combination second from end of 4 bits to number from 0 to 128 (or 240). You have in memory each of 4 bits code and they mean number from 0 to 128 (or to 240).
Now take a look at first 4 bits: 1010. This number is ${\displaystyle 2^{63}+0+2^{61}+0=9223372036854775808+0+2305843009213693952+0=11529215046068469760.}$ This number is huge, but there is only 16 such huge numbers (like in all cases only 16 numbers and the smaller, the closer 4 bits are to end of 64 bits number). So for this 4 first bits (of 64 bits number) there is each code of 4 first bits (0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111). So in our case code 1010 of 4 first bits means number 11529215046068469760. If, say, code of 4 first bits is 1100, then this code means ${\displaystyle 2^{63}+2^{62}+0+0=13835058055282163712}$.
So now we can decode this bits string
1010000000000000000000000000000000000000000000000000000001001011
into decimal digits number:
${\displaystyle 11529215046068469760+64+11=11529215046068469835.}$ As you see for 3 add instructions we must use addition table of decimal numbers (we need to know how, for example 0101+1011 outputs decimal number from 0 to 18, so need ten variants of output 0101 with all over another codes for decimal number, then next coded decimal number with all over ten decimal number results; in over words 4 bits + 4 bits and output 8 bits, 4 bits for first number (0 or 1) and 4 bits for second number (from 0 to 9)), which must be integrated into CPU. But we don't must have decimal numbers multiplication and division and subtraction tables (and for this reason there need much less transistors, but more memory for storing numbers of each code).
BTW, interesting fact, that this (${\displaystyle 2^{3}+0+2^{1}+(bit=1)=8+0+2+1=11}$) calculation is correct:
${\displaystyle 2^{9}+2^{8}+2^{7}+2^{6}+2^{5}+2^{4}+2^{3}+2^{2}+2+1=512+256+128+64+32+16+8+4+2+1=1023.}$ So 0000000000=0 and 1111111111=1023. And we know, that ${\displaystyle 2^{10}=1024}$ variants (so from 0 to 1023). — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 17:06, 4 March 2012 (UTC)

TL;DR 2^0=1 — Preceding unsigned comment added by 82.139.82.82 (talk) 10:45, 16 September 2015 (UTC)

## PASCAL

PASCAL float types are "REAL" or additionally "LONGREAL" if there must be types that are different in precision, just like in C they are (long) float. Single, double etc are names from IEEE specs, and are not part of PASCAL the language, but machines specific identifiers.

I didn't correct it in the main article to avoid confusion (specially since in some popular compilers REAL was typically mapped to softfpu, and single/double were mapped to hardware fpu). 88.159.71.34 (talk) 12:04, 30 June 2012 (UTC)

## Number of "bytes"

"Single-precision floating-point format is a computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point."

Does this really mean 4 bytes? Won't 4 octets be less ambiguous? — Preceding unsigned comment added by 112.119.94.80 (talk) 21:20, 7 August 2012 (UTC)

It is a reasonable question. Byte is certainly much better known whereas octet is, as far as I know, mostly used in RFCs. Since IEEE 754 was standardized and adopted well after the ascension of the eight bit byte, it seems to me we serve our readers better by calling it a byte. I would not object to a note which specifies that a byte here means eight bit byte or octet. Maybe it is even clearer to omit bytes altogether, specifying only bits? —EncMstr (talk) 23:26, 7 August 2012 (UTC)
What would such a footnote add? You can already see it's a 32-bit value because it actually literally says so. I think the lead is crystal clear on this matter. — Preceding unsigned comment added by 82.139.82.82 (talk) 10:42, 16 September 2015 (UTC)

## # of Significant Digits

For a single precision floating point in decimal presentation, it was said only 7 digts are significant? but in the article, it is not mentioned, not to mention why it is so. Jackzhp (talk) 17:50, 17 March 2013 (UTC)

## Precision

There are a few statements relating to precision that I think are incorrect (please see [1]):

``` "This gives from 6 to 9 significant decimal digits precision (if a decimal string with at most 6 significant decimal digits is converted to IEEE 754 single precision and then converted back to the same number of significant decimal digits, then the final string should match the original; and if an IEEE 754 single precision is converted to a decimal string with at least 9 significant decimal digits and then converted back to single, then the final number must match the original[4])."
```

You don't really get 9 digits of precision; you only get up to 8. 9 digits is useful for "round-tripping" floats, but it is not precision.

``` "some integers up to nine significant decimal digits can be converted to an IEEE 754 floating point value without loss of precision, but no more than nine significant decimal digits can be stored. As an example, the 32-bit integer 2,147,483,647 converts to 2,147,483,650 in IEEE 754 form."
```

Again, same comment on 9 digits. Also, why is this discussion limited to integers? And the example converts to 2,147,483,648, not 2,147,483,650.

``` "total precision is 24 bits (equivalent to log10(2^24) ≈ 7.225 decimal digits)"
```

The simple "take the logarithm" approach does not apply to floating-point, so the precision is not simply ≈ 7.225.

--Behindthemath (talk) 22:01, 29 June 2016 (UTC)

7 digits is closer to the truth. Dicklyon (talk) 23:21, 29 June 2016 (UTC)
All 7-digit integers, and 8-digit integers up to 2^24 = 16777216 are exactly represented as IEEE single precion, but 2^24 + 1 = 16777217 falls between two representaable values. I'd say that's close enough to 7.2 digits. Dicklyon (talk) 05:04, 30 June 2016 (UTC)
For integers, yes, log10(2^24) makes sense. But not for floating-point values. Depending on the exponent, precision can vary from 6-8 digits Behindthemath (talk) 13:35, 30 June 2016 (UTC)
Note that 10001 is less precise than 10101 for its size even though they both have 5 bits. With 5 significant bits, 11111 is possible but not 100001. So a bit of precision is lost after the transition. We average 23 and 24 to get 23.5, and multiply it with log(2) to get 7.07. 7.22 is just what you get with all significant bits on, maximizing precision for size. Also, 2147483647 (I don't like these commas) converts to 2147484000 according to me, but it could be 2147480000, 2147483600, 2147483650 or 2147483648 depending on program(ming language) used. 24691358r (talk) 14:10, 27 June 2017 (UTC)
Please explain why/how 17 is less precise than 21 even though they both have 5 bits. Then, why the programming language matters: For a particular computer, the same hardware is processing the numbers, not a language. —EncMstr (talk) 16:10, 27 June 2017 (UTC)
In base 18 for example, 17 is single digit (h) but 21 is two-digit (13), and both have integer precision. Because 13 in base 18 starts at a higher digit, it has more precision. I can show example meaningful in decimal: 1001 (9, 1 significant) is less precise than 1101 (13, 2 significant). 4*log(2) is 1.20412, but it assumes precision of 4 bits at a size of 2^4 (10000). However, 1001 has a lower size and therefore lower significant precision. So we average 3 and 4 to get 3.5log(2) which is 1.0536. Same thing for single float which gives average precision of 7.07.
One language could show 6 significant digits, while another could show 8. That's the difference. 24691358r (talk) 19:20, 27 June 2017 (UTC)

I agree that the listed precision is wrong. The formula log(2^23), or about 6.92, gives the amount of decimal precision. That's log base ten of two to the power of mantissa digits. For example, If you go above 8192, you lose the ability to differentiate differences of 0.001 and 0.002, etc, because after 8192, the values go in increments of 1/512 rather than 1/1024. That's just under 7 digits of precision. We should edit the article to say 6 to 7 decimal digits, rather than 6 to 9, because it single-precision floats simply cannot store 8 or 9 decimal digits of accuracy (if you're curious, you would need 30 bits of mantissa for 9 decimal digits, log(2^30) = 9.03...). Aaronfranke (talk) 09:41, 27 January 2018 (UTC)

However, it's also worth noting that if you include the implicit bit as precision, log(2^24) is about 7.22, which is still far closer to 7, but it does mean that you need 8 decimal digits to guarantee accurate conversions if you go binary -> decimal -> binary. You cannot, however, have all 8-digit decimal numbers survive decimal -> binary -> decimal. Aaronfranke (talk) 09:52, 27 January 2018 (UTC)