Talk:Single-precision floating-point format

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
WikiProject Computing / Software / Hardware (Rated C-class, Low-importance)
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Software (marked as Low-importance).
Taskforce icon
This article is supported by Computer hardware task force (marked as Low-importance).

Exponent/mantissa ranges are inconsistent with other sources[edit]

The C++ standard library, as an example, will report the max and min exponents as (-125,+128) and assume that the mantissa is in the range (0.5,1). This article assumes that the exponent is between (-126,127) and that the mantissa is in the range (1,2). Although either method gets the same answer, the one given in the article does not match standard usage. The article should make this clear. —Preceding unsigned comment added by (talk) 03:12, 13 February 2008 (UTC)

The values in the article are those given in the IEEE 754 standard (1985), which the article calls out in the first paragraph. That is the standard for floating-point arithmetic. What clarification(s) would you like to see? mfc (talk) 20:36, 22 February 2008 (UTC)
Actually, it matters, because some cstdlib functions will return the raw exponent/mantissa. - Richard Cavell (talk) 05:42, 17 December 2010 (UTC)

When the implicit bit doesn't exist[edit]

Can someone check this for me:

"The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros."

Is the "exponent is stored with all zeros" part correct? - Richard Cavell (talk) 05:43, 17 December 2010 (UTC)

Yes, that's correct -- if there were an implicit one bit in that case then the number could not have the value zero. mfc (talk) 14:19, 21 December 2010 (UTC)

So it's definitely the case that there is no implicit bit when, and only when, the exponent, and only the exponent, is exactly 0? - Richard Cavell (talk) 07:53, 24 December 2010 (UTC)
The original statement is correct. For IEEE754, there is no implicit bit for zero or any of the denormalized representations, all of which have zero exponents. For pre-754 formats, there might not be an implicit bit at all. DEC floating formats used an explicit most significant bit, as did most mainframes. There are also exceptional cases of no implicit bit for the various types of NaNs, and the two infinities: in those the exponent has all bits set. Here is a good summary. —EncMstr (talk) 18:11, 24 December 2010 (UTC)

Adding conversion to the main article?[edit]

The Section on converting Decimal to Binary32 is really useful, but hard to find tucked in here on the single-precision page. It would be nice if someone could expand it to the general case and put it on the main floating point page. — Preceding unsigned comment added by Canageek (talkcontribs) 20:38, 15 December 2011 (UTC)

or for bits?[edit]

from article:
"consider 0.375, the fractional part of 12.375. To convert it into a binary fraction, multiply the fraction by 2, take the integer part and re-multiply new fraction by 2 until a fraction of zero is found or until the precision limit is reached which is 23 fraction digits for IEEE 754 binary32 format.

0.375 x 2 = 0.750 = 0 + 0.750 => b−1 = 0, the integer part represents the binary fraction digit. Re-multiply 0.750 by 2 to proceed

0.750 x 2 = 1.500 = 1 + 0.500 => b−2 = 1

0.500 x 2 = 1.000 = 1 + 0.000 => b−3 = 1, fraction = 0.000, terminate

We see that (0.375)10 can be exactly represented in binary as (0.011)2. Not all decimal fractions can be represented in a finite digit binary fraction. For example decimal 0.1 cannot be represented in binary exactly. So it is only approximated.

Therefore (12.375)10 = (12)10 + (0.375)10 = (1100)2 + (0.011)2 = (1100.011)2 "

So 12 can be as if , or if ? — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 17:49, 2 March 2012 (UTC)
I think, for last digit and for any over Is that correct? For example, for last digit and and thats how we get 0. — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 18:09, 2 March 2012 (UTC)
Or if from article so . But then how to get 1, need another bit (say, in the end) for 1 like if 0 (digital) then 0 (decimal) and if 1 (digital) then +1 (decimal). — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 18:26, 2 March 2012 (UTC)
It might be irrelevant, but there almost no chance, that CPU using this stupid conversion. It decimal digit is 4 bits and there is 82 (multiply table of decimal numbers and 0) multiply gates for one decimal digit with over decimal digit (4 bits multiplication with 4 bits and 82 possible results so at least need 4*82=328 transistors - this is minimum, but real number might be much bigger). Also the same 82 gates for division, addition and subtraction, so 4*328=1312 gates for 4 basic operations (+,-,*,/). So intel 4004 CPU have 2300 transistors, this looks little bit not enough. If we talking about single precision (32 bits = 32/4=8 decimal digits), then need minimum 1312*8=10496 transistors. Intel 8086 has 3500 transistors and seems have 80 bits coprocessor 8087 which is another separate chip. If 32 bits is 8 decimal digits, then 80 bits is ten decimal digits, then thus need minimum 1312*10=13120 transistors. This means either 8087 have more than 13120 transistors or there 80 bits doesn't mean, that it have 10 decimal places computing units, but just can calculate in such precision and 8087 is chip of instructions how to calculate in such (80 bits) precision, but this 8087 coprocessor itself don't calculate anything. Thus then 1312<3500, but still too small number (even for coprocessor 8087, but maybe 8087 have more than 3500 transitors, maybe it have over 10000 transistors), maybe 4004 also have coprocessor (which maybe consist of many chips or 4004 is of limited functionality). — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 19:11, 2 March 2012 (UTC) correction: Intel 8086 have 29000 transistors and Intel 8008 have 3500 transistors.
BTW, by my estimations CPU alone have power with single core 3Ghz to render nowadays graphics. There is hardest thing not shadows projection, but texturing (in 1998 year game Motocross Madness shadows from bike falls on curved surface of hill(s) and this is on nvidia Riva TNT card or Voodoo 2/3 and 300 MHz Pentium II; this is almmost the same as selfshadowing, because anyway there probably just different parts added together in nowadays games; in Tomb Raider Legend 1-5 fps [with FRAPS] difference with shadows or without at about average 20-40 fps, and lagging game even on best cards due to big animated water textures and many small but far bump mapping textures which don't use mipmaping if they used on many far objects, but if you come closer to wall they can't be any bigger and fps rate increases to 40-90 fps with next gen effects, which means bump mapping, which decreasing fps twice). Need 1000*1000=1000000 pixels output on screen about 30-60 frames per second. So need 30*1000000=3*10^7 operations. One multiplication of decimal number CPU doing in one cycle, so 3GHz CPU can make only 3*10^9 decimal number additions or subtraction or multiplications or divisions per second. Nowadays games have about 50000 triangles (or about 100000 vertexes) in scene, thus moving vertexes is not hard part and GPU here really unnecessary, because without GPU 2-3 times difference in fps only of Directx 2010 tutorials (or in 3D Studio Max 5 you can get 10 teapots or 10 spheres with more than million poligons each and with more than 25 fps, but without textures on single core 2-3 GHz processor and nVidia Riva TNT 2). So 30(fps)*10^5(vertexes)=3*10^6 vertexes/s. Here need rotation matrices which have few sine, cosine functions, but cosine you can get roughly result from 1 MB table, say or with Taylor series calculate and here about 10 addition and 10-20 multiplication operations (rising power is simple, because: b=a*a, c=b*b, d=c*c, so a^5=a*c, a^7=c*a, a^8=d). For any rotation there is enough 4 decimal digits or even 2 decimal digits (16 or 8 bits) precision. Thus about 10-30 operation for sine or cosine function calculation; so we have 3*10^6 (vertexes/s) * 30 (operations) * 4 (decimal digits) = 3.6*10^8 (operations) and our CPU at 3 GHz can do 3*10^9 such operations. Then in 3DMax we use texture wraping (texture putting on geometric object) cylindrical or spherical or projectional, whatever you like (I doubt spherical exist), and then this texture convert for 3D game to projectional to made it faster in realtime. Need to calculate rotation about Ox and Oz axises for each triangle and then project texture pixels onto triangle. Triangle have equation of [flat] plane. Need to find point of intersection of projectional ray [from texture pixel] and each triangle of mesh (3D object). This is not very hard, need one square root and about 10-15 addition and multiplication operations in total; with square root, say, about 20 operations. So need 10^6 (pixels) * 30 (fps) * 20 (operations) = 6*10^8 operations. So total 6*10^8 + 3.6*10^8 = 9.6*10^8 (operations) and it is less than 3*10^9 (operations) on 3GHz CPU (this is where you get 90-100 fps). There true that is many textures, but if object is far then mip-maping textures used (smaller versions of texture 1/4 pixels count of original and for very far objects 1/16 number of pixels of original and so on). So we still talking about roughly the same 10^6 pixels in scene which belongs to texture. So usually texture is about 512*512 and so for bump mapping. Bump mapping only is about the same as texturing but little bit harder because need dot product calculate of light and bump map vector (this is exactly 3 multiplication operations and 2 addition operations for each bump map texture pixel, so 5 operations in total, but to compare it with 20 operations in total for texturing it almost nothing). So when we have each pixel of texture projected onto geometry then need rotate this pixels to viewer and this is another about 5-20 operations for calculating cosine or sine function, but quaternions seems can make it faster but less precise. Also there still big chance, that all cosine or sine results are gotten from table. So in total we get 2*6*10^8 + 3.6*10^8 = 1.56*10^9 operations or about 60 fps at 3.1GHz CPU (or 30 fps with bump mapping). — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 20:17, 2 March 2012 (UTC)

Sorry I didn’t read your entire post but 20 = 1 by definition, which is not equal to zero. See Exponentiation#Arbitrary integer exponents (Zeroth power redirects). Is there anywhere that suggests it is zero? Vadmium (talk, contribs) 02:48, 3 March 2012 (UTC).

But this is there binary digits operations comes handy. To address memory (RAM or nand-flash) need generate code for get each next bit of file. For this need billions combinations and processor can't know each next bit code access number, so need use bits addition. Need to bits string add 1 bit each time. For example
    1001011, then
and so on. By adding to string 0000000 each time 0000001 you will get all 2^7=128 results (only in the end of section (in case one file was removed and over inserted) need leave bits string in what section file information continues).
Because I don't know what somebody someone teaching in university or what standarts IEEE is for decimal digits conversion into binary, but there I don't see anyway how you can binary convert simple into decimal to output on screen. You can only emulate such process, but maybe even can't do that, because after trying to get, for example, from binary number 1001010 the digital number, then you know it is (say, you can't get 1 for integers, but only 0.999999 for real numbers or 1.000002). But if you rise power 2^6 binary and add 2^3 and 2^1 binary then you again will get same binary number. And if you want have table of outputing 2^1, 2^2, 2^3 ... 2^n then you need billions bytes of memory to output one number, which have about 10 decimal digits. Theory of binary computations used in article don't apply in practice. Or I am wrong? — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 09:26, 3 March 2012 (UTC)
Actually it is possible to calculate in binary form. But in the end still need some tables of numbers and need to have decimal digits adders. So everything we calculating in binary and then hard part is to convert it to decimal. Free pascal calculating in 64 bits precision (15-16 decimal digits). So to get such precision without emulation, need about 64 bits multiply or divide or add or subtract with 64 bits. So after billions calculation then need this 64 bits convert to decimal digits. Need 64 bits divide into 16 parts, each of 4 bits. Then after final result say, we get 64 bits string:
Then we dividing this string into 16 parts:
1010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0100 1011.
Last 4 bits have number Last 4 bits can have 16 combinations (number form 0 to 15), but only 10 of them are meaningful (this 4 bits can mean or 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9) when 15 converting into byte (4 bits for 0 or 1 and 4 bits for second number [of 15] from 0 to 9). But there need two decimal places, because 16 have two decimal places. Each of 16 combinations gives decimal number from 0 to 15. For first and for second decimal digit [of 15] 0 have the same 4 bits code for first number as for second number and the same must be for 1 (first decimal digit of 15 can be only 0 or 1; then we can add 1 to over second from end of 64 bits string 4 bits number which decoded becoming 8 or more bits number, but divisible by 4).
Now take a look at 4 bits from 66th bit to 60th bit, which is 0100. This 4 bits is number But maximum number of this bits can be, I am not sure, or 2^7=128 or 128+64+32+16=240. So this time table have 3 decimal digits and each of 4 bits, so need decode 0100 or any over combination second from end of 4 bits to number from 0 to 128 (or 240). You have in memory each of 4 bits code and they mean number from 0 to 128 (or to 240).
Now take a look at first 4 bits: 1010. This number is This number is huge, but there is only 16 such huge numbers (like in all cases only 16 numbers and the smaller, the closer 4 bits are to end of 64 bits number). So for this 4 first bits (of 64 bits number) there is each code of 4 first bits (0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111). So in our case code 1010 of 4 first bits means number 11529215046068469760. If, say, code of 4 first bits is 1100, then this code means .
So now we can decode this bits string
into decimal digits number:
As you see for 3 add instructions we must use addition table of decimal numbers (we need to know how, for example 0101+1011 outputs decimal number from 0 to 18, so need ten variants of output 0101 with all over another codes for decimal number, then next coded decimal number with all over ten decimal number results; in over words 4 bits + 4 bits and output 8 bits, 4 bits for first number (0 or 1) and 4 bits for second number (from 0 to 9)), which must be integrated into CPU. But we don't must have decimal numbers multiplication and division and subtraction tables (and for this reason there need much less transistors, but more memory for storing numbers of each code).
BTW, interesting fact, that this () calculation is correct:
So 0000000000=0 and 1111111111=1023. And we know, that variants (so from 0 to 1023). — Preceding unsigned comment added by Versatranitsonlywaytofly (talkcontribs) 17:06, 4 March 2012 (UTC)

TL;DR 2^0=1 — Preceding unsigned comment added by (talk) 10:45, 16 September 2015 (UTC)


PASCAL float types are "REAL" or additionally "LONGREAL" if there must be types that are different in precision, just like in C they are (long) float. Single, double etc are names from IEEE specs, and are not part of PASCAL the language, but machines specific identifiers.

I didn't correct it in the main article to avoid confusion (specially since in some popular compilers REAL was typically mapped to softfpu, and single/double were mapped to hardware fpu). (talk) 12:04, 30 June 2012 (UTC)

Number of "bytes"[edit]

"Single-precision floating-point format is a computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point."

Does this really mean 4 bytes? Won't 4 octets be less ambiguous? — Preceding unsigned comment added by (talk) 21:20, 7 August 2012 (UTC)

It is a reasonable question. Byte is certainly much better known whereas octet is, as far as I know, mostly used in RFCs. Since IEEE 754 was standardized and adopted well after the ascension of the eight bit byte, it seems to me we serve our readers better by calling it a byte. I would not object to a note which specifies that a byte here means eight bit byte or octet. Maybe it is even clearer to omit bytes altogether, specifying only bits? —EncMstr (talk) 23:26, 7 August 2012 (UTC)
What would such a footnote add? You can already see it's a 32-bit value because it actually literally says so. I think the lead is crystal clear on this matter. — Preceding unsigned comment added by (talk) 10:42, 16 September 2015 (UTC)

# of Significant Digits[edit]

For a single precision floating point in decimal presentation, it was said only 7 digts are significant? but in the article, it is not mentioned, not to mention why it is so. Jackzhp (talk) 17:50, 17 March 2013 (UTC)


There are a few statements relating to precision that I think are incorrect (please see [1]):

 "This gives from 6 to 9 significant decimal digits precision (if a decimal string with at most 6 significant decimal digits is converted to IEEE 754 single precision and then converted back to the same number of significant decimal digits, then the final string should match the original; and if an IEEE 754 single precision is converted to a decimal string with at least 9 significant decimal digits and then converted back to single, then the final number must match the original[4])."

You don't really get 9 digits of precision; you only get up to 8. 9 digits is useful for "round-tripping" floats, but it is not precision.

 "some integers up to nine significant decimal digits can be converted to an IEEE 754 floating point value without loss of precision, but no more than nine significant decimal digits can be stored. As an example, the 32-bit integer 2,147,483,647 converts to 2,147,483,650 in IEEE 754 form."

Again, same comment on 9 digits. Also, why is this discussion limited to integers? And the example converts to 2,147,483,648, not 2,147,483,650.

 "total precision is 24 bits (equivalent to log10(2^24) ≈ 7.225 decimal digits)"

The simple "take the logarithm" approach does not apply to floating-point, so the precision is not simply ≈ 7.225.


--Behindthemath (talk) 22:01, 29 June 2016 (UTC)

7 digits is closer to the truth. Dicklyon (talk) 23:21, 29 June 2016 (UTC)
All 7-digit integers, and 8-digit integers up to 2^24 = 16777216 are exactly represented as IEEE single precion, but 2^24 + 1 = 16777217 falls between two representaable values. I'd say that's close enough to 7.2 digits. Dicklyon (talk) 05:04, 30 June 2016 (UTC)
For integers, yes, log10(2^24) makes sense. But not for floating-point values. Depending on the exponent, precision can vary from 6-8 digits Behindthemath (talk) 13:35, 30 June 2016 (UTC)
Note that 10001 is less precise than 10101 for its size even though they both have 5 bits. With 5 significant bits, 11111 is possible but not 100001. So a bit of precision is lost after the transition. We average 23 and 24 to get 23.5, and multiply it with log(2) to get 7.07. 7.22 is just what you get with all significant bits on, maximizing precision for size. Also, 2147483647 (I don't like these commas) converts to 2147484000 according to me, but it could be 2147480000, 2147483600, 2147483650 or 2147483648 depending on program(ming language) used. 24691358r (talk) 14:10, 27 June 2017 (UTC)
@24691358r: Please explain why/how 17 is less precise than 21 even though they both have 5 bits. Then, why the programming language matters: For a particular computer, the same hardware is processing the numbers, not a language. —EncMstr (talk) 16:10, 27 June 2017 (UTC)
In base 18 for example, 17 is single digit (h) but 21 is two-digit (13), and both have integer precision. Because 13 in base 18 starts at a higher digit, it has more precision. I can show example meaningful in decimal: 1001 (9, 1 significant) is less precise than 1101 (13, 2 significant). 4*log(2) is 1.20412, but it assumes precision of 4 bits at a size of 2^4 (10000). However, 1001 has a lower size and therefore lower significant precision. So we average 3 and 4 to get 3.5log(2) which is 1.0536. Same thing for single float which gives average precision of 7.07.
One language could show 6 significant digits, while another could show 8. That's the difference. 24691358r (talk) 19:20, 27 June 2017 (UTC)

I agree that the listed precision is wrong. The formula log(2^23), or about 6.92, gives the amount of decimal precision. That's log base ten of two to the power of mantissa digits. For example, If you go above 8192, you lose the ability to differentiate differences of 0.001 and 0.002, etc, because after 8192, the values go in increments of 1/512 rather than 1/1024. That's just under 7 digits of precision. We should edit the article to say 6 to 7 decimal digits, rather than 6 to 9, because it single-precision floats simply cannot store 8 or 9 decimal digits of accuracy (if you're curious, you would need 30 bits of mantissa for 9 decimal digits, log(2^30) = 9.03...). Aaronfranke (talk) 09:41, 27 January 2018 (UTC)

However, it's also worth noting that if you include the implicit bit as precision, log(2^24) is about 7.22, which is still far closer to 7, but it does mean that you need 8 decimal digits to guarantee accurate conversions if you go binary -> decimal -> binary. You cannot, however, have all 8-digit decimal numbers survive decimal -> binary -> decimal. Aaronfranke (talk) 09:52, 27 January 2018 (UTC)