Jump to content

Talk:Floating-point arithmetic

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Amoss (talk | contribs) at 17:11, 8 November 2006 (how to proceed: request for edit comments). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

is 9/7 a floating Point Number

9/7 ist not a floating Point Number!? -WeißNix 15:48, 5 Aug 2003 (UTC)

Not quite sure what you're saying there WeißNix, but if you're saying 9/7 isn't a floating point number I'd agree. 9/7 would seem to represent nine sevenths perfectly, where as any floating point representation would have a finite accuracy - this is what I believe, anyway. Tompagenet 15:54, 5 Aug 2003 (UTC)

nit pick - the floating point representation of 9/7 in base 7 is 1.2e+0 8-) AndrewKepert
or in base 49, 9/7 is 1.e+0.... But perhaps the clear statement is "With base 2 or 10 the number 9/7 is not exactly representable as a floating point number". The one hiccup is that you might like to think of real numbers as "floating point numbers with infinitely long significands". But since that doesn't really add anything (other than a representation) over the notion of a real number, it isn't clear that this is a useful way to think about the subject. Jake 20:56, 3 October 2006 (UTC)[reply]

What is Absorption?

I do not understand the sentence "Absorption : 1.1015 + 1 = 1.1015" in the section Problems with floating point. I guess that absorption (with a p) means that two different real numbers are represented by the same floating point number, but 1.1015 is approximately 4.177 so I cannot see how that is the case here. Jitse Niesen 22:42, 8 Aug 2003 (UTC)

The author meant 1.1015 to be interpreted as 1 * 1015 (=1,000,000,000,000,000). The point they're making is that adding a relatively small number to a larger stored number can result in the stored number not actually changing. A dot is often used as a symbol for multiplication in (especially pure) maths, the article might be clearer with a different representation though. Hope that's some help -- Ams80 21:52, 13 Aug 2003 (UTC)

Ah, now I understand, thanks. I changed 1.1015 to 1·1015, as this is how multiplication is customary denoted. Indeed, another representation (1015 and 1E15 come to mind) might be clearer, but I am not decided. -- Jitse Niesen 23:29, 13 Aug 2003 (UTC)

Seems to me that × should be used, following the precedent in the first line of the Representation section of the article. I'll make that change (and elsewhere in the article, as needed)... Mfc


Numbers don't represent themselves?

Shouldn't the talk page be used for discussion of the content of the article, and not the article itself? There is request for improvement at the bottom (that I agree with, but am not in a position to help) and the recent addition by User:Stevenj, which is really a meta-comment, somewhat indignant, and factually incorrect:

Many a textbook asserts that a floating-point number represents the set of all numbers that differ from it by no more than a fraction of the difference between it and its neighbors with the same floating-point format. This figment of the author’s imagination may influence programmers who read it but cannot otherwise affect computers that do not read minds. A number can represent only itself, and does that perfectly. (Kahan, 2001)

The problem is that numbers don't represent themselves. Numbers are represented by numerals, either present as marks on a bit of paper, glowing phosphor or electrons hanging around on a transistor. In the present case I think the "many a textbook" is actually correct, as the point of doing a floating point calculation on a computer is to get an approximation to the answer. The operands are real numbers, and the putting of these values into computer memory is the "representation" step. It is here the range of values gets condensed to a single approximating value, as present in the hypothetical text-book. The operations of real arithmetic correspond to the algorithms the chip uses to manipulate the representations, and the output is a representation of the answer. It may not be a very good representation (in that it may say 0.333333333 where the answer is 1/3) but for most people it is good enough. So anyway, I am removing it and if others feel something similar belongs it is their call. --AndrewKepert 07:42, 22 Sep 2003 (UTC)

Try reading Kahan's articles (please tell me you know who he is!). The point is that the computer stores an exact representation of certain rational numbers; the idea that "every floating point number is a little bit imprecise", that all floating point calculations are a necessarily an "approximation", indicates muddy thinking on the part of the programmer (or textbook author), not what the computer actually does, and is an old and common source of confusion that Wikipedia should resist. Arguing about the physical storage of the number is beside the point; the issue here is the logical representation (a certain rational value), not the physical one. —Steven G. Johnson
Strictly speaking, a number can possess neither precision nor accuracy. A number possesses only its value. (Kahan)
Hi Steven, No I am not familiar with Kahan. Let me explain my viewpoint on this as a mathematician. As you point out, we are not talking about numbers here, we are talking about representations of numbers - i.e. numerals. They are numerals no matter what matter is used to store and/or display them. Yes, a numeral represents a unique number precisely (in the case of floating point representation, a unique rational number). The twist is that it is also used to represent a whole stack of other numbers imprecisely. The imprecision may come from the input step (e.g. A/D conversion) it may come from the algorithms implemented in hardware or software or it may come from inherent limitations in the set of numbers that are represented precisely. This is not a criticism of any representation - it is a feature. We would be well and truly stuck if we insisted on representing every number precisely in any of these steps (you want to use π? Sorry Dave, I can't give you π precisely, so you can't use it).
My original statement was that the self-righteous quote from Kahan was muddying the waters between numeral and number. While it is a valid viewpoint from a particular point of view (e.g. if you restrict "real number" to mean "the real numbers that I can represent with a 64-bit float") and useful you are focussing on what is going on inside the computer (which is presumably why Kahan objects to it being the starting point for programmers) it is really out of place in an introductory statement on what floating point representation is about. The content of our current discussion bears this out - there are many subtle issues involved with the simple act of "writing down a number", of which we are scratching the surface. These are not useful to the layperson who is wondering what floating point representation is and how it works. Maybe the issue is worth raising (maybe even in a balanced fashion) further down the article, but not at the top.
...and I agree with your final quote from Kahan. A number possesses only its value. However, a representation of a number has accuracy. A system that uses numerals to represent numbers (aka a numeral system) has precision.
Is my point of view clear from this?
Work calls ... --AndrewKepert 06:32, 23 Sep 2003 (UTC)
First of all, Kahan is one of the world's most respected numerical analysts and the prime mover behind the IEEE 754 floating-point standard; he's not someone whose opinion on floating-point you dismiss lightly. Also, he is an excellent and entertaining writer, so I highly recommend that you go to his web page and read a few of his articles. I think that your final statement gets to the core of your fallacy: you say that a representation of a number has accuracy, but accuracy is meaningless out of context. "3.0" is a 3-ASCII character representation of the number 3. Is it accurate? If it's being used in place of π it's pretty poor accuracy. If it's being used to represent the (integer) number of wheels on a tricycle, it has perfect accuracy. Floating-point is the same way: it stores rational numbers, and whether these are "accurate" are a statement about the context in which they are used, not about the numbers or the format in which they are stored. Saying that they are used to represent a whole stack of other numbers imprecisely is describing the programmer's confusion, not the computer, nor even a particular problem (in which, at most, the floating-point number is an approximation for a single correct number, the difference between the two being the context-dependent accuracy). —Steven G. Johnson
The relevant context is very simple. Can the binary representation "accurately" reproduce the numeric literal given in the programming language. For example, when you do X=0.5, you will get a 100% accurate floating point number because of the conversion between the decimal notation and the binary notation is exact. However, if you do X=0.3, you can never get decimal 0.3 exactly represented in binary floating point notation. So my intepretation of "accurate" is the preservation of the same value between the programmer specification and machine represenation. 67.117.82.5 20:29, 23 Sep 2003 (UTC)
You're not contradicting anyone. The point is not that there is no such thing as accuracy, but that it is a property of the program/context and not just of the floating-point number. (Although determining accuracy is usually more difficult than for the trivial case of literal constants.) It's nonsense, therefore, to say that a floating-point number has a certain inherent inaccuracy, and this is the common fallacy that the original quote lampoons (among other things). It's not just semantics: it leads to all sorts of misconceptions: for example, the idea that the accuracy is the same as the precision (the machine epsilon), when for some algorithms the error may be much larger, and for other algorithms the error may be smaller. —Steven G. Johnson
Now we're getting somewhere. In my point of view (a pure mathematician, not a numerical analyst) there is no such thing as a floating point number. There is only a floating point representation of a number. I know there is confusion about accuracy vs precision. Obviously I am confused by your point of view and you by mine.
Let me pull apart your example from my viewpoint. The number 3 (as in this many: ***) can be represented by the numeral "3.0" with very good accuracy. (error=0 etc). The number π (as in the ratio of circumference to diameter) can also be represented by the numeral "3.0" with bad accuracy (error=0.14159...). The numeration system whereby real numbers are represented by ascii strings of the form m.n has precision 0.05 (or 0.1 depending on how you define precision). If you were looking for the best representation of π in this numeral system, you can use "3.1" which has accuracy less (better) than the precision of the numeration system. All of this follows from my statement about accuracy and precision in my view. Please point out why you consider this point of view fallacious? I want to understand!
I have no doubt about Kahan's credentials, I just am not a numerical analyst, so I haven't read anything on numerical analysis since I was an undergrad. I know very little about lots of things, and a lot about a few things. My point is that hairy-chested statements that are debatable when viewed from a different angle should not be up front in wikipedia.
-- AndrewKepert 01:36, 24 Sep 2003 (UTC)
P.S. broadly speaking I agree with the quotes you gave from Kahan. Textbooks are incorrect to say "floating point number having accuracy" as once the number is only seen as the exact value represented by the numeral stored in the bytes, it has lost its original context. If you take my point of view (very literal, admittedly) then the principal error is to say "floating point number". His last statement in that first quote about numbers only representing themselves is what I consider to be incorrect. Numbers don't represent numbers (well, they can, but not in the context of floating point representation). If you take the (fictional, but practical) viewpoint that the floating point representation is the number then you can't talk about numbers you care about such as π. However when you take this viewpoint to its extreme, then you conclude that there is no such thing as accuracy or precision! Your final quote from Kahan I agree with totally, for reasons I have already stated (accuracy: property of a representation, precision: property of a method of representation). -- AndrewKepert 02:22, 24 Sep 2003 (UTC)

(Restarting paragraph nesting.)

I think the confusion in the language between us is due to the fact that there are two things going on that could be called "representation". First of all, the binary floating-point description, the "numeral" to use your term, has an unambiguous injective mapping to the rationals (plus NaN, Inf, etc.): a floating-point numeral represents a rational number (the "floating-point number"). This representation has a precision, given by the number of digits that are available. Moreover, it is defined by the relevant floating-point standard; it is not context-dependent. Second, in a particular application, this rational number may be used to approximate some real number, such as π, with an accuracy that may be better or (more commonly) worse than the precision. That is, the rational number may represent some real number, but this mapping is highly context dependent and may not even be known.

It is this second mapping that engenders so much confusion in general, and the common fallacy of textbook authors is to conflate it with the first: to say that the finite precision implies that a floating-point numeral intrinsically represents a corresponding range of real numbers. Kahan was being somewhat glib (he wasn't giving a formal definition, and being pedantic about the numeral/number distinction is beside the point), but I think his basic point is valid. The misconceptions he is trying to combat are the ideas that floating-point numbers are intrinsically inaccurate, or that they somehow represent a "range" of real numbers simultaneously (at most, one floating-point number represents one real number, with a certain accuracy, in a context-dependent fashion), or that they can only represent real numbers in the range given by their precision.

Moreover, the computer doesn't "know" about this second putative representation; as far as it's concerned, it only works with rational numbers (which "represent only themselves"), in the sense that its rules are based on rational arithmetic, and it's misleading to imagine otherwise. (There are alternative systems called interval arithmetic that do have rules explicitly based on representing continuous ranges of real numbers, but these behave quite differently from floating-point.)

In any case, upon reflection I agree that the quote is probably too provocative (and, perhaps, easily misunderstood out of context) to put up front. But I still think it is good to actively combat such common mistakes.

In the same PDF file, he gives a list of "prevalent misconceptions". I think it's worthwhile reproducing it here. It was intended to be provocative, I suspect, and the last three points are more understandable in the context of his presentation (a criticism of Java's fp specs), but it is thought-provoking.

Steven G. Johnson

Prevalent Misconceptions about Floating-Point Arithmetic

(by William Kahan, in [1])

Because they are enshrined in textbooks, ancient rules of thumb dating from the era of slide-rules and mechanical desk-top calculators continue to be taught in an era when numbers reside in computers for a billionth as long as it would take for a human mind to notice that those ancient rules don't always work. They never worked reliably.
  • Floating-point numbers are all at least slightly uncertain.
  • In floating-point arithmetic, every number is a "Stand-In" for all numbers that differ from it in digits beyond the last digit stored, so "3" and "3.0E0" and "3.0D0" are all slightly different.
  • Arithmetic much more precise than the data it operates upon is needless, and wasteful.
  • In floating-point arithmetic nothing is ever exactly 0; but if it is, no useful purpose is served by distinguishing +0 from –0.
  • Subtractive cancellation always causes numerical inaccuracy, or is the only cause of it.
  • A singularity always degrades accuracy when data approach it, so "Ill-Conditioned" data or problems deserve inaccurate results.
  • Classical formulas taught in school and found in handbooks and software must have passed the Test of Time, not merely withstood it.
  • Progress is inevitable: When better formulas are found, they supplant the worse.
  • Modern "Backward Error-Analysis" explains all error, or excuses it.
  • Algorithms known to be "Numerically Unstable" should never be used.
  • Bad results are the fault of bad data or bad programmers, never bad programming language design.
  • Most features of IEEE Floating-Point Standard 754 are too arcane to matter to most programmers.
  • "Beauty is truth, truth beauty. — that is all ye know on earth, and all ye need to know." ... from Keats' Ode on a Grecian Urn. (In other words, you needn't sweat over ugly details.)

From AndrewKepert 01:11, 25 Sep 2003 (UTC)

Okay, I think we are getting closer to talking the same language. I think I now understand what you said (thank god - we're both smart, both have a mathematical background ... 8-) ). As I see it, a representation in this context is the mapping (surjection) number -> numeral, whereas your emphasis is on the partial inverse to this mapping numeral -> number. (BTW "numeral" is not my term - it is the standard term for a written (in this case bitted) representation of a number.)

My biggest problem here is your use of the term the mapping as if it were uniquely defined. There is a unique surjective mapping from R to numerals that you can define, of course, which is used when you want to represent a known literal like π. (In this mapping, accuracy is ordinarily at least as good as the precision.) But this mapping does not generally reflect the numbers that can be represented by a given numeral in a program; more on this below. —Steven G. Johnson

I have two reasons for this:

1. Representation is the standard way of downgrading from an abstract entity (such as a group, topological algebra, or whatever) to something more tractible (space of functions, etc -- see for example group representation). So, "representation" means a mapping from the set of things being represented to another set of objects that are used represent them. "Representation of ..." is also commonly used for elements of the range of this mapping, and is the context in which we have both been using it above.

The difference in the direction of the mapping is superficial; we both agree that the numeral represents a number and not the other way around, and the direction of the mapping just indicates which problem you are solving. The main disagreement is about which number(s) are represented. Or, more to the point, about the meaning of "accuracy" even if we agree that the obvious numeration system is the "representation." See below.

2. As I see it, "numbers" stored in computers are there as representations of quantities. They only have meaning in relation to what they represent. Kahan and his colleagues did not dream up IEEE754 as an empty exercise in self-amusement. They wanted a usable finitely-stored representation of R. The only way R can be represented by finite bit-strings is accept some imprecision, so this is what is done.

But the imprecision does not tell you the error, or even bound it. See below.

Representations don't have to be faithful (one-to-one), and if we were ambitious, we could also allow them to be multi-valued, as would be required for π to have several acceptable approximate representations with accuracy close to the precision of the numeration system. However, I don't see multiple values as the main issue.

So given this, a numeral system (or I prefer the french phrase "System of Numeration": fr:Systèmes de numération) is a representation - a function from a set of numbers to a set of numerals. Then "accuracy" is a property associated with the representation of a number x (the difference between x and the rational number associated with the numeral for x) and "precision" is a property associated with the representation function itself (an upper bound to the accuracy, possibly over a certain interval in the domain, as is required in floating point repn). These definitions do require the function numeral -> number, as you need to have "the rational number associated with the numeral". So in all, I don't think it is pedantry to insist on the distinction between number and numeral, as it is what the concept of accuracy relies on, which was your initial point.

Precision is neither an upper nor a lower bound to the accuracy compared to the number that is actually desired, i.e. in any useful sense. You're looking at the accuracy of the wrong mapping. See below.
(Note that the IEEE 754 standard speaks of floating point formats as being representations for only a finite set of real numbers. Not that you are wrong, but you know as well as I that there is no universal, unambiguous, formal meaning of the word "representation", even in mathematics; let's not get hung up on finding the "correct" terminology, especially since I think such a focus misses the point here. English requires context and intelligence to interpret.)

So with "representation" in my sense, the hypothetical text-book authors would be correct to say something like a floating-point numeral/representation represents the set of all numbers that differ from it by no more than a fraction of the difference between it and its neighbors with the same floating-point format, as the real numbers in some interval ( a2n - ε , a2n + ε ) all have the same floating point representation, looking something like +a × 2n. Having said this, I agree with you and Kahan that this misleading. From my point of view the correct statement would be to change the viewpoint a number represented in floating point shares its representation with the set of all numbers ... although it is still clumsy in this form. A better starting point is to ignore the distinction between number and numeral and just get stuck into coding. When accuracy becomes an issue, the idea that we are only using a representation of a number (i.e. a numeral) is the clearest way I see of opening the gap between "number" and "numeral" and thus being able to talk about accuracy and precision.

Your "corrected" version is still wrong; or, at least, it doesn't reflect how floating-point numbers are actually used except for declaring constants. This is the heart of our disagreement. Suppose I have a program F to compute some invertible function f(x), which has a well-defined result y in the limit of arbitrary-precision arithmetic (for exactly specified x). In floating-point arithmetic, I get some rational result Y which is an approximation (a representation of sorts) of the desired result y, but which is not necessarily the same as the rational representation of y that the numeral system by itself would dictate. That is, if the numeral system's representation mapping was N(y), the "representation" mapping of the program answer could be written F(N(f–1(y))). (But I don't think defining the approximation as such a formal representation mapping is always desirable or practical.) Depending upon the accumulation of errors and the cleverness of the programmer, the error |Y-y| may be arbitrarily large; it certainly is not bounded by the machine ε (even if you look at relative error). ε only describes the precision with which a number is specified, not the accuracy; the fallacy is to conflate the two.
I can't read Kahan's mind, nor am I a numerical analyst myself, but as far as I can tell he would say simply that the floating-point numeral represents unambiguously only that single rational number, with some precision. It may be used to approximate some other real number (which may or may not be exactly representable), with some context-dependent accuracy. But I think that the main point is not terminology of "representation", but rather to divorce the concepts of precision and accuracy. —Steven G. Johnson

Lots of interesting issues raised here. I think it would make a good paragraph or three in the article. -- AndrewKepert 01:11, 25 Sep 2003 (UTC)

I really strongly recommend that anyone interested takes a few minutes to read a little bit from Kahan's web page, say pages 25-45 of this presentation. It's quite entertaining, light reading, and illuminates some of these issues (although it's not a paper on formal representation theory). —Steven G. Johnson
Quick response: Thanks for the example -- a sketch of it (as a commutative diagram figure, except for the fact that it doesn't quite commute) is basically how I was thinking of algorithms and accuracy, except from the other side of the square. If you are thinking of the algorithms for manipulating floating point numbers, yours is the natural side of the square to be working. If you are thinking of the real numbers that is the reason you wrote the code for, the algorithms are approximations to the true arithmetic (etc) operations on R. I still think my "correction" is correct -- it was poorly worded, as I was trying to paraphrase. Anyway I will check my reasoning on this when I have the time (not now). I did catch your point about the problem with my referring to "the floating point representation of a number, as in practise a single numeration system can have many usable "approximate representations" of a given number - thus my earlier comment on multi-valued functions.
I intend to continue this conversation later - I am away for the next week. (No phone, no e-mail, no web - bliss!) -- AndrewKepert 08:53, 25 Sep 2003 (UTC)

binary rounding

I'm having a tiny problem with the pi example, the sentence "Hence the floating-point representation would start with ..." irritates me a little. Instead of merely cutting away behind the last digit I'd prefer rounding instead:

11.001001000011111
is the number with an extra digit. Simply cutting away the last digit gives
11.00100100001111
That's about 3.14154053, which is about -5.2E-5 away from pi.
Rounding up the last digit instead gives
11.00100100010000
That's about 3.14160156, which is about 8.9E-6 away from pi.

Is there something I've missed? --Nikai 22:24, 14 Apr 2004 (UTC)

You're quite right, especially as there are non-zero bits 'out there to the right' of the original 17. The smallest fix would be to add 'in a representation which did not allow any rounding...' or something like that. mfc 15:25, 16 Apr 2004 (UTC)

base2 to base10

howto find the exponent

In order to convert -18,75 into 32 bits:
18d = 10010b
0,75d = 2**-1 + 2**-2 = 1/2 + 1/4 = 0,11b
---> 18,75d = 10010,11b =  1,001011 * 2**4
Sign: Négatif      = 1
Exposant: 4        = 4d + 7Fh = 83h = 100 000 11 b     Why add 7Fh (0111 1111b) ?
Mantisse: 001011   = 001 0110 0000 0000 0000 0000
Hence, -18,75 is 1 10000011 001011 000 000 000 000 000 00(binary) or  C196000(hexa)
bit number      =1 23456789 012345 678 901 234 567 890 12 = 32bits              

Why add 7Fh (0111 1111b) ? [[2]]

This is the "exponent offset" of 127 referred to in the "computer representation" section. The bit pattern stored in the exponent field of the number in the computer is not the same as the exponent that we mean when we say things like "18.75d = 1.001011 * 2**4". It is always higher by 127 (in single precision.) William Ackerman 23:55, 17 August 2006 (UTC)[reply]
I don't understand how to calculate the decimal value of 110010010000111111011011 :
1 10010010 000111111011011 left to right : a31=1, a30=1, 29=0, a28=0 etc... a31=1 so float is positive a30-a23 = 10010010 --> ? a22-a0 = mantisse = 000111111011011

So how can you say that 11.00100100010000 is nearly pi ?

Like this: You must be careful to distinguish between numbers represented in whatever base from numbers represented in the format of a particular computer's representation of normalised floating-point numbers in binary, a bit sequence that is possibly very different from the bit sequence of that number in binary due to the additional details of possible implicit leading bit, and the locations of the sign bit and exponent, and any offset for the exponent.

To convert from one base to another the hand process is simple though tedious. First, split the number into the integral part and the fractional part. Thus for pi, the parts are 3, and .14159etc. Convert the integral part by repeated division by the desired base and note the remainders (working in whatever base you wish): 3/2 gives 1, remainder 1. Second step: 1/2 gives 0, remainder 1. No further steps. The integer part of the binary representation is thus 11., being the remainders in reverse order. (For 6, the remainders would be 0, 1, 1 and the representatiion would thus be 110.) For the fractional part, repeatedly multiply by the chosen base, recording and remioving the whole part that appears. So .1415*2 = 0.2830 so 0, .2830*2 = 0.5660 so 0 again, .5660*2 = 1.1200 so 1, .1200*2 = 0.2400 so 0, etc. so the first few binary bits of the fractional part are .0010 and thus the binary representation of pi is 11.00100100001... this being a proper digit string in binary of a number.

This is however not a normalised binary floating-point number, so an exponent has to be introduced. To pack the number into the computer's representation bit string (32 bits, 64 bits, 80 bits...) the mantissa's bit string must be truncated or rounded at 24 bits, or whatever is the field size allocated for it with or without implicit leading one, then all the components (sign, mantissa, exponent) be packed according to the agreed format. Going the other way, you first unpack the floating-point format into a proper number (if you know that the bit string is not already a proper binary number bit string), especially locating the binary point and then perform the base conversion. It is quite easy to write a computer programme which does this. Although it will do its internal calculations in binary, the value of "base" can be 2 or 10, or 16, or 60 (Babylonian) and all will be well.

So 11.00100100001... is indeed the start of the binary representation of pi as a proper digit string in binary, just as 3.1415 is a proper digit string in decimal of an approximate value for pi, which on the IBM1620 (a base ten computer) might appear in storage as 3141501 where the bold type indicate tags associated with the digit: the IBM1620, a discrete transistor design (pre-integrated circuits, that is) offered indefinite precision floating-point arithmetic via hardware (not software); a number was located by the start digit of its address and if that digit was tagged, it was a negative number. The digits of the number continued until the next tagged digit marking the end thus the tag on 5, and for floating-point numbers was followed by a two-digit power with the tag marking the end of the power. Because it normalised to the 0.xxxx style (rather than the x.xxx style), 3.1415 is represented as .31415x10 thus the exponent field of 01, and if I felt like typing them, I could have added many more than five digits. Reading memory dumps of numbers on the IBM1620 was so easy... No "little-endian" madness. NickyMcLean 21:41, 17 August 2006 (UTC)[reply]

Humm, on review, I think I recall now that the IBM1620 addressed numbers by their rightmost digit (whose tag indicated the sign), and the high-order digit in memory (indefinitely off to the left, at lower addresses) would have a "field mark" tag on it. For floating-point, the address point thus led directly to the exponent field. No matter, the fundamental point is that numbers could be hundreds of digits long, and though represented in decimal, not all digits had the same interpretation as a part of the number because some were exponent digits. NickyMcLean 21:51, 17 August 2006 (UTC)[reply]

0.6/0.2

I was reading this article a while ago and I saw something (I think in the "problems" section) about what happens when a computer tries to find int (0.6/0.2), where int(x) is the largest integer not larger than x. It said something like "since neither 0.6 nor 0.2 can be exactly represented in binary, the computer evaluates 0.6/0.2 to 2.999999998, which the computer will happily convert to 2". I can't find it in the page archive, so could someone put something like that on the page, because I think it's a good example of a problem with floating-point arithmetic. (P.S. I've tried calculating int (0.6/0.2) on a computer, and it does sometimes come out to 2.)

hyphen or not?

The article is inconsistent about "floating point" versus "floating-point". I think we should decide on one form and use it. Does anyone have a preference for one variant or the other? Neilc 04:15, 12 September 2005 (UTC)[reply]

When it is used as a compound adjective (as in "floating-point arithmetic"), then it should be hyphenated according to good English punctuation style. When it is used as a noun (as in the article title), it should not be hyphenated. —Steven G. Johnson 16:02, September 12, 2005 (UTC)

Literal 0.1 unrepresentable?

In the section "Problems with floating point" it says that the literal 0.1 is unrepresentable, I'm not quite understanding this, can someone please explain? - Xgkkp 18:36, 5 October 2005 (UTC)[reply]

It cannot be represented exactly by a binary (base 2) fraction, as it requires powers of 5 to be exact. For examples, see: ['problems with binary floating-point'] mfc

Shouldn't we mention how operations on floating-point numbers are (usually) carried out?

Most implementations of floating-point numbers do not only specify how such numbers are represented but also specify what results operations on them must produce. More precisely, one usually requires a op_f b:= rnd(a op b) for any two floating-point numbers a,b, and any (here binary) operation op (like +, -, /, *, sqrt, etc.). Here, rnd is the function that rounds its argument to the closest floating-point number, and op_f is the operation op for floating-point numbers.

I wonder whether the article should not at least mention this "fundamental property" because it is this very property that allows us to analyze what a calculation actually does! For instance, the requirement "a op_f b:= rnd(a op b)" implies that (for normalized numbers) a op_f b = (a op b) (1+eps) where |eps| is bounded. Using this, you can perform an analysis of what a formula like ((a +_f b) -_f c) *_f d), for instance, actually computes.

I know that not all implementations of floating-point numbers require a op_f b:= rnd(a op b) (I think GNU MP does not, but there are "patches" that make it do so). Nonetheless, IEEE 754 does respect it (at least for normalized numbers, if I am not mistaken). Adding a section on this "fundamental property" (anyone has a better name?!) would, I think, help a lot to understand that we CAN work and calculate with floating-point numbers (and do not have to HOPE that it is going to work well).

(I am willing to write something, of course, with your help, but would like to first ask what you think of the idea.) hbf 02:14, 25 November 2005 (UTC)[reply]

It seems like a good idea to include how floating point mathematical instructions are calculated and how to calculate the error in the resulting number. (This is what you're talking about right?) You mention 'eps' - it seems that eps would vary with the function i.e. |eps (+)| = |eps (-)| <> |eps (sqrt)| . (hope my syntax makes sense without explanation)
I assume that eps is pretty much the same accuracy as the mantissa in simple functions like add.
Unfortunately my knowledge of actual fp implementation is not good. I probably wouldn't be able to help much except check your maths.
I think the correct term includes words like 'precision of floating point operations' - is precision the word you were looking for?
I'm not sure if you description 'how operations on floating-point numbers are (usually) carried out' includes computer hardware e.g. structure of logic gates to create for example a floating point 'adder' (if so maybe I can help) or were you thinking in purely mathematical terms.

On a vaguely similar subject I was thinking there should be an explanation of what floating point numbers are actually good for, with comparison with fixed point arithmetic. (in terms of say a physical simulation of real world physics e.g. floating point good for light intesity at distance due to the non linear nature of the variation. but fixed point good for 'euclidian space' vectors - (given sensible programming) since these are linear quantities. I'm currently looking for the correct page to include it in as my topic spans two pages. HappyVR 16:29, 11 March 2006 (UTC)[reply]

Scientific notation

I added a link to the closely related Scientific notation. I noticed it is already linked in the text but I thought there should be a specific link seeing as the two are so closely related.HappyVR 16:40, 11 March 2006 (UTC)[reply]

Implementation

I removed the following comment left by an editor without account at the article:

This is all very interesting, but how are they stored? Many people wish to know this!

I do hope somebody will write about the "fundamental property", mentioned above, and perhaps give more details about the implementation. -- Jitse Niesen (talk) 03:25, 20 March 2006 (UTC)[reply]

?

I removed this :
"A floating point number can be seen as a large integer, were the most significant bit is stored, then the number of the following identical bits, and then the remaining bits. Additionally, a constant denominator is defined to allow for fractions, and a constant bias to have a zero."
Not saying it's wrong - just questioning whether it helps explain anything? Seems (to me) to be describing a different method of storing numbers - it's definately not clear to me.HappyVR 19:57, 23 March 2006 (UTC)[reply]

problems - cancelation

"Cancellation - subtraction between nearly equivalent operands"

subtraction of equivalent operands does what? Does it yeild 0? That should be explicit. Fresheneesz 19:43, 25 May 2006 (UTC)[reply]

example - not IEEE standard

Why is the example not in IEEE standard? Is it in any standard? Why would we do it that way on this page when a better standard exists? Fresheneesz 19:44, 25 May 2006 (UTC)[reply]

mantissa - not informal

"which is called the significand or, informally, mantissa"

At Talk:Significand, I assert that mantissa is not the same thing as a significand, nor is it informal. Please discuss there. Fresheneesz 20:01, 25 May 2006 (UTC)[reply]

Base 2 vs base b

I think there should be a clear distinction on this page between base 2 and base b. Floating point numbers are, by far and away, mostly used in binary number representation. We should have a general definition using an arbitrary base, but in computers, the base is 2, and that should be the focus. Fresheneesz 20:05, 25 May 2006 (UTC)[reply]

My recent (and ongoing) edit.

I'm attempting to improve this article. I hope to add material on actual computer representation, and overflow/underflow/subnormal results. I hope that this will address many of the questions raised on this talk page.

By the way, the "fundamental property" noted above (that floating point arithmetic is done by effectively carrying out the operation to infinite precision and then rounding), is true. The IEEE standard mandates that compliant computers get this result. For square root, which is within the domain of the IEEE standard, this is very hard, but is nevertheless mandated. More complicated things like sin/cos/exp/log are (fortunately for library authors) not within the IEEE standard, and are only "almost always" correct.

William Ackerman 02:00, 5 June 2006 (UTC)[reply]

Looks very good; thanks for your efforts. However, you seem to use different definitions of the significand s. The article says:
  • the revolution period of Io is simply e=5 ; s=1528535047
  • the value of the significand obeys 1 ≤ s < b
  • the mathematical value of a floating-point number is .sssssssss...sss × be
These statements seem to contradict each other. Additionally, I thought that the IEEE standard does not mandate banker's rounding, but allows different kinds of rounding (or perhaps, it requires that various kinds of rounding be implemented?). In that case, the phrase "the technique of Banker's rounding is used" is too rigid. -- Jitse Niesen (talk) 04:33, 5 June 2006 (UTC)[reply]
You caught the page between my first version and a cleanup edit :-) It happens that my first handwritten draft used a different definition of point location, at the extreme left end, that I happen to prefer. (POV!!!) When I saw that I would be contradicting both common IEEE conventions and explicit IEEE terminology and definitions, I retreated, and changed things to s.sssss. William Ackerman 18:20, 5 June 2006 (UTC)[reply]

I have done more cleanup, and have taken the liberty of removing the "cleanup" flag from this page. William Ackerman 02:09, 8 June 2006 (UTC)[reply]

I have softened the assertion that "mantissa" is wrong, taking into account the sensibilities expressed earlier on this page that its usage is so widspread that calling it wrong is too harsh.

Also, I have put in material describing the "fundamental property" (as it has been called on this talk page, though not the article itself) that the result is always (mandated by IEEE) the nearest representable value. This property is crucial! It is the most important achievement of IEEE 754. The way I have expressed it doesn't do it justice. Can anyone rearrange this to describe this property in the way it deserves?

Also, I have separated the aspects of the rounding mode from the aspects of overflow and infinity creation. While the rounding mode can affect whether an infinity is created, it's a two-stage process -- first get the ideal exponent, and then see whether it fits. William Ackerman 01:17, 19 June 2006 (UTC)[reply]

Exponent overflow/underflow

I think that the conditions of exponent overflow (and underflow) should be distinct from the more pure notions of +-infinity which should be reserved for the cases where no exponent would be big enough. In implementations of floating-point arithmetic, the existing protocol requires messing with odd compiler options and strange routines that set and test various processor states which are dependent on the specific processor. Not being a standard part of a language (even in number-crunching fortran) means that such programmes are non-portable, and I haven't attempted to use these facilities. I'd prefer the collection of values to include +-inf, +-overflow, +-underflow, plus undefined (for non-initialised floating-point numbers, though I of course Always Initialise variables Before Reference) as well as the infamous NaN (resulting from computation). It is particularly annoying that different events cause different bit patterns for a resulting NaN, so that testing if x = NaN fails. Fortran offers IsNaN(x), but this is of course non-standard, since of course not all computer hardware designs recognise such a bit pattern. NickyMcLean 21:01, 16 August 2006 (UTC)[reply]

By "cases where no exponent would be big enough" do you mean numbers that could not be represented no matter how large the exponent field is? If so, that number is truly infinite. Common usage and terminology uses a weaker definition of floating-point infinity -- any number that can't be represented by the existing exponent field. That's about 10^308. You also mention unsatisfactory and non-portable results of tests of the form "x==NaN". Tests for NaN and INF should not be done that way, since the hardware (for complex but good reasons) often considers NaN to be not equal to anything, not even another NaN. You should always use things like IsNaN(x) or ieee_is_nan(x) in Fortran, or _isnan(x) in C. These should not have portability problems -- in each case, the compiler/library is supposed to understand whatever bit representation the current hardware uses, and act accordingly. That is, if a program produces a NaN, _isnan(x) should return true on any hardware, no matter how that hardware encodes those bits. William Ackerman 00:22, 18 August 2006 (UTC)[reply]
No, overflow is not infinity. For any given fp scheme, there is some maximum value for the exponent (positive, and negative) - whatever it is, if x is a number with over half that value for its exponent, then x*x will overflow (or underflow, for negative exponents) but this should not be regarded as being infinity at all.
True, but the hardware has no choice. In floating-point hardware, INF doesn't mean "the thing you learned about in school as a way of thinking about divergent integrals". It means "Help! I overflowed!" William Ackerman 00:40, 31 August 2006 (UTC)[reply]
Well, which is to be the master? I agree that the hardware behaves as you say (because it does), but, I want the hardware to (be designed to) follow what I want, not for me to adjust my thoughts to conform to what the hardware does (even though on a particular computer, I do have to, with resentment), and I'm arguing that it should do otherwise to suit me (and of course, I hope others will be suited also) thus I do wish to distinguish proper infinities (cast up by the mathematics, and recognised as such by my algorithm which of couse does brilliant things in conformance with the mathematics, which is hopefully in good conformance with the physical situation, but that's something else), from overflows caused by mere arithmetical limitations, which limits are annoyingly different in different number formats on different machines. Thus I desire +-oflo as well as +-inf and agree that more combinations arise but that's no problem to me as it suits me.... NickyMcLean 21:46, 31 August 2006 (UTC)[reply]
This applies for any finite exponent limit, whatever it is. Infinity is a lot further out. Overflow might be the result of a small change: if x is close to the largest fp number that can be represented, 2x will overflow, likewise if x is close to the smallest (non-zero) number, x/2 will underflow. (I'm not persuaded of the utility of "denormalised" numbers close to zero: their existence just exacerbates the difficulty of preparing a comprehensive algorithm that must now deal with two types of numbers rather than one) Physicists delight in formulae such as e**3/h**2 *m (charge of electron planck's constant, mass of electron) all of which are very small, so one hopes that during an expression's evaluation there are extra bits for the exponent not available in the storage form. Otherwise, overflow results even though the result is within fp range.
A proper infinity would for example be generated by a/b where b is zero, though full propiety in general requires that this be the result of an analysis involving (limit, b to zero, of a/b) rather than some blunder in the method producing a divide by zero. Similarly, tangent(pi/2) should spit back infinity (ah, which sign?), though to pick nits, the limited precision fp representation of pi/2 would not be pi/2 exactly so tangent(approx(pi/2)) might merely be a very large number (but not infinity) for which overflow ought to be the result. Likewise, one might argue that arctan(0,0) should return ? (or NaN) because a vector of zero length points in no direction.
You say you'd like different kinds of NaN for atan2(0,0), 0/0, overflow/overflow, etc. As far as I know, the IEEE commettee didn't provide for that. If such distinctions existed, the rules for how all the various types of overflows, infinities, and NaNs propagate through all the various operations would become way too complicated. In practice, these things turn into NaN pretty quickly. The main reason for providing INF is to let programmers catch things early, if they so choose. William Ackerman 00:40, 31 August 2006 (UTC)[reply]
I'll see if I can provoke different "values" for NaN in some tests. Amusingly enough, I have wallowed in this slough before (half-hourly measurements of electrical load at hundreds of substations) where data is not available with certainity. A value might be x, and associated with it in my scheme was a character code. Blank for "good", otherwise one of "fubrom" for various failure states (Failure of the meter, Unavailable number, Broken tape, Reset count error, Overflow in count (yay!), Movement failure (the paper tape didn't advance)) Well, obviously in summations there could be combinations. Blank+Blank stayed Blank(yay), Blank+fubrom became fubrom, one fubrom with a different fubrom became ?. All good fun, but it had to be done via associated tests, though careful organisation and phasing meant that only if within a day's data there were non-blank codes did the jiggling start. It was not at all satisfactory to ignore these details, despite the cost in extra processing. (Alas, staff using this system resorted to editing the source data files (!) to remove the fubrom codes so that their summations would appear without the warning codes and there would be no need to fix the errors, for there were no errors...) Now, if I could be certain of a multi-value NaN state (where any NaN value would not be admitted to a summation, nor would a count of good values include them, so something like SUM(a)/CountValid(a) would work) I'd be happier becaue the one datum would encase both the value and failing that, the codes. Thus, my proposal would be for a miscegnatory NaN (my usage of ? from mismatched fubroms), NaN codes as a consequence of various established misbehaviours, and an allowance for "user-defined" NaN codes. Currently I am dealing with "missing datum" via an explicit floating-point value that is not expected to appear (exactly) via calculations on normal data. I had wanted -666.666E666 but it is out of the available fp range, so I have chosen -666.666E33 with careful usage. The data are stored as real*4 in a random-access disc file, but all calculations are via real*8 so the BAD value is real*8 but is assigned -666.666e33's real*4 binary value. And it is really tedious to be forever having to check for BAD values, at a cost in crunch speed, when the hardware would do better. But because IsNaN is not standard (and the prog. is to run on a variety of systems in principle) I have resolved for the moment to stick with it.NickyMcLean 21:46, 31 August 2006 (UTC)[reply]
Thus one introduces further shadings. infinity/infinity generates one sort of NaN, but overflow/overflow (and other combinations) should generate another sort. In the first case, the number cannot be other than infinity due to the limit argument so no finite dynamic range for the fp scheme would suffice, but in the overflow case, it might well be that the same calculation and data performed with a greater dynamic range fp. scheme would result in a valid result with no overflows anywhere along the line. Having messed with an IBM1130 where the dynamic range was small enough to be met often (about 10**36 or so I think) one often had to recast formulae. Thus, (e/h)**2*e*m worked.
This trick for programming around the exponent range limitations of machines like the 1130 is similar to the tricks that people use (the Archimedes pi example!) to work around accuracy limitations. In each case it's unfortunate that such things are necessary, but floating-point is a big set of compromises. William Ackerman 00:40, 31 August 2006 (UTC)[reply]
With regard to the IsNaN function, the compaq "visual" fortran manual v6.6 refers to it as a non-standard function, which it is; as with other ad-hoc extensions to a compiler, there is no guarantee that other compilers will have the same extension, still less that the next standardisation will include it. Further, if there are multiple representations that count as a NaN value, I too would like to be able to set and use such code values for the benefit of my processing. (One might be "missing datum", another might be "failed to converge", etc etc) Without a clear usage of equality, surely a very basic concept, this is not helpful. NickyMcLean 05:28, 19 August 2006 (UTC)[reply]
All I can say here is that the industry has not yet standardised things like IsNaN, or ways to instantiate a NaN value. William Ackerman 00:40, 31 August 2006 (UTC)[reply]
Some tests later, I got around to RTFM (associated via "help" with the compiler), which as usual is not specific which is why tests have to be concocted because the accursed manual writers rarely present full details, merely namechecking some issues. Here is an extract from the Compaq "visual" furrytran compiler concerning NaN
Not a Number
Not a Number (NaN) results from an operation involving one or more invalid operands. For instance 0/0 and SQRT(-1) result in NaN. In general, an operation involving a NaN produces another NaN. Because the fraction of a NaN is unspecified, there are many possible NaNs. Visual Fortran treats all NaNs identically, but provide two different types:
Signaling NAN, which has an initial fraction bit of 0 (zero).
Quiet NaN, which has an initial fraction bit of 1.
The output value of NaN is NaN.
My tests could not cause Sqrt(neg) to generate a NaN despite the remark in the damn manual; there might be some option somewhere or other arrangement to enable it to produce a NaN rather than an error message, but I couldn't find it and did not want to embark on the supposed option of writing my own error-handling routines. However, the key phrase is "there are many possible NaNs" even though this compiler system restricts itself to two, itself a number not equal to one. Agreed, the IEEE specification fails to specify multiple values for NaN, merely allows them without inference (grr), so I'd be in favour of a scheme such as I described. In usage, the fp hardware (well, firmware) is ordinarily about to do something complex with two fp numbers; if one or both has a special value, the complexity of dealing with that ought to be less. So, a NaN code for "mangled", then codes for 0/0, etc, then "user-defined" codes should not cause much trouble. Basically, (NaN op x) gives that NaN, (NaNa op NaNb) gives the general NaN code unless perhaps a = b. Seems simple enough. The values could be revealed as NaN0, NaN1, etc. NickyMcLean 22:06, 4 September 2006 (UTC)[reply]

Misconceptions

I have reverted much of the changes to the "misconceptions" section. This section should be about the "FP numbers are approximations" misconception, and how one can replace it with clear thinking. It should not be about the many accuracy problems that can arise -- that topic is treated extensively later.

I believe that the detailed bit-by-bit analysis of "0.1" and "0.01", while tedious, is important for an understanding of what is actually going on.

William Ackerman 16:36, 31 July 2006 (UTC)[reply]

Storage of Exponent

In computers, it is often helpful to store the exponent as the exponent plus an offset to aid in hardware comparisons. A plain binary encoding is often not preferable.

Yes. I'm not sure whether you are asking why IEEE does it this way (reasoning given below) or suggesting that maybe existing hardware doesn't use offset arithmetic, but ought to. The IEEE representations do in fact use offsets on the exponent. Is that not clear in the "computer representation" section? William Ackerman 23:45, 17 August 2006 (UTC)[reply]
Dealing with floating-point numbers and their representation in bits is close to the hardware, which can do whatever the designer wishes. For instance, the reprsentation of a normalised floating-point number in base two can use an implicit high-order bit: since a normalised number always has it set, there is no need to represent it in the storage. Thus a 20-bit mantissa/significand really stands for a 21-bit number, and the hardware would be designed accordingly. Likewise with the exponent field. Suppose it is eight bits. This could be interpreted as a 8-bit two's complement number (-128 to +127), a "ones-complement" number (-127 to +127, with two bit patterns for zero, +0 and -0) or an unsigned integer (0 to 255), or... an unsigned integer with an implicit offset.
In general terms, one would like floating-point arithmetic to follow the rules of real number arithmetic as much as possible. Thus, if x is a fp number, so also should -x be. (Zero cannot be represented as a normalised floating point number) Accordingly, rather than use a 2's complement form for the significand/mantissa a sign bit is used thus avoiding the annoyance found in 16-bit integers where -32768 can be represented, but +32768 cannot.
The DEC PDP-6/PDP-10 used full-word 2's complement arithmetic to handle negative floating point numbers. It had problems. William Ackerman 23:45, 17 August 2006 (UTC)[reply]
Similarly, if x is a fp number, 1/x should also be a fp number. Alas, this desideratum can only be met approximately in many cases. In binary, fp numbers with the same exponent (say zero) range from 0.5 (exactly, as 0.1000...) up to but not reaching 1 (as 0.111111...); their reciprocals thus range from 2 (exactly) down to but not quite reaching 1, and thus the exponent must be changed to renormalise the value. In the case where fp numbers are normalised into 1.x rather than 0.1x a fp number with exponent 0 ranges from 1 (exactly, being 1.00000...) up to but not quite 2 (being 1.111111...) and so their reciprocals range from 1 (exactly) down to almost 0.5, and thus the exponent must be changed so as to normalise to the range 1.x
So there are assymetries, and a "plain binary" interpretation of the bit pattern of the exponent field might not be helpful.
Another consideration is that the representation of the value of zero might well be arranged so that all bits are zero; this would make initialising a wad of storage containing integers and floating-point numbers particularly easy. NickyMcLean 21:32, 16 August 2006 (UTC)[reply]

Too much info?

This article contains a tremendous amount of information! It may be too much information, if there is such a thing. To me, it reads rather like a textbook than an encyclopedia article. I'm new here and I'm not quite sure what to do about it. Any ideas? Sns 04:49, 21 August 2006 (UTC)[reply]

Any discussion of floating-point numbers involves many bits. (Ahem) NickyMcLean 21:00, 21 August 2006 (UTC)[reply]

As the article evolves, we may want to carve out sections into their own (or other) articles and have a brief treatment with reference in the main article. Let's do the clean up first and then think about refactoring. Jake 21:16, 3 October 2006 (UTC)[reply]

Reorganization

Clearly something needs to be done. Here is an outline of what I propose. I may do it over the weekend, if no one else reorganizes it or indicates here that they don't agree.

The main thing is that the "common misconceptions" section goes into way too much detail near the top of the article. Now, talking about misconceptions early on is important, because some misconceptions are extremely widespread. But the mention near the top should be very brief. Just say that a FP number represents an actual real number exactly. FP numbers are not "approximations to real numbers", and they do not "represent some small range of real numbers". All the rest, including the details of 0.1 * 0.1, can wait until after people know how FP operations work. That material is way too intimidating and confusing that early in the article.

Then the details of the representation, the operations, rounding, overflow, etc.

Then the "problems". These problems are the consequences of the way the operations work (which will have been explained by that point) and the expectations of real arithmetic. A lot more work is needed here. First, the "peculiar properties" and the "problems" sections need to be combined. Other than that, I think there are a lot of open questions about how to proceed. Perhaps there should be a first part about "what can go wrong" and a second part about "what to do about it". That is, the second part discusses the Kahan Summation Method and the example of Archimedes' calculation of pi.

Perhaps the material on how to code things to get better accuracy should be a completely separate article. It's a very big topic in general, and this is an encyclopedia. William Ackerman 15:24, 23 August 2006 (UTC)[reply]

I'd agree that a fp number does indeed represent its own value exactly! How about starting off with the "scientific notation" style, then introducing the notion of a limited (fixed) number of digits which means that not all Reals can be represented, even notionally. (Actually, only Rational numbers can be written down as finite digit strings and not all of them either such as 1/3 in base 10) Introducing an exponent allows a wider range than fixed-point yet maintains the same relative precision without introducing many more digits to a string. Thus a six-digit fp number might range from 000.001 to 999.999 only. There is an article on fixed-point, though it may not relate all that well.

All of this can be done in decimal, then computers are introduced, with definitely fixed sizes for the various fields. The particular packing scheme's choices (sign, mantissa, exponent, where in the storage word) need not be described in detail, and especially, the packing style of IEEE could be a separate article, given that queries about IEEE might be the starting point for someone's search. In other words, this is an article about fp arithmetic in general, and one particlular representation should not be heavily discussed even though millions of cpus use versions of it.

Others have raised this complaint. I really think the bit-by-bit explanation of the format belongs here. This is the page of a central concept in computers. There was a time when IEEE representation was just a proposal, but it's now the way nearly every computer operates. It belongs here, even if it is duplicated on the IEEE page. William Ackerman 23:53, 30 August 2006 (UTC)[reply]

After the definitions (where everything seems to be good, because we're still really thinking about Real numbers) come the usages. The consequences of finite precision, and finite dynamic range, and the special "numbers" such as +-infinity, or other states etc. that one adjoins to the real number system and would like available for fp arithmetic also.

But there are still lots of bits... NickyMcLean 04:33, 24 August 2006 (UTC)[reply]

I have made a major move of the hideous "accuracy/misconceptions" material, putting it down near "peculiar properties". I believe the "accuracy", "peculiar properties", "problems", and "nice properties" sections now contain most of the horribly disorganized material of the page. William Ackerman 23:57, 30 August 2006 (UTC)[reply]

Someone left a comment in the text (that is not shown when the article is presented) suggesting that the explanation of IEEE format could be replaced by a reference to the IEEE format description (it being extended as necessary to add omitted points). In other words, the description of floating point numbers could be entirely about floating point numbers as an idea, without detailed description of a particular implementation. This seems reasonable to me... For now though, I'm off to the movies (Breakfast on Pluto) NickyMcLean 03:10, 9 September 2006 (UTC)[reply]

Cutting back on IEEE info

A number of people have suggested, one way or the other, that this article duplicates material from the IEEE floating-point standard and shouldn't do so. I strongly disagree, at least to the level of detail presently given on this page. Of course, the IEEE page goes into enormous detail that doesn't belong on this page. However, I believe that a person should be able to figure out the bit representation of a floating-point number just from this page. It should definitely go into that level of detail. That means rounding, exponent bias, how the fields are packed, etc.

I think it depends on what you mean. I disagree quite strongly on the question of being able to decode any floating point number form the information on this page. This is really best served (if at all, but I agree that somewhere in wikipedia would be appropriate) on the page for the particular format. This page will have enough detail. That said, it may be valuable to walk through one format in detail in the interest of being explicit. But even then, we might be better off just refering the reader to some particular FP format's page such as single precision that discusses the encoding in detail. Many people who write floating point algorithms don't really need to concern themselves with the encoding details, so I don't this this is essential to the topic of this page. --Jake 00:33, 12 October 2006 (UTC)[reply]
I think I'm coming around a little more to your point of view. The single precision and double precision pages seem to say, much more succinctly than IEEE, the gist of what needs to be said. William Ackerman 01:36, 13 October 2006 (UTC)[reply]

On the other hand, I think that something that does not belong here is discussion of various vendors' implementations and their shortcomings. This will become a never-ending sink-hole. The descriptions on this page should be in terms of what the transformations mean on (perhaps ideal) correctly-performing implementations. I believe that IEEE has standardized things sufficiently well that such descriptions are meaningful.

I don't know whether discussions of problems with this or that vendor's implementation belong in some other WP pages. I would guess that there are other pages out there in cyberspace that address such things, and that they are kept up-to-date by whoever runs them. Perhaps external links to such pages are the way to go. William Ackerman 22:22, 20 September 2006 (UTC)[reply]

Alright, here follows the piece I removed... NickyMcLean 00:25, 28 September 2006 (UTC)[reply]

Talk:Floating point/Removed sections

I think the section on Exceptional values and exceptions under the IEEE standard should be mostly addressed in IEEE floating-point standard. In this article we should really be setting up the questions that need to be addressed, and leave the other page to answer them (for that particular case). If we wanted to have a page on VAX floating point or Cray floating point, they would address this situations differently. --Jake 00:33, 12 October 2006 (UTC)[reply]

Hypothetical design vs. actual floating-point

I very strongly disgree with the recent changes. A theoretical treatise on how one might design floating-point from first principles on a deserted island is not what this WP page should be about. It should be about how floating-point actually works. Given that practically every computer on the planet does it the same way (IEEE-754), and the few that don't follow IEEE precisely use formats that are quite similar, the page should follow those conventions. It's similar to the treatment of calculus—we all follow the notation devised by Leibniz; we don't dwell on possible alternative notations.

For example, discussion of "... the obvious rule would be ..." and "... these would be called ±Overflow and ±Underflow ..." are inappropriate. They are underflow and overflow; computer designers long ago settled on that, and on how they arise and how they are handled. The mechanism of gradual underflow and "denorms" is well established, and should be discussed. Issues of how computer programs can handle missing data, unreadable data, and malfunctioning meters is totally outside of the domain of floating point—they are handled by other software mechanisms.

People will come to this page wondering what hexadecimal 3FFBB67AE8584CAA means. The page should tell them. William Ackerman 18:17, 29 September 2006 (UTC)[reply]

I doubt that "People will come to this page wondering what hexadecimal 3FFBB67AE8584CAA means", and if they do, the place to decode it would be the specific format's page, which exists and is linked to, the IEEE 754 which is surely the place for describing exact encodements with examples and usages. If their bit sequence is not composed in accordance with the IEEE standard, then they would have to look at the other pages that do (or should) describe their system's specific format, be it IBM, Cray, or whatever. The idea was to follow someone's suggestion of providing a general description of how a floating-point number is encoded, after which the specific descriptionS for specific systemS can be followed as particular exampleS in the appropriate other pageS.
Curiously enough, I do live on an island, but not deserted. And curiously enough the "first principles" decription ended up with a design equivalent to the IEEE standard (so the purity of its generality is doubtful), though I did not elaborate on the +-Inf, etc. possibilities, that are indeed handled in that standard but not in other floating-point forms and are thus not general to the notion of a floating-point number representation, and so are not so suitable for inclusion in an article on the general notion of a floating-point number which should not be tied to a specific implementation even if one that is used on many computers, because it is not used on all computers, nor are its features well-accommodated by computer languages either, because they are hoped to be workable on all (or many) computer designs. Bad events during a computation can be finessed by being ignored, using special values (zero, Nan, etc), or triggering a signal that is dealt with by an area-wide catcher (On ZeroDivide ...) or a statement-specific catcher. It remains a messy subject.
One could question the merit of removing an entire exponent's set of significands, thereby to employ 24 bits to distinguish just a half-dozen or so states. With such a vast storage otherwise being wasted, there is indeed an opportunity for "user-defined" values, rather than use auxiliary state variables in exactly the way that such state variables should be used to notify +-Inf, NaN, etc. if a purist position were to be taken that a storage area representing a floating-point number should represent just and only a number. The potential for convenience has been followed in introducing Nan, and indeed a few special values such as the result of Sqrt(neg) though the availability of these features will depend on your system's software (Compaq Fortran raises an error, though there may be options to play with); this facility could be widened to further serve the user.
Although the gradual underflow scheme does indeed remove the explicit "Underflow" condition (though not entirely, precision is lost) via the employment of un-normalised significands, there are issues with this design choice. I have seen outbursts from the designers of numerical software who snarl about "obscene" algorithms (and yes, I forget where I read this). Before this feature is added to the design, there is one only class of numbers, normalised numbers (with implicit 1 bit possible), after there are two classes of numbers, and, they have different precision, and, if there had been an implicit 1 there is one no more. Someone struggling to devise an algorithm that will work swiftly, to minimum loss of precision for all possibler arguments will be troubled, and if there are many arguments, the problem worsens rapidly. Consider a procedure for Sqrt(x): halve the exponent, was it odd or even , lookup table for the first few bits of the significand (normalised or not? 1 bit ort not?) for the initial approximation to a Newton-Raphson iteration of say 4 cycles. (Testing convergence takes about as long as another cycle) A careful examination of the options and consequences is not trivial. Escalating to say finding the roots of a quadratic (-b +- sqrt(b**2 - 4*a*c))/(2*a) with the necessary variants on the formula for certain ranges of values is what provokes snarls about "obscene" algorithms. Such annoyances would be reduced by abandoning interest in unnormalised values, saying that the user can damn well make the effort to scale their data sensibly, and if there is an underflow, too bad: signal an error (to kill the prog.), treat as zero, employ the "Uflo" value.
Could it not be said that the description of the design is insufficiently general, in that it is linked to base two rather than to any base as would be more appropriate for the general heading "floating point"; what I wanted to avoid was a cloud of generalities without shape, so that's my excuse for that. I also forgot to wonder why the sign bit of the significand is separated from the rest of the significand by the exponent field, an oversight. On the other hand, tying the description to the IEEE standard is surely too specific, given that there is an IEEE page with such specifics (perhaps without enough of them, there) and potentially at least, a horde of pages describing every known format.
Let third parties speak! NickyMcLean 04:36, 30 September 2006 (UTC)[reply]

Call to 3rd parties

I think we need input from other experts in the field. This article has a "needs expert attention" flag, but experts in this field are extremely rare, so I'm going to request attention from specific people that I have found contributing to this and related articles in the past. Those experts are:

I will leave notes in their talk pages. If this isn't really your field, my apologies.

how to proceed

Basically, the philosophical difference is one of describing floating-point in very general and theoretical design terms vs. describing the way it works in practice in existing computers. William Ackerman 22:13, 2 October 2006 (UTC)[reply]

Well, I'm flattered to be on the expert list. I do happen to now be familiar with IEEE-754 (just took an exam on it last week...), so I suppose I can input something. Here's my initial impression. The first thing I said to myself as I skimmed over this article was "hmmm, most of this needs to be removed because it's just duplicate info from the IEEE article." What I think should be on this page (written without knowing whether this content exists): the general concept of sign, mantissa, exponent; the history of floating-point representation; other methods of floating-point representation (if they exist); methods of performing floating-point arithmetic (if that isn't specified by IEEE-754). I agree with William in that the article feels way too original-math-researchy, and needs some great rewording. However, perhaps the article can remain abstract (not IEEE) but still be encyclopaedic. The first task should be to focus this article and reconcile it with IEEE-754 by removing/merging content. ~ Booya Bazooka 03:13, 3 October 2006 (UTC)[reply]
I'm happy to help. I've been involved with IEEE 754r (both the page and committee). I think we would look at the related articles and figure out what is best served where. It is a complicated topic for a single page and so may well require pages to hold more of the details. So we should probably structure this review in terms of two questions:
  • what should go in Floating point and what should be moved elsewhere
  • How to best explain the topic at hand
So I seem to be on the same page as Booya Bazooka. Jake 21:25, 3 October 2006 (UTC)[reply]
Me too. mfc 16:35, 4 October 2006 (UTC)[reply]
An article titled "floating point" should first contrast with fixed point, indicate its applicability, and tie in the current ubiquity of IEEE standards. The body of the article should discuss historical variations and indicate design parameters such as radix, range, and so on. However, it should not include all the ridiculous POV recently inserted arguing for/against specific choices. Please consider a broad audience in the intro, including people who are not computer geeks and do not wish to be.
A few technical points worth remembering, to avoid assuming that non-integers must be floats:
And remember, many readers will not know that √2 is not rational. Nor will many readers, even those with some computer experience, appreciate the difference between floats and reals, nor the intricacies of the IEEE design choices (gradual underflow, NaNs, infinities, signed zero, rounding mode, interrupt choices). Be kind. --KSmrqT 23:11, 4 October 2006 (UTC)[reply]

A mass consensus of three so far suggests that the article's title "Floating point" should be the guide, which is fair enough. I'd note that the title is not "IEEE Floating point" and, the IEEE standard allows variant choices, and, many usages pay no attention to the standard at all even if they are deprecated as not being the modern wave. Thus, IEEE FP is not equivalent to FP, just as C++ is not equivalent to programming. The sections in disfavour are "Designing a computer representation" to "Additional Special Values" and could be excised entirely. (Ruthlessness rules!) They do make reference to concepts (overflow, cancellation) that are described later (in "Behaviour of computer arithmetic" and "Computer handling of floating point", "Accuracy problems") all of which apply to floating point whether IEEE or not. Thus, the section "Designing etc" to "Additional etc" along with "Implementation in actual computers" should have been placed after those parts.

This would give a sequence from generalities about FP through behaviour of FP arithmetic to details about representation with links to IEEE etc. The general hierarchy is the notion of FP, behviour of FP arithmetic, a variety of theoretical designs such as IEEE, actual designs in real computers (with conformance or not to theoretical desiderata), firmware implementing such designs, and for daft completeness, circuit designs. A full design involves considering all levels and their cross-influences. But this is an encyclopaedia attempt, not a design manual: no firmware or below then.
But there are consequences of design choices. The introduction of non-normalised numbers compels the firmware to follow extra steps even though in the vast majority of occasions theye are unneeded, and worse, when they are needed, it is probable that the calculation is of low merit anyway, so un-normalised numbers might as well be discarded without great loss. But there appear to be some advantages, though I haven't had experience of this. Likewise, I forgot to mention that the gradual underflow numbers wreck the idea of choosing exponent ranges so that x and 1/x can be represented, since their reciprocals will cause overflow. Floating point numbers and their usage in computers throw up many design choices and quirks.

Rather than presenting an unstructured list of such motiveless random choices and quirks, they were exhibited as considerations of a supposed design process, which, since it had a particular end point in mind was tendentious rather than pure theory.

So, why the 8/24 split between exponent and significand fields? How is this relevant in other bases? Why might base 2 be preferred to base 16, or vice-versa? Many questions can be asked, quite properly. In the Lahey Fortran discussion BB, one fellow was puzzled as to why tan(pi/2) did not come out as infinite. But my schedule of values for this in 32, 64, and 80 bits was removed. Likewise, mention of truncation. Somewhere I found a very nice diagram of fp numbers and their spacing (wrt. different exponents), and the sum and product diagrams. But I've lost the web page. There is a lot to describe for a proper encyclopaedic coverage. If that is we are not to produce short smug articles demonstrating superior knowledge via laconic obfuscation and vague generality, whose brilliance is intelligible only to those who already fully understand the issue.

As remarked, the various features should be assessed and assigned to either "Floating point" or "IEEE specification", possibly with duplication. But for now, I'm off for dinner. NickyMcLean 04:18, 7 October 2006 (UTC)[reply]

Hello, I see that the current reorganisation is still in progress (past ~2months). I've dived in and rewritten the intro as it didn't actually set the scene for the article. I've pulled the other methods into the intro to explain how floating-point differed from other representations. I deliberately haven't used the scientific notation description as it doesn't really aid the reader. The explanation that I think is conceptually easier is that floating-point stores a window that moves over the bit-string. This may need more description than is there at present (ie are window / bit-string etc understandable terms)? I'm considering writing this in more depth, and changing the later parts that use scientific notation. What are other peoples thoughts on this? Would it help, or harm the article to change this explanation? Amoss 17:11, 8 November 2006 (UTC)[reply]

Survey paragraph in intro

Following the suggestions of KSmrq, I have added some "survey of other approaches" material near the top. I'm not happy with having so much material (a whole screenful) above the index. Should this be moved to the end? Is it appropriate? Is it too technical?

Yes, I'm aware that rational arithmetic is a redlink. I think such an article should exist. William Ackerman 01:25, 13 October 2006 (UTC)[reply]

sin(π) is not zero

I have a difficult time digesting the naked assertion that sin(π) is not zero. What we mean (though I don't think this is the way to write it in the article) is probably something like (double)sin((double)(π)) is not zero. --Jake 17:01, 17 October 2006 (UTC) How should we distinguish honest to goodness mathematical constants and functions like sin() or π, from their floating point approximations?[reply]

  • We could introduce names like sinf and fpi to use a C-like treatment.
  • We could use some typesetting contrivance sin vs sin
  • We could stick with mathematical description of what the floating point operations are doing. This is commonly done by introducing a rounding function like to round to float point. With this we are saying that fl(sin(fl(pi))) is not zero, which is much less shocking.

(I am leaning towards the mathematical treatment for precision, but it might get awkward. Are there other pages that have had to deal with such notational issues? --Jake 17:01, 17 October 2006 (UTC))[reply]

Well, I had presented a table of results (since ejected) of tan(π/2) computed in real*4, real*8 and real*10 which used the phrasing tan(approx(π)/2) which seems less obscure than an eye-baffling proliferation of name variants for the purely Platonic Notion of Sine et al. More typing though, but our sinning has its costs. I have also used π when Platonic, and pi (as it would appear as a name in a source file) when dwelling in computational improprieties. As one does. More generally, one could use the full name (with leading capital) for the proper function of Real (err, Platonic) numbers, and the lower case abbreviated name for the computational form, but there would have to be introductory words describing this new convention. Endless articles use sin meaning the proper Sine, so these usages being different would have to be signalled. NickyMcLean 20:33, 17 October 2006 (UTC)[reply]

Yes, I guess we have to be careful. My intention was to say something along the lines of "Attempting to compute tan(π/2) with a computer's floating point mechanism will be unsuccsessful for reasons of ...." Where, by "tan(π/2)" we mean the eternal mathematical ("platonic") truth, of course. I guess the statement "π/2 doesn't exist" was a clue to the thinness of the ice that I was standing on. The closeness of the notation (which is the whole reason for programming languages being designed the way they are!) makes this distinction awkward.
So we probably ought to say something like "Attempts to compute tan(π/2), and observe that it is infinite, will be unsuccessful, because the argument to the tangent function won't actually be π. In fact, the computation won't even overflow--the result will be -22877332.0 blah blah blah." (I consider the fact that the exact result will actually be within exponent range to be interesting, and illustrates the surprising things that can happen.) I'd prefer to do this without getting into funny typefaces or capitalization or whatever, to distinguish the "mathematical" entities from the "computer" ones. Perhaps something along the lines of "The following attempted computation will ...." or "The following C code will ....", though we should try to avoid filling the article with code snippets.
Similarly with sin(π). Of course we can't say things like "π doesn't exist." That's just sloppiness on my part.
The reason I took out the earlier treatment of tan(approx(π)/2) etc. was that it was about the behavior of a specific system, language, and compiler, including a mention that Turbo Pascal doesn't provide the "tan" function, so one has to divide sin over cos. The problem wasn't with the notation itself. Discussion of the idiosyncracies of individual systems would be an unending rat-hole. We have to stick to "pure" IEEE. (Some would say we shouldn't even get that specific.) William Ackerman 22:53, 17 October 2006 (UTC)[reply]
Ah well, the reason I was so specific was to assist anyone attempting to reproduce such a calculation, with an example of one variation possibility as I expect that the slightest change will produce different results. Thus, if they were to have a tan function, their results would surely differ because their tan function surely wouldn't use sin/cos, similarly, their library routines would differ from those used by Turbo Pascal and would or at least could vary from language to language, release version, etc. There are also the fsin an ftan (or tanf: I forget) op codes for the 8087et seq floating-point cruncher that would use its own processes. Or, all languages would rely on the floating-point ftan if there was an 8087 or equivalent device present. Would the Weitek have differed? And is the software emulation of a missing 8087 exactly equivalent except for speed?
I too was surprised that the results were within the dynamic range, then on noting that the higher-precision approximations to Pi gave higher values, decided that this was to be expected (ahem...) [On returning from a late lunch] It is of course obvious. tan(x) approaches infinity near π/2 in the same order as 1/x does near zero. Thus, the result of tan(approx(π)/2) should go as 1/(π - approx(π)) and with the value of π being about 3, 7-digit precision should give about 1/10**-7 as a result, and so forth for higher precision. After the observation, it is seen that it could have been predicted...
Messing about with funny typefaces or capitalisations or a prefix/suffix to the normal name would be distracting. Careful phrasing requires a lot of thought though, and even then, other people still can misread it. NickyMcLean 02:08, 18 October 2006 (UTC)[reply]
This may be a quibble, but one thing I dislike about using approx is that the reader might be lead to think of it as some approximation rather than a very particular one. It sort of leaves open that maybe approx(2) could be something other than 2. Or that approx(approx(x)) could be different from approx(x). It is also both not very short and yet an abbreviation. If we use an abbreviation, I'd recommend fl(x). Or if we want something more familiar round(x). --Jake 05:03, 18 October 2006 (UTC)[reply]
Ah, quibbles. Aha. "Approx" would be described in the preceeding text, one hopes... The difficulty with using Round(x) is that the question arises, round how? Aside from issues such as rounding where (not just to integer values), there are many versions: round left, round right, round in, round out, and round near (with dispute over 0.5). One mention of truncation (round in) was expunged. Rather than "fl", perhaps "fp" or "float"? One wishes to avoid a torrent of terms needing definition. Or the sort of symbol soup that quickly results from describing an operation with distinction between + and FP+ and so on, as when arguing the details of precision and error propagation. Alternatively, "Normalise(x)" could be described, though "Norm" is a proper term in algebra, meaning of course something else. NickyMcLean 19:33, 18 October 2006 (UTC)[reply]
Yes, rounding modes are lurking about behind the scenes. Often this is what you want though. If "round" means "round according to the prevaling rounding direction" then a correctly rounded sine function would return round(sin(round(x))). Then you just need to deal with the fact that this really is talking about one function (in the mathematical or platonic sense) for each rounding direction. With fl/fp/float you would need to decide if you want it to always mean round to nearest (ties to even) or you are back to it being the same as "round". Of the last three, my impression is that fl is most often used in the literature, but I am not that widely read in the field. Looking to pages about numerical analysis this page links to doesn't shed any light. Maybe we are better off taking this as a hint that we need to use more English and less jargon? --Jake 22:31, 18 October 2006 (UTC)[reply]
I agree—we should explain it in English. Using contrived or invented functions will just lead to the need to define what we mean by those functions, and make things confusing. We should try to avoid filling the article with equations and formulas. Readers who want same can drink their fill from the referenced article "What Every Computer Scintist Should Know...." Accordingly, I have made my attempt at correcting the heresy, using a code snippet. Is this OK? William Ackerman 15:24, 19 October 2006 (UTC)[reply]
Herewith, a table, produced via sin/cos via Turbo pascal which offers a real*6 = 48 bit format.

Also, the non-representability of π (and π/2) means that tan(π/2) cannot be computed in floating point arithmetic because any (finite-precision) floating point number x would be either slightly higher or lower than π/2 so that (x - π/2) would not be zero. The result of tan(x) for x near π/2 is closely related to 1/(x - π/2) which would not yield a result of infinity, nor would it even overflow [because the number of bits in the precision is less than the powers of two for the exponent].

          x = π                              Tan(x/2)
      32 bit 3.14159274101257324 (High)  -2.28773324288696E+0007
      48 bit 3.14159265358830453 (Low)    1.34344540546582E+0012
      64 bit 3.14159265358979312 (Low)    1.63245522776191E+0016
      80 bit 3.14159265358979324 (High)   Divide by zero.
           π 3.14159265358979323846264338327950288...

Although "Divide by zero" would be a suitable computational result for tan(π/2), it is incorrect for tan(x/2) because (x/2 - π/2) is not zero! However the routine computing the trigonometrical function itself cannot have a floating point constant with value π/2 and must compute the difference between x/2 and its best representation of π/2, which may have the same value as x/2.

By the same token, a calculation of sin(π) would not yield zero. The result would be (approximately) .1225 × 10-15 in double precision, or -.8742 × 10-7 in single precision. Nevertheless, sin(0) = 0, because x = 0 can be represented exactly.

Similarly, when attempting to solve f(x) = a, there may be no floating point number x such that (f(x) - a) is calculated as zero, or many (not just one), further, close to a solution point the values of (f(x) - a) as computed are likely to change sign erratically as well.

the C prog. offered -22877332.0 for 32 bit, but 16331239353195370.0 for 64 bit, rather different. (These digit strings all have excessive digits, the ".0" part is dubious!) Curiously, the 80 bit result, though as desired, is incorrect, because the parameter was not of course exactly π/2. To investigate this behaviour properly I'd need an even-higher precision calculator. I have access to MatLab, but I'm not too familiar with its features. I think it happens because (x - π/2) is actually computed as (x - pi/2) where pi is the trig. function's representation of π/2, which is the same as x, and thus the difference is (incorrectly) zero. Which is correct. Aha. So the annotation "(High)" is probably wrong (though on the face of it correct), because the decimal representation has rounded ...38 to ...4 One should really present these numbers in binary...
The "Enough digits to be sure we get the correct approximation" touches on a vexing issue. In the case of fortran and algol (where I have detailed experience), the compiler convention for the manipulation of decimal-expressed constants and expressions involving them can variously be: high-precision decimal arithmetic converting only the final result to the precision of the recipient (I dream...), floating point of maximum precision converting the result etc, interpreting all constants as single precision unless otherwise indicated (by using 1.23D0 for example, or only possibly, by the presence of more significant digits than single precision offers), or yet other variants, plus in some cases the setting of compiler options say to interpret every constant as double precision.
For this reason I'm always unhappy to see long digit strings (which could be taken as single-precision, no matter how long) and would always use Double Pi; pi:=1; pi:=4*arctan(pi); replacing "double" by whatever is the maximum precision name. This relies on the generic arctan function being of the precision of its parameter (thus the doubt with "arctan(1)"), and I have no fears over the precision of a constant such as one when assigned to a high-precision variable.
All this just reinforces my point that we shouldn't be going into various system idiosyncracies. The correct values are:
        precision
                    actual value of tan(approx(pi/2))
                                           approx(tan(approx(pi/2)))

        single
                    -22877332.4288564598739487467394576951815025309.....
                                      -22877332.0000000000 (Yes!  Exactly!)

        40-bit significand (as in 48-bit Turbo Pascal)
                    1343445450736.380387593324346464315646243019121.....
                                           1343445450736.00000000 (Exactly)

        double     
                    16331239353195369.75596773704152891653086406810.....
                                           16331239353195370.0000 (Exactly)

        64-bit significand (as in 80-bit "long double")
                    -39867976298117107067.2324294865336281564533527.....
                                       -39867976298117107068.0000 (Exactly)
The correct rounded results really are integers. Comparing these with the values given in the previous box shows that library routines for various systems are not always correct. A correct implementation of 80-bit floating point does not overflow or get a divide by zero.
I don't follow what you are explaining above [presumably for tan of pi/2, not pi], but I have limited time just now. I hope it is clear that rounding can occur at not always the integral level. Why should tan(number close to pi/2) come out as an integer, exactly? The value of pi/2 is somewhere up to ULP/2 from π/2 and 1/diff need not be integer. But I'd have to go through the working in binary arithmetic to be certain of exactly what might be happening. NickyMcLean 02:16, 21 October 2006 (UTC)[reply]
The table above (for, e.g. double), gives two numbers: tan(approx53(pi/2)) and approx53(tan(approx53(pi/2))). (Let's call these functions "approx53" and so on -- round to 53 bits of precision.) The second number is not round to integer, it is round to 53 significand bits. The fact that it is nevertheless an integer is nontrivial (though not hard to see.) William Ackerman 16:49, 23 October 2006 (UTC)[reply]
The questions of how various systems (8087, Weitek, software library) compute these functions is a subject that, while fascinating (I've been fascinated by them in the past), we just can't go into on the floating-point page.
I believe that modern computer language systems do in fact parse constants correctly, so that my code snippet with the comment "Enough digits to be sure we get the correct approximation" will in fact get approx(pi). And they do in fact get the correct precision for the type of the variable. Though there may still be some systems out there that get it wrong. Using 4.0*atan(1.0) to materialize approx(pi) really shouldn't be necessary, and can get into various run-time inconveniences.
By the way, I don't think Matlab has higher precision than plain double. William Ackerman 23:11, 19 October 2006 (UTC)[reply]
Possibly it is Mathematica, then. A friend was demonstrating the high-precision (ie, large array holding the digits, possibly in a decimal-related base at that) of some such package, so I asked for a test evaluation of exp(pi*sqrt(163)); a notorious assertion due to Srinivasa Ramanujan is that this is integral. This is not the normal floating-point arithmetic supplied by firmware/hardware and might support thousand-digit precision, or more. NickyMcLean 02:16, 21 October 2006 (UTC)[reply]
That's nonsense. Ramanujan made no such assertion. The value is 262537412640768743.999999999999250072597198185688879353856337336990862707537410378... —uncanny, but not an integer. William Ackerman 16:49, 23 October 2006 (UTC)[reply]
Incidentally, we haven't mentioned that in solving f(x) = a, the result of (f(x) - a) for various x close to the solution value will not necessarily yield the nice mathematical behaviour no matter how loud the proclamations of "continuity" are. There may be no x for which f(x) = a, and, adjacent to the solution, (f(x) - a) as computed will likely bounce about zero in an annoying cloud of dots. NickyMcLean 20:57, 19 October 2006 (UTC)[reply]
Likely a better topic for root finding though introducing it here might be reasonable. --Jake 21:25, 19 October 2006 (UTC)[reply]

Editors have been known to remove external links from this article if they are added with no history comment and with no discussion on the talk page. Please discuss the value of your proposed link here if you want it not to be cleaned out regularly. (You can help!)  EdJohnston 17:59, 24 October 2006 (UTC)[reply]

I would like to propose the following external link to a relevant scientific paper:

Izquierdo, Luis R. and Polhill, J. Gary (2006). Is Your Model Susceptible to Floating-Point Errors?. Journal of Artificial Societies and Social Simulation 9(4) <[3]>. This paper provides a framework that highlights the features of computer models that make them especially vulnerable to floating-point errors, and suggests ways in which the impact of such errors can be mitigated.

The Floating point page is, as you see, mostly devoted to bit formats, hardware, and compiler issues. The paper by Izquierdo and Polhill that you cite is about making numerical calculations robust. That seems to fit with the traditional definition of Numerical analysis. I suggest you consider adding it there, or at least propose it on their Talk page. EdJohnston 19:38, 2 November 2006 (UTC)[reply]