# Talk:Floating-point arithmetic/Archive 3

(Redirected from Talk:Floating point/Archive 3)

Archive 2 | Archive 3 | Archive 4

## Misconceptions

I have reverted much of the changes to the "misconceptions" section. This section should be about the "FP numbers are approximations" misconception, and how one can replace it with clear thinking. It should not be about the many accuracy problems that can arise -- that topic is treated extensively later.

I believe that the detailed bit-by-bit analysis of "0.1" and "0.01", while tedious, is important for an understanding of what is actually going on.

William Ackerman 16:36, 31 July 2006 (UTC)

## Storage of Exponent

In computers, it is often helpful to store the exponent as the exponent plus an offset to aid in hardware comparisons. A plain binary encoding is often not preferable.

Yes. I'm not sure whether you are asking why IEEE does it this way (reasoning given below) or suggesting that maybe existing hardware doesn't use offset arithmetic, but ought to. The IEEE representations do in fact use offsets on the exponent. Is that not clear in the "computer representation" section? William Ackerman 23:45, 17 August 2006 (UTC)
Dealing with floating-point numbers and their representation in bits is close to the hardware, which can do whatever the designer wishes. For instance, the reprsentation of a normalised floating-point number in base two can use an implicit high-order bit: since a normalised number always has it set, there is no need to represent it in the storage. Thus a 20-bit mantissa/significand really stands for a 21-bit number, and the hardware would be designed accordingly. Likewise with the exponent field. Suppose it is eight bits. This could be interpreted as a 8-bit two's complement number (-128 to +127), a "ones-complement" number (-127 to +127, with two bit patterns for zero, +0 and -0) or an unsigned integer (0 to 255), or... an unsigned integer with an implicit offset.
In general terms, one would like floating-point arithmetic to follow the rules of real number arithmetic as much as possible. Thus, if x is a fp number, so also should -x be. (Zero cannot be represented as a normalised floating point number) Accordingly, rather than use a 2's complement form for the significand/mantissa a sign bit is used thus avoiding the annoyance found in 16-bit integers where -32768 can be represented, but +32768 cannot.
The DEC PDP-6/PDP-10 used full-word 2's complement arithmetic to handle negative floating point numbers. It had problems. William Ackerman 23:45, 17 August 2006 (UTC)
Similarly, if x is a fp number, 1/x should also be a fp number. Alas, this desideratum can only be met approximately in many cases. In binary, fp numbers with the same exponent (say zero) range from 0.5 (exactly, as 0.1000...) up to but not reaching 1 (as 0.111111...); their reciprocals thus range from 2 (exactly) down to but not quite reaching 1, and thus the exponent must be changed to renormalise the value. In the case where fp numbers are normalised into 1.x rather than 0.1x a fp number with exponent 0 ranges from 1 (exactly, being 1.00000...) up to but not quite 2 (being 1.111111...) and so their reciprocals range from 1 (exactly) down to almost 0.5, and thus the exponent must be changed so as to normalise to the range 1.x
So there are assymetries, and a "plain binary" interpretation of the bit pattern of the exponent field might not be helpful.
Another consideration is that the representation of the value of zero might well be arranged so that all bits are zero; this would make initialising a wad of storage containing integers and floating-point numbers particularly easy. NickyMcLean 21:32, 16 August 2006 (UTC)

## Too much info?

This article contains a tremendous amount of information! It may be too much information, if there is such a thing. To me, it reads rather like a textbook than an encyclopedia article. I'm new here and I'm not quite sure what to do about it. Any ideas? Sns 04:49, 21 August 2006 (UTC)

It is better to have too much, then not enough information. Michael.Pohoreski 17:00, 16 March 2007 (UTC)

Any discussion of floating-point numbers involves many bits. (Ahem) NickyMcLean 21:00, 21 August 2006 (UTC)

As the article evolves, we may want to carve out sections into their own (or other) articles and have a brief treatment with reference in the main article. Let's do the clean up first and then think about refactoring. Jake 21:16, 3 October 2006 (UTC)

## Reorganization

Clearly something needs to be done. Here is an outline of what I propose. I may do it over the weekend, if no one else reorganizes it or indicates here that they don't agree.

The main thing is that the "common misconceptions" section goes into way too much detail near the top of the article. Now, talking about misconceptions early on is important, because some misconceptions are extremely widespread. But the mention near the top should be very brief. Just say that a FP number represents an actual real number exactly. FP numbers are not "approximations to real numbers", and they do not "represent some small range of real numbers". All the rest, including the details of 0.1 * 0.1, can wait until after people know how FP operations work. That material is way too intimidating and confusing that early in the article.

Then the details of the representation, the operations, rounding, overflow, etc.

Then the "problems". These problems are the consequences of the way the operations work (which will have been explained by that point) and the expectations of real arithmetic. A lot more work is needed here. First, the "peculiar properties" and the "problems" sections need to be combined. Other than that, I think there are a lot of open questions about how to proceed. Perhaps there should be a first part about "what can go wrong" and a second part about "what to do about it". That is, the second part discusses the Kahan Summation Method and the example of Archimedes' calculation of pi.

Perhaps the material on how to code things to get better accuracy should be a completely separate article. It's a very big topic in general, and this is an encyclopedia. William Ackerman 15:24, 23 August 2006 (UTC)

I'd agree that a fp number does indeed represent its own value exactly! How about starting off with the "scientific notation" style, then introducing the notion of a limited (fixed) number of digits which means that not all Reals can be represented, even notionally. (Actually, only Rational numbers can be written down as finite digit strings and not all of them either such as 1/3 in base 10) Introducing an exponent allows a wider range than fixed-point yet maintains the same relative precision without introducing many more digits to a string. Thus a six-digit fp number might range from 000.001 to 999.999 only. There is an article on fixed-point, though it may not relate all that well.

All of this can be done in decimal, then computers are introduced, with definitely fixed sizes for the various fields. The particular packing scheme's choices (sign, mantissa, exponent, where in the storage word) need not be described in detail, and especially, the packing style of IEEE could be a separate article, given that queries about IEEE might be the starting point for someone's search. In other words, this is an article about fp arithmetic in general, and one particlular representation should not be heavily discussed even though millions of cpus use versions of it.

Others have raised this complaint. I really think the bit-by-bit explanation of the format belongs here. This is the page of a central concept in computers. There was a time when IEEE representation was just a proposal, but it's now the way nearly every computer operates. It belongs here, even if it is duplicated on the IEEE page. William Ackerman 23:53, 30 August 2006 (UTC)

After the definitions (where everything seems to be good, because we're still really thinking about Real numbers) come the usages. The consequences of finite precision, and finite dynamic range, and the special "numbers" such as +-infinity, or other states etc. that one adjoins to the real number system and would like available for fp arithmetic also.

But there are still lots of bits... NickyMcLean 04:33, 24 August 2006 (UTC)

I have made a major move of the hideous "accuracy/misconceptions" material, putting it down near "peculiar properties". I believe the "accuracy", "peculiar properties", "problems", and "nice properties" sections now contain most of the horribly disorganized material of the page. William Ackerman 23:57, 30 August 2006 (UTC)

Someone left a comment in the text (that is not shown when the article is presented) suggesting that the explanation of IEEE format could be replaced by a reference to the IEEE format description (it being extended as necessary to add omitted points). In other words, the description of floating point numbers could be entirely about floating point numbers as an idea, without detailed description of a particular implementation. This seems reasonable to me... For now though, I'm off to the movies (Breakfast on Pluto) NickyMcLean 03:10, 9 September 2006 (UTC)

## Cutting back on IEEE info

A number of people have suggested, one way or the other, that this article duplicates material from the IEEE floating-point standard and shouldn't do so. I strongly disagree, at least to the level of detail presently given on this page. Of course, the IEEE page goes into enormous detail that doesn't belong on this page. However, I believe that a person should be able to figure out the bit representation of a floating-point number just from this page. It should definitely go into that level of detail. That means rounding, exponent bias, how the fields are packed, etc.

I think it depends on what you mean. I disagree quite strongly on the question of being able to decode any floating point number form the information on this page. This is really best served (if at all, but I agree that somewhere in wikipedia would be appropriate) on the page for the particular format. This page will have enough detail. That said, it may be valuable to walk through one format in detail in the interest of being explicit. But even then, we might be better off just refering the reader to some particular FP format's page such as single precision that discusses the encoding in detail. Many people who write floating point algorithms don't really need to concern themselves with the encoding details, so I don't this this is essential to the topic of this page. --Jake 00:33, 12 October 2006 (UTC)
I think I'm coming around a little more to your point of view. The single precision and double precision pages seem to say, much more succinctly than IEEE, the gist of what needs to be said. William Ackerman 01:36, 13 October 2006 (UTC)

On the other hand, I think that something that does not belong here is discussion of various vendors' implementations and their shortcomings. This will become a never-ending sink-hole. The descriptions on this page should be in terms of what the transformations mean on (perhaps ideal) correctly-performing implementations. I believe that IEEE has standardized things sufficiently well that such descriptions are meaningful.

I don't know whether discussions of problems with this or that vendor's implementation belong in some other WP pages. I would guess that there are other pages out there in cyberspace that address such things, and that they are kept up-to-date by whoever runs them. Perhaps external links to such pages are the way to go. William Ackerman 22:22, 20 September 2006 (UTC)

Alright, here follows the piece I removed... NickyMcLean 00:25, 28 September 2006 (UTC)

Talk:Floating point/Removed sections

I think the section on Exceptional values and exceptions under the IEEE standard should be mostly addressed in IEEE floating-point standard. In this article we should really be setting up the questions that need to be addressed, and leave the other page to answer them (for that particular case). If we wanted to have a page on VAX floating point or Cray floating point, they would address this situations differently. --Jake 00:33, 12 October 2006 (UTC)

## Hypothetical design vs. actual floating-point

I very strongly disgree with the recent changes. A theoretical treatise on how one might design floating-point from first principles on a deserted island is not what this WP page should be about. It should be about how floating-point actually works. Given that practically every computer on the planet does it the same way (IEEE-754), and the few that don't follow IEEE precisely use formats that are quite similar, the page should follow those conventions. It's similar to the treatment of calculus—we all follow the notation devised by Leibniz; we don't dwell on possible alternative notations.

For example, discussion of "... the obvious rule would be ..." and "... these would be called ±Overflow and ±Underflow ..." are inappropriate. They are underflow and overflow; computer designers long ago settled on that, and on how they arise and how they are handled. The mechanism of gradual underflow and "denorms" is well established, and should be discussed. Issues of how computer programs can handle missing data, unreadable data, and malfunctioning meters is totally outside of the domain of floating point—they are handled by other software mechanisms.

People will come to this page wondering what hexadecimal 3FFBB67AE8584CAA means. The page should tell them. William Ackerman 18:17, 29 September 2006 (UTC)

I doubt that "People will come to this page wondering what hexadecimal 3FFBB67AE8584CAA means", and if they do, the place to decode it would be the specific format's page, which exists and is linked to, the IEEE 754 which is surely the place for describing exact encodements with examples and usages. If their bit sequence is not composed in accordance with the IEEE standard, then they would have to look at the other pages that do (or should) describe their system's specific format, be it IBM, Cray, or whatever. The idea was to follow someone's suggestion of providing a general description of how a floating-point number is encoded, after which the specific descriptionS for specific systemS can be followed as particular exampleS in the appropriate other pageS.
Curiously enough, I do live on an island, but not deserted. And curiously enough the "first principles" decription ended up with a design equivalent to the IEEE standard (so the purity of its generality is doubtful), though I did not elaborate on the +-Inf, etc. possibilities, that are indeed handled in that standard but not in other floating-point forms and are thus not general to the notion of a floating-point number representation, and so are not so suitable for inclusion in an article on the general notion of a floating-point number which should not be tied to a specific implementation even if one that is used on many computers, because it is not used on all computers, nor are its features well-accommodated by computer languages either, because they are hoped to be workable on all (or many) computer designs. Bad events during a computation can be finessed by being ignored, using special values (zero, Nan, etc), or triggering a signal that is dealt with by an area-wide catcher (On ZeroDivide ...) or a statement-specific catcher. It remains a messy subject.
One could question the merit of removing an entire exponent's set of significands, thereby to employ 24 bits to distinguish just a half-dozen or so states. With such a vast storage otherwise being wasted, there is indeed an opportunity for "user-defined" values, rather than use auxiliary state variables in exactly the way that such state variables should be used to notify +-Inf, NaN, etc. if a purist position were to be taken that a storage area representing a floating-point number should represent just and only a number. The potential for convenience has been followed in introducing Nan, and indeed a few special values such as the result of Sqrt(neg) though the availability of these features will depend on your system's software (Compaq Fortran raises an error, though there may be options to play with); this facility could be widened to further serve the user.
Although the gradual underflow scheme does indeed remove the explicit "Underflow" condition (though not entirely, precision is lost) via the employment of un-normalised significands, there are issues with this design choice. I have seen outbursts from the designers of numerical software who snarl about "obscene" algorithms (and yes, I forget where I read this). Before this feature is added to the design, there is one only class of numbers, normalised numbers (with implicit 1 bit possible), after there are two classes of numbers, and, they have different precision, and, if there had been an implicit 1 there is one no more. Someone struggling to devise an algorithm that will work swiftly, to minimum loss of precision for all possibler arguments will be troubled, and if there are many arguments, the problem worsens rapidly. Consider a procedure for Sqrt(x): halve the exponent, was it odd or even , lookup table for the first few bits of the significand (normalised or not? 1 bit ort not?) for the initial approximation to a Newton-Raphson iteration of say 4 cycles. (Testing convergence takes about as long as another cycle) A careful examination of the options and consequences is not trivial. Escalating to say finding the roots of a quadratic (-b +- sqrt(b**2 - 4*a*c))/(2*a) with the necessary variants on the formula for certain ranges of values is what provokes snarls about "obscene" algorithms. Such annoyances would be reduced by abandoning interest in unnormalised values, saying that the user can damn well make the effort to scale their data sensibly, and if there is an underflow, too bad: signal an error (to kill the prog.), treat as zero, employ the "Uflo" value.
Could it not be said that the description of the design is insufficiently general, in that it is linked to base two rather than to any base as would be more appropriate for the general heading "floating point"; what I wanted to avoid was a cloud of generalities without shape, so that's my excuse for that. I also forgot to wonder why the sign bit of the significand is separated from the rest of the significand by the exponent field, an oversight. On the other hand, tying the description to the IEEE standard is surely too specific, given that there is an IEEE page with such specifics (perhaps without enough of them, there) and potentially at least, a horde of pages describing every known format.
Let third parties speak! NickyMcLean 04:36, 30 September 2006 (UTC)

### Call to 3rd parties

I think we need input from other experts in the field. This article has a "needs expert attention" flag, but experts in this field are extremely rare, so I'm going to request attention from specific people that I have found contributing to this and related articles in the past. Those experts are:

I will leave notes in their talk pages. If this isn't really your field, my apologies.

### how to proceed

Basically, the philosophical difference is one of describing floating-point in very general and theoretical design terms vs. describing the way it works in practice in existing computers. William Ackerman 22:13, 2 October 2006 (UTC)

Well, I'm flattered to be on the expert list. I do happen to now be familiar with IEEE-754 (just took an exam on it last week...), so I suppose I can input something. Here's my initial impression. The first thing I said to myself as I skimmed over this article was "hmmm, most of this needs to be removed because it's just duplicate info from the IEEE article." What I think should be on this page (written without knowing whether this content exists): the general concept of sign, mantissa, exponent; the history of floating-point representation; other methods of floating-point representation (if they exist); methods of performing floating-point arithmetic (if that isn't specified by IEEE-754). I agree with William in that the article feels way too original-math-researchy, and needs some great rewording. However, perhaps the article can remain abstract (not IEEE) but still be encyclopaedic. The first task should be to focus this article and reconcile it with IEEE-754 by removing/merging content. ~ Booya Bazooka 03:13, 3 October 2006 (UTC)
I'm happy to help. I've been involved with IEEE 754r (both the page and committee). I think we would look at the related articles and figure out what is best served where. It is a complicated topic for a single page and so may well require pages to hold more of the details. So we should probably structure this review in terms of two questions:
• what should go in Floating point and what should be moved elsewhere
• How to best explain the topic at hand
So I seem to be on the same page as Booya Bazooka. Jake 21:25, 3 October 2006 (UTC)
Me too. mfc 16:35, 4 October 2006 (UTC)
An article titled "floating point" should first contrast with fixed point, indicate its applicability, and tie in the current ubiquity of IEEE standards. The body of the article should discuss historical variations and indicate design parameters such as radix, range, and so on. However, it should not include all the ridiculous POV recently inserted arguing for/against specific choices. Please consider a broad audience in the intro, including people who are not computer geeks and do not wish to be.
A few technical points worth remembering, to avoid assuming that non-integers must be floats:
And remember, many readers will not know that √2 is not rational. Nor will many readers, even those with some computer experience, appreciate the difference between floats and reals, nor the intricacies of the IEEE design choices (gradual underflow, NaNs, infinities, signed zero, rounding mode, interrupt choices). Be kind. --KSmrqT 23:11, 4 October 2006 (UTC)

A mass consensus of three so far suggests that the article's title "Floating point" should be the guide, which is fair enough. I'd note that the title is not "IEEE Floating point" and, the IEEE standard allows variant choices, and, many usages pay no attention to the standard at all even if they are deprecated as not being the modern wave. Thus, IEEE FP is not equivalent to FP, just as C++ is not equivalent to programming. The sections in disfavour are "Designing a computer representation" to "Additional Special Values" and could be excised entirely. (Ruthlessness rules!) They do make reference to concepts (overflow, cancellation) that are described later (in "Behaviour of computer arithmetic" and "Computer handling of floating point", "Accuracy problems") all of which apply to floating point whether IEEE or not. Thus, the section "Designing etc" to "Additional etc" along with "Implementation in actual computers" should have been placed after those parts.

This would give a sequence from generalities about FP through behaviour of FP arithmetic to details about representation with links to IEEE etc. The general hierarchy is the notion of FP, behviour of FP arithmetic, a variety of theoretical designs such as IEEE, actual designs in real computers (with conformance or not to theoretical desiderata), firmware implementing such designs, and for daft completeness, circuit designs. A full design involves considering all levels and their cross-influences. But this is an encyclopaedia attempt, not a design manual: no firmware or below then.
But there are consequences of design choices. The introduction of non-normalised numbers compels the firmware to follow extra steps even though in the vast majority of occasions theye are unneeded, and worse, when they are needed, it is probable that the calculation is of low merit anyway, so un-normalised numbers might as well be discarded without great loss. But there appear to be some advantages, though I haven't had experience of this. Likewise, I forgot to mention that the gradual underflow numbers wreck the idea of choosing exponent ranges so that x and 1/x can be represented, since their reciprocals will cause overflow. Floating point numbers and their usage in computers throw up many design choices and quirks.

Rather than presenting an unstructured list of such motiveless random choices and quirks, they were exhibited as considerations of a supposed design process, which, since it had a particular end point in mind was tendentious rather than pure theory.

So, why the 8/24 split between exponent and significand fields? How is this relevant in other bases? Why might base 2 be preferred to base 16, or vice-versa? Many questions can be asked, quite properly. In the Lahey Fortran discussion BB, one fellow was puzzled as to why tan(pi/2) did not come out as infinite. But my schedule of values for this in 32, 64, and 80 bits was removed. Likewise, mention of truncation. Somewhere I found a very nice diagram of fp numbers and their spacing (wrt. different exponents), and the sum and product diagrams. But I've lost the web page. There is a lot to describe for a proper encyclopaedic coverage. If that is we are not to produce short smug articles demonstrating superior knowledge via laconic obfuscation and vague generality, whose brilliance is intelligible only to those who already fully understand the issue.

As remarked, the various features should be assessed and assigned to either "Floating point" or "IEEE specification", possibly with duplication. But for now, I'm off for dinner. NickyMcLean 04:18, 7 October 2006 (UTC)

Hello, I see that the current reorganisation is still in progress (past ~2months). I've dived in and rewritten the intro as it didn't actually set the scene for the article. I've pulled the other methods into the intro to explain how floating-point differed from other representations. I deliberately haven't used the scientific notation description as it doesn't really aid the reader. The explanation that I think is conceptually easier is that floating-point stores a window that moves over the bit-string. This may need more description than is there at present (ie are window / bit-string etc understandable terms)? I'm considering writing this in more depth, and changing the later parts that use scientific notation. What are other peoples thoughts on this? Would it help, or harm the article to change this explanation? Amoss 17:11, 8 November 2006 (UTC)

See a lot of good text there, but it comes 'straight out of the blue'. I've added a quick introductory paragraph to try and give it some context, hope that helps. But it really is important to point out that the representation (bit pattern in computing) is not what floating-point is about. You can store fixed-point and integer numbers in a floating-point representation, and carry out fixed-point and integer arithmetic on values stored in that way. What makes it useful is floating-point arithmetic on those values. The representation is interesting -- but nothing to do with the basic concept of floating-point. mfc 21:31, 8 November 2006 (UTC) [offline for a week due to travel, back in a week or so]

You're right about this, and I like the way that it gives the article an immediate focus. The example was technically wrong, although easy to fix. I've rewritten the previous paragraph about representations to focus more on the actual arithmetic and merged your intro straight in. One thing to try and get a nice balance on is explaining what the alternative arithmetic systems are, so that floating-point has some context - but without overlapping too much with what is in the other articles. Have a good trip Amoss 11:31, 9 November 2006 (UTC)

Another bout of restructuring that needs a second opinion. I've moved the Basics section to be first as it makes more sense to introduce the subject, rather than the history. I've taken out some inaccuracies about floating point being real numbers - well yes they are but they are also rational number which is a smaller set and hence a more accurate description. This results in the guff about pi being more sophisticated than 1/10 being removed. Rereading the comments above, I'm thinking that KSmrq's description of a structure seems to be a good way to rewrite the article into. Amoss 13:16, 13 November 2006 (UTC)

## Survey paragraph in intro

Following the suggestions of KSmrq, I have added some "survey of other approaches" material near the top. I'm not happy with having so much material (a whole screenful) above the index. Should this be moved to the end? Is it appropriate? Is it too technical?

Yes, I'm aware that rational arithmetic is a redlink. I think such an article should exist. William Ackerman 01:25, 13 October 2006 (UTC)

## sin(π) is not zero

I have a difficult time digesting the naked assertion that sin(π) is not zero. What we mean (though I don't think this is the way to write it in the article) is probably something like (double)sin((double)(π)) is not zero. --Jake 17:01, 17 October 2006 (UTC) How should we distinguish honest to goodness mathematical constants and functions like sin() or π, from their floating point approximations?

• We could introduce names like sinf and fpi to use a C-like treatment.
• We could use some typesetting contrivance sin vs sin
• We could stick with mathematical description of what the floating point operations are doing. This is commonly done by introducing a rounding function like ${\displaystyle fl(x)}$ to round to float point. With this we are saying that fl(sin(fl(pi))) is not zero, which is much less shocking.

(I am leaning towards the mathematical treatment for precision, but it might get awkward. Are there other pages that have had to deal with such notational issues? --Jake 17:01, 17 October 2006 (UTC))

Well, I had presented a table of results (since ejected) of tan(π/2) computed in real*4, real*8 and real*10 which used the phrasing tan(approx(π)/2) which seems less obscure than an eye-baffling proliferation of name variants for the purely Platonic Notion of Sine et al. More typing though, but our sinning has its costs. I have also used π when Platonic, and pi (as it would appear as a name in a source file) when dwelling in computational improprieties. As one does. More generally, one could use the full name (with leading capital) for the proper function of Real (err, Platonic) numbers, and the lower case abbreviated name for the computational form, but there would have to be introductory words describing this new convention. Endless articles use sin meaning the proper Sine, so these usages being different would have to be signalled. NickyMcLean 20:33, 17 October 2006 (UTC)

Yes, I guess we have to be careful. My intention was to say something along the lines of "Attempting to compute tan(π/2) with a computer's floating point mechanism will be unsuccsessful for reasons of ...." Where, by "tan(π/2)" we mean the eternal mathematical ("platonic") truth, of course. I guess the statement "π/2 doesn't exist" was a clue to the thinness of the ice that I was standing on. The closeness of the notation (which is the whole reason for programming languages being designed the way they are!) makes this distinction awkward.
So we probably ought to say something like "Attempts to compute tan(π/2), and observe that it is infinite, will be unsuccessful, because the argument to the tangent function won't actually be π. In fact, the computation won't even overflow--the result will be -22877332.0 blah blah blah." (I consider the fact that the exact result will actually be within exponent range to be interesting, and illustrates the surprising things that can happen.) I'd prefer to do this without getting into funny typefaces or capitalization or whatever, to distinguish the "mathematical" entities from the "computer" ones. Perhaps something along the lines of "The following attempted computation will ...." or "The following C code will ....", though we should try to avoid filling the article with code snippets.
Similarly with sin(π). Of course we can't say things like "π doesn't exist." That's just sloppiness on my part.
The reason I took out the earlier treatment of tan(approx(π)/2) etc. was that it was about the behavior of a specific system, language, and compiler, including a mention that Turbo Pascal doesn't provide the "tan" function, so one has to divide sin over cos. The problem wasn't with the notation itself. Discussion of the idiosyncracies of individual systems would be an unending rat-hole. We have to stick to "pure" IEEE. (Some would say we shouldn't even get that specific.) William Ackerman 22:53, 17 October 2006 (UTC)
Ah well, the reason I was so specific was to assist anyone attempting to reproduce such a calculation, with an example of one variation possibility as I expect that the slightest change will produce different results. Thus, if they were to have a tan function, their results would surely differ because their tan function surely wouldn't use sin/cos, similarly, their library routines would differ from those used by Turbo Pascal and would or at least could vary from language to language, release version, etc. There are also the fsin an ftan (or tanf: I forget) op codes for the 8087et seq floating-point cruncher that would use its own processes. Or, all languages would rely on the floating-point ftan if there was an 8087 or equivalent device present. Would the Weitek have differed? And is the software emulation of a missing 8087 exactly equivalent except for speed?
I too was surprised that the results were within the dynamic range, then on noting that the higher-precision approximations to Pi gave higher values, decided that this was to be expected (ahem...) [On returning from a late lunch] It is of course obvious. tan(x) approaches infinity near π/2 in the same order as 1/x does near zero. Thus, the result of tan(approx(π)/2) should go as 1/(π - approx(π)) and with the value of π being about 3, 7-digit precision should give about 1/10**-7 as a result, and so forth for higher precision. After the observation, it is seen that it could have been predicted...
Messing about with funny typefaces or capitalisations or a prefix/suffix to the normal name would be distracting. Careful phrasing requires a lot of thought though, and even then, other people still can misread it. NickyMcLean 02:08, 18 October 2006 (UTC)
This may be a quibble, but one thing I dislike about using approx is that the reader might be lead to think of it as some approximation rather than a very particular one. It sort of leaves open that maybe approx(2) could be something other than 2. Or that approx(approx(x)) could be different from approx(x). It is also both not very short and yet an abbreviation. If we use an abbreviation, I'd recommend fl(x). Or if we want something more familiar round(x). --Jake 05:03, 18 October 2006 (UTC)
Ah, quibbles. Aha. "Approx" would be described in the preceeding text, one hopes... The difficulty with using Round(x) is that the question arises, round how? Aside from issues such as rounding where (not just to integer values), there are many versions: round left, round right, round in, round out, and round near (with dispute over 0.5). One mention of truncation (round in) was expunged. Rather than "fl", perhaps "fp" or "float"? One wishes to avoid a torrent of terms needing definition. Or the sort of symbol soup that quickly results from describing an operation with distinction between + and FP+ and so on, as when arguing the details of precision and error propagation. Alternatively, "Normalise(x)" could be described, though "Norm" is a proper term in algebra, meaning of course something else. NickyMcLean 19:33, 18 October 2006 (UTC)
Yes, rounding modes are lurking about behind the scenes. Often this is what you want though. If "round" means "round according to the prevaling rounding direction" then a correctly rounded sine function would return round(sin(round(x))). Then you just need to deal with the fact that this really is talking about one function (in the mathematical or platonic sense) for each rounding direction. With fl/fp/float you would need to decide if you want it to always mean round to nearest (ties to even) or you are back to it being the same as "round". Of the last three, my impression is that fl is most often used in the literature, but I am not that widely read in the field. Looking to pages about numerical analysis this page links to doesn't shed any light. Maybe we are better off taking this as a hint that we need to use more English and less jargon? --Jake 22:31, 18 October 2006 (UTC)
I agree—we should explain it in English. Using contrived or invented functions will just lead to the need to define what we mean by those functions, and make things confusing. We should try to avoid filling the article with equations and formulas. Readers who want same can drink their fill from the referenced article "What Every Computer Scintist Should Know...." Accordingly, I have made my attempt at correcting the heresy, using a code snippet. Is this OK? William Ackerman 15:24, 19 October 2006 (UTC)
Herewith, a table, produced via sin/cos via Turbo pascal which offers a real*6 = 48 bit format.

Also, the non-representability of π (and π/2) means that tan(π/2) cannot be computed in floating point arithmetic because any (finite-precision) floating point number x would be either slightly higher or lower than π/2 so that (x - π/2) would not be zero. The result of tan(x) for x near π/2 is closely related to 1/(x - π/2) which would not yield a result of infinity, nor would it even overflow [because the number of bits in the precision is less than the powers of two for the exponent].

```          x = π                              Tan(x/2)
32 bit 3.14159274101257324 (High)  -2.28773324288696E+0007
48 bit 3.14159265358830453 (Low)    1.34344540546582E+0012
64 bit 3.14159265358979312 (Low)    1.63245522776191E+0016
80 bit 3.14159265358979324 (High)   Divide by zero.
π 3.14159265358979323846264338327950288...
```

Although "Divide by zero" would be a suitable computational result for tan(π/2), it is incorrect for tan(x/2) because (x/2 - π/2) is not zero! However the routine computing the trigonometrical function itself cannot have a floating point constant with value π/2 and must compute the difference between x/2 and its best representation of π/2, which may have the same value as x/2.

By the same token, a calculation of sin(π) would not yield zero. The result would be (approximately) .1225 × 10-15 in double precision, or -.8742 × 10-7 in single precision. Nevertheless, sin(0) = 0, because x = 0 can be represented exactly.

Similarly, when attempting to solve f(x) = a, there may be no floating point number x such that (f(x) - a) is calculated as zero, or many (not just one), further, close to a solution point the values of (f(x) - a) as computed are likely to change sign erratically as well.

the C prog. offered -22877332.0 for 32 bit, but 16331239353195370.0 for 64 bit, rather different. (These digit strings all have excessive digits, the ".0" part is dubious!) Curiously, the 80 bit result, though as desired, is incorrect, because the parameter was not of course exactly π/2. To investigate this behaviour properly I'd need an even-higher precision calculator. I have access to MatLab, but I'm not too familiar with its features. I think it happens because (x - π/2) is actually computed as (x - pi/2) where pi is the trig. function's representation of π/2, which is the same as x, and thus the difference is (incorrectly) zero. Which is correct. Aha. So the annotation "(High)" is probably wrong (though on the face of it correct), because the decimal representation has rounded ...38 to ...4 One should really present these numbers in binary...
The "Enough digits to be sure we get the correct approximation" touches on a vexing issue. In the case of fortran and algol (where I have detailed experience), the compiler convention for the manipulation of decimal-expressed constants and expressions involving them can variously be: high-precision decimal arithmetic converting only the final result to the precision of the recipient (I dream...), floating point of maximum precision converting the result etc, interpreting all constants as single precision unless otherwise indicated (by using 1.23D0 for example, or only possibly, by the presence of more significant digits than single precision offers), or yet other variants, plus in some cases the setting of compiler options say to interpret every constant as double precision.
For this reason I'm always unhappy to see long digit strings (which could be taken as single-precision, no matter how long) and would always use Double Pi; pi:=1; pi:=4*arctan(pi); replacing "double" by whatever is the maximum precision name. This relies on the generic arctan function being of the precision of its parameter (thus the doubt with "arctan(1)"), and I have no fears over the precision of a constant such as one when assigned to a high-precision variable.
All this just reinforces my point that we shouldn't be going into various system idiosyncracies. The correct values are:
```        precision
actual value of tan(approx(pi/2))
approx(tan(approx(pi/2)))

single
-22877332.4288564598739487467394576951815025309.....
-22877332.0000000000 (Yes!  Exactly!)

40-bit significand (as in 48-bit Turbo Pascal)
1343445450736.380387593324346464315646243019121.....
1343445450736.00000000 (Exactly)

double
16331239353195369.75596773704152891653086406810.....
16331239353195370.0000 (Exactly)

64-bit significand (as in 80-bit "long double")
-39867976298117107067.2324294865336281564533527.....
-39867976298117107068.0000 (Exactly)
```
The correct rounded results really are integers. Comparing these with the values given in the previous box shows that library routines for various systems are not always correct. A correct implementation of 80-bit floating point does not overflow or get a divide by zero.
I don't follow what you are explaining above [presumably for tan of pi/2, not pi], but I have limited time just now. I hope it is clear that rounding can occur at not always the integral level. Why should tan(number close to pi/2) come out as an integer, exactly? The value of pi/2 is somewhere up to ULP/2 from π/2 and 1/diff need not be integer. But I'd have to go through the working in binary arithmetic to be certain of exactly what might be happening. NickyMcLean 02:16, 21 October 2006 (UTC)
The table above (for, e.g. double), gives two numbers: tan(approx53(pi/2)) and approx53(tan(approx53(pi/2))). (Let's call these functions "approx53" and so on -- round to 53 bits of precision.) The second number is not round to integer, it is round to 53 significand bits. The fact that it is nevertheless an integer is nontrivial (though not hard to see.) William Ackerman 16:49, 23 October 2006 (UTC)
The questions of how various systems (8087, Weitek, software library) compute these functions is a subject that, while fascinating (I've been fascinated by them in the past), we just can't go into on the floating-point page.
I believe that modern computer language systems do in fact parse constants correctly, so that my code snippet with the comment "Enough digits to be sure we get the correct approximation" will in fact get approx(pi). And they do in fact get the correct precision for the type of the variable. Though there may still be some systems out there that get it wrong. Using 4.0*atan(1.0) to materialize approx(pi) really shouldn't be necessary, and can get into various run-time inconveniences.
By the way, I don't think Matlab has higher precision than plain double. William Ackerman 23:11, 19 October 2006 (UTC)
Possibly it is Mathematica, then. A friend was demonstrating the high-precision (ie, large array holding the digits, possibly in a decimal-related base at that) of some such package, so I asked for a test evaluation of exp(pi*sqrt(163)); a notorious assertion due to Srinivasa Ramanujan is that this is integral. This is not the normal floating-point arithmetic supplied by firmware/hardware and might support thousand-digit precision, or more. NickyMcLean 02:16, 21 October 2006 (UTC)
That's nonsense. Ramanujan made no such assertion. The value is 262537412640768743.999999999999250072597198185688879353856337336990862707537410378... —uncanny, but not an integer. William Ackerman 16:49, 23 October 2006 (UTC)
Incidentally, we haven't mentioned that in solving f(x) = a, the result of (f(x) - a) for various x close to the solution value will not necessarily yield the nice mathematical behaviour no matter how loud the proclamations of "continuity" are. There may be no x for which f(x) = a, and, adjacent to the solution, (f(x) - a) as computed will likely bounce about zero in an annoying cloud of dots. NickyMcLean 20:57, 19 October 2006 (UTC)
Likely a better topic for root finding though introducing it here might be reasonable. --Jake 21:25, 19 October 2006 (UTC)

Editors have been known to remove external links from this article if they are added with no history comment and with no discussion on the talk page. Please discuss the value of your proposed link here if you want it not to be cleaned out regularly. (You can help!)  EdJohnston 17:59, 24 October 2006 (UTC)

I would like to propose the following external link to a relevant scientific paper:

Izquierdo, Luis R. and Polhill, J. Gary (2006). Is Your Model Susceptible to Floating-Point Errors?. Journal of Artificial Societies and Social Simulation 9(4) <[1]>. This paper provides a framework that highlights the features of computer models that make them especially vulnerable to floating-point errors, and suggests ways in which the impact of such errors can be mitigated.

The Floating point page is, as you see, mostly devoted to bit formats, hardware, and compiler issues. The paper by Izquierdo and Polhill that you cite is about making numerical calculations robust. That seems to fit with the traditional definition of Numerical analysis. I suggest you consider adding it there, or at least propose it on their Talk page. EdJohnston 19:38, 2 November 2006 (UTC)

## More restructuring of introductory material

I've thought more about what we are really trying to say that floating-point is, and done another round of editing. If I've stepped on anyone's toes, I really didn't mean to. We are making progress.

The section headings "overview" and "basics" are just wrong. Any ideas on restructuring that? Things are pretty badly ordered at this point, and there's a lot of redundancy.

Amoss -- I like the concept of the "window" that the digits must lie in, and that the radix point can slide anywhere relative to. But it's inconsistently applied at present. Do you think you can work out a way to use that metaphor clearly and consistently throughout the introduction? Or at least throughout parts of it? William Ackerman 01:13, 24 November 2006 (UTC)

I like much of the content in the intro, but I think it is getting a little too details for the intro. It reads more like a section on "digit string representations" which would provide background and context for the article. I think the intro should boil down the essence of the whole article to serve either as an executive summary for someone who wants the short story, or as a way to wet the appetite for someone reading the article. So beyond digit strings I think the intro should touch on values, arithmetic, and introduce the notion of exceptions. There should be a reference to IEEE 754, and some reference to the troubles resulting from floating point numbers being different from real numbers that leads to the need for numerical analysis. To keep it introductory, each of these topics should get little more than a sentence each, and leave the section in the article to do the real work. --Jake 19:32, 27 November 2006 (UTC)

## Intro example could use clarification

"Floating-point notation generally refers to a system similar to scientific notation, but without the radix character. The location of the radix point is specified solely by separate "exponent" information. It can be thought of as being equivalent to scientific notation with the requirement that the radix point be effectively in a predetermined place. That place is often chosen as just after the leftmost digit. Under this convention, the orbital period of Io is 1528535047 with an exponent of 5."

__- Took me a second to catch that example - .1528535047 x 10^5 oh now I see... perhaps "with an exponent of 5" could be clarified for this intro so newbies could catch on (I admit I did NOT read the entire article just the intro so that I could follow another article that used this principle)? }}