Talk:Binary search algorithm/Archive 1

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Archive 1

Removing the "Troubleshooting" section

The Examples->Troubleshooting section does not clearly state the relationship between the topics mentioned and the binary search algorithm. The terms "problem" and "solution" are used vaguely and no connection between them and finding an item in a sorted list is mentioned. Additionally, no verifiable source is cited that connects binary search to any form of debugging. I am therefore removing this section. —Preceding unsigned comment added by Psherm85 (talkcontribs) 22:38, 7 September 2008 (UTC)

hey, no checking if its there?

I dont think this code accounts for what might happen if you search for something thats not there. I am not sure though.

— Preceding unsigned comment added by 216.15.122.216 (talk) 15:52, 2 May 2007 (UTC)

Example doesn't work

The guessed Number might be 11 or 12! so better ask an additional question or change the comparision to "greater or equal"

— Preceding unsigned comment added by 91.16.156.30 (talk) 05:47, 26 May 2007 (UTC)

Definition Clarification?

http://www.nist.gov/dads/HTML/suffixarray.html

The reason I bring this up is because the opening definition of binary_search is because it doesn't include a clear definition "that characteristic" used by the "sort algorithm".The charicteristic used would be the order of the data, or the orderability of the data. [E.g. a "recent changes" page can be ordered alphabetically, or chronologically. This type data two inherently sequential properties, alpha and chron., both linear arrays] I'm neither a programmer nor mathematician....=/ Suffix array is something special, too. more on that

Posted in wiki:

In computer science, binary search or binary chop is a search algorithm for finding a particular value in a list of data. A divide and conquer algorithm, binary search requires random access to the data being searched, the ability to quickly get the kth item in the list. In its simplest form, binary search assumes the data is sorted (usually by a sort algorithm) and takes advantage of that characteristic.


http://www.nist.gov

Definition: Search a sorted array by repeatedly dividing the search interval in half. Begin with an interval covering the whole array. If the value of the search key is less than the item in the middle of the interval, narrow the interval to the lower half. Otherwise narrow it to the upper half. Repeatedly check until the value is found or the interval is empty.

-links to- [is an example of a:]

dichotomic search definition: Search by selecting between two distinct alternatives (dichotomies) at each step.


Data Structures, Algorithms
Binary Search Algorithm

This Binary Search Algorithm uses recursive subdivisions of an array to perform a search.

Enter the size of the array in the text field, and click the "Create" button. The array is created with randomly generated numbers. Now, input a search key in the text field, and click the "Search" button. http://www.cosc.canterbury.ac.nz/people/mukundan/dsal/BSearch.html


These two definitions use the term ARRAY.

Answers.com clears this up:

ARRAY:

  1. Mathematics.
1. A rectangular arrangement of quantities in rows and columns, as in a matrix.
2. Numerical data linearly ordered by magnitude.
    • Ieopo 04:03, 28 October 2005 (UTC)
      • An array doesn't have to be ordered. An array is a data structure pretty much like linked-lists and queues are. The array exist before being applied a sorting algorithm on it (how can it otherwise be ordered?). Moreover nothing says you need to apply such a sorting algorithm for it to be an array. So the definition of array shouldn't include order. There exist the term ordered array though. --Agcala 16:51, 1 April 2007 (UTC)


  • To be an encyclopedic definition it is somewhat obscure. I think it hasn't been stressed enough that the array must be sorted on the ascending key values, or the algorithm won't work.--Agcala 16:51, 1 April 2007 (UTC)

Pseudo code alternate version

  // Alternate version:
  BinarySearch(A, value) {
                                            // pre: 0 < N and A is ascending
      i = 0                                 // invariants:
      j = N                                 // 0 <= i < j <= N and f.
      while (i + 1 != j) {                  // 0 <= i + 1 < j <= N
          h = (i + j) / 2                   // 0 <= i < h < j <= N
          if (A[h] <= value)        
              i = h                         // 0 <= i < j <= N  
          else
              j = h                         // 0 <= i < j <= N
      }                                     // 0 <= i < j <= N and i + 1 = j
      return A[i] == value                 
  }

Only one exit point, overflow problems are not a problem for pseudocode. Some might argue there is a downside as it doesn't return the index of the found item (for which the algorithm is easily adapted by returning not_found or the index i), it doesn't work with empty arrays (for which the algorithm is easily adapted by returning not_found), and it is less efficient in that it continues searching the array after an item has been found (and it is MORE efficient in a worstcase scenario, just count the number of times a comparison is made). Question not_found is not a value and usualy -1 but this is only specific for the C language.

overflow might be fixed h = i / 2 + j / 2 + i mod 2 * j mod 2

— Preceding unsigned comment added by 85.144.94.32 (talk) 12:48, 13 April 2007 (UTC)

Pseudo code change

I changed the pseudocode since it was incorrect.dze27


Pseudocode:

function binary-search(L,V)
  set start = 1
  set end = N
  repeat while start <= end
    set middle = (start + end) div 2
    if V = L[middle]
      return success    
    else-if V < L[middle]
      set end = middle - 1 
    else-if (V > L[middle])
      set start = middle + 1
    end-if
  end-repeat
  return failure
end-function

Notes: 
 div is integer division (discard any remainder)

In practice, the non-recursive algorithm would be implemented with some minor changes: Robocoder 15:02, 12 November 2005 (UTC)

  • To avoid overflowing (e.g., left+right > maxint), calculate:
mid := floor(left+(right-left)/2)
  • Searching the top half of the array is implied and does not require the conditional test:
if value > a[mid]
In pseudocode we assume that integers are infinite-precision, for the sake of simplicity, but you are correct. Similarly, the redundant conditional test exists solely for clarity. Pseudocode is designed for people to read. Deco 21:53, 12 November 2005 (UTC)

Bridging ideas

Has anyone else noticed that the part with the example of guessing numbers between 1-16 and the part with the pseudocode etc, aren't very well connected? Someone who don't already know what binary search is, and how it works, might not make the connection?

Anyone have any ideas how to express how the _algorithm_ for 'guessing' the right number in the 'game' can be used to find the position of a certain number in a list?

Good point. I tried to add some transition and tie them in somewhat, but they're still a bit separated. I hope this helps. Deco 07:27, 26 Dec 2004 (UTC)

The pseudocode is pretty intense for me - it seems more complex than it may need to be. I think the number guessing game should include a formula* for finding for finding the smallest number of steps required to resolve any unknown number. [The "50 in 6 steps example" should be accented with a "100 in 7 steps example" to give scope of the power of the algorithm]

  • Formula? I can't find the formula or make one - I keep coming up with more code than math [e.g. if NOT real number then]

Real-world programming language example

I would as well appreciate an implementation form in a real-world programming language -- HJH

Wikipedia isn't a code repository. The pseudocode sufficiently explains the algorithm. Please search and insert an external link. — 131.230.133.185 01:37, 13 July 2005 (UTC)
At first I was quite alarmed by this change, mostly because the reason used to make the edit could be applied to unilateral deletion of code samples from all articles, which would be very bad and highly controversial. But in this particular case, I believe you're right — the samples are almost identical to the pseudocode in form and function. No need for redundancy. Deco 04:12, 13 July 2005 (UTC)
Regarding the comment that pseudocode is sufficient. Balony. If it was sufficient there wouldn't be all these edits and miscommunications regarding the incorrect and stupid use of pseudocode. Stick to providing a simple C example. The example will either work or not because it is STANDARD. And most major languages are C derived or similar anyways so most people will understand it. Pseudocode is stupid, stop using it in examples.66.165.191.253 (talk) 22:52, 20 November 2008 (UTC)

Recursive Change

I changed the code for the recursive version because it wouldn't have worked as written. The "return mid" statement would have returned the value to one of the recursive calls rather than the initial, external caller. 69.171.89.130 21:48, 4 February 2006 (UTC)

Right you are. Good catch. Deco 23:42, 4 February 2006 (UTC)

Pseudocode examples will not work

hi, I think that the algorithm will not work if programmed as described in the pseudocode examples. The values of the right-variable are never checked if they are equal value. Therefore you can never find the value 'b' in the list 'a','b'.

I think you're incorrect. Because we add 1 to mid during each recursive call, it must continue to get larger with each call, so eventually the range will only contain "right". Deco

It's possible to write this algorithm in a way that avoids special-case testing, and confusing indexing with a +1 here and perhaps not there. It would also be good to perform a single comparison at each stage (the actual comparison might be via function evaluation) though few computer languages enable this. Suppose an array A, with elements 1 to N to be searched to find an index i such that A(i) = X. In a pseudocode fragment, converted from a working programme written in accordance with Professor Knuth's version,

        L:=0;                             %Establish outer bounds.
        R:=N + 1;                         %One before, and one after, the first and last.
  Probe:P:=(R - L)/2;                     %Truncate to integer. Beware integer overflow with (L + R)/2.
        if P <= 0 then Return(NotFound);  %Aha! Nowhere!
        P:=P + L;                         %Probe the middle of the span L:R.
        Case Sign(X - A(P))               %Perform the comparison once.
Positive:L:=P, go to Probe;               %Shift the left bound up:    X follows A(P).
Negative:R:=P, go to Probe;               %Shift the right bound down: X precedes A(P).
    Zero:Return(P);                       %So, X is found, here!       X equals A(P).
        Otherwise;                        %No other cases are possible.

Now then, this is not C, it is pseudocode so be brave. What I mean by Case Sign(expression) is to consider the sign of the expression's value, which has three cases. In most languages you are stuck with something like if expression > 0 then ... else if expression < 0 then ... else ...; or some such permutation of the three cases. This means that on average half the time, two comparisons with be performed at each stage, not one. But in fortran, you can write IF (expression) negative,zero,positive meaning go to the appropriate label for the three cases, and obviously, the expression is written (and hopefully, compiled to execute) once only.

Notice that there is no special case testing for annoyances such as N = 0 or N = 1, or X being outside the bounds of A(1) to A(N), which are particularly annoying to check if N = 0. These cases are all encompassed in the one method, and it is very easy to make mistakes with half-remembered details. Further, this code contains usage of the deprecated go to statement, but I invite those who recoil to present a revision that avoids wasted effort or replicated code and blather splattered across the page. NickyMcLean 21:07, 13 December 2006 (UTC)

Regarding NickyMcLean's broken changes from 19:33, 15 February 2007: A is a sorted list with range 1 to N. Consider a single item list (i.e. N=1): low will be initialsed to 0, high to 2 and the first probe, p, to 1. That is the only valid index in the list, and yet if that doesn't match the loop won't end. Your next set of values are either high=1, low still 0, and p=0 (out of bounds fault), or low=1, high still 2, and p still 1 (an infinite loop). The while condition MUST be (low < high-1) for this form of the algorithm to work. j4_james 22:55, 15 February 2007 (UTC)

Yep, you're right, your fix for the C-pseudocode was correct and I misread it. In the above pseudocode the test remains the simple P <= 0 (to stop) rather than low < high - 1 (to continue) in the C-pseudocode formulation. Apologies for the burp in my skull sludge. NickyMcLean 00:00, 16 February 2007 (UTC) PS: Neat timestamp! And after I had lunch (2pm in NZ) I remembered how I coded the binary chopper in assembler.

I see that the C-pseudocode is still unstable, as Mariolj claims an error fixed if the sought item is not in the list. But after my previous mistake, I'm not going to look again! And all the time, the above pseudocode remains compact and correct, and is a slightly-edited copy from an actual working prog. which has worked correctly when confronted by a search for a value not in the list. But the pseudocode is not in the C-style... NickyMcLean 04:08, 21 February 2007 (UTC)

I can assure it's now broken again. There are two common forms of the algorithm - one with the low and high bounds outside the test range (start with low = 0 and high = N+1) and one with the bounds inside (start with low = 1 and high = N). The problem is that people tend to confuse the two which is how you end up with the disastrous hybrid that we have now. It's easy to prove that it's broken with a simple walk through of two elements. I couldn't care less what format the pseudocode is in, but it would have been nice if it worked. j4_james 19:15, 21 February 2007 (UTC)

Integer Overflow

Prompted by MuthuKutty the remark "Beware integer overflow" can be expanded upon without adding further blather to the proffered pseudo-C. When writing pseudocode, it is usually supposed that any variable is capacious enough to encompass the largest value or greatest precision necessary for the method so that messy details will not obscure the technique, but in actual implementations there are limits, and they are close. Suppose 16-bit signed integers are used: clearly the indexing of the array will work only up to it having 32,767 elements. Alas, the calculation of P:=(R + L)/2; can fail long before that, because the sum can easily exceed the 16-bit limit as when the array has more than 16384 elements and the area being probed is close to the high end of the array. This can be fixed (almost) by making the expression slightly larger, and accepting the extra execution time: P:=(R - L)/2 + L; (This order has hope of slightly faster execution than L + (R - L)/2, since no temporary storage is needed to hold the value of L while the (R - L)/2 part is being computed)

But only almost. Suppose the array has indeed 32,767 elements. Now remember that the initial values of L and R are to be one outside the bounds, so R must be able to hold 32,768, and it can't in 16-bit signed integer form. Similarly, suppose that unsigned integers are to be used: all well and good, and the same problem exists with a maximal upper bound. But suppose that the array indexing starts with zero (as in C, by requirement) rather than one for whatever reason. The initial value of L is to be one less than the first index, and -1 is not a possible value for an unsigned integer.

A further possibility exists. Most computer languages do not offer a protocol for a three-way test on the result of a comparison so the code is usually some lame repetition of the comparison such as

  if A(P) < X then L:=P
   else if A(P) > X then R:=P
    else Return(P)

Suppose instead this is avoided via

  diff:=A(P) - X;
  if diff < 0 then L:=P
   else if diff > 0 then R:=P
    else Return(P)

With the hope that the compiler will optimise the repeated access of variable diff using some sort of temporary register. (A compiler is very unlikely to optimise the repeated comparison code, not least because the code is not exactly repeated; one has < while the other has >) Alas, if the comparison is performed on variables with limited capacity, such as integers, this can fail because the difference overflows the integer limit as in (for 16-bit) 30000 - (-15000). The explicit comparison code will (should!) succeed, because the compiler writer will (should!) have taken advantage of such machine indicators as overflow and carry which were set by the subtract operation, which states are not carried along into the value of variable diff. In the absence of a three-way test syntax, only assembler programming will produce good code. NickyMcLean 20:46, 12 February 2007 (UTC)

Nicky, look up bsearch() in the C Standard Libarary. There is no compare function. There is only a function pointer to a compare function. If you want to compare strings then use strcmp() and it will return 1, -1, 0. This exact same set of values must be returned by any comparison function called through the function pointer specified to bsearch() as a function having two void pointers as its arguement.
int CmpInt( int *piKey, int *piArrayMem) // bsearch() might complain about typing here
if *piKey < *piArrayMem return -1; // any negative value is sufficient/legal
if *piKey > *piArrayMem return +1; // any positive value is sufficinet/legal
return 0;
// for searching arrays of structs use a compare function like this
int CmpMyStruct( int *piKey, MyStruct *pArrayMem) // bsearch() might complain about typing here
if *piKey < *pArrayMem->structmem return -1; // any negative value is sufficient/legal
if *piKey > *pArrayMem->structmem return +1; // any positive value is sufficinet/legal
return 0;
As you can see, there is no "annoying non-uniformity" between the comparison of the simple data types, including char arrays(strings), and complex, user-defined data types defined via struct{}s. There are much more clever ways to search struct{}s, such as using offsetof() to move the array's base pointer over so the compare function can dispense with the need to deal with the ->structmem offset - in which case the original simple datatype compare function would work just fine.
The notion of passing a function as a parameter is not troublesome. More interesting would be to supply an expression as a function parameter. But I offer two points: firstly, I wasn't restricting myself top talking only about C (and its variants) so there is not a lot of point in extolling whatever useful attributes some variant of C might offer in this context (which is not restricted to C-languages), and second, look at your remarks "any negative value" (or positive value) is allowed. Well, the whole point of my argument is that the result of a comparison should be usable in a three-way test, as in case(Compare(a,b) where the recognised cases have to be explicitly the values -1, 0, and +1 otherwise the case-statement fails.
You seem to have totally forgotten or confused the context of a compare function where bsearch() is concerned. The return value and function arguments MUST conform to the function prototype specified by bsearch() and the test you seem to be referring to in case(Compare(a,b) is being done INSIDE the bsearch() compiler-supplied function. This is not something you are writing nor have the ability to modify. Furthermore, the reason that -1,0,+1 are NOT specified as the only legal return values for a compare function is precisely so you can write a compare function that returns the element number (or your mother's birthday for that matter) if you wish to do so, provided you managed the sign to also satisfy bsearch()'s requirements of a compare function. I have no idea how you think you are going to embed such interpolation code INSIDE of the compiler-supplied bsearch() call, but your comments seem very confused in general. My guess is you are a student and have little real-world practice to call on. I've tried to be helpful, but really, your comments are largely disconnected, out of context, unrelated and random streams of consciousness.

--Solidpoint 00:43, 3 April 2007 (UTC)


It may be that C (and other languages) might offer range-like conditions for case-matching but this is now a second-level embedment in the vagaries of C alone. I fully agree that a comparison routine should return -1,0,+1 appropriately, however those interested in interpolative searching would be keen on a numeric value for the size of the difference. Thus the usage would be Case(Sign(Compare(a,b))) in the binary search, and Compare would have to be a routine able to compare many (all?) types of data without change of name. NickyMcLean 21:51, 1 April 2007 (UTC)


Nicky, you are sure making my point about why high-level, abstract languages suck. They tend to make you forget what is going on "under the covers". First, if you really knew anyting about C or assembly language you would know to just examine the sign bit in the compare and be done with it. Duh! Second, a binary search, such as bsearch() is not an interpolation search and I don't care what the requirements are for that since that is not the problem being solved here. You seem intent on making some point about a ternery test where none is required. C in fact produces perfect or very near perfect assembly language for compare functions, which it should since they are silly simple, so you are burning a straw man. Finally, bsearch(), using exactly the same syntax, does allow you to search any type of data, but you are so tangled up with C++ and god knows what you keep missing the point. In most cases the calling function simply specifies the appropriate compare function pointer value so that when bsearch() is called in that function's context it already "knows" which one to call. Simple and elegant and the pointer value can be ebedded right in the bsearch() statement so no endlessly distracting C++ lines of code needed to "set up" the call. As for this page, C does have an advantage in being a very sparse syntax so there would not be a lot of language-specific knowledge needed for the reader to get the gist of how a binary search is implemented.
With that caveat, I tend to agree that a definition of a binary search should not be language specific. In theory it could have uses other than those in software, but as a practical matter software is the dominant use and there are already good implementations of this functionality in most languages, including C, so the only thing left to discuss where C and bsearch() is concerned is the compare function. As for the way that C return values are handled in non-C languages, it's a problem that can never be manifest. If you actually sit down an write a C program that can search 5 different data types, including a few struct{}s, you won't need to read this to "get it". In it's defense, the great thing about C is it is a WYSIWYG language. No need to spend hours or days trying to figure out what the heck is going on "under the covers" or why something is taking forever to execute. In C its right there staring you in the face the whole time. Advantage C. --Solidpoint 00:17, 3 April 2007 (UTC)

--Solidpoint 08:33, 31 March 2007 (UTC)

Nicky, Only a real dipshit programmer would deal with data range issues down at this level in the code. Any seasoned programmer will trap those errors at the data entry or file read level and never, ever would let stuff 3-9 levels down like this deal with those issues. I also have to say that I usually read the assembler that the compiler spits out (or library calls) when I write something this low level and the compiler does produce good assembly language - at least in C. I'm just back from the gym and will have to read your compare function, but in general, it is handled by a function pointer in C and the compare function, which can be arbitrarily complex btw, should test for greater, less than and default to equal as it is the least likely case. Using return statements in the compare function are very efficient because you don't have to exit the function in a uniform way, you can just jump back to the caller with the value supplied.

I'm not too clear what you are on about. The reference http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html by J. Bloch clearly describes the problem. Large collections being searched soon exceeded the span of 16-bit integers, but no problem, everyone relaxed into 32-bit integers which are so big that no-one would ever have a collection of more than a thousand million items to be searched. So, in order to save a trivial action, (L + R)/2 was favoured over (R - L)/2 + L in whatever high-level language was used, and this has its effects all the way down to bit flipping and will cause the high-level system to fail. As for the comparison of keys, again the wording of the high-level language has its effects. In this case, if if-statements are to be used, the binary result of what should be a ternary condition forces a wasteful repetition. As I understand it, C does offer a strcmp function that returns a three-state result, excellent, it may be used in a case statement or, via diff:=strcmp(a,b) the result may be inspected via two if-statements without repetition of the comparison process itself. However, strcmp is for comparison of text strings only. In principle, any ordered datum should be comparable, as is offered by syntax involving < and >, but they only offer binary comparison. Thus there is the annoying situation whereby one syntax can be used for many types (integers of various sizes, floating point of various sizes) but another must be used for string variables. One could imagine a compare(a,b) function that deals with whatever types a and b present and returns a ternary state. This is what I was trying to imply with sign(a - b), granting a generalisation of subtraction in this context. In the absence of such a multi-type comparison function (hopefully compiled with attention to context), various different formulations might be tried in a high-level language, with potential problems as described.NickyMcLean 23:01, 27 March 2007 (UTC)

I'm not sure who this should be addressed to, but recursive functions work just fine on Intel processors, which have dedicated hardware for building and destroying stack frames. Also, since the whole point of the binary search is that you don't need very many probes the stack frame will never get very deep. If you are building a database engine you might want to optimize with a loop, but the performance hit for recursive functions is not that great. I would encourage you to write a "do nothing" function that calls itself and benchmark it - assuming the compliler isn't so smart as to optimize it into a loop anyway, you will have a better feel for how much performance is suffering. --Solidpoint 07:18, 10 March 2007 (UTC)

Where does the name come from?

Why is this called a "binary" search? What's binary about it? Is it that binary means "can have two values" and here the key can be in one of two places (below or above the midpoint)? Or something about splitting the array in two, and two == binary? Seems pretty weak, but I can't think of a better reason. --Lkesteloot 04:02, 28 June 2006 (UTC)

It's because it splits the array in two. That's really it. Deco 04:29, 28 June 2006 (UTC)

This page needs some serious work

Would anybody object if I gave this page a thorough working over? There are so many ideas mixed together here, and many which are missing, or left unexplained, that, on the whole it's just pretty awfull. I've been a professional software engineer and programmer for 25 years or so now and wrote a database engine a few years ago where I used this algorithm extensively. I already know what this thing is, so I don't really need a page on Wiki to tell me about it, but I am willing to have a go at fixing it so long as you are all willing. --Solidpoint 07:07, 10 March 2007 (UTC)

Why not? (Presumably you mean the article, not the the discussion) - I've tried to present a simple and clear pseudocode, but its non-C nature offended the eyes of some C-lovers and was promptly converted to (incorrect) C, followed by a series of fixes to the mistakes that others then altered by further fixes that created errors (and I bungled one too, alas), and I've given up trying to check that today's version of the pseudocode actually does perform a binary search of elements 1 to N of an array as declared in the text, that it reports "not found" as appropriate, that it succeeds even when N = 0 or 1 as well as more sensible ranges, and it doesn't fail for large N because it doesn't use (L + R)/2. Good luck. NickyMcLean 22:19, 27 March 2007 (UTC)

Nicky, Having thought about the scale problem of (L+R)/2 I think this is a serious problem that should be addressed - a good catch. Before I tackle this I need to get an old machine plugged in, up, and running and don't have time for that right now. I think it would be great to have a good C code here, and one that works. I could also just cut and paste the commented code from Microsoft's C library, at least if I could get permission. A better approach might be to describe a test data set that will quickly test for known failure modes. This kind of low-level functionality really should be written in C or assembly language so we should step up to the plate and get a good C function going. --Solidpoint 07:54, 31 March 2007 (UTC)

Actually, it was MuthuKutty who triggered my memory. I recall using the L + (R - L)/2 form and being criticised for it by fellow students who advocated (L + R)/2 because it was obviously faster. Since we were all using a system with 39-bit integers the overflow problem was not near and I was encouraged to not worry about something that couldn't happen. On a different system I later got away with using the (L + R)/2 form with 16-bit integers, because the arithmetic was being conducted in 32-bit registers. At the time I was quite conscious of the array size because the compiler at first would not allow arrays to be declared with a bound larger than 32767 (so the maximal array allowed had bounds -32767:32767) even though 32-bit arithmetic was being used. NickyMcLean 21:30, 4 April 2007 (UTC)

Having a go at it

Well, I decided I could at least make a dent in this, so I started at the beginning. If someone can tell me why the page is in Purgatory I will leave it alone. Otherwise this thing is just too ugly to suffer upon the world. :D --Solidpoint 01:52, 3 April 2007 (UTC)

moron's view point

Say I have a list huge, and even if I do know the size of the list, one would generally match patterns in the list.

Take this case: I have 1 million names on a list and I want to find "bob" in it. I just go around looking for "b" randomly, if i see nothing but lots of z---g in front of me, i get to g, go down and so on. In other words would simply hitting random numbered indices far apart be worse off than going down by 1/2 ?

So having a second index helps, does it not?

One could also divide the set N into 3 equal parts and keep dividing that further into 3 equal parts and so on. think of a square and you need to find a point in the square, you can split the square into 2 equal parts and keep splitting till you find the point, you can split the square into 3 equal parts and keep splitting further, so you get N/3 N/9 N/27 and so on, much better than N/2 N/4 ... ?

—The preceding unsigned comment was added by 220.227.207.194 (talk) 11:08, 4 April 2007 (UTC).

Well, better would imply you knew the cost of these additional entanglements. My gut tells me that if you did the performance would be awful. Back in the day we spent an enormous amout of time working these things out so I doubt there are any big payoffs laying around to be scooped up by a noob. It's a lot of fun trying though, if you have the stomach for a lot of dissappointment. If you can write a solid, self-balancing Red-Black tree you will have a much better idea of what motivates my comments here.--Solidpoint 05:06, 13 April 2007 (UTC)


http://en.wikipedia.org/wiki/Ternary_search so ternary exists too and seems to be faster than binary..


If you have a million items in a sorted list, the binary search process will report a result of found here, or not found, in no more than twenty probes because 220 > 1000000. For large lists, the storage may well be of slow access (a disc drive) either as an explicit disc file, or via virtual memory. In this case, it might be worthwhile to keep a small second index table of a dozen elements or so (small enough to be in fast memory) to be searched first to find a good starting range for the binary chop. There are all manner of variations possible in general, that might suit a particular case especially well. But, for an unchanging list, the initial probes of the binary search will always be to the same elements (the first is always at index N/2, then the second probe will be at either N/4 or 3N/4, etc.) and the disc file buffering scheme (or virtual memory controller) will be keeping frequently-accessed data in fast memory. So, let it do so and avoid the bother of replicating its activity with regard to your data.

A random starting point might better be replaced by starting where a previous search succeeded, or you might want to consider the Interpolation search as a possible process. Splitting the span into thirds is good, and that would perhaps be called a ternary chop or ternary search, but, how is your method going to select the right third to proceed with? With a test at the 1/3 element and the 2/3 element? But in two tests, the binary search will reduce the span by a factor of four, which is better than a factor of three. Anyway, this article is about the binary search. NickyMcLean 21:14, 4 April 2007 (UTC)


Well, some thoughful ideas, but I can tell you from real-world experience that the binary search's KISS approach is savagely fast. Most of it's speed is achieved right at the start where it is discarding huge numbers of list entries in a few steps. Imagine, for example, that you had a list with every person on earth's id on it. In the very first split it would rule out over 3 billion list members! Microsoft's lib code for C actually ends the bsearch() by executing an lsearch() as the tail end of a binary search isn't very effective. All kinds of approaches can be construed, but if you really want to evaluate new ideas you have to build a benchmark and find out how many clock cycles it takes a reference processor to execute your idea once distilled down to code. A good example is the Interpolation search. It has a ton of intellectual appeal, but I have never been able to get it to perform well in the real world on even favorable data sets. Hash tables used to rule, but they perform badly when they hit real-world hot spots like a list of names with "Smith" or "Jones" or "Gomez" in them.

Your Smith and John analogy makes no sense whatsoever in this context. Binary search has been mathematically proven to be inferior, try it on a sheet of paper in front of you, if not on a x86 comp. it is the architecture of the comp that limits you today. you never look at half the list etc, you always tend to look at random points. Even when looking up words in a dictionary. It would be crazy to look at "half way in the dictionary" when looking for a word. now if you had 3 registers to do a cmp with, you would tend to look at 3 points, on a window register case life would be fantastic. —The preceding unsigned comment was added by 220.227.207.194 (talk) 06:37, August 23, 2007 (UTC)

Personally, I keep wanting to put a search decision matrix up on this page to tell noobs when to use what kind of search or associative data structure.--Solidpoint 05:06, 13 April 2007 (UTC)

This would belong at search algorithm, not here. Disk-based searches do use more sophisticated algorithms; see database index. Dcoetzee 20:10, 11 May 2007 (UTC)

Don't delete the recursive implementation

It appears that someone decided long ago to delete the recursive implementation, which I consider illustrative, and in languages with tail-recursion elimination is even efficient. I'm going to add this back. This is not mainstream C on mainstream platforms, it's an article explaining how an algorithm works - stop microoptimizing, please. Dcoetzee 20:54, 2 May 2007 (UTC)

I also moved a bunch of details regarding three-way comparisons to their own article, three-way comparison, eliminated irrelevant differences between the implementations like variable naming and (some) syntax, and hope to condense this article more in the future. Dcoetzee 21:36, 2 May 2007 (UTC)

Problem with the code example

I have changed one of the code examples:

 high = mid; /* Should be high=mid and not high=mid-1 */

is not correct; it should be

 high = mid - 1;

The original code could enter an infinite loop - consider a one-element array (low = 0, high = 0) where A[0] equals value.

The new version can be proved to be correct; I am not sure whether it would be appropriate to include the proof here.

ICrann15 08:35, 14 July 2007 (UTC)

We should always test our code

I was so hoping that you were right about high=mid-1 because it would mean it was slightly more efficient. But unfortunately it isn't correct.

The previous code line

   high = mid;

(which I've just restored it to) never did enter an infinite loop and it has been thoroughly tested. With your code (high=mid-1), having an array of [0, 1, 2] and searching for 1 didn't work. (Forgive me if my tests or testing code is wrong -- that would be really bad!)

Here is the correct code (in the form I'm using):

   low = 0;
   high = length - 1;
   while (low < high){
       mid = (low + high)/2;
       if (A[mid] < value)
           low = mid + 1; 
       else high = mid;
   }
   if (A[low] == value)
       return low;
   else return -1;

So a one-element array would make low = 0 and high = 0, so the while-loop would be completely skipped and no infinite loop would occur.

I have added an interesting section (interesting, but not because of me -- I don't intend to sound pompous) called "Correctness and Testing", which explains how difficult it is to correctly code a binary search: most if not all programmers have done it incorrectly at some point. I've added my own testing code that thoroughly tests a binary search. It probably should have been added a long time ago.

I do like the extra advantages that you pointed out, though.

For anyone reading this, let's try to make this a better world (especially the programming world) by using the scientific method and proving things, and not just passing around code we get from somewhere else. It would be great if all of the incorrect binary searches (and such) were tested, forgotten, and annihilated, but incorrect versions have always been around (at least since 1946 when binary search first appeared, according to Robert L. Kruse's book "Data Structures and Program Design in C++", page 280, as cited on the binary search wiki page). I searched several different web sites for "binary search", and almost all of them were copy-and-pastes of what's on wikipedia, and yet they weren't tested. Let's not take everything for face value. It seems we are drowning in information but are somewhat thirsty when it comes to actual facts. I believe in proof and testing! --75.165.249.107 10:38, 15 July 2007 (UTC)

Unfortunately, "proving" the correctness of a method's implementation merely shows a direct connection between the assumptions and the result. Less obvious are the implicit assumptions. I fully agree with the need for tested code, but there seems to be nothing that prevents people from making all manner of fiddles that turn out to fail. Much of the time, I suspect that the algorithm is in error, but I've lost patience with inspecting the latest change. Any corrections are soon enough uncorrected. The tests you mention above are good (it is always good to try the null cases, here zero elements to search, or one), but the trial fails to check for the problem first raised by MuthuKutty, namely the common use of (Left + Right)/2, which in 32-bit arithmetic works fine for all anticipated array sizes, until that is the size increases to 2,000,000,000 elements and integer overflow is possible, though not certain: 64-bit arithmetic might be in use in the computation even though 32-bit variables are involved. Just as 32-bit integer arithmetic can be done even though 16-bit variables are in use. In other words, it is implicitly assumed that overflow doesn't occur. Alas, exhaustive (brute force) testing is rarely possible. NickyMcLean 21:18, 15 July 2007 (UTC)
NickyMcLean, you are right, but I have only been concerned about the correctness of the actual algorithm and not the computer architecture or platform; I think they are kind of out of the scope of the binary search algorithm itself, though it is important to check. Other threads might mess it up, or overflow may occur, but before that is worried about, one should verify that the basic algorithm is correct. By the way, my testing code purposely started with an array of length=1 instead of length=0; I thought the "new" operator didn't work for 0-length arrays, but it looks like it does. It looks like you would have to throw exceptions or make sure your index didn't access a 0-length array.
I added the "Correctness and Testing" section in the hopes that people would carefully consider and test the page's code and their own code before making rash decisions and wrongfully changing the page's code. It is possible that some code may be absolutely correct under certain conditions but incorrect under other conditions (e.g. the addition of threads, or overflow), and some people might change what is actually good code because it doesn't work with their conditions; if that happens, I recommend that the codes be verified given certain assumptions/conditions, then perhaps adding a separate set of code for those certain conditions. Perhaps a new section should be added for threads, overflows, etc., and directions on how to test binary search with those conditions, but that is not my expertise. Someone else do it, please.
I also wish that twiddlers would test, but alas they don't, in part because the supplied code is not exactly in their favoured language, and, even if it were, all that is supplied is a code fragment with no example testing code such as you advocate so preparing a test is inconvenient. In the Fortran world, a long-established practice is to present a subroutine that does something (such as compute cubic splines) along with a commented-out main programme that invokes it with known data and tests for the correct results. Simply by removing the comment markers, a test programme was ready to be compiled and run. This was especially relevant when to use a published prog. it would have to be typed in and idiot mistypes would be overlooked. I well recall entering a fast fourier transform with factorisation (thus not just N = power of two) and in the proofreading with a colleague I deliberately misread some statements and he continued to say "yes", "yes". When presenting pseudocode for the odd wiki article I'm moved to contribute to, I have always taken an actual working programme, and then reshaped it lightly into pseudocode to hide the language-specific drivel and bring out the algorithm. This results in non-C code that others immediately revise into C, and while they're at it, errors are often introduced as variant forms are mixed together.
With regard to the binary chop, the case N = 0 should indeed work, to save special initialisation code when accumulating a list. Thus,
if NotFound then AddNewEntry;        

rather than

if n <= 0 then PlaceFirstEntry          
 else if NotFound then AddNewEntry.

NickyMcLean 21:16, 16 July 2007 (UTC)

In fact, all the wiki programming algorithm pages probably should have "Correctness and Testing" sections for the basic conditions and for the more complex conditions. --75.165.249.107 22:47, 15 July 2007 (UTC)
This is part of why programs on my LiteratePrograms wiki usually can be downloaded and compiled and usually do include testing, because of this sort of problem. Dcoetzee 21:45, 16 July 2007 (UTC)

Well, it took a month for someone to decide to change (R - L)/2 into (R + L)/2 in the fortran example (added 26'th June) and thus wreck the routine. Perhaps fortran discourages fiddling? Certainly the discussion did not discourage that editor. Hopefully, the added annotations will now discourage further ill-thought changes. NickyMcLean (talk) 03:39, 26 July 2008 (UTC)

Because this has been an issue here, I will implement the same preliminary system being tested in quicksort to help prevent breaking changes. Dcoetzee 00:13, 29 July 2008 (UTC)

Code example

Apologies for the previous change. I had misread the loop condition as low<=high, which does require mid-1. (This is how I usually code this algorithm.) With the condition as low<high the code will terminate because mid is always less than (low+high)/2 unless low and high are the same. Sorry about this!

ICrann15 11:35, 16 July 2007 (UTC)

Ah well, I also have made just such a mistake. NickyMcLean 21:18, 16 July 2007 (UTC)

The "equal elements" section is horribly wrong!

The "equal elements" section states that the binary search will stop immediately when it finds an item equal to the one being searched, and that if there is more than one such item in the list, it's basically random which one of them is found, and that if for example the first one of them must be returned, a linear search must be performed for the equal elements.

This is horribly wrong. Sure, you can do it like that, but it's perfectly possible and quite trivial to implement the binary search in such way that it will not stop immediately when it finds an element equal to the one being searched, but continues until it has narrowed the search to exactly one element. This one element could be, for example, the first one of the (possibly) equal elements in the list (the other possibility is that it returns the last equal element in the list). No linear search of any kind is required. The search will still be purely O(log n), but it will always find the first element if more than one element equal to the searched one exists in the list. For example the std::lower_bound() function in the C++ library works like this.

Suggesting a linear search is a horrendously bad idea. The worst-case scenario is that the algorithm will perform n/2 comparisons, making it linear time, completely destroying the speed benefit of binary search. It's perfectly possible to do the O(log n) search even if there's a very large amount of equal elements.

85.194.211.152 13:40, 31 July 2007 (UTC)

Performance note

For in-memory searching, if the interval to be searched is small, a linear search may have superior performance simply because it exhibits better locality of reference.

Is this true? If the array is small, then it fits in one or two cache lines, and once cached, you are not going to miss it again regardless of what order you probe the elements in. I suspect the real reason that the linear search can run faster for a very short array has to do with branch prediction on pipelined archictures: if you sequentially equality-test each element, the equality test will usually be false, and the branch predictor will be wrong only once, when the item is found. But for a properly-implemented binary search, you search the left half or the right half of the remaining elements with equal probability, the branch predictor is of no help, and half the branches invalidate the pipeline. —Preceding unsigned comment added by 71.150.253.46 (talk) 12:43, 8 November 2007 (UTC)

on paper , or when looking up a telephone book i think id do a random search —Preceding unsigned comment added by 220.226.37.155 (talk) 06:09, 6 January 2008 (UTC)


General problem with suggested primary algorithms, both recursive and iterative solutions

The description of the binary search algorithm is correct, but the shown examples are broken. The problem arises from the index range of the examples [0 to N-1] for a N value sorted Array to search and the calculated next indices.

As can be tested by the given sorting test program (which has to be modified a little if standard C is used instead of C++ <sort routine also needes the number of elements>), the start Index for an N element array search using the described algorithm has to be 1 and not 0, also the associated end Index has to be N and not N-1.

This is a VERY UNOBVIOUS but ESSENTIAL difference!

Reason of this problem:

The described and commonly used (and so far correct) algorithm is described in D.E. Knuth
Part 3, 6.2.1 [Algorithm B(Binary search)] with all element indices based to 1.
The elements of the array to be searched are numbered from A[1]..A[N].
It is obviously tempting to 'shift' down the index range from 1..N to 0..N-1 to ease access to the normally zero based search array, so to access it as A[0]..A[N-1].
But this leads to incorrect results/malfunction due to the fact that the array midpoint estimation index function is defined as:
NextIndex = floor((TopIndex + BottomIndex)/2).
The floor function introduces a kind of 'nonlinear' behaviour. The 'correction' of this estimation function is done by the result of the element comparison:
- if current element's value is the searched one, then found element and terminate.
- if current element's value is below searched value then let TopIndex = NextIndex - 1
- else let BottomIndex = NextIndex + 1
if BottomIndex > Topindex, no match has been found.
When 'shifting down' the index values range from 1..N to 0..N-1, the Midpoint estimation index value also shifts down, but only by an amount of 1/2. This leads to the fact, that there will be sometimes an access to an array element at A[-1] (expressed in indices 0..N-1) that may/will not exist (see discussion at D.E. Knuth).
One possible modified solution to keep the algorithm working as desired AND to access the array elements starting at index 0 is:
Initialize:
- Set BottomIndex = 1
- Set TopIndex = N
SearchLoop:
- if (BottomIndex > TopIndex) : all elements tested, no match found, terminate (state: not found)
- Calculate/Compare:
NextIndex = floor[(TopIndex + BottomIndex) / 2] - 1 // now Access index is rel. 0
compare A[NextIndex] to given Value:
. A[NextIndex] == Value: found match, terminate (state: found)
. A[NextIndex] < Value: let TopIndex = NextIndex // Note: -1 already done at step ahead
. A[NextIndex] > Value: let BottomIndex = NextIndex + 2 // Note: +1 -(-1) from step ahead


Gerhard Oed, HOB —Preceding unsigned comment added by 89.51.228.110 (talk) 22:36, 15 April 2008 (UTC)

I think you've misinterpreted something, because floor((max+min)/2) will never, ever, ever give you a negative value (discounting overflow errors). --117.53.136.79 (talk) 14:23, 12 December 2009 (UTC)
To clarify a bit... If max and min start out non-negative, then (max+min) div 2 is non-negative. Hence mid is non-negative. And if max and min are only ever assigned to mid and mid+1 respectively, they will remain non-negative. The only way you'll ever get a negative number to appear in here anywhere is if N=0 (and hence max = N-1 = -1); in this case, all indices are invalid, so the fact that it's negative is irrelevant. --117.53.136.79 (talk) 14:51, 12 December 2009 (UTC)

Relevance of new additions

I'm discussing here the newly-added additions.

In short, I'm not sure the majority of this material really improves the article, for the following reasons:

  • The formalised notation and the "That it works" section are somewhat heavyweight, and IMO, somewhat unnecessary as a proof for what is really quite a trivial (and intuitive) mechanism.
  • The "That is fast" section is redundant, as the algorithmic complexity is already covered in the original material.
  • The "Extensions" are trivial extensions, and therefore (IMO), not really worth mentioning.
  • The "Computer usage" section is completely non-specific; issues of return values, number-range limits, and floating-point special cases apply to almost any algorithm that must be run on a computer. The issues of the three-way comparison and overflow of (high+low)/2 are already covered in the original material.

Incidentally, there was material already present that I'm not sure is really suitable for an encyclopaedia article, such as the "Testing" section. IMO, some serious editing needs to be done.

I welcome comments. Oli Filth(talk) 21:56, 3 June 2008 (UTC)

Well, the method is simple, and its working is indeed intuitively clear for anyone searching a dictionary, under/overshooting a parameter in some process (e.g. an artillery shell overshoots/undershoots the target, change the elevation and charge appropriately) because the human intellect is active. But if you look at the stuff on the computer implementation you'll see that getting the details correct for a computer version is mistake prone and many people (including me, alas) have made mistakes, and the various pseudocode/C code implementations offered in the article are frequently wrong, corrected, miss-corrected and re-corrected so that at all times, a given version is in doubt. (Same with Quicksort, for that matter) and here the petty details of indexing arrays are frequently confused. Thus, I was thinking of developing the formal proof by noting that test programmes to assure the correct functioning of a binary search routine are in fact unreliable, unless they test all possible parameters, and this is of course impractical in time. Thus the need for emphasis on the difference between arithmetic (as used in the proof) and computer arithmetic (as used in a programme) and the resulting attention to the introduced problems. For my part, I'd never have imagined that a "NaN" value was such that x = x would return false, and would never have imagined a need to prepare for such considerations in a test programme. The issue of formal proofs of correctness of actual computer programmes is a large subject. I've spent the last two days in bed with a sniffle, and realised that I'd omitted mention of a detail from the proof. How now to know that a proof is correct?
That it is fast is not redundant in principle, but I ran out of time and patience as it was. The actual behaviour of the method is quite interesting. I agree that the article sprawls, and a regrouping would help. The extensions are trivial to those already familiar with computer programming, but an encyclopaedic article should have encyclopaedic coverage. I'm not too happy with the use of A(i).Key rather than just A(i) as is usual in the toy examples, but wanted to have the code's extension to real usage more easily seen, though in discussions not yet developed far.NickyMcLean (talk) 21:17, 5 June 2008 (UTC)
I think, in general, formal symbolic proofs of correctness are too low-level for inclusion in Wikipedia articles (except where we are explicitly demonstrating such a formal proof). A high-level proof of correctness here is succinct and perfectly rigorous by the standards of mathematicians. Spurious corrections by people with minimal understanding of the algorithm are an eternal issue that will not be discouraged by such additions (these people typically would dismiss such a proof). Dcoetzee 22:20, 5 June 2008 (UTC)

Appalling article

As a long time programmer I was appalled at this article which is ridiculously long for such a simple and somewhat obvious technique. It makes me weep for the lack of abilities of so called professionals.ken (talk) 21:38, 1 July 2008 (UTC)

Though to be fair, the article discusses more than just the algorithm. But a simple image (as opposed to the flowchart) would probably explain more than a thousand words. Shinobu (talk) 05:07, 10 September 2008 (UTC)
A "simple image" (not the flowchart), of what? A picture of an array's elements with some arrows? A sequence of such images? A movie? How would such an exposition indicate and warn of the error-promoting details, especially the "decades-old" error commonly found. And which nonetheless is still in the C-code examples (presumably prompting the recent tag "Contradiction"), that I'm not messing with as I despise C-style syntax. NickyMcLean (talk) 21:01, 11 September 2008 (UTC)

Ewww

What sources use "binary chop" rather than "binary search"? Such an ugly term... Is it really mainstream? 65.183.135.231 (talk) 13:32, 20 October 2008 (UTC)

A quick google search only shows ~9,000 results for "binary chop", but ~600,000 for "binary search"... 65.183.135.231 (talk) 13:33, 20 October 2008 (UTC)

Very important: Binary search algorithm is classical algorithm

There are a lot of algorithms in Wiki, but we could not find too many algorithms which would be so important as the bin. search algorithm! IMHO, any such algorithm should be specially marked - perhaps, special Category: classical algorithm should be made for this. First of all, it would be important for beginners in programming areas, who have to distinguish well-known classical algorithms from others (new original unproved algorithms, new heuristic algorithms and so on). I do not think that "classical algorithm" should be defined specially, this "term" is very obvious for use.--Tim32 (talk) 22:26, 19 January 2009 (UTC)

Possibly so, however this article is resuming flux. The pseudocode examples (of versions that do not correspond to the flowchart) have been moved and their "contradiction" tag omitted without the contradictions being cleared, and the internal links to the different examples no longer function. The only code that does correspond to the flowchart is now further away. NickyMcLean (talk) 00:07, 20 January 2009 (UTC)

The implementation in pseudocode is better than in real well-known language?!

Recently I suggested the following:

Standard Implementation in Standard Pascal< ref>Niklaus Wirth: Algorithms + Data Structures = Programs. Prentice-Hall, 1975, ISBN 0-13-022418-9[1]</ref>

 
  i := 1; 
  j := n; {array size: var A : array [1..n] of integer} 
  repeat 
   k := (i+j) div 2;  
   if x > A[k] then 
    i := k+1 
   else 
    j := k-1; 
  until (A[k]=x) or (i>j);

This is much more better than current example, because:

1) it is real, standard and well-known language

2) it was reproduced from well-known classical book

Perhaps, current pseudocode may be faster, but it has not sense! Our goal to demonstrate the standard version of the algorithm rather than to talk about its possible improvements or original modifications. --Tim32 (talk) 09:47, 24 January 2009 (UTC)

Verbosity

I'm still concerned that this article is way too verbose for an encyclopaedia. An article of this density is going to be difficult to read for the layperson (or indeed anyone! - I'm well versed in this material, and find the article tiring to read because of its verbosity). A good encyclopedia article should be selective in the material it introduces; attempting to cover everything in detail is bound to fail and leads to articles like this.

Yes I agree. Whoever wrote this article believes a scholarly tone and opaque prose is more important than an informative and clearly written article. I don't think it's a bad article, but it's very muddled. —Preceding unsigned comment added by 134.68.132.40 (talk) 18:31, 23 January 2010 (UTC)

Just as a token example, the section That it is fast is ~400 words long, but there's really no need to say anything more than the first paragraph.

As another example, the Design section is introducing needless complexity and abstraction, and could safely be removed. Similarly, the content of the Numerical difficulties section is general to all computer algorithms and routines; to spend 3 paragraphs of article real-estate here seems like needless bulking.

Essentially, I want to reiterate my original concerns above. Since then, the article only seems to have got more and more heavyweight and abstract. Oli Filth(talk|contribs) 13:12, 24 January 2009 (UTC)

I've already removed a couple of sections which I believe are completely outside the scope of this article. At this point, I would normally just start rewording the "worst offenders" to reduce their verbosity. However, I'm aware that some people have worked hard on this article, so I don't want to start treading on people's toes unnecessarily. Any thoughts before I attempt a re-word? Oli Filth(talk|contribs) 13:34, 24 January 2009 (UTC)
Encyclopaedic (or ~pedic, for some) ..."very full and thorough; comprehensive" At the one extreme, just state the algorithm, take it or leave it. Like, just read the first paragraph. But that's not enough, not one of full thorough or comprehensive. And it won't last long, because someone will change it at once, and someones can and have. For example, everyone, and I mean everyone, insists on using the (L + R)/2 variant, as is the case in all the three C-language versions that no-one can be bothered to correct (I've already explained why I won't), and in just about every textbook as one reference mentions. That this usage is wrong and will cause failure, and its justification (of being swifter) is likewise wrong anyway, should be explained at least until the whole world gains satori and everyone already knows everything without lengthy elaboration. Until then, explanations are needed and moreover are linked to in the one example that doesn't use (L + R)/2 to dissuade those who would change it (and have changed it) to (L + R)/2, which link is now dead. It is evident that all readers do not in fact understand the method so either a) they think hard and attain understanding on their own behalf without consulting the article, or b)consult an article that does explain the binary search that moreover, confronts wrong or confused ideas that are commonly held. Rather than remove stuff, it would have been more helpful to correct stuff, such as the three C-examples that prompted one editorbeing to add the "Contradiction" tag that has since vanished along the way. NickyMcLean (talk) 19:52, 25 January 2009 (UTC)
Are you talking about changing (L+R)/2 into L+(R-L)/2? If so, I can happily do that!
However, as I've said before, having the article get overly concerned with platform-specific details such as numerical limits is out of the scope of the article, in my opinion. It's the job of the programmer to correctly convert the abstract algorithm into a working implementation on his particular platform. For instance, I write DSP code for a living, typically on a 16-bit platform. I have to consider the possibility and implications of quantisation, overflow and saturation on almost every line of code I write! But I wouldn't expect an article on, for instance, IIR filters or the FFT to concern itself with such matters.
In this particular case ((L+R)/2), perhaps an explanation is appropriate, as there's been some relatively notable examples of getting it wrong (e.g. the Java library). But this can be clearly and completely explained in, at most, two sentences! What's more, a concise exposition is much more likely to be read and understood. Oli Filth(talk|contribs) 20:40, 25 January 2009 (UTC)
We keep going over the (L+R)/2 thing over and over again. I think it's appropriate to discuss it, but I'm not convinced the "fixed" calculation should be used in all of the pseudocode - after all, it's pseudocode, it conceptually uses arbitrary-precision integers, it doesn't have to worry about matters like overflow that are incidental to how the algorithm actually works. I'd rather see it discussed in one localized part of the article and the simpler, more obvious calculation used elsewhere. It's also rarely noted that the simpler (L+R)/2 calculation is valid for L and R up to 2w-1, where w is the word size in bits, which is more than sufficient in many circumstances. Dcoetzee 02:33, 26 January 2009 (UTC)
There are two points, not one. Firstly that the (L + R)/2 approach will have trouble in actual usage (and leaving this in pseudocode with "arbitrary precision" as in mathematics is a trap for those who don't read the article, no matter how concise, as many do not read the comments attached to the pseudocode either), and secondly, that the usually-offered justification that it is faster is wrong, for reasons that were explained. Writing L + (R - L)/2 means that the following if-statement requires a subtraction of L, a point that was explained. Obviously, Oli didn't notice that in his happiness to replace one form by the other. A third point could be added, that this is yet another case wherein choosing a particular form because its expression is brief, being fewer symbols, or fewer lines of code or has other aesthetics such as apparent obviousness does not therefore lead to the best result whether that be speed of execution or smaller code size or both. Or even reliable operation. Erk! More verbosity. So are we to present merely a sequence of bald assertions with no reasoning? Very concise, but is it helpful? All of mathematics can be developed from the thorough contemplation of a handful of pebbles, so should a mathematics textbook contain just an image of that along with the exhortation BEHOLD! (An Indian mathematician offered a proof of (a + b)2 by showing a diagram) NickyMcLean (talk) 19:47, 26 January 2009 (UTC)
"Requires"?? From a quick perusal of the (extensive!) talk-page discussions above about the midpoint calculation, I can only see mention of this idea in the Talk:Binary search algorithm#Integer Overflow section. If that's what you're referring to, then that isn't a requirement, it's an attempt at optimisation by pre-empting the compiler. If that's not what you're referring to, could you point it out? Oli Filth(talk|contribs) 20:21, 26 January 2009 (UTC)
In languages with fixed-width integers (e.g. most mainstream languages today) some modified form of calculation is required on sufficiently large lists to avoid overflow in the calculation (L+R)/2. I don't contest this, I just think it confuses matters further to introduce this concept at the same time that the fundamentals of the algorithm itself is introduced. Pseudocode is not source code, it is not ready to run, it's for conceptual purposes of understanding the algorithm. That said, I don't care all that much. Dcoetzee 22:38, 26 January 2009 (UTC)
I understand L+(R-L)/2 vs. (L+R)/2, but I don't understand NickyMclean's comment above that "the following if-statement requires a subtraction of L". Incidentally, I agree (on the whole) that this article should very much be grounded in pseudocode, not implementation-specific/platform-specific details. Oli Filth(talk|contribs) 22:41, 26 January 2009 (UTC)
Ahem. Insufficient verbosity. (See, I am capable of it) If you look at the fortran version, see that it first computes P then tests for the span being empty in the following statement, then completes the preparation of P in the third statement. In the three c-language versions, the span is tested then mid is calculated. I should have made that explicit. Note however that the fortran version tests the span via the just computed value of P against zero, whilst the C-versions test the span by having to calculate a subtraction, because the issue is not against the special value zero. All this was made verbosely explicit in the currently vanished section through the use of pseudo machine code... Specifically, notice that the C-versions compute R - L twice per iteration, once explicitly in (high - low)/2 and once implicitly in high < low. Indeed, I think I recall that this arrangement was made clear to me at least in the 70s when I was coding a binary search in IBM1130 assembler, and the adding of L and later the subtracting of L was seen to be avoidable by a rearrangement, and that arrangement could also be expressed in a higher-level language. Now this is itty-bitty detailery, but, 1) compilers are quite unlikely to generate such code for other formulations even if they could in principle, and 2), the administration of an iteration of a binary search is so small, that a single additional operation can make a difference, noticeable over hundreds of millions of executions. NickyMcLean (talk) 04:18, 27 January 2009 (UTC)
Performance concerns are, frankly, absurd. Binary search exhibits horrifically poor cache locality, and as a consequence spends orders of magnitude more time accessing RAM than computing indices on any modern machine. There is only a correctness concern, not an efficiency one; and I believe the correctness concern should be deferred until the basics of the algorithm have already been clearly presented. Dcoetzee 04:48, 27 January 2009 (UTC)
Agreed. Over-emphasising performance optimisations (that may or may not beat the compiler) at the expense of reducing the clarity or intuitiveness of the exposition of the core algorithm is probably not a good idea. We really want the core pseudocode examples to match the algorithm description as closely as possible; introducing transformed variables doesn't help that.
Similarly, introducing inline catches for "perverse" things like N < 0 (c.f. the Fortran example) is inferior to having an explicit check, as it only serves to muddy the clarity of the example. I'd even go so far as to argue that sanity checking for illogical inputs is outside the scope of the article entirely; that's the job of the implementer. Oli Filth(talk|contribs) 15:24, 27 January 2009 (UTC)
Performance issues are central to the choice of the binary search over say the linear search, even though the latter would have 'good cache locality', an absurd concern for searches, since even if only one probe was made in a search (the correct one, as with a hash table) the probed item has to be accessed and if the search key is random and the search space large, "locality" won't help at all. Unless as with a linear search and blocked storage, the sought item was in local memory because the previous failed comparison fetched that block. If a block containing many records has to be loaded to acquire one, its extrema might as well be investigated before going for the next block but this is not a simple binary search (or whatever) method and is not being discussed. If the entire search area is already present in fast memory then locality is no longer an issue and the number of probes (and less so, administration) remains dominant and, given fast access, administration wastage becomes more noticeable; otherwise swamped by the time needed to fetch a record that might be off a disc file on some far-distant computer connected by a slow and noisy link. However, special purpose op-codes may still prevail for suitably small searches, and this is indeed an issue for specific machines, etc. as has been decried and enough of it.
Correctness concerns are indeed deferred until the basis of the algorithm have been presented; note the ordering after the introduction of the flowchart: the proof of correctness, and then performance. Floundering continues. The depiction of the algorithm in the flow chart is not followed by the c-code versions. The flowchart algorithm obtains for free avoidance of the repeated calculation of (high - low) and also the integer overflow problem. But at the price of not conforming to the familiar form endlessly repeated in textbooks and code repositories and other web pages. I agree that the merged computation of the probe position and checking for an exhausted span requires a little more cogitation for comprehension, but this is mostly unfamiliarity: by contrast the c-versions separate these concerns that in the flowchart are merged and explained in the proof. No explanation for these deviations are offered. Why? There's no need, because it is obvious, because it is familiar. It is precisely because of the mass familiarity that someones mass to convert descriptions back to approximate the form familiar to them (even if the resulting chimera will fail), and so explanations for the deviationism were devised to discourage at least those who bothered to read them, which explanations were removed as being unnecessary and verbose. Further, that the justification for the familiar form as being simpler, was in fact disadvantageous for computer execution struck me as interesting but some call it even more superfluous verbiage. The added check that would catch N < 0 were it to be perversely presented was also available for free as was briefly described, and general checking was hinted at as well - those who wish could indeed incorporate additional checking, but exploration of that large subject was obviously (even to me) not important for this article.
In brief, rather than present a series of unexplained assertions, a thorough (etc.) article would explain the avoidance of the usual (L + R)/2, precisely because it is contrary to the usual and that the usual justification for it as being simpler is wrong in terms of computer action as well. Hopefully, the someones would then hold off on their reversions to familiarity. (A thin hope, I know) Those implementations of the algorithm that deviate from the flowchart (repeated calculation of high - low) should have their deviations justified. A more concise description than I managed would be good, but simply excising points, though effortless, is not good. NickyMcLean (talk) 21:21, 27 January 2009 (UTC)
Wow, I think I preferred your insufficient verbosity from earlier! I think we could end up arguing at cross purposes for a while, because I think we fundamentally disagree on what the purpose of an encyclopaedia article is. In short, in my opinion the raisons d'etre for the article are, in decreasing priority:
  1. Describe the algorithm
  2. Explain the algorithm (how it works, why O(logN) is a good thing)
  3. Anything else (starting with the most general problems, like cache locality)
i.e. the classic inverted pyramid.
Therefore, (IMHO), we should start with the absolute simplest and most intuitive version of the algorithm (flowchart or pseudocode). This would involve:
  • Start from the array limits, not +/-1. It's more intuitive, and if it's what Knuth does...!
  • Variables should represent the most intuitive concepts, i.e. mid, low and high
  • All calculations should be performed using the most intuitive method, i.e. mid = (low + high)/2
  • No indirection or complication of syntax by invoking .key, just simple array/list access
  • No cluttering with sanity checks (implicit or explicit)
Most of the sub-sections in the Binary search algorithm#Variations and Binary search algorithm#The algorithm sections could probably be safely reduced to two or three sentences each, because they currently read more like blog articles than encyclopaedia articles (no offence intended...), especially the section you've just added back.
Anything to do with sanity checking should simply go, because sanity checking is a basic premise of good programming practice, and this article is not here to teach people how to program. Similarly, anything to do with micro-optimisation
I'm going to start this piece by piece, and per your request, I will try to minimise simply "excising" material. I probably won't get it right first time, so if I make mistakes, don't be shy to let me know! Oli Filth(talk|contribs) 22:11, 27 January 2009 (UTC)
I'm all in favour of simplicity and brevity, (and apple pie) it is just that you are all out of step... The algorithm I have described is indeed slightly different from the nearly universally discussed version, but, the deviationism is to be preferred for the reasons that I supply, even if the difference is thought small by some. It is precisely because of these differences that they have to be justified, otherwise a simple exposition of the routine choice would suffice. A deviant version will have its algorithm changed to one of ...lesser merit... by the someones, who in the past have often produced non-working versions. Dcoetzee has devised a scheme to hinder the facile fiddling with code sections, but it has still happened and to produce a wrong version. Alas, explaining why something is the better choice doesn't help if the fiddler doesn't read it because it is too long for a glance, and anyway unnecessary because the mistake is obvious.
I would question your usage "more intuitive" as above (seeing as it leads to conclusions opposite from mine, ahem) and it was precisely because highly regarded authors such as Prof. Knuth do not use the "exclusive bounds" version that I explained why a different scheme was better. Even if only slightly so in the view of some. And I do hope that "L" is recognised as short for "Left Bound" and "R" as "Right Bound", which I thought more intuitive than "low" and "high" and did not explain. NickyMcLean (talk) 20:56, 28 January 2009 (UTC)
Don't take this the wrong way, but this really isn't the place to present your personal preference; we're limited to presenting notable and common material. If every textbook in the world presents the "simple" version of the algorithm, then there's really no justification for us doing anything different.
At most, I would favour starting with the simple/intuitive version and then suggest improvements, rather than the other way round. Oli Filth(talk|contribs) 00:03, 29 January 2009 (UTC)
But it is not an issue of 'personal preferences' as I have explained at each stage why a choice is to be preferred (especially when it is not the routine choice), even if the tiebreakers are matters which some regard as trivial. Set against these objective criteria is a subjective criterion of "intuitiveness". To take one "intuitive" statement, to find the middle point between markers L and R, I would argue that the intuitive procedure would be first to find the length of the interval which is (R - L), halve it to find the length of the offset from the start to the midpoint, then add the start position, as in (R - L)/2 + L (which ordering incidentally would/should/could produce faster-running code than L + (R - L)/2; do I have to explain why?), then apply well-known mathematics to derive (R + L)/2 which therefore would not be as intuitive. I suppose one could write an article on "How to convert an algorithm into a computer programme", and use the Binary Search as a worked example...NickyMcLean (talk) 22:58, 29 January 2009 (UTC)

Yet one more point on overflow issues

The "Numerical difficulties" section uses signed values whilst exposing the (L+R)/2 issue. In the case of signed values, the problem is not solved by moving to L+(R-L)/2 (or an equivalent) if first and last are free to be anywhere in the number range. e.g. if first=-30000 and last=+30000, then on the first iteration, the signed value will overflow. Therefore, I'm modifying the description to be in terms of unsigned ints only. Oli Filth(talk|contribs) 22:44, 27 January 2009 (UTC)

Well, I did blather on about the same problem of overflow arising with unsigned integers. In the context of an array running 1 to N (or 0 to N - 1), index attempts with -30000 will not be used and there was mention of other bounds (And, oddly enough, only last week I perpetrated a search of an array from -600:+9000 (with index 0 being special) for reasons that require too much explanation...) Suppose unsigned integers are used, then the maximum usable range is first to last of 1 to 65534 (not 65535), with 0 being used for "not found", because values of first -1 and last + 1 are needed in the operation of the method, whether you use inclusive or exclusive bounds, as I did blabber about too. And overflow with L = 40000 R = 45000 say, still is avoided by avoiding (L + R)/2. When dealing with a pl/i compiler that allowed only 16-bit index expressions, I did indeed toy with an array running -30000:30000 so as to get extra slots, but this was less than pleasing. One could prepare an exhaustive schedule of {inclusive/exclusive bounds} x {signed/unsigned integers} x {(L + R)/2 or not} x {returning -L for "not found" or a simple constant} x {arrays starting at 0 or 1} and even I thought thirty-two combinations too much blabber and unnecessary and indigestible and just too exhausting to think further about. I'll bet that most people do use ordinary signed integers for array indexing, and also there is mention of the choice of unsigned integers vs. the return of -L for "not found". NickyMcLean (talk) 19:54, 28 January 2009 (UTC)
I think this is all a symptom of this article trying to address too many points simultaneously. The concepts of overflow, returning negative values, and having free-form array limits all conflict here (without immense clarification and addressing special cases). All of this could be avoided by sticking to the details of the algorithm itself. Oli Filth(talk|contribs) 00:03, 29 January 2009 (UTC)

The "Design" section

Sorry, despite my aim to avoid simply excising material, I have to baulk at the re-introduction of this section! This is nothing to do with correctness or efficiency or design of the binary search algorithm, it's simply a discussion of how to write an interface to a function, and at a bare minimum, is completely general to all search algorithms. The details of whether one writes their function to return a struct, a pointer, a tagged union, a sign-modulated integer, or to modify an input value via a pointer (which is essentially what this decision boils down to) really has no place in this article.

I'll wait for a reply, but I think this section needs a pretty good reason to stay. Oli Filth(talk|contribs) 22:56, 27 January 2009 (UTC)

Yeah, I agree with you. This article is suffering from a Parkinson's Law of Triviality issue - disproportionate weight and debate given to trivial issues that are not specific to binary search, some of which verge into OR territory. Everything there is to know about binary search can reasonably be covered in 2-3 pages, and I fully agree that we should begin with the simplest, most straightforward example possible. Dcoetzee 23:54, 27 January 2009 (UTC)
Well, Prof Knuth does remark that "the details can be surprisingly tricky and many good programmers have done it wrong the first few times they tried", and Jon Bently writes similarly, so ought not encyclopaedic coverage involve mentioning the deadfalls lurking in the battlefield? I doubt that many people bungle the linear search method. And when was the "Design" section re-introduced anyway? NickyMcLean (talk) 20:56, 28 January 2009 (UTC)
You're right, it wasn't re-introduced; I had forgotten which sections I'd removed originally!
But my point was that this section has nothing to do with the binary search algorithm specifically, and barely anything to do with search algorithms specifically. I think I'm pretty safe in saying that figuring out how to return results was not what Knuth was referring to. This is the kind of detail akin with "should I pass the array by pointer or by reference?". Oli Filth(talk|contribs) 23:55, 28 January 2009 (UTC)
Yes, I believe the "tricky details" to which these sources were alluding were avoiding off-by-one errors caused by inconsistency in how the lower/upper variables are treated as inclusive or exclusive; and other details like that involving the algorithm itself. Dcoetzee 00:06, 29 January 2009 (UTC)
Oddly enough, the Design section is entirely concerned with issues potentially raised by the programming of the (or a) Binary Search method, though as mentioned above, I could imagine such content in an article on how to write computer programmes that used the Binary Search as a case study. I'm sure that Prof. Knuth wasn't worried by the encoding of results, but someone introduced the general notion of "tagged union" to discuss it. As for pointer vs. reference, I've always thought that a pointer was a name for an address, and pass-by-reference amounted to passing an address. But I don't use C which may well have different syntax for those usages. If I was presenting Algolish pseudocode (and remember that a design objective for Algol was to provide a formalism for outlining an algorithm in publications), I'd consider something like
   L:=0; R:=N + 1;
   while (P:=(R - L)/2) > 0 do      %Integer division.
    case sign(A[P:=P + L].Key - X)
     -1:L:=P;          %A[P].Key < X.
      0:Return(P);     %A[P].Key = X.
     +1:R:=P;          %X < A[P].Key.
    esac;
   elihw;
   Return(-L);
Which avoids the appearance of miscellaneous go to statements, through the ability to employ assignments within an expression. Notice that there is just one subtraction of (R - L). However, I'm sure that someone would promptly replace that by C-code. I remain puzzled by the assertion that the recursive form is the "most straightforward", especially when there must be attention to +-1 thanks to the use of inclusive bounds. NickyMcLean (talk) 22:58, 29 January 2009 (UTC)
I too would personally prefer "generic" pseudocode rather than any specific language syntax. As for the "Design" section, surely any search algorithm has the opportunity to present both found/not-found and location? Oli Filth(talk|contribs) 23:07, 29 January 2009 (UTC)
I was the one who wrote that the recursive form is more straightforward, and I still believe it, but that's quite subjective - I think it's a good idea to present both. I agree that it should be generic pseudocode in both cases. The tagged union stuff I introduced was only in order to improve on existing material involving the return value, which didn't represent best practice. I'd be happy to see it removed wholesale. Dcoetzee 23:38, 29 January 2009 (UTC)
Hummm. A recursive version of the above (ignoring declarations) would be something like
 Function BinarySearch(A,first,last);
   L:=first - 1; R:=last + 1;
   while (P:=(R - L)/2) > 0 do              %Integer division.
    case sign(A[P:=P + L].Key - X)
     -1:Return(BinarySearch(A,P + 1,R - 1); %A[P].Key < X.
      0:Return(P);                          %A[P].Key = X.
     +1:Return(BinarySearch(A,L + 1,P - 1); %X < A[P].Key.
    esac;
   elihw;
   Return(-L);
 End Function BinarySearch;
And as you can see, there are all too many +-1. If it were agreed that all bounds specification would be exclusive, right from the start, then there would be no +-1 appearances, but there would have to be very loud documentation, since the caller's invocation would have to be BinarySearch(A,0,N + 1) for an array running 1:N, which is perverse. If the inclusive form is used, then P is stuck with +-1 while the L and R bounds are used as is. It is true that the explicitly recursive form more closely follows the verbal description of the algorithm (but not the flowchart), whereas the iterative form does follow the flowchart. With the use of inclusive bounds, there is still +-1 with P, but not L and R. Of course, the recursive version does involve multiple stack manifestations of the parameters, but this can be held to be ignorable in the face of aesthetics, and certain tricks are possible to the compiler. NickyMcLean (talk) 04:19, 30 January 2009 (UTC)
The reason I prefer the recursive version actually has nothing to do with the +-1, but rather the way it can be read as a more direct formulation of the informal description of the problem. "Inspect the middle element; if the key is less than it, recursively search the first half in the same manner, otherwise recursively search the second half." The iterative version may also be more intuitive for some, though, in the sense that it iteratively refines a range of possible indices, making it smaller and smaller. I think both have conceptual value. Dcoetzee 04:52, 30 January 2009 (UTC)
Yep, agree about the more direct following of the description. I misthought. NickyMcLean (talk) 05:01, 30 January 2009 (UTC)

Oli Filth: "too would personally prefer "generic" pseudocode rather than any specific language syntax." Sorry, but it is bad practice: very often "generic pseudocode" is too abstract to be clear and very often it is incorrect, so Pascal-like pseudocode usage is good and wide distributed tradition in the literature. Also if you want to introduce your original "generic" pseudocode here, then it would be original research! It is forbidden in Wiki.--Tim32 (talk) 09:19, 30 January 2009 (UTC)

This argument is specious. Original presentation is not only allowed but required at Wikipedia. Pseudocode is a matter of presentation. And you yourself are arguing for pseudocode, just with a different syntax. We'll use syntax that best expresses the algorithm without distraction, we can work out the details later. Dcoetzee 11:07, 30 January 2009 (UTC)
case sign(A[P:=P + L].Key - X) -- looks like too original to be understanded by any every reader! Poor, poor beginner who would like to use this code for his scolar program! :-( --Tim32 (talk) 12:28, 1 February 2009 (UTC)
PS. The pseudocode "case sign(A[P:=P + L].Key - X)" is undefined, the Pascal-like pseudocode is defined by Pascal standards!--Tim32 (talk) 04:43, 3 February 2009 (UTC)
...is undefined as Pascal, but the pseudocode is pseudocode. It is there to express the bare bones of an algorithm without being distracted by the limitations of some specific computer language's specific syntax constraints. Actually, an algol compiler would accept all but (I think) the function sign, except for quibbles over ending a while with elihw and case with esac (though I think the word should be otherwise) and suchlike. As for comprehension, one takes a deep breath, and begins with the standard rules for arithmetic expressions, and identifies the innermost () expression, which is simple in itself, and which then (oddly) appears to have the result used as an index to an array, which is a common scheme (so reassurance should be felt), which array element is a part of an expression that is a parameter to some mysterious function "sign", the result of which is to be supplied to a familiar construction case, which is then followed by some cases in a familiar manner, and, given the name of the function, and on connecting concepts, comprehension should follow. If not marks of approval from a pedagogue intent on enforcing conformity to some orthodoxy. So, a brave and perspicacious beginner could in principle transform the pseudocode into code conforming to whatever constraints a particular actual computer language imposes, and thus also deny their teacher an opportunity to accuse them of plagiarism by copying an article rather than performing their own work. So, an unexpected benefit! NickyMcLean (talk) 20:14, 3 February 2009 (UTC)
You wrote: “pseudocode is pseudocode” – yes and no: a good pseudocode must be understandable!:
“Pseudocode is a compact and informal high-level description of a computer programming algorithm that uses the structural conventions of some programming language, but is intended for human reading rather than machine reading. Pseudo-code typically omits details that are not essential for human understanding of the algorithm, such as variable declarations, system-specific code and subroutines. The programming language is augmented with natural language descriptions of the details, where convenient, or with compact mathematical notation. The purpose of using pseudocode is that it is easier for humans to understand than conventional programming language code, and that it is a compact and environment-independent description of the key principles of an algorithm. It is commonly used in textbooks and scientific publications that are documenting various algorithms, and also in planning of computer program development, for sketching out the structure of the program before the actual coding takes place.” (Pseudocode)
For example, I have nothing to say against “End Function BinarySearch;” -- it is obvious for everybody, but if you would write PL/1-like pseudocode:
x=y=z, (1)
then you should explain that this is the same as
y:=z; x := y; (2)
(Sorry, I do not remember, is this correct for PL…  :)
As I recall, the form is x,y = z; and I would argue that the meaning is apparent even if a reader has never used a computer language allowing it, and this construction is certainly a convenience. Pseudocode (1) is indeed less desirable, and could be read as x takes the value of the test (y = z) as I think would be the case in pl/i. NickyMcLean (talk) 20:19, 4 February 2009 (UTC)
No, in PL/1(("Programming Language One") x=y=z means: a) y:=z; x := y; or b)y:=x; z:=y; -- I do not remember, (a) or (b) is correct for PL/1… But, for some pseudocode c) x := (y=z) is possible.--Tim32 (talk) 05:09, 5 February 2009 (UTC)


So, as a rule mathematical journals and books use pseudocode (2) rather than (1).
“A[P:=P + L].Key” should be explained as well, it looks like (1). Also, why you need “Key” here? It is not essential detail and it would be quite suffiсient for the algorithm description “type A: array of integer” rather than “type A: array of record key: integer; … end;”
It is well-known fact, that standard Pascal was designed as educational language, so Pascal-like pseudocode is understandable for everybody, and so books and journals use this pseudocode very often today. C was not designed as educational language, so if an author wants to use C-like pseudocode, he has to do additional efforts to provide good style for his code. Of course sometime C++-like pseudocode or Lisp-like pseudocode may be optimal case, because of limitations of standard Pascal, but in the case of bin.search Pascal-like pseudocode is quite suffiсient. And so this pseudocode would be optimal here. Why do you want to make additional problems for a reader of this article?
Today, Algol-like (Algol-60, Algol-68, Fortran-4 and so on) pseudocodes look like returning to “goto” programming under very restricted memory. Why do you need “esac”, “elihw” etc.? The “end” is better for human reading, “esac”, “elihw” were done for compilers with low memory size. What about Cobol-like pseudocode? Your listing size would be increased drammatically! Donald Knuth introduced his own pseudocode, but he had wasted a lot of pages for his MIX description. Not all readers liked it.
Well, I was grumpy about MIX also, but, it did support detailed attention for the analysis of performance. I have used programming languages in which the end clause (or equivalent, such as } for you C-monoglots) was always just 'end' without any indication of what was being ended, and miscounts could be very difficult to correct. Thus the compiler should recognise and check the likes of 'end while' and 'end case' and especially, 'end function name, though I would prefer for i:=etc ... next i; rather than ending the loop with end for or (better!) end for i; I forget which language uses esac and offer to you modula's usage of fi to end every if, which in my case causes irresistible thoughts of Jack and the Beanstalk. Anyway, detecting such blunders is a simple clerical matter. Computers excel at simple clerical matters; why not have the computer do it? I'd suggest that the absence of such syntax features and checking would be the mark of a language constricted by a small code space. NickyMcLean (talk) 20:19, 4 February 2009 (UTC)'
No problem: yes, it is good programming practice with "indication of what was being ended" - use format and comment for it:
  { -- this line is comment -- } 
  for i := 1 to 10 do
   begin
     .....
     for j :=1 to 20 do
       begin
        case j of
        1: case i of
            1: begin
                ......
               end; {case i of 1}
            2: begin
                .....
               end; {case i of 2}
            ......
           end; {case i}
        2,3: begin
             ......
             end; {case j of 2,3}
        ....
        end; {case j}
       .....
      end; { for j}
   end; {for i}

--Tim32 (talk) 04:48, 5 February 2009 (UTC)

Yes, yes, and I have seen endless code littered by such commentary. The compiler must check for this to be fully useful, otherwise when there is trouble the bland assertions of the commentary will mislead; their intention may be correct, but the source is not due to some blunder somewhere. Indeed, the presence of such annotations shows the desperate need for a suitable syntax, which, if present, would enable the commentary to attend to more important matters of intent rather than syntax. NickyMcLean (talk) 03:46, 7 February 2009 (UTC)
You said “compiler”, so now you are talking about language for real programming (we have talked about pseudocode before). First of all, any real language should be productive for programmer’s time. It is not suitable to type “elihw” for every “while” and “od” for every “do”. In Pascal
For_statement ::= for …do Compound_statement
where
Compound_statement ::= statement | begin statement_list end
And so for compiler “end” is the end of compound statement. In your pseudocode
For_statement ::= for …do statement_list od
So you have to type “od” even if the list of statements has only one statement. Compare:
for z:=1 to maxZ do  // line 1
 for y:=1 to maxY do // 2
   for x:=1 to maxX do // 3
    if a[x,y,z]= 0 then // 4
      a[x,y,z] :=1 // 5
   else  // 6
      a[x,y,z] :=0 // 7
But for “fully useful compiler”:
for z:=1 to maxZ do // 1
 for y:=1 to maxY do // 2
   for x:=1 to maxX do // 3
    if a[x,y,z]= 0 then // 4
      a[x,y,z] :=1; // 5
    fi; // 6
   else // 7
      a[x,y,z] :=0; // 8
   esle; // 9
od; // 10
od; // 11
od; // 12
7 lines vs. 12 lines!--Tim32 (talk) 11:08, 9 February 2009 (UTC)

Also, there are many tools to check program structure, for example, source code formatters. Very often flowchart tools are the most useful for checking.

I'd be quite happy with for i:=... do <statements> next i;, and yes, unconditionally typing next variable every time, knowing that the compiler would check for correct collation. As in
for z:=1 to maxZ do
 for y:=1 to maxY do 
  for x:=1 to maxX do 
   if a[x,y,z]= 0 then a[x,y,z]:=1
    else a[x,y,z]:=0;
  next x;
 next y;
next z;
Though I agree that the nesting of if and else can become troublesome swiftly. But the whole point is that the compiler performs substantial checking in full context, rather that mess about with a collection of auxiliary routines that may or may not match the compiler's logic, even if that does match the language syntax and the programmer's interpretation of its meaning. As it happens, algol's syntax allows if-statements as a part of expressions, as in a[x,y,z]:=if a[x,y,z] = 0 then 1 else 0; however this too will puzzle the linguistically constrained. A friend once gloated over an assignment that began if if if... Algol is also not amongst those known as a "source lang". Fortunately this article's algorithm requires no for-loop encoding nor complex if-statements. NickyMcLean (talk) 21:50, 9 February 2009 (UTC)
It is typical case: the sequence
for x… for y … for z …
is equivalent with sequence
for z… for y … for x …
and nobody wants to remember which sequence he had typed 50 statements above. Such fully syntax (next x...next y...) would be too redundant – only the end of compound statement has to be marked, especially for such short algorithms as this article's algorithm. Pseudocode should not be more redundant than real programming languages.--Tim32 (talk) 10:51, 11 February 2009 (UTC)
"...50 statements above." - this is exactly why there should be redundancy and checking! Or would you say that "x*(a + b" is fine, just supply whatever closing brackets are needed on end-of-text? As for the for loops, the order of nesting can be important for execution (array element adjacency) and computationally. If this is not so, and the order of evaluation is not under discussion, then the explicit and verbose blather should be replaced by pseudocode such as Where A = 0 then a:=1 elsewhere A:=0; and actual computer languages that are serious about array processing offer such syntax, even if other languages do not and force full explication, such as C et al. NickyMcLean (talk) 18:57, 12 February 2009 (UTC)
"Or would you say that "x*(a + b" is fine" -- No: "x*(a + b)" is fine, where "(" is "begin",")" is "end", "a" is "if...then...;", "b" is "if...then...;";
x*(y*(z*(a))) is not fine, where "x" is "for x...do", "y" is "for y...do", "z" is "for z...do", "a" is "if...then...else...;";
x*y*z*a is fine;
z*y*x*a is fine;
x*y*z*(a1+a2+...+a50) is fine;
z*y*x*(a1+a2+...+a50) is fine; —Preceding unsigned comment added by Tim32 (talkcontribs) 06:24, 13 February 2009 (UTC)
x*{y*[z*(a)]} is not fine, where ")" is "next z;", "]" is "next y;", "}" is "next x;";
Also:
x*{y*[z*(a+b)+c]+d} is not fine;
x*(y*(z*(a+b)+c)+d) is fine.
"should be replaced by pseudocode such as Where A = 0 then a:=1 elsewhere A:=0;" -- A is an array, so For all x,y,z do if A[x,y,z]=0 then A[x,y,z]:=1 else A[x,y,z]:=0;--Tim32 (talk) 06:10, 13 February 2009 (UTC)
You wrote: “thus also deny their teacher an opportunity to accuse them of plagiarism by copying an article rather than performing their own work” – a student must know the standard implementation of bin.search rather than to make his own modification! A usage of copy of standard bin.search implementation is not plagiarism not only for a student, but also for professional programmer. Otherwise any standard function call would be plagiarism…--Tim32 (talk) 07:54, 4 February 2009 (UTC)
It is not a matter of a function call, but of an algorithm. My image was of a student being tested on their understanding by reproducing the code from their own understanding, not just copying it. If arbitrary pseudocode was allowed, then the student and the assessor might have a discussion such as this one. If a specified computer language was required, then the student would have to produce syntactically-correct code, and as the above pseudocode conforms to no single computer language's syntax, it couldn't possibly be used and so there could not be plagiarism. Whereas if this article described the algorithm with correct syntax in some specific language (such as C) then an accusation could arise. NickyMcLean (talk) 20:19, 4 February 2009 (UTC)
Let the student reproduced following:
 
  i := 1; 
  j := n; {array size: var A : array [1..n] of integer} 
  repeat 
   k := (i+j) div 2;  
   if x > A[k] then 
    i := k+1 
   else 
    j := k-1; 
  until (A[k]=x) or (i>j);
The assessor must ask him: 1) how does it work? 2) should A be sorted? Why if yes? 3) what is the value of iteration count for worst-case? and so on. These would be another questions that we have discussed here. And obviously, the assessor should avoid a discussion such as this our discussion.--Tim32 (talk) 06:29, 5 February 2009 (UTC)

I would strongly argue that assignments inside control statements (both this "case" and the "while") should be avoided at the best of times, and absolutely shouldn't be present in pseudocode intended to illustrate the algorithm. It may lead to compact notation, but it definitely leads to hard-to-read/misleading notation. There's no shame in putting the assignment on a separate line!

Yes, but then the control structure would have to be different! There would have to be an initial computation of p just prior to the while (that merely tested the value of p) and a fresh computation of p just before the elihw (or whatever) ready for the next test. Such repetitions at widely different locations promote human error, as well as being lame. Further, the compound form has the entirety of the control in one place, in the (admittedly more complex) control statement, rather than being scattered about in multiple statements - such scattering is one of the arguments against the use of go to statements. The alternative is to introduce labels and the dreaded go to statements, that are concealed by the use of fancier control structures even though still present in the actual machine code (but the computer deals with the simple clerical issues), and that also would be of lesser merit. I fully agree that assignments within statements are disconcerting, especially when they are unfamiliar, and I (never having seen Algol) was surprised but then my friends explained, and, once my mind was corrupted to encompass them, I became a supporter... Added commentary could give hints to the puzzled, and, after some study, would not enlightenment be attained? The argument is not about compact notation (else we would all be using APL) but about concise description. Yet still not so concise as to be unintelligible. NickyMcLean (talk) 20:19, 4 February 2009 (UTC)
I think having the same simple statement in two places would be a small price to pay to avoid obfuscated code. Assignments within control statements are more than disconcerting, they're a sure-fire way to introduce bugs and misconception, even for the most experienced of programmers. But I hardly think it's my place to lecture you on this!
As it happens, the problem would be avoided entirely if the loop condition was simply low < high!! Oli Filth(talk|contribs) 20:33, 4 February 2009 (UTC)
Ah, but in that case there are two evaluations of high - low per iteration when only one is needed, and this is wasteful. And if we are attending to the skeleton of the method (speaking as one corrupted by assembler), then the case statement itself contains waste: each case ends with an implicit go to to whatever follows the esac clause which in this case is the wend or end while or elihw clause, which contains an implied go to to the head of the while statement for retesting: two go to op codes executed in a row, which is wasteful, though it is just possible that a compiler would correct that. Curiously enough, only the fortran code provokes no surplus go to codes, but it alas does so by employing actual go to statements in the source, to the annoyance of the uptight and also increasing the total number of statements, and, by using go to (which might be to anywhere) raises the apparent complexity. So, I still think that the above pseudocode does indeed well capture the method (with well-confined control statements), and I feel that all those who are puzzled by assignments within statements should become enlightened. As one who is well-steeped, I bemoan their absence and scowl at all the lesser computer languages that inflict flabby (and error-provoking) repetition on my schemes. NickyMcLean (talk) 03:46, 7 February 2009 (UTC)
As I suggested earlier, surely the structure of this article should ignore details like "two evaluations of high-low" at the outset (especially when one of them is implicit) in order to simplify the comprehension of the algorithm and code (make it read more like the verbal description). At most, such schemes should be introduced later, such as in the "variations" section, or possibly an "optimisations" section. Oli Filth(talk|contribs) 10:10, 7 February 2009 (UTC)
"Assignments within control statements are more than disconcerting, they're a sure-fire way to introduce bugs and misconception, even for the most experienced of programmers" - yes, it is well-known fact! Also, two different ways are possible to understand the statement case A[p:=x]:
1)
k := A[p];
p := x;
case k of
2)
p := x;
k := A[p];
case k of
--Tim32 (talk) 05:33, 5 February 2009 (UTC)
Huh? I don't see why anyone should contemplate the first interpretation. The precedence rules are clear, in that the content of brackets are considered first. There is a conceptual lurch, in that the result of the expression in A[p:=expression] is not just assigned to p but also retained for the indexing implied by A[...], but my induction into the ranks of those corrupted by this syntax was decades ago (not just the previous century, but the previous millennium!), so now it seems natural to me... NickyMcLean (talk) 03:46, 7 February 2009 (UTC)
If one has to invoke precedence rules in order to interpret the pseudocode, then the point has been missed... Oli Filth(talk|contribs) 10:10, 7 February 2009 (UTC)
Aww, come on. Is the interpretation of 1 + 2*3 as distinct from (1 + 2)*3 in doubt? And, why did the designers of C overload * with an auxiliary pointerish meaning (when @ was available, surely), with all to be distinguished by syntax and precedence and spacing? NickyMcLean (talk) 20:15, 8 February 2009 (UTC)
But the code in question isn't simple arithmetic, it's something that even an experienced programmer might not have experienced (if they happen not to be familiar with, say, C). Oli Filth(talk|contribs) 20:22, 8 February 2009 (UTC)

I would also argue for an alternative to "esac" etc., although I'm not sure what that would be. It's just another set of keywords that the reader unfamiliar with say, bash, will have to figure out. Oli Filth(talk|contribs) 09:38, 4 February 2009 (UTC)

I'd suggest that the end of a case should be marked by otherwise <statement>; as this should prompt thoughts of what is to be done if no cases match. However, it is a lot of typing (o'wise?) as well as being unfamiliar. I vaguely recall some language offering an else clause (and just ; as the statement terminator), which is briefer except that this usage of else can be confused with an if statement's tangle of nested tests. If no-one ever made a miscount, then none of this would cause difficulty. NickyMcLean (talk) 20:15, 8 February 2009 (UTC)

Generic search-algorithm stuff

It seems to me that most of the material in the Binary search algorithm#Extensions section is not specific to this particular algorithm, so if it has to live somewhere, it would be better placed in the search algorithm article. Thoughts? Oli Filth(talk|contribs) 00:12, 28 January 2009 (UTC)

Cache locality

The paragraph about forcing elements N/4, N/2 and 3N/4 reads like OR to me, and I'm not sure it's particularly beneficial. Firstly, this only benefits in the first two stages (which may be insignificant in a large search), and secondly, forces more cache misses than just letting the algorithm run naturally, surely? (It will never naturally try to access all three of these values.) Oli Filth(talk|contribs) 00:26, 29 January 2009 (UTC)

It is now clear to me that OR does not stand for Operational Research (the more familiar abbreviation to me) but is Original Research, which is decried. Someone contributed some words about "cache locality" but I've never been clear what is being talked about as the contributor was not verbose... The image I have is of a search space of immense size in a disc file (and this is similar to main memory and on-chip memory), with random choice of key. In this situation it is pointless to have many buffers as their re-reference chance is 1/N for N = 100,000,000 or so. But, granted many searches of a fixed-content search area, element N/2 will be sought every time, and could be worth keeping in a buffer, etc. None of this strikes me as at all novel or at the frontiers of research, but if someone wants to talk about "cache locality" then very well... NickyMcLean (talk) 22:58, 29 January 2009 (UTC)

various comments

  • I think that a clarification regarding the statement p=L+p (after the comparison p>0 (if yes exit) ) is needed.
The p=L+p statement is not necessary for the algorithm to work, but is used for inserting an offset when needed..
I think that this point should be clarified because it causes some misunderstandings.
  • The "exclusive or inclusive bounds" section appears to be original research. It is also hard to read and not convincing. McKay (talk) 01:04, 1 April 2009 (UTC)
The two forms are shown to be equivalent by demonstrating the transformations to change the "exclusive" version into the "inclusive" version. This is not original research, surely? NickyMcLean (talk) 04:13, 1 April 2009 (UTC)
The OR part of it is that it is written as an argument of why exclusive is better than inclusive. We are supposed to cite opinions to reliable sources, not make our own arguments for them.
Prior to the introduction of the flowchart image, the enunciation of the method would be incessantly changed amongst random selections of parts of the variations, often resulting in a non-functioning version which is why Dcoetzee made it more difficult to do so. No-one ever bothered to explain why one variation was to be preferred over another, and, nor do the texts that I have seen. So, appeals to authority will be thin. And the preference is not a matter of opinion, it is based on fact, as described, so that any reader capable of following the argument can verify it themselves. NickyMcLean (talk) 00:00, 2 April 2009 (UTC)
  • What's with "Return(-L)" in the Fortran example? McKay (talk) 01:04, 1 April 2009 (UTC)
Easy. if (p <= 0) Return(-L) becomes
     if (p <= 0) then
       BinarySearch = -L
       Return
     end if

NickyMcLean (talk) 04:13, 1 April 2009 (UTC)

Sorry that is not an answer. Why is this function returning numbers like -17? It should return either the location of the item or some "not present" indicator. McKay (talk) 05:58, 1 April 2009 (UTC)
It is an answer to the question, however I now see that you had a different question in mind. As I recall, this usage had been explained in the section Extensions whereby the not-found result (signified by a value less than one) can also be an indication of where the missing value would be placed if it were to be in the list. So -17 would indicate that the missing value should be inserted into the list following element 17. Without this feature, if the list was to have the missing element added, then it would have to be searched afresh to find the proper place, or sorted, etc. Obviously, someone decided that the article was too long (as per your next remark) and removed the explanations. NickyMcLean (talk) 00:00, 2 April 2009 (UTC)
  • The main problem with the article is that it is way too long and way too rambling and disorganised. Apart from the flowchart, there is no clean description of the algorithm anywhere to be found. The leader is a disgrace that will cause most visitors to stop reading. I can't see how this can be fixed without a complete rewrite.
  • A serious problem throughout the article is the confusion of conceptual issues and implementation issues.
  • Binary search is also used in the real numbers (to solve equations f(x)=0 for real x). Is that in another article somewhere? McKay (talk) 01:04, 1 April 2009 (UTC)
Not that I've noticed, though presumably there are articles on solving f(x) = 0. The binary chop method for this is much the same as for finding an array index, but the termination condition equivalent to "not found" is different since x is no longer integral-valued, thus the search can always (with real "real" numbers) continue to halve the span. Similarly, the condition for an exact match of f(x) to zero should be abs(f(x)) <= eps. A floating-point (i.e. finite precision) computation of f(x) may never generate f(x) = 0 for any value of x and may bounce around zero as well. Adding this usage and explaining these matters is unlikely to reduce the length of the article. NickyMcLean (talk) 04:13, 1 April 2009 (UTC)
I found it at Bisection method, but numerical problems are ignored there. McKay (talk) 06:09, 1 April 2009 (UTC)
Sorry, but again: only Implementations Iterative sub-section has reference, other Implementations sub-sections still look like original research with very sophisticated and unclear pseudo-code: McKay asked about "Return(-L)", not long time ago I also asked about “A[P:=P + L].Key”, etc. May I suggest deleting all original pseudo-code for the first step?--Tim32 (talk) 18:01, 2 April 2009 (UTC)

Three-way comparison

This article really isn't the place to discuss the mechanics and possible optimisations of three-way comparisons in various languages! It's to present the theory, logic and mechanics of a binary search. Start the three-way comparison article if needs be, but this material doesn't belong here. Oli Filth(talk|contribs) 22:39, 29 April 2009 (UTC)

You reverted this code which effectively uses three-way branches: they DO exist in Java (even if this is not reflected by the syntax here).
You have to understand that ALL three-way implementations (including in Fortran!!!) are converted into a sign test using TWO separate successive conditional branches (based on sign/zero internal flags of CPUs or bytecodes). This is exactly what Java does there, when testing signs of integers (and only in this case). The same is true in C and C++ (also C#, J#, and many other languages that generate bytecode or native code, the exception being in fully interpreted languages that do not precompile the source) with almost all compilers. That why a specific syntax for the three-way branch is not necessary.
The code there offers things not discussed in the article, and directly related to the three-way comparison: the reduction of the number of branches (notably avoiding branches that jump immediately directly to another unconditional branch). Because we are speaking in the article about possible optimizations, the reduction of the number of reads/writes for work variables is also significant. Java also compiles the while(true) construct as an unconditional branch, from the bottom (every last executable instruction of each branches within the loop) directly to the top of the loop. Most C/C++ compilers also do the same. If you analyze the code produced here, you'll see that the first two branches for the sign test are containing two conditional branches, that always terminate the current iteration (one is the break condition that does not generate code itself as it is the only statement in the conditional, so this compiles only like the if() itself, the other one is directly an unconditional branch to the top of the loop that will compute the next mid-point before performing the next three-branch compare).
Here is what I wrote (you'll note that I do not even need any goto to optimize this) :
In Java, there's no three-way test, but comparing objects can be expensive if this is done twice. The Comparable interface is used to compare the internal values of objects stored in the array of Comparable Object's, and the return value of the comparison can be stored in a temporary integer variable that can easily compared to zero with virtual no-cost (three way comparison is implemented in this case within the compiled bytecode and the native code generated at runtime by the JIT compiler). An efficient implementation with the minimum number of conditional and unconditional branches (including branch prediction to avoid the unconditional branches as much as possible, and avoiding unnecessary reads after writes) can be:
static int <T>
binarySearch(
             final Comparable<T>[] A,  // the lookup array (with items comparable to value type T)
             int lo, int hi, // inclusive index bounds for search
             final T value) { // the value to search for
    if (lo <= hi) {
        while (true) {
            int mid, cmp;
            if ((cmp = A[mid = (lo + hi) >> 1].compareTo(value)) != 0)
                if (cmp > 0) { // A[mid] > value
                    if ((hi = mid - 1) < lo)
                        break; // not found
                } else {       // A[mid] < value
                    if ((lo = mid + 1) > hi);
                        break; // not found
                }
            else               // A[mid] == value
                return mid;    // found
        }
    }
    return -1; // not found
}
Similar code would be used in C or C++. Note only is this code correct, but this is clearly the most efficient way to program it.

verdy_p (talk) 22:58, 29 April 2009 (UTC)

I agree that any decent compiler ought to deal with this sensibly. I also think that the entire section is unnecessary (i.e. including the stuff about Fortran), not just your addition, as this article should not be cluttered with details about language/platform-specific optimisation tricks. Consequently, in my opinion, the section should be removed. Regards, Oli Filth(talk|contribs) 23:06, 29 April 2009 (UTC)
I agree with Oli. A large fraction of the implementation/platform stuff should be deleted or summarized in a short section. And there is no excuse at all for discussion at the level of machine instructions. That sort of stuff has hardly any importance any more (and I speak as someone who wrote many tens of thousands of lines of assembler code in the past). McKay (talk) 23:54, 29 April 2009 (UTC)
The section which McKay removed on assembler-level details was one that I'd highlighted as out-of-place in the past, and previously removed. I still agree with the premise that an article about what is essentially computer science (i.e. maths) should not be bulked out with highly-specific optimisation advice for a hypothesised instruction set.
Therefore I've reverted the restoration of this section, not because the text needed improving (it was quite well written), but because it's completely out of the scope of this article; this is a complaint that I and several other editors have made in the past on this talk page. Oli Filth(talk|contribs) 20:57, 30 April 2009 (UTC)
Not all the world is encompassed by the design of common computers today. Three-way branching tests in hardware have existed, and could exist in some other computer. There is in any case a difference between performing successive comparisons each with a particular test and branch intended, and, comparing once and making successive tests of the indicators (+) (-) (zero) (overflow) (carry) (odd) that are offered by the hardware after an operation was performed. On a computer offering such an arrangement a three-way branch can be done with one comparison. Languages not offering such syntax, confronted by two if-statements, are not going to employ such code even if available, unless you have a blind faith in the workings of compilers. The ploy offered, of storing diff:=(a - b); and then multiply testing the value of diff has been discussed in the past and that discussion has been removed/replaced/removed, etc. though it lives on fossilised in this massive talk page. It works only if diff can correctly hold the state of (a - b) which is in doubt thanks to overflow. Try a = 30,000 and b = -30,000 in signed 16-bit arithmetic. Possibly, the specific behaviour of Java interpreters for the specific example will avoid that hiccough, but I don't know as I don't play with Java, and anyway, what of text comparisons? And the example code is extraordinarily verbose compared to some of the pseudocode examples above.
Similarly, the issue of calculating the middle in one step or not: the example Java code above will fail in the manner mentioned and it seems, ignored by everyone. Entirely removing the section discussing the two choices will help perpetuate that ignorance. I have been on the receiving end of exhortations to use (L + R)/2 because it is simpler (which it is), that failed to notice the consequence of making another step more complex. This also has had extensive discussion, fossilised above. So, should the article not cover these matters at all, despite the visible differences in implementations of the method? Or just assert "do this" without explanation? That will invite endless revisions by folk with a different idea.
Specifically, if you (plural) don't like the detail explaining exactly why one version is to be preferred, rather than baldly asserting one, feel free to expend some effort rephrasing. Simply removing the portion leaves undiscussed an obvious area of difficulty, though only of interest to those paying attention... Alternatively, the entire article could be condensed to a few assertions. NickyMcLean (talk) 21:33, 30 April 2009 (UTC)
As I've already stated, it's not a matter of "rephrasing"; I (and apparently McKay) believe the material simply has no place here, so from my perspective at least, there is no desire or need to expend effort rephrasing.
To re-iterate my rationale; it's not the purpose of Wikipedia to offer such specific "how-to" advice; in this case, cherry-picked details about what a particular compiler might do or what a particular instruction set might offer, in order to be able to avoid a coding error which might or might not be relevant. As has been discussed before, the "overflow" issue is only an error on particular platforms; the three-way efficiency comparison is only relevant or valid on particular platforms.
We could (theoretically) go on and on about various optimisations on different platforms and in different languages, but it would be totally inappropriate, as it is not our place to teach people how to construct efficient or correct code. If a reader blindly uses the "naive" code from the article without considering their own particular application- or platform-specific issues, then I have no sympathy for them! Oli Filth(talk|contribs) 21:56, 30 April 2009 (UTC)
I'm mostly with Oli here - if you look at any good algorithm textbook, or research paper, you're not going to find a discussion of implementation minutia. That doesn't mean we can't mention them - it just means they should not be given undue weight. I'd say one paragraph would suffice for the lot of them put together. However, this is also an encyclopedia article, which means it would be appropriate to discuss the impact that poor implementations of this algorithm have had on real-world systems, such as serious security exploits. This is practiced for example in buffer overflow. I am not, however, aware of any such exploit, or any instance where a binary-search bug "hit the news." Dcoetzee 23:46, 30 April 2009 (UTC)
This was relatively big news in certain circles a couple of years ago...! Oli Filth(talk|contribs) 00:11, 1 May 2009 (UTC)
Yes, I've read that, but that's written by programmers for programmers - it's just an explanation of a common implementation error that - despite being widespread - has never been exploited, as far as we know. I was thinking more the major media attention that some buffer overflow security problems have gotten. Again, I think it's worthy of a sentence or two, but that's all it takes, really. Dcoetzee 00:44, 1 May 2009 (UTC)
The issue of overflow in computing the mid-point is important and should have its own paragraph. But the question of how that maps into pretend machine language is not important and only makes a too-long article even longer. Besides that, the portion I deleted had no sources and appeared to be original research. It is also broken as regards efficiency since it ignores such details as pipelining, prefetching, and branch prediction that are important to modern processors. McKay (talk) 01:49, 1 May 2009 (UTC)
Contrary to the assertion of verdy_p made below the Java code above it should be clear that the best implementation of the binary search involves a) exclusive bounds so that on each iteration the value of p is assigned to either L or R directly rather than some rigmarole involving p ± 1 and b) does not compute the middle in one step so as to avoid the actual error of overflow for large arrays, and certain other slight gains such as avoiding repeated computation of R - L or similar such as comparing L to R. As has been discussed. And replacing a "go to" by "while true do" results in a "go to - free" version only in a bureaucratic sense. The Algolish pseudocode further above achieves all this quite simply, as does the Fortran example code though both will appear odd to C-style monoglots. The remarks of McKay with regard to "pipelining, prefetching, and branch prediction" are unintelligible though certainly sounding impressive. Both these choices a) and b) are at variance from the almost universal versions seen in all manner of sources. If some article is following the mass consensus, then very little need be said beyond "me too". But if the article deviates from the near-universal usage, then justification is in order. These variations do exist, so should there not be some mention of them? Even if verdy_p demonstrates that the text is not comprehended by choosing the common form. I provided the pseudo-machine code so as to make the explanation completely explicit, because it goes further than the usual statement that (L + R)/2 is simpler, which is true but insufficient. There is this famous publication "The Art of Computer Programming" that uses a pseudo-machine code exactly so that details can be discussed clearly. But perhaps these books are not counted as being amongst "any good algorithm textbook". I am reminded of Ted Nelson's original ideas about hypertext, whereby a puzzled reader could choose to request "more detail" and sentences would become paragraphs, references would be compared and assessed, etc. Thus there could be a "simple" version, with undiscussed assertions, and more comprehensive versions with full explanations. But so far as I can see, the wikipedia system does not support this style of hypertext.
So, why not remove the "Variations" section altogether? Other people presumably have access to Prof. Knuth's description of the "midpoint and width" version but haven't bothered to expound on it. I haven't because I'd have to spend some time trying out the method, and any conclusions I might reach would no doubt be also described as having "no sources" and being "original research". So for consistency, the two graphs and the discussion of them should be removed under that banner also. This also would reduce the article's size, an obvious gain. NickyMcLean (talk) 05:09, 2 May 2009 (UTC)
You're creating philosophy here, that almost compeltely looks like original work. The binary search algorithm is not philosophy, if it's used it's because of its real performance, I don't know why you think you need to make things so ugly by just ignoring the practical problems that this algorithm solves, and why there are variations on them. Three-way comparison is a very common practice (and this is fully dpecified in the Java Comparator or Comparable interfaces, for a good reason: it is exactly there because it is as much usiful to have such comparators that returns 3 signs, which has the same theoretical problem and mportance than the binary search algorithms themselves. Three-way tests have been used since extremely long for implementing the binary search that it's simply stupid to ignore it. Continue with your philosophy if you want, but you are just forgetting the reason why people would need to look at this article: knowing the caveats and knowing how to solve them, with an example code that really works as expected is certainly important to avoid the most common bugs and understand why and where they may occur. Of course there will be common libraries that already solve the problem and optimizes its solution. But experience shows that these libraries are also not always available or behave incorrectly in some cases (these may look as minor problems, but even the most minor problems because severe bugs after some time). verdy_p (talk) 21:28, 3 May 2009 (UTC)
Removed section with assembler was not useful for this article. Ok!--Tim32 (talk) 23:30, 6 May 2009 (UTC)