Talk:Binary search algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer science (Rated GA-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 GA  This article has been rated as GA-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.

Original Wirth's code restored![edit]

Dear Editors!

Please see original book by N.Wirth, before your editions:

You wrote:

k := (i + j) shr 1;

But N.Wirth wrote:

k := (i + j) div 2;

Did you know that "shr" was not standard operator in standard Pascal by Wirth?

You wrote:

until A[k] = x or i > j;

But N.Wirth wrote:

until (A[k] = x) or (i > j);

Did you know that even in modern OOP Delphi Pascal the expression "A[k] = x or i > j" would be wrong? -- the first step may be "x or i"...

Please, see sources before your original editions!!!--Tim32 (talk) 23:17, 6 May 2009 (UTC)

Yes indeed, pascal compilers require brackets around the terms of compound boolean expressions. Yes, the use of shift operations instead of division by (powers of) two adds unnecessary obscurity in the form of something the compiler should handle anyway. And the referenced version suffers from the overflow error for large arrays, engages in unnecessary extra effort involving L and R (the reference uses the unhelpful names i and j), has to mess about with k ± 1 because it uses inclusive reckoning, and with its single comparison per iteration abandons the opportunity for early discovery of a match so that its average performance is proportional to log(N), rather than log(N) - 1. All of this has been endlessly argued over, but, since no reference also presents such arguments they are Original Research and so should be purged from the article along with the deviant code examples. NickyMcLean (talk) 04:43, 9 May 2009 (UTC)

Single-comparison code is wrong[edit]

// high == low, using high or low depends on taste

      if ((low < N) && (A[low] == value))

The low<N test fails, obviously, when low is N, i.e. when the value you are searching for is in the last element of the array. In fact, there's no reason to do any sort of check on the value of low at this point, unless the precondition of N being >= 0 were untrue, in which case the first test of low<high would have evaluated to false (and the comment "high == low" would be untrue also). But if such a check is necessary in the context of the program, it probably should be done before this procedure is called. —Preceding unsigned comment added by (talk) 21:20, 16 July 2009 (UTC)

Not in the last element but after the last element! You'll see that the search boundaries in all other examples go from 0 to N-1 or from 1 to N. Here it is from 0 to N, making N an invalid index and thus you have to check for that possibility before trying to access A[N]. This is probably intentionally done so to allow for binary inserts. —Preceding unsigned comment added by (talk) 16:02, 13 October 2009 (UTC)

Binary Inserts[edit]

I have not found any references to binary inserts in this article, i.e. at what position new elements should be inserted into a sorted list or array. All shown examples simply return -1 when a given value is not found.

What I need is a binary search function that returns true or false whether a given value was found _AND_ the position it was found at or where it should be inserted. I think this is a commonly used and very minor extension to the standard binary search and really belongs in this article!

In fact, the last example about single comparisons was intended to support this, thus the range from 0 to N (N+1 elements) to allow insertion before the first and after the last element, but someone (the original poster?) castrated it to make it conform to the other examples.

Here is how the lower part should look like:

      // high == low, using high or low depends on taste 
      if ((low < N) && (A[low] == value))
          return true,low // found
          return false,low // not found

I realise this is sort of inconvenient since most languages do not allow returning multiple values. Maybe return -low-1 when not found? (talk) 16:47, 13 October 2009 (UTC)

If you look at the three-way-IF version (in Fortran) while it yet survives you'll see that it returns -L for "Not Found" working on an array indexed 1:N. This is precisely so that one may determine where to insert a missing value without having to search the array afresh to re-find the insertion position and indeed I have used this ability. Others have questioned why -L was returned instead of -1 or 0. The point was carefully discussed as one of the possibilities of the "Extensions" section, but that discussion was removed by someone who had no interest in the matter. Similarly, there had been a discussion of functions returning compound results (such as you describe), and it too was removed by the simplifiers.NickyMcLean (talk) 19:44, 13 October 2009 (UTC)
Thanks for clearing that up. I hope there will be an Extensions section again. (talk) 15:41, 14 October 2009 (UTC)

overuse of OR tag[edit]

Those kind of things are allowed. Go check out almost every single science article with a diagram. Just as examples in the same subjects, check out the diagrams in Dijkstra's algorithm, Johnson's algorithm. They are all uploaded by the editors themselves. In fact, almost every one of these trivial and non-controversial diagrams are made by the editors themselves. Not everything have to be sourced to published sources. Do you think that it is necessary to pick out every diagram on wikipedia that isn't from academic sources and put a giant OR tag on them. Also, if you really have to be ridiculously rigid with the rules, these diagrams and simple examples can qualify as trivial calculation/compilations of data, which is explicitly said to be allowed in the OR policy article. (talk) 23:41, 23 August 2010 (UTC)

By the way, I didn't see it at first, but weren't you the one who uploaded them in the first place. Why are you doing this now? (talk) 23:42, 23 August 2010 (UTC)
Yes it was I, which is why I know that the flowchart is not found in the usual approved sources (it is not the one in Knuth's compendium) and it does not correspond to the algorithm found in many publications. The "exclusive bounds" version is of course not my own invention, but I have not seen it published - indeed, I can't now recall how I came to use it since the 1980s; I had thought I'd found it in Knuth, but was surprised to find on checking in 2004 or so that it is not there, nor in Wirth's famous book either. I must have read something somewhere, or talked with someone who had! Similarly, to produce the graphs of performance, I wasted a day's computer time on my work computer in exhaustively running a particular type of test trial on each of N = 1 to 32767 to gather the statistics for the plots. While others have performed somewhat similar trials, and may have published somewhat similar plots, I have never seen any published article with the same plots nor the specific trial I used. Thus, both the flowchart and the plots are the result of my endeavours unsupported by published sources, and differ from published sources. (Indeed, that was why I prepared them, as the standard forms are less satisfactory, but, the discussion of the differences and why to prefer one over the other has long since been ejected, as being verbose, unnecessary and, unsupported by published sources, naturally enough) I have elsewhere been roundly criticised for just this practice, and it was emphasised that even if the algorithm worked and the results presented and discussed were true, without presence in approved published sources, my screed constituted OR and was against W. policy and was removed. With the suggestion that I could submit a paper to a suitable publication, and if published, I (or anyone) could then re-enter the material. I have yet to make such an attempt. Accordingly, similar effusions in this article must also be retracted, in accordance with W. orthodoxy. Similarly, someone has marked the example source files as unreferenced; the fortran version is taken from one of my source files that I've used for decades and is definitely correct, but, is not published anywhere (though surely equivalent to published versions, except it has been explained to me that this is also inadmissible, being my opinion) and so likewise should be purged if one were resolute in following W. policy. NickyMcLean (talk) 22:20, 24 August 2010 (UTC)
It is pointless to follow Wikipedia's policy so dogmatically. They are nowhere near perfect, so common sense should override them if there is no controversy. If you think about it, it would be near impossible to add any charts and diagrams on wikipedia, if it is absolutely required to have been published in an academic source. First, a science journal would be very concise and would not provide a friendly picture to elucidate the texts, so any useful diagrams have to be taken off of textbooks. Second, it will be a long and complicated process to ask for copyright permission for the publishers and the authors and I think most editors will be driven away if they are required to go through so much trouble just to put up a picture. I think it makes much more sense and it is much more practical to only follow the OR policy so closely if it's some controversial political topic. Otherwise, if no one objects, it is not for the good of anyone to be so rigid. (talk) 01:43, 25 August 2010 (UTC)

Copying a flowchart or code from an approved source would be a copyright violation, so it mystifies me that editors think we should be doing that. —David Eppstein (talk) 01:55, 25 August 2010 (UTC)

These apologia have been raised, and rejected. Talk:Interpolation_search#Brisk, for example. I agree that over-literal application amounts to a demand that published articles be plagiarised, and it would make it impossible for any author to write a new textbook about subjects already covered in other texts. Thus, another description of an established algorithm is going to involve much the same wording, similarly, redrawing the same diagram. Surely it is proper for a W. author (or any author!) to draw upon prior work, and indeed it is W. policy to require this, with acknowledgement, of course. But to repeat, the algorithm I described is not the standard one (as found in published sources), the flowchart is not a redrawing of the standard one (as found in published sources), the fortran example source is from my own (known-correct, but not published) collection, and the statistics on the behaviour of the method are definitely the result of my own experimentations, as has been decried. All of this violates W. orthodoxy, as has been emphasised. NickyMcLean (talk) 02:08, 26 August 2010 (UTC)
"Redrawing the same diagram" as you suggest, using the same graphical elements in the same geometric relation to each other, is just as much of a copyvio as retyping a passage of text using the same characters in the same order. —David Eppstein (talk) 02:39, 26 August 2010 (UTC)
So, we're stuck. Referring to Prof. Knuth's flowchart, redrawing it with trivial variations in layout, shape and names (I'd use L and R, not L and U, for example) would be a violation of copyright as obviously, the entire diagram has to be copied rather than sampled for the purposes of scholarly reference. Actually, I'd expand on the rather sparse descriptive elements that otherwise can only be understood with reference to the text, so it just might not be regarded as plagiarism even if Prof. Knuth wanted to be rigorous over copyright rather than favouring the dissemination of understanding. So, such a diagram would not violate W. orthodoxy with regard to Original Research. But again, the stuff I have supplied in the article does violate OR, and according to W. orthodoxy, must be ejected. NickyMcLean (talk) 22:09, 26 August 2010 (UTC)

Single comparison per iteration[edit]

The premise of this section appears wrong. The notion of counting comparison operations is that the comparison is expensive. Consider a comparison on 64-byte keys. The section is suggesting that testing condition codes is expensive. For many library routines, the comparison routine returns positive, zero, or negative. Once the comparison is done, a three way branch can be done with two conditional branch instructions. The comparison need not be redone for the second branch. The complexity is counting required comparisons - not instructions or redundant calls. Glrx (talk) 18:37, 1 November 2010 (UTC)

When I wrote the prog. in IBM1130 assembler, indeed the three-way branching was effected via conditional branch instructions examining the accumulator's state tags of +, -, zero, overflow (etc.) without recalculating the comparison and there is no difficulty. Such code might be generated by a compiler written for a language that encompasses syntax for a three-way test (as in Fortran, for example, but not in fortrans compiled by translating the source into C or similar that lack three-way testing, then compiling that) but in other languages the two expressions are written out, and they are different expressions. (One has >, the other has <) Hagiographers of "modern optimising compilers" claim superb code for their cherished champions and suppose that they too would recognise this situation. I doubt this, so, examples demonstrating it are called for. There has already been a discussion of the problems with diff:=A(i).Key - A(j).Key and testing diff. Yes, you can always devise a function that compares two keys and returns -1, 0, +1 and the problem goes away somewhat (yet still there are two tests of the result variable), but now you have the overhead of function entry/exit and parameter passing. Proper machine code for the binary search method is tiny, so that any waste effort (or "overhead") has a large effect. So, the problem, discussed in terms of high-level languages, remains a problem. NickyMcLean (talk) 20:26, 1 November 2010 (UTC)
First, the section has been challenged to provide references since January 2009. That has not been done, so the entire section should be deleted. Where are the WP:RS that extol one comparison per iteration?
Second, the above reply focuses on counting instructions rather than key comparisons; its premise is at odds with the thrust of using comparisons to measure performance. Although code examples use integer comparisons to illustrate the concept, typical searches use comparisons that are more involved than simple arithmetic operations. Typical library sorting routines use a function call, and that fits well with the traditional notion is that the cost of the comparison dominates all other operations (such as the instructions that manage the loop and calculate the midpoint). If < stands for comparing words in a dictionary, then the comparisons will have costs in terms of instruction cycles, memory references and cache misses; an extra conditional branch (which neither references data memory nor causes a cache miss) is insignificant. The optimizing compiler comments are a garden path; furthermore, modern compilers have no trouble recognizing simple arithmetic common sub-expressions and testing condition codes. Theoretical comparisons should not and do not hinge on the use of crummy compilers. When counting comparisons, the two tests of the condition codes don't cost anything; neither does the small code that forms the heart of the algorithm.
Glrx (talk) 03:46, 2 November 2010 (UTC)
Offhand, I know of no references extolling one comparison per iteration nor discussing the issue and although the person who supplied the section might, I suspect that this was a piece of knowledge being explained for the presumed benefit of all. For what it's worth, I have already expunged my own offerings as being unreferenced, despite the nice images. Orthodoxy is to be upheld, no matter the consequences.
Second, indeed, library routines for sorting usually provide some generalised approach for specifying the key, being multi-part and possibly with ordering exceptions as well. Or, this may be done by requiring the specification of a function that takes two keys and reports their ordering. As with numerical integration and the like, speed assessment is dominated by the number of such function evaluations, which may be complex. Well and good. Unless the result of such a comparison function is saved so that in the absence of a three-way test it may be inspected twice, the equivalent of the code if a < b then ... else if a > b then ... else ...; contains two comparisons and they will not be converted into one comparison with multiple inspection of condition codes, because the two expressions "a < b" and "a > b" are different, they are not common. And this is quite aside from the added modern confusions due to NaN usage which will damage the usual interpretation of not(a < b). Thus, in the code under discussion, despite comparing only integers, there are explicitly two comparisons per iteration, and they will be executed according to the discussed statistics, because the compiler will not recognise the opportunity to inspect condition codes instead. And yes, I have not performed an exhaustive inspection of all compilers of all computer languages on this issue. I await a counterexample, preferably of an ordinary compiler for an ordinary language, not some special research exercise. I agree that two tests of the condition codes resulting from one comparison are trivial (and standard for code produced by humans with their brain engaged), but I say that there will be two comparisons in code produced by compilers compiling source code with two comparisons, with the expectation of half the time only one being executed.
So, the section describes what will happen, and should stand, aside from issues of provenance. NickyMcLean (talk) 20:39, 2 November 2010 (UTC)

Using a single comparison per iteration only makes the search take one iteration longer on average. I think this gain of one iteration is not worth to invest two comparisons per iterations in most situations. (Maybe there are some cases where this is appropriate, I don't know.) But I think it is better to keep both variants. This way, the reader can decide which version fits best his needs. By the way, I have looked up binary search in the book "Programming in Modula-2" by N. Wirth (page 35, 1983). He does exactly that: he first states the two-comparison version, and then improves it to a "more sophisticated version" that uses only a single comparison per iteration. Wullschj (talk) 03:46, 11 April 2012 (UTC)

I don't have a problem with including a deferred equality version; it has important features. I do have a problem with confusing order stats, comparisons and instruction counts. Typically, theoretical sorting statistics are O() and do not care whether there are one or two comparisons each time through a loop. Using < is fine for examples, but adding a second = won't slow the loop by a factor of two because there is a lot of other instruction overhead. In practice, even the two comparison version will essentially look at condition codes rather than repeating a typically expensive string comparison routine. Glrx (talk) 17:21, 12 April 2012 (UTC)
It may have important features, but where are the references for the uptight who purged the previous manifestation? As for the repeated comparisons, I have checked actual machine code produced by a compiler and condition codes from the first comparison are not checked a second time in place of a second comparison, as explained above. And as explained above, there is not at all a lot of instruction overhead to an iteration (aside from the comparison itself, as for string compares), so small differences have a comparatively large effect. But demonstrations of this involve OR. NickyMcLean (talk) 21:48, 12 April 2012 (UTC)

Midpoint calculation[edit]

The article's discussion about numerical difficulties should be removed. It's origin is not in something fundamental, but rather in some peculiarities of using limited indices. I would characterized it Losedows 16-bit braindamage.

Practical records will have multi-byte records. Consequently, indices would multiplied by at least 2 to get a byte address. That provides a factor of 2 safety margin for the midpoint calculation. The problem therefore occurs when indices and pointers are different sizes. Such a choice appears foolish -- it means, for example, that an implementation would be using 16 bit indices and 32 addresses. That means the sorting routing would not be suitable for tables with more than 64K entries. Why cripple the routine in such a large address space?

Consequently, I believe the numerical difficulty issue should be removed.

Failing that, the code examples should use midpoint(i1, i2) instead of the unreadable closed midpoint expression.

Glrx (talk) 01:24, 15 November 2010 (UTC)

1) Sixteen bit numbers were used in the discussion because the resulting digit strings are much smaller than those for thirty-two bit integers and the point is the same both, and for 24-bit, 48-bit, 64-bit 128-bit and any other size limit.
2) Although I'm using a 32-bit system I have often enough been messing with files exceeding two and even four gigabytes in size, the 32-bit limit. Thus the "factor of 2 safety margin" is irrelevant. NickyMcLean (talk) 19:59, 18 November 2010 (UTC)

bd - April 2012: I'm going to remove the "disputed" tag from the article. It is and will always be a severe logical error to calculate the mid-point as "(L + R) / 2" with indices of a limited size. The fact that this bug even have occurred in widely-used library code like the Java run-time library only adds to the importance and relevance of mentioning this issue in the article. — Preceding unsigned comment added by Briandamgaard (talkcontribs) 09:36, 23 April 2012 (UTC)

I disagree with the removal of the tag. This article confuses the distinction between an algorithm and a program. The main interest should be the algorithm. The expression (L+R)/2 is almost instantly recognized as a midpoint equation; the alternative expression is not. Hence it causes confusion. Issues such as integer overflow are programming issues.
I disagree that index arithmetic needs such protection when an int covers the address space, the items being indexed are greater than 4 bytes (allow sign and doubled value; typically true when one has over a billion entries), and the array index domain is reasonable (e.g., 0 based or 1 based arrays).
If pointer arithmetic is used, then the game is different because the pointers might be in high memory. However, compilers should complain when confronted with a dubious (p1 + p2)/2. Adding pointers doesn't make sense (subtraction does). Even if the sum is typed as an integer, that means the quotient will then be used as a pointer, so there should be a type check error somewhere in code unless the compiler is clueless about integers and pointers.
The above issue about 2 GB and 4 GB files occur when an int does not cover the address space. Furthermore, the value is really a pointer, and the file pointer access routines would be returning 64 bit values; the compiler should issue a truncation warning if that pointer is assigned to an 32 variable. Finally, calculating a midpoint using a pointer (memory or file) can land one in the middle of a record even when the records are fixed size. This page does not address the variable size record issue.
Once again, there is too much focus on programming issues here. WP is not a programming textbook.
I'm willing to have a section that discusses the midpoint calculation error, but the rest of the page should either bottle the midpoint issue up into a midpoint(x,y) function or use the ordinary midpoint formula (as it currently does).
Glrx (talk) 23:48, 26 April 2012 (UTC)

The last paragraph states that if (L + R) exceed the number range then, for signed integers, a more efficient (R + L) >>> 1 can be used. How does this not exceed the number range? This seems to be an entirely separate point that somehow got mixed in here. — Preceding unsigned comment added by (talk) 07:42, 1 January 2013 (UTC)

Language support?[edit]

Why we need this section? Today we have 10,000++ programming languages and any of them is capable to implement the bin search. Do we need the list of 10,000++ refs to every language implementation? --Tim32 (talk) 21:47, 13 March 2011 (UTC)

I would dump most of the entries, but I would mention that many programming language runtime libraries include binary search routines. I'd include C and perhaps a couple others as examples, but no more. This article is not about programming language support. Glrx (talk) 00:28, 14 March 2011 (UTC)
This is not about being capable of implementing it. This is about languages that include binary search in their standard library. Not many languages do that. -- (talk) 19:42, 10 April 2011 (UTC)
This is any encyclopedia, not a catalog. The section should be removed. — Preceding unsigned comment added by (talk) 02:35, 15 May 2014 (UTC)

Cauchy's intermediate value theorem[edit]

It may be worth mentioning here that Cauchy's proof of the intermediate value theorem was a divide and conquer algorithm. Tkuvho (talk) 08:00, 8 April 2011 (UTC)

terrible state of this article[edit]

Really folks, this article is still a disgrace. Just a few of the many problems that still exist:

  • no separation between conceptual and implementation issues
  • some paragraphs are incomprehensible (such as that referring to an "exclusive" version that is no longer present)
  • the pseudocode should precede any "extensions" or advanced discussion and should be correct

McKay (talk) 07:08, 13 June 2011 (UTC)

I propose to replace the iterative pseudocode by this, please comment:

  binarysearch(A, lo, hi, target)
  // Search for target in the sorted array A[lo..hi].
  // If target is present, return some i such that A[i]=value.
  // Otherwise, return the special value "not present".
  down := lo
  up := hi
  while down <= up do
     mid := midpoint(down,up)
     if    A[mid] > target then up := mid-1
     elseif A[mid] < target then down := mid+1
     else return mid
  return "not present"

and to follow it with this explanation:

The expression midpoint(down,up) should calculate an integer mid-way between down and up. If there are two such integers, either is ok. In implementations, care should be taken so that intermediate values in the computation of midpoint(down,up) do not exceed the maximum value of a integer. If lo and hi are non-negative, the expression down+(up-down)/2 is usually better than (down+up)/2 for this reason.

My justification is (1) it is correct even for boundary cases, (2) it doesn't assume anything about whether subscripts are positive or negative and what index is first in an explicitly declared array (even C allows negative subscripts!), (3) implementation niceties are left until later, (4) exactly the same specification can be used for the recursive version. McKay (talk) 05:02, 16 June 2011 (UTC)

Humm. In an Algolish pseudocode,

%To find a value V in N elements of array A, indexed 1 to N.
        L:=0;                %Exclusive bounds: first - 1
        R:=N + 1;            %Last + 1.
  Probe:if (p:=(R - L)/2) <= 0 then Return(-L);   %Search exhausted?
        case(sign(A[p:=L + p]) - V)                %Compare the array element to V
Negative:L:=p; GoTo Probe;                         %A[p] < V
Positive:R:=p; GoTo Probe;                         %A[p] > V
    Zero:Return(p);                                %A[p] = V Found at p.
        esac;                %No other cases. (NaN not allowable)

The exclusive-bounds version in this form remains superior, but the explanations as to why have long since been suppressed as being unnecessary/excessive/boring and anyway, unsourced. Similarly, neither of us have quoted any reference to a proper published source for our effusions. So they are OR and to be excluded. Despite its unsatisfactory nature, the Pascal version from Wirth's book is properly sourced so conforms to the requirement. NickyMcLean (talk) 21:00, 16 June 2011 (UTC)

Having something that is correct and clear is more important than having a source. Anyway, since I have a large number of algorithms books on my shelves it was easy to find it in Kingston, Algorithms and Data Structures, p8, and in Rawlins, Compared to what?, p106. The variation with lo=1 is in lots of other books too. Also, I don't know why you gave this Algolish code, it is (0) less general, (1) unstructured and difficult to read, (2) wrong wrt implementation (N+1 might be too large for an integer). This last problem is a serious handicap of the exclusive method as it doesn't allow the array to start at index 0 when the indexes are unsigned. Incidentally, your "exclusive" version uses the same number of arithmetic operations and one extra assignment per iteration compared to my "inclusive version". In actual timing tests I always get faster times for the inclusive method (C with gcc -O4 on Intel Core 2 Duo). McKay (talk) 04:46, 17 June 2011 (UTC)

You may say that having something that is correct and clear is more important than having a source and I'd be inclined to agree, but it remains W orthodoxy that a proper published source is required and that Original Research is rejected. The problem with overflow is discussed in the article (although some are arguing that the discussion should be ejected) and applies to both inclusive and exclusive versions. Some twit converted 16-bit numbers to 32-bit numbers just to have more imposing digit strings. The exclusive bound version is just as able to deal with arbitrary array bounds as the inclusive. If the array is A(first:last) then the initialisations are L:=first - 1; R:=last + 1; (as indicated in the comments) and for the inclusive bound version are L:=first; R:=last; As it happens, I have a prog. dealing with something like A(-600:33000) and although published, it is not in an approved journal. The only dificulty is the encoding of "not found" which in my case can be zero, as index zero in my application is not used. The Return(-L) is useful for A(1:N) in general as it indicates where to insert a missing element, but this discussion has long since been ejected from the article. In short, both versions can deal with arrays of arbitrary bounds, given attention to the encoding of "Not found", the possibility of signed variables being needed, and the bounds on their range. Your version dodges this detail by not specifying what the "Not found" return is. If the array starts at zero instead of one, then zero is indeed a possible array index and so can't be the value for "not found" and the Return(-L) protocol can't be used - perhaps Return(-32768) or similar? Anyway, here is a less fearsome version.

%To find a value V in N elements of array A, indexed 1 to N.
        L:=0;                %Exclusive bounds: first - 1
        R:=N + 1;            %Last + 1.
  Probe:p:=(R - L)/2         %Offset to the middle.
        if p <= 0 then Return(-L);   %Search exhausted?
        p:=p + L;            %Index of the middle element.
        case(sign(A[p]) - V) %Compare the array element to V
Negative:L:=p; GoTo Probe;   %A[p] < V
Positive:R:=p; GoTo Probe;   %A[p] > V
    Zero:Return(p);          %A[p] = V Found at p.
        esac;                %No other cases. (NaN not allowable)

It's true that there appear the dreaded "go to" decried by the structuralist fanatics, rather than hiding them with "while" and the like. There was a nice flowchart for the method that seemed clear to me and it had a nice structure too, or so I thought, but it was OR. I'm unclear about your claim of arithmetic operations per iteration, as your version does not lay out the arithmetic operations within function midpoint. Conversely, my pseudocode involving "case" does not mean the usual case statement. It is intended to stand for a three-way IF test (or equivalent) that does not repeat the comparison action as in the usual if a < b then ... else if a > b then ... else ...; in the absence of such a syntactic offering. Ah, I see that someone has removed the discussion of this point and the associated versions from the article, so the "Deferred Detection of Equality" dangles.

Counting operations is tricky without details, but I note two differences: the inclusive method's loop contol has while L <= R (as in while down <= up) and discards the result after testing then goes on to determine midpoint (which will surely involve something of the sort), whereas the exclusive version I offer retains the result as a part of indexing the middle element. Secondly, the inclusive form has to compute mid + 1 or mid - 1 to place in the appropriate bound whereas the exclusive form merely places a value without offset. These all involve additional arithmetic operations. Assessing exact counts would require an agreed machine code, such as Prof. Knuth's MIX, and both versions being analysed as he does. My original efforts in this regard were in assembler on an IBM1130, and actual machines have particularities that may mean that registers can be used with cunning and assignments be deferred, etc. Incidentally, (up - down)/2 + down is better than down + (up - down)/2.

But yes, the article is a mess. The removers have ejected stuff without bothering to consider internal references. NickyMcLean (talk) 21:20, 19 June 2011 (UTC)

So, in summary, my version can be cited to at least two reliable sources and has the advantages I listed above as (1)-(4). You did not identify any problems with it. It does not have any overflow problem if midpoint() is implemented carefully (as it says). Your version has other overflow problems: try it for N=32767 when indexes are signed 16-bit integers. The return "not found" is fine; this is an algorithm in pseudo-code, not a program to feed into a compiler, and for the same reason a 3-part if and a 3-branch case are the same (but I have no objection to using a case syntax if it doesn't introduce new overflows like hi-lo). Anyway, returning an integer value for a missing target is bad engineering since every integer value is potentially an array index. Hardware-level concerns don't belong here at all. So I am going to put my version in along with a recursive procedure having the same specification (which is not possible with your version). For checking, here is the planned recursive version (I have several sources):
  binarysearch(A, lo, hi, target)
  // Search for target in the sorted array A[lo..hi].
  // If target is present, return some i such that A[i]=value.
  // Otherwise, return a special value "not present".
    if lo > hi then return "not present" endif
    mid := midpoint(lo,hi)
    if    A[mid] > target then return binarysearch(A,lo,mid-1,target)
    elsif A[mid] < target then return binarysearch(A,mid+1,hi,target)
    else return mid

McKay (talk) 07:23, 20 June 2011 (UTC)

If you stay with pseudocode leaving the "Not found" nature undefined, then there is no difficulty over the details of array bounds and signed/unsigned integers and whether or not an integer value might be usable as an array index. With regard to integer overflow, I repeat again that both versions have this problem as is/was discussed in the article, though in different ways. For N = 32767 as you suggest, try your inclusive bounds version where the sought value follows A(32767) in order. A clue for you: mid + 1 It is for this reason that the article explained a limit of N = 32766 rather than 32767, or the equivalent for other sizes whether signed integers or not. As for recursion, the version you present is tidy though I'd still use L and R. The algorithm may be "basic", but it can be implemented in ways that are surprisingly different (midpoint-and-width, for example), and used in a context can have additional attributes (the Return(-L) feature, for example) but describing these details takes space.
And an afterthought: here is a recursive version of the exclusive search.
Function BinarySearch(A,L,R,V)      %Exclusive bounds version.
%Find some index value p such that A[p] = V, or report "not found"
%For an array span with bounds first to last use BinarySearch(A,first - 1,last + 1,V) 
 p:=(R - L)/2;                                   %Integer division. 3/2 gives 1.
 if p <= 0 then return "Not found";             %Remaining search span is exhausted!
 p:=L + p;                                       %Apply the offset to index the middle element.
 if A[p] < V then return BinarySearch(A,p,R,V); %Search the second half.
 if A[p] > V then return BinarySearch(A,L,p,V); %Search the first half.
 return p;                                       %Found at p!
end function BinarySearch;
This version can enjoy all the praise words employed above, and has no wierd mid + 1 or mid -1 trickery...

NickyMcLean (talk) 22:28, 20 June 2011 (UTC)

You are quite right that my code might compute lo-1 or hi+1. It is not hard to fix, but only by leaving the sources behind and adding complexity. (Incidentally, for someone who likes to put assignments into the middle of arithmetic expressions to describe adding 1 as "trickery" is a bit rich.) Now that I think more about it, it might be best to leave the overflow issue entirely to the comments, so instead of using midpoint(down,up) use like many textbooks do. (I was avoiding the issue of what "/2" means before anyway.) McKay (talk) 03:48, 22 June 2011 (UTC)
Ah, I admit to a taste for trickery. An assembler version would have the same topology but the placement of actions along each path would be rearranged to reduce the number of jumps in the iteration sequences. It is only via assignments within expressions that many assembler trickeries flinging off side effects to later advantage can be accommodated - otherwise high-level languages force lame semi-repetitions. Anyway, in comparing the two recursive versions (and incidentally, your version need not invoke "else" stuff), I think it fair to say that the mid+-1 version requires a bit more thinking to follow when assessing its correctness and only afterwards does it become "obvious". But the exclusive bounds version does have an odd invocation in order to maintain its premise. The (L + R)/2 overflow issue, whether decorated with fancy notation for "floor" and claims of unlimited range for arithmetic in pseudocode, is regulary ignored in texts, and innumerable resulting progs. use it as it stands. Whereas it has no place in the exclusive bounds version. Similarly with the 32766 issue being unmentioned. So, I'd favour an explication of the exclusive bound version, plus a later commentary about the inclusive bound and the (L + R)/2 overflow that it entices, associated with the 32766 issue applying to both versions. However, no reference near my hand describes the exclusive bound version, or indeed mentions overflow problems at all. If details are to be discussed, there are many, and some are traps to avoid. If details are not discussed, then traps will be fallen into if people actually use this article to produce a prog. Unfortunately, W descriptions of famous algorithms are routinely miscorrected. (QuickSort is notorious) This was why I introduced a flowchart depiction (images are more difficult to mess with!), but, I have no reference for it, so it was OR, as were the results on performance despite the pretty graphs with interesting bends. NickyMcLean (talk) 20:50, 23 June 2011 (UTC)

Why include both an iterative and a recursive version? Any loop is trivially equivalent to a tail-recursion. --Macrakis (talk) 14:23, 20 June 2011 (UTC)

A recursive version is conceptually closer to the way binary search is usually described, i.e. a divide and conquer algorithm where we take a binary search problem, and we turn it into a subproblem (smaller binary search problem), recursively. A recursive version represents this in a very apparent way by a recursive call. -- (talk) 01:12, 12 July 2011 (UTC)

It still sucks. — Preceding unsigned comment added by (talk) 23:03, 19 May 2014 (UTC)

Sorted Array[edit]

This article is entirely too long for something as basic as a binary search. I'd suggest creating the sorted array article and splitting / removing content. --Scandum (talk) 12:37, 20 June 2011 (UTC)

It could be a section of a sorted array article, that is not a bad idea and would be an excuse to start afresh (which this article badly needs). Whether as a stand-alone article, or a section of another, I propose subsections as follows:
  • Informal description and example
  • Pseudocode formal description (both iterative and recursive)
  • Running time analysis
  • Brief mention of variations (no code, no details)
To answer Macrakis, the reason to include both iterative and recursive is that elementary sources are split more or less evenly between the two views. Also, the view that people find easiest to understand varies (I've been teaching this since 1978 and didn't yet find an explanation that suits everyone). McKay (talk) 03:48, 22 June 2011 (UTC)
Dear McKay, I am not sure that any consensus is possible for this article. For example, see section "Implementations. Iterative". Somebody wrote "The following incorrect (see notes below) algorithm is slightly modified (to avoid overflow) from Niklaus Wirth's in standard Pascal". I think, if you will add your iterative pseudocode, then somebody will write that it is incorrect because it uses Goto (vs structured programming) and so on. Perhaps the article needs some sort of protection to restrict its editing? --Tim32 (talk) 09:58, 22 June 2011 (UTC)
You misread; my pseudocode is structured and has no gotos. McKay (talk) 03:34, 24 June 2011 (UTC)
Yes. Sorry for my mistake. I think, Pascal-like pseudocode would be better for beginners:
  function binarySearch(A, lo, hi, target)
  { Search for target in the sorted array A[lo..hi].
    If target is present, return some i such that A[i]=value.
    Otherwise, return the special value notPresent. For example,
    if lo and hi are non-negative, then notPresent=-1 }
   down := lo;
   up := hi;
   while down <= up do
      mid := midpoint(down,up);
      if    A[mid] > target then up := mid-1
      else A[mid] < target then down := mid+1
      else Result := mid
   Result := notPresent;
--Tim32 (talk) 07:55, 24 June 2011 (UTC)
Um, no, "Result := mid" needs to return immmediately, not just set the return value. I was thinking of using the style shown at the style page you mentioned. McKay (talk) 08:14, 24 June 2011 (UTC)
Ok. It may be "else begin Result := mid; exit end;"--Tim32 (talk) 08:20, 24 June 2011 (UTC)
That could be taken to mean exiting from the loop (as it does in some computer languages). I don't think it is wise to avoid "return". McKay (talk) 08:35, 24 June 2011 (UTC)
Yes. This is interesting problem of pseudocode understanding. Let it be "else begin Result := mid; exitFromFunction end;" A beginner may have a problem to understand Return as well.--Tim32 (talk) 08:47, 24 June 2011 (UTC)
Err, why the copy of the parameters to internal variables? Ah, because their values are changed and you don't want to damage the originals. Proper pascal messes about with the keyword "Var" and its absence for value parameters. NickyMcLean (talk) 21:38, 26 June 2011 (UTC)
This is Pascal-like pseudocode, not standard Pascal. For realization the 1st line has to be:
function binarySearch(A : intArray; lo, hi, target : integer) : integer;
function binarySearch(A : realArray; lo, hi : integer; target : real) : real;
etc. But if you wish you can write:
function binarySearch(var A : realArray; var lo, hi : integer; var target : real) : real;
Ok?--Tim32 (talk) 23:26, 26 June 2011 (UTC)
Ah, Pascal programming. I've always been annoyed at the amount of babble needed in Pascal parameter lists. Not really OK. If the parameters are not "Var" type so that they are passed by value, there is no need for the internal copies that can be adjusted since the parameters are already, though not the array. If we abandon the Pascal detail for proper pseudocode, then there should be some explanation of why there are copies made. Or alternatively, make no such copies and so don't talk about those details (which add nothing to the understanding of the method), as is desired by the simplifiers though they do not want to see any code at all. Oddly enough, the exclusive-bound iterative version does not raise this problem because it uses internal variables anyway, but everyone and their source texts seems to be glued to the inclusive bounds version.NickyMcLean (talk) 20:48, 27 June 2011 (UTC)
WP:NOT#Wikipedia_is_not_a_manual.2C_guidebook.2C_textbook.2C_or_scientific_journal I strongly suggest removing any code and pseudocode as Wikipedia is not a textbook. Like I mentioned earlier, this article is entirely too long for a relatively simple subject matter. --Scandum (talk) 12:21, 22 June 2011 (UTC)
The fact is that the majority of WikiArticles about algorithms have code and/or pseudocode. "Within this WikiProject, the consensus has generally been that where possible algorithms should be presented in pseudocode." Wikipedia:WikiProject_Computer_science/Manual_of_style_(computer_science)#Algorithms--Tim32 (talk) 13:25, 22 June 2011 (UTC)
Taken in context that is to mean that pseudocode is preferred above a language specific example, not that algorithms in pseudocode should be added 'where possible'. It's in itself not a huge issue to me if the other problems are addressed. --Scandum (talk) 04:15, 23 June 2011 (UTC)
However, in contrast with you they did not write "We strongly suggest removing any code and pseudocode as Wikipedia is not a textbook".--Tim32 (talk) 08:04, 24 June 2011 (UTC)

Perl Implementation Did Not Contain Bug[edit]

The accusation that the Perl code contained a bug is false. The binary search algorithm check was fine.

The problem was that the index must be less than half the size of the variable used to store it (be it an integer, unsigned integer, or other).

This was not an algorithm bug as is purported on this page - and I feel strongly that this is unjust. A simple comment at the start of the algorithm (input maximum index may not be more than half the maximum of the variable type) would suffice as a routine interface contract. — Preceding unsigned comment added by (talk) 12:31, 21 November 2011 (UTC)

Oh? And can you provide even one example of a textbook or documentation or interface specification that from the beginning announced this restriction? NickyMcLean (talk) 20:47, 23 November 2011 (UTC)

How about beginning the Implementations section with a correct implementation?[edit]

The Implementations section start with an implementation which says "The following incorrect (see notes below) algorithm [...]". If it is incorrect, why put it first? Nicolas1981 (talk) 11:19, 15 March 2012 (UTC)

Simply because WP is prowled by righteous purists who will decry as Original Research and excise any entry not backed by an approved published source, and the textbooks etc. uniformly present versions where integer overflow and the like are ignored as not relevant to the explanation of the method. Many potential contributors have well-tested and even provably correct versions of a binary search routine in various computer languages, but, none of them have been published in a proper approved publication. And yes, I have had stuff rejected for this reason too. For instance, there used to be a version that performed one test per iteration (rather than the two forced by most computer language syntax) under the heading "Deferred Detection of Equality", something that had never ocurred to me, and there was an explanation of its behaviour, but, because there was no approved source, out it went. The example binary search routine is from a famous text on programming, and thus is a proper source. Despite all the noted detractions. NickyMcLean (talk) 20:18, 15 March 2012 (UTC)
The section is terrible. Wikipedia policy on OR is no excuse for poor content and poor presentation. The current text is more a critique of Wirth's code than it is an exposition of a binary search algorithm. Just silly. --Macrakis (talk) 23:52, 15 March 2012 (UTC)
There is a statement somewhere to the effect that any rule slavishly followed that leads to poor results is to be disregarded, however the righteous disregard it and hold themselves all the more righteous thereby. NickyMcLean (talk) 02:57, 16 March 2012 (UTC)
The problem is that some people here don't recognize the distinction between algorithm and implementation and they think that an algorithm which has problems when copied unmodified onto a computer is erroneous. This mindset is contrary to how the great majority of algorithms are presented in computer science. Until it is changed, the prospect for this article is not great. McKay (talk) 01:57, 16 March 2012 (UTC)

imid-1 becomes negative[edit]

I see nothing in the recursive or iterative algorithms that avoids the "imid - 1 becomes negative" error. Aasuming the use of unsigned integers, a simple fix would be to return KEY_NOT_FOUND if imid is zero - before decrementing imid. — Preceding unsigned comment added by (talk) 02:13, 15 May 2014 (UTC)

The given algorithms use int rather than unsigned int, so the underlying premise is faulty. If imax ever goes negative, then it is caught at the top of the recursion or in the while condition. Consequently, a separate test for the absolute left edge of the search region is not needed. Glrx (talk) 19:44, 16 May 2014 (UTC)
Nope - these are supposed to be algorithms, not implementations in a specific language/architecture. (By the way, there is nothing to "catch" a -1, even if legal) Also, nothing prevents (imid/right/last) + 1 from exceeding the maximum possible. — Preceding unsigned comment added by (talk) 22:28, 19 May 2014 (UTC)

Algorithm section needs to specify how midpoint is calculated[edit]

Only one way should suffice for all three algorithms, but this is critical, to determine how implementation problems - notably integer range issues - can occur and may be avoided.

Deferred method[edit]

What is supposed to be meant by "assert ..."? Does the algorithm blow up if false? Does it do something to assure its true? Or most likely, is this another piece of unsupported fluff in this article?

C code replaced with pseudocode[edit]

Because C code is inaccessible to many readers and seems to be unsourced (or at least unsuppoerted) and written by an editor, I have replaced it with a rough synthesis of the pseudocode found in The Art of Computer Programming (particularly §6.2, "Algorithm B"). Esquivalience t 01:52, 19 March 2016 (UTC)

GA Review[edit]

This review is transcluded from Talk:Binary search algorithm/GA1. The edit link for this section can be used to add comments to the review.

Reviewer: David Eppstein (talk · contribs) 05:32, 27 March 2016 (UTC)

Reviewing. But this is likely to be a quick fail unless something is addressed relating to criterion 3(a) (all essential aspects of the topic must be covered). Namely: the article as it stands describes only a version of binary search that finds elements when they are already in the array, but it returns nothing useful when the query value is not in the array. But to my understanding, that is not a useful version of the algorithm. If what you want is only equality-based lookups, you would be better off with a hash table, or maybe a bitmap for sets with small ranges and no data associated with the values. What binary search is more useful for is something hash tables can't easily do: predecessor, successor, or nearest neighbor queries. Currently the article mentions nothing of this: nothing about using hashing instead of binary search when you need only exact matches, nothing about predecessor, successor, or nearest neighbor queries, and very little about how to choose between binary search of sorted arrays versus other alternatives (all I see is a uselessly-short warning that the cost of sorting may make a linear search a better choice in some unspecified circumstances). Without a more complete and accurate coverage of the types of abstract data type that binary search can be a good match for, there is no way this can pass 3(a). —David Eppstein (talk) 05:32, 27 March 2016 (UTC)

I've started to expand the article per your feedback. Esquivalience t 20:32, 27 March 2016 (UTC)
Yes check.svg Done hash tables and approximate matches, on to bitmaps and binary search versus other searching algorithms. Esquivalience t 03:33, 28 March 2016 (UTC)
I certainly agree that the article should describe the special value of binary search as opposed to (among others) hash tables. However, what User:Esquivalience has mostly added is more detailed description of hash table algorithms, almost all of which is better covered in that article, and is completely irrelevant to binary search or the comparison of binary search to hash tables. I removed that material, but Esquivalience put it back.... Other editors' comments are welcome. --Macrakis (talk) 19:33, 28 March 2016 (UTC)
What I think the binary search article needs is not a description of hash tables, binary search trees, etc., but a comparison of the relative advantages and disadvantages of these solutions: what do you need to know to decide whether a sorted array with binary search is the right or wrong solution to your programming problem? —David Eppstein (talk) 20:01, 28 March 2016 (UTC)
@David Eppstein and Macrakis: I have removed most of the description of the individual data structures, instead focusing on comparing them to binary search of a sorted array of records. Esquivalience t 01:36, 29 March 2016 (UTC)

Second reading[edit]

Good article criteria
1a. Good prose: some grammatical errors and run-on sentences detailed below.
1b. Complies with MOS: some issues with lead, figure sizing and section layout detailed below.
2a. References well formatted: mostly, but some are missing important data.
2b. The references used are reliable and cover all claims: There are a couple of self-published sources but they don't look problematic. However, there are many claims in the article that are missing sources.
2c. No original research: mostly true, but there are some statements in the article that are just wrong, possibly through editors adding OR.
2d. No copyvio or plagiarism: none found.
3a. Addresses the main aspects of the topic: Now much better, but we're still missing any discussion of how to use binary search to find non-exact matches.
3b. No unnecessary detail: I think the "Exclusive or inclusive bounds" section is problematic wrt this criterion.
4. Neutral: No major problems here.
5. Stable: Hmm. Although much of the editing was in response to my first round of review, there have been some major edit-revert cycles recently, both in response to my review and earlier.
6a. Images are properly licensed: yes.
6b. Images are relevant: yes, but some improvement in legibility is needed. Images have good captions: again, there is room for improvement.
The first and second paragraphs accurately summarize the first and second sections (algorithm and performance), but after that the remaining lead paragraph does a poor job of summarizing the rest of the article.
Expanded the lead to summarize the article. Esquivalience t 23:28, 7 April 2016 (UTC)
"that are stored contiguously": is there any other kind of array?
Removed. Esquivalience t 23:28, 7 April 2016 (UTC)
The image in the infobox has text that is too small to be legible. Why is the image scaled to below the default image size? And the use of an image size specified as an absolute number of pixels rather than as a relative scaling factor violates MOS:IMGSIZE.
Yes check.svg Done. Esquivalience t 23:28, 7 April 2016 (UTC)
The image caption omits some key information (that the search key is 4) that would help make the image more understandable (once it is redrawn or resized to make the search key in the image legible).
The space complexity claim in the infobox is confusing or wrong. The working space for an iterative binary search is indeed constant. The working space for a recursive binary search is logarithmic. And both kinds of search use linear space for the sorted array itself. This should either be stated more clearly or the term omitted from the infobox.
"This method can be described recursively or iteratively": this is now just a throwaway line with no source and no later expansion.
Removed. Esquivalience t 23:28, 7 April 2016 (UTC)
Why are we using 1-based array indexing when all modern programming languages use 0-based indexing?
Converted to 0-based indexing. Esquivalience t 23:28, 7 April 2016 (UTC)
What is a "logical procedure" and how does it differ from some other kind of procedure?
Likely some needless fluff; removed. Esquivalience t 23:28, 7 April 2016 (UTC)
The "inclusive or exclusive" variation subsection is far out of proportion to its small importance (just a coding detail). It also appears to be mostly unsourced.
Removed entirely. Esquivalience t 23:28, 7 April 2016 (UTC)
This algorithm still has the major major flaw that I already complained about: it only works for finding exact matches, a problem for which binary search is the wrong solution. There is still no description of how to use binary search to solve successor, predecessor, or nearest neighbor queries, nor even (except later hidden in the hash table subsection) any mention that it can do so.
Added a new section on approximate matching. Esquivalience t 23:28, 7 April 2016 (UTC)
Re the link on "place": see WP:SUBMARINE.
Removed link. Esquivalience t 23:28, 7 April 2016 (UTC)
Using wiki-formatting for the floor operation is very ugly. If you're going to use that operation, do it in <math>.
Converted all math to TeX. Esquivalience t 01:18, 8 April 2016 (UTC)
How is it possible that comparison tree is a redlink? Maybe we should instead link to Decision tree model, rather than writing here as if this model is unworthy of a link.
The use of comparison trees to analyze binary search needs a source.
cited TAOCP §6.2 as well as Timothy 1997. Esquivalience t 01:33, 10 April 2016 (UTC)
The image caption could use more detail: what is the array being searched? And what is the search value?
The average case analysis needs a statement of assumptions (maybe, we are only searching for things that are in the array, and they are all equally likely?).
Yes check.svg Done. Esquivalience t 01:33, 10 April 2016 (UTC)
The best case claim has no source.
Added ref. Esquivalience t 01:33, 10 April 2016 (UTC)
The sentence "In practice, for all but large n, ..." is too long, and should be split into multiple shorter sentences.
Split. Esquivalience t 01:33, 10 April 2016 (UTC)
Most importantly the average case analysis here is incorrect, because it is based on a different version of binary search, one that does the equality test second in each iteration. The version actually described here does the equality test first, so most iterations (all but the last one) take two full comparisons.
Fixed. Esquivalience t 01:33, 10 April 2016 (UTC)
What does small vs large mean here? What is the cutoff for which this optimization is worthwhile? 10? 100? 100000?
Knuth 1998 gives a value of 266; added. Esquivalience t 01:33, 10 April 2016 (UTC)
Fractional cascading is worth mentioning somewhere in this article, but is this section the right place for it? And it needs a source.
Moved to variations section.
Binary search versus other schemes
Thanks for adding this badly-needed section.
Binary search versus other schemes — Hash tables
Hash tables is plural, not matching the singular verb "is".
Re "the only information given on a failed search is that the key is not present in any record": but that's also true of the pseudocode for binary search given here!
Added approximate matching section to Binary search algorithm#Algorithm.
Some kinds of hash table (in particular cuckoo hashing) have constant worst case search time, not merely constant average case
Changed to "most implementations", mentioned perfect hashing as footnote.
"is retained": awkward passive wording.
Binary search versus other schemes — Binary search trees
"The main advantage of BSTs": here and later I think it would be less WP:TECHNICAL to just spell out binary search trees rather than using an acronym.
Yes check.svg Done
The same sentence has yet another singular-plural mismatch.
The first paragraph needs a source.
Added (Sedgewick & Robert)
No mention of the space penalty for storing the items in a collection of separate tree node objects instead of a single array.
Added mention.
Where does the 1.386 constant factor come from??? Perhaps this is the expected time for binary search trees created by random insertion without rebalancing, but that's a very specific choice of assumptions that, if you're making them, should be stated explicitly. And it needs a source. But it would be better just to stick to generalities. There are many types of balanced binary trees, they generally have logarithmic search time and update times (not just the vague statement here about updates being slow), the constant factor in the search time is generally greater than one, but for weight-balanced trees it can be made arbitrarily close to one.
Replaced with "slightly worse", which conveys the message clearly enough.
The claim that traversing the nodes of a binary search tree takes linearithmic time is flat-out false. An in-order traversal is a linear-time operation.
Fixed. Esquivalience t 02:05, 10 April 2016 (UTC)
Binary search versus other schemes — Linear search
What does "non-trivial" mean?
Likely some chaff; removed.
"amortized" is a technical term that needs an explanation.
Clarified using "spread".
For a single search, sorting + binary search is slower than not sorting + linear search, so I think more explanation is needed here.
The new wording should hopefully be clear enough that the cost of sorting needs to be spread over all searches.
It might also be worth mentioning that linear search works for linked lists and dynamic arrays and that those structures allow significantly simpler addition and deletion of elements (constant time) than sorted arrays or even binary search trees.
Added mention.
The second half of the paragraph needs sources.
Variations — Uniform binary search
The second half of the first sentence has a singular-plural mismatch between a verb and its subject.
The second paragraph is completely opaque to me. Either expand and clarify or remove it.
It was referring to actually storing the widths in another array, but pretty obscure technique; removed.
Variations — Fibonaccian search
I don't see the point of including this section; Knuth may have thought it worthy of mention, but only as a curiosity rather than as a useful algorithm. There is no division cost to be saved, because there is no actual division operation in binary search (the operation that appears to be a division, but division by the constant 2, would be replaced by a shift operation by any reasonably intelligent optimizing compiler).
What should be mentioned here instead is Fibonacci search technique, a related algorithm for finding maxima of unimodal functions.
(Response to both) I removed the claim that it reduces the cost of the midpoint, but kept the rest (as the technique is tied to binary search), but added the original purpose of Fibonaccian search.
Variations — Exponential search
What is "the first element in index 2^j" supposed to mean? How about more clearly "the first element whose index is a power of two", and separating the descriptions of what we're looking for and how we look for it into two separate sentences.
The wording implies that the analysis is valid only for positions near the start of the array. This is false. The analysis is valid always; however, for positions greater than sqrt(n) the number of comparisons is larger than binary search.
Why are we putting floor signs around the log? We don't do that elsewhere, and it looks ugly.
Moved floor signs.
Variations — Interpolation search
"arrays whose elements successively increase by a constant or close to constant amount": this is overly specific. What interpolation search actually assumes is that the array values are distributed nearly uniformly across their target range.
The first paragraph has no footnote. Even if the source is the same as the second paragraph the footnote should be repeated.
The average case number of comparisons is stated both very specifically (without the use of O-notation) and very vaguely ("about", no definition of n, and average case without specifying what it's an average over). The actual bound is at most log_2 log_2 n comparisons, in expectation, for sorted arrays of n real numbers drawn at random from the uniform distribution on an interval.
I believe n should be clear enough, but replaced with asymptotic notation.
There is a significant difference between the assumptions of interpolation search and of binary search that is not mentioned here, and should be: interpolation search requires that the array elements be numbers. Binary search on the other hand requires only that they be drawn from a totally ordered set.
Variations — Equality test at end of program
The start of this subsection makes no sense. First of all, "dichotomic" is not making two choices, it is making a single choice between two options. Second, the algorithm as presented doesn't do that. We are saying "the algorithm chooses between two options, if you ignore the third option that it also chooses from", but why should we ignore part of the algorithm in making this classification. And does saying it's dichotomic really add to reader understanding? This has all the appearance of a wikilink awkwardly wedged into an article because some other article wanted to avoid being an orphan rather than to actually improve the article.
"dichotomic" seems irrelevant (this variation was not about the number of potential choices), so removed and reworded.
How does this save only half a comparison per iteration rather than a whole comparison? Note that in the example code, the equality test is done first, so it happens every iteration.
For all but large n: how large is large?
Specified in the performance section (266
Don't start a paragraph with a pronoun.
How is returning the last index of a match rather than the first index of a match, in case of duplicate array elements, an advantage?
Non-duplicate successor queries may be sped up slightly.
The Inakibit-Anu lists regular numbers (the ones whose reciprocal has a finite segasimal representation), not all numbers. Also, as far as I can tell (e.g. from reading a related source "Ancient Babylonian Algorithms", Knuth 1972) the purpose of listing these numbers in sorted order is unknown and can only be inferred to be for the purposes of making search easier, so saying that the table was "sorted to allow for faster searching" overstates the case. And in any case since the search would be done by hand rather than by machine it would be likely to be some kind of interpolation search rather than binary search.
This would imply that the Babylonians were more clever than the programmers from 1946-60 (where binary search worked onky for a limited number of arrays), so removed altogether.
The first sentence of the second paragraph is written very confusingly as a double negative.
Moot per above.
The formula for 2^n-1 is badly formatted. If you're going to use wikiformatted math instead of <math>, you need to put spaces around the minus sign, use a proper minus sign (&minus;) instead of a hyphen, and prevent it from being wrapped. You might also consider using the {{math}} template to make the formatting more closely resemble <math> formatting.
Replaced with TeX.
Has there really been no history of this subject since 1971?
Added fractional cascading in 1986, but all the other papers I could find were apparently useless variations of binary search, one being binary search in a linked list.
Implementation issues
Don't use pronouns such as "it" at the start of a section, where there is no earlier noun for the pronoun to refer to.
No instance of the above appears in the section after I've addressed this section.
The description of the conditions under which an overflow are in some ways too vague ("wanders toward the upper end"), but in some ways overly technical (why is the bound on last specified as a number divided by two instead of just a number?).
The only error described here is overflow. However, there are other errors (mostly related to handling two-way rather than three-way branching in each iteration, as mentioned earlier in the "Equality test at end of program" subsection) that are more important to avoid, and they should at least be mentioned.
The Ruggieri source is claimed to support the fact that there are two issues with the median calculation: representability of first-1 and last+1, and overflow in the calculation of first+last. But in fact it supports only the second issue, and its alternative solution.
Cited Programming Pearls (which contains a whole chapter on binary search pitfalls)
Language support
Only four of the ten entries here are sourced
Added refs.
Notes, See also
The section ordering violates MOS:LAYOUT
Mostly consistently formatted in CS1.
Many notable authors are missing authorlinks (e.g. Stroustroup, Karlin, Mehlhorn, Fich, Tarjan).
Added some.
Similarly, notable journals such as SIAM Journal on Computing should be linked.
References [17] and [18] appear to be the same thing (Knuth 6.2.2) and should be consolidated into one reference or properly differentiated from each other.
I believe that all the Knuth cites have section names on them.
Some references to books (e.g. Gast, Bentley) are missing specific page numbers; this needs to be fixed, as otherwise the reference fails WP:V.
The Wirth reference is missing any publication data.
Bottenbruch's 1962 paper was enough; removed Wirth ref.
Getting better, good enough to avoid a quick fail for now, but still quite a ways to go.


That was all from April 3. Two weeks later, major components of this review remain unaddressed; the article is being improved, but slowly; at this rate it will take many weeks to converge, if it does at all. The GA review process is not really supposed to be used for long-term revisions like that, but more for things that can be handled quickly to bring the article into compliance. On the other hand, this is an important topic that I really would like to see covered at the GA level. So, to push things along, how about imposing a deadline: if something is done to handle all of the above comments by next weekend, I will proceed to another round of review, but otherwise we can leave this review as a fail, giving you as much time as you want to bring it up to par before nominating it again. —David Eppstein (talk) 18:27, 19 April 2016 (UTC)

@David Eppstein: Sorry for the delay, but I believe that I've addressed all the concerns above as of now (except maybe some missing author links, which I will try to find). Esquivalience t 01:08, 22 April 2016 (UTC)

Third reading[edit]

Left over from second reading
The lead now more or less summarizes the rest of the article but it still has major problems; see below.
All requested changes to the "Algorithm" section have been done
Performance section "The image caption could use more detail": not addressed
Yes check.svg Done
Comparison section, "Replaced with slightly worse": the replacement sentence is very confusingly written. And "Clarified using spread": Not much of a clarification. At least link to an article on amortized analysis.
Reworded and linked to amortized analysis.
Variations: the suggested replacement of Knuth's uninteresting "Fibonaccian search" (a technique for doing the same thing as binary search using Fibonacci numbers instead of midpoint bisection) by "Fibonacci search" (a technique for finding extreme points of unimodal arrays by doing something resembling binary search, again using Fibonacci numbers) has been done only halfway, resulting in some strange hybrid that still uses one name and the Knuth reference but now mostly talks about the other algorithm that has a different name and is not what Knuth describes. And then it says that it's less efficient, which misses the point (it's not less efficient, for the purpose of finding extreme points).
Removed Knuth Fibonacci search so that it is purely about Kiefer's Fibonacci search.
History and implementation issues: all requests done.
Library support: more of these are now sourced, but not all of them. External links within the main article text are neither a valid way to source things nor allowed by MOS for any other reason.
Yes check.svg Done
Duplication of refs 17 and 18 (now 14 and 15) still exists. Both are to Knuth section 6.4, without any more specific subsection or page numbers. They now have different section titles but how is that possible when they have the same section number?
For the {{Sfn}} citations, I meant to place the section number of the entire subchapter containing the specific subsection, then the subsection name itself after the section number. However, this was confusing, so made it clear that the section titles after the section numbers were subsections of a section. Also added subchapter titles.
"Exact searching is enough for many applications, most notably associative arrays. However, it is difficult or impossible to perform approximate search operations—important for some applications such as priority queues—on many of these data structures, such as finding the next-smallest, next-largest, and nearest element relative to an element, while binary search can perform such searches in logarithmic time or faster." This sort of detailed evaluation is out of place in the lead. And the suggestion that binary searches in sorted arrays are a good way of implementing associative arrays is wrong.
Condensed and fixed.
"variations of binary search that may run faster on some arrays or computers": speed is not the only reason for the variations.
Condensed altogether to just "variations of binary search" (as appropriate for lead).
"fractional cascading extends binary search to efficiently solve various complex searching problems in domains such as computational geometry": unnecessarily vague and technical. All fractional searching does is to speed up searches for the same key in multiple sorted lists. How hard is that to say?
History: all requests done.
"Some implementations may place the comparison for equality at the end": this is ambiguous. Does it mean the end of each iteration or the end of the whole algorithm? (Both are possible and both are improvements.)
I changed the procedure so the equality check per iteration was done last altogether. This leaves only one possible variation that fits under this wording, but I reworded so its says "end of an algorithm".
"Particularly, binary search can be used to compute, relative to a element": No! Finding the successor or predecessor of an element can be done trivially without recourse to any kind of search (just add one or subtract one to its index). What is needed is to find the successor or predecessor of a search target, especially when that target is not an element.
Added rank queries (the number of elements below a certain value) and then provided the procedure for predecessor and successor queries.
Range searches: the calculation of the number of elements is off by one, and needs some case analysis to handle the possibilities that the range endpoints are or are not elements of the array.
Fixed and used rank queries instead of binary search for range searches instead.
"and vice versa" too brief a shorthand for something like "and symmetrically for the right subtree".
"this is equivalent to a binary search where the element is found or the search is terminated only after the search has reduced to one element": false. Some reductions to a single element may use fewer than the worst case number of comparisons.
"This is equivalent to a binary search that has reduced to two elements": again false. Are you using a source that made a simplifying assumption that the comparison tree has a complete last level? Because that's not true in general.
"No search algorithm that is based solely on comparisons can exhibit better time performance than binary search." Does this refer to the worst case number of comparisons, or some other calculation of performance?
In terms of the number of iterations; changed.
How compatible are the claims that the binary search algorithm described here is absolutely the best possible (in terms of the number of comparisons, presumably) and that the equality-at-the-end optimization that reduces the number of comparisons by a factor of two (but is slower for realistic input sizes because of the other overhead of an additional loop iteration) is not worth it? They seem to contradict each other to me.
Changed to iterations per above.
As long as we're mentioning the equality-at-the-end-of-the-search variation that saves one comparison per iteration but loses an iteration, shouldn't we mention the equality-after-inequality-within-each-iteration variation that saves half a comparison per iteration (on average) for no cost at all? And when Knuth compared equality at the end to the alternative, did he compare it to the alternative with 2 comparisons per iteration or to the one with (average) 1.5?
Knuth compared it to the 1.5 average variations version.
The fractional cascading sentence could use a footnote (re-used from the later more detailed description).
A section link probably would serve the same purpose, but expanded it by a bit anyway.
Binary search versus other schemes
"the key is not present" should be "the target is not present".
Yes check.svg Done
The paragraph break in the hashing subsection lands at an odd place.
Footnote [b] is written as if these were the only ways to get an unbalanced tree. This is false.
Changed it to clarify that this only applies to perfectly unbalanced binary search trees.
"Balanced binary search trees attempt to balance their own nodes, improving the time performance of searches, but at a cost to insertion and deletion time.": this is wrong. Compared to maintaining a sorted array, insertion and deletion in balanced trees is much faster (logarithmic vs linear). And compared to an unbalanced tree, insertion and deletion in balanced trees is again faster (logarithmic vs whatever the height of the tree is). Also, the words of Yoda are appropriate here: "Do. Or do not. There is no try."
Fixed, and latter advice is against quantum mechanics do.
"a subset of the operations available to sorted arrays": Really? Which operation is not available?
Changed to just "the operations available to sorted arrays".
For exponential search, "element whose index is a power of two that exceeds the target value" is ambiguous. Does the index exceed the target, or does the element?
For interpolation search, in the sentence "It is most successful as a precursor to binary search, eliminating elements while the search space is large and switching to binary search afterwards.", what is the subject? The obvious guess is that it is the interpolation search algorithm, but this doesn't make sense, because that algorithm doesn't switch anything.
Removed altogether.
"attempts to calculate the position of the target value": Yoda again. What it does is to estimate the position.
"search space of the array": is search space some strange new synonym for length?
Changed to "length".
"time difference between interpolation and binary search does not compensate": circular reasoning, since the time difference is what we're trying to reason about. Maybe you mean the smaller number of comparisons made by interpolation search?
Clarified, and I meant the smaller number of iterations.
"It is most successful when": the most recently used noun was "small arrays". Presumably they are not the intended subject of this sentence. So use a noun, not a pronoun.
The sentence was removed altogether (see above).
"Equality test at end of program": This seems redundant with the paragraph about the same thing in the performance section. And it has nothing to do with whether you return the position of the first or last match, which can be done in either variation.
The "not of length 2^n − 1 for any natural number n" wording is still a little awkward.
Implementation issues
"When Jon Bentley assigned it": assigned what? Swapping the pronoun here for the noun in the next clause would be less confusing.
Yes check.svg Done.
You seem to be missing another more subtle overflow bug. When the initial R is the max representable integer, and the target value is larger than anything in the array, then the pseudocode given will eventually reach a state where L=R=maxint, compare the target to the value at that index, and try to set L to maxint+1. This will either cause an overflow exception or set L to something very small and keep going; in either case that's not the desired behavior.
Library support
See leftovers, above.
The Knuth references are all missing page numbers, important since some are clearly to subsets of the specified sections (unless the same section can have multiple titles).
[1], Williams: Formatted as a journal, but actually a conference, and it would be helpful to spell out the conference name.
[18], Beame&Fich: Link Journal of Computer and System Sciences
[25], Kiefer: Link Jack Kiefer (statistician)
[35], Lehmer: Another conference paper formatted as a journal paper.
[36], Bottenbruch: Should list the whole page range of the published paper (161–221), with a note at the end of the reference for the page number actually used as a source (Section 43, "Program for Binary Search", p. 214). Also add the doi, 10.1145/321119.321120.
[40], Pattis: Link Richard E. Pattis
[41], Bloch: Link Joshua Bloch
[42], Ruggieri: Link Information Processing Letters
All done.
The previous summary is still accurate: "Getting better, good enough to avoid a quick fail for now, but still quite a ways to go."
@David Eppstein: I (or at least believe that I) have responded (or acted in response) to all the comments above. Esquivalience t 03:43, 1 May 2016 (UTC)

Fourth reading[edit]

Quite a lot closer this time. I only found the following issues:

  • "Each iteration of the binary search algorithm defined above makes 1.5 comparisons": No, either 1 or 2. But for targets that are equally likely to match any of the keys, the average is 1.5/iteration.
  • Clarified.
  • "binary search trees will most likely be imbalanced, resulting in slightly worse performance than binary search: I think this overstates the issue. "Imperfectly balanced", maybe?
  • Imperfectly balanced sounds just right; replaced.
  • "Binary search is faster than linear search for sorted arrays": unless they are very short.
  • Mentioned such, and added footnote with more details on this.
  • "A variation of the algorithm forgoes the lower and upper pointers or variables, instead storing the index of the middle element and the width of the search; i.e., the number of elements around the middle element that have not been eliminated yet.": This sentence is too long.
  • Shortened sentence.
  • Footnote 9: Harv error: link from #CITEREFGoldmanGoldman2007 doesn't point to any citation.
  • The right year was 2008; fixed.

David Eppstein (talk) 05:24, 9 May 2016 (UTC)

@David Eppstein:: Fixed above issues and added some more detail. Esquivalience t 01:26, 10 May 2016 (UTC)
I made some small copyedits today, after which I can't find anything to complain about. The prose is uniformly clear, the section structure makes sense, and obeys the section structuring criteria of MOS (WP:GACR #1). Everything is appropriately sourced, without original research or copyvios (#2). After all this revision, the article now adequately addresses all the main aspects of the topic, and doesn't go into excessive detail about any aspect (#3). Its evaluative components (e.g. comparisons with other data structures) are neutral, accurate, and based on reliable sources (#4). Although it has changed significantly since the start of the review, the disputes that were apparent at that time seem to have settled down, and there are no major disagreements still evident, so I think it is stable enough (#5). The article is illustrated appropriately (if a little minimally still), the images are appropriately licensed, and the image captions are accurate and informative (#6). I will go ahead and pass this now. —David Eppstein (talk) 22:51, 11 May 2016 (UTC)


Article is much better! Still, I think that too much space is spent on things that aren't binary search (like tree balancing). Perhaps the real problem is that we're missing an article that is more general than binary search algorithm but more specific than search algorithm (which is extremely broad). Not sure what the right term is... I'll have to look at Knuth next time I have it nearby. --Macrakis (talk) 01:03, 8 April 2016 (UTC)

Since the section is titled "Improvements" I might as well just post some here. Many of the other sorting/searching algorithm pages have pseudocodes which I personally find extremely helpful to readers, but this article doesn't have one. I understand the complexities involved with binary search, but adding a pseudocode for binary search would be a huge improvement. Zamaster4536 (talk) 12:53, 17 April 2016 (UTC)

Bloom filters etc.[edit]

Recently, a paragraph was added about Bloom filters. Certainly an interesting and important algorithm. But in what way is it more important or significant than the other algorithms covered in the "Other Data Structures" section? In particular, it is an approximate algorithm, and it is only useful for set membership, not for attaching values to keys, so it really doesn't compete directly with binary search, hashing, etc. The simplest set membership algorithm, the bit array, is also—appropriately—in that section.

We shouldn't try to cover all possible lookup algorithms in the "Binary search versus other schemes" section, and certainly not in detail. What we should do is compare the domains of applicability and the time and space efficiency, and leave the implementation mechanism to the individual articles. --Macrakis (talk) 16:09, 2 August 2016 (UTC)

What I would like to see retained is the clear statement that for problems involving equality testing only (that is, both set membership and dictionary lookups), binary search is usually a bad choice, because other schemes (hash tables, Bloom filters, etc) are clearly better. The details of those other schemes are important here only to set the context, but the comparison of them against binary search is crucial. Otherwise, in my experience, too many students jump from the fact that you can use binary search for equality-based problems to the incorrect conclusion that you should use it. —David Eppstein (talk) 00:19, 5 August 2016 (UTC)
I certainly agree that we should be clear about the domain of applicability of different lookup algorithms, and their relative advantages and disadvantages. I don't think that that requires that we describe each of these other algorithms in detail in this article. --Macrakis (talk) 21:42, 5 August 2016 (UTC)

Trivial implementation observations[edit]

Article currently reads:

The iterative version of binary search only requires three extra variables, taking constant extra space. When compiled with a compiler that does not support tail call elimination, the recursive version takes space proportional to the number of iterations as extra stack space is required to store the variables. If the array is passed by value instead of by reference to the binary search subroutine, the original array will have to be copied, costing {\textstyle O(n)} extra time and space. Since binary search does not modify the elements of the array, it is safe to pass the array by reference.

This is all true, but utterly trivial, as it is true of any iterative algorithm converted to a recursive algorithm, even linear search. I have deleted it.

I have also deleted other references to a recursive form. After all, any iterative algorithm can be written in recursive form, and the efficiency issues are always the same. --Macrakis (talk) 21:38, 6 August 2016 (UTC)

Deferred detection of equality[edit]

There used to be an interesting (imho) paragraph on deferred detection of equality which vanished through user Esquivalience edits.

Why was this removed entirely? Gpakosz (talk) 09:13, 14 September 2016 (UTC)

The second-to-last paragraph of the "performance" section still appears to cover this. —David Eppstein (talk) 16:12, 14 September 2016 (UTC)