Interpolation search: Difference between revisions

Content deleted Content added

Inline

Revision as of 22:47, 17 May 2010

Interpolation search (sometimes referred to as extrapolation search) is an algorithm for searching for a given key value in an indexed array that has been ordered by the values of the key. It parallels how humans search through a telephone book for a particular name, the key value by which the book's entries are ordered. In each search step it calculates where in the remaining search space the sought item might be based on the key values at the bounds of the search space and the value of the sought key, usually via a linear interpolation. The key value actually found at this estimated position is then compared to the key value being sought. If it is not equal, then depending on the comparison, the remaining search space is reduced to the part before or after the estimated position. Only if calculations on the size of differences between key values are sensible will this method work.

By comparison, the binary search chooses always the middle of the remaining search space, discarding one half or the other, again depending on the comparison between the key value found at the estimated position and the key value sought. The remaining search space is reduced to the part before or after the estimated position. The linear search uses equality only as it compares elements one-by-one from the start, ignoring any sorting.

On average the interpolation search makes about log(log(n)) comparisons (if the elements are uniformly distributed), where n is the number of elements to be searched. In the worst case (for instance where the numerical values of the keys increase exponentially) it can make up to O(n) comparisons.

In interpolation-sequential search, interpolation is used to find an item near the one being searched for, then linear search is used to find the exact item.

Performance

Using big-O notation, the performance of the interpolation algorithm can be shown to be O(log log N).^[1]

Practical performance of interpolation search depends on whether the reduced number of probes is outweighed by the more complicated calculations needed for each probe. It can be useful for locating a record in a large sorted file on disk, where each probe involves a disk seek and is much slower than the interpolation arithmetic.

Index structures like B-trees also reduce the number of disk accesses, and are more often used to index on-disk data in part because they can index many types of data and can be updated online. Still, interpolation search may be useful when one is forced to search certain sorted but unindexed on-disk datasets.

Adaptation to different datasets

When sort keys for a dataset are uniformly distributed numbers, linear interpolation is straightforward to implement and will find an index very near the sought value.

On the other hand, for a phone book sorted by name, the straightforward approach to interpolation search doesn't apply. The same high-level principles can still apply, though: one can estimate a name's position in the phone book using the relative frequencies of letters in names and use that as a probe location.

Some interpolation search implementations may not work as expected when a run of equal key values exists. The simplest implementation of interpolation search won't necessarily select the first (or last) element of such a run. The interpolation calculation also must be written to avoid division by zero.

Book-based searching

The conversion of names in a telephone book to some sort of number clearly will not provide numbers having a uniform distribution (except via immense effort such as sorting the names and calling them name #1, name #2, etc.) and further, it is well-known that some names are much more common than others (Smith, Jones,) Similarly with dictionaries, where there are many more words starting with some letters than others. Some publishers go to the effort of preparing marginal annotations or even cutting into the side of the pages to show markers for each letter so that at a glance a segmented interpolation can be performed.

Sample implementation

The following code example for the Java programming language is a simple implementation. At each stage it computes a probe position then as with the binary search, moves either the upper or lower bound in to define a smaller interval containing the sought value. Unlike the binary search which guarantees a halving of the interval's size with each stage, a misled interpolation may reduce/increase the mid index by only one, thus resulting in a worst-case efficiency of O(n).

 public int interpolationSearch(int[] sortedArray, int toFind){
  // Returns index of toFind in sortedArray, or -1 if not found
  int low = 0;
  int high = sortedArray.length - 1;
  int mid;
 
  while (sortedArray[low] <= toFind && sortedArray[high] >= toFind) {
   mid = low + ((toFind - sortedArray[low]) * (high - low)) / (sortedArray[high] - sortedArray[low]);
 
   if (sortedArray[mid] < toFind)
    low = mid + 1;
   else if (sortedArray[mid] > toFind)      // Repetition of the comparison code is forced by syntax limitations.
    high = mid - 1;
   else
    return mid;
  }
 
  if (sortedArray[low] == toFind)
   return low;
  else
   return -1; // Not found
 }

Notice that having probed the list at index mid, for reasons of loop control administration, this code sets either high or low to be not mid but an adjacent index, which location is then probed during the next iteration. Since an adjacent entry's value will not be much different the interpolation calculation is not much improved by this one step adjustment, at the cost of an additional reference to distant memory such as disc.

Each iteration of the above code requires between five and six comparisons (the extra is due to the repetitions needed to distinguish the three states of < > and = via binary comparisons in the absence of a three-way comparison) plus some messy arithmetic, while the binary search algorithm can be written with one comparison per iteration and uses only trivial integer arithmetic. It would thereby search an array of a million elements with no more than twenty comparisons (involving accesses to slow memory where the array elements are stored); to beat that the interpolation search as written above would be allowed no more than three iterations.

Further analysis

Consider the following algol-style pseudocode:

  Given a collection of data symbolised by an array y(1:n), in sorted order, and a value V, find y(i) = V.
if n <= 0 then return(0);              |Evade vapid invocations.
if (dyl:=y(1) - V) > 0 then return(0)  |V Precedes the first element?
 else if dyl = 0 then return(1);       |V Matches it?
if (dyr:=y(n) - V) < 0 then return(-n) |V Follows the last element?
 else if dyr = 0 then return(n);       |V Matches it?
L:=1; R:=n;                            |Exclusive bounds. The search span is elements L + 1 to R - 1.
while (p:=R - L) > 1 do                |(R - L) = (number of elements + 1).
 p:=Round(L - dyl*p/(dyr - dyl));      |Linear interpolation to find x such that f(x) = V.
 if p >= R then p:=R - 1               |Element R has already been inspected.
  else if p <= L then p:=L + 1;        |Likewise L. Ensures p is within the span.
 case(sign(d:=y(p) - V))               |Compare the probed value against V, saving the difference in d.
 -1: L:=p, dyl:=d;                     |V > y(p), shift L up.
  0: return(p);                        |V = y(p), success!
 +1: R:=p, dyr:=d;                     |V < y(p), shift R down.
 otherwise;                            |No other cases.
wend;                                  |Continue searching with the reduced span.
return(-L);                            |If exhausted, the search failed.

This works on the differences between V and the y values as this difference will be required for inspection and the interpolation, and emphasises that a value from the y collection should be extracted only once as it might be available only from slow storage, such as a disc file. The initial stage checks whether V is within the span of values offered by y by explicit coding, rather than attempt to encompass this check by the workings of the main search loop as in the first example. A linear interpolation requires two values from the y array anyway.

Should V be within the span of y, then the initialisation of L and R to 1 and n before the interpolation search begins acknowledges the condition "if V is to be found, it will be amongst elements L + 1 to R - 1 of the array y", which is the style of the exclusive bounds version of the binary search. The while loop tests whether there are any such elements remaining, and with the differences dyl and dyr in hand, a probe position can be calculated by linear interpolation, rather than taking the mid-point of the index span as with the binary search.

Although the probe position will be within the span L to R because y(L) < V < y(R) the requirement is that it select one of the elements L + 1 to R - 1, thus the check, after which follows the reference to the new element y(p) and the saving of the difference y(p) - V for the three-way test that will adjust the bounds of the search, or, discover a match. Each iteration reduces the search span, but unlike the binary search method, the reduction may be only by one element rather than half the span. Nevertheless, it is certain to finish, provided that the selection of the probe position is always correct. The risk here is that there may be equal values in the y collection, and if it should happen that y(L) = y(R) the interpolation calculation will fail ignominiously.

But this cannot happen. The initialisation ensures that y(L) < y(R) so the question becomes could it develop during the search? Evidently, at a new position, y(p) might be a part of a sequence of equal elements. If y(p) = V then the search ends without further ado. Otherwise, supposing that y(p) < V then L will take the value of p so the lower bound is now in the clump of equal values. The new interpolation selects an element (possibly L + 1) and if y(p) > y(L) then there is no further difficulty. It cannot be less than y(L) but it may be equal. If so, it will also be less than V and so the lower bound will be shifted up again: the other bound's value will remain different. The corresponding argument mutatis mutandis applies for the other bound if it were the case that y(p) > V earlier.

Each iteration makes one reference to the y storage collection, but of course shuffles values about and engages in arithmetic much more complex than that needed for an iteration of the binary search. The time required to access elements of y may be so great as to overwhelm all other considerations. Performance assessment is difficult, because the behaviour depends not just on the number of elements to be searched, but also on their numerical value - their spacing and clumping. Comparisons between the two methods could be performed by assessing the running time of actual tests on a computer, but there then arise further questions as to the quality of the code generated by the compiler, etc.

Example runs

Uniform distribution

The example uses a test set of a dozen two-digit numbers (actually being the first twenty-four digits of pi), searching for the value 66 which is not present in that set.

The black dots (with jagged line) show the twelve sorted values of y, and the horizontal line marks the sought value V. The successive colours red, magenta, green (yellow is too faint) and blue (or aqua) depict the interpolation line for the successive iterations, plus a vertical line from the intercept with the V line to the new probe position with a horizontal part for any rounding to an index position. Should however that position not be within the required span it is adjusted, as marked by the coloured dots. Depending on the sign of the comparison, either the upper or lower bound of the search span is replaced by the new position and the next interpolation will be performed on the reduced span.

In this example, although the first interpolation does very well, subsequent probes are all crammed against the left end of the working span while the upper bound of the span is not shifted. A straight line applied to a curve will be consistently wrong within the span of its adjacent inflexion points.

Successive Iterations, uniform distribution of y-values, 100 points.

With a larger number of values (more digits of pi; there are some duplicate pairs but still some pairs in 0-99 are missing) the deviations of the cumulative distribution from the diagonal line are much smaller and there are many more "crossovers". The linear interpolation search does much better, adjusting both bounds early on, and for this example, required just three iterations to search amongst a hundred points, whereas with twelve values to search four iterations were required.

Non-uniform distributions

Instead of a few local deviations of the cumulative line from the diagonal, the deviation can be large and for a large part of the span, indeed the whole span. This can be seen by replacing the y-values of the example by exp(x) for x = 1 to 12. Despite all the attendant arithmetic, the result is no better than the sequential search, for each iteration reduces the search space by one element only! The successive interpolation positions are 1.0043, 2.0036, 3.0025, 4.0006, each of which rounds to the left bound and so is nudged up one. With a greater number of values to search, the interpolated positions are no better, being pushed even closer to the left bound.

Successive Iterations, lop-sided distribution, 12 points.

Although that is a contrived extreme case, a more reasonable example is possible. Suppose for example a set of coordinates depicts the outline of a lake or an island, and it is desired to find the coordinate closest to a given position, or a line of latitude. The straightforward sequential search is discouraging with many points, so, the process would be started by preparing a sorted list of latitude values. Rather than employing an actual data set, for the purposes of an example, consider a circle (topologically equivalent to the coast of an island) with positions at one degree intervals and prepare a set of y values from 180*(1 + sin(d*2*pi/360)) for d = 1 to 360, and sort them.

Successive Iterations, non-uniform distribution, 360 points.

As can be seen, even though the positions are regularly spaced, their y-coordinates (or latitudes) are not, as whenever a sequence of points tends to run East-West (or horizontal) there will be many nearly-equal latitudes resulting. But despite the painfully large deviations between the interpolation lines and the curve, the search determined that 66 was not found in just four iterations, rather better than a binary search would have managed.

Although an equal element was not found (and is not to be expected for non-integer values of different provenance), the search identifies the index of the next lowest value in the y collection (here, element 101) so that the nearest positions can be scanned for locally.

Non-linear interpolation

Especially when confronted by a cumulative curve of a non-linear but apparently simple form, the regular errors of a linear interpolation prompt consideration of more complex methods. In special circumstances there may be a special function that could guide the search as in the idealistic "island" example where the shape is of course that of a sine wave and the arc sine function could be used, but actual coordinate collections will have no such smooth regularity.

For a general method the obvious escalation is to a quadratic curve, fitted to some selection of points, that hopefully will follow the curve nicely. If the whole curve were well-approximated by a quadratic curve, then that would be convenient, but aside from special cases it is unlikely that a quadratic curve will offer a good fit to anything beyond local segments of the whole span, and extrapolation outside the fit would be particularly risky. For the "island" example, a quadratic would fit well only to the first or second half, not at all well to the whole. A cubic might do well, in the form x = f(y).

In the case of the linear interpolation, two points are needed and they can obviously be the y values associated with the L and R bounds; the chosen points will be modified as the search procedure modifies these bounds. For a quadratic a third point is needed, plus additional administrative effort to keep the collection current as the search proceeds. One possible scheme is to start by taking the middle of the span, and if that y element doesn't match V, proceed with the three-point interpolation to give a probe position q. Based on the y(q) value, adjust the L and R bounds as before, but now there is additional work: the collection of points used for the interpolation must also be adjusted. If q is less than p then the point associated with the old R is discarded, otherwise the old L, while if they are equal, no change. Thus the next iteration will go forward with point p, q and either L or R or else retain the same points, but with different values for L or R. This is not of course the only method, but it will do to be going on with. A dotted line joins the first and last point of the three corresponding to the linear interpolation, but the quadratic interpolation is shown by the continuous curves.

Successive Iterations, Quadratic interpolation, 12 points.

Although the usual fitting exercise is in terms of y = f(x), the objective of the interpolation is to find some x-value corresponding to y = V so the fit is in terms of x = f(y); the curves should be viewed as being rotated ninety degrees from the usual. The auxiliary plot for the third loop of the quadratic interpolation search example shows the details.

If the usual orientation were employed (the red curve, y = p(x)) the parabola fits the points nicely but finding the value of x such that p(x) = V involves the classic solution of a quadratic equation, which requires a square root evaluation and produces two values, only one of which will be suitable if indeed there are any real solutions at all since there is no guarantee that the curve will intersect V as might be the case were V to have the value 60 instead. By interpolating with x = p(y) (the green line) finding the required x value is simply a matter of evaluating p(V), and given that the method works not with the y values directly but y - V, the evaluation needed is of p(0) which is particularly easy.

The green line is not at all close to the segmented straight line joining the points, so interpolations based on it will not be particularly good, though at other stages the interpolation involved a good fit. Even so, this search ended with three iterations as compared to the four of the linear interpolation, not that this is guaranteed.

Other rules might be used for the collection of interpolating points, such as maintaining a list of the probe positions in order- that is, starting with positions L, R, p, move on to R p q1 then p q1 q2 and so on. It would also be possible to vary the order of the interpolation polynomial - since one value only, p(0) is required, Aitken's method would be appropriate. Yet another approach might be to have successive iterations alternate between the mid-point of the span and an interpolated position. In all cases, it will be necessary to forestall division by zero (if any two y-values were equal) or confusedly taking the same point more than once.

External links

^ Weiss, Mark Allen (2006). Data structures and problem solving using Java, Pearson Addison Wesley

[1] Weiss, Mark Allen (2006). Data structures and problem solving using Java, Pearson Addison Wesley

[1]

@@ Line 63: / Line 63: @@
 ==Further analysis==
+Consider the following algol-style pseudocode:
-The above code is clearly derived from the binary search method and unfortunately the "inclusive bounds" version, despite being almost universally found in references, text books, and example code collections, is a poor version of the method. Much better can be done working from the "exclusive bounds" version. Consider the following algol-style pseudocode:
    Given a collection of data symbolised by an array y(1:n), in sorted order, and a value V, find y(i) = V.
@@ Line 98: / Line 98: @@
 The example uses a test set of a dozen two-digit numbers (actually being the first twenty-four digits of pi), searching for the value 66 which is not present in that set.
 [[Image:InterpolationSearch12.png|thumb|right|300px|Successive Iterations, uniform distribution of y-values, 12 points.]]
-The black dots (with jagged line) show the twelve values of ''y'' (in sorted order, of course!) and the horizontal line marks the sought value ''V''. The successive colours red, magenta, green (yellow is too faint) and blue (or aqua) depict the interpolation line for the successive iterations, plus a vertical line from the intercept with the ''V'' line to the new probe position with a horizontal part for any rounding to an index position. Should however that position not be within the required span it is adjusted, as marked by the coloured dots. Depending on the sign of the comparison, either the upper or lower bound of the search span is replaced by the new position and the next interpolation will be performed on the reduced span.
+The black dots (with jagged line) show the twelve sorted values of ''y'', and the horizontal line marks the sought value ''V''. The successive colours red, magenta, green (yellow is too faint) and blue (or aqua) depict the interpolation line for the successive iterations, plus a vertical line from the intercept with the ''V'' line to the new probe position with a horizontal part for any rounding to an index position. Should however that position not be within the required span it is adjusted, as marked by the coloured dots. Depending on the sign of the comparison, either the upper or lower bound of the search span is replaced by the new position and the next interpolation will be performed on the reduced span.
-In this example, although the first interpolation does very well, subsequent probes are all crammed against the left end of the working span while the upper bound of the span is not shifted. Unfortunately, a straight line applied to a curve will be consistently wrong within the span of its adjacent inflexion points.
+In this example, although the first interpolation does very well, subsequent probes are all crammed against the left end of the working span while the upper bound of the span is not shifted. A straight line applied to a curve will be consistently wrong within the span of its adjacent inflexion points.
 [[Image:InterpolationSearch100.png|thumb|right|300px|Successive Iterations, uniform distribution of y-values, 100 points.]]
@@ Line 113: / Line 113: @@
 As can be seen, even though the positions are regularly spaced, their y-coordinates (or latitudes) are not, as whenever a sequence of points tends to run East-West (or horizontal) there will be many nearly-equal latitudes resulting. But despite the painfully large deviations between the interpolation lines and the curve, the search determined that 66 was not found in just four iterations, rather better than a binary search would have managed.
-Although an equal element was not found (and is not to be expected for non-integer values of different provenance), the search identifies the index of the next lowest value in the ''y'' collection (here, element 101) so that the nearest positions can be scanned for locally. This is why the routine returns ''-L'' to indicate "Not found" rather than a constant such as zero which provides no hint as to locality.
+Although an equal element was not found (and is not to be expected for non-integer values of different provenance), the search identifies the index of the next lowest value in the ''y'' collection (here, element 101) so that the nearest positions can be scanned for locally.
 ===Non-linear interpolation===
 Especially when confronted by a cumulative curve of a non-linear but apparently simple form, the regular errors of a linear interpolation prompt consideration of more complex methods. In special circumstances there may be a special function that could guide the search as in the idealistic "island" example where the shape is of course that of a sine wave and the ''arc sine'' function could be used, but actual coordinate collections will have no such smooth regularity.
-For a general method the obvious escalation is to a quadratic curve, fitted to some selection of points, that hopefully will follow the curve nicely. Ah, but what selection of points? And as the search progresses, the selection should be changed in some judicious way, but, what way? If the whole curve were well-approximated by a quadratic curve, then that would be convenient, but aside from special cases it is unlikely that a quadratic curve will offer a good fit to anything beyond local segments of the whole span, and extrapolation outside the fit would be particularly risky. For the "island" example, a quadratic would fit well only to the first or second half, not at all well to the whole. A cubic might do well, in the form ''x = f(y)''.
+For a general method the obvious escalation is to a quadratic curve, fitted to some selection of points, that hopefully will follow the curve nicely. If the whole curve were well-approximated by a quadratic curve, then that would be convenient, but aside from special cases it is unlikely that a quadratic curve will offer a good fit to anything beyond local segments of the whole span, and extrapolation outside the fit would be particularly risky. For the "island" example, a quadratic would fit well only to the first or second half, not at all well to the whole. A cubic might do well, in the form ''x = f(y)''.
 In the case of the linear interpolation, two points are needed and they can obviously be the ''y'' values associated with the ''L'' and ''R'' bounds; the chosen points will be modified as the search procedure modifies these bounds. For a quadratic a third point is needed, plus additional administrative effort to keep the collection current as the search proceeds. One possible scheme is to start by taking the middle of the span, and if that ''y'' element doesn't match V, proceed with the three-point interpolation to give a probe position ''q''. Based on the ''y(q)'' value, adjust the ''L'' and ''R'' bounds as before, but now there is additional work: the collection of points used for the interpolation must also be adjusted. If ''q'' is less than ''p'' then the point associated with the old ''R'' is discarded, otherwise the old ''L'', while if they are equal, no change. Thus the next iteration will go forward with point ''p'', ''q'' and either ''L'' or ''R'' or else retain the same points, but with different values for ''L'' or ''R''. This is not of course the only method, but it will do to be going on with. A dotted line joins the first and last point of the three corresponding to the linear interpolation, but the quadratic interpolation is shown by the continuous curves.
 [[Image:InterpolationSearchQuadratic.png|thumb|right|300px|Successive Iterations, Quadratic interpolation, 12 points.]]
-The plot shows some rather surprising curves. Although the usual fitting exercise is in terms of ''y = f(x)'', the objective of the interpolation is to find some ''x''-value corresponding to ''y = V'' so the fit is in terms of ''x = f(y)''; the curves should be viewed as being rotated ninety degrees from the usual. The auxiliary plot for the third loop of the quadratic interpolation search example shows the details.
+Although the usual fitting exercise is in terms of ''y = f(x)'', the objective of the interpolation is to find some ''x''-value corresponding to ''y = V'' so the fit is in terms of ''x = f(y)''; the curves should be viewed as being rotated ninety degrees from the usual. The auxiliary plot for the third loop of the quadratic interpolation search example shows the details.
 [[Image:InterpolationSearchParabolas.png|thumb|right|300px|Two parabolas through three points.]]
 If the usual orientation were employed (the red curve, ''y = p(x)'') the parabola fits the points nicely but finding the value of ''x'' such that ''p(x) = V'' involves the classic solution of a quadratic equation, which requires a square root evaluation and produces two values, only one of which will be suitable if indeed there are any real solutions at all since there is no guarantee that the curve will intersect ''V'' as might be the case were ''V'' to have the value 60 instead. By interpolating with ''x = p(y)'' (the green line) finding the required ''x'' value is simply a matter of evaluating ''p(V)'', and given that the method works not with the ''y'' values directly but ''y - V'', the evaluation needed is of ''p(0)'' which is particularly easy.
-But alas, the green line is not at all close to the segmented straight line joining the points, so interpolations based on it will not be particularly good, though at other stages the interpolation involved a good fit. Even so, this search ended with three iterations as compared to the four of the linear interpolation, not that this is guaranteed.
+The green line is not at all close to the segmented straight line joining the points, so interpolations based on it will not be particularly good, though at other stages the interpolation involved a good fit. Even so, this search ended with three iterations as compared to the four of the linear interpolation, not that this is guaranteed.
-Other rules might be used for the collection of interpolating points, such as maintaining a list of the probe positions in order- that is, starting with positions ''L, R, p'', move on to ''R p q1'' then ''p q1 q2'' and so on. It would also be possible to vary the order of the interpolation polynomial - since one value only, ''p(0)'' is required, Aitken's method would be appropriate. Yet another approach might be to have successive iterations alternate between the mid-point of the span and an interpolated position. In all cases, it will be necessary to forestall division by zero (if any two ''y''-values were equal) or confusedly taking the same point more than once. Enough.
+Other rules might be used for the collection of interpolating points, such as maintaining a list of the probe positions in order- that is, starting with positions ''L, R, p'', move on to ''R p q1'' then ''p q1 q2'' and so on. It would also be possible to vary the order of the interpolation polynomial - since one value only, ''p(0)'' is required, Aitken's method would be appropriate. Yet another approach might be to have successive iterations alternate between the mid-point of the span and an interpolated position. In all cases, it will be necessary to forestall division by zero (if any two ''y''-values were equal) or confusedly taking the same point more than once.
 ==See also==