Jump to content

Talk:Selection algorithm: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Line 50: Line 50:


:You're not being dense. Median of Medians is an incorrect algorithm. It does not produce the true median. Example:
:You're not being dense. Median of Medians is an incorrect algorithm. It does not produce the true median. Example:
median(3, 1, 4, 1, 5, 9, 2, 6, 5) = median(median(3, 1, 4), median(1, 5, 9), median(2, 6, 5)) = 5. But the actual median of the original set is 4. Therefore, though median of medians may provide an efficient method for approximating the median most of the time, it is far from mathematically accurate.
:median(3, 1, 4, 1, 5, 9, 2, 6, 5) = median(median(3, 1, 4), median(1, 5, 9), median(2, 6, 5)) = 5. But the actual median of the original set is 4. Therefore, though median of medians may provide an efficient method for approximating the median most of the time, it is far from mathematically accurate.


== Benefit of Non-strict Comparisons ==
== Benefit of Non-strict Comparisons ==

Revision as of 19:54, 7 February 2012

WikiProject iconComputer science Unassessed
WikiProject iconThis article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
???This article has not yet received a rating on Wikipedia's content assessment scale.
???This article has not yet received a rating on the project's importance scale.
Things you can help WikiProject Computer science with:

Select does not find the true median!

lols, median might b found in linear time, recursively, assuming we got the solution 4 a smaller group, 4 median, n use this result etc... :P 4 ur delite — Preceding unsigned comment added by 93.118.212.31 (talk) 12:34, 27 July 2011 (UTC)[reply]

Quoting the text:

Select is then called recursively on this sublist of n/5 elements to find their true median.

Since Select does not find the true median of the n elements, it does not find the true median of the n/5 elements in the general case. It would be more appropriate to call this value the "pivot" or "approximate median". Philippe.beaudoin (talk) 20:47, 5 April 2011 (UTC)[reply]

Why is a group size of 5 used?

I know it is not arbitrary chosen, but I cannot any article backing this up. If somebody would know why we can add it to the contents of this page. —Preceding unsigned comment added by 145.120.14.117 (talk) 21:21, 27 January 2011 (UTC)[reply]

  • 5 is just the smallest value that works. It's from the original paper. There ought to be a citation for the fact that any larger (fixed) value would also work. Dcoetzee 04:42, 28 July 2011 (UTC)[reply]

How to re-generate visualization table for proof of 40%/60%

import random

x = range(100)
random.shuffle(x)

tuples = []
while x:
 tuples.append(sorted([x.pop() for _ in range(5)]))

tuples = sorted(tuples, key=lambda tuple:tuple[2])

pivot = tuples[int(100/5 / 2)-1][2]

for i in range(5):
 print '! '
 def colorized(n):
  return 'bgcolor='+('gray' if n<pivot else 'red' if n==pivot else 'white')+'|'+str(n)
 print '| '+'||||'.join(colorized(tuple[i]) for tuple in tuples)
 print '|-'

—Preceding unsigned comment added by Ninjagecko (talkcontribs) 15:09, 3 December 2009

Summary?

If I understand it correctly the point of this article is that we can generalize an O(n) approach for finding (singular) min() and max() to finding k minimum or maximum values by keeping a small sorted side list of k items (perhaps in a b-tree or just using binary search and insertions). The idea is that as we go any values that don't fit inside the range of our current k-list are simply skipped, any that do are inserted (pushing one value off the end of the k-list). This would be O(n) * O(log k) time which, given he pre-condition that k is small, reduces to O(n) time overall. This would be particularly handy, for example, when the original list is too large to fit into memory ... one can still find k smallest or largest items in a single pass over the data (files, for example). (Even for really huge n and k that's also too large to fit in core, the upper and lower bounds of k would still fit and a random access method through the storage for k would still be relatively efficient).

The problem I'm having is understanding the part about median of medians. A diagram would be great. Is it that we're finding the median for a subset (say 5 items), then finding median for successive subsets (say 5 subsets), then finding the median of the medians, and so on? This sounds a little like the old RRDtool "compression" strategy (store data at one minute granularity for the most recent n-hours, then average those to hourly granularity, then to daily, and so on). Am I understanding that correctly? Is that mathematically a valid approach? Are the results of this process really the median for all possible data sets? There are no degenerate data set patterns that would significantly skew the finally results?

If what I'm saying here is true, how can we clarify the article? (Or am I just being dense)? JimD (talk) 20:19, 2 August 2009 (UTC)[reply]

You're not being dense. Median of Medians is an incorrect algorithm. It does not produce the true median. Example:
median(3, 1, 4, 1, 5, 9, 2, 6, 5) = median(median(3, 1, 4), median(1, 5, 9), median(2, 6, 5)) = 5. But the actual median of the original set is 4. Therefore, though median of medians may provide an efficient method for approximating the median most of the time, it is far from mathematically accurate.

Benefit of Non-strict Comparisons

"By using non-strict comparisons (<= and >=) instead, we would find the minimum element with maximum index; this approach also has some small performance benefits." What performance benefits, exactly?

On some machines it can cause less branch instructions, and so generate less icache flushes. The loop also has to have been unrolled or have other unconditional instructions in it. This effect is so small and machine-dependent however that I'll go ahead and remove this claim. Deco 02:08, 24 November 2005 (UTC)[reply]

quicksortOnlyFirstK not recursive?

Shouldn't the quicksortOnlyFirstK pseudocall call itsself instead of quicksort, or am I missing something obvious.--ISee 07:22, 26 April 2006 (UTC)[reply]

Fixed that after verifying that part of the algorithm--ISee 11:10, 26 April 2006 (UTC)[reply]

Minimum or maximum

Is it necessary to say "minimum or maximum" everywhere? It adds unnecessary wordiness, and it makes the times when it's not used that much more confusing. Here's an example:

Note that there may be multiple minimum or maximum elements. Because the comparisons above are strict, these algorithms finds the minimum element with minimum index. By using non-strict comparisons (≤ and ≥) instead, we would find the minimum element with maximum index.

Two of those minimums should be changed to minimum or maximum.

What would be even better, though, would be to just generalize the whole article to be "finding the maximum" and give a note that everything can be flipped for the reverse case. --71.71.238.231 08:07, 15 June 2006 (UTC)[reply]

I'm not sure what you're arguing. It already says "minimum or maximum" in only one place. The others are referring to something more subtle than that - the case where there are equal elements and we have to select which one is chosen. Deco 08:21, 15 June 2006 (UTC)[reply]
It already says "minimum or maximum" in only one place. Search again. That exact text appears four times. Which says nothing about forms like "minimum/maximum." --71.71.238.231 02:53, 17 June 2006 (UTC)[reply]
Oh, I was just referring to the portion you quoted above. You're right, it's a good idea to avoid the verbosity and just mention it once at the beginning. Deco 04:28, 17 June 2006 (UTC)[reply]


Why dont u simply sort the given list L and return the k-th element in L? -- Orz 04:25, 29 August 2006 (UTC)[reply]

Because that takes O(n log n) time, which is dramatically slower for large n. The article discusses this. Deco 05:37, 29 August 2006 (UTC)[reply]
It does indeed. Thanks for clearing that out, Decon

swap in partition function

is there an error in one line from the partition function? in particular, shouldn't the line:

swap a[pivotIndex] and a[right]

be replaced with the line:

swap a[pivotIndex] and a[left]

mwb

correction

or perhaps the swap line was correct, but a the end of partition you need to swap a[right] back to a position between the two subsegments...

mwb

Yes... that line was there. Someone just removed it and introduced several other errors. I've reverted them. Deco 23:21, 11 October 2006 (UTC)[reply]

Linear minimum/maximum algorithms

I don't understand why the index of the minimum/maximum element seen so far is stored in the described algorithm. In the example pseudocode, it is not used for anything. Ahy1 12:05, 29 December 2006 (UTC)[reply]

It's part of the result. Often you want not only the minimum/maximum value, but the index at which it occurred, for later processing. Deco 14:50, 29 December 2006 (UTC)[reply]

Remarks

A selection algorithm is not only for finding the kth smallest (or kth largest) number in a list. In Genetic algorithms they use other shemes for selection than the smallest value for example. Or in real-time/embedded systems there exists selection algorithms, that just take a msg and take an action according to the type of message. So it isn't all about finding the smallest values...

These are just different things with the same name. I'll clarify that the term use here is specialized. Dcoetzee 00:08, 5 May 2007 (UTC)[reply]

Introselect

There is a variation of selection that David Musser refers to. Introselect has an analogous relationship to quickselect that Intrasort has to quicksort - all designed to ensure both average and worse-case performance of O(n). But Musser in his paper does not say what algorithm would be swopped to if quickselect goes bad. I think this page should to refer to Introselect.

There is also another variation of finding the kth of n elements - where the elements remain undisturbed. Naturally the overhead is considerably higher and I dont know what the average big-O performance is like. [User:Stephen Howe|Stephen Howe] 21:50, 4 May 2007 (UTC)

Good idea, introselect ought to be mentioned. Dcoetzee 00:08, 5 May 2007 (UTC)[reply]

Error?

In the selection function, shouldn't we compare pivotNewIndex - left + 1 with k, which means selection of kth smallest number in list(left:right).

I don't think so, you have to think of it as shrinking the target area around k, which remains relative to the whole array.--Sam Brightman (talk) 16:14, 7 January 2009 (UTC)[reply]

primitive algorithm to select k elements - O(n) not O(kn)

Section : Repeated application of simple selection algorithms

O(n) to select k-th element. O(n) to compare all elements to it. —Preceding unsigned comment added by 132.72.98.166 (talk) 13:03, 12 November 2007 (UTC)[reply]

No, it's O(kn), because it can't just magically pick out the kth smallest element on the first try - it has to first identify each smaller element. Dcoetzee 13:26, 12 November 2007 (UTC)[reply]
Read the article you are commenting on, and it'll show you a great "magical" algorithm for choosing the k-th smallest element in linear time. The GP is correct, it requires only O(n) time to choose the k smallest elements. 142.20.202.196 (talk) 16:49, 25 January 2011 (UTC)[reply]

A C# implementation of the primative Algo shows an error. Would any compSci care to comment?

        public double BasicSelection(List<double> myList, int k)
        {
            for (int i = 0; i < k; i++)
            {
                int minIndex = i;
                double minValue = myList[i];
                for (int j = i+1; j < myList.Count; j++)
                {
                    if (myList[j] < minValue)
                    {
                        minIndex = j;
                        minValue = myList[j];
                    }
                    //now swap them around
                    double temp1 = myList[i];
                    double temp2 = myList[minIndex];
                    myList[i] = temp2;
                    myList[minIndex] = temp1;
                }
            }
            return myList[k];
        }

—Preceding unsigned comment added by 88.110.61.196 (talkcontribs) 12:01, 17 April 2008

Question regarding the nonlinear general selection algorithm

One of the advantages is claimed to be "after locating the jth smallest element, it requires only O(j + (k-j)2) time to find the kth smallest element, or only O(k) for kj".

I didn't understand this part. Suppose we've found the minimum (j=1), and we are going to find the second smallest element (k=2). Can we find it in O(1) time? —Preceding unsigned comment added by Roy hu (talkcontribs) 22:50, 22 January 2009 (UTC)[reply]

Advanced algorithms missing

See problem 13 (5.3.3) The Donald Knuth, The art of computer programming, Sorting and Searching.

Median can be found in 3/2 n + O(n^(2/3) log(n)) comparisons, due to R. W. Floyd. —Preceding unsigned comment added by Atulsvasu (talkcontribs) 07:19, 9 December 2009

Bug in linear general selection algorithm

The proof of O(n) running time for the linear general selection algorithm implicitly assumes all elements are different (with the given partition and select function)

Indeed, when running with a series of n equal elements with the given partition and select function, this algorithm runs in O(n**2). I guess the following fix would be ok : partition should put all elements equal to the pivot together, and return two indexes i and j such that list[i], list[i+1], ..., list[j] are equal to the pivot, and all elements before i are smaller (not just smaller or equal) than the pivot and all elements after j are greater (not just greater or equal). Then select has to check whether the k-th value is before i, after j or between them.

(I have no access to Musser's article to check what he says about this)

Judicael (talk) 09:46, 4 January 2010 (UTC)[reply]

Bug in quicksortOnlyFirstK

I am trying to implement this algorithm and find a problem with this line:

            quicksortFirstK(list, pivotNewIndex+1, right, k-pivotNewIndex)

I believe it should be

            quicksortFirstK(list, pivotNewIndex+1, right, k)

Reasoning: Our index which we are sorting to should never change. By doing k - pivotNewIndex we change that index.

I believe the intent of the author who put this algorithm in here was to change it so that the first k indexes AFTER left (which now equals pivotNewIndex+1 and so this k would then make sense) rather than the absolute k. In order for this assumption to be true you would have to change the if statement to if pivotNewIndex - left < k. This change could also fix the algorithm, but would be slower and less clear.

I have it implemented it in java and it will not work correctly until I set that and it works perfect (no excessive sorting above k) with it set how I have it fixed.

You are correct. The "k-pivotNewIndex" would be correct for a function that always took a modified "list" pointer and a "left" boundary being always zero. I've updated the main page. --bcwhite —Preceding unsigned comment added by Bcwhite (talkcontribs) 18:15, 29 June 2010 (UTC)[reply]

Agreed as well - though why has the main page changed back to the error? (23rd July 2010)

BFPRT Worst Case O(n) or Not?

In the beginning of the partition based section it is claimed that BFPRT introduced a worst case linear time algorithm, while the algorithm described is (correctly) explained to be O(n^2) worst case. Was there another algorithm (like the median of medians) in the paper or is the claim wrong? Needs a citation in any case... Otus (talk) 06:55, 19 July 2010 (UTC)[reply]

The writing is just confusing - somebody jumbled it up pretty bad since I originally wrote it. BFPRT (the same thing as the "median-of-median" algorithm) is indeed worst-case linear, as you can easily verify by reading the abstract of the paper cited. But the section "partition-based general selection algorithm" describes quickselect, the algorithm it is based on, which is expected-time linear and worst-case quadratic. Dcoetzee 10:00, 19 July 2010 (UTC)[reply]

smallest K items

      if pivotNewIndex < k // questionable
            findFirstK(list, pivotNewIndex+1, right, k)

This code is listed in the article. If pivotNewIdex is among the first K items, then we need to return the left partition ie list[left..pivotNewIndex-1] because these elements are definitely among the smallest K items. Then, we need to recur on the right partition but with a reduced value for k. If we initially wanted k=1000 items, now we just want 950 items if the left partition already provides the smallest 50 items.

The algorithm as is could crash if k exceeds the range ie K > (right - pivorNewindex) 208.123.162.2 (talk) 07:50, 2 October 2010 (UTC)[reply]