Talk:K-nearest neighbors algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Robotics (Rated Start-class, Mid-importance)
WikiProject icon K-nearest neighbors algorithm is within the scope of WikiProject Robotics, which aims to build a comprehensive and detailed guide to Robotics on Wikipedia. If you would like to participate, you can choose to edit this article, or visit the project page (Talk), where you can join the project and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 

Comment[edit]

I will write a more complete article for this when the exam season has passed.


Are there not any weighted kNN algorithms, that will take (to some degree) the actual distances into account? If the 5 closest neighbors are 0.1, 4000, 10000, 20000 and 55570 units away, we might say the very close first match has more importance than the others.

  • For an example of a weighted kNN algorithm, see F. Nigsch et al., Journal of Chemical Information and Modeling, 46, 2412-2422 (2006), DOI: 10.1021/ci060149f

So what is the complexity of kNN? I'm guessing O(n^2) but I'm not 100 % sure...

A kd-tree can be built in O(n log(n)) time, and queried in O(log(n)) time. A Ball-tree provides better guarantees and can be even faster than a kd-tree for finding neighbors. It's only O(n^2) if you're lazy and you code it up by comparing distances between every pair of points. Also, yes weighted k-NN is very common. If you weight by the inverse of the distance, it is called linear weighting because it is equivalent to using linear interpolation. This weighting is especially good for regression problems. For classification, it is more common to weight by the inverse of the square of the distance because it's faster to compute and tends to classify slightly better. This article also requires some significant discussion of feature-scaling, which can make k-NN significantly more accurate and more robust to irrelevant attributes. I like to use a simple hill-climber with cross-validation on the training set to adjust feature scaling factors. It improves accuracy by a significant amount. --128.187.80.2 (talk) 22:26, 9 December 2008 (UTC)
For a single classification result, it's n:
  • Step 1: calculate the distances: O(n)
  • Step 2: find the k smallest: also O(n)--see selection algorithm
  • Step 3: count them: O(k)
If there are multiple classification results, this can be made better (e.g. by first binning or otherwise indexing the training data--this is briefly mentioned in the article), but then we have to introduce another variable: m, say, for the number of classifcation results (I call them "test points")... Peteymills (talk) 15:17, 15 June 2010 (UTC)

refly to other something.....[edit]

Comoplexly of kNN is O(n^2). I think so "It's guess" Complexly of NNS(Nearest Neighbor Search) is O(nlogn) on the KB-tree, finding one item.

I'm studing a kNN in multiple nodes. so I made Hk-NN is called by kNN. Processing time reduced effectively. I need to share and discussion in order to find out better Hk-NN method. I don't know cost of kNN, and anyone who are known cost did nothing but guessing.

If you known kNN's processing cost with accuracy, please send a message by the mail.

My e-mail: hanjjon@gmail.com

Maybe this draft version of an information retrieval book will help you. Chapter 14 discusses kNN. http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html Hope this helps. --GrandiJoos 07:59, 1 October 2007 (UTC)

Kriging?[edit]

How did any of this Kriging stuff get in here (history)? —Preceding unsigned comment added by 66.92.232.105 (talk) 23:15, 16 March 2008 (UTC)

I think the Kriging stuff was given too much prominence. Is there a consensus on its relevance and whether it would be better further down the article?

Jbom1 (talk) 16:45, 17 March 2008 (UTC)

Listen - Kriging has nothing to do with K-nearest neighbor, and shouldn't be in this article. KNN is a classifier - takes a set of multidimensional points and a user choice of K, then splits the data into K different classes. Kriging is a way to take a set of multidimensional points, then for a new point with one dimension unknown, interpolate to generate that unknown. Simply, KNN does not rely on Kriging, and has no relationship that is apparent from the paragraph on Kriging in this article. —Preceding unsigned comment added by 158.130.14.30 (talk) 16:58, 17 March 2008 (UTC)

Split this topic in two?[edit]

I agree with previous comment. KNN is a statistical clustering algorith which uses density linkage. Maybe the two subjects should be treated separately? 196.36.217.137 (talk) 17:23, 3 December 2008 (UTC)

This article is all about instance learning, which is perhaps the most common reason to find the k-nearest-neighbors, but there are many other uses as well. For example, the k-nearest-neighbors are often used to establishing neighborhood metrics for manifold learning. I propose that we make another article called "knn_instance_learner" and move all the stuff about regression and classification into it. This article should be about algorithms for finding the k-nearest neighbors (like kd-trees, Ball-trees, etc), no matter what the reason is for doing so. --128.187.80.2 (talk) 22:31, 9 December 2008 (UTC)
No, keep it here IMO. The other article you have in mind is Nearest neighbor search --mcld (talk) 20:42, 20 October 2009 (UTC)

"Furthest" Neighbors?[edit]

I removed reference to "furthest" neighbours from overview. kNN regression is done on the basis of a (possibly weighted) average of the properties of 'nearest neighbours. See the Nigsch et al. article cited earlier in the discussion.

neighbor vs neighbour[edit]

I suggest paying attention which spelling the authors use. There is now both British and American spelling used in the article.

Wassermann7 (talk) 21:48, 19 July 2010 (UTC)

Schematic image is not clear[edit]

For me this looks more like a Parzen window. I think to make the representation clear, the circles have go exact though the center of the kth sample. Maybe the black of the circle boundary can be even a bit reduced towards gray. Also some links from the test sample to the training samples could be done, together with some counting numbers, to make clear that the radius of the neighbourhood depends only upon the distance to the closest training samples. 193.52.240.233 (talk) 02:29, 10 January 2011 (UTC)

Optimal k = 10[edit]

I removed the line about the optimal k being 10. Following the paper trail, the best that can be said is that when analyzing some kinds of forestry data, a k between 10 and 15 has been found to work well, nothing more. Helpre (talk) 17:37, 1 April 2011 (UTC)

Plural neighbor[edit]

The good faith move to k-nearest neighbors algorithm has been reverted, because both scholarly and non-scholarly searches show that the singular name for the algorithm is a bit more well-used in the vernacular. Remember, the issue here is the name of the algorithm, not how many neighbors are computed. If anyone is in disagreement with this revert, then please don't hesitate to voice your opinion. It is, of course, okay with me if anyone would like to open a move request discussion. – Paine Ellsworth CLIMAX! 20:03, 7 September 2013 (UTC)

For regression[edit]

under the section "For regression" it says:

"Find a heuristically optimal number k of nearest neighbors, based on RMSE. This is done using cross validation."

Is there some reference/further details to this hinted at method? I am searching now without much luck, seems to me like it should be referenced.

37.11.59.68 (talk) 20:55, 13 December 2013 (UTC)

Non-parametric? Really?[edit]

Maybe k-NN is somehow related to non-parametric statistics, but for sure it's not a non-parametric method. The first sentence is at least misleading, and at best hilarious since the algorithm has its parameter explicitly in its name... :) BartoszKP (talk) 14:40, 1 April 2014 (UTC)