Talk:K-nearest neighbors algorithm
|↓||Skip to table of contents||↓|
|This is the talk page for discussing improvements to the K-nearest neighbors algorithm article.|
|WikiProject Robotics||(Rated Start-class, Mid-importance)|
I will write a more complete article for this when the exam season has passed.
Are there not any weighted kNN algorithms, that will take (to some degree) the actual distances into account? If the 5 closest neighbors are 0.1, 4000, 10000, 20000 and 55570 units away, we might say the very close first match has more importance than the others.
- For an example of a weighted kNN algorithm, see F. Nigsch et al., Journal of Chemical Information and Modeling, 46, 2412-2422 (2006), DOI: 10.1021/ci060149f
So what is the complexity of kNN? I'm guessing O(n^2) but I'm not 100 % sure...
- A kd-tree can be built in O(n log(n)) time, and queried in O(log(n)) time. A Ball-tree provides better guarantees and can be even faster than a kd-tree for finding neighbors. It's only O(n^2) if you're lazy and you code it up by comparing distances between every pair of points. Also, yes weighted k-NN is very common. If you weight by the inverse of the distance, it is called linear weighting because it is equivalent to using linear interpolation. This weighting is especially good for regression problems. For classification, it is more common to weight by the inverse of the square of the distance because it's faster to compute and tends to classify slightly better. This article also requires some significant discussion of feature-scaling, which can make k-NN significantly more accurate and more robust to irrelevant attributes. I like to use a simple hill-climber with cross-validation on the training set to adjust feature scaling factors. It improves accuracy by a significant amount. --188.8.131.52 (talk) 22:26, 9 December 2008 (UTC)
- For a single classification result, it's n:
- Step 1: calculate the distances: O(n)
- Step 2: find the k smallest: also O(n)--see selection algorithm
- Step 3: count them: O(k)
- If there are multiple classification results, this can be made better (e.g. by first binning or otherwise indexing the training data--this is briefly mentioned in the article), but then we have to introduce another variable: m, say, for the number of classifcation results (I call them "test points")... Peteymills (talk) 15:17, 15 June 2010 (UTC)
refly to other something.....
Comoplexly of kNN is O(n^2). I think so "It's guess" Complexly of NNS(Nearest Neighbor Search) is O(nlogn) on the KB-tree, finding one item.
I'm studing a kNN in multiple nodes. so I made Hk-NN is called by kNN. Processing time reduced effectively. I need to share and discussion in order to find out better Hk-NN method. I don't know cost of kNN, and anyone who are known cost did nothing but guessing.
If you known kNN's processing cost with accuracy, please send a message by the mail.
My e-mail: email@example.com
- Maybe this draft version of an information retrieval book will help you. Chapter 14 discusses kNN. http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html Hope this helps. --GrandiJoos 07:59, 1 October 2007 (UTC)
I think the Kriging stuff was given too much prominence. Is there a consensus on its relevance and whether it would be better further down the article?
Listen - Kriging has nothing to do with K-nearest neighbor, and shouldn't be in this article. KNN is a classifier - takes a set of multidimensional points and a user choice of K, then splits the data into K different classes. Kriging is a way to take a set of multidimensional points, then for a new point with one dimension unknown, interpolate to generate that unknown. Simply, KNN does not rely on Kriging, and has no relationship that is apparent from the paragraph on Kriging in this article. —Preceding unsigned comment added by 184.108.40.206 (talk) 16:58, 17 March 2008 (UTC)
Split this topic in two?
I agree with previous comment. KNN is a statistical clustering algorith which uses density linkage. Maybe the two subjects should be treated separately? 220.127.116.11 (talk) 17:23, 3 December 2008 (UTC)
- This article is all about instance learning, which is perhaps the most common reason to find the k-nearest-neighbors, but there are many other uses as well. For example, the k-nearest-neighbors are often used to establishing neighborhood metrics for manifold learning. I propose that we make another article called "knn_instance_learner" and move all the stuff about regression and classification into it. This article should be about algorithms for finding the k-nearest neighbors (like kd-trees, Ball-trees, etc), no matter what the reason is for doing so. --18.104.22.168 (talk) 22:31, 9 December 2008 (UTC)
I removed reference to "furthest" neighbours from overview. kNN regression is done on the basis of a (possibly weighted) average of the properties of 'nearest neighbours. See the Nigsch et al. article cited earlier in the discussion.
neighbor vs neighbour
I suggest paying attention which spelling the authors use. There is now both British and American spelling used in the article.
Schematic image is not clear
For me this looks more like a Parzen window. I think to make the representation clear, the circles have go exact though the center of the kth sample. Maybe the black of the circle boundary can be even a bit reduced towards gray. Also some links from the test sample to the training samples could be done, together with some counting numbers, to make clear that the radius of the neighbourhood depends only upon the distance to the closest training samples. 22.214.171.124 (talk) 02:29, 10 January 2011 (UTC)
Optimal k = 10
I removed the line about the optimal k being 10. Following the paper trail, the best that can be said is that when analyzing some kinds of forestry data, a k between 10 and 15 has been found to work well, nothing more. Helpre (talk) 17:37, 1 April 2011 (UTC)
The good faith move to k-nearest neighbors algorithm has been reverted, because both scholarly and non-scholarly searches show that the singular name for the algorithm is a bit more well-used in the vernacular. Remember, the issue here is the name of the algorithm, not how many neighbors are computed. If anyone is in disagreement with this revert, then please don't hesitate to voice your opinion. It is, of course, okay with me if anyone would like to open a move request discussion. – Paine Ellsworth CLIMAX! 20:03, 7 September 2013 (UTC)
under the section "For regression" it says:
"Find a heuristically optimal number k of nearest neighbors, based on RMSE. This is done using cross validation."
Is there some reference/further details to this hinted at method? I am searching now without much luck, seems to me like it should be referenced.
Maybe k-NN is somehow related to non-parametric statistics, but for sure it's not a non-parametric method. The first sentence is at least misleading, and at best hilarious since the algorithm has its parameter explicitly in its name... :) BartoszKP (talk) 14:40, 1 April 2014 (UTC)