Talk:K-means clustering

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Confusion about computational complexity[edit]

The article on the one hand says that the problem is NP-hard, and on the other hand, it states the following: "If k and d (the dimension) are fixed, the problem can be exactly solved in time , where n is the number of entities to be clustered".

I assume that NP-hardness is w.r.t. to the number of entities to be clustered (i.e., n), but the latter statement implies polynomial time.

It would be great if an expert could clarify which statement is true. (I suspect the statement quoted above is false and that the paper from which it is taken actually claims something else...) Cervisiarius (talk) 17:47, 17 March 2016 (UTC)

I don't know if that statement is true. Complexity analysis of such problems is hard. What if you chose e.g. , which is a common practise to reduce data set size? P.S. new discussions should be added to the end. HelpUsStopSpam (talk) 20:47, 17 March 2016 (UTC)

Untitled[edit]

Suggestion: Make title to lowercase (I believe k-means is the consensus and not K-means) --Chrike 03:04, 19 January 2007 (UTC)

Some examples? --Abdull 16:32, 21 February 2006 (UTC)


== K-means+ ==

Why was the previous comment on the talk page criticizing k-means++ deleted? Anyway, it just seems to be another of the zillion approaches there have been at making a previous centroid initilization before the algorithm proper starts. Genetic Algorithms, Particle Swarm, Simulated Annealing... I would want to question the relevance of the section to the article. —Preceding unsigned comment added by 189.135.63.207 (talk) 17:27, 21 December 2008 (UTC)


I also agree with this. If people want there should be a separate section on seeding methods. The wiki page should not be used as an advertisement like this —Preceding unsigned comment added by 76.168.200.176 (talk) 07:28, 26 January 2009 (UTC)


I am David Arthur, one of the two authors of that paper. It was not me who wrote the k-means++ section here. I can, however, address the discuss comments here and in the past:
1. The paper was published at one of the main CS theory conferences, arguably the most prestigious among theory conferences that claim to be seriously interested in practical applications. It is not a seminal work that has been cited tens of thousands of times, but it is not obscure either.
2. The paper deals very specifically with k-means, unlike say simulated annealing, and thus makes more sense to reference on this page than simulated annealing. To my knowledge, the only other serious research that has been done on the question of seeding k-means was done at the same time and independently by Ostrovsky et al (see citation and discussion in my paper). The gist of what's going on is they analyze essentially the same algorithm, and show it has a good approximation ratio in certain cases. My paper shows the algorithm has a slightly less good approximation ratio in all cases and also includes some experiments. I do think it would be appropriate to cite that result as well, although it's much more complicated to state. I am unaware of other research results that make sense to reference here.
3. The paper proves its theoretical result with math, and does so in all cases, not only in some cases like the deleted comment accuses. The experiments on real-life data-sets are from the standard repository of real-life data sets that almost all academic papers use. I assure you they were not chosen in bad faith, and since sample code for k-means++ is available as reference in the paper, you are welcome to judge for yourself.
I am not a Wikipedia writer and of course I'm biased and I like having my paper referenced here, so I do not presume to make any decisions on notability. However, I did want to clear up these points.71.198.184.231 (talk) 20:15, 11 March 2009 (UTC)

Spherical clusters[edit]

I assume that the comment in the end of the article ("Also, the algorithm works well only when spherical clusters are naturally available in data.") is very much dependent on the distance metric used... --Andkaha(talk) 15:52, 20 July 2006 (UTC)

This isn't the k-means algorithm![edit]

Did anyone bother to lookup the original reference?

The algorithm, described by wikipedia, is known as the Lloyd-algorithm. It is described in "An Algorithm for Vector Quantizer Design" by Linde and Buzo and Gray, 1980.

The original k-means algorithm was described by MacQueen ("Some Methods for classification and Analysis of Multivariate Observations", 1967) as follows

Stated informally, the k-means procedure consitsts of simply starting with k groups each of which consists of a single random point, and thereafter adding each new point to the group whose mean the new point is nearest. After a point is added to a group, the mean of that group is adjusted in order to take account of the new point. Thus at each stage the k-means are, in fact, the means of the groups they represent (hence the term k-means).

Every data point is processed only once and only one cluster mean is recomputed everytime a new data point is added. This is hardly the same algorithm as the one described by wikipedia.

This is exactly what the Lloyd clustering does, except that we need to switch to Euclidean space (the algorithm famously works for nutrient grouping etc, but it is difficult to visualize): Centroid = mean; k = number of clusters; cluster = group; k-means is for k groups. The informal procedure you quoted is a different interpretation, although it does the exact same thing: In the mentioned method, you already have the points given. You don't have to add them one by one; instead you add the group centers one by one. This is exactly the same. You still associate each point with the nearest site (or the center to distinguish it from a point). Take k sites. Given n points, these k random sites are placed near the k of the n possible points (usually you select k points and make them the sites). For every point, figure out which of the k sites is the nearest; that point goes to that group. You will have at most k groups now (some groups may be left unassigned if no point lies near to the site). Now recalculate the mean, reposition the k sites; repeat lather rinse. I don't understand how the quoted verbatim differs from this; it describes this exactly. Please clarify. Also provide a link to the PDF or article where this MacQueen paper is, I couldn't find it using Scholar. User_talk:gfxguy

I'm afraid there's very little correlation between the example and the description, making the example all but useless for someone who's trying to understand what's going on. "The algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data." I assume that the value of k was chosen as 4. Where then are the 4 initial clusters?

"It then calculates the mean point, or centroid, of each set." Where is this done?

"It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters," Each frame should correspond with a step of the algorithm


Although MacQueen's original k-means algorithm may not be identical to Lloyd's method, the algorithm described in this article is identical to Lloyd's method. Moreover, both articles reflect the common understanding of the k-means algorithm, at least within the algorithms community. The first sentence of the the Arthur and Vassilviskii paper cited in this article is "The k-means method is a well known geometric clustering algorithm based on work by Lloyd in 1982." I've therefore proposed a merge. (Perhaps the merged article should include a clear explanation of MacQueen's algorithm.) — Jeff Erickson 04:41, 6 April 2007 (UTC)

I agree with the merge. Lloyd's algorithm is the most common algorithm for approaching the k-means problem. Sancho 08:51, 9 April 2007 (UTC)
If K-means is more general I don't see why they articles should be merged. There also seems to be a fair amount of active research on new approaches to K-means (from a survey of Google Scholar articles), so I think it would be a mistake to merge. Imran 05:07, 12 April 2007 (UTC)
Of course, k-means algorithm is used synonymously with Lloyd's algorithm by many. "The most prominent and widely used clustering algorithm is Lloyd's algorithms sometimes also referred to as the k-means algorithm." (From Frahling, G. and Sohler, C. 2006. A fast k-means implementation using coresets. In Proceedings of the Twenty-Second Annual Symposium on Computational Geometry (Sedona, Arizona, USA, June 05 - 07, 2006). SCG '06. ACM Press, New York, NY, 135-143. DOI= http://doi.acm.org/10.1145/1137856.1137879). Sancho 08:55, 9 April 2007 (UTC)

I liked Sancho's comment. I found an article supporting other methods to solve the K-means problem and referenced it and also clarified the difference between K-means and Lloyd's as well as creating a reference to Lloyd's. Weston.pace 21:56, 12 June 2007 (UTC)

Image Replacement[edit]

Originally I started working on replacement images for the examples because they weren't licensed properly and I figured replacement was easier than tracking down the owner. However, they have sinced been properly licensed but I went ahead and replaced them because the new images don't show external software (which I beleive detracts from the quality of the image) and are in .svg format. However, I could be mistaken and do not wish to unfairly promote my own work, so if someone preferred the old images they can revert. -192.25.240.225 19:30, 26 July 2007 (UTC)

Why does the article put so much emphasis on the K-Means++ algorithm? If this algorithm really was so awesome as the text suggest then I would expect the "k-means++ The Advantages of Careful Seeding" article to be often quoted by scientific articles. It is not, and I think the reason is quite simple. It is an obscure article that "prove" its results using very specialized datasets that does not match what K-Means is used for out in the wild. I seriously doubt the results will hold if K-means++ is tried in more general situations. —Preceding unsigned comment added by 81.216.185.173 (talk) 10:32, 29 November 2008 (UTC)

Title[edit]

I propose changing the title to K-means clustering for two reasons:

  • It explains what it does (it clusters around k means)
  • There appear to be a number of algorithms, so referring to it as "K-means algorithm" is a misnomer.

comments? -3mta3 (talk) 09:11, 7 April 2009 (UTC)

I keep seeing a lot of references that state the problem is either NP-hard or NP-complete, but have yet to find a proof. The most common reference is:

  • Brucker, Peter (1978). "On the complexity of clustering problems". Optimization and operations research (Proc. Workshop, Univ. Bonn, Bonn, 1977) (Lecture Notes in Economics and Mathematical Systems 157 ed.). Berlin: Springer. ISBN 3-540-08842-3. MR 0511326. Zbl 0397.68044.  Text "pages 45–54
" ignored (help)

but I have been unable to find a copy. Is anyone else aware of any other sources? -3mta3 (talk) 15:31, 15 April 2009 (UTC)

thanks, i put the first one and the original reference (for anyone who can find it!) in the article. -3mta3 (talk) 08:36, 24 April 2009 (UTC)

Filtering algorithm[edit]

I am not sure how to go about adding this, but I think this article should reference Kanungo et al's kd-tree filtering algorithm (http://www.cs.umd.edu/~mount/Papers/pami02.pdf). It speeds a naive implementation up by a couple orders of magnitude. 71.198.184.231 (talk) 22:32, 23 April 2009 (UTC)

There is a reference to it (see K-means_clustering#Notes) however no description of what it does. Feel free to expand. -3mta3 (talk) 08:15, 24 April 2009 (UTC)
Ok, I expanded the variations section some -- I'm not super-sure my summary of the last one is good since only the abstract is online. I hadn't heard of it before. 71.198.184.231 (talk) 16:43, 24 April 2009 (UTC)

A nitpick[edit]

I think it is somewhat misleading to say using variance and Euclidean distance are drawbacks of k-means, and it's very misleading to say this is what makes k-means fast. These choices are what makes k-means *well defined*. What makes it *fast* is that it does not attempt to find the best clustering even according to its own definitions. Any clustering algorithm has to implicitly define what it means by a "good" clustering, and k-means's choice of Euclidean distance + variance is not obviously better or worse than other choices. So yes, if people use k-means assuming it has a human-like understanding of when two unknown objects are similar, they will be disappointed. But, no algorithm does this, and there is certainly no standard belief that an algorithm ever will be able to do this. Meanwhile, the description implies there is a clear better metric which k-means doesn't use for speed reasons - I do not think this is true and without real evidence, I do not see how the claim belongs on wikipedia in its current form.198.161.30.159 (talk) 06:29, 30 June 2009 (UTC)

I would add that as I understand it k-means clustering as a method is independent of any particular distance metric, e.g. manhattan distance would be a reasonable choice in some circumstances. --The locster (talk) 16:55, 2 September 2009 (UTC)

The demonstration is complete?[edit]

As far as I have understood, the last frame is not the result of the algorithm. I think that the result should be a part of the demonstration, too. Bubla (talk) 11:25, 24 August 2009 (UTC)

Also, the caption to the first image says "initial means are randomly selected from the data set" but the means from the first image are not present as data points in the subsequent images - implying they're not part of the data set. 78.86.200.205 (talk) 10:15, 31 January 2012 (UTC)
I think it would be better to assume the problem is in the subsequent images, not the description (see section titled "Error in demonstration images").

Choice of distance metric[edit]

As I understand it k-means clustering can be used with a variety of distanec metrics, it is common to use Euclidean distance but other metrics can be used, e.g. Manhattan. It would be nice to see an overview of some metrics that are typically used and what their relative pros and cons are.

Following on from distance metrics, some explanation of how to calculate centroids for non-euclidean distance metrics would be required, e.g. manhattan distance centroids are calculated using the median of each coordinate element rather than the mean. Such explanations probably belong elsewhere with a pointer from here but in the absence of explanations elsewhere this page is better than nowhere for now.--The locster (talk) 17:04, 2 September 2009 (UTC)

Empty clusters[edit]

It's my understanding that it is possible for k-means clustering to result in empty clusters after an iteration of the algorithm. Some discussion of this issue is warranted. E.g. this is potentially more of a problem in higher dimensional spaces and for certain distance metrics. Are there algorithm variants that avoid empty clusters altogether? And what are the strategies for dealign with empty clusters when they are encountered, e.g. split the largest cluster into two, or the cluster with the highest variance, etc.--The locster (talk) 17:09, 2 September 2009 (UTC)

Convergence guarantees?[edit]

Is this algorithm guaranteed to converge? 140.180.6.33 (talk) —Preceding undated comment added 07:20, 9 December 2009 (UTC).

Error in demonstration images[edit]

In the "Demonstration of the standard algorithm" images, the initial means are chosen from the data set. These data points disappear in later images. — Preceding unsigned comment added by 71.56.157.70 (talk) 16:47, 27 August 2012 (UTC)

Obviously, the initial means were generated for this example and are not data points. --Chire (talk) 21:02, 27 August 2012 (UTC)
I think I agree with Chire: i.e. the initial criticism of this diagram is mistaken. Nevertheless there is still a problem with the rightmost figure, labelled 4). This rightmost figure, as it stands currently, clearly cannot depict the situation at any stage during the iteration of the algorithm (nor after convergence). If it were a pic of the state of affiars after one of the "Update Steps", then the green circle would be the centroid of the green squares, the red circle the centroid of the red squares, etc -- just as is believably so of figure 3). If instead figure 4) were claimed to be a snapshot after an "Assignment Step" then the border line between any two of the shaded zones (or to be precise its extension to an infinitely long straight line) would be the perpendicular bisector of the straight line segment whose endpoints are the two correspondingly coloured circles, as appears believably to be true of part fig 2) of this diagram. However both conditions are patently false of the rightmost figure 4), even at a glance.
I propose that this sub-figure 4) should be removed, until a corrected version can be substituted. At present its only function is to raise doubt and puzzlement in the mind of someone who has in fact perfectly well understood the description of the standard algorithm. (The first three parts, labelled 1) to 3), of this figure can be retained however as without any change they are a most useful illustration of the algorithm explanation given in the preceding text.) — Preceding unsigned comment added by 83.217.170.175 (talk) 14:36, 6 June 2014 (UTC)

The reported time complexity is controversial[edit]

The reported time complexity of k-means is reported as O(n^(nk+1)logn). This is against other proofs on k-means linearity (see Introduction to Information Retrieval by Manning et al., http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html#p:kmeanscomplexity), and experimental evidence showing that k-means is faster than hierarchical agglomerative clustering (which is reported to be in the time complexity range of O(n^2) ... O(n^4)).

I suggest to remove the Complexity section and the corresponding references until someone with good understanding of k-means time complexity comes and fixes this.

Finally, a few words about the references in the Complexity section. Reference 14 is not a strong one (extended abstract at a symposium), and most importantly, the paper talks about something different than the original K-means algorithm. References 11, 12, 13 talk about special cases of K-means and do not reflect the "standard" K-means algorithm described in this article. Reference 15 is a dead link, and therefore there is no valid support for the following sentence:

Better bounds are proved for simple cases. For example,[15] showed that the running time of k-means algorithm is bounded by O(dn^4M^2) for n points in an integer lattice \{1,\dots, M\}^d.

Nucflash (talk) 11:41, 26 June 2013 (UTC)

From a quick look, the book you reference discusses the Lloyd algorithm for finding an approximative optimum. As the problem of finding the exact solution is proven to be NP-hard (it should be trivial to find references for this), O(n*k*i*m) obviously cannot be correct for the exact k-means solution.
I would not use a wide-coverage book like this as reference. These books tend to be very imprecise on what they are talking about. Many of them call it "k-means", cite MacQueen, but give Lloyds algorithm, and do not discuss that this is just a heuristic. And exactly this happened in this NLP book. The given algorithm is the modern formulation of Lloyds algorithm (did you have a look at his article?); who not even used the term k-means. And it's just the most popular heuristic for finding a local optimum. However, the complexity section clearly states: " the k-means clustering problem", NOT "the lloyd algorithm". Btw, the same thing applies to hierarchical clustering. The naive matrix way can solve arbitrary cluster similarity measures in O(n^3); unless you use a really expensive cluster similarity measure. For special cases such as single-linkage, there exist variants that run in O(n^2). If you are only interested in a single cut through the dendrogram, DBSCAN with minPts=2 should give the desired result in O(n log n) with index support. Whenever talking about the complexity of an algorithm, you really need to be precise of what you are talking about. (Granted, the article could be a bit clearer on what it refers to, and emphasize this common misunderstanding). --160.39.199.43 (talk) 03:54, 27 June 2013 (UTC)

Merge proposal[edit]

Since Lloyd's algorithm is the standard algorithm for k-means these days, and essential to explaining k-means, we might just as well merge the two articles.

In fact, I'd even suggest to make Lloyd's algorithm simply a redirect to the appropriate section of k-means where it is explained.

However, there seems to be some other variant (not so much for clustering?) of Lloyd discussed in that other article, too. --Chire (talk) 14:28, 26 August 2013 (UTC)

Oppose. Lloyd's algorithm is also used for constructing Centroidal Voronoi tessellations and for smoothing finite element meshes. Since it has such varied applications, not all of which have anything to do with clustering, I feel that it deserves to be a separate article. And in addition, there are other algorithms than Lloyd for K-means, so the two subjects are not the same even in that context. —David Eppstein (talk) 16:09, 26 August 2013 (UTC)
Oppose. This was proposed in 2007, see the Lloyd's algorithm talk page for the old discussion. I feel the important distinction is that k-means (at least the k-means article) is an exclusively discrete treatment while Lloyd's algorithm has importance in a continuous setting. The Lloyd's algorithm article probably needs improvement to better reflect this: the third paragraph of that page discusses the discrete variant while the subsequent description is more of the continuous language. RuppertsAlgorithm (talk) 14:03, 27 August 2013 (UTC)
Oppose and disagree with the statement that Lloyd's algorithm is the standard algorithm. The default method in R is Hartigan and Wong. It is completely misleading to confuse a general method with one of several algorithms by which one can implement the method. Mathstat (talk) 15:44, 27 August 2013 (UTC)
For those who do not have R installed: from the R manual for kmeans: "The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan–Wong algorithm generally does a better job than either of those, but trying several random starts (nstart> 1) is often recommended." Mathstat (talk) 12:34, 28 August 2013 (UTC)

K-means++[edit]

I was surprised to find no mention of K-means++ in the article and came to the talk page to suggest it only to find that there's been some previous edit war concerning too much emphasis on K-means++. I believe the eradication of K-means++ from this article has gone a bit too far. K-means++ *is* K-means, just with a more sensible starting point. There ought to be a happy medium that can be found. — Preceding unsigned comment added by 24.18.238.89 (talk) 16:24, 23 October 2013 (UTC)

K-means++ has its own article. There is no need to discuss it again here. It's properly listed in the K-means_clustering#Variations long list of variations. There has also been critizism that k-means++ was overhyped, and in many situations does not improve the results / runtime that much; while the seeding process itself itself is also quite expensive: it requires O(n*k) distance computations. --Chire (talk) 15:27, 25 October 2013 (UTC)