Talk:K-means clustering

This is the talk page for discussing improvements to the K-means clustering article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Statistics Mid‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
Mid	This article has been rated as Mid-importance on the importance scale.

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Databases (inactive)

This article is within the scope of WikiProject Databases, a project which is currently considered to be inactive.DatabasesWikipedia:WikiProject DatabasesTemplate:WikiProject DatabasesDatabases articles

Robotics Start‑class Mid‑importance

	This article is within the scope of WikiProject Robotics, a collaborative effort to improve the coverage of Robotics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.RoboticsWikipedia:WikiProject RoboticsTemplate:WikiProject RoboticsRobotics articles
Start	This article has been given a rating which conflicts with the project-independent quality rating in the banner shell. Please resolve this conflict if possible.
Mid	This article has been rated as Mid-importance on the project's importance scale.

Mathematics Mid‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics articles
Mid	This article has been rated as Mid-priority on the project's priority scale.

Untitled 1: Examples?

Some examples? --Abdull 16:32, 21 February 2006 (UTC)[reply]

Spherical clusters

I assume that the comment in the end of the article ("Also, the algorithm works well only when spherical clusters are naturally available in data.") is very much dependent on the distance metric used... --Andkaha^(talk) 15:52, 20 July 2006 (UTC)[reply]

This isn't the k-means algorithm!

Did anyone bother to lookup the original reference?

The algorithm, described by wikipedia, is known as the Lloyd-algorithm. It is described in "An Algorithm for Vector Quantizer Design" by Linde and Buzo and Gray, 1980.

The original k-means algorithm was described by MacQueen ("Some Methods for classification and Analysis of Multivariate Observations", 1967) as follows

Stated informally, the k-means procedure consitsts of simply starting with k groups each of which consists of a single random point, and thereafter adding each new point to the group whose mean the new point is nearest. After a point is added to a group, the mean of that group is adjusted in order to take account of the new point. Thus at each stage the k-means are, in fact, the means of the groups they represent (hence the term k-means).

Every data point is processed only once and only one cluster mean is recomputed everytime a new data point is added. This is hardly the same algorithm as the one described by wikipedia.

This is exactly what the Lloyd clustering does, except that we need to switch to Euclidean space (the algorithm famously works for nutrient grouping etc, but it is difficult to visualize): Centroid = mean; k = number of clusters; cluster = group; k-means is for k groups. The informal procedure you quoted is a different interpretation, although it does the exact same thing: In the mentioned method, you already have the points given. You don't have to add them one by one; instead you add the group centers one by one. This is exactly the same. You still associate each point with the nearest site (or the center to distinguish it from a point). Take k sites. Given n points, these k random sites are placed near the k of the n possible points (usually you select k points and make them the sites). For every point, figure out which of the k sites is the nearest; that point goes to that group. You will have at most k groups now (some groups may be left unassigned if no point lies near to the site). Now recalculate the mean, reposition the k sites; repeat lather rinse. I don't understand how the quoted verbatim differs from this; it describes this exactly. Please clarify. Also provide a link to the PDF or article where this MacQueen paper is, I couldn't find it using Scholar. User_talk:gfxguy — Preceding unsigned comment added by gfxguy (talk • contribs) 06:17, 1 October 2006 (UTC)[reply]

I'm afraid there's very little correlation between the example and the description, making the example all but useless for someone who's trying to understand what's going on.

"The algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data." I assume that the value of k was chosen as 4. Where then are the 4 initial clusters?

"It then calculates the mean point, or centroid, of each set." Where is this done?

"It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters," Each frame should correspond with a step of the algorithm — Preceding unsigned comment added by Nedunuri (talk • contribs) 23:48, 18 November 2006 (UTC)[reply]

Although MacQueen's original k-means algorithm may not be identical to Lloyd's method, the algorithm described in this article is identical to Lloyd's method. Moreover, both articles reflect the common understanding of the k-means algorithm, at least within the algorithms community. The first sentence of the the Arthur and Vassilviskii paper cited in this article is "The k-means method is a well known geometric clustering algorithm based on work by Lloyd in 1982." I've therefore proposed a merge. (Perhaps the merged article should include a clear explanation of MacQueen's algorithm.) — Jeﬀ Erickson ✍ 04:41, 6 April 2007 (UTC)[reply]

I agree with the merge. Lloyd's algorithm is the most common algorithm for approaching the k-means problem. Sancho 08:51, 9 April 2007 (UTC)[reply]

If K-means is more general I don't see why they articles should be merged. There also seems to be a fair amount of active research on new approaches to K-means (from a survey of Google Scholar articles), so I think it would be a mistake to merge. Imran 05:07, 12 April 2007 (UTC)[reply]

Of course, k-means algorithm is used synonymously with Lloyd's algorithm by many. "The most prominent and widely used clustering algorithm is Lloyd's algorithms sometimes also referred to as the k-means algorithm." (From Frahling, G. and Sohler, C. 2006. A fast k-means implementation using coresets. In Proceedings of the Twenty-Second Annual Symposium on Computational Geometry (Sedona, Arizona, USA, June 05 - 07, 2006). SCG '06. ACM Press, New York, NY, 135-143. DOI= http://doi.acm.org/10.1145/1137856.1137879). Sancho 08:55, 9 April 2007 (UTC)[reply]

I liked Sancho's comment. I found an article supporting other methods to solve the K-means problem and referenced it and also clarified the difference between K-means and Lloyd's as well as creating a reference to Lloyd's. Weston.pace 21:56, 12 June 2007 (UTC)[reply]

Untitled 2: Lowercase

Suggestion: Make title to lowercase (I believe k-means is the consensus and not K-means) --Chrike 03:04, 19 January 2007 (UTC)[reply]

Image Replacement

Originally I started working on replacement images for the examples because they weren't licensed properly and I figured replacement was easier than tracking down the owner. However, they have sinced been properly licensed but I went ahead and replaced them because the new images don't show external software (which I beleive detracts from the quality of the image) and are in .svg format. However, I could be mistaken and do not wish to unfairly promote my own work, so if someone preferred the old images they can revert. -192.25.240.225 19:30, 26 July 2007 (UTC)[reply]

Untitled 3: Emphasis on K-Means++

Why does the article put so much emphasis on the K-Means++ algorithm? If this algorithm really was so awesome as the text suggest then I would expect the "k-means++ The Advantages of Careful Seeding" article to be often quoted by scientific articles. It is not, and I think the reason is quite simple. It is an obscure article that "prove" its results using very specialized datasets that does not match what K-Means is used for out in the wild. I seriously doubt the results will hold if K-means++ is tried in more general situations. —Preceding unsigned comment added by 81.216.185.173 (talk) 10:32, 29 November 2008 (UTC)[reply]

K-means+

Why was the previous comment on the talk page criticizing k-means++ deleted? Anyway, it just seems to be another of the zillion approaches there have been at making a previous centroid initilization before the algorithm proper starts. Genetic Algorithms, Particle Swarm, Simulated Annealing... I would want to question the relevance of the section to the article. —Preceding unsigned comment added by 189.135.63.207 (talk) 17:27, 21 December 2008 (UTC)[reply]

I also agree with this. If people want there should be a separate section on seeding methods. The wiki page should not be used as an advertisement like this —Preceding unsigned comment added by 76.168.200.176 (talk) 07:28, 26 January 2009 (UTC)[reply]

I am David Arthur, one of the two authors of that paper. It was not me who wrote the k-means++ section here. I can, however, address the discuss comments here and in the past:

1. The paper was published at one of the main CS theory conferences, arguably the most prestigious among theory conferences that claim to be seriously interested in practical applications. It is not a seminal work that has been cited tens of thousands of times, but it is not obscure either.

2. The paper deals very specifically with k-means, unlike say simulated annealing, and thus makes more sense to reference on this page than simulated annealing. To my knowledge, the only other serious research that has been done on the question of seeding k-means was done at the same time and independently by Ostrovsky et al (see citation and discussion in my paper). The gist of what's going on is they analyze essentially the same algorithm, and show it has a good approximation ratio in certain cases. My paper shows the algorithm has a slightly less good approximation ratio in all cases and also includes some experiments. I do think it would be appropriate to cite that result as well, although it's much more complicated to state. I am unaware of other research results that make sense to reference here.

3. The paper proves its theoretical result with math, and does so in all cases, not only in some cases like the deleted comment accuses. The experiments on real-life data-sets are from the standard repository of real-life data sets that almost all academic papers use. I assure you they were not chosen in bad faith, and since sample code for k-means++ is available as reference in the paper, you are welcome to judge for yourself.

I am not a Wikipedia writer and of course I'm biased and I like having my paper referenced here, so I do not presume to make any decisions on notability. However, I did want to clear up these points.71.198.184.231 (talk) 20:15, 11 March 2009 (UTC)[reply]

Title

I propose changing the title to K-means clustering for two reasons:

It explains what it does (it clusters around k means)
There appear to be a number of algorithms, so referring to it as "K-means algorithm" is a misnomer.

comments? -3mta3 (talk) 09:11, 7 April 2009 (UTC)[reply]

I keep seeing a lot of references that state the problem is either NP-hard or NP-complete, but have yet to find a proof. The most common reference is:

Brucker, Peter (1978). "On the complexity of clustering problems". Optimization and operations research (Proc. Workshop, Univ. Bonn, Bonn, 1977) (Lecture Notes in Economics and Mathematical Systems 157 ed.). Berlin: Springer. pp. 45–54. ISBN 3-540-08842-3. MR 0511326. Zbl 0397.68044.

but I have been unable to find a copy. Is anyone else aware of any other sources? -3mta3 (talk) 15:31, 15 April 2009 (UTC)[reply]

http://www.cc.gatech.edu/~vempala/papers/dfkvv.pdf (NP-hard even if k=2), http://www.imsc.res.in/~meena/papers/kmeans.pdf (NP-hard even if d=2). I do not think the second paper has been published 71.198.184.231 (talk) 22:23, 23 April 2009 (UTC)[reply]

thanks, i put the first one and the original reference (for anyone who can find it!) in the article. -3mta3 (talk) 08:36, 24 April 2009 (UTC)[reply]

Filtering algorithm

I am not sure how to go about adding this, but I think this article should reference Kanungo et al's kd-tree filtering algorithm (http://www.cs.umd.edu/~mount/Papers/pami02.pdf). It speeds a naive implementation up by a couple orders of magnitude. 71.198.184.231 (talk) 22:32, 23 April 2009 (UTC)[reply]

There is a reference to it (see K-means_clustering#Notes) however no description of what it does. Feel free to expand. -3mta3 (talk) 08:15, 24 April 2009 (UTC)[reply]

Ok, I expanded the variations section some -- I'm not super-sure my summary of the last one is good since only the abstract is online. I hadn't heard of it before. 71.198.184.231 (talk) 16:43, 24 April 2009 (UTC)[reply]

A nitpick

I think it is somewhat misleading to say using variance and Euclidean distance are drawbacks of k-means, and it's very misleading to say this is what makes k-means fast. These choices are what makes k-means *well defined*. What makes it *fast* is that it does not attempt to find the best clustering even according to its own definitions. Any clustering algorithm has to implicitly define what it means by a "good" clustering, and k-means's choice of Euclidean distance + variance is not obviously better or worse than other choices. So yes, if people use k-means assuming it has a human-like understanding of when two unknown objects are similar, they will be disappointed. But, no algorithm does this, and there is certainly no standard belief that an algorithm ever will be able to do this. Meanwhile, the description implies there is a clear better metric which k-means doesn't use for speed reasons - I do not think this is true and without real evidence, I do not see how the claim belongs on wikipedia in its current form.198.161.30.159 (talk) 06:29, 30 June 2009 (UTC)[reply]

I would add that as I understand it k-means clustering as a method is independent of any particular distance metric, e.g. manhattan distance would be a reasonable choice in some circumstances. --The locster (talk) 16:55, 2 September 2009 (UTC)[reply]

The demonstration is complete?

As far as I have understood, the last frame is not the result of the algorithm. I think that the result should be a part of the demonstration, too. Bubla (talk) 11:25, 24 August 2009 (UTC)[reply]

Also, the caption to the first image says "initial means are randomly selected from the data set" but the means from the first image are not present as data points in the subsequent images - implying they're not part of the data set. 78.86.200.205 (talk) 10:15, 31 January 2012 (UTC)[reply]

I think it would be better to assume the problem is in the subsequent images, not the description (see section titled "Error in demonstration images"). — Preceding unsigned comment added by 71.56.157.70 (talk) 16:54, 27 August 2012 (UTC)[reply]

Choice of distance metric

As I understand it k-means clustering can be used with a variety of distanec metrics, it is common to use Euclidean distance but other metrics can be used, e.g. Manhattan. It would be nice to see an overview of some metrics that are typically used and what their relative pros and cons are.

Following on from distance metrics, some explanation of how to calculate centroids for non-euclidean distance metrics would be required, e.g. manhattan distance centroids are calculated using the median of each coordinate element rather than the mean. Such explanations probably belong elsewhere with a pointer from here but in the absence of explanations elsewhere this page is better than nowhere for now.--The locster (talk) 17:04, 2 September 2009 (UTC)[reply]

Empty clusters

It's my understanding that it is possible for k-means clustering to result in empty clusters after an iteration of the algorithm. Some discussion of this issue is warranted. E.g. this is potentially more of a problem in higher dimensional spaces and for certain distance metrics. Are there algorithm variants that avoid empty clusters altogether? And what are the strategies for dealign with empty clusters when they are encountered, e.g. split the largest cluster into two, or the cluster with the highest variance, etc.--The locster (talk) 17:09, 2 September 2009 (UTC)[reply]

Convergence guarantees?

Is this algorithm guaranteed to converge? 140.180.6.33 (talk) —Preceding undated comment added 07:20, 9 December 2009 (UTC).[reply]

Error in demonstration images

In the "Demonstration of the standard algorithm" images, the initial means are chosen from the data set. These data points disappear in later images. — Preceding unsigned comment added by 71.56.157.70 (talk) 16:47, 27 August 2012 (UTC)[reply]

Obviously, the initial means were generated for this example and are not data points. --Chire (talk) 21:02, 27 August 2012 (UTC)[reply]

I think I agree with Chire: i.e. the initial criticism of this diagram is mistaken. Nevertheless there is still a problem with the rightmost figure, labelled 4). This rightmost figure, as it stands currently, clearly cannot depict the situation at any stage during the iteration of the algorithm (nor after convergence). If it were a pic of the state of affiars after one of the "Update Steps", then the green circle would be the centroid of the green squares, the red circle the centroid of the red squares, etc -- just as is believably so of figure 3). If instead figure 4) were claimed to be a snapshot after an "Assignment Step" then the border line between any two of the shaded zones (or to be precise its extension to an infinitely long straight line) would be the perpendicular bisector of the straight line segment whose endpoints are the two correspondingly coloured circles, as appears believably to be true of part fig 2) of this diagram. However both conditions are patently false of the rightmost figure 4), even at a glance.

I propose that this sub-figure 4) should be removed, until a corrected version can be substituted. At present its only function is to raise doubt and puzzlement in the mind of someone who has in fact perfectly well understood the description of the standard algorithm. (The first three parts, labelled 1) to 3), of this figure can be retained however as without any change they are a most useful illustration of the algorithm explanation given in the preceding text.) — Preceding unsigned comment added by 83.217.170.175 (talk) 14:36, 6 June 2014 (UTC)[reply]

The reported time complexity is controversial

The reported time complexity of k-means is reported as O(n^(nk+1)logn). This is against other proofs on k-means linearity (see Introduction to Information Retrieval by Manning et al., http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html#p:kmeanscomplexity), and experimental evidence showing that k-means is faster than hierarchical agglomerative clustering (which is reported to be in the time complexity range of O(n^2) ... O(n^4)).

I suggest to remove the Complexity section and the corresponding references until someone with good understanding of k-means time complexity comes and fixes this.

Finally, a few words about the references in the Complexity section. Reference 14 is not a strong one (extended abstract at a symposium), and most importantly, the paper talks about something different than the original K-means algorithm. References 11, 12, 13 talk about special cases of K-means and do not reflect the "standard" K-means algorithm described in this article. Reference 15 is a dead link, and therefore there is no valid support for the following sentence:

Better bounds are proved for simple cases. For example,[15] showed that the running time of k-means algorithm is bounded by O(dn^4M^2) for n points in an integer lattice \{1,\dots, M\}^d.

Nucflash (talk) 11:41, 26 June 2013 (UTC)[reply]

From a quick look, the book you reference discusses the Lloyd algorithm for finding an approximative optimum. As the problem of finding the exact solution is proven to be NP-hard (it should be trivial to find references for this), O(n*k*i*m) obviously cannot be correct for the exact k-means solution.

I would not use a wide-coverage book like this as reference. These books tend to be very imprecise on what they are talking about. Many of them call it "k-means", cite MacQueen, but give Lloyds algorithm, and do not discuss that this is just a heuristic. And exactly this happened in this NLP book. The given algorithm is the modern formulation of Lloyds algorithm (did you have a look at his article?); who not even used the term k-means. And it's just the most popular heuristic for finding a local optimum. However, the complexity section clearly states: " the k-means clustering problem", NOT "the lloyd algorithm". Btw, the same thing applies to hierarchical clustering. The naive matrix way can solve arbitrary cluster similarity measures in O(n^3); unless you use a really expensive cluster similarity measure. For special cases such as single-linkage, there exist variants that run in O(n^2). If you are only interested in a single cut through the dendrogram, DBSCAN with minPts=2 should give the desired result in O(n log n) with index support. Whenever talking about the complexity of an algorithm, you really need to be precise of what you are talking about. (Granted, the article could be a bit clearer on what it refers to, and emphasize this common misunderstanding). --160.39.199.43 (talk) 03:54, 27 June 2013 (UTC)[reply]

Merge proposal

Since Lloyd's algorithm is the standard algorithm for k-means these days, and essential to explaining k-means, we might just as well merge the two articles.

In fact, I'd even suggest to make Lloyd's algorithm simply a redirect to the appropriate section of k-means where it is explained.

However, there seems to be some other variant (not so much for clustering?) of Lloyd discussed in that other article, too. --Chire (talk) 14:28, 26 August 2013 (UTC)[reply]

Oppose. Lloyd's algorithm is also used for constructing Centroidal Voronoi tessellations and for smoothing finite element meshes. Since it has such varied applications, not all of which have anything to do with clustering, I feel that it deserves to be a separate article. And in addition, there are other algorithms than Lloyd for K-means, so the two subjects are not the same even in that context. —David Eppstein (talk) 16:09, 26 August 2013 (UTC)[reply]

Oppose. This was proposed in 2007, see the Lloyd's algorithm talk page for the old discussion. I feel the important distinction is that k-means (at least the k-means article) is an exclusively discrete treatment while Lloyd's algorithm has importance in a continuous setting. The Lloyd's algorithm article probably needs improvement to better reflect this: the third paragraph of that page discusses the discrete variant while the subsequent description is more of the continuous language. RuppertsAlgorithm (talk) 14:03, 27 August 2013 (UTC)[reply]

Oppose and disagree with the statement that Lloyd's algorithm is the standard algorithm. The default method in R is Hartigan and Wong. It is completely misleading to confuse a general method with one of several algorithms by which one can implement the method. Mathstat (talk) 15:44, 27 August 2013 (UTC)[reply]

For those who do not have R installed: from the R manual for kmeans: "The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan–Wong algorithm generally does a better job than either of those, but trying several random starts (nstart> 1) is often recommended." Mathstat (talk) 12:34, 28 August 2013 (UTC)[reply]

K-means++

I was surprised to find no mention of K-means++ in the article and came to the talk page to suggest it only to find that there's been some previous edit war concerning too much emphasis on K-means++. I believe the eradication of K-means++ from this article has gone a bit too far. K-means++ *is* K-means, just with a more sensible starting point. There ought to be a happy medium that can be found. — Preceding unsigned comment added by 24.18.238.89 (talk) 16:24, 23 October 2013 (UTC)[reply]

K-means++ has its own article. There is no need to discuss it again here. It's properly listed in the K-means_clustering#Variations long list of variations. There has also been critizism that k-means++ was overhyped, and in many situations does not improve the results / runtime that much; while the seeding process itself itself is also quite expensive: it requires O(n*k) distance computations. --Chire (talk) 15:27, 25 October 2013 (UTC)[reply]

Confusion about computational complexity

The article on the one hand says that the problem is NP-hard, and on the other hand, it states the following: "If k and d (the dimension) are fixed, the problem can be exactly solved in time $O(n^{dk+1}\log {n})$ , where n is the number of entities to be clustered".

I assume that NP-hardness is w.r.t. to the number of entities to be clustered (i.e., n), but the latter statement implies polynomial time.

It would be great if an expert could clarify which statement is true. (I suspect the statement quoted above is false and that the paper from which it is taken actually claims something else...) Cervisiarius (talk) 17:47, 17 March 2016 (UTC)[reply]

I don't know if that statement is true. Complexity analysis of such problems is hard. What if you chose e.g.

k={\sqrt {n}}

, which is a common practise to reduce data set size? P.S. new discussions should be added to the end. HelpUsStopSpam (talk) 20:47, 17 March 2016 (UTC)[reply]

Shouldn't these be the other way round?

"The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set."

Seems wrong to me, logically when first adding points randomly to clusters and then calculating the means of those clusters, I'd expect those means to lie close to the center, whereas when picking 'means' randomly from the set they would be spread out.

But as I'm no expert on the algorithm or terminology I don't really dare change it by myself.

145.15.244.14 (talk) 19:33, 2 November 2017 (UTC)[reply]

random partition = every point is randomly assigned partition 1..k => expected value of each partition is the overall data center. Seems to be correct? — Preceding unsigned comment added by 77.177.197.101 (talk) 21:43, 2 November 2017 (UTC)[reply]

Can't understand intro

Out of the blue, the second paragraph of the intro talks about "both algorithms", even though no algorithms have been mentioned before. The third paragraph then continues with "the algorithm", and again it is unclear what algorithm is being talked about. AxelBoldt (talk) 14:20, 22 December 2017 (UTC)[reply]

Relation to other algorithms: Gaussian mixture model

I would like to add a sentence to the paragraph describing the relation between k-means and Gaussian mixture models. I would like to add it as the second sentence of the "Gaussian mixture model" paragraph such the first two sentences read:

"k-means clustering, and its associated expectation-maximization algorithm, is a special case of a Gaussian mixture model, specifically, the limit of taking all covariances as diagonal, equal and small. The requirement for small variances can be dropped^[1] by interpreting k-means as a variational Expectation-Maximization algorithm.".

By dropping the requirement of small (actually limit to zero) variances, the quantization error can directly be related to a lower bound of the GMM log-likelihood (Eqn. 17 of referenced paper). This relation is novel and different from the small-variance relation of text books and of the current version of this Wiki article. I am the first author of the referenced paper, so this is a self-citation. In such cases, Wikipedia regulations recommend to post a talk message prior to actually editing the article in order to prevent excessive self-citations. To assure the k-means article editors that I do not do excessive self-citations, note that my edit will represent my first ever edit of a Wikipedia page. Any expert of the k-means algorithm may confirm (or object to) the relevance of the paper my additional sentence will point out. I am happy to discuss. Thanks! Jörg Lücke (talk) 13:01, 28 November 2019 (UTC)[reply]

A freely accessible link to the referenced paper is: https://arxiv.org/abs/1704.04812 See here for the journal.

Jörg Lücke (talk) 16:52, 25 November 2019 (UTC)[reply]

Compared to the "standard textbook" relationship, this is an unencyclopedic detail that relies on users knowing intricate details of variational Bayes. and even then, what does this actually add besides a footnote in theory? Also I'm not sure that your reading that previous work assumed small variances is even correct. — Preceding unsigned comment added by 78.49.42.107 (talk) 06:47, 28 November 2019 (UTC)[reply]

Thanks for the discussion, I think this point is really important to discuss. The data (log-)likelihood is one of the most fundamental measures to compare unsupervised learning algorithms, and k-means can only be related to a likelihood if one does not take the limit to small variances (of a GMM). The current version of the "Relation to Gaussian mixture models" explicitly states small variances as requirement. So do we want to leave this unedited/uncommented even though it can be shown that this is not a requirement? One does not need to be familiar with variational methods to understand the statement that small variances are not required. Btw. the referred to paper also provides a simple formula (Eqn. 17) which allows anybody (without any knowledge of variational methods) to related the quantization error obtained by k-means to the data log-likelihood. Maybe we can come up with an edit of this point together (also goes to other readers)? It would, for instance, be good to find more than one paper to refer to. So if you find one, go ahead. We did find some papers that provide a relation to a likelihood but they rather provide plausibility arguments (also see Sec. 3.4 here). There is also a lecture on YouTube with some plausibility arguments that one could refer to but not sure if Wiki regulations would allow for YouTube references in a scientific context. Regarding "I'm not sure that your reading that previous work assumed small variances is even correct.", could you elaborate? Standard textbooks like Bishop (2006), page 444, Barber (2012) etc as well as the paper cited in the current 2nd sentence of the paragraph^[2] all use the small variance limit. Thanks. Jörg Lücke (talk) 13:01, 28 November 2019 (UTC)[reply]

I'm arguing that this detail is only relevant to a tiny audience of theoretical scientists, and not to the typical audience of the article. Hence, it goes too far outside the scope of an encyclopedia. We cannot (and should not, because of readability) cover all available literature. Its obvious that the variance-to-zero requirement is necessary in standard GMM to force it to produce a discrete result like k-means. So whatever you do - it is not the standard GMM; but some exotic variational approximation to GMM, or some other construct such as adding an argmax to make GMM produce discrete results. Because of this, simply the statement that small variances are not required is, in my opinion, false. If you drop it, its not standard GMM anymore. Its trivial to build a counterexample - put a Gaussian with a large variance on every k-means center, and GMM will not produce a discrete result, hence not the same as k-means. So its not half as easy as you claim. And in the end, the entire relationship to GMM is purely academic. There is no good reason to force GMM to produce the same result as Lloyd k-means (which standard GMM does not, unless we force the variances to be diagonal, constant, and converge to zero). If we want a k-means result, just use k-means. If we want varying variance and covariances, use GMM. No need to abuse one to get the results of the other. So I believe that by adding this claim, this makes the article just more confusing to readers. My impression is that you mostly care about advertising your research in Wikipedia, not about actually improving the article. — Preceding unsigned comment added by 93.132.157.2 (talk) 23:25, 2 December 2019 (UTC)[reply]

To briefly interrupt your fighting: Murphy (Machine Learning: A Probabilistic Perspective, 2012) does not require variance -> 0. He shows an equivalence of k-means to "hard EM" with arbitrary but fixed variance. See 11.4.2.5. --Chire (talk) 12:00, 3 December 2019 (UTC)[reply]

@Chire: True, and thanks for the constructive contribution. What about this for the relation to Gaussian mixture models?

"Lloyd's algorithm can be obtained as a special case of EM for a Gaussian mixture model, specifically, if the Gaussian mixture model is assumed to have spherical clusters with equal and small variance (e.g., Bishop (2006), Barber (2012)). The requirement for small variances can be dropped by interpreting Lloyd's algorithm as a variational EM approximation (i.e., by using Kronecker deltas as approximations to full posteriors)^[3]^[4]^[5]. This allows for directly relating the k-means objective to the likelihood objective of Gaussian mixture models."

For the relation between Murphy (I think you mean Chapter 11.4.2.5) and our paper, note that care is required when actually proving an equivalence to k-means with Kronecker approximations to GMM posteriors. This is because there are some subtleties for hard zeros and the variational bound, and (maybe more importantly) care is required to distinguish old Thetas from new Thetas (which Murphy is not explicit about). If you are careful, you do get k-means and a result consistent with Murphy (as I wrote before, we did nowhere find a peer reviewed derivation, and Murphy's isn't one either). Our C'=1 case in the paper is essentially a Kronecker approximation but without a pre-specified winning cluster as Murphy assumes (and as is probably required for a full derivation). If interested also see page 26 here. Another difference is, of course, that we were very much interested in k-means versions with two or more winning clusters.

134.106.116.55 (talk) 16:40, 4 December 2019 (UTC)[reply]

@93.132.157.2: Note that a reader of a subsection "Gaussian mixture models" in the section "Relation to other algorithms" is likely very interested in how the underlying objectives of k-means and GMM likelihood are related. Note that the relation is between the objectives not only the algorithms you seem to elaborate on. The objectives can be related indeed only if we drop the variance-to-zero requirement. I tried to express this more clearly in the revised suggested change (see above). For a thoroughly discussed example similar to the one you put forward, see the referred paper. Thanks.

Jörg Lücke (talk) 22:19, 4 December 2019 (UTC)[reply]

@Jörg Lücke: to me, the relationship honestly is not much more interesting than that Lloyd is one of many EM-type algorithms. Maybe a little bit that there is indeed a theoretical reason why k-means works best with Gaussian data; although I'd attribute this more to using the mean and squared deviations. These already don't work well in one dimension when the data is skewed, so its no surprise this also holds for multiple dimensions and mixtures. I do not see where knowing that Lloyd is a limit case of GMM helps me solve any problem. Chire (talk) 08:49, 11 December 2019 (UTC)[reply]

References

^ Lücke and Forster (2019). "k-means as a variational EM approximation of Gaussian mixture models". Pattern Recognition Letters. 125: 349–356 – via Elsevier Science Direct.
^ Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. (2007). "Section 16.1. Gaussian Mixture Models and k-Means Clustering". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York (NY): Cambridge University Press. ISBN 978-0-521-88068-8.
^ Kevin P. Murphy (2012). Machine learning : a probabilistic perspective. Cambridge, Mass.: MIT Press. ISBN 978-0-262-30524-2. OCLC 810414751.
^ Jordan, Ghahramani and Saul (1997). "Hidden Markov decision trees" (PDF). Advances in Neural Information Processing Systems: 501–507.
^ Lücke and Forster (2019). "k-means as a variational EM approximation of Gaussian mixture models". Pattern Recognition Letters. 125: 349–356 – via Elsevier Science Direct.

Using k-meams for dominant colors?

This article makes no mention of usage of k-means for color extraction of images (probably because it is a non-notable concept?) Draheinsunvale (talk) 00:25, 30 December 2021 (UTC)[reply]

Wiki Education Foundation-supported course assignment

This article was the subject of a Wiki Education Foundation-supported course assignment, between 6 September 2020 and 6 December 2020. Further details are available on the course page. Student editor(s): Rc4230.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 01:34, 17 January 2022 (UTC)[reply]

Pseudocode

Ou-jb (talk · contribs) added a not-so-pseudo pseudocode (actually it is closer to real code than to pseudocode), thank you. However, I would suggest the following change: instead of using (more expensive, because of reallocations) lists for each cluster, use a labels array (static allocation). No change in cluster labels is a convergence criterion that does not suffer from numerical precision issues, and does not have the hidden complexity of comparing the lists of centers numerically in the current code, and the label array makes the partitioning property of k-means clear. Maybe also make the code more pseudo code, to avoid that people think they can just copy this into their programs... Pseudocode should be more abstract IMHO, focus on intent instead of syntax. 2A01:C22:BC1E:A700:D040:CD72:CAA9:1C20 (talk) 11:14, 20 April 2024 (UTC)[reply]

[1] Lücke and Forster (2019). "k-means as a variational EM approximation of Gaussian mixture models". Pattern Recognition Letters. 125: 349–356 – via Elsevier Science Direct.

[2] Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. (2007). "Section 16.1. Gaussian Mixture Models and k-Means Clustering". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York (NY): Cambridge University Press. ISBN 978-0-521-88068-8.

[3] Kevin P. Murphy (2012). Machine learning : a probabilistic perspective. Cambridge, Mass.: MIT Press. ISBN 978-0-262-30524-2. OCLC 810414751.

[4] Jordan, Ghahramani and Saul (1997). "Hidden Markov decision trees" (PDF). Advances in Neural Information Processing Systems: 501–507.

[5] Lücke and Forster (2019). "k-means as a variational EM approximation of Gaussian mixture models". Pattern Recognition Letters. 125: 349–356 – via Elsevier Science Direct.

[1]

[2]

[3]

[4]

[5]