Talk:Cluster analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Databases / Computer science  (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Databases, a collaborative effort to improve the coverage of database related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Computer science (marked as High-importance).
 
WikiProject Computer science (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 
WikiProject Robotics (Rated C-class, Mid-importance)
WikiProject icon Cluster analysis is within the scope of WikiProject Robotics, which aims to build a comprehensive and detailed guide to Robotics on Wikipedia. If you would like to participate, you can choose to edit this article, or visit the project page (Talk), where you can join the project and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 
Note icon
This article has been marked as needing immediate attention.
WikiProject Statistics (Rated C-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
 

Contents

External validation indices[edit]

About the pairwise F-measure : "This measure is able to compare clusterings with different numbers of clusters". Actually, all of the reviewed indices are able to do so. Well, they are some well known drawbacks, especially with the Rand index which tends to 1 while the number of clusters increase because of the False Positive term, but you just have to keep that in mind ;) Moreover, if you want to compare clusters independently, a matching is obviously required and it is generally done by the Hungarian method for the sake of efficiency. — Preceding unsigned comment added by 84.98.253.168 (talk) 02:19, 23 October 2012 (UTC)

Inifinity-norm[edit]

Can someone please make infinity-norm a link: infinity-norm

(The article is currently locked.)

Sabotage[edit]

This page appears to have been deliberately vandalised.

Please unlock this page.

Elbow criterion section[edit]

Someone should edit out the blatant advertising for the excel plugin product. It would also be nice to have some additional info on picking the number of clusters?

Sorry, I just had Excel at hand when making the graph. This is not an advertisement, this is not real data, but a crappy Excel hand made graph to help visualize the EC heuristic. If you look through the archives, you will see I also made an extremely ugly representation for the hierarchical clustering, which was thankfully replaced.
As for how to tweak clustering, either you have a good idea of how many clusters you want (number criterion), or a good idea of how much total variance (or another perf metric) you want to explain, or all you are left with is heuristic. That's from a Computer science POV.
Looking in another direction, for example statistics, there are ways to compare models with a different number of parameters, like Akaike information criterion and the methods linked from there, or maybe something based on information entropy. It will again help you choose in the tradeoff between having too many cluster or having too low perf because of low cluser number. I'm sorry, I don't have any relevant article nor the time at the moment to find one. Bryan
Actually ran into an article about using entropy criterion to stop clustering: Cluster Identification Using Maximum Configuration Entropy, by C.H. Li. —The preceding unsigned comment was added by 86.53.54.179 (talk) 19:02, 1 April 2007 (UTC).

I'm wondering if there isn't an error in this sentence: "Percent of variance explained or percent of total variance is the ratio of within-group variance to total variance." I'm thinking that as the number of clusters increases, the within-group variation decreases, which is not what is shown on the graph. Should this be "... the ratio of the between-group variance to the total variance." Mhorney (talk) 17:57, 11 January 2008 (UTC)

Addressed by 91.89.16.141 User A1 (talk) 06:38, 17 March 2008 (UTC)

The section is in violation of WP:NOR. —Preceding unsigned comment added by 71.100.12.147 (talk) 11:09, 11 September 2008 (UTC)

V-means clustering[edit]

A Google search for "V-means clustering" only returns this Wikipedia article. Can someone provide a citation for this?

for future ref, this is the V-means paragraph that was removed

V-means clustering[edit]

V-means clustering utilizes cluster analysis and nonparametric statistical tests to key researchers into segments of data that may contain distinct homogenous sub-sets. The methodology embraced by V-means clustering circumvents many of the problems that traditionally beleaguer standard techniques for categorizing data. First, instead of relying on analyst predictions for the number of distinct sub-sets (k-means clustering), V-means clustering generates a pareto optimal number of sub-sets. V-means clustering is calibrated to a usened confidence level p, whereby the algorithm divides the data and then recombines the resulting groups until the probability that any given group belongs to the same distribution as either of its neighbors is less than p.

Second, V-means clustering makes use of repeated iterations of the nonparametric Kolmogorov-Smirnov test. Standard methods of dividing data into its constituent parts are often entangled in definitions of distances (distance measure clustering) or in assumptions about the normality of the data (expectation maximization clustering), but nonparametric analysis draws inference from the distribution functions of sets.

Third, the method is conceptually simple. Some methods combine multiple techniques in sequence in order to produce more robust results. From a practical standpoint this muddles the meaning of the results and frequently leads to conclusions typical of “data dredging.”

Fuzzy c-means clarification[edit]

I believe ther is a typo at "typological analysis"; should be "topological"

The explanation of the fuzzy c-means algorithm seems quite difficult to follow, the actual order of the bullet points is correct but which bit is to be repeated and when is misleading.

"The fuzzy c-means algorithm is greatly similar to the k-means algorithm:

  • Choose a number of clusters
  • Assign randomly to each point coefficients for being in the clusters
  • Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
    • Compute the centroid for each cluster, using the formula above
    • For each point, compute its coefficients of being in the clusters, using the formula above"

Also aren't c-means and k-means just different names for the same thing, in which case can they be changed to be consistent throughout?



The c-means clustering relates only to the fuzzy logic clustering algorithm. You could say that k-means is teh convergence of c-clustering with ordinary logic, rather than fuzzy logic.

On-line versus off-line clustering[edit]

Beyond the division of clustering methodologies hierarchical/partitional agglomerative/divisive, it is possible to differentiate betewen: Arrivial of data points: on-line/off-line Type of data: stationary/non-stationary


Additionally it may be helpful to discuss some of the difficulties in clustering data, in particular choosing the correct number of centroids, limits on memory or processing time, and techniques for solving them, such as measuring phase transition (Rose, Gurewitz & Fox)


Another division would be Extensional vs. Conceptual Clustering. --Beau 10:40, 28 June 2006 (UTC)

Missing reference[edit]

I find this in the article:

This is the basic structure of the algorithm (J. MacQueen, 1967):

But when I looked at the bibliograpy, it was not there. If anyone has the information, could they add it? Michael Hardy 18:36, 21 November 2005 (UTC)

Impossibility Theorem[edit]

The clustering impossibility theorem should be mentioned. it is similar to Arrow's impossibility theorem, except for clustering. I don't know who created it though. Briefly, it states that any clustering algorithm should have these 3 properties:

  • Richness - "all labelings are possible"
  • scale invarience
  • Consistancy - "if you increase inter-cluster distance and decrease intra-cluster distance, cluster assignments should not change"

No clustering algorithm can have all 3.

-- BAxelrod 03:53, 16 December 2005 (UTC)

It is a good thing to have and mostly one should reference, I guess \bibitem{JK02} Jon Kleinberg. An impossibility theorem for clustering - Advances in Neural Information Processing Systems, 2002.

However, if there is not another source, then I'd mention that there is a little problem with this theorem as it is presented in that article.

First, it deals with graphs G(V,E), having |V| >= 2 and having distances d(i,j) = 0 iff i==j, i,j in V. Thus, take richness and scale invariance (which means that a graph with some fixed weights has the same clustering if all the weights are multiplied by some positive constant), a graph with |V| = 2, and boom - here you go. For each clustering we get either scale invariance or richness. If there is richness, then scale invariance does not work and the other way round. Sweet, is not it? Or am I wrong somewhere?

Could you please explain something about Isodata Algorithm for data clustering

The last external link on this page has an example on ISODATA clustering. I will try to do a digest when I have time, but feel free to beat me to it. Bryan

Relation to NLP[edit]

It seems a large amount of the effort in text mining related to text clustering is left out of this article, but it seems to be most appropriate place. Josh Froelich 20:16, 9 January 2007 (UTC)

No strict definition for the problem itself.[edit]

There are lots of details about different methods and metrics used to solve the problem, which is defined too unstrictly.

Maybe that's right[edit]

As I understand it, clustering can be used for different purposes, but it is generally used to find classes of data points that aren't easily noticeable otherwise. However, if you are using the term 'problem' in the CS theoretical sense, then each clustering algorithm really has a different problem.

Darthhappyface (talk) 04:32, 25 May 2010 (UTC)

All software links were removed[edit]

Here they are, should we put them back ?

Software implementations[edit]

Free[edit]

  • The flexclust package for R
  • COMPACT - Comparative Package for Clustering Assessment (in Matlab)
  • YALE (Yet Another Learning Environment): freely available open-source software for data pre-processing, knowledge discovery, data mining, machine learning, visualization, etc. also including a plugin for clustering, fully integrating Weka, easily extendible, and featuring a graphical user interface as well as a XML-based scripting language for data mining;
  • mixmod : Model Based Cluster And Discriminant Analysis. Code in C++, interface with Matlab and Scilab
  • LingPipe Clustering Tutorial Tutorial for doing complete- and single-link clustering using LingPipe, a Java text data mining package distributed with source.
  • Weka : Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
  • Tanagra : a free data mining software including several clustering algorithms such as K-MEANS, SOM, Clustering Tree, HAC and more.
  • Cluster : Open source clustering software. The routines are available in the form of a C clustering library, an extension module to Python, a module to Perl.
  • python-cluster Pure python implementation

Non-free[edit]

I removed these, about a month ago. Wikipedia's external link guidelines discourage large link directories like this. None of the links appear to meet the guidelines, each is just an instance of data clustering software with nothing to indicate it is especially notable to the topic. Normally I would have replaced them with a DMOZ link, but there doesn't appear to be a category for this. -SpuriousQ (talk) 10:14, 22 March 2007 (UTC)
Actually it did take some time to build up such a list of software. Also, since this is an article about algorithms, any software implementing them is relevant to the topic (if you have an interest in the algo, it might very well be because you need a working implementation, or a reference implementation). Bryan
I understand that, but that list was just an invitation for spam and links to data clustering software someone wrote one day. WP:EL is clear that links should be kept to a minimum. I would probably have no problem with links to clearly notable implementations; for example, one that was the subject of a peer-reviewed published paper or one created by noted researchers in the field. -SpuriousQ (talk) 15:12, 28 March 2007 (UTC)
No problem, I accept the policy and see its utility, but think the links and comments are important, that's why I asked for a DMOZ directory to be created. Also, there is a problem with the fact that you removed some internal wikipedia links in the process (the YALE, Weka, Peltarion Synapse links). So, if we want to keep the links somewhere, do we try to move them to DMOZ or do we create an intermediary Wikipedia page for each package ? Bryan
Hmm. The standard way of linking to internal articles is in the See also section. But I feel it's a bit questionable to give such prominence to the three instances that happen to have Wikipedia articles. Another idea would be to have an article List of data clustering software and link to that in See also. I would prefer that option, but it's also iffy to have a list article with so few examples. Do you know of any other articles we could add to that list? -SpuriousQ (talk) 01:58, 30 March 2007 (UTC)
Data mining, software section. The software links there actually suffer the same problem as the ones here, although to a lesser extent (more wikipedia articles, less external links). Please see discussion with David Eppstein. Bryan

I think the list should either be kept out of the article altogether, or it should be explicitly limited to a small number (say half a dozen) of particularly notable clustering systems, for which some strong argument can be made for their inclusion beyond "it's a clustering system" or even "it's a free clustering system". By a strong argument, I mean statements such as "it's the most widely used free clustering system for Linux" or the like. Anything less restrictive is just an invitation for an unencyclopedic link farm. —David Eppstein 03:22, 30 March 2007 (UTC)

I agree, it's time for a cleanup. Just to give a little "historical perspective", the list was started when commercial spam appeared, it kept it under very good control because it gave space for people tempted to spam the page with their soft's link to express themselves. The list has grown, and it now has many annotations, so I'm quite happy with it and would like to keep some of its info.
Now, to continue the conversation together with SpuriousQ, imagine we move everything to a list of Data mining software. We would lose info about implementations (libraries,scripts) that are not autonomous software, thus failing to highlight "reference implementations" that people may want to examine at a code level, only linking to "working implementations" that people would like to use. So my proposition would be: if the soft is autonomous, we add it to the data mining software list. If the soft is just a library, and seems to be from academia, we keep a link to it. In our case, that would keep flexclust, COMPACT, mixmod and Cluster, the python-cluster link would be lost forever, and the rest would be transferred to the more general and richer data mining software list. What do you think ? Bryan.
This would have been very useful a few weeks ago... trawling Google for OSS implementations and avoiding all the spam took ages. I am actually quite angry that someone just removed them all without a discussion. --MatthewKarlsen 09:34, 21 June 2007 (UTC)

There are other well-known (within the field) directories for such things... Maybe use one of them? [2] and [3], for example? stoborrobots (talk) 15:23, 22 February 2011 (UTC)

Affinity Propagation (Clustering by Passing Messages Between Data Points)[edit]

I believe that this algorithm developed at the University of Toronto by Brendan Frey, a professor in the department of Electrical and Computer Engineering and Delbert Dueck, a member of his research group which appeared in Science (journal) Feb 07 will change the way people think about clustering. See www.psi.toronto.edu/affinitypropagation/ and www.sciencemag.org/cgi/content/full/sci;315/5814/972 . However I am not capable of writing a full introduction, so I hope someone better equiped for the job will do that. Including the AP breakthrough is a must in my view to retain the currency of this article. Bunty.Gill 08:50, 24 April 2007 (UTC)

QT: recursion or iteration?[edit]

"Recurse with the reduced set of points." (with link to recursion)

Is this really recursion? I would call it iteration. You repeat the process until you can't go any further, then you stop. Sounds like a while loop to me.

--84.9.95.214 17:28, 1 July 2007 (UTC)

If you take a look at the Recursion (computer science) article "Any function that can be evaluated by a computer can be expressed in terms of recursive functions without the use of iteration, in continuation-passing style; and conversely any recursive function can be expressed in terms of iteration.", i.e. you can rewrite anything primitive recursive to an iterated algo. Here, recursion is in the philosophical sense, since you apply the same analysis to a reduced set of points.

Requested move: Should be entitled 'Cluster Analysis' and not 'Data Clustering'.[edit]

As mentioned in the article there are several different terms used to describe cluster analysis. However, the most frequent term used in books, papers, etc, always appears to be cluster analysis by a fair margin. It would make sense to use the most widely used term as the page name.

Approximately 7-10 days ago I collected statistics on the number of hits on each term from several different sources. The data could be used as an approximate indicator for the prevalence of each term. The numbers in brackets represent papers released since January 2000. On average the ratio of [papers published since January 2000 to number of papers overall] for the ACM and IEEE combined is approximately 0.757 and 0.759 for data clustering and cluster analysis respectively, thereby suggesting a relatively stable prevalence of each term in the literature over time.

Source IEEE ACM Alexa Yahoo Google Ask Gigablast Live Search
Results for "Data Clustering" 524 (368) 682 (555) 20,000 96,900 353,000 43,600 38,219 27,279
Results for "Cluster Analysis" 571 (511) 1159 (724) 130,000 715,000 1,860,000 307,200 316,708 148,096

Based on these results, and the titles and content of books on the subject, I propose that the page title be changed to "Cluster Analysis".

--MatthewKarlsen 16:07, 15 July 2007 (UTC)

Page moved, per unopposed request. Cheers. -GTBacchus(talk) 01:00, 23 July 2007 (UTC)

Normalized Google Distance[edit]

Normalized Google Distance (NGD) -- when I saw this I thought it was a prank. It turns out that someone has actually written a (suprisingly well-informed) paper or two on this [4] -- but that does not mean it is a serious approach (or that it is not just the type of prank that bored computer scientists come up with in their free time).

Is anyone here informed on this? Is NGD a viable norm function (in its domain)? I have been unable to find any peer-reviewed publications regarding the topic (New Scientist hardly counts). --SteelSoul 23:12, 1 November 2007 (UTC)

Limits of Cluster Analysis[edit]

I understand that CA will cluster just anything we throw at it: random data, linearly correlated data, etc. Could somebody knowledgeable please point out when NOT to use CA? --Stevemiller (talk) 04:56, 15 February 2008 (UTC)

It is often a good idea to compare the clustering results with random data and then compare it to confidence intervals. So I would say don't use clustering algorithms unless you are looking for or anticipating clusters (or their absence). Clustering algorithms can be used to compare the probability of a particular distribution of clusters forming and compare that to a random (or other comparator) case. For example you might be looking at two sets of star systems, you want to know if there is a tendency for them to form into clusters in the presence if there are detectable levels of ficticium (a fictious element). So you run your clustering algorithms past the data and see if there is a difference between the star systems with low ficticium and high ficticium, and how that correlates to clustering and if that is statistically significant, most likely using confidence intervals. You also may need to compare it to random data. User A1 (talk) 16:02, 15 February 2008 (UTC)
First of all, many algorithms will only work on vector space data (e.g. k-means need to be able to compute a mean). Then if you have unnormalized data, it will likely have a bias. And finally, in particular when you choose the various input parameters (e.g. k for k-means, but also distance functions and similar things) inappropriately, you'll often not get any sensible result. So it's not "plug and play", but much of the work is finding the appropriate parameters. --Chire2 (talk) 14:18, 7 May 2010 (UTC)


One person's opinion. Please allow me to suggest a criterion for determining if two separate clusters really should be separate: two distinct well-defined subpopulations constitute distinct clusters if and only if characteristics of interest possess a non-zero difference in their means at a statistically significant level of confidence. Thus, (1) a criterion for identifying distinct clusters is given utilizing a standard, well accepted statistical methodology - the difference between two means (2) whether two clusters are distinct may be ambiguous depending on the confidence level required. For example, two subpopulations may be distinct clusters with 95% confidence but not 99% confidence (3) if no distinct clusters are identified under this criterion, cluster analysis fails: the null hypothesis cannot be rejected and the population should therefore be considered homogeneous at the level of statistical confidence used in testing the hypothesis of distinctness. —Preceding unsigned comment added by Davidjcorliss (talkcontribs) 02:36, 24 May 2011 (UTC)

The mean is only a sensible cluster approximation when you have spherical clusters such as produced by kMeans or EM and a sensible notion of mean such as given by a vector space. For the more advanced cluster analysis methods and data models, this may not be appropriate. Even k-medoids already highlights that there may not be a sensible "mean" at all (despite being the simples modification of k-means that can do without having a computable mean). Maybe you're just using k-means too much. :-) --Chire (talk) 05:29, 24 May 2011 (UTC)

"Elbow criterion" is in violation of WP:NOR[edit]

Theoretical and Empirical Separation Curves

Use of the term "cluster" to refer to a subset of a group of attributes which define a bounded class shows an obvious lack of comprehension of the subject matter. Use of the term "cluster" is not valid in this context when referring to the number of attributes as a selected subset of a group of attributes but valid only when referring to a multiset count of the values of an attribute where the count of the set or multiset values equals the number of clusters.

The number of attributes selected as the the number of attributes in the subset is not arbitrarily selected or fixed but initially set to one for the first separation analysis and thereafter progressively incremented until 100% separation is achieved or to some point prior to target set size exceeding computer capacity or the time allocated for classification is exceeded. The minimum number of attributes (not clusters) is determined mathematically as follows:

 t_{min} =  \frac{\log G}{\log V}, where:
  • tmin is the minimal number of characteristics to result in theoretical separation,
  • G is the number of elements in the bounded class and
  • V is the highest value of logic in the group.
"Elements, attributes, subsets, multisets and clusters."

71.100.14.204 (talk) 20:47, 11 September 2008 (UTC)

Wikipedia:Articles for deletion/Optimal classification. I see you are attacking other clustering algorithms now. How nice. --Jiuguang (talk) 16:40, 11 September 2008 (UTC)
Actually wang I almost informed you of the opportunity here to apply the other skill you know best besides backstabbing, which is nominating articles for deletion, but decided why bother. Having added stalking to your list of skills it would not be long before you showed up without my help. BTW - I know why you are not into this stuff and want to delete it... robots don't need it, they go right from decision table construction to doing their thing.
The only reason you are tagging this as OR is because it directly contradicts your statements on Rypka's method as the only algorithm that can determine the optimal cluster size. Please stop - you've already been blocked for sock puppetry, personal attacks, evading infinite blocks, etc; you don't want to add vandalism to the list. --Jiuguang (talk) 23:38, 11 September 2008 (UTC)
You're delusional wang. Get a life. —Preceding unsigned comment added by 71.100.167.222 (talk) 23:40, 11 September 2008 (UTC)
Well... doesn't this IP look familiar? - Jameson L. Tai talkcontribs 05:33, 12 September 2008 (UTC)
Not quit as familar as the jail cells with which you and the other members of the robotics cabal will become. —Preceding unsigned comment added by 71.100.10.82 (talk) 09:04, 12 September 2008 (UTC)
Robotics Cabal! That's a new one. Actually, it's quite catchy. Jiuguang, would you like to join the Robotics Cabal? - Jameson L. Tai talkcontribs 15:17, 12 September 2008 (UTC)
He's probably busy with his buddy Chavez. Maybe he'll be free later, or you could join them. —Preceding unsigned comment added by 71.100.3.239 (talk) 17:59, 12 September 2008 (UTC)

A user coming from several different IP-numbers 71.100.*.* (DSL verizon) has an compulsion to add tags such as "original research" around the "elbow criterion". It is not apparent for me why this is so. Looking at the explained variance as a function of the number of clusters is a well-known method. I haven't heard of the term "elbow criterion" before, but looking at Google Scholar [5] there seems to be no doubt that it is used in peer-reviewed communication. — fnielsen (talk) 12:24, 15 September 2008 (UTC)

Right, I'm adding tags just out of compulsion to add tags and not because the contents is bogus. My question is why you guys do not want trash to be replaced with bonafide content? —Preceding unsigned comment added by 71.100.4.227 (talk) 17:16, 15 September 2008 (UTC)

71.100.*.*, let me put it this way - your conduct is in direct violation of the Verizon Internet Acceptable Use Policy (see [6]), and consider this your final warning to stop your range of disruptive activities on Wikipedia, including but not limited to

  1. Vandalism
  2. Personal attacks
  3. Trolling of WP:Reference desk and WP:Village Pump.
  4. Harassment of editors, both on and off wiki

A report to Wikipedia:Abuse reports will take place for any further abuse, which will then result in an official communication to Verizon, possibly leading to the termination of your account (as detailed by the AUP). You don't want this. Please stop. --Jiuguang (talk) 17:50, 15 September 2008 (UTC)

image for intuition[edit]

Hey, it would be nice to have some image that intuitively shows the idea of clustering. Usually in courses on machine learning or tutorials on clustering such images are shown. They are usually two dimensional depicting a number of points dispersed in the coordinate system, circles mark clusters/groups of points. I think for somebody opening an article about clustering and who is new to the topic such an image could be very helpful. Ben T/C 14:03, 10 February 2009 (UTC)

Hierarchical clustering should be its own article[edit]

That's a pretty big topic. Shouldn't be just a subsection of this article. I may have to start one if no else does within the next several months. Makewater (talk) 20:36, 6 April 2009 (UTC)

I agree. Unfortunately I know very little about it. -3mta3 (talk) 08:14, 7 April 2009 (UTC)

Standalone page for "Choosing the number of clusters in a data set"[edit]

The number of different ways to choose k seems to warrant more than a subsection on this page, especially since identification of the number of clusters in a data set is a separate issue from ways of actually performing clustering. I've expanded the former subsection on the topic into a standalone page, Determining the number of clusters in a data set. -JohnMeier (talk) 00:42, 8 April 2009 (UTC)

Rewrite[edit]

I've tagged this article as a cleanup as it is becoming very confusing and unwieldy. My vague suggestions:

  • Focus on the main types of clustering, with a section for each. My impression is that the most common are:
  • Simplify all the others to bullet points (or mention in the appropriate section if they are a variant of the above). Create sub articles if need be.
  • Prune the reference and external links lists

Any comments? —3mta3 (talk) 17:00, 18 May 2009 (UTC)

Support. Also spectral clustering is worth a separate section in my opinion. Took (talk) 22:20, 7 September 2009 (UTC)
I also agree (with both), spectral clustering is widely being used in graph theory. So, I think it should be also a separate section. --Conjugado (talk) 15:19, 11 January 2010 (UTC)

Merger proposal[edit]

Yes, the two should be merged. I dont think Wikipedia is a how to manual - so the applications, use and basic methodology whould be in a single article--Maven111 (talk) 12:36, 3 March 2010 (UTC)

The article cluster analysis (in marketing) seems to repeat a lot of information. Should we incorporate it somehow? —3mta3 (talk) 08:19, 21 May 2009 (UTC)

I think it's not a good idea, because the cluster analysis is dealing with the methods and the "in marketing" one is simply mentioning the uses of the analysis in economy. Kroolik (talk) 11:45, 25 May 2009 (UTC)

The applications of cluster analysis are numerous, but in all cases it is for grouping and the underlying method is domain independent. Nearly all of the marketing version is on cluster analysis with some added relationship with factor analysis, and some other multidimensional ordination/projection techniques. The merge seems to be a good idea. An application section can be added to this article. Shyamal (talk) 10:45, 29 May 2009 (UTC)

This section should be merged with market segmentation discussions of benefit segmentation analysis (marketing). It is an application of a broader concept, but it is very sepcific to benefit segmentation. —Preceding unsigned comment added by 74.75.128.221 (talk) 23:43, 25 October 2009 (UTC)


In my opinion, the cluster analysis article is already too long, so the marketing application should not be merged. Instead, more parts could be split out into separate articles (there is such a request for the agglomerative hierarchical methods) The "applications" section gives a lot more applications than the "market analysis" application. --Chire2 (talk) 13:59, 7 May 2010 (UTC)


Yes - the two should be merged. I apply cluster analysis in marketing segmentation as a consultant and also in astrostatistics and in education analysis in my academic research. Yet, the mathematics is, at all points, the same. It is a single discipline with a multiplicity of applications. Davidjcorliss (talk) 02:46, 24 May 2011 (UTC)

Well, describe the mathematical parts here, and the application in various domains on a separate page? This one is overfull and a mess. And yet, it does barely touch the advanced methods. There is so much beyond k-means and single-link, but still everybody in "application" still uses them despite all their known drawbacks and defects (in particular, preferring same-spatially-sized clusters due to the voronoi partitioning clearly is not sensible in real world applications). Just recently I met a biology researcher who was surprised to find OPTICS produce to produce much more useful results for him on his retina data ... despite OPTICS having been around for over 10 years already it just started to arrive in applications. Therefore in my opinion the article should include much more of these advanced methods and the application examples should be moved to separate pages. --Chire (talk) 05:41, 24 May 2011 (UTC)

link to cluster sampling[edit]

Hi, I found this page when I was looking for information on Cluster Sampling. Perhaps there should be one of those nifty disambiguation links at the top of this page. I don't really understand what this cluster analysis thingy is or how important it is so I'm not sure whether or not such a link would be justified, but it would have saved me some time.220.239.204.226 (talk) 05:43, 4 November 2009 (UTC)

Red links are created based on the following.....[edit]

--222.64.209.26 (talk) 03:37, 20 November 2009 (UTC)

--222.64.209.26 (talk) 04:01, 20 November 2009 (UTC)

--222.64.209.26 (talk) 04:04, 20 November 2009 (UTC)

--222.64.209.26 (talk) 04:19, 20 November 2009 (UTC)

--222.64.209.26 (talk) 04:28, 20 November 2009 (UTC)

--222.64.209.26 (talk) 04:29, 20 November 2009 (UTC)

--222.64.209.26 (talk) 05:25, 20 November 2009 (UTC)

--222.64.209.26 (talk) 05:29, 20 November 2009 (UTC)

It's pity that the application of the technique has been limited

--222.64.209.26 (talk) 03:40, 20 November 2009 (UTC)

I'm sure if the CA is used in conjunction with the DNA profiling for Population management, lots of overloads of population can be managed.--222.64.209.26 (talk) 03:46, 20 November 2009 (UTC)

Look at that...

--222.64.209.26 (talk) 03:51, 20 November 2009 (UTC)

--222.64.209.26 (talk) 04:14, 20 November 2009 (UTC)

Addressing WHAT IS NOT

--222.67.208.51 (talk) 06:46, 24 November 2009 (UTC)

clustergram - a method for visualizing cluster analysis results[edit]

Hi all,

I wrote an article about a method to visualize cluster analysis, here: http://www.r-statistics.com/2010/06/clustergram-a-graph-for-visualizing-cluster-analyses-r-code/

I was wondering where (and if) to add the above information about this method to the page, but couldn't quite figure out where in the page to do so. Any suggestions?

Talgalili (talk) 16:36, 15 June 2010 (UTC)

I don't think it needs to be included. It's just one out of a dozen visualization methods around. The article needs cleanup, not further bloat. --Chire (talk) 16:44, 15 June 2010 (UTC)
Hi Chire, I'll take your opinion in the matter and won't try to add it.
At the same time, why not include in the article various visualization methods for clustering? (or start another article on the subject). I (personally) find it both interesting and useful.
With much respect, Talgalili (talk) 16:50, 15 June 2010 (UTC)
I do agree that cluster visualization is an interesting subject. But beware of WP:COI. I am also not sure about whether it actually suits the "encyclopedia" aspects of Wikipedia. Given that articles these days are quickly deleted, my recommendation would be to start the article in your user namespace (i.e. User:Talgalili/Cluster_visualization), and when you have substantial content, covering multiple visualizations and links to the visualized algorithms (the article you linked seems to be very k-means-centric, and k-means by far is not the most advanced clustering algorithm; its results are quite unstable, and k is not easy to choose right), all the references to the relevant articles, then move it to the main namespace. This will likely save you some headaches and frustration, since new articles often face an "request for deletion" within a few days, unfortunately. --Chire (talk) 17:52, 15 June 2010 (UTC)
P.S. it also is related to Scientific visualization, Data visualization and Scatter plot, these might already contain some clustering visualization information. --Chire (talk) 18:25, 15 June 2010 (UTC)
Hello Chire, thank you for the suggestions :)
It sounds like a big project to take on myself. Maybe at a later stage.
BTW, that technique was only implemented there on k-means, but it is meant for help with assessing any clustering algorithm
Cheers, Talgalili (talk) 19:51, 15 June 2010 (UTC)

Error in one of the formulae for Spectral clustering?[edit]

In the section on spectral clustering, I think the formula P = S*D^(-1) should be P = D^(-1)*S instead as written in eq. 4 of the paper by Meila and Shi namely "A random walks view of spectral segmentation" Meila, M., Shi J., AISTATS 2001. In general D^(1) and S DO NOT commute. One definition leads to the transpose of the other for P - same eigenvalues but different eigenvectors. Unless someone can provide a justification for this formula, I think this could be a bonafide error. TonyMath (talk) 19:47, 19 April 2011 (UTC)

Something else: is that definition for the Laplacian matrix correct? because the link on the Laplacian matrix gives L = D-A and that agrees with the Mathworld definition but I don't know how to reconcile it with the formula used? The formula used is L = I - D^(-1/2)*S*D^(-1/2) which is eq. (5) of another paper by Meila and Shi entitled "Multi-way cuts and spectral clustering" taken from "Spectral Graph Theory" by Fan R.K. Chung but are the formulae consistent?

Laplacian matrix in Spectral Clustering[edit]

This section is not complete. It should give more details about the eigenvectors extraction, and explain why. Moreover, the formulae for the Laplacian matrix is L=D-S. Indeed, the formula in the article is the one for the normalized Laplacian matrix. This normalization is due to the relaxation of the constraints on the indicator vector. —Preceding unsigned comment added by Guillaumew (talkcontribs) 09:26, 4 May 2011 (UTC)

Software links[edit]

I read the earlier discussion "Software links were removed". I can somewhat appreciate the concerns raised but I think there is a problem.

Without information about relevant analysis software, the article seems to me lacking. I came to the article with a general concept of what cluster analysis is about and a desire to find out how to look for clusters in a data set I have.

The problem is that (in my perception) this article goes from the general conceptual level into the advanced level of discussing different theoretic approaches, with very little in between.

The article List of statistical packages has an extensive list. Could this article refer to that article to the extent of indicating packages there that include cluster analysis? Ideally also indicating which packages would be more helpful to someone unused to cluster analysis?

If this sort of information is NOT appropriate to have in the article, can anyone offer me any "private" user talk page advice? Thanks. Wanderer57 (talk) 19:21, 11 May 2011 (UTC)

Link spam is a big issue in this article, therefore I'm opposed to adding a software section. It might however be ok to start an article "List of Cluster Analysis Software" and have it linked. However, understand WP:NOT. If it just boils down to collecting links, it doesn't belong into Wikipedia but instead we should just link to the appropriate DMOZ/ODP category. --Chire (talk) 09:02, 12 May 2011 (UTC)
I think a section here or a separate list is ok if and only if all of the entries in the list are wikilinked articles to notable packages. I'd prefer not to see any external links in such a section or list. —David Eppstein (talk) 15:00, 12 May 2011 (UTC)
This also occurred to me on that List of statistical packages - pretty much all of them have Wikipedia articles, so this must be a list with "notable" packages only. Because there ought to be just thousands more... As for this article, I believe it is a bit too messy and confusing already, I'd prefer splitting such things off to a separate article. Especially since there is little overlap (references etc.) I guess. --Chire (talk) 17:10, 12 May 2011 (UTC)

Boldface & Robotics project attention needed[edit]

There seems to be an inordinate and excessive use of boldface. In particular there is what appears to be simply a list of applications of clustering which has very little prose in it. Some trimming is suggested as per MOS:BOLD#Boldface, where the spiecific example shows the list as a list article, not in the middle of a non-list article. Chaosdruid (talk) 14:23, 16 February 2012 (UTC)

Spectral Clustering[edit]

No link or mention in this article of spectral clustering, which has its own entire wikipedia article. — Preceding unsigned comment added by 192.249.47.174 (talk) 21:05, 19 June 2012 (UTC)

Adding citations to support the statement that Clustering is a main task of explorative data mining[edit]

On 11/3/2012 I added 4 independent citations to support the article's statement that Clustering is a main task of explorative data mining. These citations were removed the same day. I propose that it would be good to have citations for claims like this.

I am a novice wikipedia author, and on 11/3, I did not sign my edit or say anything on the TALK page. I now know that I should do both of these things. I should also point out that I am an author on 2 of the citations I added. However, I do not think that there is a COI in this. It is an area in which I have done research, so it is an area I know. I am simply trying to add citations that I think would help the article. However, if other wikipedia authors evaluate this and feel that my 2 citations should be excluded, I would still encourage the community to retain the other two citations that I added on 11/3.

Thank you for considering this.Karl (talk) 01:43, 6 November 2012 (UTC)

Long sentence with little value.[edit]

The sentence "Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criterion." is very long and adds very little value to the article. It should be rewritten. — Preceding unsigned comment added by 92.229.28.245 (talk) 13:44, 30 January 2013 (UTC)

Terminology: "classification" is supervised, "clustering" is unsupervised -- Really?[edit]

please see Talk:Statistical classification Fgnievinski (talk) 23:38, 3 May 2014 (UTC)

Misplaced comment by Ninjarua[edit]

The data in figures should be explained, it means that explaining what is x-axis and what is y-axis in the figures above. The similar problem also occurs to other sections in this article. Moreover, the data in two figures are different, so how can we compare the difference between two methods?

The preceding comment was placed at the bottom of the Connectivity based clustering section in the article by Ninjarua. — Anita5192 (talk) 17:55, 21 June 2014 (UTC)
This is artificial data. There is nothing to explain on the axes. The data sets are independent and cannot be compared; but it is the same method. See other sections for results by other methods on the same data. But the article is organized by method, not by data set (as the data sets are not of interest). --Chire (talk) 08:38, 23 June 2014 (UTC)

Multi-assignment clustering?[edit]

The topic of multi-assignment clustering seems to be missing altogether from wikipedia. --Nicolamr (talk) 22:55, 28 July 2014 (UTC)

I haven't seen that term used a lot. But isn't this the same as this (in the article):
* overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.
It's not covered in much detail, but it also not covered much in literature either; not many "landmark" approaches yet like k-means and DBSCAN. --Chire (talk) 09:14, 31 July 2014 (UTC)
You are right, it's precisely that. Thanks for the answer. --Nicolamr (talk) 19:57, 31 July 2014 (UTC)