Talk:Iris flower data set

From Wikipedia, the free encyclopedia
Jump to: navigation, search

"Are separable" .. really? without over-fitting?[edit]

This article makes the claim that the dataset is separable, but this is hardly obvious. It s separable if you don't mind over-fitting the data, but its not clear (to me) that any classifier can do this, using, say, a 3-fold or a 5-fold cross-validation. I'd like to see a citation for this. (talk) 17:39, 2 August 2013 (UTC)

BTW, the citation given, A.N. Gorban, N.R. Sumner, and A.Y. Zinovyev, Topological grammars for data approximation, Applied Mathematics Letters Volume 20, Issue 4 (2007), 382-386. does NOT perform a k-fold cross-validation. They appear to simply over-train on the entire dataset, which is not all that hard to do. (talk) 17:47, 2 August 2013 (UTC)
In the cited article the entire dataset is projected onto the first principal tree. This tree is build without any hints about classes (completely unsupervised task). There is no fitting for classification problem in this work at al. It happens that the classes are separated in the projection onto this tree. In the projection onto the first classical principal component these classes are not separated, of course. They are also not linearly separable. I hope that this comment can answer the question about fitting and over-fitting.Agor153 (talk) 17:58, 10 October 2013 (UTC)
I figure the term "is separable" is vague, and in many cases, people will use it even when some classification error remains. As in: "you can build a model using linear separations that gets 95% right". Language is imprecise. Even mathematical language will contain some ambiguity, unless you include all preliminaries and notations, and then it will be too verbose for an encyclopedia... --Chire (talk) 17:49, 15 October 2013 (UTC)

k-means image is horrible[edit]

The k-means image illustrates a very poor execution of the k-means algorithm. It is unclear what the reader is supposed to get out of a comparison where the I. setosa cluster is inappropriately split and the I. versicolor and I. virginica populations inappropriately merged. This should be recreated with a better k-means implementation?--Physicsmichael (talk) 01:00, 5 February 2014 (UTC)

The text of the article explains that the data set does not cluster well, and is therefore not a good choice for evaluating clustering algorithms. It's not a matter of k-means implementation, but the result seen actually may be the global minimum (unverified, but plausible). The image shouldn't be read as a standalone thing IMHO. --Chire (talk) 16:24, 10 February 2014 (UTC)

Highest Accuracy Achieved?[edit]

What is the highest accuracy that has been achieved on this dataset? — Preceding unsigned comment added by (talk) 18:11, 23 October 2014 (UTC)

With careless overfitting and sloppy evaluation: 100%. --Chire (talk) 08:47, 24 October 2014 (UTC)