Talk:Iris flower data set

Statistics Low‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
Low	This article has been rated as Low-importance on the importance scale.

Plants Low‑importance

	Plants portal This article is within the scope of WikiProject Plants, a collaborative effort to improve the coverage of plants and botany on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.PlantsWikipedia:WikiProject PlantsTemplate:WikiProject Plantsplant articles
Low	This article has been rated as Low-importance on the project's importance scale.

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Dataset[edit]

"Python tutorial"[edit]

The dataset section is written as a Python tutorial which I think is inappropriate as given, this page should really be about historical information about Fisher's Iris dataset. Wiki is not code.org. 136.168.148.56 (talk) 23:30, 30 October 2019 (UTC)[reply]

Dialog for showing/hiding data set[edit]

The table containing the data set can be expanded or hidden. If it is expanded, then it has a neatly typesetted heading. If it is hidden however, the short heading goes over three lines (at least for me). Is this intended behaviour? I think it isn't, but I don't know how to change it. It would be nice if someone could change it. NerdOnTour (talk) 08:36, 22 October 2021 (UTC)[reply]

"Are separable" .. really? without over-fitting?[edit]

This article makes the claim that the dataset is separable, but this is hardly obvious. It s separable if you don't mind over-fitting the data, but its not clear (to me) that any classifier can do this, using, say, a 3-fold or a 5-fold cross-validation. I'd like to see a citation for this. 99.153.64.179 (talk) 17:39, 2 August 2013 (UTC)[reply]

BTW, the citation given, A.N. Gorban, N.R. Sumner, and A.Y. Zinovyev, Topological grammars for data approximation, Applied Mathematics Letters Volume 20, Issue 4 (2007), 382-386. does NOT perform a k-fold cross-validation. They appear to simply over-train on the entire dataset, which is not all that hard to do. 99.153.64.179 (talk) 17:47, 2 August 2013 (UTC)[reply]

In the cited article the entire dataset is projected onto the first principal tree. This tree is build without any hints about classes (completely unsupervised task). There is no fitting for classification problem in this work at al. It happens that the classes are separated in the projection onto this tree. In the projection onto the first classical principal component these classes are not separated, of course. They are also not linearly separable. I hope that this comment can answer the question about fitting and over-fitting.Agor153 (talk) 17:58, 10 October 2013 (UTC)[reply]

I figure the term "is separable" is vague, and in many cases, people will use it even when some classification error remains. As in: "you can build a model using linear separations that gets 95% right". Language is imprecise. Even mathematical language will contain some ambiguity, unless you include all preliminaries and notations, and then it will be too verbose for an encyclopedia... --Chire (talk) 17:49, 15 October 2013 (UTC)[reply]

k-means image is horrible[edit]

The k-means image illustrates a very poor execution of the k-means algorithm. It is unclear what the reader is supposed to get out of a comparison where the I. setosa cluster is inappropriately split and the I. versicolor and I. virginica populations inappropriately merged. This should be recreated with a better k-means implementation?--Physicsmichael (talk) 01:00, 5 February 2014 (UTC)[reply]

The text of the article explains that the data set does not cluster well, and is therefore not a good choice for evaluating clustering algorithms. It's not a matter of k-means implementation, but the result seen actually may be the global minimum (unverified, but plausible). The image shouldn't be read as a standalone thing IMHO. --Chire (talk) 16:24, 10 February 2014 (UTC)[reply]

Highest Accuracy Achieved?[edit]

What is the highest accuracy that has been achieved on this dataset? — Preceding unsigned comment added by 129.59.79.147 (talk) 18:11, 23 October 2014 (UTC)[reply]

With careless overfitting and sloppy evaluation: 100%. --Chire (talk) 08:47, 24 October 2014 (UTC)[reply]

Sources regarding controversy?[edit]

I removed the following sentence from the intro as it was unsourced: Fisher's paper was published in the journal, the Annals of Eugenics, creating controversy about the continued use of the Iris dataset for teaching statistical techniques today. I did try to find sources before removing (doing searches of google and google scholar for terms like "iris dataset controversy", "iris dataset eugenics", "iris dataset racism"), but all I could find were tweets, a reddit post, and a couple posts from personal blogs ([1], [2]). I don't think these are sufficient to support the claim, but if anyone can find any better sources, feel free to restore. Colin M (talk) 18:35, 15 January 2021 (UTC)[reply]