Template talk:Machine learning bar

From Wikipedia, the free encyclopedia
Jump to: navigation, search

"Models"[edit]

This section title and contents seem pretty much random to me. How are contents chosen? One regression, one random clustering algorithm, 4 standard classificators; but no decision tree; which is probably the grandfather of all classificators. --Chire (talk) 12:41, 22 October 2013 (UTC)

In general, one may argue that k-means is NOT machine learning, but plain old statistics. And clustering is at most a step child of the machine learning world; it's a data mining / knowledge discovery domain, just like outlier detection and freuqent itemset mining. If you look at the communities, I would not call data mining part of machine learning either; it's living in parallel (unfortunately). Machine learners don't get or like unsupervised methods, actually. The "theory" section in this template is also pretty random, isn't it? --Chire (talk) 12:45, 22 October 2013 (UTC)

This template is brand new and very incomplete. You're welcome to add it. k-means clustering is a very widely employed method in the machine learning community, e.g. by computer vision folks who use it as a feature learning method, by neural nets folks for booststrapping their RBF networks and by text mining people. New papers employing or improving k-means appear regularly in the ML literature. I can dig up some references if you like.
AFAIC, the sidebar can be renamed something like "Data mining/machine learning/pattern recognition" -- the three overlap to such a degree that they're impossible to demarcate. QVVERTYVS (hm?) 14:59, 22 October 2013 (UTC)
Re: "one regression algorithm": wrong. Logistic regression is in fact a classification algorithm. It is very popular in esp. the natural language processing community and form the basis for much recent neural nets and structured prediction work. Neural nets, k-NN and SVMs are all used for regression, though, even if this is not reflected in their Wikipedia articles. QVVERTYVS (hm?) 15:04, 22 October 2013 (UTC)
I agree that they are hard to separate and it thus may be a good idea to merge them into one template. I know that k-means is used a lot in machine learning, as it is a statistical optimization problem; not so much actually a structure discovery thing. Maybe instead of the "Models" block, make one for each "Problem" above then? I.e. regression, classification, clustering, anomaly detection, etc.? --Chire (talk) 09:05, 23 October 2013 (UTC)

Maybe we need to add Markovian models?[edit]

Hidden Markov Models (HMM) has successfully been used (there are dozens are articles, just do a Google scholar search) where HMM have been used for NLP amongst other Machine learning tasks. I believe it should be added as one of the models. — Preceding unsigned comment added by 150.135.223.128 (talk) 19:12, 28 January 2014 (UTC)

I've added CRFs, HMMs and a link to the more general article graphical model. QVVERTYVS (hm?) 22:00, 28 January 2014 (UTC)

Two problems are the same: "classification" and "clustering"[edit]

there are simply two general approaches to solve the same problem, supervised and unsupervised -- but the problem is one and the same. Fgnievinski (talk) 23:55, 3 May 2014 (UTC)

Applications are also overlapping if not coincident. Fgnievinski (talk) 12:32, 5 May 2014 (UTC)
As discussed in Talk:Statistical classification#Terminology: "classification" is supervised, "clustering" is unsupervised -- Really?, I disagree that they are the same thing.
The objectives are different in the sense that classification tries to minimize the prediction error. Clustering however tries to discover some meaningful structure, without knowing what to look out for (which is also why clustering more often than not returns crap results - too little guideance on what you are looking for). They are related, but clearly not the same thing. IMHO, the applications as well as the methods differ fundamentally, too. You can't easily take one method and transfer it to the other problem; not even naive bayes, or kNN classification. There are some cases where you have similar ideas - k-means also minimizes squared errors - but these occur in many other areas, too. And there are many clustering approaches not based on minimizing some statistical quantity. The big problem with clustering is evaluation: usually you evaluate by some statistical quantity (internal), or by class labels (external); both of which look a lot like classification.
Either way; we are not truth finders. There is plenty of literature that distinguishes these approaches, so we should not merge them. The rule of thumb in literature is that classification and regression are supervised, and this is well resembled by the ML bar template. --Chire (talk) 13:36, 5 May 2014 (UTC)
I'm sorry, this is not a restatement of the previous talk. The methods are outside the scope of the present discussion. What is inside the scope is that both methodological approaches aim to cluster, group, segment, partition, and classify input variates. All the problems addressed by unsupervised methods could be tackled by supervised ones if additional information is given. Fgnievinski (talk) 13:48, 5 May 2014 (UTC)
I also disagree on that. If you added labels to a data set, it would become a different problem: how to predict the labels of new instances, given the training data set, i.e. it becomes class prediction, whereas it was structure discovery before. That is IMHO a quite different task. --Chire (talk) 13:57, 5 May 2014 (UTC)
The algorithms and performance metrics for clustering are radically different from those for classification, which is enough reason not to conflate them. I also challenge the statement that the applications are "coincident". QVVERTYVS (hm?) 14:50, 5 May 2014 (UTC)