Feature learning

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Feature learning or representation learning[1] is a set of techniques in machine learning that learn a transformation of "raw" inputs to a representation that can be effectively exploited in a supervised learning task such as classification. Feature learning algorithms themselves may be either unsupervised or supervised, and include autoencoders,[2] dictionary learning, matrix factorization,[3] restricted Boltzmann machines[2] and various forms of clustering.[2][4][5]

Multilayer neural networks can also be considered to perform feature learning, since they learn a representation of their input at the hidden layer(s) which is subsequently used for classification or regression at the output layer, and feature learning is an integral part of deep learning, to the point that the two are sometimes considered synonyms.[1] (By contrast, kernel methods such as the support vector machine compute a fixed transformation of their inputs by means of a kernel function, and do not perform feature learning.)

When the feature learning can be performed in an unsupervised way, it enables a form of semisupervised learning where first, features are learned from an unlabeled dataset, which are then employed to improve performance in a supervised setting with labeled data.[6][7]

Clustering as feature learning[edit]

K-means clustering can be used for feature learning, by clustering an unlabeled set to produce k centroids, then using these centroids to produce k additional features for a subsequent supervised learning task. These features can be derived in several ways; the simplest way is to add k binary features to each sample, where each feature j has value one iff the jth centroid learned by k-means is the closest to the sample under consideration.[2] It is also possible to use the distances to the clusters as features, perhaps after transforming them through a radial basis function (a technique that has used to train RBF networks[8]). Coates and Ng note that certain variants of k-means behave similarly to sparse coding algorithms.[9]

In a comparative evaluation of unsupervised feature learning methods, Coates, Lee and Ng found that k-means clustering with an appropriate transformation outperforms the more recently invented auto-encoders and RBMs on an image classification task.[2] K-means has also been shown to improve performance in the domain of NLP, specifically for named-entity recognition;[10] there, it competes with Brown clustering, as well as with distributed word representations (also known as neural word embeddings).[7]

Feature learning vs. dictionary learning[edit]

In the study of sparse coding, signals are represented as sparse "activations" of items from a "dictionary". Sometimes the dictionary is fixed, and sometimes it is learnt, in a process referred to as dictionary learning. This is essentially the same concept as feature learning: in fact one of the most well-known algorithms in dictionary learning, K-SVD, can be considered a generalisation of the k-means method described above for feature learning.[9]

See also[edit]


  1. ^ a b Y. Bengio; A. Courville; P. Vincent (2013). "Representation Learning: A Review and New Perspectives". IEEE Trans. PAMI, special issue Learning Deep Architectures. 
  2. ^ a b c d e Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). "An analysis of single-layer networks in unsupervised feature learning". Int'l Conf. on AI and Statistics (AISTATS). 
  3. ^ Nathan Srebro; Jason D. M. Rennie; Tommi S. Jaakkola (2004). "Maximum-Margin Matrix Factorization". NIPS. 
  4. ^ Csurka, Gabriella; Dance, Christopher C.; Fan, Lixin; Willamowski, Jutta; Bray, Cédric (2004). "Visual categorization with bags of keypoints". ECCV Workshop on Statistical Learning in Computer Vision. 
  5. ^ Daniel Jurafsky; James H. Martin (2009). Speech and Language Processing. Pearson Education International. pp. 145–146. 
  6. ^ Percy Liang (2005). Semi-Supervised Learning for Natural Language (M. Eng.). MIT. pp. 44–52. 
  7. ^ a b Joseph Turian; Lev Ratinov; Yoshua Bengio (2010). "Word representations: a simple and general method for semi-supervised learning". Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 
  8. ^ Schwenker, Friedhelm; Kestler, Hans A.; Palm, Günther (2001). "Three learning phases for radial-basis-function networks". Neural Networks 14: 439–458. doi:10.1016/s0893-6080(01)00027-2. CiteSeerX: 
  9. ^ a b Coates, Adam; Ng, Andrew Y. (2012). "Learning feature representations with k-means". In G. Montavon, G. B. Orr and K.-R. Müller. Neural Networks: Tricks of the Trade. Springer. 
  10. ^ Dekang Lin; Xiaoyun Wu (2009). "Phrase clustering for discriminative learning". Proc. J. Conf. of the ACL and 4th Int'l J. Conf. on Natural Language Processing of the AFNLP. pp. 1030–1038.