Feature learning

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Feature learning or representation learning[1] is a set of techniques that learn a transformation of "raw" inputs to a representation that can be effectively exploited in machine learning tasks.

Feature learning is motivated by the fact that machine learning tasks such as classification often require input that is mathematically and computationally convenient to process (e.g., measurable and low-dimensional). However, real-world data such as images, video, and sensor measurement is usually complex, redundant, and highly variable, which jeopardizes the effectiveness of machine learning techniques. Thus, it is necessary to discover useful features or representations from raw data. Traditional hand-crafted features, which are used in areas such as computer vision, require expensive human labor and often rely on expert knowledge. Moreover, hand-crafted features normally do not generalize well. This motivates the design of efficient feature learning techniques that can be carried out by machines without hand-tuning.

Feature learning algorithms can be divided into two categories: supervised and unsupervised feature learning.

  • In supervised feature learning, features are learned with labeled input data. An example is multi-task learning, where multiple related learning tasks are carried out simultaneously. The advantage is to share the intermediate feature representations for multiple tasks.
  • In unsupervised feature learning, features are learned with unlabeled input data. The goal of unsupervised feature learning is to discover low-dimensional features that capture the information underlying the high-dimensional input data that is useful for the ultimate machine learning tasks (e.g., classification) while removing the irrelevant variability in the input. Unsupervised feature learning is favored due to the availability of large amount of unlabeled data and the scarcity of labeled data in realistic scenarios. Examples of unsupervised feature learning include restricted Boltzmann machines,[2][3] autoencoder, dictionary learning, matrix factorization,[4] and various forms of clustering.[3][5][6] When the feature learning is performed in an unsupervised way, it enables a form of semisupervised learning where first, features are learned from an unlabeled dataset, which are then employed to improve performance in a supervised setting with labeled data.[7][8]

Multilayer neural networks can be used to perform feature learning, since they learn a representation of their input at the hidden layer(s) which is subsequently used for classification or regression at the output layer, and feature learning is an integral part of deep learning, to the point that the two are sometimes considered synonyms.[1] (By contrast, kernel methods such as the support vector machine compute a fixed transformation of their inputs by means of a kernel function, and do not perform feature learning.)


Clustering for unsupervised feature learning[edit]

K-means clustering can be used for feature learning, by clustering an unlabeled set to produce k centroids, then using these centroids to produce k additional features for a subsequent supervised learning task. These features can be derived in several ways; the simplest way is to add k binary features to each sample, where each feature j has value one iff the jth centroid learned by k-means is the closest to the sample under consideration.[3] It is also possible to use the distances to the clusters as features, perhaps after transforming them through a radial basis function (a technique that has used to train RBF networks[9]). Coates and Ng note that certain variants of k-means behave similarly to sparse coding algorithms.[10]

In a comparative evaluation of unsupervised feature learning methods, Coates, Lee and Ng found that k-means clustering with an appropriate transformation outperforms the more recently invented auto-encoders and RBMs on an image classification task.[3] K-means has also been shown to improve performance in the domain of NLP, specifically for named-entity recognition;[11] there, it competes with Brown clustering, as well as with distributed word representations (also known as neural word embeddings).[8]

Restricted Boltzmann machine for unsupervised feature learning[edit]

Restricted Boltzmann machine[edit]

A unsupervised feature learning algorithm can be obtained by training a restricted Boltzmann machine (RBM).[2] An RBM can be represented by an undirected bipartite graph consisting of N_h binary hidden variables, N_v visible variables, and edges connecting the hidden and visible nodes. It is a special case of the more general Boltzmann machines. Let h = [h_1~h_2~\cdots~h_{N_h}]^T and v = [v_1~v_2~\cdots~v_{N_v}]^T be the vectors of hidden and visible variables, respectively. The energy function of a configuration of hidden and visible nodes is defined as

E(v,h;\theta) = -c^Tv - b^Th - v^TW h

where \theta = (W, b, c); W is an N_h \times N_v matrix with the (i,j)th entry w_{ij} being the weight associated with the edge connecting the ith visible variable and jth hidden variable; b and c are offset parameters associated with the hidden and visible variables, respectively. Then the joint distribution of visible and hidden variables is defined based on the energy function as

 P(h,v;\theta) = \frac{1}{Z(\theta)} e^{-E(v,h;\theta)}.

In the context of feature learning, the visible variables correspond to input data, and the hidden variables correspond to feature detectors. According to the graphical model of the RBM, the hidden (visible) variables are independent conditioned on the visible (hidden) variables. Furthermore, based on the expression of the energy E(v,h;\theta) and the distribution P(h,v;\theta), it can be verified that

 P(h_i=1|v) = S\biggl(b_i + \sum_{i=1}^{N_v} w_{ij} v_i \biggr)

where  S(t) = 1/(1+e^{-t}) is the sigmoid function. The posterior distribution of hidden variables can be viewed as a representation for the data.

Training RBM[edit]

The distribution of the visible variables v corresponding to the RBM is given by

P(v;\theta) = \frac{1}{Z(\theta)} \sum_h e^{-E(v,h;\theta)}.

The parameters W, a, and b of the RBM can be trained by maximizing the probability of visible variables, i.e., solving the problem

 \max_{\theta} P(v;\theta) or equivalently ~~\min_{\theta} -\ln P(v;\theta).

The problem can be solved using the contrastive divergence (CD) algorithm by Geoffrey Hinton [2] which is based on gradient descent and Gibbs sampling. In particular, the weight between the ith visible variable and the jth hidden variable is updated at the kth step following

 w_{ij}(k+1) = w_{ij}(k) + \epsilon \left.\frac{\partial\ln P(v;\theta)}{\partial w_{ij}}\right|_{W = W(k)}.

For the RBM, the gradient can be expressed as

 \frac{\partial\ln P(v;\theta)}{\partial w_{ij}} = \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}}

where \langle v_i h_j \rangle_{p} is the expectation of v_i h_j with respect to the distribution p. In the CD algorithm, \langle v_i h_j \rangle_{\text{recon}} is computed via alternate Gibbs sampling.[2]

Sparse RBM[edit]

In general, the training of RBM by solving the above maximization problem tends to result in non-sparse representations. The sparse RBM, [12] a modification of the RBM, was proposed to enable sparse representations. The idea is to add a regularization term in the objective function of data likelihood, which penalizes the deviation of the expected hidden variables from a small constant p.

Deep architectures for unsupervised feature learning[edit]

The hierarchical architecture of the neural system inspires deep learning architectures for unsupervised feature learning by stacking multiple layers of simple learning blocks.[13] These architectures are designed based on the assumption of distributed representation: observed data is generated by the interactions of many different factors on multiple levels. In a deep learning architecture, each level uses the representation uses the representation produced by previous level as input, and produces new representations as output, which is then fed to higher levels. The input of bottom layer is the raw data, and the output of the final layer is the desired low-dimensional feature or representation.

An autoencoder consisting of encoder and decoder is a paradigm for deep learning architectures. An example is provided in,[2] where the encoder uses raw data (e.g., image) as input and produces feature or representation as output, and the decoder uses the extracted feature from the encoder as input and reconstructs the original input raw data as output. The encoder and decoder are constructed by stacking multiple layers of RBMs. The parameters involved in the architecture are trained in a greedy layer-by-layer manner: after one layer of feature detectors is learned, they are fed to upper layers as visible variables for training the corresponding RBM. The process can be repeated until some stopping criteria is satisfied.

The encoder-decoder architecture has several advantages including:

  • the features can be efficiently computed by running the encoder after training;
  • the comparison between the raw input and its reconstruction can help to evaluate the performance of the encoder in capturing the key information underlying the raw data.

Feature learning vs. dictionary learning[edit]

In the study of sparse coding, signals are represented as sparse "activations" of items from a "dictionary". Sometimes the dictionary is fixed, and sometimes it is learnt, in a process referred to as dictionary learning. This is essentially the same concept as feature learning: in fact one of the most well-known algorithms in dictionary learning, K-SVD, can be considered a generalisation of the k-means method described above for feature learning.[10]

See also[edit]

References[edit]

  1. ^ a b Y. Bengio; A. Courville; P. Vincent (2013). "Representation Learning: A Review and New Perspectives". IEEE Trans. PAMI, special issue Learning Deep Architectures. 
  2. ^ a b c d e Hinton, G. E.; Salakhutdinov, R. R. (2006). "Reducing the Dimensionality of Data with Neural Networks". Science 313 (5786): 504–507. doi:10.1126/science.1127647. PMID 16873662.  edit
  3. ^ a b c d Coates, Adam; Lee, Honglak; Ng, Andrew Y. (2011). "An analysis of single-layer networks in unsupervised feature learning". Int'l Conf. on AI and Statistics (AISTATS). 
  4. ^ Nathan Srebro; Jason D. M. Rennie; Tommi S. Jaakkola (2004). "Maximum-Margin Matrix Factorization". NIPS. 
  5. ^ Csurka, Gabriella; Dance, Christopher C.; Fan, Lixin; Willamowski, Jutta; Bray, Cédric (2004). "Visual categorization with bags of keypoints". ECCV Workshop on Statistical Learning in Computer Vision. 
  6. ^ Daniel Jurafsky; James H. Martin (2009). Speech and Language Processing. Pearson Education International. pp. 145–146. 
  7. ^ Percy Liang (2005). Semi-Supervised Learning for Natural Language (M. Eng.). MIT. pp. 44–52. 
  8. ^ a b Joseph Turian; Lev Ratinov; Yoshua Bengio (2010). "Word representations: a simple and general method for semi-supervised learning". Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 
  9. ^ Schwenker, Friedhelm; Kestler, Hans A.; Palm, Günther (2001). "Three learning phases for radial-basis-function networks". Neural Networks 14: 439–458. doi:10.1016/s0893-6080(01)00027-2. CiteSeerX: 10.1.1.109.312. 
  10. ^ a b Coates, Adam; Ng, Andrew Y. (2012). "Learning feature representations with k-means". In G. Montavon, G. B. Orr and K.-R. Müller. Neural Networks: Tricks of the Trade. Springer. 
  11. ^ Dekang Lin; Xiaoyun Wu (2009). "Phrase clustering for discriminative learning". Proc. J. Conf. of the ACL and 4th Int'l J. Conf. on Natural Language Processing of the AFNLP. pp. 1030–1038. 
  12. ^ Lee, Honglak; Ekanadham, Chaitanya; Andrew, Ng (2008). "Sparse deep belief net model for visual area V2". Advances in neural information processing systems. 
  13. ^ Bengio, Yoshua (2009). "Learning Deep Architectures for AI". Foundations and Trends® in Machine Learning 2 (1): 1–127. doi:10.1561/2200000006.