Feature selection

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction. The central assumption when using a feature selection technique is that the data contains many redundant or irrelevant features. Redundant features are those which provide no more information than the currently selected features, and irrelevant features provide no useful information in any context.

Feature selection techniques are a subset of the more general field of feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points). The archetypal case is the use of feature selection in analysing DNA microarrays, where there are many thousands of features, and a few tens to hundreds of samples. Feature selection techniques provide three main benefits when constructing predictive models:

  • improved model interpretability,
  • shorter training times,
  • enhanced generalisation by reducing overfitting.

Feature selection is also useful as part of the data analysis process, as it shows which features are important for prediction, and how these features are related.

Introduction[edit]

A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimises the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.[1]

Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model.

Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, whilst still capturing the usefulness of the feature set. Common measures include the mutual information,[1] the pointwise mutual information,[2] Pearson product-moment correlation coefficient, inter/intra class distance or the scores of significance tests for each class/feature combinations.[2][3] Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems.

Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalises the regression coefficients, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. These approaches tend to be between filters and wrappers in terms of computational complexity.

In statistics, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network.

Subset selection[edit]

Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into Wrappers, Filters and Embedded. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in and specific to a model.

Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.

Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected.

Search approaches include:

Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information;[6] see here.

Other available filter metrics include:

  • Class separability
    • Error probability
    • Inter-class distance
    • Probabilistic distance
    • Entropy
  • Consistency-based feature selection
  • Correlation-based feature selection

Optimality criteria[edit]

The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common ones incorporate a measure of accuracy, penalised by the number of features selected (e.g. the Bayesian information criterion). The oldest are Mallows's Cp statistic and Akaike information criterion (AIC). These add variables if the t-statistic is bigger than \sqrt{2}.

Other criteria are Bayesian information criterion (BIC) which uses \sqrt{\log{n}}, minimum description length (MDL) which asymptotically uses \sqrt{\log{n}}, Bonnferroni / RIC which use \sqrt{2\log{p}}, maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR) which use something close to \sqrt{2\log{\frac{p}{q}}}.

Structure Learning[edit]

Filter feature selection is a specific case of a more general paradigm called Structure Learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms assume the data is generated by a Bayesian Network, and so the structure is a directed graphical model. The optimal solution to the filter feature selection problem is the Markov blanket of the target node, and in a Bayesian Network, there is a unique Markov Blanket for each node.[7]

Minimum-redundancy-maximum-relevance (mRMR) feature selection[edit]

Peng et al.[8] proposed a feature selection method that can use either mutual information, correlation, or distance/similarity scores to select features. The aim is to penalise a feature's relevancy by its redundancy in the presence of the other selected features. The relevance of a feature set S for the class c is defined by the average value of all mutual information values between the individual feature fi and the class c as follows:

 D(S,c) = \frac{1}{|S|}\sum_{f_{i}\in S}I(f_{i};c) .

The redundancy of all features in the set S is the average value of all mutual information values between the feature fi and the feature fj:

 R(S) = \frac{1}{|S|^{2}}\sum_{f_{i},f_{j}\in S}I(f_{i};f_{j})

The mRMR criterion is a combination of two measures given above and is defined as follows:

\mathrm{mRMR}= \max_{S}
\left[\frac{1}{|S|}\sum_{f_{i}\in S}I(f_{i};c) - 
\frac{1}{|S|^{2}}\sum_{f_{i},f_{j}\in S}I(f_{i};f_{j})\right].

Suppose that there are n full-set features. Let xi be the set membership indicator function for feature fi, so that xi=1 indicates presence and xi=0 indicates absence of the feature fi in the globally optimal feature set. Let ci=I(fi;c) and aij=I(fi;fj). The above may then be written as an optimization problem:

\mathrm{mRMR}= \max_{x\in \{0,1\}^{n}} 
\left[\frac{\sum^{n}_{i=1}c_{i}x_{i}}{\sum^{n}_{i=1}x_{i}} -
\frac{\sum^{n}_{i,j=1}a_{ij}x_{i}x_{j}}
{(\sum^{n}_{i=1}x_{i})^{2}}\right].

The mRMR algorithm is an approximation of the theoretically optimal maximum-dependency feature selection algorithm that maximizes the mutual information between the joint distribution of the selected features and the classification variable. As mRMR approximates the combinatorial estimation problem with a series of much smaller problems, each of which only involves two variables, it thus uses pairwise joint probabilities which are more robust. In certain situations the algorithm may underestimate the usefulness of features as it has no way to measure interactions between features which can increase relevancy. This can lead to poor performance[9] when the features are individually useless, but are useful when combined (a pathological case is found when the class is a parity function of the features). Overall the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal max-dependency selection, yet produces a feature set with little pairwise redundancy.

mRMR is an instance of a large class of filter methods which trade off between relevancy and redundancy in different ways.[9][10]

Global optimization formulations[edit]

mRMR is a typical example of an incremental greedy strategy for feature selection: once a feature has been selected, it cannot be deselected at a later stage. While mRMR could be optimized using floating search to reduce some features, it might also be reformulated as a global quadratic programing optimization problem as follows:[11]


\mathrm{QPFS}: \min_\mathbf{x}  \left\{ \alpha \mathbf{x}^T H \mathbf{x} -  \mathbf{x}^T F\right\} \quad \mbox{s.t.} \ \sum_{i=1}^n x_i=1, x_i\geq 0

where F_{n\times1}=[I(f_1;c),\ldots, I(f_n;c)]^T is the vector of feature relevancy assuming there are n features in total, H_{n\times n}=[I(f_i;f_j)]_{i,j=1\ldots n} is the matrix of feature pairwise redundancy, and \mathbf{x}_{n\times 1} represents relative feature weights. QPFS is solved via quadratic programming. It is recently shown that QFPS is biased towards features with smaller entropy,[12] due to its placement of the feature self redundancy term I(f_i;f_i) on the diagonal of H.

Another global formulation for the mutual information based feature selection problem is based on the conditional relevancy:[12]


\mathrm{SPEC_{CMI}}: \max_{\mathbf{x}} \left\{\mathbf{x}^T Q \mathbf{x}\right\} \quad \mbox{s.t.}\ \|\mathbf{x}\|=1, x_i\geq 0

where Q_{ii}=I(f_i;c) and Q_{ij}=I(f_i;c|f_j), i\ne j.

An advantage of \mathrm{SPEC_{CMI}} is that it can be solved simply via finding the dominant eigenvector of Q, thus is very scalable. \mathrm{SPEC_{CMI}} also handles second-order feature interaction.

Correlation feature selection[edit]

The Correlation Feature Selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other".[13][14] The following equation gives the merit of a feature subset S consisting of k features:

 Merit_{S_{k}} = \frac{k\overline{r_{cf}}}{\sqrt{k+k(k-1)\overline{r_{ff}}}}.

Here,  \overline{r_{cf}} is the average value of all feature-classification correlations, and  \overline{r_{ff}} is the average value of all feature-feature correlations. The CFS criterion is defined as follows:

\mathrm{CFS} = \max_{S_k}
\left[\frac{r_{c f_1}+r_{c f_2}+\cdots+r_{c f_k}}
{\sqrt{k+2(r_{f_1 f_2}+\cdots+r_{f_i f_j}+ \cdots
+ r_{f_k f_1 })}}\right].

The r_{cf_{i}} and r_{f_{i}f_{j}} variables are referred to as correlations, but are not necessarily Pearson's correlation coefficient or Spearman's ρ. Dr. Mark Hall's dissertation uses neither of these, but uses three different measures of relatedness, minimum description length (MDL), symmetrical uncertainty, and relief.

Let xi be the set membership indicator function for feature fi; then the above can be rewritten as an optimization problem:

\mathrm{CFS} = \max_{x\in \{0,1\}^{n}} 
\left[\frac{(\sum^{n}_{i=1}a_{i}x_{i})^{2}}
{\sum^{n}_{i=1}x_i + \sum_{i\neq j} 2b_{ij} x_i x_j }\right].

The combinatorial problems above are, in fact, mixed 0–1 linear programming problems that can be solved by using branch-and-bound algorithms.[15]

Regularized trees[edit]

The features from a decision tree or a tree ensemble are shown to be redundant. A recent method called regularized tree[16] can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build one tree model (or one tree ensemble model) and thus are computationally efficient.

Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to outliers, and thus, require little data preprocessing such as normalization. Regularized random forest (RRF) (RRF) is one type of regularized trees. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest.

Embedded methods incorporating feature selection[edit]

See also[edit]

References[edit]

  1. ^ a b Guyon, Isabelle; Elisseeff, André (2003). "An Introduction to Variable and Feature Selection". JMLR 3. 
  2. ^ a b Yang, Yiming; Pedersen, Jan O. (1997). "A comparative study on feature selection in text categorization". ICML. 
  3. ^ Forman, George (2003). "An extensive empirical study of feature selection metrics for text classification". Journal of Machine Learning Research 3: 1289–1305. 
  4. ^ F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving feature subset selection problem by a Parallel Scatter Search, European Journal of Operational Research, vol. 169, no. 2, pp. 477–489, 2006.
  5. ^ F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving Feature Subset Selection Problem by a Hybrid Metaheuristic. In First International Workshop on Hybrid Metaheuristics, pp. 59–68, 2004.
  6. ^ Alexander Kraskov, Harald Stögbauer, Ralph G. Andrzejak, and Peter Grassberger, "Hierarchical Clustering Based on Mutual Information", (2003) ArXiv q-bio/0311039
  7. ^ Aliferis, Constantin (2010). "Local causal and markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation". Journal of Machine Learning Research 11: 171–234. 
  8. ^ Peng, H. C.; Long, F.; Ding, C. (2005). "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy". IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8): 1226–1238. doi:10.1109/TPAMI.2005.159. PMID 16119262.  Program
  9. ^ a b Brown, G., Pocock, A., Zhao, M.-J., Lujan, M. (2012). "Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection", In the Journal of Machine Learning Research (JMLR). [1]
  10. ^ Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey. [2]
  11. ^ I. Rodriguez-Lujan, R. Huerta, C. Elkan, and C. S. Cruz. Quadratic programming feature selection. J. Mach. Learn. Res., 11:1491–1516, 2010.
  12. ^ a b Nguyen X. Vinh, Jeffrey Chan, Simone Romano and James Bailey, "Effective Global Approaches for Mutual Information based Feature Selection". Proceeedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27, New York City, 2014. "[3]"
  13. ^ M. Hall 1999, Correlation-based Feature Selection for Machine Learning
  14. ^ Senliol, Baris, et al. "Fast Correlation Based Filter (FCBF) with a different search strategy." Computer and Information Sciences, 2008. ISCIS'08. 23rd International Symposium on. IEEE, 2008. [4]
  15. ^ Hai Nguyen, Katrin Franke, and Slobodan Petrovic, Optimizing a class of feature selection measures, Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Vancouver, Canada, December 2009. [5]
  16. ^ H. Deng, G. Runger, "Feature Selection via Regularized Trees", Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012

Further reading[edit]

External links[edit]