Feature selection

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models. When applied in biology domain, the technique is also called discriminative gene selection, which detects influential genes based on DNA microarray experiments. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models by:

  • Alleviating the effect of the curse of dimensionality.
  • Enhancing generalization capability.
  • Speeding up learning process.
  • Improving model interpretability.

Feature selection also helps people to acquire better understanding about their data by telling them which are the important features and how they are related with each other.

Contents

[edit] Introduction

Simple feature selection algorithms are ad hoc, but there are also more methodical approaches. From a theoretical perspective, it can be shown that optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality. If large numbers of features are available, this is impractical. For practical supervised learning algorithms, the search is for a satisfactory set of features instead of an optimal set.

Feature selection algorithms typically fall into two categories: feature ranking and subset selection. Feature ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. Subset selection searches the set of possible features for the optimal subset.

In statistics, the most popular form of feature selection is stepwise regression. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network.

[edit] Subset selection

Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken into Wrappers, Filters and Embedded. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in and specific to a model.

Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.

Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower dimensional space are then selected.

Search approaches include:

Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category.

Other available filter metrics include:

  • Class separability
    • Error probability
    • Inter-class distance
    • Probabilistic distance
    • Entropy
  • Consistency-based feature selection
  • Correlation-based feature selection

[edit] Optimality criteria

There are a variety of optimality criteria that can be used for controlling feature selection. The oldest are Mallows' Cp statistic and Akaike information criterion (AIC). These add variables if the t-statistic is bigger than \sqrt{2}.

Other criteria are Bayesian information criterion (BIC) which uses \sqrt{\log{n}}, minimum description length (MDL) which asymptotically uses \sqrt{\log{n}}, Bonnferroni / RIC which use \sqrt{2\log{p}}, maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR) which use something close to \sqrt{2\log{\frac{p}{q}}}.

[edit] Minimum-redundancy-maximum-relevance (mRMR) feature selection

Peng et al. [3] proposed an mRMR feature-selection method that can use either mutual information, correlation, distance/similarity scores to select features. For example, with mutual information, relevant features and redundant features are considered simultaneously. The relevance of a feature set S for the class c is defined by the average value of all mutual information values between the individual feature fi and the class c as follows:

 D(S,c) = \frac{1}{|S|}\sum_{f_{i}\in S}I(f_{i};c) .

The redundancy of all features in the set S is the average value of all mutual information values between the feature fi and the feature fj:

 R(S) = \frac{1}{|S|^{2}}\sum_{f_{i},f_{j}\in S}I(f_{i};f_{j})

The mRMR criterion is a combination of two measures given above and is defined as follows:

 \max_{S}  \big[\frac{1}{|S|}\sum_{f_{i}\in S}I(f_{i};c) - \frac{1}{|S|^{2}}\sum_{f_{i},f_{j}\in S}I(f_{i};f_{j})\big]. (1)

Suppose that there are n full-set features. The binary values of the variable xi are used in order to indicate the appearance (xi = 1) or the absence (xi = 0) of the feature fi in the globally optimal feature set. The mutual information values I(fi;c), I(fi;fj) are denoted by constants ci, aij, respectively. Therefore, the problem (3) can be described as an optimization problem as follows:

 \max_{x\in \{0,1\}^{n}} \big[\frac{\sum^{n}_{i=1}c_{i}x_{i}}{\sum^{n}_{i=1}x_{i}} - \frac{\sum^{n}_{i,j=1}a_{ij}x_{i}x_{j}}{(\sum^{n}_{i=1}x_{i})^{2}}\big]. (2)

Peng et al. shows that mRMR feature selection is an approximation of the theoretically optimal maximum-dependency feature selection that maximizes the mutual information between the joint distribution of the selected features and the classification variable. However, since mRMR turned a combinatorial problem as a series of much smaller scale problems, each of which only involves two variables, the estimation of joint probabilities is much more robust. Overall the algorithm is much more efficient than the theoretically optimal max-dependency selection, while it is more robust to find useful features.

It can be seen that mRMR is also related to the correlation based feature selection below. It may also be seen a special case of some generic feature selectors [4].

[edit] Correlation feature selection

The Correlation Feature Selection (CFS) measure evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other" [5]. The following equation gives the merit of a feature subset S consisting of k features:

 Merit_{S_{k}} = \frac{k\overline{r_{cf}}}{\sqrt{k+k(k-1)\overline{r_{ff}}}}.

Here,  \overline{r_{cf}} is the average value of all feature-classification correlations, and  \overline{r_{ff}} is the average value of all feature-feature correlations. The CFS criterion is defined as follows:

 \max_{S_{k}} \big[\frac{r_{cf_{1}}+r_{cf_{2}}+...+r_{cf_{k}}}{\sqrt{k+2(r_{f_{1}f_{2}}+..+r_{f_{i}f_{j}}+ ..+r_{f_{k}f_{1}})}}\big]. (3)

The r_{cf_{i}} and r_{f_{i}f_{j}} variables are referred to as correlations, but are not necessarily Pearson's correlation coefficient or Spearman's ρ. Dr. Mark Hall's dissertation uses neither of these, but uses three different measures of relatedness, minimum description length (MDL), symmetrical uncertainty, and relief.

By using binary values of the variable xi as in the case of the mRMR measure to indicate the appearance or the absence of the feature fi, the problem (4) can be rewritten as an optimization problem as follows:

 \max_{x\in \{0,1\}^{n}} \big[\frac{(\sum^{n}_{i=1}a_{i}x_{i})^{2}}{\sum^{n}_{i=1}x_{i}+\sum_{i\neq j}2b_{ij}x_{i}x_{j} }\big]. (4)

Combinatorial problems (2) and (4) are, in fact, mixed 0-1 linear programming problems that can be solved by using branch-and-bound algorithms [6].

[edit] Regularized trees

The features from a decision tree or a tree ensemble are shown to be redundant. A recent method called regularized tree can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to outliers, and thus, require little data preprocessing such as normalization. Regularized random forest (RRF) is one type of regularized trees [7].

[edit] Embedded methods incorporating feature selection

[edit] Software for feature selection

Many standard data analysis software systems are often used for feature selection, such as SciLab, NumPy and the R language. Other software systems are tailored specifically to the feature-selection task:

  • Weka – freely available and open-source software in Java.
  • Feature Selection Toolbox 3 – freely available and open-source software in C++.
  • RapidMiner – freely available and open-source software.
  • Orange – freely available and open-source software (module orngFSS).
  • TOOLDIAG Pattern recognition toolbox – freely available C toolbox.
  • minimum redundancy feature selection tool – freely available C/Matlab codes for selecting minimum redundant features.
  • A C# Implementation of greedy forward feature subset selection for various classifiers (e.g., LibLinear, SVM-light).
  • MCFS-ID (Monte Carlo Feature Selection and Interdependency Discovery) is a Monte Carlo method-based tool for feature selection. It also allows for the discovery of interdependencies between the relevant features. MCFS-ID is particularly suitable for the analysis of high-dimensional, ill-defined transactional and biological data.
  • RRF is a relatively new R package for feature selection and can be installed from R. RRF stands for Regularized Random Forest, which is a type of Regularized Trees. By building a regularized random forest, a compact set of non-redundant features can be selected without loss of predictive information. Regularized trees can capture non-linear interactions between variables, and naturally handle different scales, and numerical and categorical variables.

[edit] See also

[edit] References

  1. ^ F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving feature subset selection problem by a Parallel Scatter Search, European Journal of Operational Research, vol. 169, no. 2, pp. 477-489, 2006.
  2. ^ F.C. Garcia-Lopez, M. Garcia-Torres, B. Melian, J.A. Moreno-Perez, J.M. Moreno-Vega. Solving Feature Subset Selection Problem by a Hybrid Metaheuristic. In First International Workshop on Hybrid Metaheuristics, pp. 59-68, 2004.
  3. ^ Peng, H.C., Long, F., and Ding, C., Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp. 1226–1238, 2005. Program
  4. ^ Nguyen, H., Franke, K., Petrovic, S. (2010). "Towards a Generic Feature-Selection Measure for Intrusion Detection", In Proc. International Conference on Pattern Recognition (ICPR), Istanbul, Turkey. [1]
  5. ^ M. Hall 1999, Correlation-based Feature Selection for Machine Learning
  6. ^ Hai Nguyen, Katrin Franke, and Slobodan Petrovic, Optimizing a class of feature selection measures, Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Vancouver, Canada, December 2009. [2]
  7. ^ H Deng, G Runger, "Feature Selection via Regularized Trees", arXiv:1201.1587, 2012.

[edit] Further reading

[edit] External links

Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages