Hyperparameter (machine learning)

In the context of machine learning, hyperparameter optimization or model selection is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set. Often cross-validation is used to estimate this generalization performance.^[1] Hyperparameter optimization contrasts with actual learning problems, which are also often cast as optimization problems, but optimize a loss function on the training set alone. In effect, learning algorithms learn parameters that model/reconstruct their inputs well, while hyperparameter optimization is to ensure the model does not overfit its data by tuning, e.g., regularization.

Algorithms for hyperparameter optimization

Grid search

The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set^[2] or evaluation on a held-out validation set.

Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.

For example, a typical soft-margin SVM classifier equipped with an RBF kernel has at least two hyperparameters that need to be tuned for good performance on unseen data: a regularization constant C and a kernel hyperparameter γ. Both parameters are continuous, so to perform grid search, one selects a finite set of "reasonable" values for each, say

C\in \{10,100,1000\}

\gamma \in \{0.1,0.2,0.5,1.0\}

Grid search then trains an SVM with each pair (C, γ) in the Cartesian product of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair). Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure.

Grid search suffers from the curse of dimensionality, but is often embarrassingly parallel because typically the hyperparameter settings it evaluates are independent of each other.^[1]

Bayesian optimization

Bayesian optimization is a methodology for the global optimization of noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization consists of developing a statistical model of the function from hyperparameter values to the objective evaluated on a validation set. Intuitively, the methodology assumes that there is some smooth but noisy function that acts as a mapping from hyperparameters to the objective. In Bayesian optimization, one aims to gather observations in such a manner as to evaluate the machine learning model the least number of times while revealing as much information as possible about this function and, in particular, the location of the optimum. Bayesian optimization relies on assuming a very general prior over functions which when combined with observed hyperparameter values and corresponding outputs yields a distribution over functions. The methodology proceeds by iteratively picking hyperparameters to observe (experiments to run) in a manner that trades off exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters which are expected to have a good outcome). In practice, Bayesian optimization has been shown^[3]^[4]^[5]^[6] to obtain better results in fewer experiments than grid search and random search, due to the ability to reason about the quality of experiments before they are run.

Random search

Since grid searching is an exhaustive and therefore potentially expensive method, several alternatives have been proposed. In particular, a randomized search that simply samples parameter settings a fixed number of times has been found to be more effective in high-dimensional spaces than exhaustive search.^[1]

Gradient-based optimization

For specific learning algorithms, specialized model selection algorithms can be used. For example, Chapelle et al. present a gradient descent algorithm for minimizing the estimated generalization error of a support vector machine.^[7] More recently, Maclaurin et al. developed a more general gradient-based method using automatic differentiation.^[8]

Software

LIBSVM comes with scripts for performing grid search.
The machine learning toolkit scikit-learn includes grid and random search modules.
Hyperopt is a distributed hyperparameter optimization library in Python.
Auto-WEKA is a hyperparameter optimization layer on top of WEKA.
spearmint Spearmint is a package to perform Bayesian optimization of machine learning algorithms.
SUMO-Toolbox^[9] is a MATLAB toolbox for surrogate modeling supporting a wide collection of hyperparameter optimization algorithm for many model types.
MOE MOE is a Python/C++/CUDA library implementing Bayesian Global Optimization using Gaussian Processes.
SigOpt SigOpt offers Bayesian hyperparameter optimization as a SaaS service.

References

^ ^a ^b ^c Bergstra, James; Bengio, Yoshua (2012). "Random Search for Hyper-Parameter Optimization" (PDF). J. Machine Learning Research. 13: 281–305.
^ Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification. Technical Report, National Taiwan University.
^ Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin (2011), "Sequential model-based optimization for general algorithm configuration" (PDF), Learning and Intelligent Optimization
^ Bergstra, James; Bardenet, Remi; Bengio, Yoshua; Kegl, Balazs (2011), "Algorithms for hyper-parameter optimization" (PDF), Advances in Neural Information Processing Systems
^ Snoek, Jasper; Larochelle, Hugo; Adams, Ryan (2012), "Practical Bayesian Optimization of Machine Learning Algorithms" (PDF), Advances in Neural Information Processing Systems
^ Thornton, Chris; Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin (2013), "Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms" (PDF), Knowledge discovery and data mining
^ Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee (2002). "Choosing multiple parameters for support vector machines" (PDF). Machine Learning. 46: 131–159. doi:10.1023/a:1012450327387.
^ Maclaurin, Douglas; Duvenaud, David; Adams, Ryan P. (2015). "Gradient-based Hyperparameter Optimization through Reversible Learning". arXiv:1502.03492 [stat.ML].
^ Gorissen, Dirk; Crombecq, Karel; Couckuyt, Ivo; Demeester, Piet; Dhaene, Tom (2010). "A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design" (PDF). J. Machine Learning Research. 11: 2051–2055.

[bergstra-1] Bergstra, James; Bengio, Yoshua (2012). "Random Search for Hyper-Parameter Optimization" (PDF). J. Machine Learning Research. 13: 281–305.

[2] Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification. Technical Report, National Taiwan University.

[hutter-3] Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin (2011), "Sequential model-based optimization for general algorithm configuration" (PDF), Learning and Intelligent Optimization

[bergstra11-4] Bergstra, James; Bardenet, Remi; Bengio, Yoshua; Kegl, Balazs (2011), "Algorithms for hyper-parameter optimization" (PDF), Advances in Neural Information Processing Systems

[snoek-5] Snoek, Jasper; Larochelle, Hugo; Adams, Ryan (2012), "Practical Bayesian Optimization of Machine Learning Algorithms" (PDF), Advances in Neural Information Processing Systems

[thornton-6] Thornton, Chris; Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin (2013), "Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms" (PDF), Knowledge discovery and data mining

[7] Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee (2002). "Choosing multiple parameters for support vector machines" (PDF). Machine Learning. 46: 131–159. doi:10.1023/a:1012450327387.

[8] Maclaurin, Douglas; Duvenaud, David; Adams, Ryan P. (2015). "Gradient-based Hyperparameter Optimization through Reversible Learning". arXiv:1502.03492 [stat.ML].

[gorissen-9] Gorissen, Dirk; Crombecq, Karel; Couckuyt, Ivo; Demeester, Piet; Dhaene, Tom (2010). "A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design" (PDF). J. Machine Learning Research. 11: 2051–2055.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]