(Redirected from Boosted trees)

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function.[1] Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman,[2][3] simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean.[4][5] The latter two papers introduced the view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.

## Informal introduction

(This section follows the exposition of gradient boosting by Li.[6])

Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner in an iterative fashion. It is easiest to explain in the least-squares regression setting, where the goal is to "teach" a model ${\displaystyle F}$ to predict values of the form ${\displaystyle {\hat {y}}=F(x)}$ by minimizing the mean squared error ${\displaystyle {\tfrac {1}{n}}\sum _{i}({\hat {y}}_{i}-y_{i})^{2}}$, where ${\displaystyle i}$ indexes over some training set of size ${\displaystyle n}$ of actual values of the output variable ${\displaystyle y}$.

At each stage ${\displaystyle m}$, ${\displaystyle 1\leq m\leq M}$, of gradient boosting, it may be assumed that there is some imperfect model ${\displaystyle F_{m}}$ (at the outset, a very weak model that just predicts the mean y in the training set could be used). The gradient boosting algorithm improves on ${\displaystyle F_{m}}$ by constructing a new model that adds an estimator h to provide a better model: ${\displaystyle F_{m+1}(x)=F_{m}(x)+h(x)}$. To find ${\displaystyle h}$, the gradient boosting solution starts with the observation that a perfect h would imply

${\displaystyle F_{m+1}(x)=F_{m}(x)+h(x)=y}$

or, equivalently,

${\displaystyle h(x)=y-F_{m}(x)}$.

Therefore, gradient boosting will fit h to the residual ${\displaystyle y-F_{m}(x)}$. As in other boosting variants, each ${\displaystyle F_{m+1}}$ attempts to correct the errors of its predecessor ${\displaystyle F_{m}}$. A generalization of this idea to loss functions other than squared error, and to classification and ranking problems, follows from the observation that residuals ${\displaystyle y-F(x)}$ for a given model are the negative gradients (with respect to ${\displaystyle F(x)}$) of the squared error loss function ${\displaystyle {\frac {1}{2}}(y-F(x))^{2}}$. So, gradient boosting is a gradient descent algorithm, and generalizing it entails "plugging in" a different loss and its gradient.

## Algorithm

In many supervised learning problems one has an output variable y and a vector of input variables x described via a joint probability distribution ${\displaystyle P(x,y)}$. Using a training set ${\displaystyle \{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}}$ of known values of x and corresponding values of y, the goal is to find an approximation ${\displaystyle {\hat {F}}(x)}$ to a function ${\displaystyle F(x)}$ that minimizes the expected value of some specified loss function ${\displaystyle L(y,F(x))}$:

${\displaystyle {\hat {F}}={\underset {F}{\arg \min }}\,\mathbb {E} _{x,y}[L(y,F(x))]}$.

The gradient boosting method assumes a real-valued y and seeks an approximation ${\displaystyle {\hat {F}}(x)}$ in the form of a weighted sum of functions ${\displaystyle h_{i}(x)}$ from some class ${\displaystyle {\mathcal {H}}}$, called base (or weak) learners:

${\displaystyle {\hat {F}}(x)=\sum _{i=1}^{M}\gamma _{i}h_{i}(x)+{\mbox{const}}}$.

In accordance with the empirical risk minimization principle, the method tries to find an approximation ${\displaystyle {\hat {F}}(x)}$ that minimizes the average value of the loss function on the training set, i.e., minimizes the empirical risk. It does so by starting with a model, consisting of a constant function ${\displaystyle F_{0}(x)}$, and incrementally expands it in a greedy fashion:

${\displaystyle F_{0}(x)={\underset {\gamma }{\arg \min }}{\sum _{i=1}^{n}{L(y_{i},\gamma )}}}$,
${\displaystyle F_{m}(x)=F_{m-1}(x)+{\underset {h_{m}\in {\mathcal {H}}}{\operatorname {arg\,min} }}\left[{\sum _{i=1}^{n}{L(y_{i},F_{m-1}(x_{i})+h_{m}(x_{i}))}}\right]}$,

where ${\displaystyle h_{m}\in {\mathcal {H}}}$ is a base learner function.

Unfortunately, choosing the best function h at each step for an arbitrary loss function L is a computationally infeasible optimization problem in general. Therefore, we restrict our approach to a simplified version of the problem.

The idea is to apply a steepest descent step to this minimization problem. If we considered the continuous case, i.e. where ${\displaystyle {\mathcal {H}}}$ is the set of arbitrary differentiable functions on ${\displaystyle \mathbb {R} }$, we would update the model in accordance with the following equations

${\displaystyle F_{m}(x)=F_{m-1}(x)-\gamma _{m}\sum _{i=1}^{n}{\nabla _{F_{m-1}}L(y_{i},F_{m-1}(x_{i}))},}$
${\displaystyle \gamma _{m}={\underset {\gamma }{\arg \min }}{\sum _{i=1}^{n}{L\left(y_{i},F_{m-1}(x_{i})-\gamma \nabla _{F_{m-1}}L(y_{i},F_{m-1}(x_{i}))\right)}},}$

where the derivatives are taken with respect to the functions ${\displaystyle F_{i}}$ for ${\displaystyle i\in \{1,..,m\}}$. In the discrete case however, i.e. when the set ${\displaystyle {\mathcal {H}}}$ is finite, we choose the candidate function h closest to the gradient of L for which the coefficient γ may then be calculated with the aid of line search on the above equations. Note that this approach is a heuristic and therefore doesn't yield an exact solution to the given problem, but rather an approximation. In pseudocode, the generic gradient boosting method is:[2][7]

Input: training set ${\displaystyle \{(x_{i},y_{i})\}_{i=1}^{n},}$ a differentiable loss function ${\displaystyle L(y,F(x)),}$ number of iterations M.

Algorithm:

1. Initialize model with a constant value:
${\displaystyle F_{0}(x)={\underset {\gamma }{\arg \min }}\sum _{i=1}^{n}L(y_{i},\gamma ).}$
2. For m = 1 to M:
1. Compute so-called pseudo-residuals:
${\displaystyle r_{im}=-\left[{\frac {\partial L(y_{i},F(x_{i}))}{\partial F(x_{i})}}\right]_{F(x)=F_{m-1}(x)}\quad {\mbox{for }}i=1,\ldots ,n.}$
2. Fit a base learner (e.g. tree) ${\displaystyle h_{m}(x)}$ to pseudo-residuals, i.e. train it using the training set ${\displaystyle \{(x_{i},r_{im})\}_{i=1}^{n}}$.
3. Compute multiplier ${\displaystyle \gamma _{m}}$ by solving the following one-dimensional optimization problem:
${\displaystyle \gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L\left(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})\right).}$
4. Update the model:
${\displaystyle F_{m}(x)=F_{m-1}(x)+\gamma _{m}h_{m}(x).}$
3. Output ${\displaystyle F_{M}(x).}$

Gradient boosting is typically used with decision trees (especially CART trees) of a fixed size as base learners. For this special case, Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.

Generic gradient boosting at the m-th step would fit a decision tree ${\displaystyle h_{m}(x)}$ to pseudo-residuals. Let ${\displaystyle J_{m}}$ be the number of its leaves. The tree partitions the input space into ${\displaystyle J_{m}}$ disjoint regions ${\displaystyle R_{1m},\ldots ,R_{J_{m}m}}$ and predicts a constant value in each region. Using the indicator notation, the output of ${\displaystyle h_{m}(x)}$ for input x can be written as the sum:

${\displaystyle h_{m}(x)=\sum _{j=1}^{J_{m}}b_{jm}\mathbf {1} _{R_{jm}}(x),}$

where ${\displaystyle b_{jm}}$ is the value predicted in the region ${\displaystyle R_{jm}}$.[8]

Then the coefficients ${\displaystyle b_{jm}}$ are multiplied by some value ${\displaystyle \gamma _{m}}$, chosen using line search so as to minimize the loss function, and the model is updated as follows:

${\displaystyle F_{m}(x)=F_{m-1}(x)+\gamma _{m}h_{m}(x),\quad \gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})).}$

Friedman proposes to modify this algorithm so that it chooses a separate optimal value ${\displaystyle \gamma _{jm}}$ for each of the tree's regions, instead of a single ${\displaystyle \gamma _{m}}$ for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients ${\displaystyle b_{jm}}$ from the tree-fitting procedure can be then simply discarded and the model update rule becomes:

${\displaystyle F_{m}(x)=F_{m-1}(x)+\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbf {1} _{R_{jm}}(x),\quad \gamma _{jm}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{x_{i}\in R_{jm}}L(y_{i},F_{m-1}(x_{i})+\gamma ).}$

### Size of trees

${\displaystyle J}$, the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of interaction between variables in the model. With ${\displaystyle J=2}$ (decision stumps), no interaction between variables is allowed. With ${\displaystyle J=3}$ the model may include effects of the interaction between up to two variables, and so on.

Hastie et al.[7] comment that typically ${\displaystyle 4\leq J\leq 8}$ work well for boosting and results are fairly insensitive to the choice of ${\displaystyle J}$ in this range, ${\displaystyle J=2}$ is insufficient for many applications, and ${\displaystyle J>10}$ is unlikely to be required.

## Regularization

Fitting the training set too closely can lead to degradation of the model's generalization ability. Several so-called regularization techniques reduce this overfitting effect by constraining the fitting procedure.

One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of trees in the model when the base learner is a decision tree). Increasing M reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of M is often selected by monitoring prediction error on a separate validation data set. Besides controlling M, several other regularization techniques are used.

Another regularization parameter is the depth of the trees. The higher this value the more likely the model will overfit the training data.

### Shrinkage

An important part of gradient boosting method is regularization by shrinkage which consists in modifying the update rule as follows:

${\displaystyle F_{m}(x)=F_{m-1}(x)+\nu \cdot \gamma _{m}h_{m}(x),\quad 0<\nu \leq 1,}$

where parameter ${\displaystyle \nu }$ is called the "learning rate".

Empirically it has been found that using small learning rates (such as ${\displaystyle \nu <0.1}$) yields dramatic improvements in models' generalization ability over gradient boosting without shrinking (${\displaystyle \nu =1}$).[7] However, it comes at the price of increasing computational time both during training and querying: lower learning rate requires more iterations.

Soon after the introduction of gradient boosting, Friedman proposed a minor modification to the algorithm, motivated by Breiman's bootstrap aggregation ("bagging") method.[3] Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement.[9] Friedman observed a substantial improvement in gradient boosting's accuracy with this modification.

Subsample size is some constant fraction ${\displaystyle f}$ of the size of the training set. When ${\displaystyle f=1}$, the algorithm is deterministic and identical to the one described above. Smaller values of ${\displaystyle f}$ introduce randomness into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Friedman[3] obtained that ${\displaystyle 0.5\leq f\leq 0.8}$ leads to good results for small and moderate sized training sets. Therefore, ${\displaystyle f}$ is typically set to 0.5, meaning that one half of the training set is used to build each base learner.

Also, like in bagging, subsampling allows one to define an out-of-bag error of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.[10][11]

### Number of observations in leaves

Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes (this parameter is called n.minobsinnode in the R gbm package[10]). It is used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances.

Imposing this limit helps to reduce variance in predictions at leaves.

### Penalize Complexity of Tree

Another useful regularization techniques for gradient boosted trees is to penalize model complexity of the learned model.[12] The model complexity can be defined as the proportional number of leaves in the learned trees. The joint optimization of loss and model complexity corresponds to a post-pruning algorithm to remove branches that fail to reduce the loss by a threshold. Other kinds of regularization such as an ${\displaystyle \ell _{2}}$ penalty on the leaf values can also be added to avoid overfitting.

## Usage

Gradient boosting can be used in the field of learning to rank. The commercial web search engines Yahoo[13] and Yandex[14] use variants of gradient boosting in their machine-learned ranking engines.

## Names

The method goes by a variety of names. Friedman introduced his regression technique as a "Gradient Boosting Machine" (GBM).[2] Mason, Baxter et al. described the generalized abstract class of algorithms as "functional gradient boosting".[4][5] Friedman et al. describe an advancement of gradient boosted models as Multiple Additive Regression Trees (MART);[15] Elith et al. describe that approach as "Boosted Regression Trees" (BRT).[16]

A popular open-source implementation for R calls it a "Generalized Boosting Model",[10] however packages expanding this work use BRT.[17] Commercial implementations from Salford Systems use the names "Multiple Additive Regression Trees" (MART) and TreeNet, both trademarked.[citation needed]

## References

1. ^ Breiman, L. (June 1997). "Arcing The Edge" (PDF). Technical Report 486. Statistics Department, University of California, Berkeley.
2. ^ a b c Friedman, J. H. (February 1999). "Greedy Function Approximation: A Gradient Boosting Machine" (PDF). Cite journal requires |journal= (help)
3. ^ a b c Friedman, J. H. (March 1999). "Stochastic Gradient Boosting" (PDF). Cite journal requires |journal= (help)
4. ^ a b Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (1999). "Boosting Algorithms as Gradient Descent" (PDF). In S.A. Solla and T.K. Leen and K. Müller (ed.). Advances in Neural Information Processing Systems 12. MIT Press. pp. 512–518.
5. ^ a b Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (May 1999). "Boosting Algorithms as Gradient Descent in Function Space" (PDF). Cite journal requires |journal= (help)
6. ^ Cheng Li. "A Gentle Introduction to Gradient Boosting" (PDF).
7. ^ a b c Hastie, T.; Tibshirani, R.; Friedman, J. H. (2009). "10. Boosting and Additive Trees". The Elements of Statistical Learning (2nd ed.). New York: Springer. pp. 337–384. ISBN 978-0-387-84857-0. Archived from the original on 2009-11-10. Cite uses deprecated parameter |deadurl= (help)
8. ^ Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient ${\displaystyle b_{jm}}$ for the region ${\displaystyle R_{jm}}$ is equal to just the value of output variable, averaged over all training instances in ${\displaystyle R_{jm}}$.
9. ^ Note that this is different from bagging, which samples with replacement because it uses samples of the same size as the training set.
10. ^ a b c Ridgeway, Greg (2007). Generalized Boosted Models: A guide to the gbm package.
11. ^ Learn Gradient Boosting Algorithm for better predictions (with codes in R)
12. ^ Tianqi Chen. Introduction to Boosted Trees
13. ^ Cossock, David and Zhang, Tong (2008). Statistical Analysis of Bayes Optimal Subset Ranking Archived 2010-08-07 at the Wayback Machine, page 14.
14. ^
15. ^ Friedman, Jerome (2003). "Multiple Additive Regression Trees with Application in Epidemiology". Statistics in Medicine. 22 (9): 1365–1381. doi:10.1002/sim.1501. PMID 12704603.
16. ^ Elith, Jane (2008). "A working guide to boosted regression trees". Journal of Animal Ecology. 77 (4): 802–813. doi:10.1111/j.1365-2656.2008.01390.x. PMID 18397250.
17. ^ Elith, Jane. "Boosted Regression Trees for ecological modeling" (PDF). CRAN. CRAN. Retrieved 31 August 2018.