User:Manasi Deshmukh/sandbox

From Wikipedia, the free encyclopedia

DeepBoost[1] is a new machine learning algorithm, for Boosting, formulated by Corrina Cortes, Mehryar Mohri and Umar Syed in 2014. It is an ensemble learning algorithm which can achieve high accuracy without being overfitting. Deep decision trees or other complex families can be used as base classifiers. For the selection of the hypotheses DeepBoost uses a capacity-conscious criterion. This is the key to achieving higher accuracy as compared to that achieved from other ensemble-based learning techniques. The main purpose of DeepBoost is to minimize the learning bound. It has been observed that DeepBoost performs better than AdaBoost and Logistic Regression and their -regularised variants in most cases. It can be used to promise similar theoretical guarantees in multiclass problems and use these guarantees to obtain a family of new multi-class deep boosting algorithms.

The analysis, from a theoretical perspective, and the design of DeepBoost algorithm can be extended to various loss functions and also to ranking. This should also generalize existing algorithms and their deployment of complex set of hypothesis, which is essentially a of families with different complexity, expressed using the Rademacher complexity. The algorithm can also consider losses that are non-differentiable, convex and surrogate, e.g. the hinge loss, for its extension. The algorithmic analysis of DeepBoost can help answer questions about AdaBoost's underlying theory. AdaBoost is primarily based on a margin guarantee[2][3]. However, it does not exactly maximize the minimum margin, while algorithms such as arc-gv[4] that are built to achive this do not perform better than AdaBoost[5]. There are two main reasons are could be possible for such an observation: (1) to get a better margin, algorithms such as arc-gv may have a propensity to select decision trees that are deeper, or more complex hypotheses, which affect their generalization; (2) though these algorithms give a better margin, the margin distribution may not have improved. DeepBoost theory may help understanding and evaluating the consequence of factor (1) as the learning bounds depend on the mixture weights and each hypothesis set's, i.e. 's contribution toward the definition of the ensemble-based function. However, the guarantees also put forward a better technique, DeepBoost.


Overview[edit]

Ensemble-based methods make use of a combination of several weak predictors or learners to create a strong, more accurate classifier or learner. The most popular techniques are bagging, boosting, stacking, techniques for correction of errors, Bayesian averaging, or other averaging schemes[4][2][6][7][8]. Ensemble methods have been observed to promise better testing accuracy and deliver much better learning guarantees. Performance guarantees of algorithms like AdaBoost and its variants is expressed in terms of margins of training data and the algorithms themselves are based on a complex theoretical analysis[2][3].


Popular ensemble-based algorithms like AdaBoost make use of a combine of functions selected from a hypothesis set containing weak classifiers, wherein together they form the base classifiers. In AdaBoost applications, the hypothesis set is reduced to boosting stumps. These boosting stumps are decision trees of depth equal to one. For complicates problems like speech recognition or image processing, simple stumps do not help in achieving a very high accuracy. One may then want to resort to a more complex hypothesis set, e.g. decision trees of much larger depth. However, the current learning guarantees are based not only on the number of training samples and the margin, but also on 's complexity which is measured in terms of either VC-dimension or Rademacher complexity[2][3]. If a very complex is used, the learning bounds ted to get looser. These bounds hint at a risk of overfitting which has been encountered in experiments implementing AdaBoost[9][10][11][12].

The main goal behind Deep Boosting was the design of alternative ensemble-based algorithms using that may contain members of rich families like deep decision trees, achieving higher accuracy in the process. Let the set of base classifiers be decomposed as the union of disjoint families ordered by increasing complexity, where , , can be e.g. the set of decision trees of depth , or a set of functions based on monomials of degree .

The main goal is behind the design of Deep Boost algorithm is to achieve higher accuracy by drawing 's from in such a manner that weights allocated to the hypotheses are more for s with smaller . This goal is similar to that of model selection, Structural Risk Minimization[13], but it is different from that in the sense that the limiting of the base classifier set to an optimal . Deep decision trees with large can be used not so frequently, or the weight assigned to these trees can be made small. Thus, there is room for flexibility of learning using deep hypothesis.

Learning Guarantess[edit]

Boosting or bagging or other such linear non-negative ensemble-based methods assume that all the weak learners are chosen from the same . Ensembles of base functions, whose range was {-1,+1}, were based on margin-based generalization bounds[2], expressed in terms of the VC-dimension of . Later, tigher margin bounds having easier proofs were given, expressed in terms of the Rademacher complexity of , especially for a family of whose range was in . It is observed that the complexity of the bounds directly depends on the mixture coefficients, which define the ensembles. Let the family of ensembles be . This is a family of the form where and for each , is in for some .

Let denote the input space. are the families of functions mapping from to . Points, both training and testing, are drawn independent and identically distributed, as per distribution over . Let where the training sample is drawn according to . Let , be the binary classification error, be the -margin error, and be the emprical margin error, where the range of is .

where indicates that is drawn according to S which is the empirical distribution. Margin-nased Rademancher complexity learning bound is given by -

with . Thus, the bounds depend on the mixture coefficients . This implies that even if the Rademancher complexity of may be large, this may not hamper generalization if the total mixture weight, i.e. sum of s for that , is relatively small.

Algorithm[edit]

The capacity-conscious algorithm is derived via the application of a coordinate descent technique seeking to minimize the learning bounds. The objective function undergoes coordinate descent to derive the DeepBoost algorithm.

Optimization Problem[edit]

Let be p disjoint families of functions taking values in wityh increasing Rademacher complexity . Assumption is that are symmetric, which implies that for any , there exists a . For any hypothesis , denotes the index of the set it comes from, i.e. . The learning bound holds uniformly for all and the functions . The last term of the bound does not depent on , therefore, must be chosen such that it minimizes

where . Since for any , and take the same error of generalization, instead search can be performned, for with which gives

such that . To simplify the minimization of this objective function, which is not convex, consider a convex upper bound. Let be a convex non-increasing function that upper bounds where is differentiable over and for all , . can be the exponential function as in AdaBoost[2] or the logistic function. This upper bound gives the following convex optimization problem:

such that , where the parameter controls the balance between the second term and the magnitude of the values taken by . Let be a Lagrange variable associated to the constraint mentioned here, then the equivalent problem would be

,

where can be freely chosen by the algorithm, as any choice is equivalent to that of . Let be a set of distinct base functions. Let the objective funciton be denoted by .

with . The condition cab be dropped due to symmetry of hypothesis sets, and then coordinate gradient descent can be applied on the objective function.

Let denote the vector resulted after iterations and let . Let denote the th unit vecotr in , . The direction and the step selected at the th round are those minimizing , that is

,

where . For any , is the distribution defined by

,

where is a normalization factor, . For any and , the weighted error of hypothesis for the distribution for be given by

.

DeepBoost[edit]

DEEPBOOST()

 1   for  to  do
 2        
 3   for  to  do
 4        for  to  do
 5             if then
 6                  
 7             elseif then
 8                  
 9             else 
10        
11        
12        if  then
13             
14        elseif then
15             
16        else 
17        
18        
19        for  to 
20             
21   
22   return 

See also[edit]

References[edit]

  1. ^ Cortes, Corinna; Mohri, Mehryar; Syed, Umar (2014). "Deep Boosting". ICML.
  2. ^ a b c d e f Schapire, Robert; Freund, Yoav; Bartlett, Peter; Lee, Wee Sun (1997). "Boosting the margin: A new explanation for the effectiveness of voting methods". ICML: 322–330.
  3. ^ a b c Koltchinskii, Vladmir; Panchenko, Dmitry (2002). "Empirical margin distributions and bounding the generalization error of combined classifiers". Annals of Statistics (30).
  4. ^ a b Breiman, Leo (1996). "Bagging predictors". Machine Learning. 24 (2): 123–140.
  5. ^ Reyzin, Lev; Schapire, Robert (2006). "How boosting the margin can also boost classifier complexity". ICML: 753–760.
  6. ^ Smyth, Padhraic; Wolpert, David (July 1999). "Linearly combining density estimators via stacking". Machine Learning (36): 59–83.
  7. ^ MacKay, David. "Bayesian methods for adaptive models.\" (PDF).
  8. ^ Freund, Yoav; Yishay; Schapire, Robert (2004). "Generalization bounds for averaged classifiers". The Annals of Statistics (32): 1698–1722.
  9. ^ Grove, Adam; Schuurmans, Dale (1998). "Boosting in the limit: Maximizing the margin of learned ensembles". AAAI/IAAI: 692–699.
  10. ^ Schapire, Robert (1999). "Theoretical views of boosting and applications". Lecture Notes in Computer Science. 1720: 13–25.
  11. ^ Dietterich, Thomas (2000). "An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization". Machine Learning. 40 (2): 139–157.
  12. ^ Ratsch, Gunnar; Onoda, Takashi; Muller, Klaus- Robert (2001b). "Soft margins for AdaBoost". Machine Learning. 42 (3): 287–320.
  13. ^ Vapnik, Vladimir (1998). Statistical Learning Theory. Wiley- Interscience.

External Links[edit]

  • [1] Paper on deep boosting
  • [2] Presentation on deep boosting
  • [3] Presentation on deep boosting
  • [4] Multi-Class Deep Boosting
  • [5] Video techtalk on deep boosting
  • [6] Code for deep boosting at google/deepboost on github
  • [7] Paper on Multi-Class Deep Boosting
  • [8] Paper on deep boosting
  • [9] Paper on deep boosting
  1. ^ Cortes, Corinna; Mohri, Mehryar; Syed, Umar. "Deep Boosting" (PDF).
  2. ^ Mohri, Mehryar; Syed, Umar. "Deep Boosting" (PDF).
  3. ^ Cortes, Corinna; Syed, Umar. "Deep Boosting" (PDF).
  4. ^ Kuznetsov, Vitaly; Mohri, Mehryar; Syed, Umar. "Multi-Class Deep Boosting".
  5. ^ Cortes, Corinna; Mohri, Mehryar; Syed, Umar. "Deep Boosting".
  6. ^ Cortes, Corinna; Mohri, Mehryar; Syed, Umar. "Code for deep boosting".
  7. ^ Kuznetsov, Vitaly; Mohri, Mehryar; Syed, Umar. "Multi-Class Deep Boosting" (PDF).
  8. ^ Cortes, Corinna; Mohri, Mehryar; Syed, Umar. "Deep Booting". research.google.com.
  9. ^ Cortes, Corinna; Mohri, Mehryar; Syed, Umar. "Deep Boosting". http://citeseerx.ist.psu.edu/. {{cite web}}: External link in |website= (help)