Boosting Algorithms

Josh Day
February 5, 2014

Outline

  • Motivation: Ensemble Methods and Related Algorithms
  • Adaboost.M1 Algorithm
  • Gradient Boosting/Functional Gradient Descent
  • Loss Functions and Boosting Algorithms
  • \( L_2 \) Boosting
  • Boosting in R

Ensemble Methods and Related Algorithms

Ensemble Methods

  • Build a prediction model by combining a collection of simpler base models
  • Ensemble learning can be broken down into two tasks:
    • developing base learners from training data
    • combining them to form the final predictor
  • Ideal base learners have high variance, low bias
    • Averaging base learners lowers variance and leaves bias unchanged
  • “A weak classifier is one whose error rate is only slightly better than random guessing”[2]
    • Standard choice is decision tree with few splits

Bagging (Breiman, 1994)

adabag package in R

  1. Bootstrap to generate B training sets.
  2. Build predictor on each set.
  3. Take average/majority vote to get final predictor.

[2]

Random Forest (Breiman, 2001)

randomForest package in R

  • Issue with bagging: Predictors are often highly correlated.
    • Most important variables will be included in most base learners
    • Smaller reduction in variance for correlated predictors.
  • Difference from bagging:
    • Fit trees using a random subset of predictors

Random Forest

[2]

Adaboost.M1 Algorithm

Adaptive Boosting

AdaBoost.M1

  • First successful boosting algorithm (Freund and Schapire, 1997)
  • Only for binary classification (coded \( Y_i\in\{-1,1\} \))
  • The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers \( G_m(x),m = 1,2,...,M \)

AdaBoost.M1

[2]

AdaBoost.M1 Algorithm

[2]

Gradient Boosting

Gradient Boosting

Functional Gradient Descent

  • Friedman et al. (2000) and Friedman (2001) developed statistical framework which yields a direct interpretation of boosting as a method for estimating a function.
    • it is a “stagewise, additive modeling” approach [1]
      • not additive in covariates

Gradient Boosting

Functional Gradient Descent

Consider estimating real-valued function

\[ f^0(\cdot) = \text{arg}\,\text{min}_{f(\cdot)} \; \text{E}[\;\rho(f(X), Y)\;] \]
where \( \rho(\cdot) \) is a loss function, typically assumed differentiable and convex with respect to the first argument.

Minimization is over all (measurable) functions \( f(\cdot) \)

Gradient Boosting

\[ f^0(\cdot) = \text{arg}\,\text{min}_{f(\cdot)} \; \text{E}[\;\rho(f(X), Y)\;] \]

Estimation of \( f^0(\cdot) \) with boosting can be done by using empirical risk: \[ \frac{1}{n}\sum_{i=1}^n \rho(f(X_i), Y_i) \]

and pursuing iterative steepest descent in function space.

Gradient Boosting

[1]

Gradient Boosting

  • Regularization parameters \( \nu \) and \( m_{stop} \)
  • \( \nu \) should be “small” (0.01)
  • Trade-off between \( \nu \) and \( m_{stop} \)
  • Forward Stagewise Additive Modeling[2]
    • Sequentially adding basis functions

Loss Functions and Boosting Algorithms

Loss Functions and Boosting Algorithms

  • Various boosting algorithms can be defined by specifying a loss function
    • Exponential
    • Negative binomial log-likelihood (upper approximation of misclassification loss: \( \rho(f,y)=\text{I}(\tilde{y}f<0) \))
    • Squared error [1]

Loss Functions

[2]

L2 Boosting

L2 Boosting

  • FGD algorithm using squared error loss function: \[ \rho_{L_2}(y,f) = \frac{(y-f)^2}{2} \]

  • Very useful for high-dimensional regression

  • Simple, fast

L2 Boosting

[1]

In R

gbm package

In R

  • Boston Housing Data
  • Predict median value from 13 attributes:
    • Crime, average number of rooms, etc.
library(gbm)
Boston.boost <- gbm(medv~., data=Boston, distribution='gaussian', n.trees=5000, train.fraction=.7)

In R

'gaussian' option uses squared error loss plot of chunk unnamed-chunk-2

In R

summary(Boston.boost, plot=FALSE)
            var   rel.inf
rm           rm 80.360922
lstat     lstat 16.386085
ptratio ptratio  1.408125
tax         tax  1.376013
dis         dis  0.277621
indus     indus  0.067403
crim       crim  0.051962
age         age  0.037138
black     black  0.025459
nox         nox  0.005013
rad         rad  0.003118
chas       chas  0.001139
zn           zn  0.000000

References

  1. Bühlmann, Peter, and Geer. Statistics for High-Dimensional Data. Heidelberg New York: Springer, 2011.
  2. Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. New York: Springer, 2009.
  3. James, Gareth. An introduction to statistical learning with applications in R. New York, NY: Springer, 2013.
  4. Friedman, Jerome H. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics. Vol. 29, No. 5 (Oct., 2001), pp. 1189-1232.