Boosting Algorithms

Josh Day
February 5, 2014

Outline

Motivation: Ensemble Methods and Related Algorithms
Adaboost.M1 Algorithm
Gradient Boosting/Functional Gradient Descent
Loss Functions and Boosting Algorithms
\( L_2 \) Boosting
Boosting in R

Ensemble Methods and Related Algorithms

Ensemble Methods

Build a prediction model by combining a collection of simpler base models
Ensemble learning can be broken down into two tasks:
- developing base learners from training data
- combining them to form the final predictor
Ideal base learners have high variance, low bias
- Averaging base learners lowers variance and leaves bias unchanged
“A weak classifier is one whose error rate is only slightly better than random guessing”[2]
- Standard choice is decision tree with few splits

Bagging (Breiman, 1994)

adabag package in R

Bootstrap to generate B training sets.
Build predictor on each set.
Take average/majority vote to get final predictor.

[2]

Random Forest (Breiman, 2001)

randomForest package in R

Issue with bagging: Predictors are often highly correlated.
- Most important variables will be included in most base learners
- Smaller reduction in variance for correlated predictors.
Difference from bagging:
- Fit trees using a random subset of predictors

Random Forest

[2]

Adaboost.M1 Algorithm

Adaptive Boosting

AdaBoost.M1

First successful boosting algorithm (Freund and Schapire, 1997)
Only for binary classification (coded \( Y_i\in\{-1,1\} \))
The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers \( G_m(x),m = 1,2,...,M \)

AdaBoost.M1

[2]

AdaBoost.M1 Algorithm

[2]

Gradient Boosting

Functional Gradient Descent

Friedman et al. (2000) and Friedman (2001) developed statistical framework which yields a direct interpretation of boosting as a method for estimating a function.
- it is a “stagewise, additive modeling” approach [1]
  - not additive in covariates

Gradient Boosting

Functional Gradient Descent

Consider estimating real-valued function

\[ f^0(\cdot) = \text{arg}\,\text{min}_{f(\cdot)} \; \text{E}[\;\rho(f(X), Y)\;] \]
where \( \rho(\cdot) \) is a loss function, typically assumed differentiable and convex with respect to the first argument.

Minimization is over all (measurable) functions \( f(\cdot) \)

Gradient Boosting

\[ f^0(\cdot) = \text{arg}\,\text{min}_{f(\cdot)} \; \text{E}[\;\rho(f(X), Y)\;] \]

Estimation of \( f^0(\cdot) \) with boosting can be done by using empirical risk: \[ \frac{1}{n}\sum_{i=1}^n \rho(f(X_i), Y_i) \]

and pursuing iterative steepest descent in function space.

Gradient Boosting

[1]

Gradient Boosting

Regularization parameters \( \nu \) and \( m_{stop} \)
\( \nu \) should be “small” (0.01)
Trade-off between \( \nu \) and \( m_{stop} \)
Forward Stagewise Additive Modeling[2]
- Sequentially adding basis functions

Loss Functions and Boosting Algorithms

Various boosting algorithms can be defined by specifying a loss function
- Exponential
- Negative binomial log-likelihood (upper approximation of misclassification loss: \( \rho(f,y)=\text{I}(\tilde{y}f<0) \))
- Squared error [1]

Loss Functions

[2]

L2 Boosting

FGD algorithm using squared error loss function: \[ \rho_{L_2}(y,f) = \frac{(y-f)^2}{2} \]
Very useful for high-dimensional regression
Simple, fast

L2 Boosting

[1]

In R

`gbm` package

In R

Boston Housing Data
Predict median value from 13 attributes:
- Crime, average number of rooms, etc.

library(gbm)
Boston.boost <- gbm(medv~., data=Boston, distribution='gaussian', n.trees=5000, train.fraction=.7)

In R

'gaussian' option uses squared error loss plot of chunk unnamed-chunk-2

In R

summary(Boston.boost, plot=FALSE)

            var   rel.inf
rm           rm 80.360922
lstat     lstat 16.386085
ptratio ptratio  1.408125
tax         tax  1.376013
dis         dis  0.277621
indus     indus  0.067403
crim       crim  0.051962
age         age  0.037138
black     black  0.025459
nox         nox  0.005013
rad         rad  0.003118
chas       chas  0.001139
zn           zn  0.000000

References

Bühlmann, Peter, and Geer. Statistics for High-Dimensional Data. Heidelberg New York: Springer, 2011.
Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. New York: Springer, 2009.
James, Gareth. An introduction to statistical learning with applications in R. New York, NY: Springer, 2013.
Friedman, Jerome H. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics. Vol. 29, No. 5 (Oct., 2001), pp. 1189-1232.

Boosting Algorithms

Outline

Ensemble Methods and Related Algorithms

Ensemble Methods

Bagging (Breiman, 1994)

Random Forest (Breiman, 2001)

Random Forest

Adaboost.M1 Algorithm

Adaptive Boosting

AdaBoost.M1

AdaBoost.M1

AdaBoost.M1 Algorithm

Gradient Boosting

Gradient Boosting

Functional Gradient Descent

Gradient Boosting

Functional Gradient Descent

Gradient Boosting

Gradient Boosting

Gradient Boosting

Loss Functions and Boosting Algorithms

Loss Functions and Boosting Algorithms

Loss Functions

L2 Boosting

L2 Boosting

L2 Boosting

In R

gbm package

In R

In R

In R

References

`gbm` package