lecture 23: boosting - stat.ncsu.edu

Lecture 23: Boosting

Wenbin Lu

Department of StatisticsNorth Carolina State University

Fall 2019

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 47

Outlines

Adaboost

Strong learners vs Weak LearnersMotivationsAlgorithms

Additive Models

Adaboost and Additive Logistic Regression

L2-Boost

Boosting Trees

XGBoost


Boosting Methods

Boosting Methods

Motivation: Model Averaging

They are methods for improving the performance of weak learners.

strong learners: Given a large enough dataset, the classifier canarbitrarily accurately learn the target function with probability 1− τ(where τ > 0 can be arbitrarily small)

weak learners: Given a large enough dataset, the classifier can barelylearn the target function with probability 1

2 + τ

The error rate is only slightly better than a random guessing

Can we construct a strong learner from weak learners and how?


Boosting Methods

Boosting

Motivation: combines the outputs of many weak classifiers to produce apowerful “committee”.

Similar to bagging and other committee-based approaches

Originally designed for classification problems, but can also beextended to regression problem.

Consider the two-class problem

Y ∈ −1, 1, the classifier G (x) has training error

err =1

n

n∑i=1

I (yi 6= G (xi ))

The expected error rate on future predictions is EX,Y I (Y 6= G (X)).


Boosting Methods

Classification Trees

Classifications trees can be simple, but often produce noise or weakclassifiers.

Bagging (Breiman 1996): Fit many large trees to bootstrap-resampledversions of the training data, and classify by a majority vote.

Boosting (Freund & Shapire 1996): Fit many large or small trees tore-weighted versions of the training data. Classify by a weightedmajority vote.

In general, Boosting > Bagging > Single Tree

Breiman’s comment “AdaBoost .... best off-the-shelf classifier in theworld”. (1996, NIPS workshop)


AdaBoost

AdaBoost (Discrete Boost)

Adaptively resampling the data (Freund & Shapire 1997; winners of the2003 Godel Prize)

1 sequentially apply the weak classification algorithm to repeatedlymodified versions of the data (re-weighted data)

2 produces a sequence of weak classifiers

Gm(x), m = 1, 2, ...,M.

3 The predictions from Gm’s are then combined through a weightedmajority vote to produce the final prediction

G (x) = sign

(M∑

m=1

αmGm(x)

).

Here α1, ..., αM ≥ 0 are computed by the boosting algorithm.


AdaBoost Algorithm

AdaBoost Algorithm

1 Initially the observation weights wi = 1/n, i = 1, ..., n.2 For m=1 to M

(a) Fit a classifier Gm(x) to the training data using weights wi .(b) Compute the weighted error

errm =

∑ni=1 wi I (yi 6= Gm(xi ))∑n

i=1 wi.

(c) Compute the importance of Gm as

αm = log

(1− errm

errm

)(d) Update wi ← wi · exp[αm · I (yi 6= Gm(xi ))], i = 1, ..., n.

3 Output G (x) = sign(∑M

m=1 αmGm(x)).


AdaBoost Algorithm

Weights of Individual Weak Learners

In the final rule, the weight of Gm

αm = log

(1− errm

errm

),

where errm is the weight error of Gm.

The weights αm’s weigh the contribution of each Gm.

The higher (lower) errm, the smaller (larger) αm.

The principle is to give higher influence (larger weights) to moreaccurate classifiers in the sequence.


AdaBoost Algorithm

Data Re-weighting (Modification) Scheme

At each boosting, we impose the updated weights w1, ...,wn to samples(xi , yi ), i = 1, ..., n.

Initially, all weights are set to wi = 1/n. The usual classifier.

At step m = 2, . . . ,M, we modify weights for observationsindividually: increasing weights for those observations misclassified byGm−1(x) and decreasing weights for observations classified correctlyby Gm−1(x).

Samples difficult to correctly classify receive ever increasing influenceEach successive classifier is forced to concentrate on those trainingobservations that are missed by previous ones in sequence

The classification algorithm is re-applied to the weighted observations


AdaBoost Algorithm

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

PSfrag repla ementsG(x) = sign hPMm=1 mGm(x)iGM (x)

G3(x)G2(x)G1(x)

Final Classifier

Figure 10.1: S hemati of AdaBoost. Classiers aretrained on weighted versions of the dataset, and then ombined to produ e a nal predi tion.


AdaBoost Example

Power of Boosting

Ten features X1, ...,X10 ∼ N(0, 1)

two-classes: Y = 2 · I (∑10

j=1 X2j > χ2

10(0.5))− 1

sample size n = 2000, test size 10, 000

weak classifier: stump (a two-terminal node classification tree)

performance of boosting and comparison with other methods:

stump has 46% misclassification rate400-node tree has 26% misclassification rateboosting has 12.2% error rate


AdaBoost Example


Boosting Iterations

Test

Error

0 100 200 300 400

0.00.1

0.20.3

0.40.5

Single Stump

400 Node Tree

Figure 10.2: Simulated data (10.2): test error ratefor boosting with stumps, as a fun tion of the numberof iterations. Also shown are the test error rate for asingle stump, and a 400 node lassi ation tree.


AdaBoost Example

Boosting And Additive Models

The success of boosting is not very mysterious. The key lies in

G (x) = sign

(M∑

m=1

αmGm(x)

).

Adaboost is equivalent to fitting an additive model using theexponential loss function (a very recent discovery by Friedman et al.(2000)).

AdaBoost was originally motivated from a very different perspective


Additive Models

Introduction on Additive Models

An additive model typically assumes a function form

f (x) =M∑

m=1

βmb(x; γm),

βm’s are coefficients. b(x, γm) are basis functions of x characterizedby γm.

The model f is obtained by minimizing a loss averaged over the trainingdata

minβm,γmM1

n∑i=1

L

(yi ,

M∑m=1

βmb(xi ; γm)

). (1)

It is feasible to rapidly solve the sub-problem of fitting just a single basis.


Additive Models

Forward Stagewise Fitting for Additive Models

Forward stagewise modeling approximate the solution to (1) by

sequentially adding new basis functions to the expansion withoutadjusting the parameters and coef. of those that have been added.

At iteration m, one solves for the optimal basis function b(x, γm) and

corresponding coefficient βm, which is added to the current expansionfm−1(x).

Previously added terms are not modified.

This process is repeated.


Additive Models

Squared-Error Loss Example

Consider the squared-error loss

L(y , f (x)) = [y − f (x)]2

At the mth step, given the current fit fm−1(x), we solve

minβ,γ

∑i

L(yi , fm−1(xi ) + βb(x, γ))⇐⇒

minβ,γ

∑i

[yi − fm−1(xi )− βb(xi ; γ)]2 =∑i

[rim − βb(xi , γ)]2.

The term βmb(x; γm) is the best fit to the current residual.This produces the updated fit

fm(x) = fm−1(x) + βmb(x; γm).


Additive Models

Forward Stagewise Additive Modeling

1 Initialize f0(x) = 0.2 For m = 1 to M:

(a) Compute

(βm, γm) = arg minβ,γ

n∑i=1

L (yi , fm−1(xi ) + βb(xi ; γ)) .

(b) Set fm(x) = fm−1(x) + βmb(x; γm)


AdaBoost and Additive Logistic Models

Additive Logistic Models and AdaBoost

Friedman et al. (2001) showed that AdaBoost is equivalent to forwardstagewise additive modeling

using the exponential loss function

L(y , f (x)) = exp−yf (x),

using individual classifiers Gm(x) ∈ −1, 1 as basis functions

The (population) minimizer of the exponential loss function is

f ∗(x) = arg minf

EY |x[e−Yf (x)]

=1

2log

Pr(Y = 1|x)

Pr(Y = −1|x),

which is equal to one half of the log-odds. So, AdaBoost can be regardedas an additive logistic regression model.



Forward Stagewise Additive Modeling with Exponential Loss

For exponential loss, the minimization at mth step in forward stagewisemodeling becomes

(βm, γm) = arg minβ,γ

n∑i=1

exp−yi [fm−1(xi ) + βb(xi , γ)]

In the context of a weak learner G , this is

(βm,Gm) = arg minβ,G

n∑i=1

exp−yi [fm−1(xi ) + βG (xi )],

or equivalently, (βm,Gm) = arg minβ,G

n∑i=1

w(m)i exp−yiβG (xi ),

where w(m)i = exp−yi fm−1(xi ).



Iterative Optimization

In order to solve

minβ,G

n∑i=1

w(m)i exp−yiβG (xi ),

we take the two-step (profile) approach

first, fix β > 0 and solve for G .

second, solve for β with G = G .

Recall that both Y and G (x) take only two values +1 and −1. So

yiG (xi ) = +1⇐⇒ yi = G (xi ),

yiG (xi ) = −1⇐⇒ yi 6= G (xi ).



Solving for Gm

(βm,Gm) = arg minβ,G

n∑i=1

w(m)i exp−yiβG (xi )

= arg minβ,G

eβ∑

yi 6=G(xi )

w(m)i + e−β

∑yi=G(xi )

w(m)i

= arg minβ,G

(eβ − e−β)∑

yi 6=G(xi )

w(m)i + e−β

n∑i=1

w(m)i

For any β > 0, the minimizer Gm is a −1, 1-valued function

Gm = arg minG

n∑i=1

w(m)i I [yi 6= G (xi )],

the classifier that minimizes training error for the weighted data.



Solving for βm

Define

errm =

∑ni=1 w

(m)i I (yi 6= Gm(xi ))∑n

i=1 w(m)i

Plugging Gm in the objective gives

βm = argminβe−β + (eβ − e−β)errmn∑

i=1

w(m)i

The solution is

βm =1

2log

(1− errm

errm

).



Connection to Adaboost

Since−yGm(x) = 2(I (y 6= Gm(x))− 1,

we have

w(m+1)i = w

(m)i exp−βmyiGm(xi )

= w(m)i expαmI (yi 6= G (xi )) exp−βm

where αm = 2βm and exp−βm is constant across the data points.Therefore

the weight update is equivalent to line 2(d) of the AdaBoost

line 2(a) of the Adaboost is equivalent to solving the minimizationproblem



Weights and Their Update

At the mth step, the weight for the ith observation is

w(m)i = exp−yi fm−1(xi ),

which depends only on fm−1(xi ) but not on β or G (x).

At the (m + 1)th step, we update the weight using the fact

fm(x) = fm−1(xi ) + βmGm(xi ).

It leads to the following update formula

w(m+1)i = exp−yi fm(xi )

= exp−yi (fm−1(xi ) + βmGm(xi ))= w

(m)i exp−βmyiGm(xi ).



Why Exponential Loss?

Principal virtue is computational

Exponential loss concentrates much more influence on observationswith large negative margins yf (x). It is especially sensitive tomisspecification of class labels.

More robust losses: computation is not as easy (Use GradientBoosting).



Exponential Loss and Cross Entropy

Define Y ′ = (Y + 1)/2 ∈ 0, 1.The binomial negative log-likelihood loss function is

−l(Y , p(x)) = −[Y ′ log p(x) + (1− Y ′) log(1− p(x))

]= log (1 + exp−2Yf (x)) ≡ −l(Y , f (x)),

where p(x) = [1 + e−2f (x)]−1.

The population minimizers of the deviance EY |x[−l(Y , f (x))] and

EY |x[e−Yf (x)] are the same.

The population minimizer is given by

f ∗(x) =1

2log

Pr(Y = 1|x)

Pr(Y = −1|x).



Loss Functions for Classification

Choice of loss functions matters for finite data sets.The Margin of f (x) is defined as yf (x).The classification rule is G (x) = sign[f (x)]The decision boundary is f (x) = 0

Observations with positive margin yi f (xi ) > 0 are correctly classified

Observations with negative margin yi f (xi ) < 0 are incorrectlyclassified

The goal of a classification algorithm is to produce positive margins asfrequently as possible

Any loss function should penalize negative margins more heavily thanpositive margins, since positive margin are already correctly classified



Various Loss Functions

Monotone decreasing loss functions of the margin

Misclassification loss: I (sign(f (x)) 6= y)

Exponential loss: exp(−yf ) (not robust against mislabeled samples)

Binomial deviance: log1 + exp(−2yf ) (more robust againstinfluential points)

SVM loss: [1− yf ]+

Other loss functions

Squared error: (y − f )2 = (1− yf )2 (penalize positive margins heavily)




-2 -1 0 1 2

0.00.5

1.01.5

2.02.5

3.0

MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector

PSfrag repla ementsLoss

y fFigure 10.4: Loss fun tions for two- lass lassi a-tion. The response is y = 1; the predi tion is f ,with lass predi tion sign(f). The losses are mis las-si ation: I(sign(f) 6= y); exponential: exp(yf); bi-nomial devian e: log(1 + exp(2yf)); squared error:(y f)2; and support ve tor: (1 yf) I(yf > 1) (seeSe tion 12.3). Ea h fun tion has been s aled so that itpasses through the point (0; 1).



Brief Summary on AdaBoost

AdaBoost fits an additive model, where the basis functions Gm(x)stage-wise optimize exponential loss

The population minimizer of exponential loss is the log odds

There are loss functions more robust than squared error loss (forregression problems) or exponential loss (for classification problems)


L2 Boosting

L2 Boosting

Proposed by Buhlmann & Yu (2003).

Adopt the general framework of the gradient boosting machinedeveloped by Friedman (2001).

Data: (Yi ,Xi ), i = 1, · · · , n, where Yi is the response and Xi arep-dimensional features.

Problem: find F : Rp → R to minimize the expected lossE [LYi ,F (Xi )].Define empirical loss: Ln(F ) =

∑ni=1 LYi ,F (Xi ).

Estimate F nonparametrically using functional gradient descent.

The procedure includes: initiation, evaluation of negative gradient ofthe loss function, projection of the gradient to a base learner, linesearch, update, and iterations.

Common choice: component-wise cubic smoothing splines as the baselearner.


L2 Boosting

Algorithm

Step 1. Set m = 0 and Fm(·) = 0.

Step 2. Compute the negative gradient as

U(m)i = −∂L(Yi ,F )

∂F|F = Fm(Xi ), i = 1, · · · , n.

Step 3. Projection: regress U(m)i on Xi ,k using univariate cubic

smoothing spline as the base learner, and choose the predictor Xk(m)

such that

k(m) = arg min1≤k≤p

n∑i=1

(U

(m)i − gk(Xi ,k)

)2,

where gk(·) is the conventional univariate cubic smoothing spline fit.


L2 Boosting

Algorithm

Step 4. Line search: find wm to minimize

n∑i=1

LYi , Fm(Xi ) + wgk(Xi ,k(m))

w.r.t. w .

Step 5. Update: set

Fm+1(Xi ) = Fm(Xi ) + ρw (m)gk(Xi ,k(m)),

and increase m by one. Here, ρ is a small learning rate, usuallychosen as ρ = 0.05 or 0.01.

Step 6. Iteration: repeat Steps 2 to 5.


L2 Boosting

Discussions

Tuning parameters: learning rate ρ and total number of iterations.

Both can be tuned via cross-validation.

Empirical findings: a smaller learning rate would often lead to abetter predictive performance of the boosting algorithm; thepredictive performance is not overly sensitive to the choice of ρ.

Stopping rule: stop the iteration when the loss function flattens. Thisgives almost the same performance as cross-validation.

Variable importance measure (Friedman, 2001):

Ij =

EX

∂F (X )

∂Xj

2

Var(Xj)

1/2

, j = 1, · · · , p,

where the expectation with respect to the marginal distribution of Xand can be computed through sample average.


Boosting Trees

Boosting Trees

A tree can be expressed as

T (x ; Θ) =J∑

j=1

γj I (x ∈ Rj),

with parameters Θ = (γj ,Rj)J1.

The parameters are found by minimizing the empirical risk

Θ = arg minΘ

J∑j=1

∑xi∈Rj

L(yi , γj) = arg minΘ

N∑i=1

L(yi ,T (xi ,Θ)).

Finding γj given Rj : it can be trivially done.

Finding Rj : this is the difficult part. A typical strategy is to use agreedy, top-down recursive partitioning algorithm. A smooth functionmay be used to approximate L.


Boosting Trees

Boosted Tree Model

The boosted tree model

fM(x) =M∑

m=1

T (x ,Θm),

induced in a forward stagewise manner.

At each step, we solve

Θm = arg minΘm

N∑i=1

L(yi , fm−1(xi ) + T (xi ,Θm)),

where Θm = (γjm,Rjm), j = 1, · · · , Jm.


Boosting Trees

Gradient Boosting

Consider the numerical optimization

arg minf ∈fM

L(f ) ≡N∑i=1

L(yi , f (xi )).

The solution is written as a sum of component vectors

fM =M∑

m=0

hm, hm ∈ RN ,

where fM = (fM(x1), · · · , fM(xN))T and hm is the increment vector atthe mth step.

Steepest descent: choose hm = −ρmgm, where ρm is a scalar and theith component of gm is given by

gim =

[∂L(yi , f (xi ))

∂f (xi )

]f (xi )=fm−1(xi )

.


Boosting Trees

Gradient Boosting

The step length ρm is the solution to

ρm = arg minρ

N∑i=1

L(yi , fm−1(xi )− ρgm(xi )).

Solution update: fm(xi ) = fm−1(xi )− ρmgm(xi ).

Gradient boosting based on least squares:

Θm = arg minΘ

N∑i=1

(−gim − T (xi ,Θ))2.


Boosting Trees

Gradient Tree Boosting Algorithm

1. Initialize f0(x) = arg minγ∑N

i=1 L(yi , γ).

2. For m = 1 to M:

a. For i = 1, · · · ,N, compute rim = −gim.b. Fit a regression tree to the targets rim giving terminal regions Rjm,

j = 1, 2, ..., Jm.c. For j = 1, 2, ..., Jm, compute

γjm = arg minγ

∑xi∈Rjm

L(yi , fm−1(xi ) + γ).

d. Update fm(x) = fm−1(x) +∑Jm

j=1 γjmI (x ∈ Rjm).

3. Output f (x) = fM(x).


XGBoost

XGBoost

eXtreme Gradient Boosting (XGBoost): a scalable tree boostingsystem. Capable to handle large data sets.

a variant of gradient tree boosting, also known as gradient boostingmachine (GBM) or gradient boosted regression tree (GBRT).

Builds an ensemble of weak learners in stage-wise manner.Combines gradient descent with CART.

successfully used in many machine learning and data miningchallenges, such as Kaggle.

Among the 29 challenge winning solutions published at Kaggle’s blogduring 2015, 17 solutions used XGBoost.In addition, eight solely used XGBoost to train the model, while mostothers combined XGBoost with neural nets in ensembles.


XGBoost

Gradient Tree Boosting

Data with n samples and m features: D = (xi , yi ) : i = 1, · · · , n,xi ∈ Rm and yi ∈ R.

A tree ensemble model uses K additive functions:

yi = φ(xi ) =K∑

k=1

fk(xi ), fk ∈ F ,

where F = f (x) = wq(x) : q : Rm → T ,w ∈ RT is the space ofregression trees (CART).

q: represents the structure of each tree;T : number of leaves of each tree;w : leaf weights (the score of each leaf).


XGBoost

Gradient Tree Boosting

Minimize the following objective function:

L(φ) =∑i

l(yi , yi ) +∑k

Ω(fk),

where Ω(f ) = γT + (λ/2)||w ||2.

Consider y(t)i = y

(t−1)i + ft(xi ) and

L(t)(φ) =∑i

l(yi , y(t−1)i + ft(xi )) + Ω(ft).

Consider second-order approximation:

L(t)(φ) =∑i

[l(yi , y(t−1)i ) + gi ft(xi ) +

1

2hi f

2t (xi )] + Ω(ft)

=∑i

[gi ft(xi ) +1

2hi f

2t (xi )] + Ω(ft) + constant,

where gi and hi are the first and second order derivatives of l w.r.ty (t−1).Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 42 / 47

XGBoost

Recap of XGBoost

Algorithm:

1. Initialize y (0)(x) = f0(x) = arg minf0

∑ni=1 l(yi , f0(xi )) + Ω(f0).

2. For t = 1, · · · ,K , do

Calculate gi and hi , i = 1, · · · , n;Fit a new tree ft = arg minft

∑i [gi ft(xi ) + 1

2hi f2t (xi )] + Ω(ft); (*)

Update y(t)i = y

(t−1)i + ft(xi ).

Output y (K).


XGBoost

Combination of Gradient Boosting with CART

To optimize (*):

Define the instance set in leaf j as Ij = i : q(xi ) = j.Rewrite L(t)(φ) as

L(t)(φ) =∑i

[giwq(xi ) +1

2hiw

2q(xi )

] + γT +λ

2

T∑j=1

w2j

=T∑j=1

(∑i∈Ij

gi )wj +1

2(∑i∈Ij

hi + λ)w2j

+ γT .

For a fixed structure q, the optimal weight w∗j of leaf j is given by

w∗j = −∑

i∈Ij gi∑i∈Ij hi + λ

.


XGBoost

Splitting Trees

Define the optimal value of a given tree structure q

L(t)(q) = −1

2

T∑j=1

(∑

i∈Ij gi )2∑

i∈Ij hi + λ+ γT .

The optimal value is like the impurity score for evaluating decisiontrees. A greedy algorithm can be used to build the tree (a single nodeis added at each step).

Let IL and IR denote the instance set of left and right nodes after thesplit. Letting I = IL ∪ IR . The loss reduction after the split is given by

Lsplit =1

2

[(∑

i∈IL gi )2∑

i∈Ij hi + λ+

(∑

i∈IR gi )2∑

i∈IR hi + λ−

(∑

i∈I gi )2∑

i∈I hi + λ

]− γ.

The above is usually used for evaluating the split candidates.


XGBoost

Splitting Algorithms

Exact greedy algorithm: potentially time-consuming for continuousvariables.

Approximate greedy algorithm:

Proposes candidate splitting points according to percentiles of featuredistribution.Maps the continuous features into buckets split by these candidatepoints.Aggregates the statistics and finds the best solution among proposalsbased on the aggregated statistics.

Sparsity-aware split finding: features are sparse

presence of missing values in the data.frequent zero entries in the statistics.


XGBoost

Refine Prediction

Pruning

The gain of the split Lsplit can be negative.Grow a tree to maximum depth, recursively prune all the leaf splitswith negative gain.

Shrinkage: y(t)i = y

(t−1)i + εft(xi ), ε is the learning rate usually set at

around 0.1.

Feature subsampling: as used in random forests.

XGBoost can be implemented in R (“xgboost” package), Python, andJulia.


lecture 23: boosting - stat.ncsu.edu

Documents