lecture 23: boosting - stat.ncsu.edu
TRANSCRIPT
Lecture 23: Boosting
Wenbin Lu
Department of StatisticsNorth Carolina State University
Fall 2019
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 47
Outlines
Adaboost
Strong learners vs Weak LearnersMotivationsAlgorithms
Additive Models
Adaboost and Additive Logistic Regression
L2-Boost
Boosting Trees
XGBoost
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 47
Boosting Methods
Boosting Methods
Motivation: Model Averaging
They are methods for improving the performance of weak learners.
strong learners: Given a large enough dataset, the classifier canarbitrarily accurately learn the target function with probability 1− τ(where τ > 0 can be arbitrarily small)
weak learners: Given a large enough dataset, the classifier can barelylearn the target function with probability 1
2 + τ
The error rate is only slightly better than a random guessing
Can we construct a strong learner from weak learners and how?
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 47
Boosting Methods
Boosting
Motivation: combines the outputs of many weak classifiers to produce apowerful “committee”.
Similar to bagging and other committee-based approaches
Originally designed for classification problems, but can also beextended to regression problem.
Consider the two-class problem
Y ∈ −1, 1, the classifier G (x) has training error
err =1
n
n∑i=1
I (yi 6= G (xi ))
The expected error rate on future predictions is EX,Y I (Y 6= G (X)).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 47
Boosting Methods
Classification Trees
Classifications trees can be simple, but often produce noise or weakclassifiers.
Bagging (Breiman 1996): Fit many large trees to bootstrap-resampledversions of the training data, and classify by a majority vote.
Boosting (Freund & Shapire 1996): Fit many large or small trees tore-weighted versions of the training data. Classify by a weightedmajority vote.
In general, Boosting > Bagging > Single Tree
Breiman’s comment “AdaBoost .... best off-the-shelf classifier in theworld”. (1996, NIPS workshop)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 47
AdaBoost
AdaBoost (Discrete Boost)
Adaptively resampling the data (Freund & Shapire 1997; winners of the2003 Godel Prize)
1 sequentially apply the weak classification algorithm to repeatedlymodified versions of the data (re-weighted data)
2 produces a sequence of weak classifiers
Gm(x), m = 1, 2, ...,M.
3 The predictions from Gm’s are then combined through a weightedmajority vote to produce the final prediction
G (x) = sign
(M∑
m=1
αmGm(x)
).
Here α1, ..., αM ≥ 0 are computed by the boosting algorithm.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 47
AdaBoost Algorithm
AdaBoost Algorithm
1 Initially the observation weights wi = 1/n, i = 1, ..., n.2 For m=1 to M
(a) Fit a classifier Gm(x) to the training data using weights wi .(b) Compute the weighted error
errm =
∑ni=1 wi I (yi 6= Gm(xi ))∑n
i=1 wi.
(c) Compute the importance of Gm as
αm = log
(1− errm
errm
)(d) Update wi ← wi · exp[αm · I (yi 6= Gm(xi ))], i = 1, ..., n.
3 Output G (x) = sign(∑M
m=1 αmGm(x)).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 47
AdaBoost Algorithm
Weights of Individual Weak Learners
In the final rule, the weight of Gm
αm = log
(1− errm
errm
),
where errm is the weight error of Gm.
The weights αm’s weigh the contribution of each Gm.
The higher (lower) errm, the smaller (larger) αm.
The principle is to give higher influence (larger weights) to moreaccurate classifiers in the sequence.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 47
AdaBoost Algorithm
Data Re-weighting (Modification) Scheme
At each boosting, we impose the updated weights w1, ...,wn to samples(xi , yi ), i = 1, ..., n.
Initially, all weights are set to wi = 1/n. The usual classifier.
At step m = 2, . . . ,M, we modify weights for observationsindividually: increasing weights for those observations misclassified byGm−1(x) and decreasing weights for observations classified correctlyby Gm−1(x).
Samples difficult to correctly classify receive ever increasing influenceEach successive classifier is forced to concentrate on those trainingobservations that are missed by previous ones in sequence
The classification algorithm is re-applied to the weighted observations
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 47
AdaBoost Algorithm
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10
Training Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted Sample
Weighted SampleWeighted Sample
Training Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted SampleWeighted Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
PSfrag repla ementsG(x) = sign hPMm=1 mGm(x)iGM (x)
G3(x)G2(x)G1(x)
Final Classifier
Figure 10.1: S hemati of AdaBoost. Classiers aretrained on weighted versions of the dataset, and then ombined to produ e a nal predi tion.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 47
AdaBoost Example
Power of Boosting
Ten features X1, ...,X10 ∼ N(0, 1)
two-classes: Y = 2 · I (∑10
j=1 X2j > χ2
10(0.5))− 1
sample size n = 2000, test size 10, 000
weak classifier: stump (a two-terminal node classification tree)
performance of boosting and comparison with other methods:
stump has 46% misclassification rate400-node tree has 26% misclassification rateboosting has 12.2% error rate
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 47
AdaBoost Example
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10
Boosting Iterations
Test
Error
0 100 200 300 400
0.00.1
0.20.3
0.40.5
Single Stump
400 Node Tree
Figure 10.2: Simulated data (10.2): test error ratefor boosting with stumps, as a fun tion of the numberof iterations. Also shown are the test error rate for asingle stump, and a 400 node lassi ation tree.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 47
AdaBoost Example
Boosting And Additive Models
The success of boosting is not very mysterious. The key lies in
G (x) = sign
(M∑
m=1
αmGm(x)
).
Adaboost is equivalent to fitting an additive model using theexponential loss function (a very recent discovery by Friedman et al.(2000)).
AdaBoost was originally motivated from a very different perspective
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 47
Additive Models
Introduction on Additive Models
An additive model typically assumes a function form
f (x) =M∑
m=1
βmb(x; γm),
βm’s are coefficients. b(x, γm) are basis functions of x characterizedby γm.
The model f is obtained by minimizing a loss averaged over the trainingdata
minβm,γmM1
n∑i=1
L
(yi ,
M∑m=1
βmb(xi ; γm)
). (1)
It is feasible to rapidly solve the sub-problem of fitting just a single basis.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 47
Additive Models
Forward Stagewise Fitting for Additive Models
Forward stagewise modeling approximate the solution to (1) by
sequentially adding new basis functions to the expansion withoutadjusting the parameters and coef. of those that have been added.
At iteration m, one solves for the optimal basis function b(x, γm) and
corresponding coefficient βm, which is added to the current expansionfm−1(x).
Previously added terms are not modified.
This process is repeated.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 47
Additive Models
Squared-Error Loss Example
Consider the squared-error loss
L(y , f (x)) = [y − f (x)]2
At the mth step, given the current fit fm−1(x), we solve
minβ,γ
∑i
L(yi , fm−1(xi ) + βb(x, γ))⇐⇒
minβ,γ
∑i
[yi − fm−1(xi )− βb(xi ; γ)]2 =∑i
[rim − βb(xi , γ)]2.
The term βmb(x; γm) is the best fit to the current residual.This produces the updated fit
fm(x) = fm−1(x) + βmb(x; γm).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 47
Additive Models
Forward Stagewise Additive Modeling
1 Initialize f0(x) = 0.2 For m = 1 to M:
(a) Compute
(βm, γm) = arg minβ,γ
n∑i=1
L (yi , fm−1(xi ) + βb(xi ; γ)) .
(b) Set fm(x) = fm−1(x) + βmb(x; γm)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 47
AdaBoost and Additive Logistic Models
Additive Logistic Models and AdaBoost
Friedman et al. (2001) showed that AdaBoost is equivalent to forwardstagewise additive modeling
using the exponential loss function
L(y , f (x)) = exp−yf (x),
using individual classifiers Gm(x) ∈ −1, 1 as basis functions
The (population) minimizer of the exponential loss function is
f ∗(x) = arg minf
EY |x[e−Yf (x)]
=1
2log
Pr(Y = 1|x)
Pr(Y = −1|x),
which is equal to one half of the log-odds. So, AdaBoost can be regardedas an additive logistic regression model.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 47
AdaBoost and Additive Logistic Models
Forward Stagewise Additive Modeling with Exponential Loss
For exponential loss, the minimization at mth step in forward stagewisemodeling becomes
(βm, γm) = arg minβ,γ
n∑i=1
exp−yi [fm−1(xi ) + βb(xi , γ)]
In the context of a weak learner G , this is
(βm,Gm) = arg minβ,G
n∑i=1
exp−yi [fm−1(xi ) + βG (xi )],
or equivalently, (βm,Gm) = arg minβ,G
n∑i=1
w(m)i exp−yiβG (xi ),
where w(m)i = exp−yi fm−1(xi ).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 47
AdaBoost and Additive Logistic Models
Iterative Optimization
In order to solve
minβ,G
n∑i=1
w(m)i exp−yiβG (xi ),
we take the two-step (profile) approach
first, fix β > 0 and solve for G .
second, solve for β with G = G .
Recall that both Y and G (x) take only two values +1 and −1. So
yiG (xi ) = +1⇐⇒ yi = G (xi ),
yiG (xi ) = −1⇐⇒ yi 6= G (xi ).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 47
AdaBoost and Additive Logistic Models
Solving for Gm
(βm,Gm) = arg minβ,G
n∑i=1
w(m)i exp−yiβG (xi )
= arg minβ,G
eβ∑
yi 6=G(xi )
w(m)i + e−β
∑yi=G(xi )
w(m)i
= arg minβ,G
(eβ − e−β)∑
yi 6=G(xi )
w(m)i + e−β
n∑i=1
w(m)i
For any β > 0, the minimizer Gm is a −1, 1-valued function
Gm = arg minG
n∑i=1
w(m)i I [yi 6= G (xi )],
the classifier that minimizes training error for the weighted data.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 47
AdaBoost and Additive Logistic Models
Solving for βm
Define
errm =
∑ni=1 w
(m)i I (yi 6= Gm(xi ))∑n
i=1 w(m)i
Plugging Gm in the objective gives
βm = argminβe−β + (eβ − e−β)errmn∑
i=1
w(m)i
The solution is
βm =1
2log
(1− errm
errm
).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 47
AdaBoost and Additive Logistic Models
Connection to Adaboost
Since−yGm(x) = 2(I (y 6= Gm(x))− 1,
we have
w(m+1)i = w
(m)i exp−βmyiGm(xi )
= w(m)i expαmI (yi 6= G (xi )) exp−βm
where αm = 2βm and exp−βm is constant across the data points.Therefore
the weight update is equivalent to line 2(d) of the AdaBoost
line 2(a) of the Adaboost is equivalent to solving the minimizationproblem
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 47
AdaBoost and Additive Logistic Models
Weights and Their Update
At the mth step, the weight for the ith observation is
w(m)i = exp−yi fm−1(xi ),
which depends only on fm−1(xi ) but not on β or G (x).
At the (m + 1)th step, we update the weight using the fact
fm(x) = fm−1(xi ) + βmGm(xi ).
It leads to the following update formula
w(m+1)i = exp−yi fm(xi )
= exp−yi (fm−1(xi ) + βmGm(xi ))= w
(m)i exp−βmyiGm(xi ).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 47
AdaBoost and Additive Logistic Models
Why Exponential Loss?
Principal virtue is computational
Exponential loss concentrates much more influence on observationswith large negative margins yf (x). It is especially sensitive tomisspecification of class labels.
More robust losses: computation is not as easy (Use GradientBoosting).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 47
AdaBoost and Additive Logistic Models
Exponential Loss and Cross Entropy
Define Y ′ = (Y + 1)/2 ∈ 0, 1.The binomial negative log-likelihood loss function is
−l(Y , p(x)) = −[Y ′ log p(x) + (1− Y ′) log(1− p(x))
]= log (1 + exp−2Yf (x)) ≡ −l(Y , f (x)),
where p(x) = [1 + e−2f (x)]−1.
The population minimizers of the deviance EY |x[−l(Y , f (x))] and
EY |x[e−Yf (x)] are the same.
The population minimizer is given by
f ∗(x) =1
2log
Pr(Y = 1|x)
Pr(Y = −1|x).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 47
AdaBoost and Additive Logistic Models
Loss Functions for Classification
Choice of loss functions matters for finite data sets.The Margin of f (x) is defined as yf (x).The classification rule is G (x) = sign[f (x)]The decision boundary is f (x) = 0
Observations with positive margin yi f (xi ) > 0 are correctly classified
Observations with negative margin yi f (xi ) < 0 are incorrectlyclassified
The goal of a classification algorithm is to produce positive margins asfrequently as possible
Any loss function should penalize negative margins more heavily thanpositive margins, since positive margin are already correctly classified
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 47
AdaBoost and Additive Logistic Models
Various Loss Functions
Monotone decreasing loss functions of the margin
Misclassification loss: I (sign(f (x)) 6= y)
Exponential loss: exp(−yf ) (not robust against mislabeled samples)
Binomial deviance: log1 + exp(−2yf ) (more robust againstinfluential points)
SVM loss: [1− yf ]+
Other loss functions
Squared error: (y − f )2 = (1− yf )2 (penalize positive margins heavily)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 28 / 47
AdaBoost and Additive Logistic Models
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 10
-2 -1 0 1 2
0.00.5
1.01.5
2.02.5
3.0
MisclassificationExponentialBinomial DevianceSquared ErrorSupport Vector
PSfrag repla ementsLoss
y fFigure 10.4: Loss fun tions for two- lass lassi a-tion. The response is y = 1; the predi tion is f ,with lass predi tion sign(f). The losses are mis las-si ation: I(sign(f) 6= y); exponential: exp(yf); bi-nomial devian e: log(1 + exp(2yf)); squared error:(y f)2; and support ve tor: (1 yf) I(yf > 1) (seeSe tion 12.3). Ea h fun tion has been s aled so that itpasses through the point (0; 1).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 29 / 47
AdaBoost and Additive Logistic Models
Brief Summary on AdaBoost
AdaBoost fits an additive model, where the basis functions Gm(x)stage-wise optimize exponential loss
The population minimizer of exponential loss is the log odds
There are loss functions more robust than squared error loss (forregression problems) or exponential loss (for classification problems)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 47
L2 Boosting
L2 Boosting
Proposed by Buhlmann & Yu (2003).
Adopt the general framework of the gradient boosting machinedeveloped by Friedman (2001).
Data: (Yi ,Xi ), i = 1, · · · , n, where Yi is the response and Xi arep-dimensional features.
Problem: find F : Rp → R to minimize the expected lossE [LYi ,F (Xi )].Define empirical loss: Ln(F ) =
∑ni=1 LYi ,F (Xi ).
Estimate F nonparametrically using functional gradient descent.
The procedure includes: initiation, evaluation of negative gradient ofthe loss function, projection of the gradient to a base learner, linesearch, update, and iterations.
Common choice: component-wise cubic smoothing splines as the baselearner.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 31 / 47
L2 Boosting
Algorithm
Step 1. Set m = 0 and Fm(·) = 0.
Step 2. Compute the negative gradient as
U(m)i = −∂L(Yi ,F )
∂F|F = Fm(Xi ), i = 1, · · · , n.
Step 3. Projection: regress U(m)i on Xi ,k using univariate cubic
smoothing spline as the base learner, and choose the predictor Xk(m)
such that
k(m) = arg min1≤k≤p
n∑i=1
(U
(m)i − gk(Xi ,k)
)2,
where gk(·) is the conventional univariate cubic smoothing spline fit.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 32 / 47
L2 Boosting
Algorithm
Step 4. Line search: find wm to minimize
n∑i=1
LYi , Fm(Xi ) + wgk(Xi ,k(m))
w.r.t. w .
Step 5. Update: set
Fm+1(Xi ) = Fm(Xi ) + ρw (m)gk(Xi ,k(m)),
and increase m by one. Here, ρ is a small learning rate, usuallychosen as ρ = 0.05 or 0.01.
Step 6. Iteration: repeat Steps 2 to 5.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 33 / 47
L2 Boosting
Discussions
Tuning parameters: learning rate ρ and total number of iterations.
Both can be tuned via cross-validation.
Empirical findings: a smaller learning rate would often lead to abetter predictive performance of the boosting algorithm; thepredictive performance is not overly sensitive to the choice of ρ.
Stopping rule: stop the iteration when the loss function flattens. Thisgives almost the same performance as cross-validation.
Variable importance measure (Friedman, 2001):
Ij =
EX
∂F (X )
∂Xj
2
Var(Xj)
1/2
, j = 1, · · · , p,
where the expectation with respect to the marginal distribution of Xand can be computed through sample average.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 34 / 47
Boosting Trees
Boosting Trees
A tree can be expressed as
T (x ; Θ) =J∑
j=1
γj I (x ∈ Rj),
with parameters Θ = (γj ,Rj)J1.
The parameters are found by minimizing the empirical risk
Θ = arg minΘ
J∑j=1
∑xi∈Rj
L(yi , γj) = arg minΘ
N∑i=1
L(yi ,T (xi ,Θ)).
Finding γj given Rj : it can be trivially done.
Finding Rj : this is the difficult part. A typical strategy is to use agreedy, top-down recursive partitioning algorithm. A smooth functionmay be used to approximate L.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 35 / 47
Boosting Trees
Boosted Tree Model
The boosted tree model
fM(x) =M∑
m=1
T (x ,Θm),
induced in a forward stagewise manner.
At each step, we solve
Θm = arg minΘm
N∑i=1
L(yi , fm−1(xi ) + T (xi ,Θm)),
where Θm = (γjm,Rjm), j = 1, · · · , Jm.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 36 / 47
Boosting Trees
Gradient Boosting
Consider the numerical optimization
arg minf ∈fM
L(f ) ≡N∑i=1
L(yi , f (xi )).
The solution is written as a sum of component vectors
fM =M∑
m=0
hm, hm ∈ RN ,
where fM = (fM(x1), · · · , fM(xN))T and hm is the increment vector atthe mth step.
Steepest descent: choose hm = −ρmgm, where ρm is a scalar and theith component of gm is given by
gim =
[∂L(yi , f (xi ))
∂f (xi )
]f (xi )=fm−1(xi )
.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 37 / 47
Boosting Trees
Gradient Boosting
The step length ρm is the solution to
ρm = arg minρ
N∑i=1
L(yi , fm−1(xi )− ρgm(xi )).
Solution update: fm(xi ) = fm−1(xi )− ρmgm(xi ).
Gradient boosting based on least squares:
Θm = arg minΘ
N∑i=1
(−gim − T (xi ,Θ))2.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 38 / 47
Boosting Trees
Gradient Tree Boosting Algorithm
1. Initialize f0(x) = arg minγ∑N
i=1 L(yi , γ).
2. For m = 1 to M:
a. For i = 1, · · · ,N, compute rim = −gim.b. Fit a regression tree to the targets rim giving terminal regions Rjm,
j = 1, 2, ..., Jm.c. For j = 1, 2, ..., Jm, compute
γjm = arg minγ
∑xi∈Rjm
L(yi , fm−1(xi ) + γ).
d. Update fm(x) = fm−1(x) +∑Jm
j=1 γjmI (x ∈ Rjm).
3. Output f (x) = fM(x).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 39 / 47
XGBoost
XGBoost
eXtreme Gradient Boosting (XGBoost): a scalable tree boostingsystem. Capable to handle large data sets.
a variant of gradient tree boosting, also known as gradient boostingmachine (GBM) or gradient boosted regression tree (GBRT).
Builds an ensemble of weak learners in stage-wise manner.Combines gradient descent with CART.
successfully used in many machine learning and data miningchallenges, such as Kaggle.
Among the 29 challenge winning solutions published at Kaggle’s blogduring 2015, 17 solutions used XGBoost.In addition, eight solely used XGBoost to train the model, while mostothers combined XGBoost with neural nets in ensembles.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 40 / 47
XGBoost
Gradient Tree Boosting
Data with n samples and m features: D = (xi , yi ) : i = 1, · · · , n,xi ∈ Rm and yi ∈ R.
A tree ensemble model uses K additive functions:
yi = φ(xi ) =K∑
k=1
fk(xi ), fk ∈ F ,
where F = f (x) = wq(x) : q : Rm → T ,w ∈ RT is the space ofregression trees (CART).
q: represents the structure of each tree;T : number of leaves of each tree;w : leaf weights (the score of each leaf).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 41 / 47
XGBoost
Gradient Tree Boosting
Minimize the following objective function:
L(φ) =∑i
l(yi , yi ) +∑k
Ω(fk),
where Ω(f ) = γT + (λ/2)||w ||2.
Consider y(t)i = y
(t−1)i + ft(xi ) and
L(t)(φ) =∑i
l(yi , y(t−1)i + ft(xi )) + Ω(ft).
Consider second-order approximation:
L(t)(φ) =∑i
[l(yi , y(t−1)i ) + gi ft(xi ) +
1
2hi f
2t (xi )] + Ω(ft)
=∑i
[gi ft(xi ) +1
2hi f
2t (xi )] + Ω(ft) + constant,
where gi and hi are the first and second order derivatives of l w.r.ty (t−1).Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 42 / 47
XGBoost
Recap of XGBoost
Algorithm:
1. Initialize y (0)(x) = f0(x) = arg minf0
∑ni=1 l(yi , f0(xi )) + Ω(f0).
2. For t = 1, · · · ,K , do
Calculate gi and hi , i = 1, · · · , n;Fit a new tree ft = arg minft
∑i [gi ft(xi ) + 1
2hi f2t (xi )] + Ω(ft); (*)
Update y(t)i = y
(t−1)i + ft(xi ).
Output y (K).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 43 / 47
XGBoost
Combination of Gradient Boosting with CART
To optimize (*):
Define the instance set in leaf j as Ij = i : q(xi ) = j.Rewrite L(t)(φ) as
L(t)(φ) =∑i
[giwq(xi ) +1
2hiw
2q(xi )
] + γT +λ
2
T∑j=1
w2j
=T∑j=1
(∑i∈Ij
gi )wj +1
2(∑i∈Ij
hi + λ)w2j
+ γT .
For a fixed structure q, the optimal weight w∗j of leaf j is given by
w∗j = −∑
i∈Ij gi∑i∈Ij hi + λ
.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 44 / 47
XGBoost
Splitting Trees
Define the optimal value of a given tree structure q
L(t)(q) = −1
2
T∑j=1
(∑
i∈Ij gi )2∑
i∈Ij hi + λ+ γT .
The optimal value is like the impurity score for evaluating decisiontrees. A greedy algorithm can be used to build the tree (a single nodeis added at each step).
Let IL and IR denote the instance set of left and right nodes after thesplit. Letting I = IL ∪ IR . The loss reduction after the split is given by
Lsplit =1
2
[(∑
i∈IL gi )2∑
i∈Ij hi + λ+
(∑
i∈IR gi )2∑
i∈IR hi + λ−
(∑
i∈I gi )2∑
i∈I hi + λ
]− γ.
The above is usually used for evaluating the split candidates.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 45 / 47
XGBoost
Splitting Algorithms
Exact greedy algorithm: potentially time-consuming for continuousvariables.
Approximate greedy algorithm:
Proposes candidate splitting points according to percentiles of featuredistribution.Maps the continuous features into buckets split by these candidatepoints.Aggregates the statistics and finds the best solution among proposalsbased on the aggregated statistics.
Sparsity-aware split finding: features are sparse
presence of missing values in the data.frequent zero entries in the statistics.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 46 / 47
XGBoost
Refine Prediction
Pruning
The gain of the split Lsplit can be negative.Grow a tree to maximum depth, recursively prune all the leaf splitswith negative gain.
Shrinkage: y(t)i = y
(t−1)i + εft(xi ), ε is the learning rate usually set at
around 0.1.
Feature subsampling: as used in random forests.
XGBoost can be implemented in R (“xgboost” package), Python, andJulia.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 47 / 47