data mining taylor statistics 202: data mining -...

53
Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor December 2, 2012 1/1

Upload: dinhtram

Post on 01-Oct-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningSupport Vector Machine, Random Forests, Boosting

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

December 2, 2012

1 / 1

Page 2: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Neural networks

Neural network

Another classifier (or regression technique) for 2-classproblems.

Like logistic regression, it tries to predict labels yyy from xxx .

Tries to find

f = argminf ∈C

n∑i=1

(YYY i − f (XXX i ))2

whereC = {“neural net functions”} .

2 / 1

Page 3: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Neural networks: single layer

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54

Artificial Neural Networks (ANN)

3 / 1

Page 4: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Neural networks

Single layer neural nets

With p inputs, there are p + 1 weights

f (xxx1, . . . ,xxxp) = S(α + xxxTβ

)Here, S is a sigmoid function like the logistic curve.

S(η) =eη

1 + eη.

The algorithm must estimate (α, β) from data.

Classifier would assign +1 to anything withS(α + xxxT β) > t for some threshold like 0.5.

4 / 1

Page 5: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Neural networks: double layer

5 / 1

Page 6: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Neural networks

Single layer neural nets

In the previous figure, there are 15 weights and 3 interceptterms (α’s) for input → hidden layer.

There are 3 weights and one intercept term for hidden →output layer.

The outputs from the hidden layer have the form

S

(αj +

5∑i=1

wijxxx i

)1 ≤ j ≤ 3.

So, the final output is of the form

S

α0 +3∑

j=1

vjS

(αj +

5∑i=1

wijxxx i

) .

6 / 1

Page 7: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Neural networks

Single layer neural nets

Fitting this model is done via nonlinear least squares.

Computationally difficult due to the highly nonlinear formof the regression function.

7 / 1

Page 8: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machines

Support vector machine

Another classifier (or regression technique) for 2-classproblems.

Like logistic regression, it tries to predict labels yyy from xxxusing a linear function.

A linear function with an intercept α determines an affinefunction fα,β(xxx) = xxxTβ + α and a hyperplane

Hα,β ={xxx : xxxTβ + α = 0

}.

A good classifier will “separate the classes” into the sideswhere fα,β(xxx) > 0, fα,β(xxx) < 0.

It seeks to find the hyperplane that “separates the classesthe most.”

8 / 1

Page 9: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machine

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59

Support Vector Machines

!  Find a linear hyperplane (decision boundary) that will separate the data

9 / 1

Page 10: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machine

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60

Support Vector Machines

!  One Possible Solution

10 / 1

Page 11: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machine

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61

Support Vector Machines

!  Another possible solution

11 / 1

Page 12: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machine

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62

Support Vector Machines

!  Other possible solutions

12 / 1

Page 13: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machine

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63

Support Vector Machines

!  Which one is better? B1 or B2? !  How do you define better?

13 / 1

Page 14: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machine

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64

Support Vector Machines

!  Find hyperplane maximizes the margin => B1 is better than B2

14 / 1

Page 15: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machines

Support vector machine

The reciprocal maximum margin can be written as

infβ,α‖β‖2

subject to yyy i (xxxTi β + α) ≥ 1.

Therefore, the support vector machine problem is

minimizeβ,α‖β‖2

subject to yyy i (xxxTi β + α) ≥ 1.

15 / 1

Page 16: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machine

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67

Support Vector Machines

! What if the problem is not linearly separable?

16 / 1

Page 17: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machines

Non-separable problems

If we cannot completely separate the classes it is commonto modify the problem to

minimizeβ,α,ξ‖β‖2

subject to yyy i (xxxTi β + α) ≥ 1− ξi , ξi ≥ 0,

∑ni=1 ξi ≤ C

Or the Lagrange version

minimizeβ,α,ξ‖β‖22 + γ

n∑i=1

ξi

subject to yyy i (xxxTi β + α) ≥ 1− ξi , ξi ≥ 0.

17 / 1

Page 18: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machines

Non-separable problems

The ξi ’s can be removed from this problem, yielding

minimizeβ,α‖β‖22 + γ

n∑i=1

(1− yi fα,β(xxx i ))+

where (z)+ = max(z , 0) is the positive part function.

Or,

minimizeβ,α

n∑i=1

(1− yi fα,β(xxx i ))+ + λ‖β‖22

18 / 1

Page 19: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Logistic vs. SVM

3 2 1 0 1 2 30.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

LogisticSVM

19 / 1

Page 20: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

20 / 1

Page 21: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

21 / 1

Page 22: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

22 / 1

Page 23: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machines

Kernel trick

The term ‖β‖22 can be thought of as a complexity penaltyon the function fα,β so we could write

minimizefβ ,α

n∑i=1

(1− yi (α + fβ(xxx i ))+ + λ‖fβ‖22

We could also add such a complexity penalty to logisticregression

minimizefβ ,α

n∑i=1

DEV(α + fβ(xxx i ),yyy i ) + λ‖fβ‖22

So, in both SVM and penalized logistic, we are looking forthe “best” linear function where “best” is determined byeither logistic loss (deviance) or the hinge loss (SVM). 23 / 1

Page 24: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Support vector machines

Kernel trick

If the decision boundary is nonlinear, one might look forfunctions other than linear.

There are classes of functions called Reproducing KernelHilbert Spaces (RKHS) for which it is theoretically (andcomputationally) possible to solve

minimizef ∈H

n∑i=1

(1− yyy i (α + f (xxx i )))+ + λ‖f ‖2H.

or

minimizef ∈H

n∑i=1

DEV(α + f (xxx i ),yyy i ) + λ‖f ‖2H.

The svm function in R allows some such “kernels” forSVM.

24 / 1

Page 25: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

25 / 1

Page 26: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

26 / 1

Page 27: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

27 / 1

Page 28: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

28 / 1

Page 29: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

29 / 1

Page 30: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width

30 / 1

Page 31: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74

General Idea

31 / 1

Page 32: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Why use ensemble methods?

Basic idea: weak learners / classifiers are noisy learners.

Aggregating over a large number of weak learners shouldimprove the output.

Assumes some “independence” across weak learners,otherwise aggregation does nothing.

That is, if every learner is the same, then any sensibleagrregation technique should return this classifier.

These are often some of the best classifiers in terms ofperformance.

Downside: they have lost interpretability in theaggregation stage.

32 / 1

Page 33: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Bagging

In this method, one takes several bootstrap samples(samples with replacement) of the data.

For each bootstrap sample Sb, 1 ≤ b ≤ B, fit a model,retaining the classifier f ∗,b.

After all models have been fit, use majority vote

f (xxx) = majority vote of (f ∗,b(xxx))1≤i≤B .

33 / 1

Page 34: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Bagging

Approximately 1/3 of data is not selected in eachbootstrap sample (recall 0.632 rule).

Using this observation, we can define the Out-of-bag(OOB) estimate of accuracy based on OOB prediction

fOOB(xxx i ) = majority vote of (f b(xixixi ))1≤b≤B(i)

where B(i) = {all trees not built with xxx i}.Total OOB error is

OOB =1

n

n∑i=1

(fOOB(xxx i ) 6= yyy i )

34 / 1

Page 35: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Random forests: bagging trees

If the weak learners are trees, this method is called aRandom Forest.

Some variations on random forests inject more randomnessthan just bagging.

For example:

Taking a random subset of features.Random linear combinations of features.

35 / 1

Page 36: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width usingrandomForest

36 / 1

Page 37: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Boosting

Unlike bagging, boosting does not insert randomizatoninto the algorithm.

Boosting works by iteratively reweighting each observationand refitting a given classifier.

To begin, suppose we have a classification algorithm thataccepts case weights w = (w1, . . . ,wn)

fw = argminf

n∑i=1

wiL(yyy i , f (xxx i )).

and returns labels of {+1,−1}.

37 / 1

Page 38: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Boosting algorithm (AdaBoost.M1)

Step 0 Initiailize weights w(0)i = 1

n , 1 ≤ i ≤ n.

Step 1 For 1 ≤ t ≤ T , fit a classifier with weights w (t−1)

yielding

g (t) = fw (t−1) = argminf

n∑i=1

w(t−1)i L(yyy i , f (xxx i )).

Step 2 Compute the total error rate

err(t) =

∑ni=1 w

(t−1)i (yyy i 6= g (t)(xxx i ))∑n

i=1 w(t−1)i

α(t) = log

(1− err(t)

err(t)

)38 / 1

Page 39: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Boosting algorithm (AdaBoost.M1)

Step 3 Produce new weights

w(t)i = w

(t−1)i · exp(α(t) · (yyy i 6= g (t)(xxx i ))

Step 4 Repeat steps 1-3 until termination.

Step 5 Output the classifier

f (xxx) = sign

(T∑t=1

α(t)g (t)(xxx)

)

39 / 1

Page 40: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 83

Illustrating AdaBoost

Data points for training

Initial weights for each data point

40 / 1

Page 41: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84

Illustrating AdaBoost

41 / 1

Page 42: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Boosting as gradient descent

It turns out that boosting can be thought of as somethinglike gradient descent.

Above, we assumed we were solving

argminf

n∑i=1

w(t−1)i L(yyy i , f (xxx i ))

but we didn’t say what type of f we considered.

42 / 1

Page 43: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Boosting as gradient descent

Call the space of all possible classifiers above F0, and alllinear combinations of such classifiers

F =

∑j

αj fj , fj ∈ F0

In some sense, the boosting algorithm is a “steepestdescent” algorithm to find

argminf ∈F

n∑i=1

L(yyy i , f (xxx i )).

For more details, see ESL, but we will discuss some ofthese ideas. In R, this idea is implemented in the gbm

package. 43 / 1

Page 44: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Boosting as gradient descent

Consider the problem

minimizef ∈F =n∑

i=1

L(yi , f (xi ))

= L(f )

This can be solved by “gradient descent” where the“gradient” ∇L(f ) is in F0 and is the direction of steepestdescent for some stepsize α

argming∈F0L(f + α · g)

44 / 1

Page 45: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Boosting as gradient descent

We might also optimize the stepsize

argming∈F0,αL(f + α · g)

45 / 1

Page 46: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Ensemble methods

Boosting as gradient descent

In this context, the forward stagewise algorithm forsteepest descent is:

1 Initialize: f (0)(x) = 0;2 For i = 1, . . . ,T : solve

α(t), g (t) = argminα,g∈F0

L(f (t−1) + α · g)

and update f (t) = f (t−1) + α(t) · g (t).3 Return the final classifier f (T )

When L(y , f (x)) = e−y ·f (x) this corresponds toAdaBoost.M1

Changing the loss function yields new gradient boostingalgorithms.

46 / 1

Page 47: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width, 1000trees

{Versicolor} vs. {Setosa, Virginica} 47 / 1

Page 48: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width, 5000trees

{Versicolor} vs. {Setosa, Virginica} 48 / 1

Page 49: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Iris data: sepal.length, sepal.width, 10000trees

{Versicolor} vs. {Setosa, Virginica} 49 / 1

Page 50: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Boosting fits an additive model (by default)

50 / 1

Page 51: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Boosting fits an additive model (by default)

51 / 1

Page 52: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

Out-of-bag error improvement

52 / 1

Page 53: Data Mining Taylor Statistics 202: Data Mining - …statweb.stanford.edu/~jtaylo/courses/stats202/restricted/notes/... · Statistics 202: Data Mining c Jonathan Taylor Statistics

Statistics 202:Data Mining

c©JonathanTaylor

53 / 1