data mining taylor statistics 202: data mining -...

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningSupport Vector Machine, Random Forests, Boosting

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

December 2, 2012

1 / 1


c©JonathanTaylor

Neural networks

Neural network

Another classifier (or regression technique) for 2-classproblems.

Like logistic regression, it tries to predict labels yyy from xxx .

Tries to find

f = argminf ∈C

n∑i=1

(YYY i − f (XXX i ))2

whereC = {“neural net functions”} .

2 / 1


c©JonathanTaylor

Neural networks: single layer

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54

Artificial Neural Networks (ANN)

3 / 1


c©JonathanTaylor

Neural networks

Single layer neural nets

With p inputs, there are p + 1 weights

f (xxx1, . . . ,xxxp) = S(α + xxxTβ

)Here, S is a sigmoid function like the logistic curve.

S(η) =eη

1 + eη.

The algorithm must estimate (α, β) from data.

Classifier would assign +1 to anything withS(α + xxxT β) > t for some threshold like 0.5.

4 / 1


c©JonathanTaylor

Neural networks: double layer

5 / 1


c©JonathanTaylor

Neural networks


In the previous figure, there are 15 weights and 3 interceptterms (α’s) for input → hidden layer.

There are 3 weights and one intercept term for hidden →output layer.

The outputs from the hidden layer have the form

S

(αj +

5∑i=1

wijxxx i

)1 ≤ j ≤ 3.

So, the final output is of the form

S

α0 +3∑

j=1

vjS

(αj +

5∑i=1

wijxxx i

) .

6 / 1


c©JonathanTaylor

Neural networks


Fitting this model is done via nonlinear least squares.

Computationally difficult due to the highly nonlinear formof the regression function.

7 / 1


c©JonathanTaylor

Support vector machines

Support vector machine

Another classifier (or regression technique) for 2-classproblems.

Like logistic regression, it tries to predict labels yyy from xxxusing a linear function.

A linear function with an intercept α determines an affinefunction fα,β(xxx) = xxxTβ + α and a hyperplane

Hα,β ={xxx : xxxTβ + α = 0

}.

A good classifier will “separate the classes” into the sideswhere fα,β(xxx) > 0, fα,β(xxx) < 0.

It seeks to find the hyperplane that “separates the classesthe most.”

8 / 1


c©JonathanTaylor



Support Vector Machines

!  Find a linear hyperplane (decision boundary) that will separate the data

9 / 1


c©JonathanTaylor




!  One Possible Solution

10 / 1


c©JonathanTaylor




!  Another possible solution

11 / 1


c©JonathanTaylor




!  Other possible solutions

12 / 1


c©JonathanTaylor




!  Which one is better? B1 or B2? !  How do you define better?

13 / 1


c©JonathanTaylor




!  Find hyperplane maximizes the margin => B1 is better than B2

14 / 1


c©JonathanTaylor



The reciprocal maximum margin can be written as

infβ,α‖β‖2

subject to yyy i (xxxTi β + α) ≥ 1.

Therefore, the support vector machine problem is

minimizeβ,α‖β‖2

subject to yyy i (xxxTi β + α) ≥ 1.

15 / 1


c©JonathanTaylor




! What if the problem is not linearly separable?

16 / 1


c©JonathanTaylor


Non-separable problems

If we cannot completely separate the classes it is commonto modify the problem to

minimizeβ,α,ξ‖β‖2

subject to yyy i (xxxTi β + α) ≥ 1− ξi , ξi ≥ 0,

∑ni=1 ξi ≤ C

Or the Lagrange version

minimizeβ,α,ξ‖β‖22 + γ

n∑i=1

ξi

subject to yyy i (xxxTi β + α) ≥ 1− ξi , ξi ≥ 0.

17 / 1


c©JonathanTaylor


Non-separable problems

The ξi ’s can be removed from this problem, yielding

minimizeβ,α‖β‖22 + γ

n∑i=1

(1− yi fα,β(xxx i ))+

where (z)+ = max(z , 0) is the positive part function.

Or,

minimizeβ,α

n∑i=1

(1− yi fα,β(xxx i ))+ + λ‖β‖22

18 / 1


c©JonathanTaylor

Logistic vs. SVM

3 2 1 0 1 2 30.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

LogisticSVM

19 / 1


c©JonathanTaylor

Iris data: sepal.length, sepal.width

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

20 / 1


c©JonathanTaylor


21 / 1


c©JonathanTaylor


22 / 1


c©JonathanTaylor


Kernel trick

The term ‖β‖22 can be thought of as a complexity penaltyon the function fα,β so we could write

minimizefβ ,α

n∑i=1

(1− yi (α + fβ(xxx i ))+ + λ‖fβ‖22

We could also add such a complexity penalty to logisticregression

minimizefβ ,α

n∑i=1

DEV(α + fβ(xxx i ),yyy i ) + λ‖fβ‖22

So, in both SVM and penalized logistic, we are looking forthe “best” linear function where “best” is determined byeither logistic loss (deviance) or the hinge loss (SVM). 23 / 1


c©JonathanTaylor


Kernel trick

If the decision boundary is nonlinear, one might look forfunctions other than linear.

There are classes of functions called Reproducing KernelHilbert Spaces (RKHS) for which it is theoretically (andcomputationally) possible to solve

minimizef ∈H

n∑i=1

(1− yyy i (α + f (xxx i )))+ + λ‖f ‖2H.

or

minimizef ∈H

n∑i=1

DEV(α + f (xxx i ),yyy i ) + λ‖f ‖2H.

The svm function in R allows some such “kernels” forSVM.

24 / 1


c©JonathanTaylor


3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

25 / 1


c©JonathanTaylor


26 / 1


c©JonathanTaylor


27 / 1


c©JonathanTaylor


3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

28 / 1


c©JonathanTaylor


29 / 1


c©JonathanTaylor


30 / 1


c©JonathanTaylor

Ensemble methods


General Idea

31 / 1


c©JonathanTaylor

Ensemble methods

Why use ensemble methods?

Basic idea: weak learners / classifiers are noisy learners.

Aggregating over a large number of weak learners shouldimprove the output.

Assumes some “independence” across weak learners,otherwise aggregation does nothing.

That is, if every learner is the same, then any sensibleagrregation technique should return this classifier.

These are often some of the best classifiers in terms ofperformance.

Downside: they have lost interpretability in theaggregation stage.

32 / 1


c©JonathanTaylor

Ensemble methods

Bagging

In this method, one takes several bootstrap samples(samples with replacement) of the data.

For each bootstrap sample Sb, 1 ≤ b ≤ B, fit a model,retaining the classifier f ∗,b.

After all models have been fit, use majority vote

f (xxx) = majority vote of (f ∗,b(xxx))1≤i≤B .

33 / 1


c©JonathanTaylor

Ensemble methods

Bagging

Approximately 1/3 of data is not selected in eachbootstrap sample (recall 0.632 rule).

Using this observation, we can define the Out-of-bag(OOB) estimate of accuracy based on OOB prediction

fOOB(xxx i ) = majority vote of (f b(xixixi ))1≤b≤B(i)

where B(i) = {all trees not built with xxx i}.Total OOB error is

OOB =1

n

n∑i=1

(fOOB(xxx i ) 6= yyy i )

34 / 1


c©JonathanTaylor

Ensemble methods

Random forests: bagging trees

If the weak learners are trees, this method is called aRandom Forest.

Some variations on random forests inject more randomnessthan just bagging.

For example:

Taking a random subset of features.Random linear combinations of features.

35 / 1


c©JonathanTaylor

Iris data: sepal.length, sepal.width usingrandomForest

36 / 1


c©JonathanTaylor

Ensemble methods

Boosting

Unlike bagging, boosting does not insert randomizatoninto the algorithm.

Boosting works by iteratively reweighting each observationand refitting a given classifier.

To begin, suppose we have a classification algorithm thataccepts case weights w = (w1, . . . ,wn)

fw = argminf

n∑i=1

wiL(yyy i , f (xxx i )).

and returns labels of {+1,−1}.

37 / 1


c©JonathanTaylor

Ensemble methods

Boosting algorithm (AdaBoost.M1)

Step 0 Initiailize weights w(0)i = 1

n , 1 ≤ i ≤ n.

Step 1 For 1 ≤ t ≤ T , fit a classifier with weights w (t−1)

yielding

g (t) = fw (t−1) = argminf

n∑i=1

w(t−1)i L(yyy i , f (xxx i )).

Step 2 Compute the total error rate

err(t) =

∑ni=1 w

(t−1)i (yyy i 6= g (t)(xxx i ))∑n

i=1 w(t−1)i

α(t) = log

(1− err(t)

err(t)

)38 / 1


c©JonathanTaylor

Ensemble methods

Boosting algorithm (AdaBoost.M1)

Step 3 Produce new weights

w(t)i = w

(t−1)i · exp(α(t) · (yyy i 6= g (t)(xxx i ))

Step 4 Repeat steps 1-3 until termination.

Step 5 Output the classifier

f (xxx) = sign

(T∑t=1

α(t)g (t)(xxx)

)

39 / 1


c©JonathanTaylor

Ensemble methods


Illustrating AdaBoost

Data points for training

Initial weights for each data point

40 / 1


c©JonathanTaylor

Ensemble methods

Boosting as gradient descent

It turns out that boosting can be thought of as somethinglike gradient descent.

Above, we assumed we were solving

argminf

n∑i=1

w(t−1)i L(yyy i , f (xxx i ))

but we didn’t say what type of f we considered.

42 / 1


c©JonathanTaylor

Ensemble methods


Call the space of all possible classifiers above F0, and alllinear combinations of such classifiers

F =

∑j

αj fj , fj ∈ F0

In some sense, the boosting algorithm is a “steepestdescent” algorithm to find

argminf ∈F

n∑i=1

L(yyy i , f (xxx i )).

For more details, see ESL, but we will discuss some ofthese ideas. In R, this idea is implemented in the gbm

package. 43 / 1


c©JonathanTaylor

Ensemble methods


Consider the problem

minimizef ∈F =n∑

i=1

L(yi , f (xi ))

= L(f )

This can be solved by “gradient descent” where the“gradient” ∇L(f ) is in F0 and is the direction of steepestdescent for some stepsize α

argming∈F0L(f + α · g)

44 / 1


c©JonathanTaylor

Ensemble methods


In this context, the forward stagewise algorithm forsteepest descent is:

1 Initialize: f (0)(x) = 0;2 For i = 1, . . . ,T : solve

α(t), g (t) = argminα,g∈F0

L(f (t−1) + α · g)

and update f (t) = f (t−1) + α(t) · g (t).3 Return the final classifier f (T )

When L(y , f (x)) = e−y ·f (x) this corresponds toAdaBoost.M1

Changing the loss function yields new gradient boostingalgorithms.

46 / 1

data mining taylor statistics 202: data mining -...

Documents