arti cial intelligence machine learning ensemble methodscolumbiax+csmm.101x+1t2017+type@asset... ·...

Post on 22-Aug-2019

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Artificial Intelligence

Machine Learning

Ensemble Methods

Outline

1. Majority voting

2. Boosting

3. Bagging

4. Random Forests

5. Conclusion

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

• Many random hyperplanes combined by majority vote will still

be random.

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

• Many random hyperplanes combined by majority vote will still

be random.

• Suppose we have m classifiers, performing slightly better than

random, that is error = 0.5-ε.

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

• Many random hyperplanes combined by majority vote will still

be random.

• Suppose we have m classifiers, performing slightly better than

random, that is error = 0.5-ε.

• Combine: make a decision based on majority vote?

Majority Voting

• A randomly chosen hyperplane has an expected error of 0.5.

• Many random hyperplanes combined by majority vote will still

be random.

• Suppose we have m classifiers, performing slightly better than

random, that is error = 0.5-ε.

• Combine: make a decision based on majority vote?

• What if we combined these m slightly-better-than-random

classifiers? Would majority vote be a good choice?

Condorcet’s Jury Theorem

Marquis de Condorcet Appli-

cation of Analysis to the Prob-

ability of Majority Decisions.

1785.

Assumptions:

1. Each individual makes the right choice with a probability p.

2. The votes are independent.

Condorcet’s Jury Theorem

Marquis de Condorcet Appli-

cation of Analysis to the Prob-

ability of Majority Decisions.

1785.

Assumptions:

1. Each individual makes the right choice with a probability p.

2. The votes are independent.

If p > 0.5, then adding more voters increases the probability that the majority

decision is correct. if p < 0.5, then adding more voters makes things worse.

Ensemble Methods

• An Ensemble Method combines the predictions of many in-

dividual classifiers by majority voting.

Ensemble Methods

• An Ensemble Method combines the predictions of many in-

dividual classifiers by majority voting.

• Such individual classifiers, called weak learners, are required

to perform slightly better than random.

Ensemble Methods

• An Ensemble Method combines the predictions of many in-

dividual classifiers by majority voting.

• Such individual classifiers, called weak learners, are required

to perform slightly better than random.

How do we produce independent weak learners

using the same training data?

Ensemble Methods

• An Ensemble Method combines the predictions of many in-

dividual classifiers by majority voting.

• Such individual classifiers, called weak learners, are required

to perform slightly better than random.

How do we produce independent weak learners

using the same training data?

• Use a strategy to obtain relatively independent weak learners!

• Different methods:

1. Boosting

2. Bagging

3. Random Forests

Boosting

• First ensemble method.

• One of the most powerful Machine Learning methods.

• Popular algorithm: AdaBoost.M1 (Freund and Shapire 1997).

• “Best off-the-shelf classifier in the world” Breiman (CART’s

inventor), 1998.

• Simple algorithm.

• Weak learners can be trees, perceptrons, decision stumps, etc.

• Idea:

Train the weak learners on weighted training examples.

Boosting

Boosting

• The predictions from all of the Gm, m ∈ {1, · · · ,M} are com-

bined with a weighted majority voting.

• αm is the contribution of each weak learner Gm.

• Computed by the boosting algorithm to give a weighted im-

portance to the classifiers in the sequence.

• The decision of a highly-performing classifier in the sequence

should weight more than less important classifiers in the se-

quence.

• This is captured in:

G(x) = sign[M∑

m=1

αmGm(x)]

Boosting

The error rate on the training sample:

err :=

∑ni=1 1{yi 6= G(xi)}

n

Boosting

The error rate on the training sample:

err :=

∑ni=1 1{yi 6= G(xi)}

n

The error rate on each weak learner:

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

Boosting

The error rate on the training sample:

err :=

∑ni=1 1{yi 6= G(xi)}

n

The error rate on each weak learner:

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

Intuition:

- Give large weights for hard examples.

- Give small weights for easy examples.

Boosting

For each weak learner m, we associate an error errm.

αm = log(1−errmerrm

)

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

(b) Compute

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

(b) Compute

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

(c) Compute

αm = log(1− errmerrm

)

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

(b) Compute

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

(c) Compute

αm = log(1− errmerrm

)

(d) wi ← wi.exp[αm.1(yi 6= Gm(xi))] for i = 1, · · · , n.

AdaBoost

1. Initialize the example weights wi = 1n, i = 1, · · · , n.

2. For m = 1 to M (number of weak learners)

(a) Fit a classifier Gm(x) to training data using the weights wi.

(b) Compute

errm :=

∑ni=1 wi 1{yi 6= Gm(xi)}∑n

i=1wi

(c) Compute

αm = log(1− errmerrm

)

(d) wi ← wi.exp[αm.1(yi 6= Gm(xi))] for i = 1, · · · , n.

3. Output

G(x) = sign[M∑

m=1

αmGm(x)]

Digression: Decision Stumps

This is an example of very weak classifier

A simple 2-terminal node decision tree for binary classification.

f(xi|j, t) =

{+1 if xij > t−1 otherwise

Where j ∈ {1, · · · d}.

Example: A dataset with 10 features, 2,000 examples training

and 10,000 testing.

AdaBoost Performance340 10. Boosting and Additive Trees

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Boosting Iterations

Te

st

Err

or

Single Stump

244 Node Tree

FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps,as a function of the number of iterations. Also shown are the test error rate fora single stump, and a 244-node classification tree.

random guessing. However, as boosting iterations proceed the error ratesteadily decreases, reaching 5.8% after 400 iterations. Thus, boosting thissimple very weak classifier reduces its prediction error rate by almost afactor of four. It also outperforms a single large classification tree (errorrate 24.7%). Since its introduction, much has been written to explain thesuccess of AdaBoost in producing accurate classifiers. Most of this workhas centered on using classification trees as the “base learner” G(x), whereimprovements are often most dramatic. In fact, Breiman (NIPS Workshop,1996) referred to AdaBoost with trees as the “best o!-the-shelf classifier inthe world” (see also Breiman (1998)). This is especially the case for data-mining applications, as discussed more fully in Section 10.7 later in thischapter.

10.1.1 Outline of This Chapter

Here is an outline of the developments in this chapter:

• We show that AdaBoost fits an additive model in a base learner,optimizing a novel exponential loss function. This loss function is

Error rates:

- Random: 50%.

- Stump: 45.8%.

- Large classification tree: 24.7%.

- AdaBoost with stumps: 5.8% after 400 iterations!

AdaBoost Performance340 10. Boosting and Additive Trees

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Boosting Iterations

Te

st

Err

or

Single Stump

244 Node Tree

FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps,as a function of the number of iterations. Also shown are the test error rate fora single stump, and a 244-node classification tree.

random guessing. However, as boosting iterations proceed the error ratesteadily decreases, reaching 5.8% after 400 iterations. Thus, boosting thissimple very weak classifier reduces its prediction error rate by almost afactor of four. It also outperforms a single large classification tree (errorrate 24.7%). Since its introduction, much has been written to explain thesuccess of AdaBoost in producing accurate classifiers. Most of this workhas centered on using classification trees as the “base learner” G(x), whereimprovements are often most dramatic. In fact, Breiman (NIPS Workshop,1996) referred to AdaBoost with trees as the “best o!-the-shelf classifier inthe world” (see also Breiman (1998)). This is especially the case for data-mining applications, as discussed more fully in Section 10.7 later in thischapter.

10.1.1 Outline of This Chapter

Here is an outline of the developments in this chapter:

• We show that AdaBoost fits an additive model in a base learner,optimizing a novel exponential loss function. This loss function is

AdaBoost with Decision stumps lead to a form of:

feature selection

AdaBoost-Decision Stumps354 10. Boosting and Additive Trees

!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative Importance

FIGURE 10.6. Predictor variable importance spectrum for the spam data. Thevariable names are written on the vertical axis.

Bagging & Bootstrapping

• Bootstrap is a re-sampling technique ≡ sampling from the em-

pirical distribution.

• Aims to improve the quality of estimators.

• Bagging and Boosting are based on bootstrapping.

• Both use re-sampling to generate weak learners for classifica-

tion.

• Strategy: Randomly distort data by re-sampling.

• Train weak learners on re-sampled training sets.

• Bootstrap aggregation ≡ Bagging.

Bagging

Training

For b = 1, · · · , B1. Draw a bootstrap sample Bb of size ` from training data.

2. Train a classifier fb on Bb.

Classification: Classify by majority vote among the B trees:

favg :=1

B

B∑b=1

fb(x)

BaggingBagging works well for trees:

284 8. Model Inference and Averaging

|

x.1 < 0.395

0 1

0

1 0

1

1 0

Original Tree

|

x.1 < 0.555

0

1 0

0

1

b = 1

|

x.2 < 0.205

0 1

0 1

0 1

b = 2

|

x.2 < 0.285

1 1

0

1 0

b = 3

|

x.3 < 0.985

0

1

0 1

1 1

b = 4

|

x.4 < !1.36

0

11 0

1

0

1 0

b = 5

|

x.1 < 0.395

1 1 0 0

1

b = 6

|

x.1 < 0.395

0 1

0 1

1

b = 7

|

x.3 < 0.985

0 1

0 0

1 0

b = 8

|

x.1 < 0.395

0

1

0 1

1 0

b = 9

|

x.1 < 0.555

1 0

1

0 1

b = 10

|

x.1 < 0.555

0 1

0

1

b = 11

FIGURE 8.9. Bagging trees on simulated dataset. The top left panel shows theoriginal tree. Eleven trees grown on bootstrap samples are shown. For each tree,the top split is annotated.

Random Forests

1. Random forests: modifies bagging with trees to reduce corre-

lation between trees.

2. Tree training optimizes each split over all dimensions.

3. But for Random forests, choose a different subset of di-

mensions at each split. Number of dimensions chosen m.

4. Optimal split is chosen within the subset.

5. The subset is chosen at random out of all dimensions 1, . . . , d.

6. Recommended m =√d or smaller.

Credit• “An Introduction to Statistical Learning, with applications in

R” (Springer, 2013)” with permission from the authors: G.

James, D. Witten, T. Hastie and R. Tibshirani.

top related