machine learning lecture 8: ensemble methods

28
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others

Upload: karleigh-stein

Post on 31-Dec-2015

275 views

Category:

Documents


7 download

DESCRIPTION

Machine Learning Lecture 8: Ensemble Methods. Moshe Koppel Slides adapted from Raymond J. Mooney and others. General Idea. Value of Ensembles. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning Lecture 8: Ensemble Methods

1

Machine LearningLecture 8: Ensemble Methods

Moshe Koppel

Slides adapted from Raymond J. Mooney and others

Page 2: Machine Learning Lecture 8: Ensemble Methods

2

General IdeaOriginal

Training data

....D1D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

Page 3: Machine Learning Lecture 8: Ensemble Methods

3

Value of Ensembles

• When combing multiple independent and diverse decisions each of which is at least more accurate than random guessing, random errors cancel each other out, correct decisions are reinforced.

• Human ensembles are demonstrably better– How many jelly beans in the jar?: Individual

estimates vs. group average.– Who Wants to be a Millionaire: Expert friend

vs. audience vote.

Page 4: Machine Learning Lecture 8: Ensemble Methods

4

Why does it work?

• Suppose there are 25 base classifiers– Each classifier has error rate, = 0.35– Assume classifiers are independent– Probability that the ensemble classifier makes a

wrong prediction:

25

13

25 06.0)1(25

i

ii

i

This value approaches 0 as the number of classifiers increases. This is just the law of large numbers but in this context is sometimes called “Condorcet’s Jury Theorem”.

Page 5: Machine Learning Lecture 8: Ensemble Methods

5

Generating Multiple Models

• Use a single, arbitrary learning algorithm but manipulate training data to make it learn multiple models.– Data1 Data2 … Data m– Learner1 = Learner2 = … = Learner m

• Different methods for changing training data:– Bagging: Resample training data– Boosting: Reweight training data– DECORATE: Add additional artificial training data

• In WEKA, these are called meta-learners: they take a learning algorithm as an argument (base learner)

• Other approaches to generating multiple models:– vary some random parameter in learner, e.g., order of examples – vary choice of feature set: Random Forests

Page 6: Machine Learning Lecture 8: Ensemble Methods

6

Bagging

• Create ensembles by repeatedly randomly resampling the training data (Brieman, 1996).

• Given a training set of size n, create m samples of size n by drawing n examples from the original data, with replacement.– Each bootstrap sample will on average contain 63.2% of the

unique training examples, the rest are replicates.

• Combine the m resulting models using simple majority vote.

• Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed.

Page 7: Machine Learning Lecture 8: Ensemble Methods

7

Bagging

•Base learner = Single level binary tree (“stump”)• True rule x<.35 or x>.75• No stump perfect

Page 8: Machine Learning Lecture 8: Ensemble Methods

8

Bagging

Accuracy of ensemble classifier: 100%

Page 9: Machine Learning Lecture 8: Ensemble Methods

9

Bagging- Final Points

• Works well if the base classifiers are unstable

• Increased accuracy because it reduces variance as compared to any individual classifier

• Multiple models can be learned in parallel

Page 10: Machine Learning Lecture 8: Ensemble Methods

10

Boosting

• Originally developed by computational learning theorists to guarantee performance improvements on fitting training data for a weak learner that only needs to generate a hypothesis with a training accuracy greater than 0.5 (Schapire, 1990).

• Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996).

• Examples are given weights. At each iteration, a new hypothesis is learned and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong.

Page 11: Machine Learning Lecture 8: Ensemble Methods

11

Boosting: Basic Algorithm

• General Loop: Set all examples to have equal uniform weights. For t from 1 to T do: Learn a hypothesis, ht, from the weighted examples Decrease the weights of examples ht classifies correctly

• Base (weak) learner must focus on correctly classifying the most highly weighted examples while strongly avoiding over-fitting.

• During testing, each of the T hypotheses get a weighted vote proportional to their accuracy on the training data.

Page 12: Machine Learning Lecture 8: Ensemble Methods

12

AdaBoost Pseudocode

TrainAdaBoost(D, BaseLearn) For each example di in D let its weight wi=1/|D| For t from 1 to T do: Learn a hypothesis, ht, from the weighted examples: ht=BaseLearn(D) Calculate the error, εt, of the hypothesis ht as the total sum weight of the examples that it classifies incorrectly. If εt > 0.5 then exit loop, else continue. Let βt = εt / (1 – εt ) Multiply the weights of the examples that ht classifies correctly by βt

Rescale the weights of all of the examples so the total sum weight remains 1. Return H={h1,…,hT}

TestAdaBoost(ex, H) Let each hypothesis ht in H vote for ex’s classification with weight log(1/ βt ) Return the class with the highest weighted vote total.

Page 13: Machine Learning Lecture 8: Ensemble Methods

13

AdaBoost Pseudocode

TrainAdaBoost(D, BaseLearn) For each example di in D let its weight wi=1/|D| For t from 1 to T do: Learn a hypothesis, ht, from the weighted examples: ht=BaseLearn(D) Calculate the error, εt, of the hypothesis ht as the total sum weight of the examples that it classifies incorrectly. If εt > 0.5 then exit loop, else continue. Let βt = εt / (1 – εt ) Multiply the weights of the examples that ht classifies correctly by βt

Rescale the weights of all of the examples so the total sum weight remains 1. Return H={h1,…,hT}

TestAdaBoost(ex, H) Let each hypothesis ht in H vote for ex’s classification with weight log(1/ βt ) Return the class with the highest weighted vote total.

Before reweighting the total weight of these is 1 – εt, so now the total weight is εt

Page 14: Machine Learning Lecture 8: Ensemble Methods

14

AdaBoost Pseudocode

TrainAdaBoost(D, BaseLearn) For each example di in D let its weight wi=1/|D| For t from 1 to T do: Learn a hypothesis, ht, from the weighted examples: ht=BaseLearn(D) Calculate the error, εt, of the hypothesis ht as the total sum weight of the examples that it classifies incorrectly. If εt > 0.5 then exit loop, else continue. Let βt = εt / (1 – εt ) Multiply the weights of the examples that ht classifies correctly by βt

Rescale the weights of all of the examples so the total sum weight remains 1. Return H={h1,…,hT}

TestAdaBoost(ex, H) Let each hypothesis ht in H vote for ex’s classification with weight log(1/ βt ) Return the class with the highest weighted vote total.

Page 15: Machine Learning Lecture 8: Ensemble Methods

15

AdaBoost Pseudocode

TrainAdaBoost(D, BaseLearn) For each example di in D let its weight wi=1/|D| For t from 1 to T do: Learn a hypothesis, ht, from the weighted examples: ht=BaseLearn(D) Calculate the error, εt, of the hypothesis ht as the total sum weight of the examples that it classifies incorrectly. If εt > 0.5 then exit loop, else continue. Let βt = εt / (1 – εt ) Multiply the weights of the examples that ht classifies correctly by βt

Rescale the weights of all of the examples so the total sum weight remains 1. Return H={h1,…,hT}

TestAdaBoost(ex, H) Let each hypothesis ht in H vote for ex’s classification with weight log(1/ βt ) Return the class with the highest weighted vote total.

As εt ½, weight 0; as εt 0, weight

Page 16: Machine Learning Lecture 8: Ensemble Methods

16

Learning with Weighted Examples

• Generic approach is to replicate examples in the training set proportional to their weights (e.g. 10 replicates of an example with a weight of 0.01 and 100 for one with weight 0.1).

• Most algorithms can be enhanced to efficiently incorporate weights directly in the learning algorithm so that the effect is the same (e.g. implement the WeightedInstancesHandler interface in WEKA).

• For decision trees, for calculating information gain, when counting example i, simply increment the corresponding count by wi rather than by 1.

Page 17: Machine Learning Lecture 8: Ensemble Methods

17

Toy Example

Page 18: Machine Learning Lecture 8: Ensemble Methods

18

Final Classifier

Page 19: Machine Learning Lecture 8: Ensemble Methods

19

Training Error

Page 20: Machine Learning Lecture 8: Ensemble Methods

20

Experimental Results on Ensembles(Freund & Schapire, 1996; Quinlan, 1996)

• Ensembles have been used to improve generalization accuracy on a wide variety of problems.

• On average, Boosting provides a larger increase in accuracy than Bagging.

• Boosting on rare occasions can degrade accuracy.• Bagging more consistently provides a modest

improvement.• Boosting is particularly subject to over-fitting

when there is significant noise in the training data.

Page 21: Machine Learning Lecture 8: Ensemble Methods

21

DECORATE(Melville & Mooney, 2003)

• What if the training set is small, and therefore resampling and reweighting the training set has limited ability to generate diverse alternative hypotheses?

• Can generate multiple models by using all training data and adding fake training examples with made-up labels in order to force variation in learned models.

• Miraculously, this works.

Page 22: Machine Learning Lecture 8: Ensemble Methods

22

Base Learner

Overview of DECORATE

Training Examples

Artificial Examples

Current Ensemble

--+

++

C1

++-+-

Page 23: Machine Learning Lecture 8: Ensemble Methods

23

C1

Base Learner

Overview of DECORATE

Training Examples

Artificial Examples

Current Ensemble

--+-+

--+

++

C2+---+

Page 24: Machine Learning Lecture 8: Ensemble Methods

24

C1

C2Base Learner

Overview of DECORATE

Training Examples

Artificial Examples

Current Ensemble

-+++-

--+

++

C3

Page 25: Machine Learning Lecture 8: Ensemble Methods

25

Another Approach: Random Forests

• Random forests (RF) are a combination of tree predictors

• Generate multiple trees by repeatedly choosing some random subset of all features

• Optionally also use random subset of examples as in bagging

Page 26: Machine Learning Lecture 8: Ensemble Methods

26

Random Forests Algorithm

Given a training set S

For i = 1 to k do:

Build subset Si by sampling with replacement from S

Learn tree Ti from Si

At each node:

Choose best split from random subset of F features

Make predictions according to majority vote of the set of k trees.

Page 27: Machine Learning Lecture 8: Ensemble Methods

27

Random Forests

• Like bagging, multiple models can be learned in parallel

• No need to get fancy; don’t bother pruning

• Designed for using decision trees as base learner, but can be generalized to other learning methods

Page 28: Machine Learning Lecture 8: Ensemble Methods

28

Issues in Ensembles

• Can generate multiple classifiers by– Resampling training data

– Reweighting training data

– Varying learning parameters

– Varying feature set

• Boosting can’t be run in parallel and is very sensitive to noise, but (in the absence of noise) seems very insensitive to overfitting

• Ability of ensemble methods to reduce variance depends on independence of learned classifiers