ensemble methods for machine learning · 2017. 5. 13. · • bagging/random forests • “bucket...

50
Ensemble Methods for Machine Learning

Upload: others

Post on 23-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Ensemble Methods for Machine Learning

Page 2: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

COMBINING CLASSIFIERS: ENSEMBLE APPROACHES

Page 3: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Common Ensemble classifiers

•  Bagging/Random Forests •  “Bucket of models” •  Stacking •  Boosting

Page 4: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Ensemble classifiers we’ve studied so far

•  Bagging – Build many bootstrap replicates of the data – Train many classifiers – Vote the results

Page 5: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Bagged trees to random forests •  Bagged decision trees:

–  Build many bootstrap replicates of the data –  Train many tree classifiers –  Vote the results

•  Random forest: –  Build many bootstrap replicates of the data –  Train many tree classifiers (no pruning)

•  For each split, restrict to a random subset of the original feature set. (Usually a random subset of size sqrt(d) where there are d original features.)

–  Vote the results

–  Like bagged decision trees, fast and easy to use

Page 6: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

A “bucket of models”

•  Scenario: –  I have three learners

•  Logistic Regression, Naïve Bayes, Backprop –  I’m going to embed a learning machine in some

system (eg, “genius”-like music recommender) •  Each system has a different user è different dataset D •  In pilot studies there’s no single best system •  Which learner do embed to get the best result?

•  Can I combine all of them somehow and do better?

Page 7: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Simple learner combination: A “bucket of models”

•  Input: – your top T favorite learners (or tunings)

•  L1,….,LT

– A dataset D •  Learning algorithm:

– Use 10-CV to estimate the error of L1,….,LT – Pick the best (lowest 10-CV error) learner L* – Train L* on D and return its hypothesis h*

Page 8: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Pros and cons of a “bucket of models”

•  Pros: –  Simple –  Will give results not much worse than the best of the

“base learners”

•  Cons: – What if there’s not a single best learner?

•  Other approaches: –  Vote the hypotheses (how would you weight them?) –  Combine them some other way? –  How about learning to combine the hypotheses?

Page 9: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Common Ensemble classifiers

•  Bagging/Random Forests •  “Bucket of models” •  Stacking •  Boosting

Page 10: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Stacked learners: first attempt

•  Input: –  your top T favorite learners (or tunings)

•  L1,….,LT

–  A dataset D containing (x,y), …. •  Learning algorithm:

–  Train L1,….,LT on D to get h1,….,hT –  Create a new dataset D’ containing (x’,y’),….

•  x’ is a vector of the T predictions h1(x),….,hT(x) •  y is the label y for x

–  Train YFCL on D’ to get h’ --- which combines the predictions! •  To predict on a new x:

–  Construct x’ as before and predict h’(x’)

Problem: if Li overfits the data D, then hi(x) could be almost always the same as y in D’. But that won’t be the case on an out-of-sample-the test example x.

The fix: make an x’ in D’ look more like the out-of-sample test cases.

Page 11: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Stacked learners: the right way

•  Input: –  your top T favorite learners (or tunings): L1,….,LT

–  A dataset D containing (x,y), …. •  Learning algorithm:

–  Train L1,….,LT on D to get h1,….,hT –  Also run 10-CV and collect the CV test predictions (x,Li(D-x,x))

for each example x and each learner Li –  Create a new dataset D’ containing (x’,y’),….

•  x’ is the CV test predictions (L1(D-x,x), L2(D-x,x), …

•  y is the label y for x –  Train YFCL on D’ to get h’ --- which combines the predictions!

•  To predict on a new x: –  Construct x’ using h1,….,hT (as before) and predict h’(x’)

D-x is the CV training set for x Li(D*,x) is the label predicted by the hypothesis learned from D* by Li

we don’t do any CV at prediction time

Page 12: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Pros and cons of stacking

•  Pros: –  Fairly simple –  Slow, but easy to parallelize

•  Cons: – What if there’s not a single best combination

scheme? – E.g.: for movie recommendation sometimes

L1 is best for users with many ratings and L2 is best for users with few ratings.

Page 13: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Multi-level stacking/blending

•  Learning algorithm: –  Train L1,….,LT on D to get h1,….,hT –  Also run 10-CV and collect the CV test predictions (x,Li(D-x,x))

for each example x and each learner Li –  Create a new dataset D’ containing (x’,y’),….

•  x’ is the CV test predictions (L1(D-x,x), L2(D-x,x), … combined with additional features from x (e.g., numRatings, userAge, …)

•  y is the label y for x –  Train YFCL on D’ to get h’ --- which combines the predictions!

•  To predict on a new x: –  Construct x’ using h1,….,hT (as before) and predict h’(x’)

D-x is the CV training set for x and Li(D*,x) is a predicted label

might create features like “L1(D-x,x)=y* and numRatings>20”

where the choice of classifier to rely on depends on meta-features of x

“meta-features”

Page 14: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Comments

•  Ensembles based on blending/stacking were key approaches used in the netflix competition –  Winning entries blended many types of classifiers

•  Ensembles based on stacking are the main architecture used in Watson –  Not all of the base classifiers/rankers are learned,

however; some are hand-programmed.

Page 15: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Common Ensemble classifiers

•  Bagging/Random Forests •  “Bucket of models” •  Stacking •  Boosting

Page 16: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting

Thanks to A Short Introduction to Boosting. Yoav Freund, Robert E. Schapire, Journal of Japanese Society for Artificial Intelligence,14(5):771-780, September, 1999

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Page 17: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

•  Valiant CACM 1984 and PAC-learning: partly inspired by Turing

Formal Informal

AI Valiant (1984)

Turing test (1950)

Complexity Turing machine (1936)

1936 - T

Question: what sort of AI questions can we formalize and study with formal methods?

Page 18: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

“Weak” pac-learning (Kearns & Valiant 88)

(PAC learning)

say, ε=0.49

Page 19: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

“Weak” PAC-learning is equivalent to “strong” PAC-learning (!) (Schapire 89)

(PAC learning)

say, ε=0.49

=

Page 20: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

“Weak” PAC-learning is equivalent to “strong” PAC-learning (!) (Schapire 89)

•  The basic idea exploits the fact that you can learn a little on every distribution: –  Learn h1 from D0 with error < 49% –  Modify D0 so that h1 has error 50% (call this D1)

•  If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x).

–  Learn h2 from D1 with error < 49% –  Modify D1 so that h1 and h2 always disagree (call this D2) –  Learn h3 from D2 with error <49%. –  Now vote h1,h2, and h3. This has error better than any

of the “weak” hypotheses h1,h2 or h3. –  Repeat this as needed to lower the error rate more….

Page 21: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Boosting can actually help experimentally…but… (Drucker, Schapire, Simard)

•  The basic idea exploits the fact that you can learn a little on every distribution: –  Learn h1 from D0 with error < 49% –  Modify D0 so that h1 has error 50% (call this D1)

•  If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x).

–  Learn h2 from D1 with error < 49% –  Modify D1 so that h1 and h2 always disagree (call this D2) –  Learn h3 from D2 with error <49%. –  Now vote h1,h2, and h3. This has error better than any

of the “weak” hypotheses h1,h2 or h3. –  Repeat this as needed to lower the error rate more….

Very wasteful of examples

wait for examples where they disagree !?

Page 22: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

AdaBoost (Freund and Schapire)

Page 23: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Increase weight of xi if ht is wrong, decrease weight if ht is right.

Linear combination of base hypotheses - best weight αt depends on error of ht.

Sample with replacement

Page 24: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

AdaBoost: Adaptive Boosting (Freund & Schapire, 1995)

Page 25: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Theoretically, one can upper bound the training error of boosting.

Page 26: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting: A toy example Thanks, Rob Schapire

Page 27: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting: A toy example Thanks, Rob Schapire

Page 28: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting: A toy example Thanks, Rob Schapire

Page 29: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting: A toy example Thanks, Rob Schapire

Page 30: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting: A toy example Thanks, Rob Schapire

Page 31: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Boosting improved decision trees…

Page 32: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting: Analysis

Theorem:  if  the  error  at  round  t  of  the  base  classi4ier  is  εt  =  ½-­‐γt,  then  the  training  error  of  the  boosted  classi4ier  is  bounded  by    

The  algorithm  doesn’t  need  to  know  any  bound  on  γt  in  advance  though  -­‐-­‐  it  adapts  to  the  actual  sequence  of  errors.

Page 33: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

BOOSTING AS OPTIMIZATION

Page 34: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Even boosting single features worked well…

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Reuters newswire corpus

Page 35: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Some background facts

Coordinate descent optimization to minimize f(w) •  For t=1,…,T or till convergence:

•  For i=1,…,N where w=<w1,…,wN> •  Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> •  Set wi = w*

Page 36: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting as optimization using coordinate descent With a small number of possible h’s, you can think of boosting as finding a linear combination of these: So boosting is sort of like stacking:

H (x) = sign wihi (x)i∑"

#$

%

&'

h(x) ≡ h1(x),...,hN (x) (stacked) instance vector

w ≡ α1,...,αN weight vector

Boosting uses coordinate descent to minimize an upper bound on error rate:

exp yi wihi (x)i∑

"

#$

%

&'

t=i∑

Page 37: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Boosting and optimization

1999 - FHT

Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 2000.

Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize log-likelihood, squared error, …

Page 38: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

BOOSTING AS MARGIN LEARNING

Page 39: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Boosting didn’t seem to overfit…(!)

test error

train error

Page 40: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…because it turned out to be increasing the margin of the classifier

100 rounds

1000 rounds

Page 41: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting movie

Page 42: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Some background facts

Coordinate descent optimization to minimize f(w) •  For t=1,…,T or till convergence:

•  For i=1,…,N where w=<w1,…,wN> •  Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> •  Set wi = w*

w 2 ≡ w1

2 + ...+ wT2

w k≡kw

1

k + ...+ wTk

w 1 ≡ w 1+ ...+ wT

w ∞ ≡max(w 1,...,wT )

Page 43: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)

h(x) ≡ h1(x),...,hT (x) (stacked) instance vector

w ≡ α1,...,αN weight vector

maxw minxw⋅ h(x)⋅ yw 1 * h(x)

optimized by coordinate descent

here w 1 = α tt∑ and h(x) ∞ = maxt ht (x)

Boosting:

The “coordinates” are being extended by one in each round of boosting --- usually, unless you happen to generate the same tree twice

Page 44: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)

h(x) ≡ h1(x),...,hT (x) (stacked) instance vector

w ≡ α1,...,αN weight vector

maxw minxw⋅ h(x)⋅ yw 2 * h(x) 2

optimized by QP,...

where w 2 = α tt∑

2 and h(x) 2 = ht (x)2

maxw minxw⋅ h(x)⋅ yw 1 * h(x)

optimized by coordinate descent

here w 1 = α tt∑ and h(x) ∞ = maxt ht (x)

Boosting:

Linear SVMs:

Page 45: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

BOOSTING VS BAGGING

Page 46: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting vs Bagging

•  Both build weighted combinations of classifiers, each classifier being learned on a sample of the original data

•  Boosting reweights the distribution in each iteration

•  Bagging doesn’t

Page 47: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting vs Bagging •  Boosting finds a linear combination of weak hypotheses

–  Theory says it reduces bias –  In practice is also reduces variance, especially in later

iterations

Splice Audiology

Page 48: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting vs Bagging

Page 49: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

WRAPUP ON BOOSTING

Page 50: Ensemble Methods for Machine Learning · 2017. 5. 13. · • Bagging/Random Forests • “Bucket of models” • Stacking • Boosting . Ensemble ... • Ensembles based on blending/stacking

Boosting in the real world

•  William’s wrap up: –  Boosting is not discussed much in the ML research

community any more •  It’s much too well understood

–  It’s really useful in practice as a meta-learning method •  Eg, boosted Naïve Bayes usually beats Naïve

Bayes –  Boosted decision trees are

•  almost always competitive with respect to accuracy •  very robust against rescaling numeric features, extra

features, non-linearities, … •  somewhat slower to learn and use than many linear

classifiers •  But getting probabilities out of them is a little less

reliable.

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

now