ensemble methods for machine learning · 2017. 5. 13. · • bagging/random forests • “bucket...

Ensemble Methods for Machine Learning

COMBINING CLASSIFIERS: ENSEMBLE APPROACHES

Common Ensemble classifiers

•  Bagging/Random Forests •  “Bucket of models” •  Stacking •  Boosting

Ensemble classifiers we’ve studied so far

•  Bagging – Build many bootstrap replicates of the data – Train many classifiers – Vote the results

Bagged trees to random forests •  Bagged decision trees:

–  Build many bootstrap replicates of the data –  Train many tree classifiers –  Vote the results

•  Random forest: –  Build many bootstrap replicates of the data –  Train many tree classifiers (no pruning)

•  For each split, restrict to a random subset of the original feature set. (Usually a random subset of size sqrt(d) where there are d original features.)

–  Vote the results

–  Like bagged decision trees, fast and easy to use

A “bucket of models”

•  Scenario: –  I have three learners

•  Logistic Regression, Naïve Bayes, Backprop –  I’m going to embed a learning machine in some

system (eg, “genius”-like music recommender) •  Each system has a different user è different dataset D •  In pilot studies there’s no single best system •  Which learner do embed to get the best result?

•  Can I combine all of them somehow and do better?

Simple learner combination: A “bucket of models”

•  Input: – your top T favorite learners (or tunings)

•  L1,….,LT

– A dataset D •  Learning algorithm:

– Use 10-CV to estimate the error of L1,….,LT – Pick the best (lowest 10-CV error) learner L* – Train L* on D and return its hypothesis h*

Pros and cons of a “bucket of models”

•  Pros: –  Simple –  Will give results not much worse than the best of the

“base learners”

•  Cons: – What if there’s not a single best learner?

•  Other approaches: –  Vote the hypotheses (how would you weight them?) –  Combine them some other way? –  How about learning to combine the hypotheses?

Stacked learners: first attempt

•  Input: –  your top T favorite learners (or tunings)

•  L1,….,LT

–  A dataset D containing (x,y), …. •  Learning algorithm:

–  Train L1,….,LT on D to get h1,….,hT –  Create a new dataset D’ containing (x’,y’),….

•  x’ is a vector of the T predictions h1(x),….,hT(x) •  y is the label y for x

–  Train YFCL on D’ to get h’ --- which combines the predictions! •  To predict on a new x:

–  Construct x’ as before and predict h’(x’)

Problem: if Li overfits the data D, then hi(x) could be almost always the same as y in D’. But that won’t be the case on an out-of-sample-the test example x.

The fix: make an x’ in D’ look more like the out-of-sample test cases.

Stacked learners: the right way

•  Input: –  your top T favorite learners (or tunings): L1,….,LT

–  A dataset D containing (x,y), …. •  Learning algorithm:

–  Train L1,….,LT on D to get h1,….,hT –  Also run 10-CV and collect the CV test predictions (x,Li(D-x,x))

for each example x and each learner Li –  Create a new dataset D’ containing (x’,y’),….

•  x’ is the CV test predictions (L1(D-x,x), L2(D-x,x), …

•  y is the label y for x –  Train YFCL on D’ to get h’ --- which combines the predictions!

•  To predict on a new x: –  Construct x’ using h1,….,hT (as before) and predict h’(x’)

D-x is the CV training set for x Li(D*,x) is the label predicted by the hypothesis learned from D* by Li

we don’t do any CV at prediction time

Pros and cons of stacking

•  Pros: –  Fairly simple –  Slow, but easy to parallelize

•  Cons: – What if there’s not a single best combination

scheme? – E.g.: for movie recommendation sometimes

L1 is best for users with many ratings and L2 is best for users with few ratings.

Multi-level stacking/blending

•  Learning algorithm: –  Train L1,….,LT on D to get h1,….,hT –  Also run 10-CV and collect the CV test predictions (x,Li(D-x,x))

for each example x and each learner Li –  Create a new dataset D’ containing (x’,y’),….

•  x’ is the CV test predictions (L1(D-x,x), L2(D-x,x), … combined with additional features from x (e.g., numRatings, userAge, …)

•  y is the label y for x –  Train YFCL on D’ to get h’ --- which combines the predictions!

•  To predict on a new x: –  Construct x’ using h1,….,hT (as before) and predict h’(x’)

D-x is the CV training set for x and Li(D*,x) is a predicted label

might create features like “L1(D-x,x)=y* and numRatings>20”

where the choice of classifier to rely on depends on meta-features of x

“meta-features”

Comments

•  Ensembles based on blending/stacking were key approaches used in the netflix competition –  Winning entries blended many types of classifiers

•  Ensembles based on stacking are the main architecture used in Watson –  Not all of the base classifiers/rankers are learned,

however; some are hand-programmed.

Boosting

Thanks to A Short Introduction to Boosting. Yoav Freund, Robert E. Schapire, Journal of Japanese Society for Artificial Intelligence,14(5):771-780, September, 1999

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

•  Valiant CACM 1984 and PAC-learning: partly inspired by Turing

Formal Informal

AI Valiant (1984)

Turing test (1950)

Complexity Turing machine (1936)

1936 - T

Question: what sort of AI questions can we formalize and study with formal methods?

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

“Weak” pac-learning (Kearns & Valiant 88)

(PAC learning)

say, ε=0.49

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

“Weak” PAC-learning is equivalent to “strong” PAC-learning (!) (Schapire 89)

(PAC learning)

say, ε=0.49

=

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

“Weak” PAC-learning is equivalent to “strong” PAC-learning (!) (Schapire 89)

•  The basic idea exploits the fact that you can learn a little on every distribution: –  Learn h1 from D0 with error < 49% –  Modify D0 so that h1 has error 50% (call this D1)

•  If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x).

–  Learn h2 from D1 with error < 49% –  Modify D1 so that h1 and h2 always disagree (call this D2) –  Learn h3 from D2 with error <49%. –  Now vote h1,h2, and h3. This has error better than any

of the “weak” hypotheses h1,h2 or h3. –  Repeat this as needed to lower the error rate more….

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Boosting can actually help experimentally…but… (Drucker, Schapire, Simard)

•  The basic idea exploits the fact that you can learn a little on every distribution: –  Learn h1 from D0 with error < 49% –  Modify D0 so that h1 has error 50% (call this D1)

•  If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x).

–  Learn h2 from D1 with error < 49% –  Modify D1 so that h1 and h2 always disagree (call this D2) –  Learn h3 from D2 with error <49%. –  Now vote h1,h2, and h3. This has error better than any

of the “weak” hypotheses h1,h2 or h3. –  Repeat this as needed to lower the error rate more….

Very wasteful of examples

wait for examples where they disagree !?

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

AdaBoost (Freund and Schapire)

Increase weight of xi if ht is wrong, decrease weight if ht is right.

Linear combination of base hypotheses - best weight αt depends on error of ht.

Sample with replacement

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

AdaBoost: Adaptive Boosting (Freund & Schapire, 1995)

Theoretically, one can upper bound the training error of boosting.

Boosting: A toy example Thanks, Rob Schapire

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Boosting improved decision trees…

Boosting: Analysis

Theorem: if the error at round t of the base classi4ier is εt = ½-‐γt, then the training error of the boosted classi4ier is bounded by

The algorithm doesn’t need to know any bound on γt in advance though -‐-‐ it adapts to the actual sequence of errors.

BOOSTING AS OPTIMIZATION

Even boosting single features worked well…

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Reuters newswire corpus

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Some background facts

Coordinate descent optimization to minimize f(w) •  For t=1,…,T or till convergence:

•  For i=1,…,N where w=<w1,…,wN> •  Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> •  Set wi = w*

Boosting as optimization using coordinate descent With a small number of possible h’s, you can think of boosting as finding a linear combination of these: So boosting is sort of like stacking:

H (x) = sign wihi (x)i∑"

#$

%

&'

h(x) ≡ h1(x),...,hN (x) (stacked) instance vector

w ≡ α1,...,αN weight vector

Boosting uses coordinate descent to minimize an upper bound on error rate:

exp yi wihi (x)i∑

"

#$

%

&'

t=i∑

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Boosting and optimization

1999 - FHT

Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 2000.

Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize log-likelihood, squared error, …

BOOSTING AS MARGIN LEARNING

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Boosting didn’t seem to overfit…(!)

test error

train error

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

…because it turned out to be increasing the margin of the classifier

100 rounds

1000 rounds

Boosting movie

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Some background facts

Coordinate descent optimization to minimize f(w) •  For t=1,…,T or till convergence:

•  For i=1,…,N where w=<w1,…,wN> •  Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> •  Set wi = w*

€

w 2 ≡ w1

2 + ...+ wT2

€

w k≡kw

1

k + ...+ wTk

€

w 1 ≡ w 1+ ...+ wT

€

w ∞ ≡max(w 1,...,wT )

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)

€

h(x) ≡ h1(x),...,hT (x) (stacked) instance vector


€

maxw minxw⋅ h(x)⋅ yw 1 * h(x)

∞

optimized by coordinate descent

here w 1 = α tt∑ and h(x) ∞ = maxt ht (x)

Boosting:

The “coordinates” are being extended by one in each round of boosting --- usually, unless you happen to generate the same tree twice

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)

€

h(x) ≡ h1(x),...,hT (x) (stacked) instance vector


€

maxw minxw⋅ h(x)⋅ yw 2 * h(x) 2

optimized by QP,...

where w 2 = α tt∑

2 and h(x) 2 = ht (x)2

€

maxw minxw⋅ h(x)⋅ yw 1 * h(x)

∞

optimized by coordinate descent

here w 1 = α tt∑ and h(x) ∞ = maxt ht (x)

Boosting:

Linear SVMs:

BOOSTING VS BAGGING

Boosting vs Bagging

•  Both build weighted combinations of classifiers, each classifier being learned on a sample of the original data

•  Boosting reweights the distribution in each iteration

•  Bagging doesn’t

Boosting vs Bagging •  Boosting finds a linear combination of weak hypotheses

–  Theory says it reduces bias –  In practice is also reduces variance, especially in later

iterations

Splice Audiology

Boosting vs Bagging

WRAPUP ON BOOSTING

Boosting in the real world

•  William’s wrap up: –  Boosting is not discussed much in the ML research

community any more •  It’s much too well understood

–  It’s really useful in practice as a meta-learning method •  Eg, boosted Naïve Bayes usually beats Naïve

Bayes –  Boosted decision trees are

•  almost always competitive with respect to accuracy •  very robust against rescaling numeric features, extra

features, non-linearities, … •  somewhat slower to learn and use than many linear

classifiers •  But getting probabilities out of them is a little less

reliable.

1984 - V

1988 - KV 1989 - S

1993 - DSS

1995 - FS

1950 - T

…

…

now

ensemble methods for machine learning · 2017. 5. 13. · • bagging/random forests • “bucket...

Documents