ensemble methods for machine learning · 2017. 5. 13. · • bagging/random forests • “bucket...
TRANSCRIPT
Ensemble Methods for Machine Learning
COMBINING CLASSIFIERS: ENSEMBLE APPROACHES
Common Ensemble classifiers
• Bagging/Random Forests • “Bucket of models” • Stacking • Boosting
Ensemble classifiers we’ve studied so far
• Bagging – Build many bootstrap replicates of the data – Train many classifiers – Vote the results
Bagged trees to random forests • Bagged decision trees:
– Build many bootstrap replicates of the data – Train many tree classifiers – Vote the results
• Random forest: – Build many bootstrap replicates of the data – Train many tree classifiers (no pruning)
• For each split, restrict to a random subset of the original feature set. (Usually a random subset of size sqrt(d) where there are d original features.)
– Vote the results
– Like bagged decision trees, fast and easy to use
A “bucket of models”
• Scenario: – I have three learners
• Logistic Regression, Naïve Bayes, Backprop – I’m going to embed a learning machine in some
system (eg, “genius”-like music recommender) • Each system has a different user è different dataset D • In pilot studies there’s no single best system • Which learner do embed to get the best result?
• Can I combine all of them somehow and do better?
Simple learner combination: A “bucket of models”
• Input: – your top T favorite learners (or tunings)
• L1,….,LT
– A dataset D • Learning algorithm:
– Use 10-CV to estimate the error of L1,….,LT – Pick the best (lowest 10-CV error) learner L* – Train L* on D and return its hypothesis h*
Pros and cons of a “bucket of models”
• Pros: – Simple – Will give results not much worse than the best of the
“base learners”
• Cons: – What if there’s not a single best learner?
• Other approaches: – Vote the hypotheses (how would you weight them?) – Combine them some other way? – How about learning to combine the hypotheses?
Common Ensemble classifiers
• Bagging/Random Forests • “Bucket of models” • Stacking • Boosting
Stacked learners: first attempt
• Input: – your top T favorite learners (or tunings)
• L1,….,LT
– A dataset D containing (x,y), …. • Learning algorithm:
– Train L1,….,LT on D to get h1,….,hT – Create a new dataset D’ containing (x’,y’),….
• x’ is a vector of the T predictions h1(x),….,hT(x) • y is the label y for x
– Train YFCL on D’ to get h’ --- which combines the predictions! • To predict on a new x:
– Construct x’ as before and predict h’(x’)
Problem: if Li overfits the data D, then hi(x) could be almost always the same as y in D’. But that won’t be the case on an out-of-sample-the test example x.
The fix: make an x’ in D’ look more like the out-of-sample test cases.
Stacked learners: the right way
• Input: – your top T favorite learners (or tunings): L1,….,LT
– A dataset D containing (x,y), …. • Learning algorithm:
– Train L1,….,LT on D to get h1,….,hT – Also run 10-CV and collect the CV test predictions (x,Li(D-x,x))
for each example x and each learner Li – Create a new dataset D’ containing (x’,y’),….
• x’ is the CV test predictions (L1(D-x,x), L2(D-x,x), …
• y is the label y for x – Train YFCL on D’ to get h’ --- which combines the predictions!
• To predict on a new x: – Construct x’ using h1,….,hT (as before) and predict h’(x’)
D-x is the CV training set for x Li(D*,x) is the label predicted by the hypothesis learned from D* by Li
we don’t do any CV at prediction time
Pros and cons of stacking
• Pros: – Fairly simple – Slow, but easy to parallelize
• Cons: – What if there’s not a single best combination
scheme? – E.g.: for movie recommendation sometimes
L1 is best for users with many ratings and L2 is best for users with few ratings.
Multi-level stacking/blending
• Learning algorithm: – Train L1,….,LT on D to get h1,….,hT – Also run 10-CV and collect the CV test predictions (x,Li(D-x,x))
for each example x and each learner Li – Create a new dataset D’ containing (x’,y’),….
• x’ is the CV test predictions (L1(D-x,x), L2(D-x,x), … combined with additional features from x (e.g., numRatings, userAge, …)
• y is the label y for x – Train YFCL on D’ to get h’ --- which combines the predictions!
• To predict on a new x: – Construct x’ using h1,….,hT (as before) and predict h’(x’)
D-x is the CV training set for x and Li(D*,x) is a predicted label
might create features like “L1(D-x,x)=y* and numRatings>20”
where the choice of classifier to rely on depends on meta-features of x
“meta-features”
Comments
• Ensembles based on blending/stacking were key approaches used in the netflix competition – Winning entries blended many types of classifiers
• Ensembles based on stacking are the main architecture used in Watson – Not all of the base classifiers/rankers are learned,
however; some are hand-programmed.
Common Ensemble classifiers
• Bagging/Random Forests • “Bucket of models” • Stacking • Boosting
Boosting
Thanks to A Short Introduction to Boosting. Yoav Freund, Robert E. Schapire, Journal of Japanese Society for Artificial Intelligence,14(5):771-780, September, 1999
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
• Valiant CACM 1984 and PAC-learning: partly inspired by Turing
Formal Informal
AI Valiant (1984)
Turing test (1950)
Complexity Turing machine (1936)
1936 - T
Question: what sort of AI questions can we formalize and study with formal methods?
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
“Weak” pac-learning (Kearns & Valiant 88)
(PAC learning)
say, ε=0.49
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
“Weak” PAC-learning is equivalent to “strong” PAC-learning (!) (Schapire 89)
(PAC learning)
say, ε=0.49
=
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
“Weak” PAC-learning is equivalent to “strong” PAC-learning (!) (Schapire 89)
• The basic idea exploits the fact that you can learn a little on every distribution: – Learn h1 from D0 with error < 49% – Modify D0 so that h1 has error 50% (call this D1)
• If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x).
– Learn h2 from D1 with error < 49% – Modify D1 so that h1 and h2 always disagree (call this D2) – Learn h3 from D2 with error <49%. – Now vote h1,h2, and h3. This has error better than any
of the “weak” hypotheses h1,h2 or h3. – Repeat this as needed to lower the error rate more….
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Boosting can actually help experimentally…but… (Drucker, Schapire, Simard)
• The basic idea exploits the fact that you can learn a little on every distribution: – Learn h1 from D0 with error < 49% – Modify D0 so that h1 has error 50% (call this D1)
• If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x).
– Learn h2 from D1 with error < 49% – Modify D1 so that h1 and h2 always disagree (call this D2) – Learn h3 from D2 with error <49%. – Now vote h1,h2, and h3. This has error better than any
of the “weak” hypotheses h1,h2 or h3. – Repeat this as needed to lower the error rate more….
Very wasteful of examples
wait for examples where they disagree !?
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
AdaBoost (Freund and Schapire)
Increase weight of xi if ht is wrong, decrease weight if ht is right.
Linear combination of base hypotheses - best weight αt depends on error of ht.
Sample with replacement
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
AdaBoost: Adaptive Boosting (Freund & Schapire, 1995)
Theoretically, one can upper bound the training error of boosting.
Boosting: A toy example Thanks, Rob Schapire
Boosting: A toy example Thanks, Rob Schapire
Boosting: A toy example Thanks, Rob Schapire
Boosting: A toy example Thanks, Rob Schapire
Boosting: A toy example Thanks, Rob Schapire
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Boosting improved decision trees…
Boosting: Analysis
Theorem: if the error at round t of the base classi4ier is εt = ½-‐γt, then the training error of the boosted classi4ier is bounded by
The algorithm doesn’t need to know any bound on γt in advance though -‐-‐ it adapts to the actual sequence of errors.
BOOSTING AS OPTIMIZATION
Even boosting single features worked well…
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Reuters newswire corpus
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Some background facts
Coordinate descent optimization to minimize f(w) • For t=1,…,T or till convergence:
• For i=1,…,N where w=<w1,…,wN> • Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> • Set wi = w*
Boosting as optimization using coordinate descent With a small number of possible h’s, you can think of boosting as finding a linear combination of these: So boosting is sort of like stacking:
H (x) = sign wihi (x)i∑"
#$
%
&'
h(x) ≡ h1(x),...,hN (x) (stacked) instance vector
w ≡ α1,...,αN weight vector
Boosting uses coordinate descent to minimize an upper bound on error rate:
exp yi wihi (x)i∑
"
#$
%
&'
t=i∑
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Boosting and optimization
1999 - FHT
Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 2000.
Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize log-likelihood, squared error, …
BOOSTING AS MARGIN LEARNING
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Boosting didn’t seem to overfit…(!)
test error
train error
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
…because it turned out to be increasing the margin of the classifier
100 rounds
1000 rounds
Boosting movie
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Some background facts
Coordinate descent optimization to minimize f(w) • For t=1,…,T or till convergence:
• For i=1,…,N where w=<w1,…,wN> • Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> • Set wi = w*
€
w 2 ≡ w1
2 + ...+ wT2
€
w k≡kw
1
k + ...+ wTk
€
w 1 ≡ w 1+ ...+ wT
€
w ∞ ≡max(w 1,...,wT )
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)
€
h(x) ≡ h1(x),...,hT (x) (stacked) instance vector
w ≡ α1,...,αN weight vector
€
maxw minxw⋅ h(x)⋅ yw 1 * h(x)
∞
optimized by coordinate descent
here w 1 = α tt∑ and h(x) ∞ = maxt ht (x)
Boosting:
The “coordinates” are being extended by one in each round of boosting --- usually, unless you happen to generate the same tree twice
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)
€
h(x) ≡ h1(x),...,hT (x) (stacked) instance vector
w ≡ α1,...,αN weight vector
€
maxw minxw⋅ h(x)⋅ yw 2 * h(x) 2
optimized by QP,...
where w 2 = α tt∑
2 and h(x) 2 = ht (x)2
€
maxw minxw⋅ h(x)⋅ yw 1 * h(x)
∞
optimized by coordinate descent
here w 1 = α tt∑ and h(x) ∞ = maxt ht (x)
Boosting:
Linear SVMs:
BOOSTING VS BAGGING
Boosting vs Bagging
• Both build weighted combinations of classifiers, each classifier being learned on a sample of the original data
• Boosting reweights the distribution in each iteration
• Bagging doesn’t
Boosting vs Bagging • Boosting finds a linear combination of weak hypotheses
– Theory says it reduces bias – In practice is also reduces variance, especially in later
iterations
Splice Audiology
Boosting vs Bagging
WRAPUP ON BOOSTING
Boosting in the real world
• William’s wrap up: – Boosting is not discussed much in the ML research
community any more • It’s much too well understood
– It’s really useful in practice as a meta-learning method • Eg, boosted Naïve Bayes usually beats Naïve
Bayes – Boosted decision trees are
• almost always competitive with respect to accuracy • very robust against rescaling numeric features, extra
features, non-linearities, … • somewhat slower to learn and use than many linear
classifiers • But getting probabilities out of them is a little less
reliable.
1984 - V
1988 - KV 1989 - S
1993 - DSS
1995 - FS
1950 - T
…
…
now