learning ensembles - ucsb · 2017. 3. 18. · • opitz, maclin – “popular ensemble methods: an...

26
Learning Ensembles 293S T. Yang. UCSB, 2017 .

Upload: others

Post on 23-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Learning Ensembles

293S T. Yang. UCSB, 2017.

Page 2: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Outlines

• Learning Assembles• Random Forest• Adaboost

Page 3: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Training data: Restaurant example

• Examples described by attribute values (Boolean, discrete, continuous)

• E.g., situations where I will/won't wait for a table:

• Classification of examples is positive (T) or negative (F)

Page 4: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

A decision tree to decide whether to wait

• imagine someone talking a sequence of decisions.

Page 5: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

5

Learning Ensembles

• Learn multiple classifiers separately• Combine decisions (e.g. using weighted voting)• When combing multiple decisions, random errors

cancel each other out, correct decisions are reinforced. Training Data

Data1 Data mData2 × × × × × × × ×

Learner1 Learner2 Learner m× × × × × × × ×

Model1 Model2 Model m× × × × × × × ×

Model Combiner Final Model

Page 6: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Homogenous Ensembles

• Use a single, arbitrary learning algorithm but manipulate training data to make it learn multiple models.§ Data1 ¹ Data2 ¹ … ¹ Data m§ Learner1 = Learner2 = … = Learner m

• Methods for changing training data:§ Bagging: Resample training data§ Boosting: Reweight training data§ DECORATE: Add additional artificial training data

Training Data

Data1 Data mData2 × × × × × × × ×

Learner1 Learner2 Learner m× × × × × × × ×

Page 7: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

7

Bagging

• Create ensembles by repeatedly randomly resampling the training data (Brieman, 1996).

• Given a training set of size n, create m sample sets§ Each bootstrap sample set will on average contain 63.2% of

the unique training examples, the rest are replicates.• Combine the m resulting models using majority vote.

• Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed.

Page 8: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Random Forests

• Introduce two sources of randomness: “Bagging” and “Random input vectors”§ Each tree is grown using a bootstrap sample of

training data§ At each node, best split is chosen from random

sample of m variables instead of all variables M.• m is held constant during the forest growing• Each tree is grown to the largest extent possible• Bagging using decision trees is a special case of random

forests when m=M

Page 9: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Random Forests

Page 10: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Random Forest Algorithm

• Good accuracy without over-fitting• Fast algorithm (can be faster than growing/pruning a

single tree); easily parallelized• Handle high dimensional data without much problem

Page 11: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Boosting: AdaBoost

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line

learning and an application to boosting. Journal of Computer and System Sciences,

55(1):119–139, August 1997.§ Simple with theoretical foundation

Page 12: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

12

Adaboost - Adaptive Boosting

• Use training set re-weighting§ Each training sample uses a weight to determine the

probability of being selected for a training set.

• AdaBoost is an algorithm for constructing a “strong” classifier as linear combination of “simple” “weak” classifier

• Final classification based on weighted sum of weak classifiers

Page 13: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

AdaBoost: An Easy Flow

Data set 1 Data set 2 Data set T

Learner1 Learner2 LearnerT… ...

… ...

… ...

training instances that are wrongly predicted by Learner1 will be weighted more for Learner2

weighted combination

Original training set

Page 14: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

14

Adaboost Terminology

• ht(x) … “weak” or basis classifier • … “strong” or final classifier

• Weak Classifier: < 50% error over any distribution

• Strong Classifier: thresholded linear combination of weak classifier outputs

Page 15: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

And in a Picture

training casecorrectlyclassified

training casehas large weightin this round

this DT has a strong vote.

Page 16: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

• Given training set X={(x1,y1),…,(xm,ym)}• yiÎ{-1,+1} correct label of instance xiÎX• Initialize distribution D1(i)=1/m; ( weight of training cases)• for t = 1,…,T:

• Find a weak classifier (“rule of thumb”)ht : X ® {-1,+1}

with small error et on Dt:• Update distribution Dt on {1,…,m}. 𝛼t = log(1/𝜀t-1)

• output final hypothesis

AdaBoost.M1

å ==

T

t tt xhsignxH1

))(()( a

yi * ht(xi) > 0, if correctyi * ht(xi) < 0, if wrong

Page 17: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

17

Reweighting

y * h(x) = -1

y * h(x) = 1

Page 18: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Toy Example

Page 19: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Round 1

Weak classifier: if h1 <0.2 à 1 else -1

Page 20: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Round 2

Weak classifier: if h2 <0.8 à 1 else -1

Page 21: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Round 3

Weak classifier: if h3 >0.7 à 1 else -1

Page 22: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

Final Combinationif h1 <0.2 à 1 else -1

if h3 >0.7 à 1 else -1

if h2 <0.8 à 1 else -1

Page 23: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

23

Pros and cons of AdaBoost

Advantages§ Very simple to implement§ Does feature selection resulting in relatively simple

classifier§ Fairly good generalization

Disadvantages§ Suboptimal solution§ Sensitive to noisy data and outliers

Page 24: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

24

References

• Duda, Hart, ect – Pattern Classification

• Freund – “An adaptive version of the boost by majority algorithm”

• Freund – “Experiments with a new boosting algorithm”

• Freund, Schapire – “A decision-theoretic generalization of on-line learning and an application to boosting”

• Friedman, Hastie, etc – “Additive Logistic Regression: A Statistical View of Boosting”

• Jin, Liu, etc (CMU) – “A New Boosting Algorithm Using Input-Dependent Regularizer”

• Li, Zhang, etc – “Floatboost Learning for Classification”

• Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study”

• Ratsch, Warmuth – “Efficient Margin Maximization with Boosting”

• Schapire, Freund, etc – “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods”

• Schapire, Singer – “Improved Boosting Algorithms Using Confidence-Weighted Predictions”

• Schapire – “The Boosting Approach to Machine Learning: An overview”

• Zhang, Li, etc – “Multi-view Face Detection with Floatboost”

Page 25: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

n Suppose

n Therefore, training error is:

n As:

Considering

Finally:

AdaBoost: Training Error Analysis

Õ=

£¹T

ttii zyxHi

m 1

})(:{11

1( ) 1, exp( ( ))T i i tti iD i y f x Z

m+ = - =å å Õ

Equivalent

{i: H(xi)≠yi} is a vector which i-th element is [H(xi) ≠ yi].

|{i: H(xi)≠yi}| is the sum of all the element in the vector

Page 26: Learning Ensembles - UCSB · 2017. 3. 18. · • Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” • Ratsch, Warmuth – “Efficient Margin Maximization with

n According to

n Therefore, we choosen Let

n The right term is minimized when

AdaBoost: How to choose

( )* argmin argmin ( ) t i t i

t t

y h xt t t

iZ D i e a

a aa -= = å

Õ=

£¹T

ttii zyxHi

m 1

})(:{1

ta

titii xhyu aa == ),(

Minimize the error bound could be done by greedily minimizing Zt each round.

This equation is obvious if we treat ui as a binary-valued variable.

By setting dz/dα=0 , and considering ∑D(i)=1, we can easily get this solution.

Actually AdaBoost can just minimize the training error.

1 ( ) ( )1, let ( ), we have

2( ) ( )

t th y h y

th y

t th y h y

D i D irD i

r D i D ie e= ¹

¹

= ¹

ì = +ï -ï = =íï = -ïî

å åå

å å