tong zhang -...

34
Model Combination Tong Zhang Rutgers University T. Zhang (Rutgers) Model Combination 1 / 19

Upload: nguyendat

Post on 14-Mar-2018

232 views

Category:

Documents


2 download

TRANSCRIPT

Model Combination

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Model Combination 1 / 19

Standard model for supervised learning

Goal: want to predict Y based on Xprediction function f (x)

Data: (X ,Y ) are randomly drawn from an underlying distribution DTraining data: iid samples from D:

Sn = {(X1,Y1), . . . , (Xn,Yn)}

Learning: construct prediction function f from training datawant to minimize future loss over D:

E(X ,Y )∼DL(f (X ),Y )

T. Zhang (Rutgers) Model Combination 2 / 19

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi )}i=1,...,n.learn prediction rule f = A(Sn) from Sn

Examples: linear logistic regression, SVM, ridge regression, etclearning linear prediction f (x) = βT x from training data.

Model selection:many algorithms, which one to choose?goal: good interpretability and/or better performance

Model averaging/combination:many algorithms, how to combine?goal: better performance

T. Zhang (Rutgers) Model Combination 3 / 19

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi )}i=1,...,n.learn prediction rule f = A(Sn) from Sn

Examples: linear logistic regression, SVM, ridge regression, etclearning linear prediction f (x) = βT x from training data.

Model selection:many algorithms, which one to choose?goal: good interpretability and/or better performance

Model averaging/combination:many algorithms, how to combine?goal: better performance

T. Zhang (Rutgers) Model Combination 3 / 19

Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi )}i=1,...,n.learn prediction rule f = A(Sn) from Sn

Examples: linear logistic regression, SVM, ridge regression, etclearning linear prediction f (x) = βT x from training data.

Model selection:many algorithms, which one to choose?goal: good interpretability and/or better performance

Model averaging/combination:many algorithms, how to combine?goal: better performance

T. Zhang (Rutgers) Model Combination 3 / 19

Model Combination

Also called ensemble or blending or averagingEqually weighted averaging (voting).Exponentially weighted model averaging.Weight optimization using stacking.Bagging.Additive model and boosting

T. Zhang (Rutgers) Model Combination 4 / 19

Equally weighted averaging (voting)

Models {Mj}j=1,...,L; prediction functions fj(x) from training data.The voting ensemble is

f (x) =1L

L∑j=1

fj(x).

For convex loss functions: reducing model estimation variance.Least squares:

EX ,Y

1L

L∑j=1

fj(X )− Y

2

︸ ︷︷ ︸ensemble performance

=1L

L∑j=1

EX ,Y (fj(X )− Y )2

︸ ︷︷ ︸average single model performance

− 1L

L∑j=1

EX ,Y (fj(X )− f (X ))2

︸ ︷︷ ︸model diversity

T. Zhang (Rutgers) Model Combination 5 / 19

Model Selection versus Averaging

Model selection: works better if one model is significantly moreaccurate than other models

no ambiguity of which single model is betterEqually weighted averaging: works better if all models have similarprediction accuracy, but are different

some ambiguity of which single model is better

Key to the success of model ensemble (averaging):all models are reasonably high performancemodels are diverse (they have different predictions)

T. Zhang (Rutgers) Model Combination 6 / 19

Model Selection versus Averaging

Model selection: works better if one model is significantly moreaccurate than other models

no ambiguity of which single model is betterEqually weighted averaging: works better if all models have similarprediction accuracy, but are different

some ambiguity of which single model is betterKey to the success of model ensemble (averaging):

all models are reasonably high performancemodels are diverse (they have different predictions)

T. Zhang (Rutgers) Model Combination 6 / 19

Ideas to Generate Diverse Models

In competitions, to improve performance, one often creates a diverseset of models with similar performance.

How to do so?

different learning algorithmstuning parameter variationdifferent variations of featuresrandomization (e.g. bagging)

Average diverse models nearly always help a littlerelated reason: when models are ambiguous, model combination ismore stable than model selection

T. Zhang (Rutgers) Model Combination 7 / 19

Ideas to Generate Diverse Models

In competitions, to improve performance, one often creates a diverseset of models with similar performance.

How to do so?different learning algorithmstuning parameter variationdifferent variations of featuresrandomization (e.g. bagging)

Average diverse models nearly always help a littlerelated reason: when models are ambiguous, model combination ismore stable than model selection

T. Zhang (Rutgers) Model Combination 7 / 19

Example

Winning strategy in a recent predictive modeling competition overmore than 200 teams.Performance is measured by a certain error metric: the smaller thebetter – test data size: 60,000Ensemble: generate 62 models using randomization and variationsof ensemble trees

Test error:single best: 0.688 (the performance already wins the competition)combination of five models: 0.685 (0.4% error reduction)combination of thirty models: 0.682 (0.9% error reduction)combination of sixty-two models: 0.680 (1.2% error reduction)

Larger gains on some other data: create a more diverse ensemble

T. Zhang (Rutgers) Model Combination 8 / 19

Example

Winning strategy in a recent predictive modeling competition overmore than 200 teams.Performance is measured by a certain error metric: the smaller thebetter – test data size: 60,000Ensemble: generate 62 models using randomization and variationsof ensemble treesTest error:

single best: 0.688 (the performance already wins the competition)combination of five models: 0.685 (0.4% error reduction)combination of thirty models: 0.682 (0.9% error reduction)combination of sixty-two models: 0.680 (1.2% error reduction)

Larger gains on some other data: create a more diverse ensemble

T. Zhang (Rutgers) Model Combination 8 / 19

Exponentially Weighted Model Averaging

Observe data (Xi ,Yi) and want to find a combination to minimize acertain error ERROR(f )

Consider model averaging with model outputs f1(x), . . . , fL(x), and

f (x) = w1 f1(x) + · · ·wL fL(x)

Exponentially weighted model averaging:

wj =e−λ ERROR(fj )∑Li=1 e−λ ERROR(fj )

Give larger weight to smaller error (error can be either on trainingor validation data)λ =∞: model selection (find the one with smallest error)λ = 0: equally weighted model averaging

Good theoretical property; related to Bayesian model averaging

T. Zhang (Rutgers) Model Combination 9 / 19

Stacking

Split data into training + validation.L models fj(x) obtained from training data.On validation data define VALIDATION ERROR(fj), find combination:

f (x) =L∑

j=1

wj fj(x), wj ≥ 0,∑

j

wj = 1

to minimize the validation error:

minw

VALIDATION ERROR

L∑j=1

wj fj

.more complex than exponentially weighted model averaging

More generally: treat {fj(x)} as features and use any learningalgorithm to train a combination model with features {fj(x)} onvalidation data.

T. Zhang (Rutgers) Model Combination 10 / 19

Stacking

Split data into training + validation.L models fj(x) obtained from training data.On validation data define VALIDATION ERROR(fj), find combination:

f (x) =L∑

j=1

wj fj(x), wj ≥ 0,∑

j

wj = 1

to minimize the validation error:

minw

VALIDATION ERROR

L∑j=1

wj fj

.more complex than exponentially weighted model averagingMore generally: treat {fj(x)} as features and use any learningalgorithm to train a combination model with features {fj(x)} onvalidation data.

T. Zhang (Rutgers) Model Combination 10 / 19

Example

Winning strategy of an old competition: CoNLL 2003 Englishnamed entity recognition.Performance measured by F -measure (similar to accuracy): thehigher the betterTraining four classifiers using different algorithmsBest single: 89.94Equally weighted averaging: 91.56Stacking: training a linear combination of the classifiers: 91.63

T. Zhang (Rutgers) Model Combination 11 / 19

Generating Diverse Ensemble: Bagging

How to generate ensemble candidates from a single learningalgorithm A?

Answer: baggingbuild fi from bootstrapped training samples:

drawn n samples S in from training data Sn, with replacement.

fi = A(S in).

form voted classifiers: f = 1L

∑Li=1 fi .

When does it help.unstable classifier: fi very different from each other.performance of fi are similar.

How does it help.variance reduction: ensemble is more stable although each classifieruses less data.

T. Zhang (Rutgers) Model Combination 12 / 19

Generating Diverse Ensemble: Bagging

How to generate ensemble candidates from a single learningalgorithm A?Answer: bagging

build fi from bootstrapped training samples:drawn n samples S i

n from training data Sn, with replacement.fi = A(S i

n).

form voted classifiers: f = 1L

∑Li=1 fi .

When does it help.unstable classifier: fi very different from each other.performance of fi are similar.

How does it help.variance reduction: ensemble is more stable although each classifieruses less data.

T. Zhang (Rutgers) Model Combination 12 / 19

Generating Diverse Ensemble: Bagging

How to generate ensemble candidates from a single learningalgorithm A?Answer: bagging

build fi from bootstrapped training samples:drawn n samples S i

n from training data Sn, with replacement.fi = A(S i

n).

form voted classifiers: f = 1L

∑Li=1 fi .

When does it help.unstable classifier: fi very different from each other.performance of fi are similar.

How does it help.variance reduction: ensemble is more stable although each classifieruses less data.

T. Zhang (Rutgers) Model Combination 12 / 19

Generating Diverse Ensemble: Bagging

How to generate ensemble candidates from a single learningalgorithm A?Answer: bagging

build fi from bootstrapped training samples:drawn n samples S i

n from training data Sn, with replacement.fi = A(S i

n).

form voted classifiers: f = 1L

∑Li=1 fi .

When does it help.unstable classifier: fi very different from each other.performance of fi are similar.

How does it help.variance reduction: ensemble is more stable although each classifieruses less data.

T. Zhang (Rutgers) Model Combination 12 / 19

Other Ideas to Generate Ensemble Candidates

Use boosting: can be either regarded as a single model or as anensemble method

Use different learning algorithms (different loss functions,regularizations)Use different variations of data-processing

different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps

Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor

Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)

T. Zhang (Rutgers) Model Combination 13 / 19

Other Ideas to Generate Ensemble Candidates

Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)

Use different variations of data-processingdifferent transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps

Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor

Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)

T. Zhang (Rutgers) Model Combination 13 / 19

Other Ideas to Generate Ensemble Candidates

Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)Use different variations of data-processing

different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps

Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor

Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)

T. Zhang (Rutgers) Model Combination 13 / 19

Other Ideas to Generate Ensemble Candidates

Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)Use different variations of data-processing

different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps

Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor

Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)

T. Zhang (Rutgers) Model Combination 13 / 19

Other Ideas to Generate Ensemble Candidates

Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)Use different variations of data-processing

different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps

Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor

Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)

T. Zhang (Rutgers) Model Combination 13 / 19

Decision Tree

X

A < 2

B < 2

X B < 4

Y X

true

true

true false

false

false

A and B: featuresX and Y : predicted class labelsTraining: greedily test features to split at each node

T. Zhang (Rutgers) Model Combination 14 / 19

Remarks on Decision Tree

Advantages:interpretablehandle non-homogeneous features easilyfinds non-linear interactions

Disadvantage:usually not the most accurate classifier by itself

T. Zhang (Rutgers) Model Combination 15 / 19

Improving Decision Tree Performance

Improve accuracy through tree ensemble learning:bagging

generate bootstrap samples.train one tree per bootstrap sample.take equally weighted average of the trees.

random forestbagging with additional randomization.

boosting

T. Zhang (Rutgers) Model Combination 16 / 19

Bagging and Random Forest

Baggingbuild fi from bootstrapped training samples:Form voted classifiers: f = 1

m

∑mi=1 fi .

Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.

T. Zhang (Rutgers) Model Combination 17 / 19

Bagging and Random Forest

Baggingbuild fi from bootstrapped training samples:Form voted classifiers: f = 1

m

∑mi=1 fi .

Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.

T. Zhang (Rutgers) Model Combination 17 / 19

Why Bagging and RF work

Tree unstable, thus different random trees are very differentMay generate deep trees, overfitting the data with each treeEqually weighted averaging reduces the variance.Related to Bayesian model averaging over all possible trees.

T. Zhang (Rutgers) Model Combination 18 / 19

Summary

Model selection: find the best single modelbetter interpretability but may be unstable under ambiguity

Model averaging/combination:find the best combination of modelsvarious methods

Model averaging is often more effective than model selectionensemble can improve stability: ensemble is stable even if individualmodels are not

Key issue in model averaging: generate diverse models with goodperformance

T. Zhang (Rutgers) Model Combination 19 / 19

Summary

Model selection: find the best single modelbetter interpretability but may be unstable under ambiguity

Model averaging/combination:find the best combination of modelsvarious methods

Model averaging is often more effective than model selectionensemble can improve stability: ensemble is stable even if individualmodels are not

Key issue in model averaging: generate diverse models with goodperformance

T. Zhang (Rutgers) Model Combination 19 / 19