tong zhang -...

Model Combination

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Model Combination 1 / 19

Standard model for supervised learning

Goal: want to predict Y based on Xprediction function f (x)

Data: (X ,Y ) are randomly drawn from an underlying distribution DTraining data: iid samples from D:

Sn = {(X1,Y1), . . . , (Xn,Yn)}

Learning: construct prediction function f from training datawant to minimize future loss over D:

E(X ,Y )∼DL(f (X ),Y )


Learning Algorithm

Learning algorithm Agiven training data Sn = {(Xi ,Yi )}i=1,...,n.learn prediction rule f = A(Sn) from Sn

Examples: linear logistic regression, SVM, ridge regression, etclearning linear prediction f (x) = βT x from training data.

Model selection:many algorithms, which one to choose?goal: good interpretability and/or better performance

Model averaging/combination:many algorithms, how to combine?goal: better performance


Model Combination

Also called ensemble or blending or averagingEqually weighted averaging (voting).Exponentially weighted model averaging.Weight optimization using stacking.Bagging.Additive model and boosting


Equally weighted averaging (voting)

Models {Mj}j=1,...,L; prediction functions fj(x) from training data.The voting ensemble is

f (x) =1L

L∑j=1

fj(x).

For convex loss functions: reducing model estimation variance.Least squares:

EX ,Y

1L

L∑j=1

fj(X )− Y

2

︸︷︷︸ensemble performance

=1L

L∑j=1

EX ,Y (fj(X )− Y )2

︸︷︷︸average single model performance

− 1L

L∑j=1

EX ,Y (fj(X )− f (X ))2

︸︷︷︸model diversity


Model Selection versus Averaging

Model selection: works better if one model is significantly moreaccurate than other models

no ambiguity of which single model is betterEqually weighted averaging: works better if all models have similarprediction accuracy, but are different

some ambiguity of which single model is better

Key to the success of model ensemble (averaging):all models are reasonably high performancemodels are diverse (they have different predictions)


Model Selection versus Averaging

Model selection: works better if one model is significantly moreaccurate than other models

no ambiguity of which single model is betterEqually weighted averaging: works better if all models have similarprediction accuracy, but are different

some ambiguity of which single model is betterKey to the success of model ensemble (averaging):

all models are reasonably high performancemodels are diverse (they have different predictions)


Ideas to Generate Diverse Models

In competitions, to improve performance, one often creates a diverseset of models with similar performance.

How to do so?

different learning algorithmstuning parameter variationdifferent variations of featuresrandomization (e.g. bagging)

Average diverse models nearly always help a littlerelated reason: when models are ambiguous, model combination ismore stable than model selection


Ideas to Generate Diverse Models

In competitions, to improve performance, one often creates a diverseset of models with similar performance.

How to do so?different learning algorithmstuning parameter variationdifferent variations of featuresrandomization (e.g. bagging)

Average diverse models nearly always help a littlerelated reason: when models are ambiguous, model combination ismore stable than model selection


Example

Winning strategy in a recent predictive modeling competition overmore than 200 teams.Performance is measured by a certain error metric: the smaller thebetter – test data size: 60,000Ensemble: generate 62 models using randomization and variationsof ensemble trees

Test error:single best: 0.688 (the performance already wins the competition)combination of five models: 0.685 (0.4% error reduction)combination of thirty models: 0.682 (0.9% error reduction)combination of sixty-two models: 0.680 (1.2% error reduction)

Larger gains on some other data: create a more diverse ensemble


Example

Winning strategy in a recent predictive modeling competition overmore than 200 teams.Performance is measured by a certain error metric: the smaller thebetter – test data size: 60,000Ensemble: generate 62 models using randomization and variationsof ensemble treesTest error:

single best: 0.688 (the performance already wins the competition)combination of five models: 0.685 (0.4% error reduction)combination of thirty models: 0.682 (0.9% error reduction)combination of sixty-two models: 0.680 (1.2% error reduction)

Larger gains on some other data: create a more diverse ensemble


Exponentially Weighted Model Averaging

Observe data (Xi ,Yi) and want to find a combination to minimize acertain error ERROR(f )

Consider model averaging with model outputs f1(x), . . . , fL(x), and

f (x) = w1 f1(x) + · · ·wL fL(x)

Exponentially weighted model averaging:

wj =e−λ ERROR(fj )∑Li=1 e−λ ERROR(fj )

Give larger weight to smaller error (error can be either on trainingor validation data)λ =∞: model selection (find the one with smallest error)λ = 0: equally weighted model averaging

Good theoretical property; related to Bayesian model averaging


Stacking

Split data into training + validation.L models fj(x) obtained from training data.On validation data define VALIDATION ERROR(fj), find combination:

f (x) =L∑

j=1

wj fj(x), wj ≥ 0,∑

j

wj = 1

to minimize the validation error:

minw

VALIDATION ERROR

L∑j=1

wj fj

.more complex than exponentially weighted model averaging

More generally: treat {fj(x)} as features and use any learningalgorithm to train a combination model with features {fj(x)} onvalidation data.


Stacking

Split data into training + validation.L models fj(x) obtained from training data.On validation data define VALIDATION ERROR(fj), find combination:

f (x) =L∑

j=1

wj fj(x), wj ≥ 0,∑

j

wj = 1

to minimize the validation error:

minw

VALIDATION ERROR

L∑j=1

wj fj

.more complex than exponentially weighted model averagingMore generally: treat {fj(x)} as features and use any learningalgorithm to train a combination model with features {fj(x)} onvalidation data.


Example

Winning strategy of an old competition: CoNLL 2003 Englishnamed entity recognition.Performance measured by F -measure (similar to accuracy): thehigher the betterTraining four classifiers using different algorithmsBest single: 89.94Equally weighted averaging: 91.56Stacking: training a linear combination of the classifiers: 91.63


Generating Diverse Ensemble: Bagging

How to generate ensemble candidates from a single learningalgorithm A?

Answer: baggingbuild fi from bootstrapped training samples:

drawn n samples S in from training data Sn, with replacement.

fi = A(S in).

form voted classifiers: f = 1L

∑Li=1 fi .

When does it help.unstable classifier: fi very different from each other.performance of fi are similar.

How does it help.variance reduction: ensemble is more stable although each classifieruses less data.


Generating Diverse Ensemble: Bagging

How to generate ensemble candidates from a single learningalgorithm A?Answer: bagging

build fi from bootstrapped training samples:drawn n samples S i

n from training data Sn, with replacement.fi = A(S i

n).

form voted classifiers: f = 1L

∑Li=1 fi .

When does it help.unstable classifier: fi very different from each other.performance of fi are similar.

How does it help.variance reduction: ensemble is more stable although each classifieruses less data.


Other Ideas to Generate Ensemble Candidates

Use boosting: can be either regarded as a single model or as anensemble method

Use different learning algorithms (different loss functions,regularizations)Use different variations of data-processing

different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps

Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor

Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)



Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)

Use different variations of data-processingdifferent transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps





Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)Use different variations of data-processing

different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps




Decision Tree

X

A < 2

B < 2

X B < 4

Y X

true

true

true false

false

false

A and B: featuresX and Y : predicted class labelsTraining: greedily test features to split at each node


Remarks on Decision Tree

Advantages:interpretablehandle non-homogeneous features easilyfinds non-linear interactions

Disadvantage:usually not the most accurate classifier by itself


Improving Decision Tree Performance

Improve accuracy through tree ensemble learning:bagging

generate bootstrap samples.train one tree per bootstrap sample.take equally weighted average of the trees.

random forestbagging with additional randomization.

boosting


Bagging and Random Forest

Baggingbuild fi from bootstrapped training samples:Form voted classifiers: f = 1

m

∑mi=1 fi .

Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.


Why Bagging and RF work

Tree unstable, thus different random trees are very differentMay generate deep trees, overfitting the data with each treeEqually weighted averaging reduces the variance.Related to Bayesian model averaging over all possible trees.


Summary

Model selection: find the best single modelbetter interpretability but may be unstable under ambiguity

Model averaging/combination:find the best combination of modelsvarious methods

Model averaging is often more effective than model selectionensemble can improve stability: ensemble is stable even if individualmodels are not

Key issue in model averaging: generate diverse models with goodperformance


tong zhang -...

Documents