of - unige.it · neta rch. regula rization revisited consider a highly non-natural p roblem fo r nn...

Introducetion

to

Ensem

blesofExperts

and

Hybrid

M

ethods

NathanIntrator

Tel-AvivUniversityandBrownUniversity

www.physics.brown.edu/users/faculty/intrator

Collaboration:

YairShimshoni,YuvalRaviv,InnaStainvas,ShimonCohen

Outline

�

Basicde�nitions

�

Averagingformulation

�

Combiningmodelsofdi�erentnature

�

Assessingthegoodnessofanexpert

�

Predicting

large

ensemble

performance

from

a

smallensemble

�

Regularizationrevisited

netarch

GeneralSetup

�

Problem:

Smalltrainingset,largenumberof

dependentvariables

�

BestSolution:

Detailedmodelingofthedata

withveryfew

freeparameterstoestimate

�

Second

best:

Useamore exiblemodel,esti-

matemanyparameters

netarch

GeneralSetup

Tradeo�between:

�

numberoffreeparameters

�

datacomplexity

�

reliabilityoftheestimation

netarch

W

hatareEnsemblesandHybridarchitectures

netarch

W

hatareEnsemblesandHybridarchitectures

De�nition:

Combiningdi�erentmodelswhereeachiscapableof

modelingtheobservationsseparately

netarch

ReasonsforEnsemblesandHybridMethods

netarch


�

Uncertaintyaboutthedesiredmodel

netarch


�

Uncertaintyaboutthedesiredmodel

{Uncertaintyaboutmodelparameters

{Uncertaintyaboutmodelcapacity

{Uncertaintyaboutmodelcomplexity

{Uncertaintyaboutmodeltypeandarchitecturen

etarch

Uncertaintyaboutmodelparameters

�

Whentheoptimizationsolutionisunique,uncer-

taintyresultsfrom

thechoiceofthetrainingset

�

When

thesolution

isnon-unique,additionalun-

certaintyresultsfrom

theinitialchoiceofmodel

parameters

netarch


(continued)

Usuallyaddressedby:

�

Imposingaprior�(W

)onthedistributionofpa-

rameters

�

Integrationoverthedistribution:

Z�(W

)W

(x)dx:

Thismaybeproblematicduetomultiplelocalminima.

netarch


(continued)

A

betterapproach:Integrate(average)overthepre-

dictionMW

ofallthesemodels

Z�(W

)MW

dW:

Leadstoensembleofexpertsasanapproximationto

amodelposterior

netarch


SequentialmethodsofaHybrid avor

�

AdditiveandGeneralizedadditivemodels(GAM)

�

Projectionpursuitregression

�

Matchingpursuit:Choosefrom

a(nonorthogonal)

collectionofbasisfunctions

netarch


Reasonsforcombination

�

EÆciency

di�erence

between

models,

training

methodology

�

Sequentialmodeling

�

Divideandconquer

�

Modelinterpretation

netarch

Actualcombination

Similartothecombinationofmodelswithdi�erent

parametervalues:

�

Construct(orempiricallyestimate)aposteriorto

themodels�(Mi (W

)),whereirepresentsthedif-

ferentmodels

�

Integrateovertheprior

Z�(Mi (W

))Mi (W

)

netarch

Prosandconsofcombiningmodelsusinga

posteriordistribution

Pros

�

Appearsto

modelthedatabetter,�tthemore

appropriatemodels

�

Removesnaturallyveryunrelatedmodels

�

Smallerensemblesizeworks�ne

netarch



Cons

�

Regularizationissimpler

�

Sensitivitytowrongmodelsisreduced

�

Trainingforoptimalensembleperformanceissim-

pler

netarch

Maincaveatfor"smart"averaging:Construct

usefulModelAssessment

�

A

"good"modelassessmentcouldbeusefulfor

modelaveraging.

�

Whentwomodelshavesimilarpredictionsshould

wegivethem

sameimportance?

netarch



�

Simplyput,ifa40hiddenunitarchitectureper-

formsaswellasa5hiddenunitarchitecture,which

oneshouldweprefer?

netarch



�

Simplyput,ifa40hiddenunitarchitectureper-

formsaswellasa5hiddenunitarchitecture,which

oneshouldweprefer?

Informationtheorymaysurpriseushere..

netarch

ModelAssessment(Hinton&

vanCamp,1993)

Basicidea

�

Theperformanceofanexpertisafunctionofits

error(residual)andafunctionofitscomplexity.

�

Thecomplexityofamodelisafunction

ofthe

numberofparametersandtherequiredaccuracy

fortheparameters

netarch

ModelAssessment(Hinton&

vanCamp,1993)

�

Tousethesamescale,wemeasurethecode-length

oftheresidualandofthemodelparameters

�

Thecode-lengthofamodelisobtainedusingthe

posteriorprobabilityoftheparameters

�

Modelassessmentisthusinverselyproportionalto

thesum

ofthecode-lengths

netarch



Cons

�

Regularizationissimplerandsensitivityreduced

�

Varianceoftheensemblecanbereduced

�

Trainingforoptimalensembleperformance

�

Predictlargeensembleperformancefrom

asmall

set

netarch

Variance/BiasDecompositionforEnsembles

�f(x)=

1Q

QXi=1

fi (x):

E[(�f�E[�f])2]=

E[(1QX

fi�E[1QX

fi ])2]

=

E[(1QX

fi )2]�(E[1QX

fi ])2:

(1)

The�rstRHSterm

canwerewrittenas

E[(1QX

fi )2]=

1Q2 XE[f2i]+

2Q2 Xi<

jE[fi fj ];

netarch

Variance/BiasDecompositionforEnsembles

andthesecondterm

gives,

(E[1QX

fi ])2=

1Q2 X(E[f2i])2+

2Q2 Xi<

jE[fi ]E[fj ]:

Pluggingtheseequalitiesinto(1)gives

E[(�f�E[�f])2]=

1Q2 XfE[f2i]�(E[fi ])2g+

2Q2 Xi<

j fE[fi fj ]�E[fi ]E[fj ]g:

Set

=

Var(fi )+(Q�1)maxi;j (E[fi fj ]�E[fi ]E[fj ]):

Itfollows[ab�a2+b2

2

)

E[fi fj ]�E[fi ]E[fj ]�maxiVar(fi )]that

Var(�f)�

1Q �max

i

Var(fi ):

(2)

netarch

Error as a function of ensemble size and training time

(Horn et al., 98)

0 20 40 60 80 100 120 140

t

0.070

0.075

0.080

0.085

0.090

0.095

0.100

ARV Q=1

Q=2

0 20 40 60 80 100 120 140

t

0.075

0.080

0.085

0.090

0.095

0.100

ARV

Q=1

Q=2

0 0.2 0.4 0.6 0.8 1

1/Q

0.074

0.076

0.078

0.080

0.082

0.084

ARV

Di�erent ensembles of two predictors as a function of training time. The

variance goes down as 1/Q.

netarch

Regularizationrevisited

�

Considerahighlynon-naturalproblem

forNN

�

Low

dimensional(highlynonlinear)

�

StudytheabilitytocontrolmodelpropertiesCa-

pacity,Variance,Bias/Smoothness

�

EasyvisualizationofGeneralizationProperties

netarch

The Two-Spiral Problem

c(-6, 6)

c(-6

, 6)

-6 -4 -2 0 2 4 6-6

-4-2

02

46

o o

x

o

x

o

x

o

x

o

x

o

x

o

x

ox

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

xo xo

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

xoxo

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

xo xo

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

ox ox ox

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

xo

xo xo xo xo

xo

xo

x

o

x

o

x

o

xox

� 194 X,Y values. Half produce a 1 output, and half produce 0

� Lang and Witbrock (1988) proposed a 2-5-5-5-1 net (138 weights)

� Fahlman Lebiere (1990) Cascade Correlation architecture

� Baum and Lang (1991) Net of 2�50�1 could be consistent with training

set, but could not be found from random initial weights

� De�uant (1995) suggested the "Perceptron Membrane": piecewise linear

discriminating surfaces using 29 perceptrons. Non smooth solution

netarch

The noisy spiralsc(-6, 6)

c(-6

, 6)

-6 -4 -2 0 2 4 6

-6-4

-20

24

6

o oo

x

o

x

o

x

o

x

o

x

o

xox

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

xo x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

xo

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

xo xo

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

ox ox

o

xo

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

x

o

xo

xo

xo xo xo

xo

x

ox

o

x

o

xox

o

x

Additional Gaussian noise (SD=0.3)

netarch

Di�erentnoiselevels

Resultsoftrainingwithdi�erentnoiselevels.�=

0;:::;0:8

netarch

Di�erentnoiselevelsandoptimalweightdecayn

etarch

Summary:40NetEnsemble

Top

left:

Noconstrains.Top

right:

OptimalWD,nonoise.Bottom

left:

Optimalnoise,noWD.Bottom

right:Optimalnoise&

WD.

netarch

LocalGAM

�

Localgeneralizedadditivemodel(HastieTibshirani,1986)

�

Usesapolynomial�tofdegree1(optimal)

�

Thespanparameterdeterminesthedegreeoflocalityoftheestimation

�

Idealmodelfortheproblem

{

Localwithcontrolonlocality

{

Noridgeconstraints

{

Providesauniquemodel(lessvariability)

{

Smoothnesscontrolledbylocalityanddegreeofthepolynomial

netarch

Noisy GAM

No bootstrap sd=0 sd=0.025

sd=0.05 sd=0.1 sd=0.3

Average of 20 GAMs with varying degrees of noisenetarch

Takehomefrom

theSpirals

�

NN

areeasytoregularize

{Weightdecay(smoothing)

{Ensembleaverage

�

Bootstrapwith

noiseisusefulforothermodels'

regularizationandisnotequivalenttosmoothingn

etarch

Challenge:

show

similarperformanceusing

Stacking,Bagging,

Boosting,Arcing,Randomization,etc.

netarch

ProblemsinInterpretabilityofNN

�Themodelisnotidenti�ablesincethereisnouniquesolution

toa�xedANN

architectureandlearningrule.

�Estimationwithgradientdescentincreasesvariabilityofthe

modelduetolocalminima

�ThereisnoclearOptimalnetworkarchitecture(numberof

hiddenlayers,numberofhiddenunits,recurrent,secondor-

der,etc.)

�Nonlinearmodel:alle�ectsshouldonlybecalculatedlocally

(perinputobservation)

�How

todevisesummarystatisticsforrankingbetweenvari-

ables?

netarch

Summary

�

Whilemostactivityisgearedtowardssamearchi-

tectureensembles,Di�erentarchitectureensemble

ispromising

�

Modelassessmentwaspresentedforsameordif-

ferentarchitectureensembles

�

Variancecontrolispossiblewithsimpleaveraging

�

Largeensembleperformancecanbepredictedfrom

smallset

netarch

of - unige.it · neta rch. regula rization revisited consider a highly non-natural p roblem fo r nn...

Documents