of - unige.it · neta rch. regula rization revisited consider a highly non-natural p roblem fo r nn...
TRANSCRIPT
Introducetion
to
Ensem
blesofExperts
and
Hybrid
M
ethods
NathanIntrator
Tel-AvivUniversityandBrownUniversity
www.physics.brown.edu/users/faculty/intrator
Collaboration:
YairShimshoni,YuvalRaviv,InnaStainvas,ShimonCohen
Outline
�
Basicde�nitions
�
Averagingformulation
�
Combiningmodelsofdi�erentnature
�
Assessingthegoodnessofanexpert
�
Predicting
large
ensemble
performance
from
a
smallensemble
�
Regularizationrevisited
netarch
GeneralSetup
�
Problem:
Smalltrainingset,largenumberof
dependentvariables
�
BestSolution:
Detailedmodelingofthedata
withveryfew
freeparameterstoestimate
�
Second
best:
Useamore exiblemodel,esti-
matemanyparameters
netarch
GeneralSetup
Tradeo�between:
�
numberoffreeparameters
�
datacomplexity
�
reliabilityoftheestimation
netarch
W
hatareEnsemblesandHybridarchitectures
netarch
W
hatareEnsemblesandHybridarchitectures
De�nition:
Combiningdi�erentmodelswhereeachiscapableof
modelingtheobservationsseparately
netarch
ReasonsforEnsemblesandHybridMethods
netarch
ReasonsforEnsemblesandHybridMethods
�
Uncertaintyaboutthedesiredmodel
netarch
ReasonsforEnsemblesandHybridMethods
�
Uncertaintyaboutthedesiredmodel
{Uncertaintyaboutmodelparameters
{Uncertaintyaboutmodelcapacity
{Uncertaintyaboutmodelcomplexity
{Uncertaintyaboutmodeltypeandarchitecturen
etarch
Uncertaintyaboutmodelparameters
�
Whentheoptimizationsolutionisunique,uncer-
taintyresultsfrom
thechoiceofthetrainingset
�
When
thesolution
isnon-unique,additionalun-
certaintyresultsfrom
theinitialchoiceofmodel
parameters
netarch
Uncertaintyaboutmodelparameters
(continued)
Usuallyaddressedby:
�
Imposingaprior�(W
)onthedistributionofpa-
rameters
�
Integrationoverthedistribution:
Z�(W
)W
(x)dx:
Thismaybeproblematicduetomultiplelocalminima.
netarch
Uncertaintyaboutmodelparameters
(continued)
A
betterapproach:Integrate(average)overthepre-
dictionMW
ofallthesemodels
Z�(W
)MW
dW:
Leadstoensembleofexpertsasanapproximationto
amodelposterior
netarch
Combiningmodelsofdi�erentnature
SequentialmethodsofaHybrid avor
�
AdditiveandGeneralizedadditivemodels(GAM)
�
Projectionpursuitregression
�
Matchingpursuit:Choosefrom
a(nonorthogonal)
collectionofbasisfunctions
netarch
Combiningmodelsofdi�erentnature
Reasonsforcombination
�
EÆciency
di�erence
between
models,
training
methodology
�
Sequentialmodeling
�
Divideandconquer
�
Modelinterpretation
netarch
Actualcombination
Similartothecombinationofmodelswithdi�erent
parametervalues:
�
Construct(orempiricallyestimate)aposteriorto
themodels�(Mi (W
)),whereirepresentsthedif-
ferentmodels
�
Integrateovertheprior
Z�(Mi (W
))Mi (W
)
netarch
Prosandconsofcombiningmodelsusinga
posteriordistribution
Pros
�
Appearsto
modelthedatabetter,�tthemore
appropriatemodels
�
Removesnaturallyveryunrelatedmodels
�
Smallerensemblesizeworks�ne
netarch
Prosandconsofcombiningmodelsusinga
posteriordistribution
Cons
�
Regularizationissimpler
�
Sensitivitytowrongmodelsisreduced
�
Trainingforoptimalensembleperformanceissim-
pler
netarch
Maincaveatfor"smart"averaging:Construct
usefulModelAssessment
�
A
"good"modelassessmentcouldbeusefulfor
modelaveraging.
�
Whentwomodelshavesimilarpredictionsshould
wegivethem
sameimportance?
netarch
Maincaveatfor"smart"averaging:Construct
usefulModelAssessment
�
Simplyput,ifa40hiddenunitarchitectureper-
formsaswellasa5hiddenunitarchitecture,which
oneshouldweprefer?
netarch
Maincaveatfor"smart"averaging:Construct
usefulModelAssessment
�
Simplyput,ifa40hiddenunitarchitectureper-
formsaswellasa5hiddenunitarchitecture,which
oneshouldweprefer?
Informationtheorymaysurpriseushere..
netarch
ModelAssessment(Hinton&
vanCamp,1993)
Basicidea
�
Theperformanceofanexpertisafunctionofits
error(residual)andafunctionofitscomplexity.
�
Thecomplexityofamodelisafunction
ofthe
numberofparametersandtherequiredaccuracy
fortheparameters
netarch
ModelAssessment(Hinton&
vanCamp,1993)
�
Tousethesamescale,wemeasurethecode-length
oftheresidualandofthemodelparameters
�
Thecode-lengthofamodelisobtainedusingthe
posteriorprobabilityoftheparameters
�
Modelassessmentisthusinverselyproportionalto
thesum
ofthecode-lengths
netarch
Prosandconsofcombiningmodelsusinga
posteriordistribution
Cons
�
Regularizationissimplerandsensitivityreduced
�
Varianceoftheensemblecanbereduced
�
Trainingforoptimalensembleperformance
�
Predictlargeensembleperformancefrom
asmall
set
netarch
Variance/BiasDecompositionforEnsembles
�f(x)=
1Q
QXi=1
fi (x):
E[(�f�E[�f])2]=
E[(1QX
fi�E[1QX
fi ])2]
=
E[(1QX
fi )2]�(E[1QX
fi ])2:
(1)
The�rstRHSterm
canwerewrittenas
E[(1QX
fi )2]=
1Q2 XE[f2i]+
2Q2 Xi<
jE[fi fj ];
netarch
Variance/BiasDecompositionforEnsembles
andthesecondterm
gives,
(E[1QX
fi ])2=
1Q2 X(E[f2i])2+
2Q2 Xi<
jE[fi ]E[fj ]:
Pluggingtheseequalitiesinto(1)gives
E[(�f�E[�f])2]=
1Q2 XfE[f2i]�(E[fi ])2g+
2Q2 Xi<
j fE[fi fj ]�E[fi ]E[fj ]g:
Set
=
Var(fi )+(Q�1)maxi;j (E[fi fj ]�E[fi ]E[fj ]):
Itfollows[ab�a2+b2
2
)
E[fi fj ]�E[fi ]E[fj ]�maxiVar(fi )]that
Var(�f)�
1Q �max
i
Var(fi ):
(2)
netarch
Error as a function of ensemble size and training time
(Horn et al., 98)
0 20 40 60 80 100 120 140
t
0.070
0.075
0.080
0.085
0.090
0.095
0.100
ARV Q=1
Q=2
0 20 40 60 80 100 120 140
t
0.075
0.080
0.085
0.090
0.095
0.100
ARV
Q=1
Q=2
0 0.2 0.4 0.6 0.8 1
1/Q
0.074
0.076
0.078
0.080
0.082
0.084
ARV
Di�erent ensembles of two predictors as a function of training time. The
variance goes down as 1/Q.
netarch
Regularizationrevisited
�
Considerahighlynon-naturalproblem
forNN
�
Low
dimensional(highlynonlinear)
�
StudytheabilitytocontrolmodelpropertiesCa-
pacity,Variance,Bias/Smoothness
�
EasyvisualizationofGeneralizationProperties
netarch
The Two-Spiral Problem
c(-6, 6)
c(-6
, 6)
-6 -4 -2 0 2 4 6-6
-4-2
02
46
o o
x
o
x
o
x
o
x
o
x
o
x
o
x
ox
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
xo xo
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
xoxo
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
xo xo
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
ox ox ox
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
xo
xo xo xo xo
xo
xo
x
o
x
o
x
o
xox
� 194 X,Y values. Half produce a 1 output, and half produce 0
� Lang and Witbrock (1988) proposed a 2-5-5-5-1 net (138 weights)
� Fahlman Lebiere (1990) Cascade Correlation architecture
� Baum and Lang (1991) Net of 2�50�1 could be consistent with training
set, but could not be found from random initial weights
� De�uant (1995) suggested the "Perceptron Membrane": piecewise linear
discriminating surfaces using 29 perceptrons. Non smooth solution
netarch
The noisy spiralsc(-6, 6)
c(-6
, 6)
-6 -4 -2 0 2 4 6
-6-4
-20
24
6
o oo
x
o
x
o
x
o
x
o
x
o
xox
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
xo x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
xo
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
xo xo
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
ox ox
o
xo
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
x
o
xo
xo
xo xo xo
xo
x
ox
o
x
o
xox
o
x
Additional Gaussian noise (SD=0.3)
netarch
Di�erentnoiselevels
Resultsoftrainingwithdi�erentnoiselevels.�=
0;:::;0:8
netarch
Di�erentnoiselevelsandoptimalweightdecayn
etarch
Summary:40NetEnsemble
Top
left:
Noconstrains.Top
right:
OptimalWD,nonoise.Bottom
left:
Optimalnoise,noWD.Bottom
right:Optimalnoise&
WD.
netarch
LocalGAM
�
Localgeneralizedadditivemodel(HastieTibshirani,1986)
�
Usesapolynomial�tofdegree1(optimal)
�
Thespanparameterdeterminesthedegreeoflocalityoftheestimation
�
Idealmodelfortheproblem
{
Localwithcontrolonlocality
{
Noridgeconstraints
{
Providesauniquemodel(lessvariability)
{
Smoothnesscontrolledbylocalityanddegreeofthepolynomial
netarch
Noisy GAM
No bootstrap sd=0 sd=0.025
sd=0.05 sd=0.1 sd=0.3
Average of 20 GAMs with varying degrees of noisenetarch
Takehomefrom
theSpirals
�
NN
areeasytoregularize
{Weightdecay(smoothing)
{Ensembleaverage
�
Bootstrapwith
noiseisusefulforothermodels'
regularizationandisnotequivalenttosmoothingn
etarch
Challenge:
show
similarperformanceusing
Stacking,Bagging,
Boosting,Arcing,Randomization,etc.
netarch
ProblemsinInterpretabilityofNN
�Themodelisnotidenti�ablesincethereisnouniquesolution
toa�xedANN
architectureandlearningrule.
�Estimationwithgradientdescentincreasesvariabilityofthe
modelduetolocalminima
�ThereisnoclearOptimalnetworkarchitecture(numberof
hiddenlayers,numberofhiddenunits,recurrent,secondor-
der,etc.)
�Nonlinearmodel:alle�ectsshouldonlybecalculatedlocally
(perinputobservation)
�How
todevisesummarystatisticsforrankingbetweenvari-
ables?
netarch
Summary
�
Whilemostactivityisgearedtowardssamearchi-
tectureensembles,Di�erentarchitectureensemble
ispromising
�
Modelassessmentwaspresentedforsameordif-
ferentarchitectureensembles
�
Variancecontrolispossiblewithsimpleaveraging
�
Largeensembleperformancecanbepredictedfrom
smallset
netarch