importance sampling: an alternative view of ensemble learning...
TRANSCRIPT
![Page 1: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/1.jpg)
Importance Sampling:
An Alternative View of Ensemble Learning
Jerome H. Friedman
Bogdan Popescu
Stanford University
![Page 2: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/2.jpg)
PREDICTIVE LEARNING
Given data: fzigN1 = fyi;xigN1 v q(z)
y = \output" or \response" attribute (variable)
x = fx1; � � �; xng = \inputs" or \predictors"
and loss function L(y; F ):
estimate F �(x) = argminF (x)Eq(z)L(y; F (x))
![Page 3: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/3.jpg)
WHY?
F �(x) is best predictor of y jx under L.
Examples:
Regression: y; F 2 R
L(y; F ) = jy � F j; (y � F )2
Classi�cation: y; F 2 fc1; � � �; cKg
L(y; F ) = Ly;F (K �K matrix)
![Page 4: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/4.jpg)
F �(x) = \target" function (regression)
concept (classi�cation)
Estimate: F (x) learning procedure (fzigN1 )
(CART, MARS, logistic regression)
Here: procedure = \ LEARNING ENSEMBLES"
TreeNet (MART)
Random Forests
![Page 5: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/5.jpg)
BASIC LINEAR MODEL
F (x) =RP a(p) f(x;p) dp
f(x;p) = \base" learner (basis function)
parameters: p = (p1; p2; � ��)
p 2 P indexes particular function of x
from f f(x;p)gp2P
a(p) = coe�cient of f(x;p)
![Page 6: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/6.jpg)
Examples:
f(x;p) = [1 + exp(�ptx)]�1 (neural nets)
= multivariate splines (MARS)
= decision trees (Mart, RF)�
![Page 7: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/7.jpg)
NUMERICAL QUADRATURE
RP I(p) dp w
PMm=1wmI(pm)
here: I(p) = a(p) f(x;p)
Quadrature rule de�ned by:
fpmgM1 = evaluation points 2 P
fwmgM1 = weights
![Page 8: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/8.jpg)
F (x) wMXm=1
wm a(pm) f(x;pm)
wMXm=1
cm f(x;pm)
Averaging over x:
fc�mgM1 =
linear regression of y on ff(x;pm)gM1 (pop.)
Problem: �nd good fpmgM1 :
![Page 9: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/9.jpg)
MONTE CARLO METHODS
r(p) = sampling pdf of p 2 P
fpm v r(p)gM1
Simple Monte Carlo: r(p) = constant
Usually not very good
![Page 10: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/10.jpg)
IMPORTANCE SAMPLING
Customize r(p) for each particular problem (F �(x))
r(pm) = big =)
pm important to high accuracy
when used with fpm0gm0 6=m
![Page 11: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/11.jpg)
MONTE CARLO METHODS
(1) \Random" Monte Carlo:
ignore other points: pm v r(p) iid
(2) \Quasi" Monte Carlo:
fpmgM1 = deterministic
account for other points
importance ! groups of points
![Page 12: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/12.jpg)
RANDOM MONTE CARLO
(Lack of) importance J(p) depends only on p
One measure: \partial importance"
J(p) = Eq(z)L(y; f(x;p))
p� = argminp J(p)
= best single point (M = 1) rule
f(x;p�) = optimal single base learner
![Page 13: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/13.jpg)
Usually not very good, especially if
F �(x) =2 ff(x;p)gp2P
BUT, often used:
single logistic regression or tree
Note: J(pm) ignores fpm0gm0 6=m
Hope: better than r(p) = constant.
![Page 14: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/14.jpg)
PARTIAL IMPORTANCE SAMPLING
r(p) = g(J(p))
g(�) = monotone decreasing function
r(p�) = max w center (location)
p 6= p� =) r(p) < r(p�)
d(p;p�) = J(p)� J(p�)
![Page 15: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/15.jpg)
Besides location,
Critical parameter for importance sampling:
Scale (width) of r(p):
� =RP d(p;p
�) r(p) dp
Controlled by choice of g(�):
� = too large ! r(p) = constant.
� = too small ! best single point rule p�
![Page 16: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/16.jpg)
![Page 17: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/17.jpg)
Questions:
(1) how to choose g(�) v �
(2) sample from r(p) = g(J(p))
![Page 18: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/18.jpg)
TRICK
Perturbation sampling ) repeatedly:
(1) randomly modify (perturb) problem
(2) �nd optimal f(x;pm) for perturbed problem
pm = RmnargminpEq(z)L(y; f(x;p))
ocontrol width � of r(p) by degree
Perturb: L(y; F ), q(z), algorithm, hybrid.
![Page 19: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/19.jpg)
EXAMPLES
Perturb loss function:
Lm(y; f) = L(y; f) + � � lm(y; f)
lm(y; f) = random function
Lm(y; f) = L(y; f + � � hm(x))
hm(x) = random function of x
pm = argminpEq(z)Lm(y; f(x;p))
Width � of r(p) v value of �
![Page 20: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/20.jpg)
Perturb data distribution:
Random reweighting:
qm(z) = [wm(z)]� q(z)
wm(z) = random function of z
pm = argminpEqm(z)L(y; f(x;p))
Width � of r(p) v value of �
![Page 21: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/21.jpg)
Perturb algorithm:
pm = rand[argminp]Eq(z)L(y; f(x;p))
control width � of r(p) by degree
repeated partial optimizations
perturb partial solutions
Examples (trees):
Dittereich - random trees
Breiman - random forests
![Page 22: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/22.jpg)
GOAL
Produce a good fpmgM1 so that
MXm=1
c�mf(x;pm) w F �(x)
where
fc�mgM1 = pop. linear regression (L)
of y on ff(x;pm)gM1
Note: both depend on knowing population q(z).
![Page 23: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/23.jpg)
FINITE DATA
fzigN1 v q(z)
q(z) =PNi=1
1N �(z� zi)
Apply perturbation sampling based on q(z):
Loss function / algorithm:
q(z)! q(z)
width � of r(p) controlled as before
![Page 24: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/24.jpg)
Empirical data distribution: random reweighting
qm(z) =PNi=1wim�(z� zi)
wim v Pr(w) : Ewim = 1=N
width � of r(p) controlled by std(wim)
Fastest comp: wim 2 f0; 1=Kg
) draw K from N without replacement
� v std(w) = (N=K � 1)1=2=N
computation v K=N
![Page 25: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/25.jpg)
Quadrature Coe�cients
Population:
Linear regression of y on ff(x;pm)gM1 :
fc�mgM1 = argminfcmgEq(z)L�y;PMm=1 cm f(x;pm)
�
![Page 26: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/26.jpg)
Finite data: regularized linear regression
fcmgM1 = argminfcmgEq(z)L�y;PMm=1 cm f(x;pm)
�
+� �PMm=1 j cm � c(0)m j (Lasso)
Regularization ) reduce variance
fc(0)m gM1 = prior guess (usually = 0)
� > 0 chosen by cross{validation
Fast algorithm: sol'ns for all �
(see Friedman & Popescu 2004)
![Page 27: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/27.jpg)
Importance Sampled Learning Ensembles (ISLE)
Numerical Integration
F (x) =RP a(p) f(x;p) dp
w PMm=1 cm f(x; pm)
fpmgM1 v r(p) importance sampling
v perturbation sampling on q(z)
fcmgM1 regularized linear regression
of y on ff(x; pm)gM1
![Page 28: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/28.jpg)
BAGGING (Breiman 1996)
Perturb data distribution q(z):
qm(z) = bootstrap sample =PNi=1wim�(z� zi)
wim 2 f0; 1=N; 2=N; � � �; 1g
v multinomial (1=N)
pm = argminpEqm(z)L(y; f(x;p))
F (x) =PMm=1
1M f(x;pm) (average)
![Page 29: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/29.jpg)
Width � of r(p):
E(std(wim)) = (1� 1=N)�1=N w 1=N
Fixed ) no control
No joint �tting of coe�cients:
� =1 & c(0)m = 1=M
Potential improvements:
Di�erent � (sampling strategy)
� <1) jointly �t coe�'s to data
![Page 30: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/30.jpg)
RANDOM FORESTS (Breiman 1998)
f(x;p) = T (x) = largest possible decision tree
Hybrid sampling strategy :
(1) qm(z) = bootstrap sample (bagging)
(2) random algorithm modi�cation:
select var. for each split from
among randomly chosen subset
Breiman: ns = fl(log2 n+ 1)
![Page 31: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/31.jpg)
F (x) =PMm=1
1M Tm(x) (average)
As an ISLE: �(RF ) > �(Bag); (" as ns #)
Potential improvements: same as bagging
Di�erent � (sampling strategy)
� <1) jointly �t coe�'s to data
(more later)
![Page 32: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/32.jpg)
SEQUENTIAL SAMPLING
Random Monte Carlo: fpm v r(p)gM1 iid
Quasi{Monte Carlo: fpmgM1 = deterministic
J(fpmgM1 ) = minf�mgEq(z)L�y;PMm=1�m f(x;pm)
�
Joint regression of y on f f(x;pm)gM1 (pop.)
![Page 33: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/33.jpg)
Approximation: sequential sampling
(forward stagewise)
Jm(p j fplgm�11 ) = min�Eq(z)L(y; � f(x;p)+hm(x))
hm(x) =Pm�1l=1 �l f(x;pl); �l = sol'n for pl
pm = argminp Jm(p j fplgm�11 )
Repeatedly modify loss function:
Similar to Lm(y; f) = L(y; f + � � hm(x))
but here � = 1 & hm(x) = deterministic
![Page 34: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/34.jpg)
Connection to Boosting
AdaBoost (Freund & Shapire 1996):
L(y; f) = exp(�y � f), y 2 f�1; 1g
F (x) = sign�PM
m=1�m f(x;pm)�
f�mgM1 = sequential partial reg. coe�'s
Gradient Boosting (MART { Friedman 2001):
general y & L(y; f), �m = shrunk (� << 1)
F (x) =PMm=1�m f(x;pm)
![Page 35: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/35.jpg)
Potential improvements (ISLE):
(1) F (x) =PMm=1 cm f(x; pm)
fpmgM1 v seq. sampling on q(x)
fcmgM1 v regularized linear regression
(2) and/or hybrid with random qm(z) (speed)
(sample K from N without replacement)
![Page 36: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/36.jpg)
ISLE Paradigm
Wide variety of ISLE methods:
(1) base learner f(x;p); (2) loss criterion L(y; f)
(3) perturbation method
(4) degree of perturbation: � of r(p)
(5) iid vs. sequential
(6) hybrids
Examine several options.
![Page 37: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/37.jpg)
Monte Carlo Study
100 data sets: each N = 10000; n = 40
fyil = Fl(xi) + "ilg10000i=1 , l = 1; 100
fFl(x)g1001 = di�erent (random) target fun's.
xi v N(0; I40); "il v N(0; V arx(Fl(x)))
) signal/noise = 1/1
![Page 38: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/38.jpg)
Evaluation Criteria
Relative RMS error:
rmse(Fjl) = [1�R2(Fl; Fjl)]1=2
Comparative RMS error:
cmse = rmse(Fjl)=minkfrmse(Fkl)g
(adjusts for problem di�culty)
j; k 2 frespective methodsg
10000 indep. obs.
![Page 39: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/39.jpg)
Properties of Fl(x)
(1) 30 \noise" variables
(2) wide variety of fun's (di�culty)
(3) emphasize lower order interactions
(3) not in span of base learners
Decision Trees
![Page 40: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/40.jpg)
![Page 41: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/41.jpg)
![Page 42: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/42.jpg)
![Page 43: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/43.jpg)
![Page 44: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/44.jpg)
![Page 45: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/45.jpg)
![Page 46: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/46.jpg)
CLASSIFICATION
(classcomp.ps)
![Page 47: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/47.jpg)
CENSUS DATA (http://www.ips.umn.edu/usa)
N = 46937 (36000/10937, 5 times)
y = individual personal income
x1; � � �; x70 = demographic variables (many missing )
categorical: occupation, industry, etc.
numeric: education grade level, family size, etc.
(censusdat.ps, censusrf.ps, censusmart.ps)
![Page 48: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/48.jpg)
SPAM DATA (http://www.data-mining-cup.com)
N = 19177 emails (15177/4000, 5 times)
y = spam/not spam, x1; � � �; x833 = binary features
presence/absence of:
selected text strings
characteristics of header
URL features
(spam.ps, spamlite.ps)
![Page 49: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/49.jpg)
SUMMARY: Theory { unify:
(1) Bagging, (2) random forests,
(3) Bayesian model averaging,
(4) boosting
single paradigm v Monte Carlo integration
(1) { (3) : iid Monte Carlo, p v r(p)
(1), (2) perturb. sampling; (3) MCMC
(4) quasi{Monte Carlo: approx. seq. sampling
![Page 50: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/50.jpg)
Practice:
fwigM1 lasso linear regression
(1) improves accuracy of RF v bagging (faster)
(2) combined with aggressive subsampling
& weaker base learners, improves speed:
bagging & RF > 102, MART v 5
allowing much bigger data sets.
Also, prediction many times faster.
![Page 51: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/51.jpg)
FUTURE DIRECTIONS
(1) More thorough understanding (theory)
! speci�c recommendations
![Page 52: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/52.jpg)
(2) Multiple learning ensembles (MISLES)
F (x) =PKk=1
RPkak(pk) fk(x;pk) dpk
f fk(x;pk)gK1 = di�erent (comp.) base learners
~F (x) =Pckm fk(x;pk)
fckmg combined lasso regression
Example: f1 = decision trees
f2 = fxjgn1 (no sampling)
![Page 53: Importance Sampling: An Alternative View of Ensemble Learning …statweb.stanford.edu/~jhf/talks/isletalk.pdf · 2004-03-21 · Importance Sampling: An Alternative View of Ensemble](https://reader034.vdocument.in/reader034/viewer/2022050518/5fa27422eadc4513c50c3017/html5/thumbnails/53.jpg)
SLIDES
http://www-stat.stanford.edu/~jhf/talks/isletalk.pdf
REFERENCES
Friedman & Popescu 2003:
http://www-stat.stanford.edu/~jhf/ftp/isle.pdf
Friedman & Popescu 2004:
http://www-stat.stanford.edu/~jhf/ftp/path.pdf