importance sampling: an alternative view of ensemble learning...

Importance Sampling:

An Alternative View of Ensemble Learning

Jerome H. Friedman

Bogdan Popescu

Stanford University

PREDICTIVE LEARNING

Given data: fzigN1 = fyi;xigN1 v q(z)

y = \output" or \response" attribute (variable)

x = fx1; � � �; xng = \inputs" or \predictors"

and loss function L(y; F ):

estimate F �(x) = argminF (x)Eq(z)L(y; F (x))

WHY?

F �(x) is best predictor of y jx under L.

Examples:

Regression: y; F 2 R

L(y; F ) = jy � F j; (y � F )2

Classi�cation: y; F 2 fc1; � � �; cKg

L(y; F ) = Ly;F (K �K matrix)

F �(x) = \target" function (regression)

concept (classi�cation)

Estimate: F (x) learning procedure (fzigN1 )

(CART, MARS, logistic regression)

Here: procedure = \ LEARNING ENSEMBLES"

TreeNet (MART)

Random Forests

BASIC LINEAR MODEL

F (x) =RP a(p) f(x;p) dp

f(x;p) = \base" learner (basis function)

parameters: p = (p1; p2; � ��)

p 2 P indexes particular function of x

from f f(x;p)gp2P

a(p) = coe�cient of f(x;p)

Examples:

f(x;p) = [1 + exp(�ptx)]�1 (neural nets)

= multivariate splines (MARS)

= decision trees (Mart, RF)�

NUMERICAL QUADRATURE

RP I(p) dp w

PMm=1wmI(pm)

here: I(p) = a(p) f(x;p)

Quadrature rule de�ned by:

fpmgM1 = evaluation points 2 P

fwmgM1 = weights

F (x) wMXm=1

wm a(pm) f(x;pm)

wMXm=1

cm f(x;pm)

Averaging over x:

fc�mgM1 =

linear regression of y on ff(x;pm)gM1 (pop.)

Problem: �nd good fpmgM1 :

MONTE CARLO METHODS

r(p) = sampling pdf of p 2 P

fpm v r(p)gM1

Simple Monte Carlo: r(p) = constant

Usually not very good

IMPORTANCE SAMPLING

Customize r(p) for each particular problem (F �(x))

r(pm) = big =)

pm important to high accuracy

when used with fpm0gm0 6=m

MONTE CARLO METHODS

(1) \Random" Monte Carlo:

ignore other points: pm v r(p) iid

(2) \Quasi" Monte Carlo:

fpmgM1 = deterministic

account for other points

importance ! groups of points

RANDOM MONTE CARLO

(Lack of) importance J(p) depends only on p

One measure: \partial importance"

J(p) = Eq(z)L(y; f(x;p))

p� = argminp J(p)

= best single point (M = 1) rule

f(x;p�) = optimal single base learner

Usually not very good, especially if

F �(x) =2 ff(x;p)gp2P

BUT, often used:

single logistic regression or tree

Note: J(pm) ignores fpm0gm0 6=m

Hope: better than r(p) = constant.

PARTIAL IMPORTANCE SAMPLING

r(p) = g(J(p))

g(�) = monotone decreasing function

r(p�) = max w center (location)

p 6= p� =) r(p) < r(p�)

d(p;p�) = J(p)� J(p�)

Besides location,

Critical parameter for importance sampling:

Scale (width) of r(p):

� =RP d(p;p

�) r(p) dp

Controlled by choice of g(�):

� = too large ! r(p) = constant.

� = too small ! best single point rule p�

Questions:

(1) how to choose g(�) v �

(2) sample from r(p) = g(J(p))

TRICK

Perturbation sampling ) repeatedly:

(1) randomly modify (perturb) problem

(2) �nd optimal f(x;pm) for perturbed problem

pm = RmnargminpEq(z)L(y; f(x;p))

ocontrol width � of r(p) by degree

Perturb: L(y; F ), q(z), algorithm, hybrid.

EXAMPLES

Perturb loss function:

Lm(y; f) = L(y; f) + � � lm(y; f)

lm(y; f) = random function

Lm(y; f) = L(y; f + � � hm(x))

hm(x) = random function of x

pm = argminpEq(z)Lm(y; f(x;p))

Width � of r(p) v value of �

Perturb data distribution:

Random reweighting:

qm(z) = [wm(z)]� q(z)

wm(z) = random function of z

pm = argminpEqm(z)L(y; f(x;p))

Width � of r(p) v value of �

Perturb algorithm:

pm = rand[argminp]Eq(z)L(y; f(x;p))

control width � of r(p) by degree

repeated partial optimizations

perturb partial solutions

Examples (trees):

Dittereich - random trees

Breiman - random forests

GOAL

Produce a good fpmgM1 so that

MXm=1

c�mf(x;pm) w F �(x)

where

fc�mgM1 = pop. linear regression (L)

of y on ff(x;pm)gM1

Note: both depend on knowing population q(z).

FINITE DATA

fzigN1 v q(z)

q(z) =PNi=1

1N �(z� zi)

Apply perturbation sampling based on q(z):

Loss function / algorithm:

q(z)! q(z)

width � of r(p) controlled as before

Empirical data distribution: random reweighting

qm(z) =PNi=1wim�(z� zi)

wim v Pr(w) : Ewim = 1=N

width � of r(p) controlled by std(wim)

Fastest comp: wim 2 f0; 1=Kg

) draw K from N without replacement

� v std(w) = (N=K � 1)1=2=N

computation v K=N

Quadrature Coe�cients

Population:

Linear regression of y on ff(x;pm)gM1 :

fc�mgM1 = argminfcmgEq(z)L�y;PMm=1 cm f(x;pm)

�

Finite data: regularized linear regression

fcmgM1 = argminfcmgEq(z)L�y;PMm=1 cm f(x;pm)

�

+� �PMm=1 j cm � c(0)m j (Lasso)

Regularization ) reduce variance

fc(0)m gM1 = prior guess (usually = 0)

� > 0 chosen by cross{validation

Fast algorithm: sol'ns for all �

(see Friedman & Popescu 2004)

Importance Sampled Learning Ensembles (ISLE)

Numerical Integration

F (x) =RP a(p) f(x;p) dp

w PMm=1 cm f(x; pm)

fpmgM1 v r(p) importance sampling

v perturbation sampling on q(z)

fcmgM1 regularized linear regression

of y on ff(x; pm)gM1

BAGGING (Breiman 1996)

Perturb data distribution q(z):

qm(z) = bootstrap sample =PNi=1wim�(z� zi)

wim 2 f0; 1=N; 2=N; � � �; 1g

v multinomial (1=N)

pm = argminpEqm(z)L(y; f(x;p))

F (x) =PMm=1

1M f(x;pm) (average)

Width � of r(p):

E(std(wim)) = (1� 1=N)�1=N w 1=N

Fixed ) no control

No joint �tting of coe�cients:

� =1 & c(0)m = 1=M

Potential improvements:

Di�erent � (sampling strategy)

� <1) jointly �t coe�'s to data

RANDOM FORESTS (Breiman 1998)

f(x;p) = T (x) = largest possible decision tree

Hybrid sampling strategy :

(1) qm(z) = bootstrap sample (bagging)

(2) random algorithm modi�cation:

select var. for each split from

among randomly chosen subset

Breiman: ns = fl(log2 n+ 1)

F (x) =PMm=1

1M Tm(x) (average)

As an ISLE: �(RF ) > �(Bag); (" as ns #)

Potential improvements: same as bagging

Di�erent � (sampling strategy)

� <1) jointly �t coe�'s to data

(more later)

SEQUENTIAL SAMPLING

Random Monte Carlo: fpm v r(p)gM1 iid

Quasi{Monte Carlo: fpmgM1 = deterministic

J(fpmgM1 ) = minf�mgEq(z)L�y;PMm=1�m f(x;pm)

�

Joint regression of y on f f(x;pm)gM1 (pop.)

Approximation: sequential sampling

(forward stagewise)

Jm(p j fplgm�11 ) = min�Eq(z)L(y; � f(x;p)+hm(x))

hm(x) =Pm�1l=1 �l f(x;pl); �l = sol'n for pl

pm = argminp Jm(p j fplgm�11 )

Repeatedly modify loss function:

Similar to Lm(y; f) = L(y; f + � � hm(x))

but here � = 1 & hm(x) = deterministic

Connection to Boosting

AdaBoost (Freund & Shapire 1996):

L(y; f) = exp(�y � f), y 2 f�1; 1g

F (x) = sign�PM

m=1�m f(x;pm)�

f�mgM1 = sequential partial reg. coe�'s

Gradient Boosting (MART { Friedman 2001):

general y & L(y; f), �m = shrunk (� << 1)

F (x) =PMm=1�m f(x;pm)

Potential improvements (ISLE):

(1) F (x) =PMm=1 cm f(x; pm)

fpmgM1 v seq. sampling on q(x)

fcmgM1 v regularized linear regression

(2) and/or hybrid with random qm(z) (speed)

(sample K from N without replacement)

ISLE Paradigm

Wide variety of ISLE methods:

(1) base learner f(x;p); (2) loss criterion L(y; f)

(3) perturbation method

(4) degree of perturbation: � of r(p)

(5) iid vs. sequential

(6) hybrids

Examine several options.

Monte Carlo Study

100 data sets: each N = 10000; n = 40

fyil = Fl(xi) + "ilg10000i=1 , l = 1; 100

fFl(x)g1001 = di�erent (random) target fun's.

xi v N(0; I40); "il v N(0; V arx(Fl(x)))

) signal/noise = 1/1

Evaluation Criteria

Relative RMS error:

rmse(Fjl) = [1�R2(Fl; Fjl)]1=2

Comparative RMS error:

cmse = rmse(Fjl)=minkfrmse(Fkl)g

(adjusts for problem di�culty)

j; k 2 frespective methodsg

10000 indep. obs.

Properties of Fl(x)

(1) 30 \noise" variables

(2) wide variety of fun's (di�culty)

(3) emphasize lower order interactions

(3) not in span of base learners

Decision Trees

CLASSIFICATION

(classcomp.ps)

CENSUS DATA (http://www.ips.umn.edu/usa)

N = 46937 (36000/10937, 5 times)

y = individual personal income

x1; � � �; x70 = demographic variables (many missing )

categorical: occupation, industry, etc.

numeric: education grade level, family size, etc.

(censusdat.ps, censusrf.ps, censusmart.ps)

SPAM DATA (http://www.data-mining-cup.com)

N = 19177 emails (15177/4000, 5 times)

y = spam/not spam, x1; � � �; x833 = binary features

presence/absence of:

selected text strings

characteristics of header

URL features

(spam.ps, spamlite.ps)

SUMMARY: Theory { unify:

(1) Bagging, (2) random forests,

(3) Bayesian model averaging,

(4) boosting

single paradigm v Monte Carlo integration

(1) { (3) : iid Monte Carlo, p v r(p)

(1), (2) perturb. sampling; (3) MCMC

(4) quasi{Monte Carlo: approx. seq. sampling

Practice:

fwigM1 lasso linear regression

(1) improves accuracy of RF v bagging (faster)

(2) combined with aggressive subsampling

& weaker base learners, improves speed:

bagging & RF > 102, MART v 5

allowing much bigger data sets.

Also, prediction many times faster.

FUTURE DIRECTIONS

(1) More thorough understanding (theory)

! speci�c recommendations

(2) Multiple learning ensembles (MISLES)

F (x) =PKk=1

RPkak(pk) fk(x;pk) dpk

f fk(x;pk)gK1 = di�erent (comp.) base learners

~F (x) =Pckm fk(x;pk)

fckmg combined lasso regression

Example: f1 = decision trees

f2 = fxjgn1 (no sampling)

SLIDES

http://www-stat.stanford.edu/~jhf/talks/isletalk.pdf

REFERENCES

Friedman & Popescu 2003:

http://www-stat.stanford.edu/~jhf/ftp/isle.pdf

Friedman & Popescu 2004:

http://www-stat.stanford.edu/~jhf/ftp/path.pdf

importance sampling: an alternative view of ensemble learning...

Documents