regression trees and regression graphs: efficient estimators for generalized additive models adam...

Regression trees and regression graphs:Efficient estimators for

Generalized Additive Models

Adam Tauman Kalai

TTI-Chicago

New

New

Outline

• Generalized Additive Models (GAM)

• Computationally efficient regression– Model

Thm: Regression graph algorithm efficiently learns GAMs

• Regression tree algorithm

• Regression graph algorithm

Correlation boosting

[Mansour&McAllester]

[Valiant] [Kearns&Schapire]

Generalized Additive Models [Hastie & Tibshirani]

Dist. Dist. over over XX ££ YY = = RRdd ££ RR

f(x) = Ef(x) = E[y|x] = u(f[y|x] = u(f11(x(x(1)(1))+f)+f22(x(x(2)(2))+…)+…+f+fdd(x(x(d)(d))) ))

monotonic u: monotonic u: RR!!RR, arbitrary f, arbitrary fii: : RR!!RR• e.g., Generalized linear models– u( w¢x ), monotonic u– linear/logistic models

• e.g., f(x) = f(x) = ee–||x||–||x||2 2 = = ee–x(1)–x(1)22–x(2)–x(2)22…–x(d)…–x(d)22

# Risk factors

Relapse

< 5 years

Relapse

< 2 years

Death

< 5 years

Death

< 2 years

0,1 30% 21% 30% 16%

2 50% 34% 50% 34%

3 51% 41% 51% 46%

4,5 60% 42% 60% 66%

Non-Hodgkin’s Lymphoma International Prognostics Index [NEJM ‘93]

Risk Factors age>60, # sites>1, perf. status>1, LDH>normal, stage>2

0

0

0

0

00

0 0

0

0 00

00

0

0

00

0

0

0

0

0

01

11

1

1

1

1

1

.8

1

1 111 1

1

11

1

1

1

11

1

1

0

1

11

1

.2

1

.5

.4

.4

.5

0 .2

0.1

0

0

0

0

1

Setup

X £ Y

X = Rd Y = [0,1]training sample:(x1,y1),…,(xn,yn)

regression algorithm

h: X ! [0,1]

.02

.4.3

.2

.1

0

00

0

00

0 0

00

00

0

0

0

0

00

0

0

0

0

0

01

11

1

1

1

1

1

1

1

1 111 1

1

11

1

1

1

11

1

1

0

1

11

1

.3

.3

.7

.3

.4

.4

0 .3

0.2

00

0

0

1

“true error”h) = E[(h(x)y)2]

“training error”h,train) = i(h(xi)-y)21

n

Computationally-efficient regression [Kearns&Schapire]

Learning Algorithm

A

X £ [0,1]

f(x) = E[y|x] 2 F,

8

n examplesn examples

h: X ! [0,1]

Family oftarget functions

>0

Definition: A efficiently learns F:

E[(h(x)-y)2] · E[(f(x)-y)2]+(term)/nc

A’s runtime must be poly(n,|f|)

poly(|f|,1/)

with probability 1-,

true error (h)

New

New

Outline

• Generalized Additive Models (GAM)

• Computationally efficient regression– Model

Thm: Regression graph algorithm efficiently learns GAMs

• Regression tree algorithm

• Regression graph algorithm

Correlation boosting

[Mansour&McAllester]

[Valiant] [Kearns&Schapire]

Results for GAM’s

RegressionGraph

Learner

0

h: Rd ! [0,1]

0 00

00

00

0 .2

0

1 11 1

.4

1

1

1.7

11

111

1

0

.1 .6

.8

New

8 with probability 1-,

n samples 2 X £ [0,1] X µ Rd

Thm: reg. graph learner efficiently learns GAMs

• 8dist over X£Y with E[y|x] = f(x) 2 GAM

– E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV log(dn/))

– runtime = poly(n,d) n1/7

Results for GAM’sNew

• f(x) = u(i fi(x(i)))

– u: R!R, monotonic, L-Lipschitz (L=max |u’(z)|)

– fi: R!R, bounded total variationV = i s |fi’(z)|dz

Thm: reg. graph learner efficiently learns GAMs

• 8dist over X£Y with E[y|x]=f(x) 2 GAM

– E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV log(dn/))

– runtime = poly(n,d) n1/7

Results for GAM’sNew

Thm: reg. tree learner inefficiently learns GAMs

• 8dist over X£Y with E[y|x]=f(x) 2 GAM

– E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV)

– runtime = poly(n,d)

RegressionTree

Learner

0

h: Rd ! [0,1]

0 00

00

00

0 .2

0

1 11 1

.4

1

1

1.7

1

n samples 2 X £ [0,1] X µ Rd

1

11

1

1

0

.1 .6

.8

log(d)log(n)( )

1/4

Regression Tree Algorithm

• Regression tree RT: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1] (x1,y1),

(x2,y2),…

avg(y1,y2,…,yn)


• Regression tree RT: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]

x(j) ¸

avg(yi: xi(j)<)

(xi,yi): x(j) <

avg(yi: xi(j)¸)

(xi,yi): x(j) ¸


• Regression tree RT: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]

x(j) ¸

avg(yi: xi(j)<)

(xi,yi): x(j) < x(j’) ¸ ’

(xi,yi): x(j) ¸ andx(j’) < ’

avg(yi: x(j)¸Æx(j’)¸’)

(xi,yi): x(j) ¸ andx(j’) ¸ ’

avg(yi: x(j)¸Æx(j’)<’)


• n = amount of training data

• Put all data into one leaf

• Repeat until size(RT)=n/log2(n):– Greedily choose leaf and split x(j) · to

minimize (RT,train) = (RT(xi)-yi)2/n

– Divide data in split node into two new leaves

Equivalent to “Gini”

• Regression graph RG: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]

x(j) ¸

x(j’) ¸ ’


(xi,yi): x(j) ¸ andx(j’) ¸ ’

avg(yi: x(j)<Æx(j’’)¸’’)

x(j’’) ¸ ’’

(xi,yi): x(j) < andx(j’’) < ’’

avg(yi: x(j)<Æx(j’’)<’’)

Regression Graph Algorithm [Mansour&McAllester]

(xi,yi): x(j) ¸ andx(j’) < ’

avg(yi: x(j)¸Æx(j’)<’)

(xi,yi): x(j) < andx(j’’) ¸ ’’

• Regression graph RG: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]

x(j) ¸

x(j’) ¸ ’


(xi,yi): x(j) ¸ andx(j’) ¸ ’

avg(yi: (x(j)<Æx(j’’)¸’’)Ç(x(j)¸Æx(j’)<’))

x(j’’) ¸ ’’

(xi,yi): x(j) < andx(j’’) < ’’

(xi,yi): x(j) < andx(j’’) ¸ ’’or x(j) ¸ and x(j’) < ’

avg(yi: x(j)<Æx(j’)<’)



• Put all n training data into one leaf• Repeat until size(RG)=n3/7:

– Split: greedily choose leaf and split x(j) · to minimize (RG,train) = (RG(xi)-yi)2/n

• Divide data in split node into two new leaves

• Let be the decrease in (RG,train) from this split

– Merge(s):• Greedily choose two leaves whose merger increases

(RG,train) as little as possible

• Repeat merging while total increase in (RG,train) from merges is · /2

Two useful lemmas

• Uniform generalization bound for any n:

• Existence of a correlated split:There always exists a split I(x(i) · ) s.t.,

probability over training sets (x1,y1),…,(xn,yn)

regression graph R

Motivating natural example

• X = {0,1}d, f(x) = (x(1)+x(2)+…+x(d))/d, uniform • Size(RT) ¼ exp(Size(RG)c), e.g. d=4:

x(1)>½

x(2)>½ x(2)>½

x(3)>½x(3)>½

x(4)>½

x(3)>½

x(4)>½x(4)>½ x(4)>½

0 .75 1.5.25

x(1)>½

x(2)>½ x(2)>½

x(3)>½

x(4)>½

x(3)>½

x(4)>½

.75 1.75.75

x(4)>½x(4)>½

.5 .5.5.25

x(3)>½

x(4)>½

x(3)>½

x(4)>½

.5 .75.5.5

x(4)>½x(4)>½

.25.25

.250

Regression boosting

• Incremental learning– Suppose you find something of positive

correlation with y, then reg. graphs make progress

– “Weak regression” implies strong regression, i.e. small correlations can efficiently be combined to get correlation near 1 (error near 0)

– Generalizes binary classification boosting[Kearns&Valiant, Schapire, Mansour&McAllester,…]

Conclusions

• Generalized additive models are very general• Regression graphs, i.e., regression trees with

merging, provably estimate GAMs using polynomial data and runtime

• Regression boosting generalizes binary classification boosting

• Future work– Improve algorithm/analysis– Room for interesting work in

statistics Å computational learning theory

regression trees and regression graphs: efficient estimators for generalized additive models adam...

Documents

e fxy

setup x y x

e hx y

y n slide

e hxfx fxy

r d r fx

y n regression algorithm

gam e hxy