1 regression and the bias-variance decomposition william cohen 10-601 april 2008 readings: bishop...

1

Regression and the Bias-Variance Decomposition

William Cohen

10-601 April 2008

Readings: Bishop 3.1,3.2

2

Regression

• Technically: learning a function f(x)=y where y is real-valued, rather than discrete.– Replace livesInSquirrelHill(x1,x2,…,xn) with

averageCommuteDistanceInMiles(x1,x2,…,xn)

– Replace userLikesMovie(u,m) with usersRatingForMovie(u,m)

– …

3

Example: univariate linear regression• Example: predict age from number of publications

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140 160

Number of Publications

Ag

e i

n Y

ea

rs

4

Linear regression

• Model: yi = axi + b + εi where εi ~ N(0,σ)

• Training Data: (x1,y1),….(xn,yn)

• Goal: estimate a,b with w=(a,b)^ ^

ii

iii xy

D

D

D

2)](ˆ[minarg

),|Pr(logmaxarg

)|Pr(logmaxarg

)Pr()|Pr(maxarg

)|Pr(maxarg

w

w

w

ww

ww

assume MLE

)ˆˆ()(ˆ bxay iii w

5

Linear regression

• Model: yi = axi + b + εi where εi ~ N(0,σ)

• Training Data: (x1,y1),….(xn,yn)

• Goal: estimate a,b with w=(a,b)• Ways to estimate parameters

– Find derivative wrt parameters a,b– Set to zero and solve

• Or use gradient ascent to solve• Or ….

^ ^ i

i2ˆminarg w

6

Linear regression

d2

d1

x1 x2

y2

y1

x

y

d3

How to estimate the slope?

...

2

2

1

1

xx

yy

xx

yyx

yslope

ii

ii

xxn

yyn1

1

i i

ii

xx

yyxx2

n*var(X)

n*cov(X,Y)

7

Linear regression

How to estimate the intercept?

bxay ˆˆ

d2

d1

x1 x2

y2

y1

x

y

d3

xayb ˆˆ

8

Bias/Variance Decomposition of Error

9

• Return to the simple regression problem f:XY

y = f(x) +

What is the expected error for a learned h?

noise N(0,)

deterministic

Bias – Variance decomposition of error

10


learned from D

])Pr()Pr()()([2

dxdxxhxfE DD

true fct dataset

Experiment (the error of which I’d like to predict):

1. Draw size n sample D=(x1,y1),….(xn,yn)

2. Train linear function hD using D

3. Draw a test example (x,f(x)+ε)

4. Measure squared error of hD on that example

11

Bias – Variance decomposition of error (2)

learned from D

)()( 2, xhxfE DD

true fct dataset

Fix x, then do this experiment:

1. Draw size n sample D=(x1,y1),….(xn,yn)

2. Train linear function hD using D

3. Draw the test example (x,f(x)+ε)

4. Measure squared error of hD on that example

12


t

f y

]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][

]ˆ[][

)ˆ(

222

22

2

2

yffyttfyfftE

yfftyfftE

yfftE

ytE

^

)()( 2, xhxfE DD

why not?

really yD^

13


]ˆ[][]ˆ[][2])ˆ[(][

]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][

]ˆ[][

)ˆ(

222

222

22

2

2,

yfEfEytEtfEyfEE

yffyttfyfftE

yfftyfftE

yfftE

ytED

Depends on how well learner approximates f

Intrinsic noise

ff )( yf ˆ)(

14


]ˆ[][]ˆ[][2])ˆ[(])[(

]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][

]ˆ[][

)ˆ(

222

222

22

2

2

yhEhEyfEfhEyhEhfE

yhhyffhyhhfE

yhhfyhhfE

yhhfE

yfE

Squared difference btwn our long-term expectation for the learners performance, ED[hD(x)], and what we expect in a representative run

on a dataset D (hat y)

Squared difference between best possible

prediction for x, f(x), and our “long-term” expectation for what the learner will do if we averaged over many

datasets D, ED[hD(x)]

)}({ xhEh DD)(ˆˆ xhyy DD

BIAS2

VARIANCE

15

Bias-variance decomposition

How can you reduce bias of a learner?

How can you reduce variance of a learner?

Make the long-term average better approximate the true function f(x)

Make the learner less sensitive to variations in the data

16

A generalization of bias-variance decomposition to other loss functions

• “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’

• Define “optimal prediction”: y* = argmin y’ L(t,y’)

• Define “main prediction of learner”ym=ym,D = argmin y’ ED{L(t,y’)}

• Define “bias of learner”:B(x)=L(y*,ym)

• Define “variance of learner” V(x)=ED[L(ym,y)]

• Define “noise for x”:N(x) = Et[L(t,y*)]

Claim:

ED,t[L(t,y) ] = c1N(x)+B(x)+c2V(x)

where

c1=PrD[y=y*] - 1

c2=1 if ym=y*, -1 else

m=n=|D|

17

Other regression methods

18

Example: univariate linear regression• Example: predict age from number of publications

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140 160

Number of Publications

Ag

e i

n Y

ea

rs

Paul Erdős

Hungarian mathematician, 1913-1996

267

1ˆ xy

x ~ 1500

age about 240

19

Linear regression

d2

d1

x1 x2

y2

y1

x

y

d3

xayb

xx

yyxxa

i i

ii

ˆˆ

ˆ2

Summary:

To simplify:

• assume zero-centered data, as we did for PCA

• let x=(x1,…,xn) and y =(y1,…,yn)

• then… 0ˆ

)(ˆ 1

b

a TT xxyx

20

Onward: multivariate linear regression

1)(ˆ xxyx TTw

nxx ,....,1x

nyy ,....,1y

1

11

)(ˆ

ˆ...ˆˆ

XXX

xwxwyTT

kk

yw

knn

k

xx

xx

X

,....,

...

,....,

1

111

ny

y

...1

y

Univariate Multivariate

row is example

col is feature

2)](ˆ[minarg ww i

iT

ii y xww )(̂

21


1

11

)(ˆ

ˆ...ˆˆ

XXX

xwxwyTT

kk

yw

knn

k

xx

xx

X

,....,

...

,....,

1

111

ny

y

...1

y 2)](ˆ[minarg ww i

iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

23


1

11

)(ˆ

ˆ...ˆˆ

XXX

xwxwyTT

kk

yw

knn

k

xx

xx

X

,....,

...

,....,

1

111

ny

y

...1


iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

What does increasing λ do?

24


1

11

)(ˆ

ˆ...ˆˆ

XXX

xwxwyTT

kk

yw

nx

x

X

,1

...

,1 1

ny

y

...1


iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

w=(w1,w2) What does fixing w2=0 do (if λ=0)?

25

Regression trees - summary

• Growing tree:– Split to optimize information gain

• At each leaf node– Predict the majority class

• Pruning tree:– Prune to reduce error on holdout

• Prediction:– Trace path to a leaf and predict

associated majority class

build a linear model, then greedily remove features

estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features

estimated error on training data

using to a linear interpolation of every prediction made by every node on the path

[Quinlan’s M5]

26

Regression trees: example - 1

27

Regression trees – example 2

What does pruning do to bias and variance?

28

Kernel regression

• aka locally weighted regression, locally linear regression, LOESS, …

29

Kernel regression

• aka locally weighted regression, locally linear regression, …

• Close approximation to kernel regression:– Pick a few values z1,…,zk up front– Preprocess: for each example (x,y), replace x with

x=<K(x,z1),…,K(x,zk)>

where K(x,z) = exp( -(x-z)2 / 2σ2 )– Use multivariate regression on x,y pairs

30

Kernel regression

• aka locally weighted regression, locally linear regression, LOESS, …

What does making the kernel wider do to bias and variance?

31

Additional readings

• P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.

• J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.

• Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997

• D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996.

http://www.cs.washington.edu/homes/pedrod/papers/mlc00a.pdf

http://www.cs.washington.edu/homes/pedrod/papers/mlc00a.pdf

http://citeseer.ist.psu.edu/300635.html

http://citeseer.ist.psu.edu/70235.html

http://www.jair.org/media/295/live-295-1554-jair.pdf

1 regression and the bias-variance decomposition william cohen 10-601 april 2008 readings: bishop...

Documents