1 regression and the bias-variance decomposition william cohen 10-601 april 2008 readings: bishop...
TRANSCRIPT
1
Regression and the Bias-Variance Decomposition
William Cohen
10-601 April 2008
Readings: Bishop 3.1,3.2
2
Regression
• Technically: learning a function f(x)=y where y is real-valued, rather than discrete.– Replace livesInSquirrelHill(x1,x2,…,xn) with
averageCommuteDistanceInMiles(x1,x2,…,xn)
– Replace userLikesMovie(u,m) with usersRatingForMovie(u,m)
– …
3
Example: univariate linear regression• Example: predict age from number of publications
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140 160
Number of Publications
Ag
e i
n Y
ea
rs
4
Linear regression
• Model: yi = axi + b + εi where εi ~ N(0,σ)
• Training Data: (x1,y1),….(xn,yn)
• Goal: estimate a,b with w=(a,b)^ ^
ii
iii xy
D
D
D
2)](ˆ[minarg
),|Pr(logmaxarg
)|Pr(logmaxarg
)Pr()|Pr(maxarg
)|Pr(maxarg
w
w
w
ww
ww
assume MLE
)ˆˆ()(ˆ bxay iii w
5
Linear regression
• Model: yi = axi + b + εi where εi ~ N(0,σ)
• Training Data: (x1,y1),….(xn,yn)
• Goal: estimate a,b with w=(a,b)• Ways to estimate parameters
– Find derivative wrt parameters a,b– Set to zero and solve
• Or use gradient ascent to solve• Or ….
^ ^ i
i2ˆminarg w
6
Linear regression
d2
d1
x1 x2
y2
y1
x
y
d3
How to estimate the slope?
...
2
2
1
1
xx
yy
xx
yyx
yslope
ii
ii
xxn
yyn1
1
i i
ii
xx
yyxx2
n*var(X)
n*cov(X,Y)
7
Linear regression
How to estimate the intercept?
bxay ˆˆ
d2
d1
x1 x2
y2
y1
x
y
d3
xayb ˆˆ
8
Bias/Variance Decomposition of Error
9
• Return to the simple regression problem f:XY
y = f(x) +
What is the expected error for a learned h?
noise N(0,)
deterministic
Bias – Variance decomposition of error
10
Bias – Variance decomposition of error
learned from D
])Pr()Pr()()([2
dxdxxhxfE DD
true fct dataset
Experiment (the error of which I’d like to predict):
1. Draw size n sample D=(x1,y1),….(xn,yn)
2. Train linear function hD using D
3. Draw a test example (x,f(x)+ε)
4. Measure squared error of hD on that example
11
Bias – Variance decomposition of error (2)
learned from D
)()( 2, xhxfE DD
true fct dataset
Fix x, then do this experiment:
1. Draw size n sample D=(x1,y1),….(xn,yn)
2. Train linear function hD using D
3. Draw the test example (x,f(x)+ε)
4. Measure squared error of hD on that example
12
Bias – Variance decomposition of error
t
f y
]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][
]ˆ[][
)ˆ(
222
22
2
2
yffyttfyfftE
yfftyfftE
yfftE
ytE
^
)()( 2, xhxfE DD
why not?
really yD^
13
Bias – Variance decomposition of error
]ˆ[][]ˆ[][2])ˆ[(][
]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][
]ˆ[][
)ˆ(
222
222
22
2
2,
yfEfEytEtfEyfEE
yffyttfyfftE
yfftyfftE
yfftE
ytED
Depends on how well learner approximates f
Intrinsic noise
ff )( yf ˆ)(
14
Bias – Variance decomposition of error
]ˆ[][]ˆ[][2])ˆ[(])[(
]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][
]ˆ[][
)ˆ(
222
222
22
2
2
yhEhEyfEfhEyhEhfE
yhhyffhyhhfE
yhhfyhhfE
yhhfE
yfE
Squared difference btwn our long-term expectation for the learners performance, ED[hD(x)], and what we expect in a representative run
on a dataset D (hat y)
Squared difference between best possible
prediction for x, f(x), and our “long-term” expectation for what the learner will do if we averaged over many
datasets D, ED[hD(x)]
)}({ xhEh DD)(ˆˆ xhyy DD
BIAS2
VARIANCE
15
Bias-variance decomposition
How can you reduce bias of a learner?
How can you reduce variance of a learner?
Make the long-term average better approximate the true function f(x)
Make the learner less sensitive to variations in the data
16
A generalization of bias-variance decomposition to other loss functions
• “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’
• Define “optimal prediction”: y* = argmin y’ L(t,y’)
• Define “main prediction of learner”ym=ym,D = argmin y’ ED{L(t,y’)}
• Define “bias of learner”:B(x)=L(y*,ym)
• Define “variance of learner” V(x)=ED[L(ym,y)]
• Define “noise for x”:N(x) = Et[L(t,y*)]
Claim:
ED,t[L(t,y) ] = c1N(x)+B(x)+c2V(x)
where
c1=PrD[y=y*] - 1
c2=1 if ym=y*, -1 else
m=n=|D|
17
Other regression methods
18
Example: univariate linear regression• Example: predict age from number of publications
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140 160
Number of Publications
Ag
e i
n Y
ea
rs
Paul Erdős
Hungarian mathematician, 1913-1996
267
1ˆ xy
x ~ 1500
age about 240
19
Linear regression
d2
d1
x1 x2
y2
y1
x
y
d3
xayb
xx
yyxxa
i i
ii
ˆˆ
ˆ2
Summary:
To simplify:
• assume zero-centered data, as we did for PCA
• let x=(x1,…,xn) and y =(y1,…,yn)
• then… 0ˆ
)(ˆ 1
b
a TT xxyx
20
Onward: multivariate linear regression
1)(ˆ xxyx TTw
nxx ,....,1x
nyy ,....,1y
1
11
)(ˆ
ˆ...ˆˆ
XXX
xwxwyTT
kk
yw
knn
k
xx
xx
X
,....,
...
,....,
1
111
ny
y
...1
y
Univariate Multivariate
row is example
col is feature
2)](ˆ[minarg ww i
iT
ii y xww )(̂
21
Onward: multivariate linear regression
1
11
)(ˆ
ˆ...ˆˆ
XXX
xwxwyTT
kk
yw
knn
k
xx
xx
X
,....,
...
,....,
1
111
ny
y
...1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
23
Onward: multivariate linear regression
1
11
)(ˆ
ˆ...ˆˆ
XXX
xwxwyTT
kk
yw
knn
k
xx
xx
X
,....,
...
,....,
1
111
ny
y
...1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
What does increasing λ do?
24
Onward: multivariate linear regression
1
11
)(ˆ
ˆ...ˆˆ
XXX
xwxwyTT
kk
yw
nx
x
X
,1
...
,1 1
ny
y
...1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
w=(w1,w2) What does fixing w2=0 do (if λ=0)?
25
Regression trees - summary
• Growing tree:– Split to optimize information gain
• At each leaf node– Predict the majority class
• Pruning tree:– Prune to reduce error on holdout
• Prediction:– Trace path to a leaf and predict
associated majority class
build a linear model, then greedily remove features
estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features
estimated error on training data
using to a linear interpolation of every prediction made by every node on the path
[Quinlan’s M5]
26
Regression trees: example - 1
27
Regression trees – example 2
What does pruning do to bias and variance?
28
Kernel regression
• aka locally weighted regression, locally linear regression, LOESS, …
29
Kernel regression
• aka locally weighted regression, locally linear regression, …
• Close approximation to kernel regression:– Pick a few values z1,…,zk up front– Preprocess: for each example (x,y), replace x with
x=<K(x,z1),…,K(x,zk)>
where K(x,z) = exp( -(x-z)2 / 2σ2 )– Use multivariate regression on x,y pairs
30
Kernel regression
• aka locally weighted regression, locally linear regression, LOESS, …
What does making the kernel wider do to bias and variance?
31
Additional readings
• P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.
• J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.
• Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997
• D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996.