regression variance-bias trade-off. regression we need a regression function h(x) we need a loss...
Post on 19-Dec-2015
220 views
TRANSCRIPT
RegressionVariance-Bias Trade-off
Regression
• We need a regression function h(x)
• We need a loss function L(h(x),y)
• We have a true distribution p(x,y)
• Assume a quadratic loss, then:
Note: yt; h(x)y(x)
€
E[L] = dydx∫ L(h(x),y)p(x,y) = dydx∫ (h(x) − y)2 p(x,y)
dydx∫ (h(x) − E[y]p(y|x ) + E[y]p(y|x ) − y)2 p(x,y) =
dydx∫ [(h(x) − E[y]p(y|x ))2 + (E[y]p(y|x ) − y)2 + 2(h(x) − E[y]p(y|x ))(E[y]p(y|x ) − y)]p(y | x)p(x) =
dx∫ (h(x) − E[y]p(y|x ))2 p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x).
estimation error noise error
Regression: Learning
• Assume h(x) is a parametric curve, e.g. h(x)=af(x)+b.
• Minimize loss over the parameters (e.g. a,b), where p(x,y) is replaced with a sum over data-cases (called a “Monte Carlo sum”):
• That is: we solve:
• The same results follows from posing a Gaussian model q(y|x) for p(y|x) with mean h(x) and maximizing the probability of the data over the parameters. (This approach is taken in 274; probabilistic learning).
€
dx g(x,y)∫ ≈1
Ng(x i,y i)
i=1
N
∑
€
θ* = argminθ (hθ (x i) − y i)2
i
∑
Back to overfitting• More parameters lead to more flexible functions which may lead to over-fitting.
• Formalize this by imagining very many datasets D, all of size N. Call h(x,D) the regression function estimated from a dataset D of size N, i.e. a(D)f(x)+b(D), then:
• Next, average over p(D)=p(x1)p(x2)….p(xN). Only first term depends on D:€
E[L |D] = dx∫ (h(x,D) − E[y]p(y|x ))2 p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x)
€
ED[E x,y[L |D]] = dx∫ ED[(h(x,D) − E[y]p(y|x ))2]p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x)
⇒ dxdD ∫ (h(x,D) − E[y]p(y|x ))2 p(x,D) = ∫ dxdD (h(x,D) − ED[h(x,D)] + ED[h(x,D)] − E[y]p(y|x ))
2 p(x,D)
= 2 dxdD∫ (h(x,D) − ED[h(x,D)])(ED[h(x,D)] − E[y]p(y|x ))p(x,D) +
dxdD∫ (h(x,D) − ED[h(x,D)])2 p(x,D) + dx (∫ ED[h(x,D)] − E[y]p(y|x ))2 p(x)
0
Variance+bias2
Bias/Variance Tradeoff
€
ED[Ex,y[L |D]] =
dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x) +
dxdD∫ (h(x,D) − ED[h(x,D)])2 p(x,D) +
dx (∫ ED[h(x,D)] − E[y]p(y|x ))2 p(x)
A
B
C
A: The label y label fluctuates (label variance).
B: The estimate of h fluctuates across different datasets (estimation variance).
C: The average estimate of h does not fit well to the true curve (squared estimation bias).
Bias/Variance Illustration
€
g(x) = ED[h(x,D]Bias
Variance
€
gi(x) = h(x,Di)
Relation to Over-fitting
Increasing regularization(less flexible models)
Decreasing regularization(more flexible models)
Training error is measuringbias, but ignoring variance.
Testing error / X-validation erroris measuring both bias and variance.