regression variance-bias trade-off. regression we need a regression function h(x) we need a loss...

7
Regression Variance-Bias Trade- off

Post on 19-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

RegressionVariance-Bias Trade-off

Page 2: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Regression

• We need a regression function h(x)

• We need a loss function L(h(x),y)

• We have a true distribution p(x,y)

• Assume a quadratic loss, then:

Note: yt; h(x)y(x)

E[L] = dydx∫ L(h(x),y)p(x,y) = dydx∫ (h(x) − y)2 p(x,y)

dydx∫ (h(x) − E[y]p(y|x ) + E[y]p(y|x ) − y)2 p(x,y) =

dydx∫ [(h(x) − E[y]p(y|x ))2 + (E[y]p(y|x ) − y)2 + 2(h(x) − E[y]p(y|x ))(E[y]p(y|x ) − y)]p(y | x)p(x) =

dx∫ (h(x) − E[y]p(y|x ))2 p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x).

estimation error noise error

Page 3: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Regression: Learning

• Assume h(x) is a parametric curve, e.g. h(x)=af(x)+b.

• Minimize loss over the parameters (e.g. a,b), where p(x,y) is replaced with a sum over data-cases (called a “Monte Carlo sum”):

• That is: we solve:

• The same results follows from posing a Gaussian model q(y|x) for p(y|x) with mean h(x) and maximizing the probability of the data over the parameters. (This approach is taken in 274; probabilistic learning).

dx g(x,y)∫ ≈1

Ng(x i,y i)

i=1

N

θ* = argminθ (hθ (x i) − y i)2

i

Page 4: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Back to overfitting• More parameters lead to more flexible functions which may lead to over-fitting.

• Formalize this by imagining very many datasets D, all of size N. Call h(x,D) the regression function estimated from a dataset D of size N, i.e. a(D)f(x)+b(D), then:

• Next, average over p(D)=p(x1)p(x2)….p(xN). Only first term depends on D:€

E[L |D] = dx∫ (h(x,D) − E[y]p(y|x ))2 p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x)

ED[E x,y[L |D]] = dx∫ ED[(h(x,D) − E[y]p(y|x ))2]p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x)

⇒ dxdD ∫ (h(x,D) − E[y]p(y|x ))2 p(x,D) = ∫ dxdD (h(x,D) − ED[h(x,D)] + ED[h(x,D)] − E[y]p(y|x ))

2 p(x,D)

= 2 dxdD∫ (h(x,D) − ED[h(x,D)])(ED[h(x,D)] − E[y]p(y|x ))p(x,D) +

dxdD∫ (h(x,D) − ED[h(x,D)])2 p(x,D) + dx (∫ ED[h(x,D)] − E[y]p(y|x ))2 p(x)

0

Variance+bias2

Page 5: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Bias/Variance Tradeoff

ED[Ex,y[L |D]] =

dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x) +

dxdD∫ (h(x,D) − ED[h(x,D)])2 p(x,D) +

dx (∫ ED[h(x,D)] − E[y]p(y|x ))2 p(x)

A

B

C

A: The label y label fluctuates (label variance).

B: The estimate of h fluctuates across different datasets (estimation variance).

C: The average estimate of h does not fit well to the true curve (squared estimation bias).

Page 6: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Bias/Variance Illustration

g(x) = ED[h(x,D]Bias

Variance

gi(x) = h(x,Di)

Page 7: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Relation to Over-fitting

Increasing regularization(less flexible models)

Decreasing regularization(more flexible models)

Training error is measuringbias, but ignoring variance.

Testing error / X-validation erroris measuring both bias and variance.