regression variance-bias trade-off. regression we need a regression function h(x) we need a loss...

7

Regression Variance-Bias Trade- off

Post on 19-Dec-2015

220 views

Category:

Documents

0 download

Report

Download

Tags:

Embed Size (px):

TRANSCRIPT

Page 1: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

RegressionVariance-Bias Trade-off

Page 2: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Regression

• We need a regression function h(x)

• We need a loss function L(h(x),y)

• We have a true distribution p(x,y)

• Assume a quadratic loss, then:

Note: yt; h(x)y(x)

€

E[L] = dydx∫ L(h(x),y)p(x,y) = dydx∫ (h(x) − y)2 p(x,y)

dydx∫ (h(x) − E[y]p(y|x ) + E[y]p(y|x ) − y)2 p(x,y) =

dydx∫ [(h(x) − E[y]p(y|x ))2 + (E[y]p(y|x ) − y)2 + 2(h(x) − E[y]p(y|x ))(E[y]p(y|x ) − y)]p(y | x)p(x) =

dx∫ (h(x) − E[y]p(y|x ))2 p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x).

estimation error noise error

Page 3: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Regression: Learning

• Assume h(x) is a parametric curve, e.g. h(x)=af(x)+b.

• Minimize loss over the parameters (e.g. a,b), where p(x,y) is replaced with a sum over data-cases (called a “Monte Carlo sum”):

• That is: we solve:

• The same results follows from posing a Gaussian model q(y|x) for p(y|x) with mean h(x) and maximizing the probability of the data over the parameters. (This approach is taken in 274; probabilistic learning).

€

dx g(x,y)∫ ≈1

Ng(x i,y i)

i=1

N

∑

€

θ* = argminθ (hθ (x i) − y i)2

i

∑

Page 4: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Back to overfitting• More parameters lead to more flexible functions which may lead to over-fitting.

• Formalize this by imagining very many datasets D, all of size N. Call h(x,D) the regression function estimated from a dataset D of size N, i.e. a(D)f(x)+b(D), then:

• Next, average over p(D)=p(x1)p(x2)….p(xN). Only first term depends on D:€

E[L |D] = dx∫ (h(x,D) − E[y]p(y|x ))2 p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x)

€

ED[E x,y[L |D]] = dx∫ ED[(h(x,D) − E[y]p(y|x ))2]p(x) + dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x)

⇒ dxdD ∫ (h(x,D) − E[y]p(y|x ))2 p(x,D) = ∫ dxdD (h(x,D) − ED[h(x,D)] + ED[h(x,D)] − E[y]p(y|x ))

2 p(x,D)

= 2 dxdD∫ (h(x,D) − ED[h(x,D)])(ED[h(x,D)] − E[y]p(y|x ))p(x,D) +

dxdD∫ (h(x,D) − ED[h(x,D)])2 p(x,D) + dx (∫ ED[h(x,D)] − E[y]p(y|x ))2 p(x)

0

Variance+bias2

Page 5: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Bias/Variance Tradeoff

€

ED[Ex,y[L |D]] =

dxdy ∫ (E[y]p(y|x ) − y)2 p(y,x) +

dxdD∫ (h(x,D) − ED[h(x,D)])2 p(x,D) +

dx (∫ ED[h(x,D)] − E[y]p(y|x ))2 p(x)

A

B

C

A: The label y label fluctuates (label variance).

B: The estimate of h fluctuates across different datasets (estimation variance).

C: The average estimate of h does not fit well to the true curve (squared estimation bias).

Page 6: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Bias/Variance Illustration

€

g(x) = ED[h(x,D]Bias

Variance

€

gi(x) = h(x,Di)

Page 7: Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Relation to Over-fitting

Increasing regularization(less flexible models)

Decreasing regularization(more flexible models)

Training error is measuringbias, but ignoring variance.

Testing error / X-validation erroris measuring both bias and variance.

Nonparametric Regression 1 Introduction · Nonparametric Regression 1 Introduction So far we have assumed that Y = 0 + 1X 1 + + pX p+ : In other words, m(x) = E[YjX= x] = 0 + 1X 1

Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable

Robust Regression. Regression Methods We are going to look at three approaches to robust regression: Regression with robust standard errors Regression

Regression based on x vs. y, and lab canister difference = y - x

Some apl examples - Dyalog - · Web view1 reg x y 1 is x1 linear regression 0.9 1.55 2 reg x y 2 is x2 quadratic regression 0.29 0.9 0.98 3 reg x y 3 is x3 cubic regression 0.06 0.29

Spectrum Dependent Learning Curves in Kernel Regression ......Learning Curves in Kernel Regression and Wide Neural Networks our results to C>1 as we discuss in Section 2.5. Let {x

Multiple Linear Regression - Johns Hopkins University...... X 3) has their own regression coefficient Review: Simple linear regression • Y’ is a linear function of X • Y’ =

Logistic regression. Recall the simple linear regression model: y = 0 + 1 x + where we are trying to predict a continuous dependent variable y from

The regression line of X on Y Is: x- x = Preparation of the table. 1200 2070 2000 3080 2700 The regression line is: x. 123.6 = 0.68(y- 17.2) Learning Outcomes Identifies the concept

Regression, Ridge Regression, Lasso...A general de nition Regression studies the relationship between a response variable Y and covariates X 1;:::;X n. A covariate is also called a

Logistic Regression: Interaction Terms - · PDF fileInteractions in Logistic Regression I For linear regression, with predictors X 1 and X 2 we saw that an interaction model is a model

Regression #2 - Purdue Universityjltobias/671/lecture_notes/regression2.pdfInterpretation Consider the regression model: y = X 1 1 + x 2 2 + : Here, X 1 represents a set of covariates

Linear Regression Chapter 8. Linear Regression We are predicting the y-values, thus the “hat” over the “y”. We use actual values for “x”… so no hat here

Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,

REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y = 0 + 1 x + | | + | | So far we focused on the regression part –

Week 4 Multiple regression analysis. More general regression model Consider one Y variable and n independent variables X i, e.g. X 1, X 2, X 3. Data on

Regression Analyses. Multiple IVs Single DV (continuous) Generalization of simple linear regression Y’ = b 0 + b 1 X 1 + b 2 X 2 + b 3 X 3...b k X k Where

Sparse Kernel Regression with Coe cient-based regularization · Sparse Kernel Regression function K t: X!R x7!K(x;t): Let 0

1 'sketelsen/files/courses/csci... · 2018-01-31 · Logistic Regression Last time we learned about the Logistic Regression classifier p(y = 1 | x; w) = sigm(wTx) = 1 + exp(—wTx)

1 Regression as Moment Structure. 2 Regression Equation Y = X + v Observable Variables Y z = X Moment matrix YY YX = YX XX Moment structure

Chapter 15 Multiple Regression. Regression Multiple Regression Model y = 0 + 1 x 1 + 2 x 2 + … + p x p + Multiple Regression Equation y =

Regression Correlation vs. regression - UMass Schluter Ch 17... · Regression • Predicts Y from X • Linear regression assumes that the relationship between X and Y can be described

LECTURE 2: SIMPLE REGRESSION Izouharj/econ/Lecture_2.pdf · Introducing Simple Regression Introductory Econometrics Jan Zouhar 3 simple regression = regression with 2 variables we

Logistic Regression€¦ · Logistic Regression • Combine with linear regression to obtain logistic regression approach: • Learn best weights in • • We know interpret this

The Simple Regression Model y = 0 + 1 x + u. Some Terminology In the simple linear regression model, where y = 0 + 1 x + u, we typically refer

Lecture 6: Multiple Linear Regression, Polynomial ... · Polynomial Regression The simplest non-linear model we can consider, for a response Yand a predictor X, is a polynomial model

1 Multiple Regression Here we add more independent variables to the regression

Non-linear regression techniques Part - IIlasa.epfl.ch/.../Lec_IX_NonlinearRegression_Part_II.pdfii M i T M X x y w X N p X w V V y y yy Probabilistic Regression Prior model on distribution

Lecture 9: Linear Regression · Why Linear Regression? •Suppose we want to model the dependent variable Y in terms of three predictors, X 1, X 2, X 3 Y = f(X 1, X 2, X 3) •Typically

IAML: Logistic Regression...Figure credit: Chris Bishop, PRML As for linear regression, we can transform the input space if we want x!˚(x) 19/24 Generative and Discriminative Models

1 MAXIMUM LIKELIHOOD ESTIMATION OF REGRESSION COEFFICIENTS X Y XiXi 11 1 + 2 X i Y = 1 + 2 X We will now apply the maximum likelihood principle

12 Simple Linear Regression and Correlationamath.colorado.edu/sites/default/files/2014/04/462802561/Week10.pdf · The Simple Linear Regression Model Let x 1, x 2, …, x n denote

Multiple Linear Regression. Multiple Regression In multiple regression we have multiple predictors X 1, X 2, …, X p and we are interested in modeling

Linear Regression Linear Regression with Shrinkagerita/ml_course/lectures/Regression...Typically we have a set of training data from which we estimate the parameters . Linear Regression

Day 2: Rethinking regression as a predictive tool · Uncertainty in regression models: the linear case revisited I Suppose we regress y on X to produce b = (X0X) 1X0y I Then we set