regression models notes

13
Regression Models Notes Ramesh Kadambi June 26, 2014

Upload: ramesh-kadambi

Post on 22-Jul-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Work in progress- Notes from the coursera Regression Models Course

TRANSCRIPT

Page 1: Regression Models Notes

Regression Models Notes

Ramesh Kadambi

June 26, 2014

Page 2: Regression Models Notes

2

Page 3: Regression Models Notes

Contents

1 Some Statistical Terms 51.1 Data and the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Data and the Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5.1 Facts About Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Linear Regression 72.1 Regression Through Origin (RTO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 General Regression Fitting the best line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Consequences of Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Regression to the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Regression Model with Additive Gausian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Interpretation of Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.6.1 Interpreting the Intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6.2 Interpreting the Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 Example Working with Diamond Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.8 Residuals and Residual Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.8.1 Properties of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.9 Nonlinear Data and Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.10 Data with changing variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3

Page 4: Regression Models Notes

4 CONTENTS

Page 5: Regression Models Notes

Chapter 1

Some Statistical Terms

1.1 Data and the Mean

Given a set of random data points X = {xi : i = 1 to n}.One would like to find a number corresponding to the middleof the data points. The middle as intuition would tell us isthe point that is at the shortest distance from all the pointsin our data set, i.e. a point x, that minimizes

∑n1 (xi − x)

2.

Minimizing the error function we have,

d

dx

n∑1

(xi − x)2

= 0

−2

n∑1

(x− x) = 0

nx =

n∑1

xi

x =1

n

n∑1

xi

1.2 Data and the Variance

Once we know where the center of the data set lies, we wouldlike to know how the points are distributed around the cen-ter. The distribtution/dispersion of the data around themean is given by the variance or the

√variance called the

standard deviation (σ). Note that the mean is the value thatminimizes variance. The unbiased estimate of the variance

is given by,

S2 =1

n− 1

n∑1

(xi − x)2

σ = S =

√√√√ 1

n− 1

n∑1

(xi − x)2

=

√√√√ 1

n− 1

n∑1

(x2i − 2xix+ x2

)

=

√√√√ 1

n− 1

(n∑1

x2i − 2x

n∑1

xi + nx2

)

=

√√√√ 1

n− 1

(n∑1

x2i − 2nx2 + nx2

)

=

√√√√ 1

n− 1

(n∑1

x2i − nx2)

1.3 Normalization

Transforming the given data by subtracting the mean anddividing by the standard deviation is called normalization.Normailzing leaves the data with a mean of 0 and a stan-dard deviation of 1. Normalized data are in units of stan-dard deviation from the mean. The value of the data pointrepresents the number of standard deviations from the meanthe point occured.

1.4 Covariance

Given two sets of random data, X = {xi : i = 1 to n} andY = {yj : j = 1 to n}, the covariance is defined to measure

5

Page 6: Regression Models Notes

6 CHAPTER 1. SOME STATISTICAL TERMS

the relationship between the two random variables.

Cov(X,Y ) =1

n− 1

n∑1

(xi − x) (yi − y)

=1

n− 1

n∑1

(xiyi − xiy − xyi + xy)

=1

n− 1

(n∑1

xiyi − 2nxy + nxy

)

=1

n− 1

(n∑1

xiyi − nxy

)

1.5 Correlation

Correlation is a dimensionless quantity defined as the ratioof covariance to the product of the standard deviaiton of thecovariant variables.

Cor(X,Y ) =Cov(X,Y )

σxσy

1.5.1 Facts About Correlation

1. Cor(X,Y ) = Cor(Y,X), this follows from the factthat Cov(X,Y ) = Cov(Y,X).

2. −1 ≤ Cor(X,Y ) ≤ 1

3. Cor(X,Y ) = 1 or Cor(X,Y ) = −1, only when thedata X and Y fall perfectly on a positively sloped ora negatively sloped line.

4. Cor(X,Y ) meausres the strength of the relationshipbetween X and Y . The relationship is stronger asCor(X,Y ) ∈ {1,−1}.

5. Cor(X,Y ) = 0 implies no linear relationship.

Page 7: Regression Models Notes

Chapter 2

Linear Regression

2.1 Regression Through Origin(RTO)

Given two sets of random data X = {xi : i = 1 to n}and Y = {yj : j = 1 to n} we would like to use xito predict the value of yi. The idea is to find if there isa relationship between the given random variables. We willfind out that the releationship is dependent on the correla-tion of the two random variates. Our objective is to find βthat minimizes the relation,

n∑1

(yi − βxi)2

minimizing the error function we have

d

n∑1

(yi − βxi)2 = 0

− 2

n∑1

(yi − βxi)xi = 0

β

n∑1

x2i =n∑1

xiyi

β =

∑n1 xiyi∑n1 x

2i

2.2 General Regression Fitting thebest line

Given two sets of Random variates X = {xi : i = 1 to n}and Y = {yj : j = 1 to n} we would like to use xi to predictthe value of yi as before. Unlike RTO, here we fit a straightline with an intercept. Our model function is of the form

y = β0 + β1x

. As before we would like to minimize the error,

n∑1

(yi − (β0 + β1xi))2

. Solving the multivariate minimization problem we have,

∂β0

n∑1

(yi − (β0 + β1xi))2 = 0

∂β1

n∑1

(yi − (β0 + β1xi))2 = 0

carrying out the minimization in each dimension we have,

∂β0

n∑1

(yi − (β0 + β1xi))2 = 0

− 2

n∑1

(yi − β0 − β1xi) = 0

n∑1

β0 =

n∑1

yi − β1n∑1

xi

β0 =1

n

(n∑1

yi − β1n∑1

xi

)β0 = y − β1x

Similarly solving for β1 we have,

∂β1

n∑1

(yi − (β0 + β1xi))2 = 0

− 2

n∑1

(yi − β0 − βixi)xi = 0

n∑1

yixi − β0n∑1

xi − β1n∑1

x2i = 0

multiplying by1

nand substituting for β0 we have

1

n

(n∑1

xiyi − (y − β1x)

n∑1

xi − β1n∑1

x2i

)= 0

1

n

(n∑1

xiyi − (y − β1x)nx− β1n∑1

x2i

)= 0

1

n

(n∑1

xiyi − nxy

)− β1

1

n

(n∑1

x2i − nx2)

= 0

β1 =Cov(X,Y )

σ2x

=Cov(X,Y )

σxσy

σyσx

= ρxyσyσx

Table 1: Summary of RegressionRegression through Origin Model: y = βx

β =∑n

1 xiyi∑n1 x

2i

Regression with Intercept Model: y = β0 + β1x

β0 = y − β1x

β1 =σyσxρxy

7

Page 8: Regression Models Notes

8 CHAPTER 2. LINEAR REGRESSION

2.3 Consequences of Linear Regres-sion

Given the linear model y = β0 + β1x subject to the mini-mization problem of

∑n1 (yi − (β0 + β1x))2 we have,

1. β1 has the units of XY , β0 has the units of Y .

2. The regression line passes through (x, y), which is clearfrom the fact that, if x = x then y = y−β1x+β1x = y.

3. The slope is the same as that obtained by fitting theline through the origin using cetnered data (demeaneddata, ie. using xi − x as the predictor).

4. Flipping the predictor just changes the slope byswitching the ratio of the standard deviations, ie.

σyσx

to σxσy

or vice versa.

5. If the data is normalized {xi−xσx, yi−yσy

}, the slope is the

correlation between the random variates.

2.4 Regression to the mean 1

Regression to the mean is the phenomenon observed in lin-ear regression and could be generalized to any data withbivariate distribution. Given the model y = β0 + β1x, wehave

y = β0 + β1x

y = y − β1x+ β1x

y − y = β1(x− x)

y − yσy

= ρxyx− xσx

(2.4.1)

In (2.4.1) we see that if −1 < ρxy < 1, then y−yσy

< x−xσx

.

The predicted (or fitted) standardized value of y is closerto its mean than the standardized value of x is to its mean.In terms of probability, P (Y < y|X = x) gets bigger as xheads in to large values. Similarly P (Y > y|X = x) getsbigger as x heads to very small values.

2.5 Regression Model with Addi-tive Gausian Noise

Given two random variates X,Y , linear regression builds amodel of the form y = β0 + β1x. The above model com-putes the coefficients of the linear model by formulating amathematical problem and minimizing an error function. Astatistical formulation of linear regression would include anerror term that is random. The estimate of the coefficients

will be based on a maximum likelihood estimate. Ourestimated model would be:

y = β0 + β1x+ ε

here,

1. ε are assumed to be iid N(0, σ2).

2. E[Y = yi|X = xi] = µi = β0 + β1xi

3. V ar(Y = yi|X = xi) = σ2

It would be interesting to dwell on this for a moment. Weare given two random variate {X,Y }. It is our belief thatY and X are related and there exists a linear functionf : X → Y . For a given value of X = xi, the values yiare random and have an expected value of β0 + β1xi and avariance of σ2. Graphically it would look as below,2

Plot of yi|x = .17

0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24

320

330

340

350

xpl

y[x

== 0

.17]

Our objective is to estimate the parameter set θ = {β0, β1},given the values of yi have an additive noise ε = N(0, σ2).

From the relation yi = β0 + β1xi + εi, we see that the yiare distributed with the density function N(µi, σ

2) where

µi = E[yi] = β0 + β1xi. Since we assume εi are iid, yi areindependent though not identically distributed (due to µi).The joint density of yi is given by,

L(β0, β1) =

n∏1

N(µi, σ2)

=

n∏1

1√2πσ2

e−1

2σ2(yi−µi)2

1Note theˆis used to indicate estimated values. Though we know that observed values are true, our model coefficients are estimated values.2This is from the package usingR ’data(diamond)’. We have plotted prices for a diamond with carat value of .17.

Page 9: Regression Models Notes

2.6. INTERPRETATION OF REGRESSION COEFFICIENTS 9

simplifying the log likelihood function we have,

log(L(β0, β1)) =

n∑1

log(1√

2πσ2e−

12σ2

(yi−µi)2)

=

n∑1

log 1− log(√

2πσ2)− 1

2σ2(yi − µi)2

= −n∑1

1

2(log 2 + log π + 2 log σ)− 1

2σ2(yi − µi)2

= c− 2n log σ − 1

2σ2

n∑1

(yi − µi)2

The paramerters are arrived at by maximizing the likelihoodfunction L(β0, β1). The function we are maximizing is thesame as the negative of the function we minimized for leastsquares regression, i.e

∑n1 (yi−µi)2. The estimate under the

assumption of gaussian error is therefore the same as theestimate for the linear least squares.

2.6 Interpretation of RegressionCoefficients

The coefficients of linear regression can be interpreted togive an intuitive meaning.

2.6.1 Interpreting the Intercept

β0 is the expected value of the response when the predictorhas the value 0. This follows from the fact that,

y = β0 + β1x+ ε

E[y|x = 0] = β0

The above interpretation may not be meaningful in caseswhere x = 0 is not a valid value for the predictor. In suchcases we would shift or center the predictor to get a properinterpretation.

y = β0 + β1a+ β1(x− a) + ε

= βs + β1(x− a) + ε

Shifting the data points xi only shifts the intercept and doesnot affect the slope. Assigning a = x, the intercept will bethe prediction/response for the value x = x.

2.6.2 Interpreting the Slope

The slope can be interpreted a couple of different ways,

1. The slope is interpreted as the expected re-sponse/prediction for a unit change in the predictor.

E[y|x = xi + 1]−E[y|x = xi]

= β0 + β1(x+ 1)− β0 − β1x

= β1

2. Scaling the units of x results in dividing the slope bythe same scaling factor.

y = β0 + aβ1ax+ ε

y = β0 + aβfx+ ε

The interpretation of the slope is intuitive in the sense thatthe units of the slope β1 are units of y

units of x . If we now scale theunits of x and use the scaled value as my predictor. Theslope of the resulting regression model will be appropriatelyscaled when estimated.

2.7 Example Working with Dia-mond Prices 3

The regression results are given below,

> lrwithintercept(diamond$carat, diamond$price)

$x

[1] 0.17 0.16 0.17 0.18 0.25 0.16 0.15

[8] 0.19 0.21 0.15 0.18 0.28 0.16 0.20

[15] 0.23 0.29 0.12 0.26 0.25 0.27 0.18

[22] 0.16 0.17 0.16 0.17 0.18 0.17 0.18

[29] 0.17 0.15 0.17 0.32 0.32 0.15 0.16

[36] 0.16 0.23 0.23 0.17 0.33 0.25 0.35

[43] 0.18 0.25 0.25 0.15 0.26 0.15

$y

[1] 355 328 350 325 642 342 322

[8] 485 483 323 462 823 336 498

[15] 595 860 223 663 750 720 468

[22] 345 352 332 353 438 318 419

[29] 346 315 350 918 919 298 339

[36] 338 595 553 345 945 655 1086

[43] 443 678 675 287 693 316

$intercept

[1] -259.6259

$slope

[1] 3721.025

3Using data(diamond) from the package UsingR

Page 10: Regression Models Notes

10 CHAPTER 2. LINEAR REGRESSION

Figure 2: Regression of diamond wt vs price

0.15 0.20 0.25 0.30 0.35

200

600

1000

diamond$carat

diamond$price

The intercept, zero carat diamond is worth -259$ whichdoes not make sense as there are none. Centering the dia-mond weights at the mean we have the following,4

> res <- lrwithintercept(diamond$carat

- mean(diamond$carat), diamond$price)

> res

$x

[1] -0.034166667 -0.044166667

[3] -0.034166667 -0.024166667

[5] 0.045833333 -0.044166667

[7] -0.054166667 -0.014166667

[9] 0.005833333 -0.054166667

[11] -0.024166667 0.075833333

[13] -0.044166667 -0.004166667

[15] 0.025833333 0.085833333

[17] -0.084166667 0.055833333

[19] 0.045833333 0.065833333

[21] -0.024166667 -0.044166667

[23] -0.034166667 -0.044166667

[25] -0.034166667 -0.024166667

[27] -0.034166667 -0.024166667

[29] -0.034166667 -0.054166667

[31] -0.034166667 0.115833333

[33] 0.115833333 -0.054166667

[35] -0.044166667 -0.044166667

[37] 0.025833333 0.025833333

[39] -0.034166667 0.125833333

[41] 0.045833333 0.145833333

[43] -0.024166667 0.045833333

[45] 0.045833333 -0.054166667

[47] 0.055833333 -0.054166667

$y

[1] 355 328 350 325 642 342 322

[8] 485 483 323 462 823 336 498

[15] 595 860 223 663 750 720 468

[22] 345 352 332 353 438 318 419

[29] 346 315 350 918 919 298 339

[36] 338 595 553 345 945 655 1086

[43] 443 678 675 287 693 316

$intercept

[1] 500.0833

$slope

[1] 3721.025

> mean(diamond$carat)

[1] 0.2041667

The intercept of the centered data is 500.0833, according toour interpretation it is the price of the diamond of weight.2041667. The intercept is indicated by the red point inFigure 3.

Figure 3: Centered regression and intercept interpretation

-0.05 0.00 0.05 0.10 0.15

200

400

600

800

1000

res$x

res$y

The slope of the regression line represents a unit change inprice for a unit change in size of the diamond. Our interceptindicates that a 1 carat change in the size of the diamond

costs 3721$. Scaling the size of the diamonds to 110

thcarat,

we have the following result,

> res <- lrwithintercept(diamond$carat * 10,

diamond$price)

> res

$x

[1] 1.7 1.6 1.7 1.8 2.5 1.6 1.5 1.9 2.1

[10] 1.5 1.8 2.8 1.6 2.0 2.3 2.9 1.2 2.6

[19] 2.5 2.7 1.8 1.6 1.7 1.6 1.7 1.8 1.7

[28] 1.8 1.7 1.5 1.7 3.2 3.2 1.5 1.6 1.6

[37] 2.3 2.3 1.7 3.3 2.5 3.5 1.8 2.5 2.5

[46] 1.5 2.6 1.5

$y

[1] 355 328 350 325 642 342 322

[8] 485 483 323 462 823 336 498

[15] 595 860 223 663 750 720 468

[22] 345 352 332 353 438 318 419

4This is as expected, translating the line does not change the slope but the intercept. We have just moved the line across the x-axis so thatthe origin corresponds to the mean size of the diamond.

Page 11: Regression Models Notes

2.8. RESIDUALS AND RESIDUAL VARIATION 11

[29] 346 315 350 918 919 298 339

[36] 338 595 553 345 945 655 1086

[43] 443 678 675 287 693 316

$intercept

[1] -259.6259

$slope

[1] 372.1025

As expected scaling the size of the diamonds by a factor of

10 (same as changin the unit of measurement by 110

thcarat),

reduces the slope to 372 $/carat.To predict the value of a diamond of a given size we just usethe results of the regression and value it using the slope andthe intercept.

> predictlm(res, 10 * c(.16,.27,.34))

[1] 335.7381 745.0508 1005.5225

Figure 4: Scaled Regression with Predictions

1.5 2.0 2.5 3.0 3.5

200

400

600

800

1000

res$x

res$y

2.8 Residuals and Residual Varia-tion

Our model y = β0 + β1x + ε is expected to represent theobserved values yi = β0 + β1xi + εi, where εi = N(0, σ2).However, we do not expect to predict the values exactly.What we hope to predict is some kind of average value ofthe observed variable at a given value of the predictor vari-able. Our regression estimate yi = β0 + β1xi provides anestimate of the observed values yi . We obtain the errorei = yi− yi. Note that ei are not the same as εi. The ei arecalled the residuals. The least squares regression minimizes∑n

1 e2i .

2.8.1 Properties of Residuals

1. The expected value of the residuals is zero, ie. E[ei]= 0.

2. If an intercept is included then∑n

1 ei = 0. This indi-cates that the residuals are not independent. This canbe generalized to the fact that if we fit ”p” parametersin the linear model then, knowledge of n− p residualsis sufficient to compute the remaining p residuals.

3. If a regressor variable, xi, is included in the modelthen

∑n1 xiei = 0

4. Residuals are useful for investigating poor model fit.

5. Positive residuals are above the line and negative resid-uals are below the line (regression).

6. Residuals can be thought of as the outcome with thelinear association of the prediction removed.

7. One differentiates residual variation (variation afterthe predictor is removed) from the systematic vari-ation (variation explained by the regression model).

8. Residual plots highlight poor model fit.

Figure 5: Plot of residues

1.5 2.0 2.5 3.0 3.5

-50

050

res$x

res$e

2.9 Nonlinear Data and Linear Re-gression

We will generate random non-linear data as,

x <- runif(100, -3, 3);

y <- x + sin(x) + rnorm(100, sd = .2)

The plots of the regression and residuals are shown below.One can clearly see a pattern in the residuals. In suchsituations, the data can be transformed to obtain a linearrelationship.

Page 12: Regression Models Notes

12 CHAPTER 2. LINEAR REGRESSION

Figure 6: Plot of nonliear data

-3 -2 -1 0 1 2

-3-2

-10

12

3

x

y

Figure 7: Plot of residuals nonliear data

-3 -2 -1 0 1 2

-0.5

0.0

0.5

1.0

x

res_sinsim$e

2.10 Data with changing variance

We will generate random data that is heteroskdastic usingthe following code,

> x <- = c(seq(-3,3,1), seq(-3,3,1),

seq(-3,3,1), seq(-3,3,1),seq(-3,3,1))

y <- x + rnorm(nrow(as.matrix(x), sd = abs(.2 + x))

Below are the plots of the regression as well as the residues.As seen from the residue plot. The residue does not havethe same constant variance.

Figure 8: Plot of heteroscdastic data

-3 -2 -1 0 1 2 3

-4-2

02

46

x

y

Figure 9: Plot of heteroscdastic residuals

-3 -2 -1 0 1 2 3

-6-4

-20

24

6

x

res_sinsim$e

A better illustration is plotting a sample of yi against xiat a couple of different points as well as the correspondingvariance. The plot is as shown below. The points in greenand red are the mean and variance. As seen the yi do nothave a constant variance across xi.

> sdy

[1] 0.2887582 3.4365632

> meany

[1] -0.002637026 2.713918444

Page 13: Regression Models Notes

2.10. DATA WITH CHANGING VARIANCE 13

Figure 8: Illustration of heteroskadasticity

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-20

24

6

xl

yl