simple linear regression & correlation instructor: prof. wei zhu 11/21/2013 ams 572 group...

Simple Linear Regression & Correlation

Instructor: Prof. Wei Zhu

11/21/2013

AMS 572 Group Project

1. Motivation & Introduction – Lizhou Nie2. A Probabilistic Model for Simple Linear Regression –

Long Wang3. Fitting the Simple Linear Regression Model – Zexi Han4. Statistical Inference for Simple Linear Regression –

Lichao Su5. Regression Diagnostics – Jue Huang6. Correlation Analysis – Ting Sun7. Implementation in SAS – Qianyi Chen8. Application and Summary – Jie Shuai

Outline

1. Motivation

http://popperfont.net/2012/11/13/the-ultimate-solar-system-animated-gif/

Fig. 1.1 Simplified Model for Solar System

Fig. 1.2 Obama & Romney during Presidential Election Campaign

http://outfront.blogs.cnn.com/2012/08/14/the-most-negative-in-campaign-history/





• Regression AnalysisLinear Regression:Simple Linear Regression: {y, x}

Multiple Linear Regression: {y; x1, … , xp}

Multivariate Linear Regression: {y1, … , yn; x1, … , xp}

• Correlation AnalysisPearson Product-Moment Correlation

Coefficient: Measurement of Linear Relationship between Two Variables

Introduction

• George Udny Yule & Karl Pearson Extention to a

More Generalized Statistical Context

• Carl Friedrich Gauss Further Development of

Least Square Theory including Gauss-Markov Theorem

•Adrien-Marie LegendreEarliest Form of

Regression: Least

Square Method

History

• Sir Francis Galton Coining the Term

“Regression”

http://en.wikipedia.org/wiki/Regression_analysishttp://en.wikipedia.org/wiki/Adrien_Marie_Legendrehttp://en.wikipedia.org/wiki/Carl_Friedrich_Gausshttp://en.wikipedia.org/wiki/Francis_Galtonhttp://www.york.ac.uk/depts/maths/histstat/people/yule.gifhttp://en.wikipedia.org/wiki/Karl_Pearson

http://en.wikipedia.org/wiki/Regression_analysis


http://en.wikipedia.org/wiki/Adrien_Marie_Legendre


http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss

http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss

http://en.wikipedia.org/wiki/Francis_Galton

http://en.wikipedia.org/wiki/Francis_Galton

http://www.york.ac.uk/depts/maths/histstat/people/yule.gif

http://en.wikipedia.org/wiki/Karl_Pearson

Simple Linear Regression

- Special Case of Linear Regression

- One Response Variable to One Explanatory Variable

General Setting

- We Denote Explanatory Variable as Xi’s and Response Variable as Yi’s

- N Pairs of Observations {xi, yi}, i = 1 to n

2. A Probabilistic Model

Sketch the Graph


(29, 5.5)

X Y1 37.70 9.822 16.31 5.003 28.37 9.274 -12.13 2.98

98 9.06 7.3499 28.54 10.37

100 -17.19 2.33

In Simple Linear Regression, Data is described as:

Where ~ N(0, )

The Fitted Model:

Where - Intercept

- Slope of Regression Line


3. Fitting the Simple Linear Regression Model

0 5 10 15 20 25 30 350.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

450.00

Mileage (in 1000 miles)

Groo

ve d

epth

(in

mils

)

Milage(in 1000 miles) Groove Depth (in mils)

0 394.33

4 329.50

8 291.00

12 255.17

16 229.33

20 204.83

24 179.00

28 163.83

32 150.33

Fig 3.1. Scatter plot of tire tread wear vs. mileage. From: Statistics and Data Analysis; Tamhane and Dunlop; Prentice Hall.

Table 3.1.

The difference between the fitted line and real data is

2

110

n

iii xyQ

Our goal: minimize the sum of square


0 5 10 15 20 25 30 350.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

400.00

450.00

Mileage (in 1000 miles)

Groo

ve d

epth

(in

mils

)

Fig 3.2.

ie

ie is the vertical distance between fitted line and the real data

i i ie y y


0 11 1

20 1

1 1 1

n n

i ii i

n n n

i i i ii i i

n x y

x x x y

Least Square Method

2

1 1 1 10

2 2

1 1

1 1 11

2 2

1 1

( )( ) ( )( )

( )

( )( )

( )

n n n n

i i i i ii i i i

n n

i ii i

n n n

i i i ii i i

n n

i ii i

x y x x y

n x x

n x y x y

n x x


1 1 1 1

2 2 2

1 1 1

2 2 2

1 1 1

1( )( ) ( )( )

1( ) ( )

1( ) ( )

n n n n

xy i i i i i ii i i i

n n n

xx i i ii i i

n n n

yy i i ii i i

S x x y y x y x yn

S x x x xn

S y y y yn

To simplify, we denote:


Back to the example:


Therefore, the equation of fitted line is:

Not enough!


We define:

Prove:

The ratio:

is called the coefficient of determination

3. Fitting the Simple Linear Regression Model Check the goodness of fit of LS line

SST: total sum of squaresSSR: Regression sum of squaresSSE: Error sum of squares

Back to the example:

3. Fitting the Simple Linear Regression Model Check the goodness of fit of LS line

where the sign of r follows from the sign of since 95.3% of the variation in tread wear is accounted for by linear regression on mileage, the relationship between the two is strongly linear with a negative slope.

r is the sample correlation coefficient between X and Y:

For the simple linear regression,


Estimation of

The variance measures the scatter of the

around their means

An unbiased estimate of is given by

2

2

2

2 1

2 2

n

ii

eSSE

sn n

2


From the example, we have SSE=2351.3 and n-2=7,therefore

Which has 7 d.f. The estimate of is


4. Statistical Inference For SLR

Under the normal error assumption

* Point estimators:

* Sampling distributions of and :

1

00 1,

2

0( ) i

xx

xSE s

nS

1( )xx

sSE

S

xx

i

nS

xN

22

00 ,~ˆ

xxS

N2

11 ,~ˆ

22

00 )ˆ( E

11)ˆ( E

Derivation

xxxx

xx

n

ii

xx

n

i xx

i

n

ii

xx

i

SS

S

xxS

S

xx

YVarS

xxVar

2

2

2

1

22

2

1

2

2

1

2

1

)(

)()ˆ(

11

21

11

1

11

10

1

10

11

)(

)()(

)()(

)()(

)()()ˆ(

n

ii

xx

n

ii

n

iii

xx

n

i xx

iin

i xx

i

n

i xx

ii

n

i xx

ii

xxS

xxxxxxS

S

xxx

S

xx

S

xExx

S

YExxE

0

110

110

1

10

)(

)ˆ()(

)ˆ()ˆ(

xn

xn

xn

xE

xEn

YE

xYEE

i

i

i

xx

i

xx

ii

xx

nS

x

nS

xnxxx

S

x

n

VarxYVar

xYVarVar

22

22

222

12

10

)(

)ˆ()(

)ˆ()ˆ(

Derivation

For mathematical derivations, please refer to the Tamhane and Dunlop text book, P331.

* Pivotal Quantities (P.Q.’s):

* Confidence Intervals (C.I.’s):

0 0 1 12, 2,2 2

,n nt SE t SE

2

0

00~)ˆ(

ˆ

nt

SE

2

1

11~)ˆ(

ˆ

nt

SE

25

Statistical Inference on β0 and β1

A useful application is to show whether there is a linear relationship between x and y

26/69

Hypothesis tests:

. 011a

0110 :.: HvsH

Reject at level if

0H 2/,2

1

011

0)ˆ(

ˆ

ntSE

t

0:.0: 110 aHvsH

Reject at level if

0H 2/,2

1

1

0)ˆ(

ˆ

nt

SEt

Mean Square: A sum of squares divided by its degrees of freedom.

27/69

Analysis of Variance (ANOVA)

2and

1 n

SSEMSE

SSRMSR

020

2

1

1

2

12

21

2 )ˆ(

ˆ

/

ˆˆFt

SESss

S

s

SSR

MSE

MSR

xx

xx

22/,2,2,1 nn tf

Analysis of Variance (ANOVA)

ANOVA Table

Source of Variation (Source)

Sum of Squares (SS)

Degrees of Freedom (d.f.)

Mean Square (MS)

F

Regression

Error

SSR SSE

1 n - 2

Total SST n - 1

SSRMSR=

1SSE

MSE=2n

MSRF=MSE

28

5.1 Checking the Model Assumptions

5.1.1 Checking for Linearity5.1.2 Checking for Constant Variance5.1.3 Checking for Normality

Primary tool: residual plots

5.2 Checking for Outliers and Influential Observations

5.2.1 Checking for Outliers5.2.2 Checking for Influential Observations5.2.3 How to Deal with Outliers and Influential Observations

5. Regression Diagnostics

5.1.1 Checking for Linearity

i

1 0 394.33 360.64 33.69

2 4 329.50 331.51 -2.01

3 8 291.00 302.39 -11.39

4 12 255.17 273.27 -18.10

5 16 229.33 244.15 -14.82

6 20 204.83 215.02 -10.19

7 24 179.00 185.90 -6.90

8 28 163.83 156.78 7.05

9 32 150.33 127.66 22.67

Table 5.1 The, , , for the Tire Wear Data

Figure 5.1 S, , for the Tire Wear Data


5.1.1 Checking for Linearity (Data transformation)

x yx2 yx3 yx logyx 1/y

x ylogx y-1/x y2

x y3

x y

x ylogx y-1/x y

x logyx -1/y

x yx2 yx3 yx y2

x y3

Figure 5.2 Typical Scatter Plot Shapes and Corresponding Linearizing Transformations


5.1.1 Checking for Linearity (Data transformation)

i

1 0 394.33 5.926 374.64 19.69

2 4 329.50 5.807 332.58 － 3.08

3 8 291.00 5.688 295.24 － 4.24

4 12 255.17 5.569 262.09 － 6.92

5 16 229.33 5.450 232.67 － 3.34

6 20 204.83 5.331 206.54 － 1.71

7 24 179.00 5.211 183.36 － 4.36

8 28 163.83 5.092 162.77 1.06

9 32 150.33 4.973 144.50 5.83

Table 5.2 The, , , , for the Tire Wear Data Figure 5.2 S,, for the Tire Wear Data


5.1.2 Checking for Constant Variance

Plot the residuals against the fitted value If the constant variance assumption is correct, the dispersion of the ’s is approximately constant with respect to the ’s.

Figure 5.4 Plots of ResidualsFigure 5.3 Plots of Residuals


5.1.3 Checking for normality

Make a normal plot of the residuals They have a zero mean and an approximately constant variance.

(assuming the other assumptions about the model are correct)

Figure 5.5 N


Outlier:

an observation that does not follow the general pattern of the relationship between y and x. A large residual indicates an outlier.

Standardized residuals are given by

If , then the corresponding observation may be regarded as an outlier.

*

2, 1, 2,..., .

( ) ( )11

i i ii

i i

xx

e e ee i n

SE e sx xs

n S

* 2ie

Influential Observation:

an influential observation has an extreme x-value, an extreme y-value, or both.

If we express the fitted value of y as a linear combination of all the

If , then the corresponding observations may be regarded as influential observation.

1

ˆn

i ij jj

y h y

2( )1 i

iixx

x xh

n S

2( 1) /iih k n


𝑦 𝑗


1 2.8653

2 -0.4113

3 -0.5367

4 -0.8505

5 -0.4067

6 -0.2102

7 -0.5519

8 0.1416

9 0.8484

*iei

1 0.3778

2 0.2611

3 0.1778

4 0.1278

5 0.1111

6 0.1278

7 0.1778

8 0.2611

9 0.3778

iihi

Table 5.3 Standard residuals & leverage for transformed data

0.44iih * 2ie


clear;clc;x = [0 4 8 12 16 20 24 28 32];y = [394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33];y1 = log(y); %data transformationp = polyfit(x,y,1) %linear regression predicts y from x% p = polyfit(x,log(y),1)yfit = polyval(p,x) %use p to predict yyresid = y - yfit %compute the residuals%yresid = y1 - exp(yfit) %residual for transformed datassresid = sum(yresid.^2); %residual sum of squaressstotal = (length(y)-1) * var(y); %sstotalrsq = 1 - ssresid/sstotal; %R square normplot(yresid) %normal plot for residuals[h,p,jbstat,critval]=jbtest(yresid) %test normalityscatter(x,y,500,'r','.') %generate the scatter plotslslinelaxis([-5,35,-10,25])xlabel('x_i')ylabel('y_i')Title('plot of ...')for i = 1:length(x) % check for outliers p(i) = yresid(i)/std(yresid)/sqrt(1-1/length(x)-(yresid(i)-mean(yresid)^2)/(yresid(i)-mean(yresid))^2)end%check for influential observationsfor j = 1:length(x) q(i) = 1/length(x)+(x(i)-mean(x))^2/960end

MATLAB Code for Regression Diagnostics

Why we need this? Regression analysis is used to model the

relationship between two variables.

But when there is no such distinction and both variables are random, correlation analysis is used to study the strength of the relationship.

6.1 Correlation Analysis

6.1 Correlation Analysis- Example

Flu reportedLife expectancy

Economy level People who get flu shot

Temperature

Economic growth

Figure 6.1

Example

Because we need to investigate the correlation between X,Y

Source:http://wiki.stat.ucla.edu/socr/index.php/File:SOCR_BivariateNormal_JS_Activity_Fig7.png

6.2 Bivariate Normal Distribution

Figure 6.2

http://wiki.stat.ucla.edu/socr/index.php/File:SOCR_BivariateNormal_JS_Activity_Fig7.png

http://wiki.stat.ucla.edu/socr/index.php/File:SOCR_BivariateNormal_JS_Activity_Fig7.png

6.2 Why introduce Bivariate Normal Distribution?

First, we need to do some computation.

Compare with:

So, if (X,Y) have a bivariate normal distribution, then the regression model is true

Define the r.v. R corresponding to r

But the distribution of R is quite complicated

6.3 Statistical Inference of r

Figure 6.3

r r r r

f(r) f(r) f(r) f(r)

-0.7 -0.3 0.50

Test: H0 : ρ=0 , Ha : ρ≠0

Test statistic:

Reject H0 iff

ExampleA researcher wants to determine if two test instruments give similar results. The two test instruments are administered to a sample of 15 students. The correlation coefficient between the two sets of scores is found to be 0.7. Is this correlation statistically significant at the .01 level?

H0 : ρ=0 , Ha : ρ≠0

3.534 = t0 > t13, .005 = 3.012

So, we reject H0

6.3 Exact test when ρ=0

0 2

2

1

r nT

r

0 2, / 2nt t 534.3

7.01

2157.020

t

6.3 Note:They are the same!

Because

So

We can say

H0: β1=0 are equivalent to H0: ρ=0

Because that the exact distribution of R is not very useful for making inference on ρ,

R.A Fisher showed that we can do the following linear transformation, to let it be approximate normal distribution.

That is,

6.3 Approximate test when ρ≠0

3

1,

1

1ln2

1

1

1ln2

1tanh 1

nN

R

RR

1,H0 : ρ= ρ0 vs. H1 : ρ ≠ ρ0

2, point estimator

3, T.S.

4, C.I

6.3 Steps to do the approximate test on ρ

Lurking Variable Over extrapolation

6.4 The pitfalls of correlation analysis

7. Implementation in SAS

state district democA voteA expendA

expendB

prtystrA lexpendA

lespendB

shareA

1"AL" 7 1 68 328.3 8.74 41 5.793916 2.167567 97.41

2"AK" 1 0 62 626.38 402.48 60 6.439952 5.997638 60.88

3"AZ" 2 1 73 99.61 3.07 55 4.601233 1.120048 97.01

…

173"WI" 8 1 30 14.42 227.82 47 2.668685 5.428569 5.95

Table7.1 vote example data

SAS code of the vote example

proc corr data=vote1; var F4 F10; run;

Pearson Correlation Coefficients, N = 173 Prob > |r| under H0: Rho=0

F4 F10

F4 1.00000 0.92528

Table7.2 correlation coeffients


proc reg data=vote1; model F4=F10; label F4=voteA; label F10=shareA;output out=fitvote residual=R; run;

SAS output

Analysis of Variance

Source DF Sum of Squares Mean Square F Value Pr > F

Model 1 41486 41486 1017.70 <.0001

Error 171 6970.77364 40.76476

Corrected Total 172 48457

Root MSE 6.38473 R-Square 0.8561

Dependent Mean 50.50289 Adj R-Sq 0.8553

Coeff Var 12.64230

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value

Intercept Intercept 1 26.81254 0.88719 30.22

F10 F10 1 0.46382 0.01454 31.90

Table7.3 SAS output for vote example


Figure7.1 Plot of Residual vs. ShareA for vote example


Figure7.2 Plot of voteA vs. shareA for vote example


SAS-Check Homoscedasticity

Figure7.3 Plots of SAS output for vote example


SAS-Check Normality of Residuals

SAS code:

Tests for Location: Mu0=0

Test Statistic p Value

Student's t t 0 Pr > |t| 1.0000

Sign M -0.5 Pr >= |M| 1.0000

Signed Rank S -170.5 Pr >= |S| 0.7969

Tests for Normality

Test Statistic p Value

Shapiro-Wilk W 0.952811 Pr < W 0.7395

Kolmogorov-Smirnov

D 0.209773 Pr > D >0.1500

Cramer-von Mises W-Sq 0.056218 Pr > W-Sq >0.2500

Anderson-Darling A-Sq 0.30325 Pr > A-Sq >0.2500

proc univariate data=fitvote normal;var R;qqplot R / normal (Mu=est Sigma=est);run;

Table7.4 SAS output for checking normality


SAS-Check Normality of Residuals

Figure7.4 Plot of Residual vs. Normal Quantiles for vote example


• Linear regression is widely used to describe possible relationships between variables. It ranks as one of the most important tools in these disciplines.Marketing/business analyticsHealthcareFinanceEconomicsEcology/environmental science

8. Application

• Prediction, forecasting or deductionLinear regression can be used to fit a

predictive model to an observed data set of Y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of Y, the fitted model can be used to make a prediction of the value of Y.

8. Application

• Quantifying the strength of relationshipGiven a variable y and a number of

variables X1, ..., Xp that may be related to Y, linear regression analysis can be applied to assess which Xj may have no relationship with Y at all, and to identify which subsets of the Xj contain redundant information about Y.

8. Application

Example 1. Trend line

8. Application

A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. Trend lines are sometimes used in business analytics to show changes in data over time.

Figure 8.1 Refrigerator sales over a 13-year period

http://www.likeoffice.com/28057/Excel-2007-Formatting-charts

http://www.likeoffice.com/28057/Excel-2007-Formatting-charts

Example 2. Clinical drug trials

8. Application

Regression analysis is widely utilized in healthcare. The graph shows an example in which we investigate the relationship between protein concentration and absorbance employing linear regression analysis. Figure 8.2 BSA Protein Concentration Vs.

Absorbance

http://openwetware.org/wiki/User:Laura_Flynn/Notebook/Experimental_Biological_Chemistry/2011/09/13



Summary

Model Assumptions

Outliers &Influential

Observations

Linearity, Constant Variance &Normality

Data Transformation

Probabilistic Models

Least Square Estimate

Linear RegressionAnalysis

StatisticalInference

CorrelationAnalysis

CorrelationCoefficient

(Bivariate NormalDistribution, Exact T-test, Approximate

Z-test.

Acknowledgement• Sincere thanks go to Prof. Wei Zhu

References• Statistics and Data Analysis, Ajit Tamhane & Dorothy

Dunlop.• Introductory Econometrics: A Modern Approach, Jeffrey

M. Wooldridge,5th ed.• http://en.wikipedia.org/wiki/Regression_analysis


etc. (web links have already been included in the slides)

Acknowledgement & References







simple linear regression & correlation instructor: prof. wei zhu 11/21/2013 ams 572 group...

Documents

multiple linear regression

multivariate linear

regression line

probabilistic model

term regression http

pearson slide

fitted model

simplified model