lec 11

13
STAT3010: Lecture 11 1 CORRELATION AND REGRESSION Correlation Analysis (Section 10.1, Page 466) The goal of correlation analysis is to understand the nature and strength of the relationship between x and y (bivariate data). We must first understand the relationship between 2 variables by view of the scatter plot of (x, y). The following scatter plots display different types of relationships between the x and y values:

Upload: joy-aj

Post on 18-Jan-2016

3 views

Category:

Documents


0 download

DESCRIPTION

Correlation Analysis

TRANSCRIPT

Page 1: Lec 11

STAT3010: Lecture 11

1

CORRELATION AND REGRESSION

Correlation Analysis (Section 10.1, Page 466)

The goal of correlation analysis is to understand the nature and strength of the relationship between x and y (bivariate data). We must first understand the relationship between 2 variables by view of the scatter plot of (x, y).

The following scatter plots display different types of relationships between the x and y values:

Page 2: Lec 11

STAT3010: Lecture 11

2

To make precise statements about a data set, we must go beyond just a scatter plot. For example, we know that the above plot (b) scatter plot shows a direct (positive) linear relationship between x and y, but the question is, how positive is it? This is where the population correlation coefficient comes

correlation between x and y, it will also tell us the strength, ie.,

The population correlation coefficient, denoted by (rho), will only take on the values in the range of ______________.

The sign of the correlation coefficient indicates the nature of the relationship between x and y

And the magnitude of the correlation coefficient indicates the strength of the linear association between the 2 variables. Recall from STAT 2010/2020:

Page 3: Lec 11

STAT3010: Lecture 11

3

THE SAMPLE CORRELATION COEFFICIENT r

Definition:

is given by

r = = .

Where Var(x) and Var(y) are the sample variances of x and y, respectively. Recall:

and

And Cov(x,y) is the covariance of x and y defined by:

Computing formulas for the three summation quantities are

ny

ySyy iiyy

222 )(

)(

Page 4: Lec 11

STAT3010: Lecture 11

4

Standard deviation and variance only operate on 1 dimension, so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other.

Covariance is such a measure. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. So, if you had a 3-dimensional data set (x, y, z), then you could measure the covariance between the x and y dimensions, the y and z dimensions, and the x and z dimensions. Measuring the covariance between x and x, or y and y, or z and z would give you the variance of the x, y and z dimensions respectively.

Example 10.1: Correlation Between Body Mass Index and Systolic Blood Pressure

mass index and systolic blood pressure in males 50 yrs old. A random sample of 10 males 50 years of age is selected and their body mass index scores and systolic blood pressure is recorded in the following table:

Page 5: Lec 11

STAT3010: Lecture 11

5

X = Body Mass Index Y = Systolic Blood Pressure

18.4 120 20.1 110 22.4 120 25.9 135 26.5 140 28.9 115 30.1 150 32.9 165 33.0 160

34.7 180

first view the scatter diagram:

Systolic blood pressure

Body mass index

Calculate the sample correlation coefficient and explain:

Page 6: Lec 11

STAT3010: Lecture 11

6

Now that was sample correlation, what about population correlation? The sample correlation coefficient, r, is a point estimate for the population correlation coefficient, . Tests of hypothesis concerning address whether there is a linear association in the population.

To test the null hypothesis of NO linear relationship ( =0) we

use: with df = n-2 (using table B.3

to get a critical value)

Example 10.1.2: Statistical Inference Concerning

Hypothesis:

Test Statistic:

Decision:

Conclusion:

SAS CODE: options ps=62 ls=80; data correlation; input bmi sbp; cards; 18.4 120 20.1 110 22.4 120 25.9 135 26.5 140 28.9 115 30.1 150

Page 7: Lec 11

STAT3010: Lecture 11

7

32.9 165 33.0 160 34.7 180 run; proc plot;

plot sbp*bmi; run; proc corr cov; var bmi sbp; run;

SAS OUTPUT: The SAS System Plot of sbp*bmi. Legend: A = 1 obs, B = 2 obs, etc.

165

1

A

17.5 20.0 22.5 25.0 27.5 30.0 32.5 35.0

bmi

Page 8: Lec 11

STAT3010: Lecture 11

8

The SAS System

The CORR Procedure

2 Variables: bmi sbp

Covariance Matrix, DF = 9

bmi sbp

bmi 31.8521111 115.2166667 sbp 115.2166667 563.6111111

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum

bmi 10 27.29000 5.64377 272.90000 18.40000 34.70000 sbp 10 139.50000 23.74050 1395 110.00000 180.00000

Pearson Correlation Coefficients, N = 10 Prob > |r| under H0: Rho=0

bmi sbp

bmi 1.00000 0.85992 0.0014

sbp 0.85992 1.00000 0.0014

Simple Linear Regression (Section 10.2, Page 477)

Regression analysis is used to develop the mathematical equation that best describes the relationship between two variables, x and y. In correlation analysis, it is not necessary to specify which of the two variables is the independent one and dependent one. In regression analysis, it is necessary, they must be specified.

Page 9: Lec 11

STAT3010: Lecture 11

9

Remember our correlation plot from example 10.1 (above):

We now want to create the equation of the best fit of this data. This equation of the line relating y to x is called the simple linear regression equation and is given by:

Where Y is the dependent variable X is the independent variable

is the Y-intercept (the value of Y, when X=0) is the slope (the expected change in Y relative to one

unit change in X) is the error

The parameters of and in the least squares regression line are estimated in such a way that:

Let estimates of and be respectively denoted by and

. These estimators are the solutions of the following equations:

Page 10: Lec 11

STAT3010: Lecture 11

10

Page 11: Lec 11

STAT3010: Lecture 11

11

We have now obtained:

These estimates are called the least squares estimates of the slope and intercept. The estimate of the simple linear regression equation is given by substituting the least squares estimates in the simple linear regression equation:

Where is the expected value of Y for a given value of X.

Back to Example 10.1:

least squares regression equation for the data given in example 10.1.

Page 12: Lec 11

STAT3010: Lecture 11

12

Page 13: Lec 11

STAT3010: Lecture 11

13

To compute the regression estimates ( and ) within SAS, place the following code after the code introduced above on Page 6/7:

proc reg; model sbp=bmi; run;

The SAS System The CORR Procedure

2 Variables: bmi sbp Covariance Matrix, DF = 9

bmi sbp bmi 31.8521111 115.2166667 sbp 115.2166667 563.6111111

Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum bmi 10 27.29000 5.64377 272.90000 18.40000 34.70000 sbp 10 139.50000 23.74050 1395 110.00000 180.00000

Pearson Correlation Coefficients, N = 10

Prob > |r| under H0: Rho=0 bmi sbp

bmi 1.00000 0.85992 0.0014

sbp 0.85992 1.00000 0.0014

The SAS System The REG Procedure

Model: MODEL1 Dependent Variable: sbp

Number of Observations Read 10 Number of Observations Used 10

Analysis of Variance Sum of Mean

Source DF Squares Square F Value Pr > F Model 1 3750.89494 3750.89494 22.71 0.0014 Error 8 1321.60506 165.20063 Corrected Total 9 5072.50000

Root MSE 12.85304 R-Square 0.7395 Dependent Mean 139.50000 Adj R-Sq 0.7069 Coeff Var 9.21365

Parameter Estimates Parameter Standard

Variable DF Estimate Error t Value Pr > |t| Intercept 1 40.78558 21.11158 1.93 0.0895 bmi 1 3.61724 0.75913 4.76 0.0014