correlation and simple linear regression psy440 june 10, 2008

Post on 21-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Correlation and Simple Linear RegressionCorrelation and Simple Linear Regression

PSY440

June 10, 2008

A few points of clarification

• For the chi-squared test, the results are unreliable if the expected frequency in too many of your cells is too low.

• A rule of thumb is that the minimum expected frequency should be 5 (i.e., no cells with expected counts less than 5). A more conservative rule recommended by some is a minimum expected frequency of 10. If your minimum is too low, you need a larger sample! The more categories you have the larger your sample must be.

• SPSS will warn you if you have any cells with expected frequency less than 5.

Regarding threats to internal validity

• One of the strengths of well-designed single-subject research is the use of repeated observations during each phase.

• Repeated observations during baseline and intervention (during an AB study, e.g.) helps rule out testing, instrumentation (somewhat) and regression. These effects would be unlikely to result in a marked change between experimental phases that is not apparent during repeated observations before and after the phase change.

Regarding histograms

The difference between a histogram and a bar graph is that the variable on the x axis (which represents the score on the variable being graphed, as opposed to the frequency of observations) is conceptualized as being continuous in a histogram, whereas a bar graph represents discrete categories along the x axis.

About the exam….

Exam on Thursday will cover material from the first three weeks of class (lectures 1-6, or everything through Chi-Squared tests).

Emphasis of exam will be on generating results with computers (calculations by hand will not be emphasized), and interpreting the results.

Exam questions will be based mainly on lecture material and modeled on previous active learning experiences (homework and in-class demonstrations and exercises).

Knowledge of material on qualitative methods and experimental & single-subject design is expected.

Before we move on…..

Any questions?

Today’s lecture and next homework

Today’s lecture will cover correlation and simple (bivariate) regression.

Homework based on today’s lecture will be distributed on Thursday and due on Tuesday (June 17).

Correlation

• A correlation is the association between scores on two variables– age and coordination skills in children, as kids

get older their motor coordination tends to improve

– price and quality, generally the more expensive something is the higher in quality it is

Correlation and Causality

Correlational research– Correlation as a statistical procedure is

generally used to measure the association between two (or more) continuous variables

– Correlation as a kind of research design refers to observational studies in which there is no experimental manipulation.

Correlation and CausalityCorrelational research

– Not all “correlational” (i.e., observational) research designs use correlation as the statistical procedure for analyzing the data (example: comparison of verbal abilities between boys and girls - observational study - don’t manipulate gender - but probably analyze mean differences with t-tests).

– But: Virtually of the inferential statistical methods (including t-tests, anova, ancova) covered in 440 can be represented in terms of correlational/regression models (general linear model - we’ll talk more about this later).

– Bottom line: Don’t confuse design with analytic strategy.

Correlation and Causality

• Correlations (like other linear statistical models) describe relationships between variables, but DO NOT explain why the variables are related

Suppose that Dr. Steward finds that rates of spilled coffee and severity of plane turbulence are strongly positively correlated.

One might argue that turbulence cause coffee spills

One might argue that spilling coffee causes turbulence

Correlation and Causation

Suppose that Dr. Cranium finds a positive correlation between head size and digit span (roughly the number of digits you can remember).

One might argue that bigger your head, the larger your digit span

1

21

24

1537

One might argue that head size and digit span both increase with age (but head size and digit span aren’t directly related)

Correlation and Causation

Observational research and correlational statistical methods (including regression and path analysis) can be used to compare competing models of causation, to see which model fits the data best.

One might argue that bigger your head, the larger your digit span

1

21

24

1537

One might argue that head size and digit span both increase with age (but head size and digit span aren’t directly related)

Relationships between variables

• Properties of a statistical correlation– Form (linear or non-linear)– Direction (positive or negative)– Strength (none, weak, strong, perfect)

• To examine this relationship you should:– Make a scatterplot - a picture of the relationship– Compute the Correlation Coefficient - a numerical

description of the relationship

Graphing Correlations

• Steps for making a scatterplot (scatter diagram)1. Draw axes and assign variables to them

2. Determine range of values for each variable and mark on axes

3. Mark a dot for each person’s pair of scores

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6

X Y

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2

X Y

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2C 5 6

X Y

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2C 5 6

D 3 4

X Y

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2C 5 6

D 3 4

E 3 2

X Y

Scatterplot

Y

X1

2

34

5

6

1 2 3 4 5 6

• Imagine a line through the data points

• Plots one variable against the other• Each point

corresponds to a different individual

A 6 6B 1 2C 5 6

D 3 4

E 3 2

X Y

• Useful for “seeing” the relationship– Form, Direction,

and Strength

Scatterplots with Excel and SPSS

In SPSS, charts menu=>legacy dialogues=>scatter/dot=>simple scatter

Click on define, and select which variable you want on the x axis and which on the y axis.

In Excel, insert menu=>chart=>xyscatter

Specify if variables are arranged in rows or columns and select the cells with the relevant data.

FormNon-linearLinear

NegativePositive

Direction

• X & Y vary in the same direction

• As X goes up, Y goes up

• positive Pearson’s r

• X & Y vary in opposite directions

• As X goes up, Y goes down

• negative Pearson’s r

Y

X

Y

X

Strength

• The strength of the relationship– Spread around the line (note the axis scales)

– Correlation coefficient will range from -1 to +1• Zero means “no relationship”.

• The farther the r is from zero, the stronger the relationship

– In general when we talk about correlation coefficients: Correlation coefficient = Pearson’s product moment coefficient = Pearson’s r = r.

Strength

r = 1.0“perfect positive corr.”r2 = 100%

r = -1.0“perfect negative corr.”r2 = 100%

r = 0.0“no relationship”r2 = 0.0

-1.0 0.0 +1.0

The farther from zero, the stronger the relationship

The Correlation Coefficient

• Formulas for the correlation coefficient:

r = XZ YZ∑N

r =SP

SSX SSY

SP = X − X ( ) Y −Y ( )∑

Conceptual Formula Common Alternative

The Correlation Coefficient

• Formulas for the correlation coefficient:

r = XZ YZ∑N

r =SP

SSX SSY

SP = X − X ( ) Y −Y ( )∑

Conceptual Formula Common alternative

Computing Pearson’s r (using SP)

• Step 1: SP (Sum of the Products)

SP = X − X ( ) Y −Y ( )∑

mean 3.6 4.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )

Computing Pearson’s r (using SP)

• Step 1: SP (Sum of the Products)

SP = X − X ( ) Y −Y ( )∑

mean 3.6 4.0

2.4

0.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )= 6 - 3.6

-2.6 = 1 - 3.6

1.4 = 5 - 3.6

-0.6 = 3 - 3.6

-0.6 = 3 - 3.6

Quick check

Computing Pearson’s r (using SP)

• Step 1: SP (Sum of the Products)

SP = X − X ( ) Y −Y ( )∑

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0 0.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )2.0 = 6 - 4.0

-2.0 = 2 - 4.0

2.0 = 6 - 4.0

0.0= 4 - 4.0

-2.0= 2 - 4.0

Quick check

Computing Pearson’s r (using SP)

• Step 1: SP (Sum of the Products)

SP = X − X ( ) Y −Y ( )∑

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0 14.0 SP

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.8* =

5.2* =

2.8* =

0.0* =

1.2* =

Computing Pearson’s r (using SP)

• Step 2: SSX & SSY

Computing Pearson’s r (using SP)

• Step 2: SSX & SSY

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0 14.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.76

15.20

SSX

2 =6.762 =

1.962 =

0.362 =

0.362 =

Computing Pearson’s r (using SP)

• Step 2: SSX & SSY

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0 14.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

2 = 4.02 = 4.02 = 4.02 = 0.02 = 4.0

16.0

SSY

Computing Pearson’s r (using SP)

• Step 3: compute r

r =SP

SSX SSY

Computing Pearson’s r (using SP)

• Step 3: compute r

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0 14.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0

SSYSSX

SP

r =SP

SSX SSY

Computing Pearson’s r

• Step 3: compute r

14.015.20 16.0

SSYSSX

SP

r =SP

SSX SSY

Computing Pearson’s r

• Step 3: compute r

15.20 16.0

SSYSSX

r =14

SSXSSY

Computing Pearson’s r

• Step 3: compute r

15.20

SSX

r =14

SSX * 16

Computing Pearson’s r

• Step 3: compute r

r =14

15.2 *16

Computing Pearson’s r

• Step 3: compute rr =

1415.2 * 16

=0.89

Y

X1

2

34

5

6

1 2 3 4 5 6

• Appears linear

• Positive relationship

• Fairly strong relationship• .89 is far from 0, near +1

The Correlation Coefficient

• Formulas for the correlation coefficient:

r = XZ YZ∑N

r =SP

SSXSSY

SP = X−X( ) Y −Y( )∑

Conceptual Formula Common alternative

Computing Pearson’s r (using z-scores)

• Step 1: compute standard deviation for X and Y (note: keep track of sample or population)

6 61 25 6

3 4

3 2

X Y

• For this example we will assume the data is from a population

Computing Pearson’s r (using z-scores)

• Step 1: compute standard deviation for X and Y (note: keep track of sample or population)

Mean 3.6

2.4-2.6

1.4

-0.6

-0.6

0.0

6 61 25 6

3 4

3 2

X Y

X − X ( )

X − X ( )2

5.766.76

1.96

0.36

0.36

15.20

SSXStd dev 1.74

σ =SSX

N=

15.2

5= 1.74

• For this example we will assume the data is from a population

Computing Pearson’s r (using z-scores)

• Step 1: compute standard deviation for X and Y (note: keep track of sample or population)

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y X −X( )

Y −Y ( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0

SSYStd dev 1.74 1.79

• For this example we will assume the data is from a population

σ =SSY

N

=16.0

5= 1.79

Computing Pearson’s r (using z-scores)

• Step 2: compute z-scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

6 61 25 6

3 4

3 2

X Y

X − X ( ) Y −Y( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX

1.74 1.79

1.38 =2.4

1.74

X −X( )sX

Computing Pearson’s r (using z-scores)

• Step 2: compute z-scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

6 61 25 6

3 4

3 2

X Y

X − X ( ) Y −Y( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX

X −X( )sX

1.74 1.79

1.38-1.49

0.8

- 0.34

- 0.34

0.0 Quick check

Computing Pearson’s r (using z-scores)

• Step 2: compute z-scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

6 61 25 6

3 4

3 2

X Y X −X( )

Y −Y ( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.79

1.1

Y −Y( )sY

=2.0

1.791.38-1.49

0.8

- 0.34

- 0.34

Computing Pearson’s r (using z-scores)

• Step 2: compute z-scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

2.0-2.0

2.0

0.0

-2.0

6 61 25 6

3 4

3 2

X Y X −X( )

Y −Y ( )X −X( )2

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

Y −Y( )sY

1.74 1.79

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

Quick check

Computing Pearson’s r (using z-scores)

• Step 3: compute r

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y ZX ZY

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.790.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.52

X −X( ) X −X( )2

r =ZXZY∑N

Y −Y( )

1.38-1.49

0.8

- 0.34

- 0.34

* =

Computing Pearson’s r (using z-scores)

• Step 3: compute r

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y ZX ZY

5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.790.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.521.64

0.88

0.0

0.37

X −X( ) X −X( )2

r =ZXZY∑N

=4.41

5

Y −Y( )

1.38-1.49

0.8

- 0.34

- 0.34

=0.88

4.41

Computing Pearson’s r (using z-scores)

• Step 3: compute r

Y

X1

2

34

5

6

1 2 3 4 5 6

• Appears linear

• Positive relationship

• Fairly strong relationship• .88 is far from 0, near +1

r =ZXZY∑N

=0.88

Correlation in Research Articles

• Correlation matrix– A display of the correlations between more than two variables

Acculturation

• Why have a “-”?

• Why only half the table filled with numbers?

Correlations with SPSS & Excel

SPSS: Analyze => correlate=> bivariateThen select the variables you want correlation(s) for (can

select just one pair, or many variables to get a correlation matrix)

Try this with height and shoe size in our data.Now try with height, shoe size, mother’s height, and number

of shoes owned.Excel: Arrange data for two variables in two columns or

rows & use formula bar to request a correlation:=correl(array1,array2)

SPSS correlation output

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Invalid inferences from correlations

Why you should always look at the scatter plot before computing (and certainly before interpreting Pearson’s r):

• Correlations are greatly affected by range of scores in data– Consider height and age relationship– Restricted range example from text (SAT and GPA)

• Extreme scores can have dramatic effects on correlations – A single extreme score can radically change r, especially when your

sample is small.

• Relations between variables may differ for subgroups, resulting in misleading r values for aggregate data

• Curvilinear relations not captures by Pearson’s r

What to do about a curvilinear pattern

• If pattern is monotonically increasing or decreasing, convert scores to ranks and compute r (using same formula) based on the rank scores. Result is called Spearman’s Rank Correlation Coefficient or Spearman’s Rho and can be requested in your spss output by checking the appropriate box when you select the variables for which you want correlations.

• If pattern is more complicated (u-shaped or s-shaped, for example), consult more advanced statistics resources.

Coefficient of determination

• When considering "how good" a relationship is, we really should consider r2 (coefficient of determination), not just r.

• This coefficient tells you the percent of the variance in one variable that is explained or accounted for by the other variable.

From Correlation to Regression• With correlation, we can examine whether variables X

& Y are related• With regression, we try to predict the value of one

variable given what we know about the other variable and the relationship between the two.

Regression• Last time: “it doesn’t matter which variable goes on the

X-axis or the Y-axis”

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• For regression this is NOT the case

• The variable that you are predicting goes on the Y-axis (criterion or “dependent” variable)

Predicted variable

Predicting variable

• The variable that you are making the prediction based on goes on the X-axis (predictor or “independent” variable)

Quiz performance

Hours of study

Regression• Correlation: “Imagine a line through the points”

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• But there are lots of possible lines

• One line is the “best fitting line”

• Regression: compute the equation corresponding to this “best fitting line”

Quiz performance

Hours of study

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)

2.0

Y

X

1

2

3

4

5

6

1 2 3 4 5 60

Y = intercept, when X = 0

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)

2.0

Change in Y

Change in X= slope

0.5

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

1

2

0

The equation for a line

• A brief review of geometry

Y = (X)(slope) + (intercept)Y

X

1

2

3

4

5

6

1 2 3 4 5 60

Y = (X)(0.5) + 2.0

Regression

• A brief review of geometry• Consider a perfect correlation

Y = (X)(0.5) + (2.0)Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Can make specific predictions about Y based on X

X = 5

Y = ?Y = (5)(0.5) + (2.0)

Y = 2.5 + 2 = 4.54.5

Regression

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Consider a less than perfect correlation• The line still represents the

predicted values of Y given X

Y = (X)(0.5) + (2.0)X = 5

Y = ?Y = (5)(0.5) + (2.0)

Y = 2.5 + 2 = 4.54.5

Regression

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• The “best fitting line” is the one that minimizes the error (differences) between the predicted scores (the line) and the actual scores (the points)

• Rather than compare the errors from different lines and picking the best, we will directly compute the equation for the best fitting line

Regression

• The linear model

Y = intercept + slope (X) + error

μY = β0 + β1X + ε

Beta’s ( ) are sometimes called parameters

Come in two types:

• standardized

• unstanderdized μY = β0 + β1X + ε )ZY =(β)(ZX ) + ε

Now let’s go through an example computing these things

Scatterplot

• Using the dataset from our correlation example

6 61 25 6

3 4

3 2

X Y Y

X

1

23456

1 2 3 4 5 6

From when we computed Pearson’s r

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

X − X ( )

Y −Y ( )

X − X ( ) Y −Y ( )4.85.2

2.8

0.0

1.2

X − X ( )2

5.766.76

1.96

0.36

0.36

Y −Y ( )2

4.04.0

4.0

0.0

4.0

14.015.20 16.0

SSYSSX

SP

Computing regression line (with raw scores)

6 61 25 6

3 4

3 2

X Y

14.015.20 16.0

SSYSSX

SP

slope = b =SP

SSX

=14

15.2= 0.92

intercept = a = Y − bX

mean 3.6 4.0 €

=4.0 − (0.92)(3.6)

=0.688

Computing regression line(with raw scores)

6 61 25 6

3 4

3 2

X Y

slope = b = 0.92

mean 3.6 4.0

intercept = 0.688

Y

X

1

23456

1 2 3 4 5 6

Y = 0.92X + 0.688

Computing regression line (with raw scores)

6 61 25 6

3 4

3 2

X Y

slope = b = 0.92

mean 3.6 4.0

intercept = 0.688

Y

X

1

23456

1 2 3 4 5 6

X

Y

Y = 0.92X + 0.688

The two means will be on the line

Computing regression line (standardized, using z-scores)

• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores

Mean 3.6 4.0

2.4-2.6

1.4

-0.6

-0.6

0.0

2.0-2.0

2.0

0.0

-2.0

0.0

6 61 25 6

3 4

3 2

X Y5.766.76

1.96

0.36

0.36

15.20

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0Std dev

ZX ZY

1.74 1.790.0

1.1-1.1

0.0

-1.1

1.1

0.0

X −X( ) X −X( )2

Y −Y( )

1.38-1.49

0.8

- 0.34

- 0.34

Computing regression line (standardized, using z-scores)

• Sometimes the regression equation is standardized. – Computed based on z-scores rather than with raw scores

ZX ZY

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

• Prediction model– Predicted Z score (on criterion variable) =

standardized regression coefficient multiplied by Z score on predictor variable

– Formula

)ZY =(β)(ZX )

– The standardized regression coefficient (β)

• In bivariate prediction, β = r

Computing regression line (with z-scores)

slope =β =r =0.89

meanintercept =0.0

ZY

ZX

-1

1

2

0

1 2

ZX ZY

0.0

1.1-1.1

0.0

-1.1

1.1

0.0

1.38-1.49

0.8

- 0.34

- 0.34

)ZY =(β)(ZX )

-2

-1-2

Regression

• Also need a measure of error

Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

• Same line, but different relationships (strength difference)

Y = intercept + slope (X)+ error

• The linear equation isn’t the whole thing

Regression

• Error

– Actual score minus the predicted score

• Measures of error– r2 (r-squared)

– Proportionate reduction in error

• Note: Total squared error when predicting from the mean = SSTotal=SSY

=SStotal − SSerror

SStotal

– Squared error using prediction model = Sum of the squared residuals = SSresidual= SSerror

R-squared

• r2 represents the percent variance in Y accounted for by X

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

Y

X

1

2

3

4

5

6

1 2 3 4 5 6

r = 0.8 r = 0.5r2 = 0.64 r2 = 0.25

64% variance explained 25% variance explained

Computing Error around the line

• Compute the difference between the predicted values and the observed values (“residuals”)

• Square the differences

• Add up the squared differences

Y

X

1

23456

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

ˆ Y

Y =0.92X + 0.688Predicted values of Y (points on the line)

• Sum of the squared residuals = SSresidual = SSerror

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

ˆ Y

Y =0.92X + 0.688

= (0.92)(6)+0.688

Predicted values of Y (points on the line)

• Sum of the squared residuals = SSresidual = SSerror

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

ˆ Y

Y =0.92X + 0.688

= (0.92)(6)+0.688

1.6 = (0.92)(1)+0.688

5.3 = (0.92)(5)+0.688

3.45 = (0.92)(3)+0.688

3.45 = (0.92)(3)+0.688

• Sum of the squared residuals = SSresidual = SSerror

Computing Error around the line

Y

X

123

45

6

1 2 3 4 5 6

• Sum of the squared residuals = SSresidual = SSerror

X Y

ˆ Y 6 61 25 6

3 4

3 2

6.21.6

5.3

3.45

3.45

6.2

1.6

5.3

3.45

Y =0.92X + 0.688

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

ˆ Y

Y − ˆ Y ( )-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

residuals• Sum of the squared residuals = SSresidual = SSerror

Quick check

6 - 6.2 =

2 - 1.6 =

6 - 5.3 =

4 - 3.45 =

2 - 3.45 =

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

ˆ Y

Y − ˆ Y ( )

Y − ˆ Y ( )2

-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

Computing Error around the line

6 61 25 6

3 4

3 2

X Y

mean 3.6 4.0

6.2

0.00

0.040.16

0.49

0.30

2.10

3.09

ˆ Y

Y − ˆ Y ( )

Y − ˆ Y ( )2

-0.200.40

0.70

0.55

-1.45

Y =0.92X + 0.688

1.6

5.3

3.45

3.45

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

Y −Y ( )2

4.04.0

4.0

0.0

4.0

16.0

SSY

Computing Error around the line• Sum of the squared residuals = SSresidual = SSerror

• Standard error of estimate (from textbook) is analagous to standard deviation. It is the square root of the average error: sx.y= sqrt(SSerror/df)

• Also, the standard error of estimate is related to r2 and to the standard deviaion of y:

• sx.y=sy*sqrt(1-r2)

Computing Error around the line

3.09

SSERROR

• Sum of the squared residuals = SSresidual = SSerror

16.0

SSY

– Proportionate reduction in error =SStotal − SSerror

SStotal

=16.0 − 3.09

16.0= 0.81

• Also (like r2) represents the percent variance in Y accounted for by X

• In fact, it is mathematically identical to r2

Seeing patterns in the error

• Residual plots• The sum of the residuals should always equal 0 (as should the mean).

– the least squares regression line splits the data in half, half of the error is above the line and half is below the line.

• In addition to summing to zero, we also want the residuals to be randomly distributed.

– That is, there should be no pattern to the residuals. – If there is a pattern, it may suggest that there is more than a simple linear relationship

between the two variables.

• Residual plots are very useful tools to examine the relationship even further. – These are basically scatterplots of the residuals (Yobs-Ypred) against the Explanatory (X)

variable

(note: the examples actually plot the residuals that have transformed into z-scores).

Seeing patterns in the error

• The residual plot shows that the residuals fall randomly above and below the line. Critically there doesn't seem to be a discernable pattern to the residuals.

Residual plotScatter plot

• The scatterplot shows a nice linear relationship.

Seeing patterns in the error

Residual plot

• The scatterplot also shows a nice linear relationship.

• The residual plot shows that the residuals get larger as X increases.

• This suggests that the variability around the line is not constant across values of X.

• This is referred to as a violation of homogeniety of variance.

Scatter plot

Seeing patterns in the error

• The residual plot suggests that a non-linear relationship may be more appropriate (see how a curved pattern appears in the residual plot).

Residual plotScatter plot

• The scatterplot shows what may be a linear relationship.

Regression in SPSS

• Running the analysis in SPSS is pretty easy– Analyze: Regression: Linear– X or predictor variable(s) go into the ‘independent

variable’ field– Y or predicted variable goes into the ‘dependent

variable’ field– You can save the residuals as a new variable to plot

the residuals against x as shown in the previous slide.

• You get a lot of output

Regression in SPSS• The variables in the model

• r

• r2

• Unstandardized coefficients

• Slope (indep var name)• Intercept (constant)

• Standardized coefficients

• We’ll get back to these numbers in a few weeks

In Excel

• With Data Analysis “Tool Pack” you can perform regression analysis

• With standard software package, you can get bivariate correlation (which is the same as the standardized regression coefficient), you can create a scatterplot, and you can request a trend line (as we did when plotting data for single-subject research), which is a regression line (what is y and what is x in that case?)

Considerations:

Slope is dependent on variance of x and yStandardized slope = r (weaker associations

between x and y result in flatter slopes)Means as the association becomes weaker,

your prediction of y is more influenced by the mean of y than by changes in x.

Regression to the mean is a special case of this…..

Regression to the meanSometimes reliability is represented as r values (test-retest, split-half).If you have a test with low test-retest reliability, your score on the first

administration is only weakly related to your score on the second administration. It is influenced by a considerable amount of error variance.

Score(true)=Score(observed)+/-ErrorScore-/+Error=Score(observed)Any time you take a measurement, the observed score reflects your true score

plus error.The further away your observed score gets from the mean score for the test, the

more likely it is that the distance from the mean is due at least in part to error.If error is randomly distributed, then your next observed score is more likely to

be closer to the mean than farther from the mean.

Regression to the meanIf x=obs1 and y=obs2, and the test-retest reliability of your measure is relatively low

(say, r=.5), then your first score only helps predict your second score somewhat.Standardized regression equation is

y=.5x + error

On a standardized test with mean=0 and sd=1, if you get a score above the mean, say 1.2, the first time you take the test, (obs1=x=1.2), and the test-retest reliability is only .5, your predicted score the next time you take the test is .5*1.2=.6 . You are more likely to score closer to the mean. This doesn’t mean that you will definitely score closer to the mean, it just means that on average, people who score 1.2 sd above the mean the first time tend to have scores closer to .6 the next time they are tested. This is because the test isn’t that reliable, and the original observation of 1.2 includes error. For the average person with that score (but not for everyone), the error is part of what accounts for the difference between the score and the mean.

If your test has higher reliability, then the regression to the mean effect is reduced.

Multiple Regression

• Multiple regression prediction models

μY = β0 + β1X1 + β2 X2 + β 3X3 + ε

“fit” “residual”

Prediction in Research Articles

• Bivariate prediction models rarely reported

• Multiple regression results commonly reported

top related