correlation 2 computations, and the best fitting line

47
Correlation 2 Computations, and the best fitting line.

Post on 20-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Correlation 2 Computations, and the best fitting line

Correlation 2

Computations, and the best fitting line.

Page 2: Correlation 2 Computations, and the best fitting line

Computing r from a more realistic set of data

• A study was performed to investigate whether the quality of an image affects reading time.

• The experimental hypothesis was that reduced quality would slow down reading time.

• Quality was measured on a scale of 1 to 10. Reading time was in seconds.

Page 3: Correlation 2 Computations, and the best fitting line

Quality vs Reading Time data: Compute the correlation

Quality(scale 1-10)

4.304.555.555.656.306.456.45

Reading time(seconds)

8.18.57.87.37.57.36.0

Is there a relationship?Check for linearity.Compute r.

Page 4: Correlation 2 Computations, and the best fitting line

Calculate t scores for XX

4.304.555.555.656.306.456.45

X=39.25 n= 7 X=5.61

(X - X)2

1.711.120.000.000.480.710.71

X - X-1.31-1.06-0.06 0.04 0.69 0.84 0.84

tX =(X - X) / sX

-1.48-1.19-0.070.050.780.950.95

MSW = 4.73/(7-1) = 0.79

s = 0.89

SSW = 4.73

Page 5: Correlation 2 Computations, and the best fitting line

Calculate t scores for YY

8.18.57.87.37.57.36.0

Y=52.5 n= 7 Y=7.50 MSW = 3.78/(7-1) = 0.63

sY = 0.79

(Y - Y)2

0.361.000.090.040.000.042.25

Y - Y0.601.000.30-0.200.00-0.20-1.50

tY =

(Y - Y) / sY

0.76 1.26 0.38-025 0.00-0.25-1.89

SSW = 3.78

Page 6: Correlation 2 Computations, and the best fitting line

Plot t scores

tY

0.76 1.28 0.39-0.25 0.00-0.25-1.89

tX

-1.48-1.19-0.07 0.05 0.78 0.95 0.95

Page 7: Correlation 2 Computations, and the best fitting line

t score plot with best fitting line: linear? YES!

-2.00

-1.00

0.00

1.00

2.00

-2.00 -1.00 0.00 1.00 2.00

Image quality (t score)

Rea

din

g T

ime

(t s

core

)

Page 8: Correlation 2 Computations, and the best fitting line

Calculate r

tY

0.76 1.28 0.39-0.25 0.00-0.25-1.88

tX

-1.48-1.19-0.07 0.05 0.78 0.95 0.95

tY -tX

-2.24-2.47-0.46 0.30 0.78 1.20 2.83

(tY -tX)2

5.026.100.210.090.611.448.01

(tX - tY)2 / (nP - 1) = 3.580

r = 1 - (1/2 * 3.580) = 1 - 1.79 = -0.790

(tX - tY)2 = 21.48

Page 9: Correlation 2 Computations, and the best fitting line

Best fitting line

Page 10: Correlation 2 Computations, and the best fitting line

The definition of the best fitting line plotted on t axes

• A “best fitting line” minimizes the average squared vertical distance of Y scores in the sample (expressed as tY scores) from the line.

• The best fitting line is a least squares, unbiased estimate of values of Y in the sample.

• The generic formula for a line is Y=mx+b where m is the slope and b is the Y intercept.

• Thus, any specific line, such as the best fitting line, can be defined by its slope and its intercept.

Page 11: Correlation 2 Computations, and the best fitting line

The intercept of the best fitting line plotted on t axes

The origin is the point where both tX and tY=0.000

• So the origin represents the mean of both the X and Y variable

• When plotted on t axes all best fitting lines go through the origin.

• Thus, the tY intercept of the best fitting line = 0.000

Page 12: Correlation 2 Computations, and the best fitting line

The slope of and formula for the best fitting line

• When plotted on t axes the slope of the best fitting line = r, the correlation coefficient.

• To define a line we need its slope and Y intercept

• r = the slope and tY intercept=0.00 • The formula for the best fitting line is

therefore tY=rtX + 0.00 or tY= rtX

Page 13: Correlation 2 Computations, and the best fitting line

Here’s how a visual representation of the best fitting line (slope = r, Y intercept = 0.000) and the dots representing

tX and tY scores might be described. (Whether the correlation is positive of negative doesn’t matter.)

• Perfect - scores fall exactly on a straight line.

• Strong - most scores fall near the line.

• Moderate - some are near the line, some not.

• Weak - the scores are only mildly linear.

• Independent - the scores are not linear at all.

Page 14: Correlation 2 Computations, and the best fitting line

Strength of a relationship1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Perfect

Page 15: Correlation 2 Computations, and the best fitting line

Strength of a relationship1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Strongr about .800

Page 16: Correlation 2 Computations, and the best fitting line

Strength of a relationship1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Moderater about .500

Page 17: Correlation 2 Computations, and the best fitting line

Strength of a relationshipr about 0.000

1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Independent

Page 18: Correlation 2 Computations, and the best fitting line

r=.800, the formula for the best fitting line = ???

1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Page 19: Correlation 2 Computations, and the best fitting line

r=-.800, the formula for the best fitting line = ???

1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Page 20: Correlation 2 Computations, and the best fitting line

r=0.000, the formula for the best fitting line is:

1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Page 21: Correlation 2 Computations, and the best fitting line

Notice what that formula for independent variables says

• tY = rtX = 0.000 (tX) = 0.000

• When tY = 0.000, you are at the mean of Y

• So, when variables are independent, the best fitting line says that the best estimate of Y scores in the sample is back to the mean of Y regardless of your score on X

• Thus, when variables are independent we go back to saying everyone will score right at the mean

Page 22: Correlation 2 Computations, and the best fitting line

A note of caution: Watch out for the plot for which the best fitting line is a curve.

1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Page 23: Correlation 2 Computations, and the best fitting line

Confidence intervals around rhoT – relation to Chapter 6

• In Chapter 6 we learned to create confidence intervals around muT that allowed us to test a theory.

• To test our theory about mu we took a random sample, computed the sample mean and standard deviation, and determined whether the sample mean fell into that interval.

• If it did not, we had shown the theory that led us to predict muT was false.

• We then discarded the theory and muT and used the sample mean as our best estimate of the true population mean.

Page 24: Correlation 2 Computations, and the best fitting line

If we discard muT, what do we use as our best estimate of mu?

• Generally, our best estimate of a population parameter is the sample statistic that estimates it.

• Our best estimate of mu has been and is the sample mean, X-bar.

• Since we have discarded our theory, we went back to using X-bar as our best (least squares, unbiased, consistent estimate) of mu.

Page 25: Correlation 2 Computations, and the best fitting line

More generally, we can test a theory (hypothesis) about any population parameter

using a similar confidence interval.• We theorize about what the value of the

population parameter is.• We get an estimate of the variability of the

parameter• We construct a confidence interval (usually a 95%

confidence interval) in which our hypothesis says that the sample statistic should fall.

• We obtain a random sample and determine whether the sample statistic falls inside or outside our confidence interval

Page 26: Correlation 2 Computations, and the best fitting line

The sample statistic will fall inside or outside of the CI.95

• If the sample statistic falls inside the confidence interval, our theory has received some support and we hold on to it.

• But the more interesting case is when the sample statistic falls outside the confidence interval.

• Then we must discard the theory and the theory based estimate of the population parameter.

• In that case, our best estimate of the population parameter is the sample statistic

• Remember, the sample statistic is a least squares, unbiased, consistent estimate of its population parameter.

Page 27: Correlation 2 Computations, and the best fitting line

We are going to do the same thing with a theory about rho

• rho is the correlation coefficient for the population.• If we have a theory about rho, we can create a 95%

confidence interval into which we expect r will fall.• An r computed from a random sample will then fall inside

or outside the confidence interval.

Page 28: Correlation 2 Computations, and the best fitting line

When r falls inside or outside of the CI.95 around rhoT

• If r falls inside the confidence interval, our theory about rho has received some support and we hold on to it.

• But the more interesting case is when r falls outside the confidence interval.

• Then we must discard the theory and the theory based estimate of the population parameter.

• In that case, our best estimate of rho is the r we found in our random sample

• Thus, when r falls outside the CI.95 we can go back to using it as a least squares unbiased estimate of rho.

Page 29: Correlation 2 Computations, and the best fitting line

Chapter 7 slides end here

Rest of slides are for other chapters and should not be reviewed here.

RK – 10/24

Page 30: Correlation 2 Computations, and the best fitting line

Why is it so important to determine whether r fits a theory

• In Chapter 8 we go on to predict values of Y from values of X and r.

• The formula we use is called the regression equation, it is very much like the formula for the best fitting line.

• The only difference is that the best fitting line describes the relationship among the Y scores in the sample.

• But in Chapter 8 we move to predicting scores for people who are in the population from which the sample was drawn, but not in the sample.

Page 31: Correlation 2 Computations, and the best fitting line

That’s dangerous.

Let me give you an example.

Page 32: Correlation 2 Computations, and the best fitting line

Assume, you are the personnel officer for a mid size company.

• You need to hire a typist.

• There are 2 applicants for the job.

• You give the applicants a typing test.

• Which would you hire: someone who types 6 words a minute with 12 mistakes or someone who types 100 words a minute with 1 mistake.

Page 33: Correlation 2 Computations, and the best fitting line

Who would you hire?

• Of course, you would predict that the second person will be a better typist and hire that person.

• Notice that we never gave the person with 6 words/minute a chance to be a typist in our firm.

• We prejudged her on the basis of the typing test.• That is probably valid in this case – a typing test

probably predicts fairly well how good a typist someone will be.

Page 34: Correlation 2 Computations, and the best fitting line

But say the situation is a little more complicated!

• You have several applicants for a leadership position in your firm.

• But it is not 2002, it is 1957, when we knew that only white males were capable of leadership in corporate America.

• That is, we all “know” that leadership ability is correlated with both gender and skin color, white and male are associated with high leadership ability and darker skin color and female gender with lower leadership ability.

• We now know this is absurd, but lots of people were never

Page 35: Correlation 2 Computations, and the best fitting line

Confidence intervals around muT

Page 36: Correlation 2 Computations, and the best fitting line

Confidence intervals and hypothetical means

• We frequently have a theory about what the mean of a distribution should be.

• To be scientific, that theory about mu must be able to be proved wrong (falsified).

• One way to test a theory about a mean is to state a range where sample means should fall if the theory is correct.

• We usually state that range as a 95% confidence interval.

Page 37: Correlation 2 Computations, and the best fitting line

• To test our theory, we take a random sample from the appropriate population and see if the sample mean falls where the theory says it should, inside the confidence interval.

• If the sample mean falls outside the 95% confidence interval established by the theory, the evidence suggests that our theoretical population mean and the theory that led to its prediction is wrong.

• When that happens our theory has been falsified. We must discard it and look for an alternative explanation of our data.

Page 38: Correlation 2 Computations, and the best fitting line

For example:• For example, let’s say that we had a new

antidepressant drug we wanted to peddle. Before we can do that we must show that the drug is safe.

• Drugs like ours can cause problems with body temperature. People can get chills or fever.

• We want to show that body temperature is not effected by our new drug.

Page 39: Correlation 2 Computations, and the best fitting line

Testing a theory

• “Everyone” knows that normal body temperature for healthy adults is 98.6oF.

• Therefore, it would be nice if we could show that after taking our drug, healthy adults still had an average body temperature of 98.6oF.

• So we might test a sample of 16 healthy adults, first giving them a standard dose of our drug and, when enough time had passed, taking their temperature to see whether it was 98.6oF on the average.

Page 40: Correlation 2 Computations, and the best fitting line

Testing a theory - 2• Of course, even if we are right and our drug has no

effect on body temperature, we wouldn’t expect a sample mean to be precisely 98.600000…

• We would expect some sampling fluctuation around a population mean of 98.6oF.

• So, if our drug does not cause change in body temperature, the sample mean should be close to 98.6. It should, in fact, be within the 95% confidence interval around muT, 98.6.

• SO WE MUST CONSTRUCT A 95% CONFIDENCE INTERVAL AROUND 98.6o AND SEE WHETHER OUR SAMPLE MEAN FALLS INSIDE OR OUTSIDE THE CI.

Page 41: Correlation 2 Computations, and the best fitting line

To create a confidence interval around muT, we must estimate sigma from a sample.

• We randomly select a group of 16 healthy individuals from the population.

• We administer a standard clinical dose of our new drug for 3 days.

• We carefully measure body temperature.• RESULTS: We find that the average body

temperature in our sample is 99.5oF with an estimated standard deviation of 1.40o (s=1.40).

• IS 99.5oF. IN THE 95% CI AROUND MUT???

Page 42: Correlation 2 Computations, and the best fitting line

Knowing s and n we can easily compute the estimated standard error of the mean.

• Let’s say that s=1.40o and n = 16:

• = 1.40/4.00 = 0.35

• Using this estimated standard error we can construct a 95% confidence interval for the body temperature of a sample of 16 healthy adults.

nssX /

Page 43: Correlation 2 Computations, and the best fitting line

We learned how to create confidence intervals with the Z distribution in Chapter 4.

95% of sample means will fall in a symmetrical interval around mu that goes from 1.960 standard

errors below mu to 1.960 standard errors above mu

• A way to write that fact in statistical language is:

CI.95: mu + ZCRIT* sigmaX-bar or

CI.95: mu - ZCRIT* sigmaX-bar < X-bar < mu + ZCRIT* sigmaX-bar

For a 95% CI, ZCRIT = 1.960

Page 44: Correlation 2 Computations, and the best fitting line

But when we must estimate sigma with s, we must use the t distribution to define critical intervals around mu or muT.

Here is how we would write the formulae substituting t for Z and s for sigma

CI95: muT + tCRIT* sX-bar or

CI.95: muT - tCRIT* sX-bar < X-bar < muT + tCRIT* sX-bar

Notice that the critical value of t that includes 95% of the sample means changes with the number of degrees of freedom for s, our estimate of sigma, and must be taken from the t table.

If n= 16 in a single sample, dfW=n-k=15.

Page 45: Correlation 2 Computations, and the best fitting line

df 1 2 3 4 5 6 7 8.05 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306.01 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355

df 9 10 11 12 13 14 15 16

.05 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120

.01 3.250 3.169 3.106 3.055 3.012 2.997 2.947 2.921

df 17 18 19 20 21 22 23 24.05 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064.01 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797

df 25 26 27 28 29 30 40 60.05 2.060 2.056 2.052 2.048 2.045 2.042 2.021 2.000.01 2.787 2.779 2.771 2.763 2.756 2.750 2.704 2.660

df 100 200 500 1000 2000 10000.05 1.984 1.972 1.965 1.962 1.961 1.960.01 2.626 2.601 2.586 2.581 2.578 2.576

Page 46: Correlation 2 Computations, and the best fitting line

So, muT=98.6, tCRIT=2.131, s=1.40, n=16

Here is the confidence intervalCI.95: muT + tCRIT* sX-bar = = 98.6 + (2.131)*(1.40/ ) = = 98.6 + (2.131)*(1.40/4)

= 98.6 + (2.131)(0.35) = 98.60+ 0.75

CI.95: 97.85 < X-bar < 99.35

Our sample mean fell outside the CI.95 and falsifies the theory that our drug has no effect on body temperature. Our drug may cause a slight fever.

16

Page 47: Correlation 2 Computations, and the best fitting line