major points

Post on 05-Jan-2016

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Major Points. Scatterplots The correlation coefficient Correlations on ranks Factors affecting correlations Testing for significance Intercorrelation matrices Other kinds of correlations. The Problem. Are two variables related? - PowerPoint PPT Presentation

TRANSCRIPT

Major Points• Scatterplots• The correlation coefficient

– Correlations on ranks

• Factors affecting correlations• Testing for significance• Intercorrelation matrices• Other kinds of correlations

The Problem

Are two variables related? Does one increase as the other

increases? e. g. skills and income Does one decrease as the other

increases? e. g. health problems and nutrition

How can we get a graphical representation of the degree of relationship?

Relation between father and son’s height: Pearson, (1896)

Reliability

1

Another dataset:Heart Disease and Cigarettes

• Landwehr & Watkins report data on heart disease and cigarette smoking in 21 developed countries

• Data have been rounded for computational convenience. The results were not affected.

Scatterplot of Heart Disease

• CHD Mortality goes on y axis• Cigarette consumption on x axis• What does each dot represent?• Best fitting line included for clarity

Cigarette Consumption per Adult per Day

12108642

CH

D M

orta

lity

per

10,0

00

30

20

10

0

{X = 6, Y = 11}

2

Cigarette Consumption per Adult per Day

12108642

CH

D M

orta

lity

per

10,0

00

30

20

10

0

{X = 6, Y = 11}

3

What Does the Scatterplot Show?

• As smoking increases, so does coronary heart disease mortality.

• Relationship looks strong• Not all data points on line.

This gives us “residuals” or “errors of prediction”

Example Scatterplots

xx

x

x

x

x

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

y

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

yHigh correlation Low correlation

4

Scatter plots:

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

r = .00

5

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

r =.15

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

r = .40

6

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

r = .81

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

r = .99

7

0

20

40

60

80

100

120

0 20 40 60 80 100 120

r = -.79

Guessing correlations: from Rice University

10

Another way to visualize a correlation

Variance in A

Variance in b

Variance in A

Variance in b

Covariance Covariance

11

What is a Correlation Coefficient

• A measure of degree of relationship.• Sign refers to direction.• Based on covariance

• Measure of degree to which large scores go with large scores, and small scores with small scores

• Pearson’s correlation coefficient is most often used

The DataThe DataCigarette Consumption and Coronary Heart Disease Mortality for 21 Countries

Cig. 11 9 9 9 8 8 8 6 6 5 5CHD 26 21 24 21 19 13 19 11 23 15 13

Cig. 5 5 5 5 4 4 4 3 3 3CHD 4 18 12 3 11 15 6 13 4 14

Cig. = Cigarettes per adult per dayCHD = Cornary Heart Disease Mortality per 10,000 population

Surprisingly, the U.S. it the first country on the list--the country with the highest consumption and highest mortality.

Cig. CHD11 26

9 219 249 218 198 138 196 116 235 155 135 45 185 125 34 114 154 63 133 43 14

Cigarette Consumption and Coronary Heart Disease Mortality for 21 countries

Cigarette Consumption: per adult per dayCoronary Heart Disease: Mortality per 10,000 population

Covariance• The formula

• Index of degree to which both list of numbers covary

• When would covXY be large and positive?• When would covXY be large and negative?

1))((

NYYXX

CovXY

Calculation

• CovXY = 11.13

• sX = 2.33

• sY = 6.69

71.59.15

13.11

)69.6)(33.2(

13.11cov

YX

XY

SDSDr

Correlation Coefficient

• Symbolized by r

• Covariance ÷ (product of st. dev.)

YX

XY

SDSD

Covr

Correlation in a random sample

Generated 6 sets of random numbers (100each)The correlation Matrix

var1 var2 var3 var4 var5 var6var1 1.00var2 -0.02 1.00var3 0.08 -0.03 1.00var4 -0.10 -0.15 0.02 1.00var5 0.03 0.01 -0.11 0.22 1.00var6 0.00 0.19 -0.03 0.13 -0.15 1.00

Factors Affecting r• Range restrictions

• Outliers

• Nonlinearity e.g. anxiety and performance

• Heterogeneous subsamples Everyday examples

The effect of outliers on correlations

Dataset: 20 cases selected from darts and pros

DARTS

806040200-20-40

Pro

s80

60

40

20

0

-20

-40

r = .80

Dataset: one case altered to give more extreme values

DARTS

Pro

s

r = .58

806040200-20-40

80

60

40

20

0

-20

-40

12

Summary of effect of outliers

•A few extreme values can have extreme effects

•Especially when sample size is sample

•You cannot randomly toss out data! You need to have a theoretical or statistical justification

Restriction of range: Countries With Low Consumptions

Data With Restricted Range

Truncated at 5 Cigarettes Per Day

Cigarette Consumption per Adult per Day

5.55.04.54.03.53.02.5

CH

D M

ort

alit

y p

er

10

,00

0

20

18

16

14

12

10

8

6

4

2

R between between grades in high school and grades in college.

Scatter plot for 250 students who vary on High School GPA

Scatter plot for students who have GPA equal to or greater than 3.5

•no effect on Pearson's correlation coefficient.

•Example: r between height and weight is the same regardless of whether height is measured in inches, feet, centimeters or even miles.

•This is a very desirable property since choice of measurement scales that are linear transformations of each other is often arbitrary.

Effect of linear transformations of data

An example: •Scores on the Scholastic Aptitude Test (SAT) range from 200-800.

•200 to 800 is an arbitrary range.

•You could subtract 100 points from each score and multiply each score by 3. Scores on the SAT would then range from 300-2100. Test would remain the same.

•r between SAT and some other variable (such as college grade point average) would not be affected by this linear transformation.

Non linear relationships

Example: Anxiety and Performance

0

2

4

6

8

10

12

14

16

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

r = .07

13

The interpretation of a correlation coefficient

• Ranges from –1 to 1• No correlation in the data means you

will get a is 0 r or near it• Suffers from sampling error (like

everything else!). So you need to estimate true population correlation from the sample correlation.

• Correlations in the sample differ from the correlations in the population by some amount (sampling error)

• Sometimes it is higher than population correlation, sometimes it is lower, rarely on the target.

• How do you know when to accept and when to reject correlation?

Possible ways to decide

• Accept it if it fits your hypothesis, reject it otherwise!

• Toss a coin

• Democratically: Ask your officemates to vote.

Fisherian Statistics: Null and Alternative Hypothesis

• Sampling error implies that sometimes the results we obtain will be due to chance (since not every sample will accurately resemble the population)

• The null hypothesis expresses the idea that an observed difference is due to chance.

• For example: There is no difference between the norms regarding the use of email and voice mail

• The alternative hypothesis (the experimental hypothesis) is often the one that you formulate: there is a correlation between people’s perception of a website’s reliability and the probability of their buying something on the site

• Why bother to have a null hypothesis?– Can you reject the null hypothesis

The alternative hypothesis

An Example

• Relationship between browsing and buying on an electronic commerce site

• Data gathered from server logs

• Hypothesis: Those who browse longer also tend to purchase

• Hypothesis can be framed in another way: There is no relationship between time spent browsing and likelihood of purchase (Null Hypothesis)

Testing the significance of a r

• Population parameter = • Null hypothesis H0: = 0

What would a true null mean here? What would a false null mean here?

• Alternative hypothesis (H1)

Tables of Significance

• Table in Appendix E.2

• For N - 2 = 19 df, rcrit = .433

• Our correlation > .433

• Reject H0 Correlation is significant. More cigarette consumption

associated with more CHD mortality.

SPSS Printout

• SPSS Printout gives test of significance. Double asterisks with footnote

indicate p < .01.

SPSS Printout

Correlations

1.000 .138

. .530

23 23

.138 1.000

.530 .

23 23

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

LIFEEXP

EXPEND

LIFEEXP EXPEND

LIFEEXP

747372717069686766

EX

PE

ND

1600

1400

1200

1000

800

600

400

200

Sweden

United_S

France

NorwayCanadaDenmark

Australi NetherlaGermany SwitzerlFinland

ItalyBelgiumNew_ZealAustria Eng-WaleIrelandLuxembou

JapanSpain

IcelandPortugalGreece

SPSS printout for scatterplot

OPTIM

RELINFL

RELINV

RELHOPE

A matrix of scatterplots

Correlation is significant at the 0.01 level (2-tailed).**.

1.000 .272** .167** .266**

.272** 1.000 .449** .419**

.167** .449** 1.000 .544**

.266** .419** .544** 1.000

OPTIM

RELINFL

RELINV

RELHOPE

OPTIM RELINFL RELINV RELHOPE

A review of Scatterplots

next three slides• Infant mortality and number of physicians• Life expectance and health care

expenditures• Cancer rate and solar radiation

Figure 9.1

Infant Mortaility and Number of Physicians

Physicians per 100,000 Population

201816141210

Infa

nt

Mo

rta

lity

10

8

6

4

2

0

-2

-4

-6

Figure 9.2

Life Expectancy and Health Care Costs

Health Care Expenditures

1600140012001000800600400200

Life

Exp

ect

an

cy (

Ma

les)

74

73

72

71

70

69

68

67

66

Figure 9.3

Cancer Rate and Solar Radiation

Solar Radiation

600500400300200

Bre

ast

Ca

nce

r R

ate

34

32

30

28

26

24

22

20

top related