unit 4: correlation and causation. now, a single datum is two values are variables related...

21
Unit 4: Correlation and Causation

Upload: joleen-blair

Post on 18-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Unit 4: Correlation and Causation

Page 2: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

• Now, a single datum is two values

• Are variables related (associated)? – i.e., if one changes, is the other likely to change?

Page 3: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Statistical cliché: Association does not imply causation

• Ex: Sleeping with one's shoes on is strongly correlated with waking up with a headache. Therefore, sleeping with one's shoes on causes headache. – (Or more likely, a common cause is drunkeness)

• Ex: Study at U of PA Med Ctr, 1999: Young children who sleep with the light on are much more likely to develop myopia in later life. – (Later researchers found another common cause:

Parents’ myopia.)

Page 4: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

• From Smithsonian Magazine, Aug ’98:• The Vermont Back Research Institute at the Univ of

Vermont uses the “Vermont Disability Prediction Questionnaire” to predict whether a back problem will become disabling. Items include:– How many times have you visited a medical doctor in the

past for back problems?– How many times have you been married?– How well do you get along with your coworkers?

• Why do they ask these questions? Dr, Roland Hazard shrugs: “We don’t know.” It’s just that answers to such questions have proved predictive on whether back problems will become disabling.

• I.e., they are related, but we don’t know how.

Page 5: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Which kinds of variables?

• Both categorical: compare percentages– Ex: gender vs. physical activity (S ’06)

• Input variable categorical: compare avgs– Ex: digital ratios

• Both numerical– scatterplot (“correlation” and “regression”)– Ex: babyboom

Page 6: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Are these associations positive or negative? weak or strong?

Page 7: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Correlation (coefficient) r• Gives a measure of how closely points

follow a straight line

• Always between -1 and 1– r = 1: all pts on a line with + slope– r = -1: all pts on a line with – slope – r near 0: blob

• [Formula: turn x- and y-values into z-scores, multiply for each point, find avg product]

• History: Invented by Karl Pearson (1857-1911)

Page 8: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to
Page 9: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Estimate the correlations:

Page 10: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

“SD-line” [FPP only]• Okay, r measures how closely data follows

a line. Which line?– through “point of averages” (x , y )

– slope: ± σy / σx ,where

• sign is + if r > 0 , - if r < 0

• Ex: Baldness study: # hair (in 10K’s) avg 40, σ = 15; ages avg 36, σ = 20; r = -.3. If hair is on vertical axis, SD-line?

• Ex: Scores on first exam avg 75, σ = 15; on final exam avg 110, σ = 35; r = .5. SD-line?

Page 11: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Sketching in the SD-lines

Page 12: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

“Covariance formula” for r (FPP p.134)

r = ((avg of xy) -xy)) /( σx σy )

Numerator is the “covariance of x and y”

Page 13: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to
Page 14: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Remarks on r

• r is– a pure number (no units)– not affected by

• reversing variables• linear changes of variables [changes of

units, like ft to m]

Page 15: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

r is affected by …

• nonlinear association

• outliers

• combining different groups, with different centers (Simpson’s Paradox II)

• “ecological correlations”, i.e., correlations of averaged data points

• [examples shortly]

Page 16: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to
Page 17: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to
Page 18: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to
Page 19: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

SAT scores Average scores from school districts in Cayuga, Madison, and

Oswego counties for the 1998-99 school year

Madison M 978 496 482

Morrisville-Eaton M 1035 503 532

Oneida M 1034 508 526

Stockbridge Valley M 1000 490 510

A-P-W O 1039 512.3 526.3

Central Square O 1023 507 518

Fulton O 1044 522 522

Hannibal O 1017 504 513

Mexico O 1039 519 520

Oswego O 1059 522 537

Phoenix O 998 494 504

Pulaski O 1028 522 502

Sandy Creek O 1000 490 510

District C Overall Verbal Math

Auburn C 1068 525 543

Cato-Meridian C 1057 528 529

Moravia C 1064 523 541

Otselic Valley C 1010 506.4 503.6

Port Byron C 999 498 501

Southern Cayuga C 1111 541 570

Brookfield M 1002 509 493

Canastota M 1001 494 507

Cazenovia M 1046 518 528

Chittenango M 998 491 507

DeRuyter M 987.7 491.1 496.6

Hamilton M 1082 534 548

Verbal / Math r = 0.770

Page 20: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to

Overall Verbal Math

Cayuga County 1063.9 524.3 539.6 Ecol r = 0.989

Madison County 1015.8 503.7 512.1

Oswego County 1027.4 510.2 516.9

Page 21: Unit 4: Correlation and Causation. Now, a single datum is two values Are variables related (associated)? – i.e., if one changes, is the other likely to