unit 4: correlation and causation. now, a single datum is two values are variables related...
TRANSCRIPT
Unit 4: Correlation and Causation
• Now, a single datum is two values
• Are variables related (associated)? – i.e., if one changes, is the other likely to change?
Statistical cliché: Association does not imply causation
• Ex: Sleeping with one's shoes on is strongly correlated with waking up with a headache. Therefore, sleeping with one's shoes on causes headache. – (Or more likely, a common cause is drunkeness)
• Ex: Study at U of PA Med Ctr, 1999: Young children who sleep with the light on are much more likely to develop myopia in later life. – (Later researchers found another common cause:
Parents’ myopia.)
• From Smithsonian Magazine, Aug ’98:• The Vermont Back Research Institute at the Univ of
Vermont uses the “Vermont Disability Prediction Questionnaire” to predict whether a back problem will become disabling. Items include:– How many times have you visited a medical doctor in the
past for back problems?– How many times have you been married?– How well do you get along with your coworkers?
• Why do they ask these questions? Dr, Roland Hazard shrugs: “We don’t know.” It’s just that answers to such questions have proved predictive on whether back problems will become disabling.
• I.e., they are related, but we don’t know how.
Which kinds of variables?
• Both categorical: compare percentages– Ex: gender vs. physical activity (S ’06)
• Input variable categorical: compare avgs– Ex: digital ratios
• Both numerical– scatterplot (“correlation” and “regression”)– Ex: babyboom
Are these associations positive or negative? weak or strong?
Correlation (coefficient) r• Gives a measure of how closely points
follow a straight line
• Always between -1 and 1– r = 1: all pts on a line with + slope– r = -1: all pts on a line with – slope – r near 0: blob
• [Formula: turn x- and y-values into z-scores, multiply for each point, find avg product]
• History: Invented by Karl Pearson (1857-1911)
Estimate the correlations:
“SD-line” [FPP only]• Okay, r measures how closely data follows
a line. Which line?– through “point of averages” (x , y )
– slope: ± σy / σx ,where
• sign is + if r > 0 , - if r < 0
• Ex: Baldness study: # hair (in 10K’s) avg 40, σ = 15; ages avg 36, σ = 20; r = -.3. If hair is on vertical axis, SD-line?
• Ex: Scores on first exam avg 75, σ = 15; on final exam avg 110, σ = 35; r = .5. SD-line?
Sketching in the SD-lines
“Covariance formula” for r (FPP p.134)
r = ((avg of xy) -xy)) /( σx σy )
Numerator is the “covariance of x and y”
Remarks on r
• r is– a pure number (no units)– not affected by
• reversing variables• linear changes of variables [changes of
units, like ft to m]
r is affected by …
• nonlinear association
• outliers
• combining different groups, with different centers (Simpson’s Paradox II)
• “ecological correlations”, i.e., correlations of averaged data points
• [examples shortly]
SAT scores Average scores from school districts in Cayuga, Madison, and
Oswego counties for the 1998-99 school year
Madison M 978 496 482
Morrisville-Eaton M 1035 503 532
Oneida M 1034 508 526
Stockbridge Valley M 1000 490 510
A-P-W O 1039 512.3 526.3
Central Square O 1023 507 518
Fulton O 1044 522 522
Hannibal O 1017 504 513
Mexico O 1039 519 520
Oswego O 1059 522 537
Phoenix O 998 494 504
Pulaski O 1028 522 502
Sandy Creek O 1000 490 510
District C Overall Verbal Math
Auburn C 1068 525 543
Cato-Meridian C 1057 528 529
Moravia C 1064 523 541
Otselic Valley C 1010 506.4 503.6
Port Byron C 999 498 501
Southern Cayuga C 1111 541 570
Brookfield M 1002 509 493
Canastota M 1001 494 507
Cazenovia M 1046 518 528
Chittenango M 998 491 507
DeRuyter M 987.7 491.1 496.6
Hamilton M 1082 534 548
Verbal / Math r = 0.770
Overall Verbal Math
Cayuga County 1063.9 524.3 539.6 Ecol r = 0.989
Madison County 1015.8 503.7 512.1
Oswego County 1027.4 510.2 516.9