stat 155 introductory statistics lecture 10: cautions ... · stat 155 introductory statistics...
TRANSCRIPT
10/03/06 Lecture 10 1
STAT 155 Introductory Statistics
Lecture 10: Cautions about Regression and Correlation, Causation
The UNIVERSITY of NORTH CAROLINAat CHAPEL HILL
10/03/06 Lecture 10 2
Review
• Least-Squares Regression Lines• Equation and interpretation of the line• Prediction using the line• Correlation and Regression• Coefficient of Determination
10/03/06 Lecture 10 3
Regression Diagnostics
• Look at residuals (errors):– A residual is the difference between an
observed value of the response variable and the value predicted by the regression line, i.e.,
– The sum of the least-squares residuals is always zero.
.ˆresidual yy −=
Why?
10/03/06 Lecture 10 4
Residual Plots
• A residual plot is a scatterplot of the regression residuals against the explanatory variable.
• Residual plots help us assess the fit of a regression line.
10/03/06 Lecture 10 6
Residual Plot
• If the regression line catches the overall pattern of the data, there should be no pattern in the residual.
totally random
10/03/06 Lecture 10 8
Diabetes Patient: FPG vs. HbA
• FPG: fasting plasma glucose.• HbA: percent of red blood cells that have a
glucose molecule attached.• Both are measuring blood glucose.• We expect a positive association.• 18 subjects, r = 0.4819.• See the scatterplot on the next page.
10/03/06 Lecture 10 10
Outliers and Influential Observations
• An outlier is a point that lies outside the overall pattern of the other points. – Outliers in the y direction have large residuals, but
other outliers may not.
• An influential obs. is a point that the regression line would be significantly changed with or without it. – Outliers in the x direction are often influential
points.– But not always…
10/03/06 Lecture 10 12
• Outliers in the y direction can be spotted from the residual plot.
• Influential points can be identified by fitting regression lines with/without those points. More serious.– Can not be identified via residual plot.– Scatterplot gives us some hint.
Outliers & Influential Obs.
10/03/06 Lecture 10 13
Cautions about correlation and regression
• Linear only• DO NOT extrapolate• Not resistant• Beware lurking variables• Beware correlations based on averaged
data• The restricted-range problem
10/03/06 Lecture 10 14
Lurking Variable
• A lurking (hidden) variable is a variable that has an important effect on the relationship among the variables in a study, but is not included among the variables being studied.
• Examples:– SAT scores and college grades
• Lurking variable: IQ
10/03/06 Lecture 10 15
Lurking variables can create nonsense correlations.
• For the world’s nations, let x be the number of TVs/person and y be the average life expectancy;
• A high positive correlation – nations with more TV sets have higher life expectancies. – Could we lengthen the lives of people in Rwanda by shipping
them more TVs? • Lurking variable: wealth of the nation
– Rich nations: more TV sets. – Rich nations: longer life expectancies because of better nutrition,
clean water, and better health care. • There is no cause-and-effect tie between TV sets and
length of life.• Association vs. causation.
10/03/06 Lecture 10 17
Beware correlations based on averaged data
• A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on data for individuals.
• Age vs. Height• (Basketball) score % vs. practice time
10/03/06 Lecture 10 18
The restricted-range problem
• A restricted-range problem occurs when one does not get to observe the full range of the variables.
• When data suffer from restricted range, r and r2 are lower than they would be if the full range could be observed.
• SAT scores vs. College GPA– Princeton vs. Generic State College (Ex 2.22)
10/03/06 Lecture 10 19
Causation vs. Association
• Some studies want to find the existence of causation.
• Example of causation: – Increased drinking of alcohol causes a decrease in
coordination.– Smoking and Lung Cancer.
• Example of association: – The above two examples.– SAT scores and Freshman year GPA.
10/03/06 Lecture 10 20
Association does not imply causation.
• An association between two variables x and y can reflect many types of relationship among x, y, and one or more lurking variables.
• An association between a predictor x and a response y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y.
10/03/06 Lecture 10 22
Explaining Association: Causation
• Cause-and-effect• Examples
– Amount of fertilizer and yield of corn– Weight of a car and its MPG– Dosage of a drug and the survival rate of the mice
10/03/06 Lecture 10 23
Explaining Association: Common Response
• Lurking variables• Both x and y change in response to
changes in z, the lurking variable• There may not be direct causal link
between x and y.• Examples:
– SAT scores vs. College GPA (IQ, Attitude)– Monthly flow of money into stock mutual funds
vs. rate of return for the stock market (Market Condition, Investor Attitude)
10/03/06 Lecture 10 24
Explaining Association: Confounding
• Two variables are confounded when their effects on a response variable are mixed together.
• One explanatory variable may be confounded with other explanatory variables or lurking variables.
• Examples:– More education leads to higher income.
• Family background…
– Religious people live longer.• Life style…
10/03/06 Lecture 10 25
Establishing causation
• The only compelling method: Designed experiment (More in Chapter 3)
• Hot disputes:– Does gun control reduce violent crime?– Does meat consumption in your diet cause
heart diseases?– Does smoking cause lung cancer?
10/03/06 Lecture 10 26
Does smoking CAUSE lung cancer?
• causation: smoking causes lung cancer.• common response: people who have a
genetic predisposition to lung cancer also have a genetic predisposition to smoking.
• confounding: people who drink too much, don't exercise, eat unhealthy foods, etc. are more likely to get lung cancer as a result of their lifestyle. Such people may be more likely to be smokers as well.
10/03/06 Lecture 10 27
Some guidelines when designed experiment is impossible:
• strong association• association consistent across various
studies• higher dose associated with stronger
responses• the cause precedes the effect in time• plausibility