shonda kuiper grinnell college. statistical techniques taught in introductory statistics courses...
TRANSCRIPT
Shonda Kuiper
Grinnell College
Comparing the two-sample t-test, ANOVA and regression
Comparing Statistical Tests
Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory variable.
Explanatory Variable
Response
Variable
Response variable measures the outcome of a study.
Explanatory variable explain changes in the response variable.
Comparing Statistical Tests
Each variable can be classified as either categorical or quantitative.
Explanatory Variable
Response
Variable
Categorical
Categorical
Quantitative
Quantitative
Chi-Square test
Two proportion test
Two-sample t-test
ANOVA
Logistic Regression
Regression
Categorical data place individuals into one of several groups (such as red/blue/white, male/female or yes/no).
Quantitative data consists of numerical values for which most arithmetic operations make sense.
= +
Model for a Two-sample t-test
𝑌 𝑖𝑗=𝑌 𝑖+ �̂�𝑖𝑗70 80 -10
82 80 2
90 80 10
78 = 80 + -2
75 85 -10
85 85 0
95 85 10
85 85 0
where i =1,2 j = 1,2,3,4
Statistical models have the following form:
observed value = mean response + random error
Generic Group: = = (70+82+90+78)/4 = 80
Brand Name Group: = = (75+85+95+85)/4 = 85
= = 80
= = 85
μ1
μ2
Null Hypothesis: the two groups of batteries last the same amount of time
Model for a Two-sample t-test
= 80
= 85
μ1
μ2
Model for a Two-Sample t-test
Model for a Two-Sample t-test
The theoretical model used in the two-sample t-test is designed to account for these two group means (µ1 and µ2) and random error.
Null Hypothesis:
Alternative Hypothesis:
observed mean randomvalue response error= +
𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2 j = 1,2,3,4
𝑌 𝑖𝑗=𝑌 𝑖+ �̂�𝑖𝑗 where i =1,2 j = 1,2,3,4
Model for ANOVA
70 82.5 -2.5 -10
82 82.5 -2.5 2
90 82.5 -2.5 10
78 = 82.5 + -2.5 + -2
75 82.5 2.5 -10
85 82.5 2.5 0
95 82.5 2.5 10
85 82.5 2.5 0
= = 80 82.5 = —2.5
= = 85 + 82.5 = 2.5
= = (70 + 82 + 90 + 78 + 75 + 85 + 95 + 85)/8
= 82.5
where i = 1,2 and j = 1,2,3,4
ANOVA: Instead of using two group means, we break the mean response into a grand mean, , two group effects (1 and 2).
= 80
= 85
μ1
μ2
= = 82.5 = = —2.5
= 2.5
Model for ANOVA
Model for ANOVA
Null Hypothesis:
Alternative Hypothesis:
+𝑌 𝑖 , 𝑗=𝜇𝑖+𝜀𝑖 , 𝑗
observed mean randomvalue response error= +𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2
j = 1,2,3,4𝑌 𝑖 , 𝑗={𝜇+𝛼𝑖 }+𝜀𝑖 , 𝑗
𝐻0 :𝜇1=𝜇2
Model for Regression
Xi is either 0 or 1
Regression: Instead of using two group means, we create a model for a straight line (using and ).
Xi 0, Xi , 𝐻0 :𝜇2−𝜇1=0
𝑌 𝑖 , 𝑗=𝜇𝑖+𝜀𝑖 , 𝑗
observed mean randomvalue response error= +𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗
where i =1,2 j = 1,2,3,4
𝑌 𝑖= {𝛽0+𝛽1𝑋 𝑖 }+𝜀𝑖 where i = 1,2, …, 8
Model for Regression
Model for Regression
70 80 0 -10
82 80 0 2
90 80 0 10
78 = 80 + 0 + -2
75 80 5 -10
85 80 5 0
95 80 5 10
85 80 5 0
80
85 80 5
where i = 1,2,…,8
Regression: Instead of using two group means, we create a model for a straight line (using and ).
Model for Regression
80 80 0
80 80 0
80 80 0
80 = 80 + 0
85 80 5
85 80 5
85 80 5
85 80 5
where i = 1,2,…,8
Regression: Instead of using two group means, we create a model for a straight line (using and ).
The equation for the line is often written as:
Comparing the Two-sample t-test, Regression and ANOVA
When there are only two groups (and we have the same assumptions), all three models are algebraically equivalent.
𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2 j = 1,2,3,4
𝐻0 : μ1=μ2
𝑌 𝑖 , 𝑗={𝜇+𝛼𝑖 }+𝜀𝑖 , 𝑗 where i =1,2 j = 1,2,3,4
𝑌 𝑖= {𝛽0+𝛽1𝑋 𝑖 }+𝜀𝑖 where i = 1,2, …, 8
Shonda Kuiper
Grinnell College
Introduction to Multiple RegressionHypothesis Tests and R2
Goals of Multiple Regression
• Multiple regression analysis can be used to serve different goals. The goals will influence the type of analysis that is conducted. The most common goals of multiple regression are to:• Describe: A model may be developed to describe the
relationship between multiple explanatory variables and the response variable.
• Predict: A regression model may be used to generalize to observations outside the sample.
• Confirm: Theories are often developed about which variables or combination of variables should be included in a model. Hypothesis tests can be used to evaluate the relationship between the explanatory variables and the response.
Introduction to Multiple Regression
• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions: What happens to Price as Mileage increases?
Introduction to Multiple Regression
• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant?
Introduction to Multiple Regression
• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you?
Introduction to Multiple Regression
• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you?
Introduction to Multiple Regression
• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%
• Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you? Are there outliers or influential observations?
What is R2?
What is R2?
What is R2?
What happens when all the points fall on the regression line?
0
What is R2?
What happens when the regression line does not help us estimate Y?
What is R2?
What happens when the regression line does not help us estimate Y?
What is R2?
What happens when the regression line does not help us estimate Y?
What is R2?
What happens when the regression line does not help us estimate Y?
What is R2?
What happens when the regression line does not help us estimate Y?
What is R2?
What happens when the regression line does not help us estimate Y?
What is R2?
What happens when the regression line does not help us estimate Y?
Adjusted R2
• R2adj includes a penalty when more terms are included in
the model.
• n is the sample size and p is the number of coefficients (including the constant term β0, β1, β2, β3,…, βp-1)
• When many terms are in the model:• p is larger R2
adj is smaller (n – 1)/(n-p) is larger
Price = 35738 – 0.22 Mileage R-Sq: 4.1%
Slope coefficient (b1): t = -2.95 (p-value = 0.004)
Shonda Kuiper
Grinnell College
Introduction to Multiple Regression:Variable Section
Variable Selection Techniques
• Build a multiple regression model to predict retail price of cars
Mileage
Pri
ce
50000400003000020000100000
70000
60000
50000
40000
30000
20000
10000
0
Scatterplot of Price vs Mileage R2 = 2%
Variable Selection Techniques
• Build a multiple regression model to predict retail price of cars
Mileage
Pri
ce
50000400003000020000100000
70000
60000
50000
40000
30000
20000
10000
0
Scatterplot of Price vs Mileage R2 = 2%Mileage
Cylinder
Liter
Leather
Cruise
Doors
Sound
Variable Selection Techniques
• Build a multiple regression model to predict retail price of cars
Mileage
Pri
ce
50000400003000020000100000
70000
60000
50000
40000
30000
20000
10000
0
Scatterplot of Price vs Mileage R2 = 2%Mileage
Cylinder
Liter
Leather
Cruise
Doors
Sound
Price = 6759 + 6289Cruise + 3792Cyl -1543Doors + 3349Leather - 787Liter -0.17Mileage - 1994Sound
R2 = 44.6%
Introduction to Multiple Regression
Step Forward Regression (Forward Selection):
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise R2 = 18.56%
Introduction to Multiple Regression
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise R2 = 18.56%
Price = -17.06 + 4054.2Cyl R2 = 32.39%
Introduction to Multiple Regression
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise R2 = 18.56%
Price = -17.06 + 4054.2Cyl R2 = 32.39%
Price = 24764.6 – 0.17Mileage R2 = 2.04%
Introduction to Multiple Regression
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise R2 = 18.56%
Price = -17.06 + 4054.2Cyl R2 = 32.39%
Price = 24764.6 – 0.17Mileage R2 = 2.04%
Price = 6185.8.6 + 4990.4Liter R2 = 31.15%
Introduction to Multiple Regression
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise R2 = 18.56%
Price = -17.06 + 4054.2Cyl R2 = 32.39%
Price = 24764.6 – 0.17Mileage R2 = 2.04%
Price = 6185.8.6 + 4990.4Liter R2 = 31.15%
Price = 23130.1 – 2631.4Sound R2 = 1.55%
Price = 18828.8 + 3473.46Leather R2 = 2.47%
Price = 27033.6 -1613.2Doors R2 = 1.93%
Introduction to Multiple Regression
Step Forward Regression:
Which combination of two terms best predicts Price?
Price = - 17.06 + 4054.2Cyl R2 = 32.39% Price = -1046.4 + 3392.6Cyl + 6000.4Cruise R2 = 38.4% (38.2%)
Introduction to Multiple Regression
Step Forward Regression:
Which combination of two terms best predicts Price?
Price = - 17.06 + 4054.2Cyl R2 = 32.39% Price = 3145.8 + 4027.6Cyl – 0.152Mileage R2 = 34% (33.8)
Introduction to Multiple Regression
Step Forward Regression:
Which combination of two terms best predicts Price?
Price = -17.06 + 4054.2Cyl R2 = 32.39% Price = 1372.4 + 2976.4Cyl + 1412.2Liter R2 = 32.6% (32.4%)
Introduction to Multiple Regression
Step Forward Regression:
Which combination of terms best predicts Price?
Price = -17.06 + 4054.2Cyl R2 = 32.39% Price = -1046.4 + 3393Cyl + 6000.4Cruise R2 = 38.4% (38.2%)
Price = -2978.4 + 3276Cyl +6362Cruise + 3139Leather
R2 = 40.4% (40.2%)
Price = 412.6 + 3233Cyl +6492Cruise + 3162Leather
-0.17Mileage R2 = 42.3% (42%)
Price = 5530.3 + 3258Cyl +6320Cruise + 2979Leather
-0.17Mileage – 1402Doors R2 = 43.7% (43.3%)
Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather
-0.17Mileage – 1463Doors – 2024Sound R2 = 44.6% (44.15%)
Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter
-0.17Mileage -1543Doors - 1994Sound R2 = 44.6% (44.14%)
Introduction to Multiple Regression
Step Forward Regression:
Which single explanatory variable best predicts Price?
Price = 13921.9 + 9862.3Cruise R2 = 18.56%
Price = -17.06 + 4054.2Cyl R2 = 32.39%
Price = 24764.6 – 0.17Mileage R2 = 2.04%
Price = 6185.8.6 + 4990.4Liter R2 = 31.15%
Price = 23130.1 – 2631.4Sound R2 = 1.55%
Price = 18828.8 + 3473.46Leather R2 = 2.47%
Price = 27033.6 -1613.2Doors R2 = 1.93%
Introduction to Multiple Regression
Step Backward Regression (Backward Elimination):
Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather
-0.17Mileage – 1463Doors – 2024Sound R2 = 44.6% (44.15%)
Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter
-0.17Mileage -1543Doors - 1994Sound R2 = 44.6% (44.14%)
Other techniques, such as Akaike information criterion, Bayesian information criterion, Mallows’ Cp, are often used to find the best model.
Bidirectional stepwise procedures
Introduction to Multiple Regression
Best Subsets Regression:
Here we see that Liter is the second best single predictor of price.
Introduction to Multiple Regression
Important Cautions:
• Stepwise regression techniques can often ignore very important explanatory variables. Best subsets is often preferable.
• Both best subsets and stepwise regression methods only consider linear relationships between the response and explanatory variables.
• Residual graphs are still essential in validating whether the model is appropriate.
• Transformations, interactions and quadratic terms can often improve the model.
• Whenever these iterative variable selections techniques are used, the p-values corresponding to the significance of each individual coefficient are not reliable.