chapter 8: linear regression—part a
DESCRIPTION
Chapter 8: Linear Regression—Part A. A.P. Statistics. Linear Model. Making a scatterplot allows you to describe the relationship between the two quantitative variables. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/1.jpg)
Chapter 8: Linear Regression—Part A
A.P. Statistics
![Page 2: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/2.jpg)
Linear Model
• Making a scatterplot allows you to describe the relationship between the two quantitative variables.
• However, sometimes it is much more useful to use that linear relationship to predict or estimate information based on that real data relationship.
• We use the Linear Model to make those predictions and estimations.
![Page 3: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/3.jpg)
Linear Model
Normal ModelAllows us to make predictions
and estimations about the population and future events.
It is a model of real data, as long as that data has a nearly symmetric distribution.
Linear ModelAllow us to make predictions
and estimations about the population and future events.
It is a model of real data, as long as that data has a linear relationship between two quantitative variables.
![Page 4: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/4.jpg)
Linear Model and the Least Squared Regression Line
• To make this model, we need to find a line of best fit.
• This line of best fit is the “predictor line” and will be the way we predict or estimate our response variable, given our explanatory variable.
• This line has to do with how well it minimizes the residuals.
![Page 5: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/5.jpg)
Residuals and the Least Squares Regression Line
• The residual is the difference between the observed value and the predicted value.
• It tells us how far off the model’s prediction is at that point
• Negative residual: predicted value is too big (overestimation)
• Positive residual: predicted value is too small (underestimation)
![Page 6: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/6.jpg)
Residuals
![Page 7: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/7.jpg)
Least Squares Regression Line
• The LSRL attempts to find a line where the sum of the squared residuals are the smallest.
• Why not just find a line where the sum of the residuals is the smallest?– Sum of residuals will always be zero – By squaring residuals, we get all positive values,
which can be added– Emphasizes the large residuals—which have a big
impact on the correlation and the regression line
![Page 8: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/8.jpg)
Scatterplot of Math and Verbal SAT scores
480
500
520
540
560
580
600
620
640
660
680
Math_SAT500 520 540 560 580 600 620 640 660
Collection 1 Scatter Plot
![Page 9: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/9.jpg)
Scatterplot of Math and Verbal SAT scores with incorrect LSRL
Verbal_SAT = 1.232Math_SAT - 144 Sum of squares = 2350
480500520540560580600620640660680
Math_SAT500 520 540 560 580 600 620 640 660
Collection 1 Scatter Plot
![Page 10: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/10.jpg)
Scatterplot of Math and Verbal SAT scores with correct LSRL
Verbal_SAT = 1.11Math_SAT - 75.4 Sum of squares = 2076
; r2 = 0.91
480500520540560580600620640660680
Math_SAT500 520 540 560 580 600 620 640 660
Collection 1 Scatter Plot
![Page 11: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/11.jpg)
Verbal_SAT = 1.11Math_SAT - 75.4 Sum of squares = 2076
; r2 = 0.91
480500520540560580600620640660680
Math_SAT500 520 540 560 580 600 620 640 660
Collection 1 Scatter Plot
Model of Collection 1 Simple Regression
Response attribute (numeric): Verbal_SATPredictor attribute (numeric): Math_SATSample count: 6
Equation of least-squares regression line: Verbal_SAT = 1.11024 Math_SAT - 75.424Correlation coefficient, r = 0.954082r-squared = 0.91027, indicating that 91.027% of the variation in Verbal_SAT is accounted for by Math_SAT.
The best estimate for the slope is 1.11024 +/- 0.4839 at a 95 % confidence level. (The standard error of the slope is 0.174288.)
When Math_SAT = 0 , the predicted value for a future observation of Verbal_SAT is -75.4244 +/- 288.073.
![Page 12: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/12.jpg)
![Page 13: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/13.jpg)
Correlation and the Line(Standardized data)
• LSRL passes through and
• LSRL equation is:
“moving one standard
deviation from the mean in x, we can expect to move about r standard deviations from the mean in y .”
yzxz
xy rzz ˆ
![Page 14: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/14.jpg)
Interpreting Standardized Slope of LSRLLSRL of scatterplot:
For every standard deviation above (below) the mean a sandwich is in protein, we’ll predict that that its fat content is 0.83 standard deviations above (below) the mean.
proteinfat zz 83.0ˆ
![Page 15: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/15.jpg)
LSRL that models data in real units
Protein Fat
g 0.14g 2.17
xsx
g 4.16g 5.23
ysy
83.0r
xbby 10ˆ
slopeintercept-
1
0
byb
x
y
srs
b
xbyb
1
10
LSRL Equation:
![Page 16: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/16.jpg)
Interpreting LSRL
proteintaf 97.08.6ˆ
Slope: One additional gram of protein is associated with an additional 0.97 grams of fat.
y-intercept: An item that has zero grams of protein will have 6.8 grams of fat.
ALWAYS CHECK TO SEE IF y-INTERCEPT MAKES SENSE IN THE CONTEXT OF THE PROBLEM AND DATA
![Page 17: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/17.jpg)
Properties of the LSRL
The fact that the Sum of Squared Errors (SSE, same as Least Squared Sum)is as small as possible means that for this line:
• The sum and mean of the residuals is 0• The variation in the residuals is as small as
possible• The line contains the point of averages
yx,
![Page 18: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/18.jpg)
Assumptions and Conditions for using LSRL
Quantitative Variable Condition
Straight Enough Conditionif not—re-express
Outlier Conditionwith and without ?
![Page 19: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/19.jpg)
Residuals and LSRL
• Residuals should be used to see if a linear model is appropriate and in addition the LSRL that was calculated
• Residuals are the part of the data that has not been modeled in our linear model
![Page 20: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/20.jpg)
Residuals and LSRLWhat to Look for in a
Residual Plot to Satisfy Straight Enough Condition:
No patterns, no interesting features (like direction or shape), should stretch horizontally with about same scatter throughout, no bends or outliers.
The distribution of residuals should be symmetric if the original data is straight enough.
Looking at a scatterplot of the residuals vs. the x-value is a good way to check the Straight Enough Condition, which determines if a linear model is appropriate.
![Page 21: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/21.jpg)
Residuals, again
![Page 22: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/22.jpg)
When analyzing the relationship between two variables (thus far)
ALWAYS:• Plot the data and describe the relationship*
• Check Three Regression Assumptions/Conditions
• Compute correlation coefficient• Compute Least Squared Regression Line• Check Residual Plot (Again)• Interpret relationship (intercept, slope, correlation and
general conclusion)* Calculate mean and standard deviation for each variable, if possible
OutlierEnoughStraight
Data veQuantitati
![Page 23: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/23.jpg)
Exam_2 = 0.788Exam_1 + 21.3; r2 = 0.84
40
50
60
70
80
90
100
Exam_160 70 80 90 100
-4
0
4
8
60 70 80 90 100Exam_1
Collection 1 Scatter Plot
![Page 24: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/24.jpg)
40
50
60
70
80
90
100
Exam_160 70 80 90 100
Collection 1 Scatter Plot
![Page 25: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/25.jpg)
Exam_2 = 1.692Exam_1 - 75.5; r2 = 0.65
40
50
60
70
80
90
100
Exam_160 65 70 75 80 85 90
-20-10
01020
60 65 70 75 80 85 90Exam_1
Collection 1 Scatter Plot
![Page 26: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/26.jpg)
Final = 0.752Midterm + 22.2; r2 = 0.73
70
75
80
85
90
95
100
65 70 75 80 85 90 95 100Midterm
Collection 1 Scatter Plot
12345678
70 80 90 100 110Midterm
Collection 1 Histogram Collection 1Midterm83.4666677.4437574
S1 = meanS2 = s
Col
lect
ion
1Fi
nal
84.9
3333
36.
5407
15S
1 =
mea
nS
2 =
s
246810
Fina
l70
8090
100
110
Col
lect
ion
1Hi
stog
ram
![Page 27: Chapter 8: Linear Regression—Part A](https://reader036.vdocument.in/reader036/viewer/2022082211/568161e5550346895dd20526/html5/thumbnails/27.jpg)
Final = 0.752Midterm + 22.2; r2 = 0.73
70
75
80
85
90
95
100
65 70 75 80 85 90 95 100Midterm
-8-404
65 70 75 80 85 90 95 100Midterm
Collection 1 Scatter Plot Model of Collection 1Simple Regression
Response attribute (numeric): FinalPredictor attribute (numeric): MidtermSample count: 15
Equation of least-squares regression line: Final = 0.752149 Midterm + 22.154Correlation coefficient, r = 0.855994r-squared = 0.73273, indicating that 73.273% of the variation in Final is accounted for by Midterm.
The best estimate for the slope is 0.752149 +/- 0.272187 at a 95 % confidence level. (The standard error of the slope is 0.125991.)
When Midterm = 0 , the predicted value for a future observation of Final is 22.154 +/- 24.0299.