simple linear regression. g. baker, department of statistics university of south carolina; slide 2...
TRANSCRIPT
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 22
Relationship Between Two Relationship Between Two Quantitative VariablesQuantitative Variables
If we can model the relationship If we can model the relationship between two quantitative variables, between two quantitative variables, we can use one variable, X, to we can use one variable, X, to predict another variable, Y.predict another variable, Y.– Use height to predict weight.Use height to predict weight.– Use percentage of hardwood in pulp to Use percentage of hardwood in pulp to
predict the tensile strength of paper.predict the tensile strength of paper.– Use square feet of warehouse space to Use square feet of warehouse space to
predict monthly rental cost. predict monthly rental cost.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 33
Relationship Between Two Relationship Between Two Quantitative VariablesQuantitative Variables
We use data to create the modelWe use data to create the model– Observational studyObservational study
Height and Weight example.Height and Weight example. Square footage of warehouse space Square footage of warehouse space
and cost example.and cost example.
– Designed experimentDesigned experiment Percentage of hardwood and tensile Percentage of hardwood and tensile
strength of paper example.strength of paper example.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 44
Simple Linear RegressionSimple Linear Regression
Simple: only one predictor variableSimple: only one predictor variable Linear: Straight line relationshipLinear: Straight line relationship Regression: Fit data to (straight line) Regression: Fit data to (straight line)
modelmodely
(Response or Dependent
Variable) x
(Predictor, Regressor, or Independent Variable)
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 55
Use Scatter Plot to See Use Scatter Plot to See RelationshipRelationship
Predictor
Response
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 66
Absorbed Liquid DataAbsorbed Liquid Data
In a chemical process, batches of In a chemical process, batches of liquid are passed through a bed liquid are passed through a bed containing an ingredient that is containing an ingredient that is absorbed by the liquid. absorbed by the liquid.
We will attempt to relate the We will attempt to relate the absorbed percentage of the absorbed percentage of the ingredient (y) to the amount of liquid ingredient (y) to the amount of liquid in the batch (x).in the batch (x).
Exercise 6.11 in text.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 77
Absorbed Liquid DataAbsorbed Liquid Data
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 88
Absorbed Liquid DataAbsorbed Liquid Data
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 99
Abs% = -1822 + 435(Amt)Abs% = -1822 + 435(Amt)
The regression line or model is deterministic.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1010
We are going to use a We are going to use a probabilistic model which probabilistic model which accounts for the variation accounts for the variation around the line.around the line.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1111
Probabilistic ModelProbabilistic Model Probabilistic Model: deterministic plus Probabilistic Model: deterministic plus
error component for unexplained error component for unexplained variation.variation. errorrandomAmtAbs )(4351822%
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1212
Regression EquationRegression Equation
y = deterministic model + random y = deterministic model + random errorerror
xy 10β0 = y-intercept
β1 = slope
ε = random error
component ticdeterminis )( yE
xyE 10)(
Regression line is estimate of the mean value of y at a given value of x.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1313
Probabilistic ModelProbabilistic Model Probabilistic Model: deterministic plus Probabilistic Model: deterministic plus
error component for unexplained error component for unexplained variation.variation. errorrandomAmtAbs )(4351822%
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1414
Estimating Estimating ββ00 and β and β11
Once we determine that a straight line Once we determine that a straight line model is reasonable, we want to model is reasonable, we want to establish the best line by estimating establish the best line by estimating ββ00 and and ββ1.1.
µ = E(y)µ = E(y) = = ββ00 + + ββ11xx
ββ11 is the slopeis the slope. . It is the amount by which It is the amount by which yy will change with a unit increase in will change with a unit increase in xx. .
ββ00 is the y-intercept. It is the expected is the y-intercept. It is the expected (mean) value of (mean) value of yy when when x = 0x = 0. (This . (This may or may not be meaningful.)may or may not be meaningful.)
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1515
)(4351822% AmtAbs
If Amount goes up by 1 unit, then the Absorb% is expected to go up by 435 %.
If Amount = 0, the expected Absorb% = -1822 %.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1616
Absorbed Liquid DataAbsorbed Liquid Data
Do not consider x values outside the range of the data.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1717
Errors of Prediction = Vertical Errors of Prediction = Vertical Distance Between Points and Distance Between Points and
LineLine
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1818
Method of Least SquaresMethod of Least Squares Sum of prediction errors = 0.Sum of prediction errors = 0. Sum of the squared errors = Sum of Sum of the squared errors = Sum of
Squares Error = SSESquares Error = SSE Many lines for which the sum of errors = Many lines for which the sum of errors =
0.0. Only one line for which SSE is minimized.Only one line for which SSE is minimized. Least squares line = regression line = Least squares line = regression line =
line for which SSE is minimized.line for which SSE is minimized.
xy 10ˆˆˆ ii xy 10
ˆˆˆ or
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1919
Least Squares EstimatesLeast Squares Estimates Deviation of iDeviation of ith th point from estimated point from estimated
value:value:
The sum of the square of deviations The sum of the square of deviations for all for all nn points: points:
Values of and that minimize Values of and that minimize SSE are called the SSE are called the least squares least squares estimates. estimates. They are minimum They are minimum variance unbiased estimates.variance unbiased estimates.
)]ˆˆ([)ˆ( 10 iiii xyyy
0 1
n
iii xySSE
1
210 )]ˆˆ([
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2020
Formulas for Least Squares Formulas for Least Squares EstimatesEstimates
xx
xy
SS
SS1 xy 10
ˆˆ
where
n
i
n
i
iiiiiixy n
yxyxyyxxSS
1 1
))((
n
i
n
i
iiixx n
xxxxSS
1 1
2
22)(
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2121
Estimate of Variance at each x, Estimate of Variance at each x, σσ22
MSEn
yy
df
SSEs ii
2
)ˆ(ˆ
222
s is estimated standard error of regression model.
MSE2 MSERoot ˆ
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2222
MSE and Root MSEMSE and Root MSE
sMSE
sMSE
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2323
Sampling Distribution of Sampling Distribution of 1
β1
11
11 2
11
11 2
xxXX ss
MSE
ss
ss
11
Standard Error for : 1
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2424
Test of Model UtilityTest of Model UtilityH0: β1 = 0
Ha: β1 = 0
Test Statistic:
xxndf
SSsst
/
0ˆ0ˆ1
1
12
Confidence Interval:
xxn
SS
st 2,2/1
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2525
Amt and Absorb%Amt and Absorb%
PredictoPredictorr
Coef.Coef. SE SE Coef.Coef.
TT P-valueP-value
InterceInterceptpt
-1822-1822 366366 -4.978-4.978 <0.000<0.00011
SlopeSlope 435435 6060 7.257.25 <0.000<0.00011
H0: β1 = 0
Ha: β1 = 0
)(4351822% AmtAbs
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2626
Coefficient of CorrelationCoefficient of Correlation CorrelationCorrelation measures the linear measures the linear
relationship between two relationship between two quantitative variables.quantitative variables.
To get a visual picture, use a scatter To get a visual picture, use a scatter plot.plot.
To assign a numeric value: Pearson To assign a numeric value: Pearson product moment coefficient of product moment coefficient of correlation, correlation, rr..
yyxx
xy
SSSS
SSr
r is scale less and will vary from –1 to +1.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2727
Coefficient of CorrelationCoefficient of Correlation
r = +1 r = -1
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2828
Coefficient of CorrelationCoefficient of Correlation
r = .95
r = 0r = 0
r = -.80
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2929
Coefficient of DeterminationCoefficient of Determination Coefficient of DeterminationCoefficient of Determination, , rr22, ,
measures the contribution of measures the contribution of xx in the in the predicting of predicting of yy..
Recall:Recall:
If If xx makes no contribution to prediction makes no contribution to prediction of of yy, then , and , then , and SSESSE = = SSSSyyyy..
If If xx contributes to prediction of contributes to prediction of yy, then , then we expect we expect SSESSE << << SSSSyyyy..
2)( yySS iyy 2)ˆ( ii yySSE
yyi ˆ
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3030
Coefficient of DeterminationCoefficient of Determination Recall:Recall:
– SSSSyyyy is total sample variation around y. is total sample variation around y.
– SSESSE is unexplained sample variability is unexplained sample variability after fitting regression line.after fitting regression line.
Proportion of total sample variation Proportion of total sample variation explained by linear relationship:explained by linear relationship:
yy
yy
SS
SSESS
yVariabilit Total
yVariabilit Explained
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3131
Coefficient of DeterminationCoefficient of Determination
yyyy
yy
SS
SSE
SS
SSESSr
12
= proportion of total sample variability around y that is explained by the linear relationship between y and x.
r2 varies from 0 to 1.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3232
Using Model for EstimationUsing Model for Estimation
Use model to estimate Use model to estimate mean valuemean value of of yy, , E[y]E[y], for specific value of , for specific value of xx..
Solving regression equation for Solving regression equation for particular value of particular value of x x gives point gives point estimate for estimate for y y at that value of x.at that value of x.
ii xy 10ˆˆˆ
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3333
(1-(1-α)100% Confidence Interval α)100% Confidence Interval for y at x = xfor y at x = xpp
is a statistic. It has a sampling is a statistic. It has a sampling distribution. Since we are operating distribution. Since we are operating under the normal assumption, the under the normal assumption, the Confidence Interval = Pt. Est. Confidence Interval = Pt. Est. ++ t tα/2α/2 (Std Error of ).(Std Error of ).
y
y
xx
p
SS
xx
nsty
2
2/
)(1ˆ
where tα/2 has n-2 degrees of freedom.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3434
Predict a New Individual y Predict a New Individual y Value for a Given x. Value for a Given x.
Individual values have more variation Individual values have more variation than means. than means. (1-(1-α)100%α)100% Prediction Prediction Interval for Individual Value of Interval for Individual Value of yy at at x x = x= xpp::
xx
p
SS
xx
nsty
2
2/
)(11ˆ
where tα/2 has n-2 degrees of freedom.
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3535
Confidence and Prediction Confidence and Prediction BandsBands
Amt
Abso
rbed%
7.57.06.56.05.55.04.5
2000
1500
1000
500
0
S 204.898R-Sq 70.3%R-Sq(adj) 68.9%
Regression95% CI95% PI
Fitted Line PlotAbsorbed% = - 1822 + 434.7 Amt
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3636
Assumptions of a Assumptions of a Regression AnalysisRegression Analysis
Assumptions involve distribution of Assumptions involve distribution of errors.errors.– Actual errors:Actual errors:
– Estimated errors - Estimated errors - residualsresiduals
Use plots of residuals to check the Use plots of residuals to check the assumptions.assumptions.
)( 10 iii xy
)ˆˆ( 10 iii xye
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3737
There are Four AssumptionsThere are Four Assumptions
(1) The mean of the errors is 0 at each (1) The mean of the errors is 0 at each value of x.value of x.
-4
-3
-2
-1
0
1
2
3
4
0 5 10 15 20 25
Fitted Values
Resi
dual
s
-4
-3
-2
-1
0
1
2
3
4
0 5 10 15 20 25
Fitted Values
Resi
dual
s
YES NO
X values X values
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3838
StatCrunch Plot of Residuals vs StatCrunch Plot of Residuals vs X ValuesX Values
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3939
There are Four AssumptionsThere are Four Assumptions
(2) Variance of errors is constant (2) Variance of errors is constant across all values of x.across all values of x.
-4
-3
-2
-1
0
1
2
3
4
0 5 10 15 20 25
Fitted Values
Resi
dual
s
-4
-3
-2
-1
0
1
2
3
4
0 5 10 15 20 25
Fitted Values
Resi
dual
s
YES NOX values X values
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 4040
StatCrunch Plot of Residuals vs X StatCrunch Plot of Residuals vs X ValuesValues
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 4141
There are Four AssumptionsThere are Four Assumptions
(3) Errors have normal distribution at (3) Errors have normal distribution at each x.each x.
-4
-3
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4
Z-values
Res
idua
ls-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
Z-values
Res
idua
ls
YES NO
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 4242
StatCrunch QQ Plot of StatCrunch QQ Plot of ResidualsResiduals
G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 4343
There are Four AssumptionsThere are Four Assumptions(4) Errors are independent – must (4) Errors are independent – must
know how data was gathered.know how data was gathered.
-4
-3
-2
-1
0
1
2
3
4
0 5 10 15 20 25
Fitted Values
Resi
dual
s
-4
-3
-2
-1
0
1
2
3
4
0 5 10 15 20 25
Fitted Values
Resi
dual
s
By time, person, etc. By time, person, etc.
YES NO