l11 regression
TRANSCRIPT
-
7/24/2019 L11 Regression
1/62
David ChowNov 2014
Chapter 13
Simple Linear Regression
-
7/24/2019 L11 Regression
2/62
Chap 13-22
Learning Objectives
Understand the simple linear regression model
Model assumptions
Meaning of the coefficients b0and b1
Predict the value of a dependent variable usingregression results
Make inferences about the coefficients
Estimate mean values and predict individual values
-
7/24/2019 L11 Regression
3/62
Overview
This PPT
Linear regression model
Estimate regression line
Measures of variation
Residual analysis
Statistical inference Testing & interval estimate
Excel file
Eg: House price Scatter plot
Reg: standard output Reg: residuals
Data for review question
Answer
3
-
7/24/2019 L11 Regression
4/62
Chap 13-4
What We Know
1. Equation of a straight line
2. Scatter plot
Linear / non-linear relationship
Strong / weak relationship
3. Correlation:____ relationshipbetween two variables
Correlation doesnt imply casual relationship
Eg: No. of firefighters and damage (in $)
4. Hypothesis testing
-
7/24/2019 L11 Regression
5/62
Chap 13-5Chap 13-5
Regression Analysis
Dependent variable (y): the variable we wish topredict or explain
Independent variable (x): the variable used to
predict or explain y
Regression analysisis used to:
Explainthe impact of changes in x on the dependent variable y
Predictthe value of y based on the value of x(s)
Linear Regression:
Only one x
Linear relationship between x and y
-
7/24/2019 L11 Regression
6/62
Y=___, X=___
Useful tip:A visual check BEFORErunningregression is always helpful!
Examples
-
7/24/2019 L11 Regression
7/62
Linear Regression Model
-
7/24/2019 L11 Regression
8/62Chap 13-8Chap 13-8
ii10i XY
Linear component
Linear Regression Model
PopulationY-intercept
PopulationSlopeCoefficient Random
Error termDependent
Variable
IndependentVariable
Random error component
Assumptions1. X and Y are linearly related2. 0 and 1are population parameters3. Given an X-value (say, Xi), Y= Yiis random, because of
4. Assume E()=0, so that E(Y)=____
-
7/24/2019 L11 Regression
9/62Chap 13-9Chap 13-9
Random Errorfor this Xivalue
Y
X
Observed Value
of Y for Xi
Predicted Valueof Y for Xi
ii10i XY
Xi
Slope = 1
Intercept = 0
i
Linear Regression Model
Population Regressionline: E(Y) = 0 +1 X
-
7/24/2019 L11 Regression
10/62Chap 13-10Chap 13-10
Model Assumptions: LINE
Linearity The relationship between X and Y is linear
Independence of errors
Error values are statistically independent Normality of error
Error values are normally distributed for any given value of X
Equal variance (also called homoscedasticity)
The probability distribution of the errors has constant variance
Assumption on error terms:~ N (0, 2)
See the appendix for a graphical presentation
-
7/24/2019 L11 Regression
11/62
Estimate the RegressionLine and Predict Y Values
-
7/24/2019 L11 Regression
12/62Chap 13-12Chap 13-12
i10i XbbY
The regression equation provides anestimateof the population regression line
Est. Regression Equation(Prediction Line)
Estimate ofintercept
Estimate of slopeEstimated (orpredicted) Y valuefor observation i
Value of X forobservation i
-
7/24/2019 L11 Regression
13/62Chap 13-13Chap 13-13
Least Squares Method(The Best Fitted Line)
b0and b1are obtained by finding the values of
that minimize the sum of the squared
differences between Y and :
2
i10i
2
ii ))Xb(b(Ymin)Y(Ymin
Y
The computations (to find b0and b1) are done by Excel
Formulae for b0and b1are in the appendix
-
7/24/2019 L11 Regression
14/62Chap 13-14Chap 13-14
b0is the estimated mean value of
Y when the value of X is zero
b1is the estimated changein the
mean value of Y as a result of a
one-unit increase in X
Interpreting b0and b1
-
7/24/2019 L11 Regression
15/62Chap 13-15Chap 13-15
Eg: House Price(in Excel File)
A real estate agent wishes to examine the relationshipbetween the selling price of a home and its size(measured in square feet)
A random sample of 10 houses is selected
Dependent var (Y) = house price (in $1000s)
Independent var (X) = square feet
-
7/24/2019 L11 Regression
16/62Chap 13-16Chap 13-16
Eg: House PriceData & Scatter Plot
House Price in$1000s (Y)
Square Feet(X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550405 2350
324 2450
319 1425
255 1700
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
HousePrice($1000s)
Square Feet
-
7/24/2019 L11 Regression
17/62Chap 13-17Chap 13-17
Using Excel
1. Choose Data 2. Choose Data Analysis
3. Choose Regression
4. Input data range and output options
-
7/24/2019 L11 Regression
18/62Chap 13-18Chap 13-18
Using Excel: Regression Output
Regress ion Statis t ics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVAd f SS MS F Signi f icance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coeff ic ients Standard Error t Stat P-value Low er 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
The regression equation is:
feet)(square0.1097798.24833pricehouse
-
7/24/2019 L11 Regression
19/62Chap 13-19Chap 13-19
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
House
Price
($1000s
)
Eg: House Price
House price model: Scatter Plot and Prediction Line
feet)(square0.1097798.24833pricehouse
Slope= 0.10977
Intercept= 98.248
-
7/24/2019 L11 Regression
20/62Chap 13-20Chap 13-20
Eg: Interpreting bo
bois the estimated mean value of Y when the value of Xis zero (if X = 0 is in the range of observed X values)
Because a house cannot have a square footage of 0, bo
has no practical application
Generally speaking, intercept
bois not our focus in regression
feet)(square0.1097798.24833pricehouse
-
7/24/2019 L11 Regression
21/62Chap 13-21Chap 13-21
Eg: Interpreting b1
b1estimates the change in the mean value of Yas a result of a one-unit increase in X
Here, b1= 0.10977 tells us that the mean value of a house
increases by 0.10977($1000) = $109.77, on average, for each
additional square foot of size
feet)(square0.1097798.24833pricehouse
-
7/24/2019 L11 Regression
22/62Chap 13-22Chap 13-22
317.78
00)0.10977(2098.24833(sq.ft.)0.1097798.24833pricehouse
Predict the price for a house with 2000 sq feet:
The predicted price for a house with 2000square feet is 317.78($1,000s) = $317,780
Eg: Making Predictions
-
7/24/2019 L11 Regression
23/62
Chap 13-23Chap 13-23
0
50
100150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
House
Price
($1000s)
Making Predictions
General rule:Predict only within the relevant range of Xs
Relevant range forinterpolation
Do not try to extrapolatebeyond the range of
observed Xs
-
7/24/2019 L11 Regression
24/62
Measures of Variation
and r2
-
7/24/2019 L11 Regression
25/62
Chap 13-25Chap 13-25
Measures of Variation
Total variation is made up of two parts:
SSESSRSST
Total Sum ofSquares
Regression Sumof Squares
Error Sum ofSquares
2
i )YY(SST
2
ii
)YY(SSE 2
i )YY(SSR
where:
= Mean value of the dependent variable
Yi= Observed value of the dependent variable
= Predicted value of Y for the given XivalueiY
Y
-
7/24/2019 L11 Regression
26/62
Chap 13-26Chap 13-26
SST = total sum of squares (Total Variation)
Measures the variation of the Yivalues around theirmean Y
SSR = regression sum of squares (Explained Variation)
Variation attributable to the relationship between Xand Y
SSE = error sum of squares (Unexplained Variation) Variation in Y attributable to factors other than X
Measures of Variation
-
7/24/2019 L11 Regression
27/62
Chap 13-27Chap 13-27
Xi
Y
X
Yi
SST=(Yi-Y)2
SSE=
(Yi-Yi )2
SSR =
(Yi-Y)2
_
_
_
Y
Y
Y
_Y
Measures of Variation
-
7/24/2019 L11 Regression
28/62
Chap 13-28Chap 13-28
The coefficient of determinationis the portion of thetotal variation in Y that is explained by variation in X
The coefficient of determination is also called r-squaredand is denoted as r2
Coefficient of Determination, r2
1r0 2
squaresofsumtotal
squaresofsumregression2
SST
SSRr
-
7/24/2019 L11 Regression
29/62
Chap 13-29Chap 13-29
Examples of r2
Y
X
Y
X
0 < r2< 1
Weaker linear relationshipsbetween X and Y:
Some but not all of the variation inY is explained by variation in X
Q: Graph for r2 = 1?
-
7/24/2019 L11 Regression
30/62
Chap 13-30Chap 13-30
Measures of Variation and r2in ExcelHouse Price Again
Regress ion Statis t ics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVAd f SS MS F Signi f icance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coeff ic ients Standard Error t Stat P-value Low er 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
58.08% of the variation in house pricesis explained by variation in sq feet
0.580832600.5000
18934.9348
SST
SSRr2
-
7/24/2019 L11 Regression
31/62
Chap 13-31Chap 13-31
Standard Error of Estimate
The standard deviation of the variation of observationsaround the regression line is estimated by
2
)(
2
1
2
n
YY
n
SSES
n
i
ii
YX
whereSSE = error sum of squares
n = sample size
-
7/24/2019 L11 Regression
32/62
Chap 13-32Chap 13-32
Standard Error of Estimate in ExcelHouse Price Again
Regress ion Statis t ics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVAd f SS MS F Signi f icance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coeff ic ients Standard Error t Stat P-value Low er 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
41.33032SYX
-
7/24/2019 L11 Regression
33/62
Chap 13-33Chap 13-33
Comparing Standard Errors
YY
X X
YXSsmall
YXSlarge
SYXis a measure of the variation of observedY values from the regression line
SYXcarries the same unit as Y
It should be judged relative to the size of the Y values in thesample data
Eg: SYX= $41.33K ismoderately small relative to house
prices in the $200K - $400K range
-
7/24/2019 L11 Regression
34/62
Years of employment and salary ($1000) in 7 subjects:
Years Salary6.6 32
7.4 42
8.8 52
9.7 61
10.5 62
10.7 66
11.8 65
Example: Salary Data
-
7/24/2019 L11 Regression
35/62
Residual Analysis
(Autocorrelation is NOT covered)
-
7/24/2019 L11 Regression
36/62
Chap 13-36Chap 13-36
Residual Analysis
The residual for observation i, ei, is the differencebetween its observed and predicted value
Recall the regression assumptions
L: X and Y linearly related?
I: Errors statistically independent?
N: Errors normally distributed? E: Errors have constant variance (homoscedasticity)?
A residual plot (residuals vs X)is very useful in
checking the assumptions
iii YYe
-
7/24/2019 L11 Regression
37/62
Chap 13-37Chap 13-37
L: Linearity
Not Linear Linear
x
residua
ls
x
Y
x
Y
x
residua
ls
-
7/24/2019 L11 Regression
38/62
Chap 13-38Chap 13-38
I: Independence
Not Independent
Independent
X
Xresiduals
residuals
X
residuals
-
7/24/2019 L11 Regression
39/62
Chap 13-39Chap 13-39
N: Normality
N: Are the errors normally distributed?
Many ways to check for normality
Stem-and-Leaf
Histogram
Normal Probability Plot, etc.
My recommendation is always ___
-
7/24/2019 L11 Regression
40/62
Chap 13-40Chap 13-40
E: Equal Variance
Non-constant variance Constant variance
x x
Y
x x
Y
residua
ls
residua
ls
A R id l Pl t b E l
-
7/24/2019 L11 Regression
41/62
Chap 13-41Chap 13-41
House Price Model Residual Plot
-60
-40
-20
0
20
40
60
80
0 1000 2000 3000
Square Feet
Residuals
A Residual Plot by ExcelHouse Price Again
RESIDUAL OUTPUT
Predicted
House Price Residuals
1 251.92316 -6.923162
2 273.87671 38.123293 284.85348 -5.853484
4 304.06284 3.937162
5 218.99284 -19.99284
6 268.38832 -49.38832
7 356.20251 48.79749
8 367.17929 -43.17929
9 254.6674 64.33264
10 284.85348 -29.85348
Key: Any patternin the residual plot?
NOTE: Autocorrelation is NOTcoveredin this course
-
7/24/2019 L11 Regression
42/62
Testing for Significance
and Interval Estimate
-
7/24/2019 L11 Regression
43/62
Chap 13-43Chap 13-43
Inferences About the Slope
The standard error of the regression slopecoefficient (b1) is estimated by
2
i
YXYXb
)X(X
S
SSX
SS1
where:
= Estimate of the standard error of the slope
= Standard error of the estimate
1bS
2n
SSES
YX
-
7/24/2019 L11 Regression
44/62
Chap 13-44Chap 13-44
t Test for the Slope
Typical question: X and Y linearly related? H0: 1= 0 (no linear relationship)
H1: 10 (linear relationship does exist)
Test statistic
1b
11STAT
S
bt
2nd.f.
where:
b1= regression slope
coefficient1= hypothesized slope
Sb1= standarderror of the slope
I f Ab t th Sl i E l
-
7/24/2019 L11 Regression
45/62
Chap 13-45Chap 13-45
Inferences About the Slope in ExcelHouse Price Again
Want to know if ____
H0: 1= 0; H1: 1 0
Next, how to test if slopeequals a particular value?
Coeff ic ients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892Square Feet 0.10977 0.03297 3.32938 0.01039
1bSb1
329383032970
0109770
S
bt
1b
11
STAT ..
.
-
7/24/2019 L11 Regression
46/62
Chap 13-46Chap 13-46
Inferences About the Slope in ExcelHouse Price Again
Test Statistic: tSTAT= 3.329
Critical values?
a 0.05
d.f. 10 2 8
Decision: Reject H0
Reject H0Reject H0
a/2=.025
-t/2Do not reject H0
0t/2
a/2=.025
-2.3060 2.3060 3.329
-
7/24/2019 L11 Regression
47/62
Chap 13-47Chap 13-47
Coeff ic ients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
p-value
Decision: Reject H0, since p-value <
Inferences About the Slope in ExcelHouse Price Again
The p-value Approach
f S f
-
7/24/2019 L11 Regression
48/62
Chap 13-48Chap 13-48
F Test for Model Significance
F Test statistic:
where
MSE
MSRFSTAT
1kn
SSEMSE
k
SSRMSR
FSTATfollows an F distribution Numerator d.f. = k Denominator d.f. = n-k-1 k = the number of independent variables in the regression model
H0: 1= 0; H1: 1 0
F T t f M d l Si ifi
-
7/24/2019 L11 Regression
49/62
Chap 13-49Chap 13-49
F-Test for Model SignificanceHouse Price Again
Regress ion Statis t ics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVAdf SS MS F Signi f icance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
11.08481708.1957
18934.9348
MSE
MSRFSTAT
With 1 and 8 degreesof freedom
p-value forthe F-Test
-
7/24/2019 L11 Regression
50/62
Chap 13-50Chap 13-50
H0: 1= 0 H1: 1 0
a= .05
df1= 1 df2= 8
Test Statistic:
Decision:
Reject H0at = 0.05
0
a= .05
F.05 = 5.32Reject H0Do not
reject H0
11.08FSTAT
MSE
MSR
Critical Value:
F
= 5.32
F Test for Significance
F
-
7/24/2019 L11 Regression
51/62
Chap 13-51Chap 13-51
Interval Estimate for the SlopeHouse Price Again
Excel Printout
At 95% level of confidence, the confidence interval for
the slope is (0.0337, 0.1858) we are 95% confident that the average impact on sales
price is between $33.74 and $185.80 per sq foot of size
Same conclusion as t-test?
1b2/1
Sb
t
Coeff ic ients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
d.f. = n - 2
-
7/24/2019 L11 Regression
52/62
Chap 13-52Chap 13-52
t Test for a Correlation Coefficient
Hypotheses
H0: = 0 (no correlation between X and Y)
H1: 0 (correlation exists)
Test statistic
(with n2 degrees of freedom)
2n
r1
-rt
2STAT
0bifrr
0bifrr
where
1
2
1
2
-
7/24/2019 L11 Regression
53/62
Chap 13-53Chap 13-53
t-test For A Correlation Coefficient
Is there evidence of a linear relationship between squarefeet and house price at the .05 level of significance?
H0: = 0 (No correlation)
H1: 0 (correlation exists)a=.05 , df=10 - 2 = 8
3.329
210.7621
0.762
2nr1
rt
22STAT
Critical Value =Decision:
E ti ti M V l d
-
7/24/2019 L11 Regression
54/62
Chap 13-54Chap 13-54
Estimating Mean Values andPredicting Individual Values
Y
XXi
Y = b0+b1Xi
ConfidenceInterval for
the meanofY, given Xi
Prediction Interval
for an individualY,given X
i
Goal: Form intervals around Y to expressuncertainty about the value of Y for a given Xi
Y
S t d P d f
-
7/24/2019 L11 Regression
55/62
Chap 13-55Chap 13-55
Suggested Procedures for aRegression Analysis
1. Start with a scatter plot to observe possiblerelationship
2. Run regression
3. Perform residual analysis Plot the residuals vs. X to check for assumptions such as
linearity & homoscedasticity
Use a histogram to see if the residuals are normallydistributed
S ggested Proced res for a
-
7/24/2019 L11 Regression
56/62
Chap 13-56Chap 13-56
Suggested Procedures for aRegression Analysis
4. If there is violation of any assumption, use alternativemethods or models
5. If there is no evidence of assumption violation, then
test for the significance of the regression coefficientsand construct confidence intervals and predictionintervals
6. Avoid making predictions or forecasts outside therelevant range
-
7/24/2019 L11 Regression
57/62
1. In a regression analysis, the error term is a randomvariable with an expected value of ____
2. In a regression analysis, if r2= 1, SST = ____
3. A regression analysis between sales (in $1000) andadvertising (in $100) resulted in the following equation:
Y-hat = 500 + 4X
a) How to interpret b1?b) Based on the above equation, if advertising is $10,000,
the point estimate for sales (in dollars) is ____
Review Questions 1-3
-
7/24/2019 L11 Regression
58/62
4. Shown below is a portion of an Excel printout for a regressionanalysis relating Y (quantity demanded) and X (unit price).
a) Perform a t test to determine if Y and X are related. Let = 0.05
b) Compute R2and interpret its meaning
Review Question 4
ANOVA
df
SS
Regression 1 5048.818
Residual 46 3132.661
Total 47 8181.479
Coefficients Standard Er ror
Intercept 80.390 3.102
X -2.137 0.248
-
7/24/2019 L11 Regression
59/62
A regression model is developed to relate the price per personand the sum of the ratings for food, decor, and service
1. Find r 2
2. Calculate and Interpret the regression coefficients
3. At the 0.05 level of significance, is there evidence of a linearrelationship between the price per person and the summated
rating?
4. Construct a 95% confidence interval for the slope
5. Evaluate the usefulness of this model
Review Question 5:
Restaurant Ratings
-
7/24/2019 L11 Regression
60/62
Appendix
Figure: Regression Model Assumptions
-
7/24/2019 L11 Regression
61/62
Figure: Regression Model Assumptions
S h d
-
7/24/2019 L11 Regression
62/62
1 2
( )( )
( )
i i
i
x x y yb
x x
Least Squares Method(Formulae for b0and b1)
0 1b y b x