regression analysis chapter 10. 2 regression and correlation techniques that are used to establish...

25
Regression Analysis Chapter 10

Upload: joelle-sidell

Post on 29-Mar-2015

237 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

Regression Analysis

Chapter 10

Page 2: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

2

Regression and Correlation

Techniques that are used to establish whether there is a mathematical relationship between two or more variables, so that the behavior of one variable can be used to predict the behavior of others. Applicable to “Variables” data only.

• “Regression” provides a functional relationship (Y=f(x)) between the variables; the function represents the “average” relationship.

• “Correlation” tells us the direction and the strength of the relationship.

The analysis starts with a Scatter Plot of Y vs X.The analysis starts with a Scatter Plot of Y vs X

Page 3: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

3

Simple Linear RegressionWhat is it?Determines if Y depends on X and provides a math equation for the relationship (continuous data)

Examples:Process conditions and product propertiesSales and advertising budget

y

x

Does Y depend on X?

Which line is correct?

Page 4: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

4

Simple Linear Regression

b = Y intercept = the Y value at point that the line intersects Y axis.

m = slope = riserun

Y

X0

b

rise

run

A simple linear relationship can be described mathematically by

Y = mX + b

Page 5: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

Simple Linear Regression

Y

X0 105

5

0

rise

run

slope = riserun

=(6 - 3)

(10 - 4)=

1

2

intercept = 1

Y = 0.5X + 1

Page 6: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

6

Simple regression example An agent for a residential real estate

company in a large city would like to predict the monthly rental cost for apartments based on the size of the apartment as defined by square footage. A sample of 25 apartments in a particular residential neighborhood was selected to gather the information

Page 7: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

7

Size Rent

850 950

1450 1600

1085 1200

1232 1500

718 950

1485 1700

1136 1650

726 935

700 875

956 1150

1100 1400

1285 1650

1985 2300

1369 1800

1175 1400

1225 1450

1245 1100

1259 1700

1150 1200

896 1150

1361 1600

1040 1650

755 1200

1000 800

1200 1750

The data on size and rent for the 25 apartments will be analyzed in EXCEL.

Page 8: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

8

Scatter plot

500700900

11001300150017001900210023002500

500 700 900 1100 1300 1500 1700 1900 2100

Size

Ren

t

Scatter plot suggests that there is a ‘linear’ relationship between Rent and Size

Page 9: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

9

Interpreting EXCEL output

Regression Equation

Rent = 177.121+1.065*Size

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.85R Square 0.72Adjusted R Square 0.71Standard Error 194.60Observations 25

ANOVAdf SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08Residual 23 870949.4547 37867.3676Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 10: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

10

Interpretation of the regression coefficient What does the coefficient of Size

mean?

For every additional square feet,Rent goes up by $1.065

Page 11: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

11

Using regression for prediction Predict monthly rent when

apartment size is 1000 square feet:

Regression Equation:Rent = 177.121+1.065*SizeThus, when Size=1000

Rent=177.121+1.065*1000=$1242 (rounded)

Page 12: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

12

Using regression for prediction – Caution! Regression equation is valid only over the range

over which it was estimated! We should interpolate

Do not use the equation in predicting Y when X values are not within the range of data used to develop the equation. Extrapolation can be risky

Thus, we should not use the equation to predict rent for an apartment whose size is 500 square feet, since this value is not in the range of size values used to create the regression equation.

Page 13: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

13

2.5 4.0

SampleData

TrueRelationship

Why extrapolation is risky

In this figure, we fit our regression model using sample data – but the linear relation implicit in our regression model does not hold outside our sample! By extrapolating, we are making erroneous estimates!

Extrapolated relationship

Page 14: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

14

Correlation (r) “Correlation coefficient”, r, is a measure

of the strength and the direction of the relationship between two variables. Values of r range from +1 (very strong direct relationship), through “0” (no relationship), to –1 (very strong inverse relationship). It measures the degree of scatter of the points around the “Least Squares” regression line

Page 15: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

15

Coefficient of correlation from EXCEL

The sign of r is the same as that of the coefficient of X (Size) in the regression equation (in our case the sign is positive). Also, if you look at the scatter plot, you will note that the sign should be positive.

R=0.85 suggests a fairly ‘strong’ correlation between size and rent.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.85R Square 0.72Adjusted R Square 0.71Standard Error 194.60Observations 25

ANOVAdf SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08Residual 23 870949.4547 37867.3676Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 16: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

16

Coefficient of determination (r2) “Coefficient of Determination”, r-squared,

(sometimes R- squared), defines the amount of the variation in Y that is attributable to variation in X

Page 17: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

17

Getting r2 from EXCEL

It is important to remember that r-squared is always positive. It is the square of the coefficient of correlation r. In our case, r2=0.72 suggests that 72% of variation in Rent is explained by the variation in Size. The higher the value of r2, the better is the simple regression model.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.85R Square 0.72Adjusted R Square 0.71Standard Error 194.60Observations 25

ANOVAdf SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08Residual 23 870949.4547 37867.3676Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 18: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

18

Standard error (SE) Standard error measures the

variability or scatter of the observed values around the regression line.

500

700

900

1100

1300

1500

1700

1900

2100

500 1000 1500 2000 2500

Size (square feet)

Ren

t ($)

Page 19: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

19

Getting the standard error (SE) from EXCEL

In our example, the standard error associated with estimating rent is $194.60.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.85R Square 0.72Adjusted R Square 0.71Standard Error 194.60Observations 25

ANOVAdf SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08Residual 23 870949.4547 37867.3676Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 20: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

20

Is the simple regression model statistically valid? It is important to test whether the

regression model developed from sample data is statistically valid.

For simple regression, we can use 2 approaches to test whether the coefficient of X is equal to zero

1. using t-test2. using ANOVA

Page 21: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

21

Is the coefficient of X equal to zero? In both cases, the hypothesis we

test is:

0Slope:H

0Slope:H

1

0

What could we say about the linear relationship between X and Y if the slope were zero?

Page 22: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

22

Using coefficient information for testing if slope=0

t-stat=7.740 and P-value=7.52E-08. P-value is very small. If it is smaller than our level, then, we reject null; not otherwise. If =0.05, we would reject null and conclude that slope is not zero. Same result holds at =0.01 because the P-value is smaller than 0.01. Thus, at 0.05 (or 0.01) level, we conclude that the slope is NOT zero implying that our model is statistically valid.

P-value

7.52E-08

=7.52*10-8

=0.0000000752

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.85R Square 0.72Adjusted R Square 0.71Standard Error 194.60Observations 25

ANOVAdf SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08Residual 23 870949.4547 37867.3676Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 23: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

23

Using ANOVA for testing if slope=0 in EXCEL

F=59.91376 and P-value=7.51833E-08. P-value is again very small. If it is smaller than our level, then, we reject null; not otherwise. Thus, at 0.05 (or 0.01) level, slope is NOT zero implying that our model is statistically valid. This is the same conclusion we reached using the t-test.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.85R Square 0.72Adjusted R Square 0.71Standard Error 194.60Observations 25

ANOVAdf SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08Residual 23 870949.4547 37867.3676Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 24: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

24

Confidence interval for the slope of Size

The 95% CI tells us that for every 1 square feet increase in apartment Size, Rent will increase by $0.78 to $1.35.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.85R Square 0.72Adjusted R Square 0.71Standard Error 194.60Observations 25

ANOVAdf SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08Residual 23 870949.4547 37867.3676Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 25: Regression Analysis Chapter 10. 2 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between

25

Summary Simple regression is a statistical tool that attempts to fit a

straight line relationship between X (independent variable) and Y (dependent variable)

The scatter plot gives us a visual clue about the nature of the relationship between X and Y

EXCEL, or other statistical software is used to ‘fit’ the model; a good model will be statistically valid, and will have a reasonably high R-squared value

A good model is then used to make predictions; when making predictions, be sure to confine them within the domain of X’s used to fit the model (i.e. interpolate); we should avoid extrapolation