simple linear regression. g. baker, department of statistics university of south carolina; slide 2...

43
Simple Linear Simple Linear Regression Regression

Upload: donna-dennis

Post on 16-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Simple Linear Simple Linear RegressionRegression

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 22

Relationship Between Two Relationship Between Two Quantitative VariablesQuantitative Variables

If we can model the relationship If we can model the relationship between two quantitative variables, between two quantitative variables, we can use one variable, X, to we can use one variable, X, to predict another variable, Y.predict another variable, Y.– Use height to predict weight.Use height to predict weight.– Use percentage of hardwood in pulp to Use percentage of hardwood in pulp to

predict the tensile strength of paper.predict the tensile strength of paper.– Use square feet of warehouse space to Use square feet of warehouse space to

predict monthly rental cost. predict monthly rental cost.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 33

Relationship Between Two Relationship Between Two Quantitative VariablesQuantitative Variables

We use data to create the modelWe use data to create the model– Observational studyObservational study

Height and Weight example.Height and Weight example. Square footage of warehouse space Square footage of warehouse space

and cost example.and cost example.

– Designed experimentDesigned experiment Percentage of hardwood and tensile Percentage of hardwood and tensile

strength of paper example.strength of paper example.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 44

Simple Linear RegressionSimple Linear Regression

Simple: only one predictor variableSimple: only one predictor variable Linear: Straight line relationshipLinear: Straight line relationship Regression: Fit data to (straight line) Regression: Fit data to (straight line)

modelmodely

(Response or Dependent

Variable) x

(Predictor, Regressor, or Independent Variable)

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 55

Use Scatter Plot to See Use Scatter Plot to See RelationshipRelationship

Predictor

Response

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 66

Absorbed Liquid DataAbsorbed Liquid Data

In a chemical process, batches of In a chemical process, batches of liquid are passed through a bed liquid are passed through a bed containing an ingredient that is containing an ingredient that is absorbed by the liquid. absorbed by the liquid.

We will attempt to relate the We will attempt to relate the absorbed percentage of the absorbed percentage of the ingredient (y) to the amount of liquid ingredient (y) to the amount of liquid in the batch (x).in the batch (x).

Exercise 6.11 in text.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 77

Absorbed Liquid DataAbsorbed Liquid Data

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 88

Absorbed Liquid DataAbsorbed Liquid Data

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 99

Abs% = -1822 + 435(Amt)Abs% = -1822 + 435(Amt)

The regression line or model is deterministic.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1010

We are going to use a We are going to use a probabilistic model which probabilistic model which accounts for the variation accounts for the variation around the line.around the line.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1111

Probabilistic ModelProbabilistic Model Probabilistic Model: deterministic plus Probabilistic Model: deterministic plus

error component for unexplained error component for unexplained variation.variation. errorrandomAmtAbs )(4351822%

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1212

Regression EquationRegression Equation

y = deterministic model + random y = deterministic model + random errorerror

xy 10β0 = y-intercept

β1 = slope

ε = random error

component ticdeterminis )( yE

xyE 10)(

Regression line is estimate of the mean value of y at a given value of x.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1313

Probabilistic ModelProbabilistic Model Probabilistic Model: deterministic plus Probabilistic Model: deterministic plus

error component for unexplained error component for unexplained variation.variation. errorrandomAmtAbs )(4351822%

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1414

Estimating Estimating ββ00 and β and β11

Once we determine that a straight line Once we determine that a straight line model is reasonable, we want to model is reasonable, we want to establish the best line by estimating establish the best line by estimating ββ00 and and ββ1.1.

µ = E(y)µ = E(y) = = ββ00 + + ββ11xx

ββ11 is the slopeis the slope. . It is the amount by which It is the amount by which yy will change with a unit increase in will change with a unit increase in xx. .

ββ00 is the y-intercept. It is the expected is the y-intercept. It is the expected (mean) value of (mean) value of yy when when x = 0x = 0. (This . (This may or may not be meaningful.)may or may not be meaningful.)

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1515

)(4351822% AmtAbs

If Amount goes up by 1 unit, then the Absorb% is expected to go up by 435 %.

If Amount = 0, the expected Absorb% = -1822 %.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1616

Absorbed Liquid DataAbsorbed Liquid Data

Do not consider x values outside the range of the data.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1717

Errors of Prediction = Vertical Errors of Prediction = Vertical Distance Between Points and Distance Between Points and

LineLine

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1818

Method of Least SquaresMethod of Least Squares Sum of prediction errors = 0.Sum of prediction errors = 0. Sum of the squared errors = Sum of Sum of the squared errors = Sum of

Squares Error = SSESquares Error = SSE Many lines for which the sum of errors = Many lines for which the sum of errors =

0.0. Only one line for which SSE is minimized.Only one line for which SSE is minimized. Least squares line = regression line = Least squares line = regression line =

line for which SSE is minimized.line for which SSE is minimized.

xy 10ˆˆˆ ii xy 10

ˆˆˆ or

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 1919

Least Squares EstimatesLeast Squares Estimates Deviation of iDeviation of ith th point from estimated point from estimated

value:value:

The sum of the square of deviations The sum of the square of deviations for all for all nn points: points:

Values of and that minimize Values of and that minimize SSE are called the SSE are called the least squares least squares estimates. estimates. They are minimum They are minimum variance unbiased estimates.variance unbiased estimates.

)]ˆˆ([)ˆ( 10 iiii xyyy

0 1

n

iii xySSE

1

210 )]ˆˆ([

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2020

Formulas for Least Squares Formulas for Least Squares EstimatesEstimates

xx

xy

SS

SS1 xy 10

ˆˆ

where

n

i

n

i

iiiiiixy n

yxyxyyxxSS

1 1

))((

n

i

n

i

iiixx n

xxxxSS

1 1

2

22)(

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2121

Estimate of Variance at each x, Estimate of Variance at each x, σσ22

MSEn

yy

df

SSEs ii

2

)ˆ(ˆ

222

s is estimated standard error of regression model.

MSE2 MSERoot ˆ

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2222

MSE and Root MSEMSE and Root MSE

sMSE

sMSE

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2323

Sampling Distribution of Sampling Distribution of 1

β1

11

11 2

11

11 2

xxXX ss

MSE

ss

ss

11

Standard Error for : 1

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2424

Test of Model UtilityTest of Model UtilityH0: β1 = 0

Ha: β1 = 0

Test Statistic:

xxndf

SSsst

/

0ˆ0ˆ1

1

12

Confidence Interval:

xxn

SS

st 2,2/1

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2525

Amt and Absorb%Amt and Absorb%

PredictoPredictorr

Coef.Coef. SE SE Coef.Coef.

TT P-valueP-value

InterceInterceptpt

-1822-1822 366366 -4.978-4.978 <0.000<0.00011

SlopeSlope 435435 6060 7.257.25 <0.000<0.00011

H0: β1 = 0

Ha: β1 = 0

)(4351822% AmtAbs

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2626

Coefficient of CorrelationCoefficient of Correlation CorrelationCorrelation measures the linear measures the linear

relationship between two relationship between two quantitative variables.quantitative variables.

To get a visual picture, use a scatter To get a visual picture, use a scatter plot.plot.

To assign a numeric value: Pearson To assign a numeric value: Pearson product moment coefficient of product moment coefficient of correlation, correlation, rr..

yyxx

xy

SSSS

SSr

r is scale less and will vary from –1 to +1.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2727

Coefficient of CorrelationCoefficient of Correlation

r = +1 r = -1

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2828

Coefficient of CorrelationCoefficient of Correlation

r = .95

r = 0r = 0

r = -.80

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 2929

Coefficient of DeterminationCoefficient of Determination Coefficient of DeterminationCoefficient of Determination, , rr22, ,

measures the contribution of measures the contribution of xx in the in the predicting of predicting of yy..

Recall:Recall:

If If xx makes no contribution to prediction makes no contribution to prediction of of yy, then , and , then , and SSESSE = = SSSSyyyy..

If If xx contributes to prediction of contributes to prediction of yy, then , then we expect we expect SSESSE << << SSSSyyyy..

2)( yySS iyy 2)ˆ( ii yySSE

yyi ˆ

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3030

Coefficient of DeterminationCoefficient of Determination Recall:Recall:

– SSSSyyyy is total sample variation around y. is total sample variation around y.

– SSESSE is unexplained sample variability is unexplained sample variability after fitting regression line.after fitting regression line.

Proportion of total sample variation Proportion of total sample variation explained by linear relationship:explained by linear relationship:

yy

yy

SS

SSESS

yVariabilit Total

yVariabilit Explained

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3131

Coefficient of DeterminationCoefficient of Determination

yyyy

yy

SS

SSE

SS

SSESSr

12

= proportion of total sample variability around y that is explained by the linear relationship between y and x.

r2 varies from 0 to 1.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3232

Using Model for EstimationUsing Model for Estimation

Use model to estimate Use model to estimate mean valuemean value of of yy, , E[y]E[y], for specific value of , for specific value of xx..

Solving regression equation for Solving regression equation for particular value of particular value of x x gives point gives point estimate for estimate for y y at that value of x.at that value of x.

ii xy 10ˆˆˆ

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3333

(1-(1-α)100% Confidence Interval α)100% Confidence Interval for y at x = xfor y at x = xpp

is a statistic. It has a sampling is a statistic. It has a sampling distribution. Since we are operating distribution. Since we are operating under the normal assumption, the under the normal assumption, the Confidence Interval = Pt. Est. Confidence Interval = Pt. Est. ++ t tα/2α/2 (Std Error of ).(Std Error of ).

y

y

xx

p

SS

xx

nsty

2

2/

)(1ˆ

where tα/2 has n-2 degrees of freedom.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3434

Predict a New Individual y Predict a New Individual y Value for a Given x. Value for a Given x.

Individual values have more variation Individual values have more variation than means. than means. (1-(1-α)100%α)100% Prediction Prediction Interval for Individual Value of Interval for Individual Value of yy at at x x = x= xpp::

xx

p

SS

xx

nsty

2

2/

)(11ˆ

where tα/2 has n-2 degrees of freedom.

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3535

Confidence and Prediction Confidence and Prediction BandsBands

Amt

Abso

rbed%

7.57.06.56.05.55.04.5

2000

1500

1000

500

0

S 204.898R-Sq 70.3%R-Sq(adj) 68.9%

Regression95% CI95% PI

Fitted Line PlotAbsorbed% = - 1822 + 434.7 Amt

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3636

Assumptions of a Assumptions of a Regression AnalysisRegression Analysis

Assumptions involve distribution of Assumptions involve distribution of errors.errors.– Actual errors:Actual errors:

– Estimated errors - Estimated errors - residualsresiduals

Use plots of residuals to check the Use plots of residuals to check the assumptions.assumptions.

)( 10 iii xy

)ˆˆ( 10 iii xye

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3737

There are Four AssumptionsThere are Four Assumptions

(1) The mean of the errors is 0 at each (1) The mean of the errors is 0 at each value of x.value of x.

-4

-3

-2

-1

0

1

2

3

4

0 5 10 15 20 25

Fitted Values

Resi

dual

s

-4

-3

-2

-1

0

1

2

3

4

0 5 10 15 20 25

Fitted Values

Resi

dual

s

YES NO

X values X values

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3838

StatCrunch Plot of Residuals vs StatCrunch Plot of Residuals vs X ValuesX Values

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 3939

There are Four AssumptionsThere are Four Assumptions

(2) Variance of errors is constant (2) Variance of errors is constant across all values of x.across all values of x.

-4

-3

-2

-1

0

1

2

3

4

0 5 10 15 20 25

Fitted Values

Resi

dual

s

-4

-3

-2

-1

0

1

2

3

4

0 5 10 15 20 25

Fitted Values

Resi

dual

s

YES NOX values X values

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 4040

StatCrunch Plot of Residuals vs X StatCrunch Plot of Residuals vs X ValuesValues

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 4141

There are Four AssumptionsThere are Four Assumptions

(3) Errors have normal distribution at (3) Errors have normal distribution at each x.each x.

-4

-3

-2

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3 4

Z-values

Res

idua

ls-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

Z-values

Res

idua

ls

YES NO

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 4242

StatCrunch QQ Plot of StatCrunch QQ Plot of ResidualsResiduals

G. Baker, Department of StatisticsG. Baker, Department of StatisticsUniversity of South Carolina; Slide University of South Carolina; Slide 4343

There are Four AssumptionsThere are Four Assumptions(4) Errors are independent – must (4) Errors are independent – must

know how data was gathered.know how data was gathered.

-4

-3

-2

-1

0

1

2

3

4

0 5 10 15 20 25

Fitted Values

Resi

dual

s

-4

-3

-2

-1

0

1

2

3

4

0 5 10 15 20 25

Fitted Values

Resi

dual

s

By time, person, etc. By time, person, etc.

YES NO