least squares regression fitting a line to bivariate data

Post on 31-Mar-2015

219 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Least Squares Regression

Fitting a Line to Bivariate Data

Linear Relationships

Avg. occupants per car

1980: 6/car 1990: 3/car 2000: 1.5/car By the year 2010

every fourth car will have nobody in it!

Food for Thought Kind of

mathematical relationship between year and avg. no. of occupants per car?

Why might relation-

ship break down by 2010?

Basic Terminology Scatterplots, correlation: interested in

association between 2 variables (assign x and y arbitrarily)

Least squares regression: does one quantitative variable explain or cause changes in another variable?

Basic Terminology (cont.) Explanatory variable: explains or

causes changes in the other variable; the x variable. (independent variable)

Response variable: the y -variable; it responds to changes in the x - variable. (dependent variable)

Examples Fertilizer (x ) corn yield (y ) Advertising $ (x ) store income (y ) Drug dose (x ) blood pressure (y ) Daily temperature (x )

natural gas demand (y ) change in min wage(x)

unemployment rate (y)

Simplest Relationship Simplest equation that describes the

dependence of variable y on variable x

y = b0 + b1x linear equation graph is line with slope b1 and y-

intercept b0

Graph

y

x0

b0

y=b0 +b1x

run

riseSlope b=rise/run

Notation (x1, y1), (x2, y2), . . . , (xn, yn)

draw the line y= b0 + b1x through the scatterplot , the point on the line corresponding to xi is

0 1

0 1 i

i

ˆ ˆ; is the value of y predicted by the line

y when ;

is the observed value of when .

i i i

i

y b b x y

b b x x x

y y x x

Observed y, Predicted y

predicted y when x=2.7yhat = a + bx = a + b*2.7

2.7

Scatterplot: Fuel Consumption vs Car Weight Fuel Consumption vs Car Weight

2

3

4

5

6

7

1 2 3 4 5

Car Weight (1000 lbs)

Fu

el

con

sum

pti

on

(g

al/

100

mil

es)

Fuel consumption

“Best” line?

Scatterplot with least squares prediction line

FUEL CONSUMPTION vs CAR WEIGHT

y = 1.639x - 0.3631r2 = 0.9538

234567

1.5 2.5 3.5 4.5

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P.

(gal

/100

mile

s)

How do we draw the line? Residuals

0 1

ˆ

( )

th

th

th

i i

i i

i

i

i y y

y y

y b b x

the residual is the vertical deviation of the

data point from the line :

residual = observed predicted

Residuals: graphically

Graphical Display of Residuals

XXi

Yi ei=Yi - Yi

Yi

positive residual

negative residual

Criterion for choosing what line to draw: method of least

squares The method of least squares chooses

the line that makes the sum of squares of the residuals as small as possible

This line has slope b1 and intercept b0 that minimizes

20 1

1

[ ( )]

( , )

n

i ii

i i

y b b x

x y

for the given observations

Least Squares Line y = b0 + b1x: Slope b1 and Intercept b0

1

0

2

11 2

2

11 2

1

( )is the standard deviation of , ,...,

1

( )is the standard deviation of , ,...,

1

( )( )

(

y

x

n

ii

x n

n

ii

y n

n

i ii

s

s

b

x xs x x x

n

y ys y y y

n

x x y yr

b r

y bx

1 1 2 2 n n(x ,y ),(x ,y ), ,(x ,y )

slope

y intercept

where

20 1

1 1 1

is the correlation between and1) x y

n n n

i i i ii i i

x yn s s

SSE y b y b x y

Example: Income vs Consumption Expenditure

Income (x)ConsumptionExpenditure (y)

1 75 69 9

13 817 10

Questions

Construct scatterplot; determine if linear model is appropriate. If so …

… find the least squares prediction line Estimate consumption expenditure in a

household with an income of (i) $6,000 (ii) $25,000. Comfortable with estimates?

Compute the residuals

Scatterplot

Consumption Expenditure

5

6

7

8

9

10

11

0 5 10 15 20

Household Income ($1,000's)

Exp

end

itu

re (

$1,0

00's

)

SolutionInc. x Exp. y xi-xbar (xi-xbar)2 yi-ybar (yi-ybar)2 (xi-xbar)

(yi-ybar) 1 7 -8 64 -1 1 8

5 6 -4 16 -2 4 8

9 9 0 0 1 1 0

13 8 4 16 0 0 0

17 10 8 64 2 4 16

x=45 y=40 (xi-xbar) =0

(xi-xbar)2

=160 (yi-ybar)

=0(yi-ybar)2

=10 32

1604

104

45 409; 8; 40 6.325

5 532

2.5 1.581; .84(6.325)(1.581)

x

y

x y s

s r

Calculations

1

0 1

1.581.8 .2;

6.325

8 .2(9) 8 1.8 6.2

least squares prediction line:

ˆ 6.2 .2

y

x

sb r

s

b y b x

y x

least squares prediction line

0 1ˆ 6.2 .2

$6,000, 6

ˆ 6.2 .2(6) 7.4 ($7,400)

$25,000, 25

ˆ 6.2 .2(25) 11.2 ($11,200)

y b b x x

income x

y

income x

y

Least Squares Prediction Line

Consumption Expenditure

y = 6.2 + 0.2x

5

6

7

8

9

10

11

0 5 10 15 20

Household Income ($1,000's)

Exp

end

itu

re (

$1,0

00's

)

Consumption Expenditure Prediction When x=$6,000

Consumption Expenditure

y = 6.2 + 0.2x

5

6

7

8

9

10

11

0 5 10 15 20

Household Income ($1,000's)

Exp

end

itu

re (

$1,0

00's

)

6

7.4

Consumption Expenditure Prediction When x=$25,000

Consumption Expenditure

y = 6.2 + 0.2x

5

6

7

8

9

10

11

12

0 5 10 15 20 25

Household Income ($1,000's)

Exp

endi

ture

($1,

000'

s)

25

11.2

The least squares line always goes through the point with coordinates (x, y)

( x, y ) = ( 9, 8 )

C. Compute the Residuals

Inc. x ConE y y=6.2+.2x y - y (y-y)^2

1 7 6.4 .6 .36

5 6 7.2 -1.2 1.44

9 9 8 1 1

13 8 8.8 -.8 .64

17 10 9.6 .4 .16

residuals=0 (residuals)2

=3.6

Residuals

Consumption Expenditure

y = 6.2 + 0.2x

5

6

7

8

9

10

11

0 5 10 15 20

Household Income ($1,000's)

Exp

end

itu

re (

$1,0

00's

)

Income Residual Plot

Income Residual Plot

-2-1012

0 5 10 15 20

Incom e

Resi

dual

s

residuals, residuals)2

Note that* residuals = 0 residuals)2 = 3.6* From formula in box on p. 7:

SSE=yi2 – b0*yi – b1*xiyi

330 – 6.2*40 - .2*392= 330 – 248 – 78.4 = 3.6

Any other line drawn through the scatterplot will have

residuals)2 > 3.6

Car Weight, Fuel Consumption Example, cont.

FUEL CONSUMPTION vs CAR WEIGHT

2

3

4

5

6

7

1.5 2.5 3.5 4.5

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P.

(gal

/100

mile

s)(xi, yi): (3.4, 5.5) (3.8, 5.9) (4.1, 6.5) (2.2, 3.3)(2.6, 3.6) (2.9, 4.6) (2, 2.9) (2.7, 3.6) (1.9, 3.1) (3.4, 4.9)

Wt

(x)

Fuel

(y)

3.4 5.5 .5 .25 1.11 1.231 .555

3.8 5.9 .9 .81 1.51 2.2801 1.359

4.1 6.5 1.2 1.44 2.11 4.4521 2.532

2.2 3.3 -.7 .49 -1.09 1.1881 .763

2.6 3.6 -.3 .09 -.79 .6241 .237

2.9 4.6 0 0 .21 .0441 0

2.0 2.9 -.9 .81 -1.49 2.2201 1.341

2.7 3.6 -.2 .04 -.79 .6241 .158

1.9 3.1 -1.0 1 -1.29 1.6641 1.29

3.4 4.9 .5 .25 .51 .2601 .255

29 43.9 0 5.18 0 14.589 8.49

ix - x 2i(x - x) iy - y 2

i(y - y) i i(x - x)(y - y)

col. sum

Calculations

5.189

14.5899

1

0 1

0 1

slope 1.639

intercept 4.39 1.639(2.9) .3631

ˆleast squares prediction line .3631 1.

2.9; 4.39; .7587;

8.491.2732; .9766

9(.77587)(1.2732)

1.2732.9766

.7587

x

y

y

x

b r

b y b x

y b b x

x y s

s r

s

s

639x

Scatterplot with least squares prediction line

FUEL CONSUMPTION vs CAR WEIGHT

y = 1.639x - 0.3631r2 = 0.9538

234567

1.5 2.5 3.5 4.5

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P.

(gal

/100

mile

s)

The Least Squares Line Always goes Through ( x, y )

(x, y ) = (2.9, 4.39)

Using the least squares line for prediction. Fuel consumption of 3,000 lb car? (x=3)

ˆ .3631 1.639(3) 4.5539y Fuel Consumption vs Car Weight: Scatterplot and Least Squares Line

y = - 0.3631 + 1.639x

2

3

4

5

6

7

1.5 2 2.5 3 3.5 4 4.5CAR WEIGHT

FU

EL

CO

NS

UM

PT

ION

(3.0, 4.5539)

Be Careful!

ˆ .3631 1.639(.5) .4564

(219 mpg)

y

Fuel consumption of 500 lb car? (x = .5)

FUEL CONSUMPTION vs CAR WEIGHT

y = 1.639x - 0.3631r2 = 0.9538

234567

1.5 2.5 3.5 4.5

WEIGHT (1000 lbs)

FU

EL

CO

NS

UM

P.

(gal/100 m

iles)

x = .5 is outside the range of the x-data that we used to determine the least squares line

Avoid GIGO! Evaluating the least squares line

1. Create scatterplot. Approximately linear?

2. Calculate r2, the square of the correlation coefficient

3. Examine residual plot

r2 : The Variation Accounted For

The square of the correlation coefficient r gives important information about the usefulness of the least squares line

r2: important information for evaluating the usefulness of the least squares line

The square of the correlation coefficient, r2, is the fraction of the variation in y that is explained by the least squares regression of y on x.

-1 ≤ r ≤ 1 implies 0 ≤ r2 ≤ 1

The square of the correlation coefficient, r2, is the fraction of the variation in y that is explained by the variation in x.

Example: car weight, fuel consumption

x=car weight, y=fuel consumption

r2 = (.9766)2 .95

About 95% of the variation in fuel consumption (y) is explained by the linear relationship between car weight (x) and fuel consumption (y).

What else affects fuel consumption?

– Driver, size of engine, tires, road, etc.

Example: SAT scoresSAT Mean per State vs % Seniors Taking Test

y = -2.2375x + 1023.4

R2 = 0.7542

820

870

920

970

1020

1070

1120

0 10 20 30 40 50 60 70 80

% of Seniors Taking Test

Mea

n S

AT

Sco

re

SAT scores: calculations

1 0 1

1

0

33.882 24.103 947.549 62.1 .868

,

62.1slope .868 2.23635

24.103intercept 947.549 ( 2.236)33.882 1023.309

ˆleast squares prediction line 1023.309 2.236

x y

y

x

x s y s r

sb r b y b x

s

b

b

y x

SAT scores: result

SAT Mean per State vs % Seniors Taking Test

y = -2.2375x + 1023.4

R2 = 0.7542

820

870

920

970

1020

1070

1120

0 10 20 30 40 50 60 70 80

% of Seniors Taking Test

Mea

n S

AT

Sco

re

r2 = (-.868)2 = .7534

If 57% of NC seniors take the SAT, the predicted mean score is

ˆ 1023.309 2.23635(57) 895.84y

Avoid GIGO! Evaluating the least squares line

1. Create scatterplot. Approximately linear?

2. Calculate r2, the square of the correlation coefficient

3. Examine residual plot

Residuals residual =observed y - predicted y

= y - y Properties of residuals

1. The residuals always sum to 0 (therefore the mean of the residuals is 0)

2. The least squares line always goes through the point (x, y)

Graphicallyresidual = y - y

y

yi

yi ei=yi - yi

Xxi

Residual Plot

Residuals help us determine if fitting a least squares line to the data makes sense

When a least squares line is appropriate, it should model the underlying relationship; nothing interesting should be left behind

We make a scatterplot of the residuals in the hope of finding…

NOTHING!

Car Wt/ Fuel Consump: Residuals

CAR WT. FUEL CONSUMP. Pred FUEL CONSUMP. Residuals

3.4 5.5 5.2094980690 .290501931 3.8 5.9 5.865096525 0.034903475 4.1 6.5 6.356795367 0.143204633 2.2 3.3 3.242702703 0.057297297 2.6 3.6 3.898301158 -0.29830115 2.9 4.6 4.39 0.21 2 2.9 2.914903475 -0.01490347 2.7 3.6 4.062200772 -0.46220077 1.9 3.1 2.751003861 0.348996139 3.4 4.9 5.209498069 -0.309498069

Example: Car wt/fuel consump. residual plot page 13

RESIDUALS vs WT(X)

-0.6

-0.4

-0.2

0

0.2

0.4

1.5 2 2.5 3 3.5 4 4.5

WT(X)

RE

SID

UA

LS

RESIDUAL

SAT Residuals

%TAKE Residual Plot

-100-50

0

50100

0 20 40 60 80

%TAKE

Resi

dual

s

Linear Relationship?

Linear(?)

0

10

20

30

40

50

60

-4 -2 0 2 4 6 8X

Y

Garbage In Garbage Out

GIGO

y = 4x + 11

0

10

20

30

40

50

60

-4 -2 0 2 4 6 8X

Y

Residual Plot – Clue to GIGO

Residual Plot

-20

-10

0

10

20

-4 -2 0 2 4 6 8

X Variable

Resi

dual

s

GIGO

y = 4x + 11

0

10

20

30

40

50

60

-4 -2 0 2 4 6 8X

Y

Residual Plot

-20

-10

0

10

20

-4 -2 0 2 4 6 8

X Variable

Re

sid

ua

ls

top related