agresti/franklin statistics, 1 of 88 section 11.4 what do we learn from how the data vary around...

34
Agresti/Franklin Statistics, 1 of 88 Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Upload: dale-roberts

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 1 of 88

Section 11.4

What Do We Learn from How the Data Vary Around the

Regression Line?

Page 2: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 2 of 88

Residuals and Standardized Residuals

A residual is a prediction error – the difference between an observed outcome and its predicted value• The magnitude of these residuals depends

on the units of measurement for y

A standardized version of the residual does not depend on the units

Page 3: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 3 of 88

Standardized Residuals

Standardized residual:

The se formula is complex, so we rely on software to find it

A standardized residual indicates how many standard errors a residual falls from 0

Often, observations with standardized residuals larger than 3 in absolute value represent outliers

)ˆ(

)ˆ(

yyse

yy

Typo on Pg 553 of Text.

Corrected Version

Page 4: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 4 of 88

Example: Detecting an Underachieving College Student

Data was collected on a sample of 59 students at the University of Georgia

Two of the variables were:• CGPA: College Grade Point Average

• HSGPA: High School Grade Point Average

Example 13 in Text

Page 5: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 5 of 88

Example: Detecting an Underachieving College Student

A regression equation was created from the data:

• x: HSGPA

• y: CGPA

Equation: xy 64.019.1ˆ

Page 6: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 6 of 88

Example: Detecting an Underachieving College Student

MINITAB highlights observations that have standardized residuals with absolute value larger than 2:

Page 7: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 7 of 88

Example: Detecting an Underachieving College Student

Consider the reported standardized residual of -3.14

• This indicates that the residual is 3.14 standard errors below 0

• This student’s actual college GPA is quite far below what the regression line predicts

Page 8: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 8 of 88

Analyzing Large Standardized Residuals

Does it fall well away from the linear trend that the other points follow?

Does it have too much influence on the results?

Note: Some large standardized residuals may occur just because of ordinary random variability

Page 9: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 9 of 88

Histogram of Residuals

A histogram of residuals or standardized residuals is a good way of detecting unusual observations

A histogram is also a good way of checking the assumption that the conditional distribution of y at each x value is normal• Look for a bell-shaped histogram

Page 10: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 10 of 88

Histogram of Residuals

Suppose the histogram is not bell-shaped: • The distribution of the residuals is not

normal

However….

• Two-sided inferences about the slope parameter still work quite well

• The t- inferences are robust

Page 11: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 11 of 88

The Residual Standard Deviation

For statistical inference, the regression model assumes that the conditional distribution of y at a fixed value of x is normal, with the same standard deviation at each x

This standard deviation, denoted by σ, refers to the variability of y values for all subjects with the same x value

Page 12: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 12 of 88

The Residual Standard Deviation

The estimate of σ, obtained from the data, is:

2

)ˆ( 2

n

yys

Page 13: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 13 of 88

Example: How Variable are the Athletes’ Strengths?

From MINITAB output, we obtain s, the residual standard deviation of y:

For any given x value, we estimate the mean y value using the regression equation and we estimate the standard deviation using s: s = 8.0

0.855

8.3522 s

Page 14: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 14 of 88

Confidence Interval for µy

We estimate µy, the population mean of y

at a given value of x by:

We can construct a 95 %confidence interval for µy using:

bxay ˆ

)ˆ (ˆ025.

yofsety

Page 15: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 15 of 88

Prediction Interval for y

The estimate for the mean of y at a fixed value of x is also a prediction for an individual outcome y at the fixed value of x

Most regression software will form this interval within which an outcome y is likely to fall• This is called a prediction interval for y

bxay ˆ

(See Figure 11.10)

Page 16: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 16 of 88

The Residual Standard Deviation

Difference in limit of CI and “s”

2

)ˆ( 2

n

yys

Page 17: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 17 of 88

Prediction Interval for y vs Confidence Interval for µy

The prediction interval for y is an inference about where individual observations fall

• Use a prediction interval for y if you want to predict where a single observation on y will fall for a particular x value

Page 18: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 18 of 88

Prediction Interval for y vs Confidence Interval for µy

The confidence interval for µy is an

inference about where a population mean falls

• Use a confidence interval for µy if you want

to estimate the mean of y for all individuals having a particular x value

Page 19: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 19 of 88

Example: Predicting Maximum Bench Press and Estimating its Mean

Page 20: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 20 of 88

Example: Predicting Maximum Bench Press and Estimating its Mean

Use the MINITAB output to find and interpret a 95% CI for the population mean of the maximum bench press values for all female high school athletes who can do x = 11 sixty-pound bench presses

For all female high school athletes who can do 11 sixty-pound bench presses, we estimate the mean of their maximum bench press values falls between 78 and 82 pounds

Page 21: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 21 of 88

Example: Predicting Maximum Bench Press and Estimating its Mean

Use the MINITAB output to find and interpret a 95% Prediction Interval for a single new observation on the maximum bench press for a randomly chosen female high school athlete who can do x = 11 sixty-pound bench presses

For all female high school athletes who can do 11 sixty-pound bench presses, we predict that 95% of them have maximum bench press values between 64 and 96 pounds

Page 22: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 22 of 88

Decomposing the Error

OR

Regression SS + Residual SS= Total SS

Regression SS:=P

(yi ¡ ¹y)2 =P

(yi ¡ ¹y)2 ¡P

(yi ¡ yi )2

F=(MS Reg)/(MSE). More general the “t” test (in cases studied in this class it is effectively “t” squared)

However in more complicated models (more explanatory variables) the difference and utility of this becomes apparent

In software(e.g. ANOVA), a sumof squaresdivided by df iscalled theMeanSquare For example MSE stands for mean square error :=

P(yi ¡ yi )2)=df

Page 23: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 23 of 88

Section 11.5

Exponential Regression: A Model for Nonlinearity

Page 24: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 24 of 88

Nonlinear Regression Models

If a scatterplot indicates substantial curvature in a relationship, then equations that provide curvature are needed

• Occasionally a scatterplot has a parabolic appearance: as x increases, y increases then it goes back down

• More often, y tends to continually increase or continually decrease but the trend shows curvature

Page 25: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 25 of 88

Example: Exponential Growth in Population Size

Since 2000, the population of the U.S. has been growing at a rate of 2% a year

• The population size in 2000 was 280 million

• The population size in 2001 was 280 x 1.02

• The population size in 2002 was 280 x (1.02)2

• …

• The population size in 2010 is estimated to be

• 280 x (1.02)10

• This is called exponential growth

Page 26: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 26 of 88

Exponential Regression Model

An exponential regression model has the formula:

For the mean µy of y at a given value of x, where α and β are parameters

x

y

Page 27: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 27 of 88

Exponential Regression Model

In the exponential regression equation, the explanatory variable x appears as the exponent of a parameter

The mean µy and the parameter β can take only positive values

As x increases, the mean µy increases when β>1

It continually decreases when 0 < β<1

Page 28: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 28 of 88

Exponential Regression Model

For exponential regression, the logarithm of the mean is a linear function of x

When the exponential regression model holds, a plot of the log of the y values versus x should show an approximate straight-line relation with x

Page 29: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 29 of 88

Example: Explosion in Number of People Using the Internet

Page 30: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 30 of 88

Example: Explosion in Number of People Using the Internet

Page 31: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 31 of 88

Example: Explosion in Number of People Using the Internet

Page 32: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 32 of 88

Example: Explosion in Number of People Using the Internet

Using regression software, we can create the exponential regression equation:

x: the number of years since 1995. Start with x = 0 for 1995, then x=1 for 1996, etc

y: number of internet users

Equation: xy )7708.1(38.20ˆ

Page 33: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 33 of 88

Interpreting Exponential Regression Models

In the exponential regression model,

the parameter α represents the mean value of y when x = 0;

The parameter β represents the multiplicative effect on the mean of y for a one-unit increase in x

x

y

Page 34: Agresti/Franklin Statistics, 1 of 88  Section 11.4 What Do We Learn from How the Data Vary Around the Regression Line?

Agresti/Franklin Statistics, 34 of 88

Example: Explosion in Number of People Using the Internet

In this model:

The predicted number of Internet users in 1995 (for which x = 0) is 20.38 million

The predicted number of Internet users in 1996 is 20.38 times 1.7708

xy )7708.1(38.20ˆ