22s:152 applied linear regression chapter 16:...

22s:152 Applied Linear Regression

Chapter 16: BootstrappingAn option for dealing with violation of assumptions.

————————————————————

•What do we do when our distributionalassumptions (like normality) are not met?

• The assumptions get us valid p-values...

– valid standard errors

– valid confidence intervals

– valid hypothesis tests

• Consider inference on a mean µ where wehave the point estimate X̄ .

– Define the statistic: T = X̄−µs/√n

1

– With normality, we have: T ∼ tn−1which we use to...

∗ form valid 100(1− α)% CI’s

∗ perform α-level hypothesis tests

– Without normality, we can not assume Thas this distribution

– Without the known distributional‘behavior’ of the random variable T , wecan’t make valid confidence intervals forµ, nor run hypothesis tests on µ with aknown error rate using our usual methods.

• If we do not know the theoretical ‘samplingdistribution’ of a relevant statistic (like X̄),we can instead use the Bootstrapping ap-proach to statistical inference.

2

Bootstrapping, what is it...

• A nonparametric approach to statistical in-ference that gives us...

– valid standard errors

– valid confidence intervals

– valid hypothesis tests

without the normality assumption.

• Some dislike the term nonparametric andprefer the term distribution-free.

• Assumption we do need:

– The sampled data provide a reasonablerepresentation of the population from whichthey came

• Bootstrapping is more computationallyintensive than traditional inference becauseit is a ‘resampling’ method.

3

• Recall, a sampling distribution is...

the probability distribution of a statistic.

Examples (true under certain conditions):

1. X̄ ∼ N(µ, σ2

n )

2. β̂ ∼ N(β, V (β̂))

These sampling distributions allow us toperform hypothesis tests and form confidenceintervals on parameters of interest, like µ andβ.

But what do we do if our needed ‘conditions’are not met? One option is bootstrapping.

4

Bootstrapping:Example for inference on ρ(population correlation)

• Average values for GPA and LSAT scores forstudents admitted to n=15 Law Schools in1973 (a random sampling of law schools).

School LSAT GPA

1 576 3.39

2 635 3.30

3 558 2.81

4 579 3.03

5 666 3.44

6 580 3.07

7 555 3.00

8 661 3.43

9 651 3.36

10 605 3.13

11 653 3.12

12 575 2.74

13 545 2.76

14 572 2.88

15 594 2.96

– Can we make a confidence interval for ρ?

5

– Point estimate for ρ is the sample correla-tion r:

r = 0.7766

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

560 580 600 620 640 660

2.8

3.0

3.2

3.4

LSAT

GP

A

– Classical inference on ρ depends on X andY having a bivariate normal distribution.

– In the above sample, there are a few out-liers suggesting this assumption may beviolated.

– We’ll use the bootstrap approach to dostatistical inference instead.

6

– In the bootstrap method, we generate manypossible sample data sets (based on thesingle data set we actually observed), andfrom each one, we calculate an estimate r.So, we’ll have an r∗b calculated from eachhypothetical (or bootstrap) sample b.

– This provides an empirical ‘samplingdistribution’ for the estimator.(by empirical we mean based on the observed data)

r∗1 , r∗2 , r∗3 , . . . , r

∗B

This distribution of the above values willgive us an idea of the variability (and sam-pling distribution) of our estimator r.

– We will assume these n = 15 observationsis a representative sample from the pop-ulation of all law schools (the assumptionwe make).

7

– Repeating steps (i) & (ii) below B timesgenerates an empirical ‘sampling distribu-tion’ for r:

(i) Resample from the original sample withreplacement to create a bootstrap sampleof the same size n = 15. One observa-tion looks like xi = (x1i, x2i).

(ii) From this bth bootstrap sample, calcu-late the estimate r∗b .

## Create function to get 1 new bootstrapped r.

## Input arguments: n, data

## Output argument: r

> get.1.bootstrapped.r=function(n=15,data=law){

## Get indices of resampled observations:

chosen=sample(1:n,replace=T)

## Get new bootstrap sample:

bootstrap.sample=law[chosen,]

## Calculate r:

r=cor(bootstrap.sample$LSAT,bootstrap.sample$GPA)

return(r)

}

8

## Call the function:

> r.bootstrap=get.1.bootstrapped.r(n=15,data=law)

> r.bootstrap

GPA

LSAT 0.953635

This particular bootstrap sample gave afairly high correlation, r∗1 = 0.9536.

The bootstrap sample looked like:School LSAT GPA

9 651 3.36

4 579 3.03

6 580 3.07

3 558 2.81

6 580 3.07

8 661 3.43

9 651 3.36

4 579 3.03

10 605 3.13

8 661 3.43

7 555 3.00

13 545 2.76

3 558 2.81

15 594 2.96

6 580 3.07

9

## Call the function again(get new bootstrap sample):

> r.bootstrap=get.1.bootstrapped.r(n=15,data=law)

> r.bootstrap

GPA

LSAT 0.7605992

This particular sample gave a lower corre-lation, r∗2 = 0.7606.

The bootstrap sample looked like:School LSAT GPA

13 545 2.76

7 555 3.00

11 653 3.12

4 579 3.03

5 666 3.44

5 666 3.44

12 575 2.74

15 594 2.96

11 653 3.12

12 575 2.74

3 558 2.81

13 545 2.76

4 579 3.03

9 651 3.36

1 576 3.39

10

– Repeat procedure B-times to get thebootstrap estimates: r∗1 , r

∗2 , . . . , r

∗B.

> B=1000

## Allocate space to save B estimates:

> bootstrapped.r.values=rep(0,B)

> for (i in 1:B){bootstrapped.r.values[i]=

get.1.bootstrapped.r(n=15,data=law)}

> hist(bootstrapped.r.values,col="grey80",n=16)

> abline(v=0.7766,col="red",lwd=2)

> box()

> legend(.3,70,"Observed r",lwd=2,col="red")

Histogram of bootstrapped.r.values

bootstrapped.r.values

Fre

quen

cy

0.2 0.4 0.6 0.8 1.0

050

100

150

Observed r

– This distribution tells us the empiricalsampling distribution of our estimator r.

11

– We can use it to make a 90% empiricalconfidence interval for ρ.

– Order the r∗b values, and use the 5th quan-tile and the 95th quantile as the lower andupper end points...

> quantile(bootstrapped.r.values,0.05)

5%

0.5373854

> quantile(bootstrapped.r.values,0.95)

95%

0.9506562

The 90% empirical CI for ρ is: [0.5374, 0.9507]Histogram of bootstrapped.r.values

bootstrapped.r.values

Fre

quen

cy

0.2 0.4 0.6 0.8 1.0

050

100

150

5% lower tailor 50 observations

5%uppertail

12

– We were able to create a CI without anyassumptions on distribution, i.e. nonpara-metrically (very useful in many situations).

– This only works if the original sample isrepresentative of the original population.

– Recall what sampling variability of an es-timator is... BEFORE we collect our data,the estimator is a random variable becauseit’s value depends on the sample chosen.

– The bootstrap method uses resampling toget a handle on this variability (since wecan’t get at it theoretically because wedon’t have normality).

– We should resample from the n observa-tions in the same manner as how the origi-nal data was sampled (here, we had a sim-ple random sample).

13

• There is an R package that will dobootstrapping for us called boot...

R contributed packages:http://cran.r-project.org/web/packages/

Some of these packages may be available fromthe pull-down menu under Install Packages...(‘boot’ is already pre-installed in SH 41, use > library(boot) ).

Installing a package from a local ‘mirror’ (i.e. apackage provider):> install.packages("boot")

Then choose a ‘mirror’, there is an Iowa one... IA.

To see list of available packages from a ‘mirror’...> available.packages()

Then choose a ‘mirror’.

You can also download a package to your local drive and

load it from the pull-down menu under the local drive.

14

• After installing the boot package...

> library(boot)

## The main function is called boot():

> ?boot

## The form is ‘boot(data, compute.statistic, R)’.

## where the first argument is the data, and the

## second argument is a function computing the

## statistic of interest. ‘compute.statistic’ must

## take two inputs as (data, indices) where the data

## will be put in order by ‘indices’. For the original

## data, indices=1:n. R is the number of bootstrap

## samples requested.

## We define the ‘compute.statistic’ function:

> get.r=function(data,indices){

## order rows of data by ‘indices’:

data=data[indices,]

## Calculate r:

r=cor(data$LSAT,data$GPA)

return(r)

}

## Get a bootstrap confidence interval from

## 1000 bootstrap samples using boot():

> boot.out=boot(law, get.r, R=1000)

> boot.ci(boot.out, conf=0.90, type="perc")

15

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 1000 bootstrap replicates

CALL :

boot.ci(boot.out = boot.out, conf = 0.9, type = "perc")

Intervals :

Level Percentile

90% ( 0.5399, 0.9538 )

Calculations and Intervals on Original Scale

This is quite close to the one we made earlieras [0.5374, 0.9507].

16

Bootstrapping Regression Models

• You can use this same procedure for infer-ence in βj in a regression model.

• Example: Anscombe data set:U.S. State Public-School Expenditures in 1970

VARIABLES

education -- Per-capita education expenditures, $

income -- Proportion urban, per 100

> attach(Anscombe)

> plot(income,education)

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●● ●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

2000 2500 3000 3500 4000 4500

150

200

250

300

350

income

educ

atio

n

17

> lm.out=lm(education ~ income)

> plot(lm.out$fitted.values,lm.out$residuals,pch=16)

> qqnorm(lm.out$residuals,pch=16)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

140 160 180 200 220 240 260

−50

050

100

lm.out$fitted.values

lm.o

ut$r

esid

uals

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

−2 −1 0 1 2

−50

050

100

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

> lm.out$coefficients

(Intercept) income

17.71003077 0.05537594

——————————————————–Use the bootstrap method to get a confidenceinterval on both regression coefficients.

Define function for ‘boot’ to get coefficients:

> get.coeffic=function(data,indices){

data=data[indices,]

lm.out=lm(education ~ income,data=data)

return(lm.out$coefficients)

}

18

Call the function once for the original data:

> n=nrow(Anscombe)

> get.coeffic(Anscombe,1:n)

(Intercept) income

17.71003077 0.05537594

These estimates are from the original data.

——————————————————–

Using ‘boot’ to get 1000 bootstrap estimates...

> boot.out=boot(Anscombe,get.coeffic,R=1000)

## The 1000 bootstrap regression estimates are in...

> boot.out$t

[,1] [,2]

[1,] 17.52784692 0.05423758

[2,] 16.55435429 0.05580714

[3,] -51.16559342 0.07918406

[4,] 37.51375190 0.04516923

[5,] -6.07968045 0.06205171

. .

. .

. .

[999,] 22.27582056 0.05379981

[1000,] 110.27523283 0.02707580

19

Empirical 95%CI for β1

> boot.ci(boot.out,index=2,type="perc",conf=0.95)



CALL :

boot.ci(boot.out = boot.out, type = "perc", index = 2)

Intervals :

Level Percentile

95% ( 0.0359, 0.0776 )


Empirical 95%CI for β0

> boot.ci(boot.out,index=1,type="perc",conf=0.95)



CALL :

boot.ci(boot.out = boot.out, type = "perc", index = 1)

Intervals :

Level Percentile

95% (-51.41, 78.34 )


20

Joint sampling distribution of (β̂0, β̂1)

> dataEllipse(boot.out$t[,2],boot.out$t[,1],

xlab="slope",ylab="intercept",levels=c(.5,.95,.99))

0.00 0.02 0.04 0.06 0.08

-100

-50

050

100

slope

intercept

{dataEllipse() is in the car library. It superimposes the

normal-probability contours over a scatterplot of the data}

Using the 95% CI for β1 to test the hypoth-esis of H0 : β1 = 0 at the α = 0.05 level...

Intervals :

Level Percentile

95% ( 0.0359, 0.0776 )

CI does not contain 0. We reject H0.21

Using the 95% CI for β0 to test the hypoth-esis of H0 : β0 = 0 at the α = 0.05 level...

Intervals :

Level Percentile

95% (-51.41, 78.34 )

CI contains 0. We fail to reject H0.——————————————————We can use the average of the 1000 bootstrapestimates as our estimated regression coeffi-cients:

> apply(boot.out$t,2,mean)

[1] 17.33916141 0.05551061

——————————————————Comparing to parametric results from origi-nal sample:

> summary(lm.out)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 17.710031 28.873840 0.613 0.542

income 0.055376 0.008823 6.276 8.76e-08 ***

(Recall the original wasn’t terribly non-normal.)

22

Comments

•We can use bootstrapping to do statisticalinference when the assumptions of normalityand/or constant variance are violated.

• ∗The sample must be a representativesample from the population.

• The bootstrap approximation becomes moreaccurate for larger samples (larger n).

• There are some bias-correction elements forbootstrapping which we leave for further study.

• You can also use the distribution of bootstrapestimates to calculate a bootstrap estimate ofthe standard error of θ̂:

SE∗(θ̂∗) =

√∑Bb=1(θ̂∗b−θ̄

∗)2

B−1 with θ̄∗ =∑B

b=1 θ̂∗b/B

23

• The type of bootstrapping of regression mod-els described here (random resampling of theobservations) considers the regressors (theX ’s)to be random, not fixed.

• If you have fixed X-values (as in an exper-imental design), you can use a parametricbootstrap, and/or residual resampling, whichtakes this into account.

24

Bootstrapping, the general process...

1. we’re interested in doing inference on thepopulation parameter θ using the estimatorθ̂

2. we have a representative sample of n obser-vations from the population

3. we resample with replacement from the orig-inal sample of size n to create a bootstrapsample of the same size n

4. From the bth bootstrap sample, calculate theestimate θ̂∗b

5. Repeat steps 3 & 4 B-times to generate anempirical sampling distribution for θ̂{θ̂∗1 , θ̂

∗2 , . . . , θ̂

∗B}

6. Use the distribution of θ̂∗b ’s to estimate prop-

erties of the sampling distribution of θ̂.

25

22s:152 applied linear regression chapter 16:...

Documents