statistical inference - university of torontobutler/c32/notes/inference.pdf · i thus you can never...

Statistical Inference

R packages used in this chapter

library(ggplot2)

library(dplyr)

##

## Attaching package: ’dplyr’

## The following objects are masked from ’package:stats’:

##

## filter, lag

## The following objects are masked from ’package:base’:

##

## intersect, setdiff, setequal, union

2 / 166

Statistical Inference and Science

I Previously: descriptive statistics. “Here are data; what do they say?”.

I May need to take some action based on information in data.

I Or want to generalize beyond data (sample) to larger world(population).

I Science: first guess about how world works.

I Then collect data, by sampling.

I Is guess correct (based on data) for whole world, or not?

3 / 166

Sample data are imperfectI Sample data never entirely represent what you’re observing.

I There is always random error present.

I Thus you can never be entirely certain about your conclusions.

I The Toronto Blue Jays’ average home attendance in part of 2015season was 25,070 (up to May 27 2015, frombaseball-reference.com).

I Does that mean the attendance at every game was exactly 25,070?Certainly not. Actual attendance depends on many things, eg.:

I how well the Jays are playing

I the opposition

I day of week

I weather

I random chance4 / 166

baseball-reference.com

Summarizing the attendances

jays=read.csv("jays15-home.csv",header=T)

summary(jays$attendance)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 14180 16400 21200 25070 33090 48410

Attendances varied a lot, from a low of 14180 to a high of 48410 (openingday).

Boxplot over. I used “base graphics” (not ggplot) for this one.

5 / 166

Attendance boxplotboxplot(jays$attendance)

1500

025

000

3500

045

000

6 / 166

Comments

I Distribution somewhat skewed to right (but no outliers).

I Attendances have substantial variability.

I These are a sample of “all possible games” (or maybe “all possiblegames played in April and May”). What can we say about meanattendance in all possible games based on this evidence?

I Think about:

I Confidence interval

I Hypothesis test.

7 / 166

Confidence interval for mean

I Should account for:

I we only observed 25 out of all possible games (sample size)

I attendances we did observe were highly variable (sample SD)

I we observed mean attendance of 25,070 (sample mean).

I Confidence level: how likely we want the interval to be to contain themean attendance at “all possible games”. Typically 95%.

I Formula, with sample mean x̄ , sample SD s and sample size n:

x̄ ± t∗s√n.

t∗ is number from t-table that accounts for confidence level (eg.95%) and sample size through degrees of freedom n − 1.

I For 25− 1 = 24 df and 95% confidence, t∗ = 2.064.

8 / 166

How do I get t∗ or z∗ for CI?I z∗ first, for 95% interval:

I Want middle 95%, z-values that enclose that.

I 2.5% less than −z∗, 97.5% less than z∗.

I In R, qnorm gives value of z with input probability less than it:

qnorm(0.025)

## [1] -1.959964

qnorm(0.975)

## [1] 1.959964

I So, for 95% z-confidence interval, z∗ = 1.96.

9 / 166

t∗ for 95% confidence interval

I “Half the leftovers” again.

I Use qt rather than qnorm.

I Two things in qt: prob. of less, plus degrees of freedom.

I 95% t∗ with 24 df:

qt(0.975,24)

## [1] 2.063899

I or same with 10 df:

qt(0.975,10)

## [1] 2.228139

bigger, because fewer df.

I or t∗ for 90% with 10 df:

qt(0.95,10)

## [1] 1.812461

smaller, as you’d expect.

10 / 166

CI for mean attendance

I First, use R as calculator.

I Sample mean and SD:

att=jays$attendance

mean(att)

## [1] 25070.16

sd(att)

## [1] 11006.69

I Work out “margin of error”(plus/minus part) first:

m=2.064*11007/sqrt(25)

m

## [1] 4543.69

I then get limits of CI:

xbar=25070

xbar-m

## [1] 20526.31

xbar+m

## [1] 29613.69

11 / 166

Comments

I Interval from about 20,000 to about 30,000.

I Not estimating mean attendance well at all!

I Generally want confidence interval to be shorter, which happens if:

I SD smaller

I sample size bigger

I confidence level smaller

I Last one is a cheat, really, since reducing confidence level increaseschance that interval won’t contain pop. mean at all!

12 / 166

Using t.test

I Foregoing illustrated how you would do it by hand.

I In practice, t.test function does it all:

t.test(att)

##

## One Sample t-test

##

## data: att

## t = 11.389, df = 24, p-value = 3.661e-11

## alternative hypothesis: true mean is not equal to 0

## 95 percent confidence interval:

## 20526.82 29613.50

## sample estimates:

## mean of x

## 25070.16

13 / 166

Or, 90% CI

I by including a value for conf.level:

t.test(att,conf.level=0.90)

##


##

## data: att

## t = 11.389, df = 24, p-value = 3.661e-11



## 21303.93 28836.39


## mean of x

## 25070.16

14 / 166

Hypothesis testI CI answers question “what is the mean?”

I Might have a value µ in mind for the mean, and question “Is themean equal to µ, or not?”

I For example, 2014 average attendance was 29,327.

I Second question answered by hypothesis test.

I Value being assessed goes in null hypothesis: here, H0 : µ = 29, 327.

I Alternative hypothesis says how null might be wrong, eg.Ha : µ 6= 29, 327.

I Assess evidence against null. If that evidence strong enough, rejectnull hypothesis; if not, fail to reject null hypothesis (sometimes retainnull).

I Note asymmetry between null and alternative, and utter absence ofword “accept”.

15 / 166

α and errorsI Hypothesis test ends with decision:

I reject null hypothesis

I do not reject null hypothesis.

I but decision may be wrong:

DecisionTruth Do not reject Reject null

Null true Correct Type I errorNull false Type II error Correct

I Either type of error is bad, but for now focus on controlling Type Ierror: write α = P(type I error), and devise test so that α small,typically 0.05.

I That is, if null hypothesis true, have only small chance to reject it(which would be a mistake).

I Worry about type II errors later (when we consider power of test).16 / 166

Why 0.05? This man.Responsible for:

I analysis of variance

I Fisher information

I Linear discriminant analysis

I Fisher’s z-transformation

I Fisher-Yates shuffle

I Behrens-Fisher problem

Sir Ronald A. Fisher, 1890–1962.

17 / 166

Why 0.05? (2)I From The Arrangement of Field Experiments (1926):

I and

18 / 166

Test statistic: going from data to decision

I “How far away from null hypothesis is data?”

I In testing for mean, statistic is

t =x̄ − µs/√n

I and for our baseball attendance data

t.stat=(25070-29327)/(11000/sqrt(25))

t.stat

## [1] -1.935

19 / 166

P-value

I The probability of observing a test statistic value as extreme or moreextreme than that observed, if the null hypothesis is true.

I “More extreme” depends on Ha: count both sides if two-sided, countonly the proper side if one-sided.

I Our Ha was Ha : µ 6= 29, 327, two-sided.

20 / 166

t-tableI Look up 1.935 not −1.935.

I 1.71 < 1.935 < 2.06

I P-value between 0.05 and 0.10

I Or, exactly (×2: 2-sided):

2*pt(-1.93,24)

## [1] 0.06550769

I P-value not less than 0.05: donot reject null.

I No evidence of change in meanattendance.

21 / 166

Or, again, using t.test:

t.test(att, mu=29327)

##


##

## data: att

## t = -1.9338, df = 24, p-value = 0.06502



## 20526.82 29613.50


## mean of x

## 25070.16

I See test statistic −1.93, P-value 0.065.

I Again, do not reject null.

22 / 166

And in SAS

I created a spreadsheet with only attendances (since there were otherwise alot of variables). The first line is a header, so start reading data on second.

data jays;

infile '/folders/myfolders/jays15-home-att.csv'

firstobs=2;

input attendance;

proc ttest h0=29327;

var attendance;

23 / 166

Output

N Mean Std Dev Std Err Minimum Maximum

25 25070.2 11006.7 2201.3 14184.0 48414.0

Mean 95% CL Mean Std Dev 95% CL Std Dev

25070.2 20526.8 29613.5 11006.7 8594.3 15312.0

DF t Value Pr > |t|

24 -1.93 0.0650

Same CI (20527 to 29614) as R, also same P-value 0.0650.

24 / 166

SAS, the long wayI To work from the original spreadsheet, we need to read in all the

variables:

data jays;

infile '/folders/myfolders/jays15-home.csv'

dlm=',' dsd firstobs=2;

input row game date $ box $ team $ venue $ opp $

result $ runs oppruns innings wl $ position gb $

winner $ loser $ save $ gametime $ daynight $

attendance streak $;

proc print;

I Yes, that is one long input line, split.

I dlm=’,’ means data values separated by commas.

I dsd means treat two consecutive delimiters as a missing value(usually what you want for delimited data). 25 / 166

Results (some, tiny)Obs row game date box team venue opp result runs oppruns innings wl

1 82 7 Monday, boxscore TOR TBR L 1 2 . 4-3

2 83 8 Tuesday, boxscore TOR TBR L 2 3 . 4-4

3 84 9 Wednesda boxscore TOR TBR W 12 7 . 5-4

4 85 10 Thursday boxscore TOR TBR L 2 4 . 5-5

5 86 11 Friday, boxscore TOR ATL L 7 8 . 5-6

6 87 12 Saturday boxscore TOR ATL W-wo 6 5 10 6-6

7 88 13 Sunday, boxscore TOR ATL L 2 5 . 6-7

8 89 14 Tuesday, boxscore TOR BAL W 13 6 . 7-7

9 90 15 Wednesda boxscore TOR BAL W 4 2 . 8-7

10 91 16 Thursday boxscore TOR BAL W 7 6 . 9-7

11 92 27 Monday, boxscore TOR NYY W 3 1 . 13-14

12 93 28 Tuesday, boxscore TOR NYY L 3 6 . 13-15

13 94 29 Wednesda boxscore TOR NYY W 5 1 . 14-15

14 95 30 Friday, boxscore TOR BOS W 7 0 . 15-15

15 96 31 Saturday boxscore TOR BOS W 7 1 . 16-15

16 97 32 Sunday, boxscore TOR BOS L 3 6 . 16-16

17 98 40 Monday, boxscore TOR LAA W 10 6 . 18-22

18 99 41 Tuesday, boxscore TOR LAA L 2 3 . 18-23

19 100 42 Wednesda boxscore TOR LAA L 3 4 . 18-24

20 101 43 Thursday boxscore TOR LAA W 8 4 . 19-24

21 102 44 Friday, boxscore TOR SEA L 3 4 . 19-25

22 103 45 Saturday boxscore TOR SEA L 2 3 . 19-26

23 104 46 Sunday, boxscore TOR SEA W 8 2 . 20-26

24 105 47 Monday, boxscore TOR CHW W 6 0 . 21-26

25 106 48 Tuesday, boxscore TOR CHW W-wo 10 9 . 22-26

Obs position gb winner loser save gametime daynight attendance streak

1 2 1 Odorizzi Dickey Boxberge 2:30 N 48414 -

2 3 2 Geltz Castro Jepsen 3:06 N 17264 --

3 2 1 Buehrle Ramirez 3:02 N 15086 +

4 4 1.5 Archer Sanchez Boxberge 3:00 N 14433 -

5 4 2.5 Martin Cecil Grilli 3:09 N 21397 --

6 3 1.5 Cecil Marimon 2:41 D 34743 +

7 4 1.5 Miller Norris Grilli 2:41 D 44794 -

8 2 2 Buehrle Norris 2:53 N 14184 +

9 2 1 Sanchez Jimenez Castro 2:36 N 15606 ++

10 1 Tied Hutchiso Tillman Castro 2:36 N 18581 +++

11 4 3.5 Dickey Martin Cecil 2:18 N 19217 +

12 5 4.5 Pineda Estrada Miller 2:54 N 21519 -

13 3 3.5 Buehrle Sabathia 2:30 N 21312 +

14 3 4 Sanchez Miley 2:41 N 30430 ++

15 3 3 Hutchiso Kelly 3:12 D 42917 +++

16 3 4 Buchholz Dickey Uehara 2:38 D 42419 -

17 5 4.5 Osuna Morin 3:28 D 29306 +

18 5 4.5 Santiago Sanchez Street 2:32 N 15062 -

19 5 4.5 Weaver Hutchiso Street 2:36 N 16402 --

20 5 4.5 Dickey Shoemake 2:22 N 19014 +

21 5 5.5 Hernande Estrada Rodney 2:30 N 21195 -

22 5 5.5 Paxton Buehrle Rodney 2:25 D 33086 --

23 5 4.5 Sanchez Walker 2:50 D 37929 +

24 5 3.5 Hutchiso Noesi 2:10 N 15168 ++

25 4 3 Delabar Robertso 3:12 N 17276 +++

26 / 166

Results (a few, but less tiny)proc print;

var date attendance daynight;

Obs date attendance daynight

1 Monday, 48414 N

2 Tuesday, 17264 N

3 Wednesda 15086 N

4 Thursday 14433 N

5 Friday, 21397 N

6 Saturday 34743 D

7 Sunday, 44794 D

8 Tuesday, 14184 N

9 Wednesda 15606 N

10 Thursday 18581 N

11 Monday, 19217 N

12 Tuesday, 21519 N

13 Wednesda 21312 N

14 Friday, 30430 N

15 Saturday 42917 D

16 Sunday, 42419 D

17 Monday, 29306 D

18 Tuesday, 15062 N

19 Wednesda 16402 N

20 Thursday 19014 N

21 Friday, 21195 N

22 Saturday 33086 D

23 Sunday, 37929 D

24 Monday, 15168 N

25 Tuesday, 17276 N

27 / 166

Comments

I Need to know names and types for all variables to read in data (vs. R,just read in data frame).

I For text variables, just first 8 characters read in. (To change, seeinformat later.)

I daynight is D for a day game and N for a night game. How doattendances compare for these?

proc sgplot;

vbox attendance / category=daynight;

28 / 166

The boxplot

29 / 166

Comments

I Attendances on average much higher for day games than night ones.

I Why?

I What is that upper outlier in the night games?

30 / 166

Another example: learning to read

I You devised new method for teaching children to read.

I Guess it will be more effective than current methods.

I To support this guess, collect data.

I Want to generalize to “all children in Canada”.

I So take random sample of all children in Canada.

I Or, argue that sample you actually have is “typical” of all children inCanada.

I Randomization: whether or not a child in sample or not has nothingto do with anything else about that child.

I Aside: if your new method good for teaching struggling children toread, then “all kids” is “all kids having trouble learning to read”, andyou take a sample of those.

31 / 166

The data (some), in SASI Data in file drp.txt with header line, group then reading test score.

data kids;

infile '/folders/myfolders/drp.txt' firstobs=2;

input group $ score;

proc print;

Obs group score

1 t 24

2 t 61

3 t 59

4 t 46

5 t 43

6 t 44

7 t 52

8 t 43

9 t 58

10 t 67

11 t 62

12 t 57

13 t 71

14 t 49

15 t 54

16 t 43

17 t 53

18 t 57

19 t 49

20 t 56

21 t 33

22 c 42

23 c 33

24 c 46

25 c 37

26 c 43

27 c 41

28 c 10

29 c 42

30 c 55

31 c 19

32 c 17

33 c 55

34 c 26

35 c 54

36 c 60

37 c 28

38 c 62

39 c 20

40 c 53

41 c 48

42 c 37

43 c 85

44 c 42

32 / 166

Analysis

I Groups labelled c for “control” and t for “treatment”.

I Start with summaries (group means) and plot (boxplot).

I No pairing, matching: might compare means with two-sample t-test.

I For test, need approx. normality, but don’t need equal variability.

I Use summaries to decide if test reasonable.

33 / 166

Comparing means

proc means;

class group;

var score;

The MEANS Procedure

Analysis Variable : score

N

group Obs N Mean Std Dev Minimum Maximum

-------------------------------------------------------------------------------

c 23 23 41.5217391 17.1487332 10.0000000 85.0000000

t 21 21 51.4761905 11.0073568 24.0000000 71.0000000

-------------------------------------------------------------------------------

34 / 166

Boxplotsproc sgplot;

vbox score / category=group;

35 / 166

Comments

I Groups not actually same size (maybe 2 kids had to drop out).

I Means a fair bit different, treatment mean higher.

I But a lot of variability, so groups do overlap.

I Standard deviations somewhat different too.

I Biggest threat to normality is outliers, none here.

I Both distributions not far off symmetric.

I t-test should be good enough.

36 / 166

The t-test

proc ttest side=l;

var score;

class group;

The TTEST Procedure

Variable: score

Method Variances DF t Value Pr < t

Pooled Equal 42 -2.27 0.0143

Satterthwaite Unequal 37.855 -2.31 0.0132

plus a lot more output.

37 / 166

Comments and Conclusions

I One-sided test (looking for improvement). side can be l (lower), u(upper) or 2 (two-sided, can be omitted.) This is l because controlgroup first in alphabetical order.

I Right t-test is Satterthwaite (does not assume equal variability)

I P-value 0.0132 < 0.05: there is increase in reading scores.

I Should not use pooled test, because SDs not close; even so, resultvery similar (P-value 0.0143).

I One-sided test doesn’t give (regular) CI for difference in means. Toget that, repeat analysis without side=l.

38 / 166

Doing it with RI I skip the boxplots (you do it).

I c (control) before t (treatment) alphabetically, so proper alternativeis “less”.

I R does Satterthwaite test by default (as before: new reading programreally helps):

kids=read.table("drp.txt",header=T)

t.test(score~group,data=kids,alternative="less")

##

## Welch Two Sample t-test

##

## data: score by group

## t = -2.3109, df = 37.855, p-value = 0.01319

## alternative hypothesis: true difference in means is less than 0


## -Inf -2.691293


## mean in group c mean in group t

## 41.52174 51.47619 39 / 166

The pooled t-testI Pooled (equal variances) test this way:

t.test(score~group,data=kids,alternative="less",

var.equal=T)

##

## Two Sample t-test

##


## t = -2.2666, df = 42, p-value = 0.01431

## alternative hypothesis: true difference in means is less than 0


## -Inf -2.567497



## 41.52174 51.47619

I Similar P-value to Satterthwaite test (and same results as SAS).40 / 166

Two-sided test; CI

I To do 2-sided test, leave out alternative:

t.test(score~group,data=kids)

##


##


## t = -2.3109, df = 37.855, p-value = 0.02638

## alternative hypothesis: true difference in means is not equal to 0


## -18.67588 -1.23302



## 41.52174 51.47619

I Also gives CI: new reading program increases average scores bysomewhere between about 1 and 19 points.

41 / 166

Logic of testingI Work out what would happen if null hypothesis were true.

I Compare to what actually did happen.

I If these are too far apart, conclude that null hypothesis is not trueafter all. (Be guided by P-value.)

As applied to our reading programs:

I If reading programs equally good, expect to see a difference in meansclose to 0.

I Closeness quantified eg. by t.

I Mean reading score was 10 higher for new program.

I Difference of 10 was unusually big (P-value small from t-test andrandomization test). So conclude that new reading program iseffective.

Nothing here about what happens if null hypothesis is false. This is powerand type II error probability. 42 / 166

Errors in testing

What can happen:

DecisionTruth Do not reject Reject null

Null true Correct Type I errorNull false Type II error Correct

Tension between truth and decision about truth (imperfect).

I Prob. of type I error denoted α. Usually fix α, eg. α = 0.05.

I Prob. of type II error denoted β. Determined by the plannedexperiment. Low β good.

I Prob. of not making type II error called power (= 1− β). High powergood.

43 / 166

Power

I Suppose H0 : θ = 10, Ha : θ 6= 10 for some parameter θ.

I Suppose H0 wrong. What does that say about θ?

I Not much. Could have θ = 11 or θ = 8 or θ = 496. In each case, H0

wrong.

I How likely a type II error is depends on what θ is:

I If θ = 496, should be able to reject H0 : θ = 10 even for small sample,so β should be small (power large).

I If θ = 11, might have hard time rejecting H0 even with large sample, soβ would be larger (power smaller).

I Power depends on true parameter value, and on sample size.

I So we play “what if”: “if θ were 11 (or 8 or 496), what would powerbe?”.

44 / 166

Figuring out power

I Time to figure out power is before you collect any data, as part ofplanning process.

I Need to have idea of what kind of departure from null hypothesis ofinterest to you, eg. average improvement of 5 points on reading testscores. (Subject-matter decision, not statistical one.)

I Then, either:

I “I have this big a sample and this big a departure I want to detect.What is my power for detecting it?”

I “I want to detect this big a departure with this much power. How big asample size do I need?”

I R or SAS.

45 / 166

Calculating power for a t-test

Think about reading programs again. Need to specify:

I What kind of t-test (here 2-sample, not 1-sample or paired)

I What kind of Ha (here one-sided)

I What the population SD is (usually have to guess this). Here thesample SDs were 11 and 17, so 15 seems a fair guess (same for eachgroup).

I What size departure from null (here 5 points).

46 / 166

22 kids/group, difference 5

power.t.test(n=22,delta=5,sd=15,type="two.sample",

alternative="one.sided")

##

## Two-sample t test power calculation

##

## n = 22

## delta = 5

## sd = 15

## sig.level = 0.05

## power = 0.2887273

## alternative = one.sided

##

## NOTE: n is number in *each* group

This power (29%) way too small.

47 / 166

Detecting a larger differenceLarger difference, say 10 points under same conditions, should be easier todetect. Change delta=5 to delta=10:

power.t.test(n=22,delta=10,sd=15,type="two.sample",


##


##

## n = 22

## delta = 10

## sd = 15

## sig.level = 0.05

## power = 0.7020779


##


Much better, power around 70%. Much likelier to detect this biggerdifference. 48 / 166

Increasing the sample sizeMaking n bigger should improve power. For detecting difference of 5,increase sample size from 22 in each group to 40:

power.t.test(n=40,delta=5,sd=15,type="two.sample",alternative="one.sided")

##


##

## n = 40

## delta = 5

## sd = 15

## sig.level = 0.05

## power = 0.4336523


##


Power went up from 29% to 43%. Often need much bigger sample size tomake much difference.

49 / 166

Making a graph of power

I “Power curve”.

I Two ways:

I Keep sample size fixed, change true difference between means

I Keep difference fixed, change sample size.

I Feed power.t.test a vector of true differences or sample sizes. Getback vector of power values:

my.del=seq(0,20,2)

mypower=power.t.test(n=22,delta=my.del,sd=15,

type="two.sample",alternative="one.sided")

mypower$power

## [1] 0.0500000 0.1131924 0.2192775 0.3670836 0.5380092 0.7020779 0.8328057

## [8] 0.9192728 0.9667502 0.9883915 0.9965808

50 / 166

Fixing up “list”; making scatterplot

I To plot the power and my.del values, we need to make a data frameof them:

power.df=data.frame(diff=my.del,power=mypower$power)

I I called the difference in means diff to remind me what it was.

I Our plot will be a scatterplot. Set up ggplot in the usual way, then:

I geom_point() will plot the x-points against the y -points.

I geom_line() joins points by lines (helpful here).

51 / 166

The power curveggplot(power.df,aes(x=diff,y=power))+geom_point()+geom_line()

●

●

●

●

●

●

●

●

●● ●

0.25

0.50

0.75

1.00

0 5 10 15 20diff

pow

er

52 / 166

Embellishing the power curve

I The largest the power can be is 1 (the test correctly rejects the nullevery time).

I If two means actually equal (diff is 0), the null hypothesis actuallycorrect, and then the prob of rejecting it is 0.05. So power should be0.05 at diff 0, bigger elsewhere.

I Add these as horizontal lines to graph.

I Uses geom_hline(yintercept=y).

I Construct graph first, then print (over):

gr=ggplot(power.df,aes(x=diff,y=power))+

geom_point()+geom_line()+

geom_hline(yintercept=0.05,linetype="dashed")+

geom_hline(yintercept=1,linetype="dashed")

53 / 166

The improved power curvegr

●

●

●

●

●

●

●

●

●● ●

0.25

0.50

0.75

1.00

0 5 10 15 20diff

pow

er

54 / 166

Power curve for sample size, fixed difference of 5

Sample sizes, 10 to 100 in steps of 5:

my.n=seq(10,100,5)

mypower=power.t.test(n=my.n,delta=5,sd=15,

type="two.sample",alternative="one.sided")

mypower$power

## [1] 0.1768868 0.2254334 0.2710950 0.3145661 0.3560927 0.3957733 0.4336523

## [8] 0.4697559 0.5041065 0.5367294 0.5676549 0.5969194 0.6245648 0.6506381

## [15] 0.6751905 0.6982766 0.7199535 0.7402799 0.7593159

55 / 166

Setting up the power curve

my.df=data.frame(n=my.n,power=mypower$power)

gr=ggplot(my.df,aes(x=n,y=power))+

geom_point()+geom_line()+

geom_hline(yintercept=0.05,linetype="dashed")+

geom_hline(yintercept=1,linetype="dashed")

56 / 166

The power curvegr

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

0.25

0.50

0.75

1.00

25 50 75 100n

pow

er

57 / 166

Finding sample sizeI Instead of being given a sample size and finding a power, we can be

given a power and asked for the sample size needed to achieve it.

I Eg. to detect a difference of 5 between means, how big a sample do Ineed to get 80% power?

I Leave out n= and put in power=0.80 instead; solves for n:

power.t.test(power=0.80,delta=5,sd=15,type="two.sample",


##


##

## n = 111.9686

## delta = 5

## sd = 15

## sig.level = 0.05

## power = 0.8


##

## NOTE: n is number in *each* group 58 / 166

Comments

I To have 80% chance of correctly detecting difference of 5 in means,need 112 children in each group.

I Sample size calculations apt to give depressing results!

59 / 166

SAS

SAS has proc power which does all the same stuff (a little more flexibly):

I Two-sample t for means: twosamplemeans.

I Satterthwaite test: diff satt

I One-sided.

I Desired difference in means 5.

I SD in each group 15

I Total sample size 44 (22 in each group)

I Want to find power for this: power=..

60 / 166

SAS code

proc power;

twosamplemeans

test=diff_satt

sides=1

meandiff=5

stddev=15

ntotal=44

power=.;

Note punctuation: twosamplemeans to power really all one line, sosemicolon only after power=..

61 / 166

SAS outputThe POWER Procedure

Two-Sample t Test for Mean Difference with Unequal Variances

Fixed Scenario Elements

Distribution Normal

Method Exact

Number of Sides 1

Mean Difference 5

Standard Deviation 15

Total Sample Size 44

Null Difference 0

Nominal Alpha 0.05

Group 1 Weight 1

Group 2 Weight 1

Computed Power

Actual

Alpha Power

0.0498 0.288

Power is 0.29, as with R. Note that we didn’t need any data. 62 / 166

Power for several alternatives and sample sizes

If you put several values on a line, SAS does all combinations of them, eg.mean differences 5 and 10, sample size 22 and 40 per group:

proc power;

twosamplemeans

test=diff_satt

sides=1

meandiff=5 10

stddev=15

ntotal=44 80

power=.;

63 / 166

OutputThe POWER Procedure



Distribution Normal

Method Exact

Number of Sides 1


Null Difference 0

Nominal Alpha 0.05

Group 1 Weight 1

Group 2 Weight 1

Computed Power

Mean N Actual

Index Diff Total Alpha Power

1 5 44 0.0498 0.288

2 5 80 0.0499 0.433

3 10 44 0.0498 0.701

4 10 80 0.0499 0.905

64 / 166

Comments

I Consistent with R results before.

I Increasing sample size increases power a bit.

I Increasing true mean difference increases power dramatically.

65 / 166

Sample size calculation

To calculate sample size required, leave ntotal missing (put in a dot) andfill in power. Here, mean difference 5, desired power 0.80:

proc power;

twosamplemeans

test=diff_satt

sides=1

meandiff=5

stddev=15

ntotal=.

power=0.80;

66 / 166

ResultsThe POWER Procedure



Distribution Normal

Method Exact

Number of Sides 1

Mean Difference 5


Nominal Power 0.8

Null Difference 0

Nominal Alpha 0.05

Group 1 Weight 1

Group 2 Weight 1

Computed N Total

Actual Actual N

Alpha Power Total

0.05 0.800 224

Need 224 children total, or 112 in each group. As R. 67 / 166

Or, more accurately. . .I Didn’t actually have same-size groups or equal population SDs.

Assumed them for R.

I SAS will allow different values per group.

I We had sample sizes 21 and 23, sample SDs 11 and 17 (use aspopulation SDs).

I Unequal sample sizes usually decrease power, but smaller sample sizewith smaller SD actually better. Overall effect unclear.

I SAS: use groupstddevs and groupns, and vertical bars.

proc power;

twosamplemeans

test=diff_satt

sides=1

meandiff=5

groupstddevs=11|17

groupns=21|23

power=.;68 / 166

ResultsThe POWER Procedure



Distribution Normal

Method Exact

Number of Sides 1

Mean Difference 5

Group 1 Standard Deviation 11

Group 2 Standard Deviation 17

Group 1 Sample Size 21


Null Difference 0

Nominal Alpha 0.05

Computed Power

Actual

Alpha Power

0.0499 0.309

Power actually went up a tiny bit. 69 / 166

Unequal sample sizes

To show effect of unequal sample sizes, go back to SDs both being 15, buthave very unequal sample sizes. What effect does that have on power?

proc power;

twosamplemeans

test=diff_satt

sides=1

meandiff=5

stddev=15

groupns=10|34

power=.;

70 / 166

ResultsPower for 22 in each group was 29%:

The POWER Procedure



Distribution Normal

Method Exact

Number of Sides 1

Mean Difference 5




Null Difference 0

Nominal Alpha 0.05

Computed Power

Actual

Alpha Power

0.0505 0.225

Unequal sample sizes bring power down to 22.5%.71 / 166

Return to actual data: CI in SAS. . . For difference between mean reading scores:

proc ttest;

var score;

class group;

The TTEST Procedure

Variable: score

group Method Mean 95% CL Mean Std Dev

c 41.5217 34.1061 48.9374 17.1487

t 51.4762 46.4657 56.4867 11.0074

Diff (1-2) Pooled -9.9545 -18.8176 -1.0913 14.5512

Diff (1-2) Satterthwaite -9.9545 -18.6759 -1.2330

group Method 95% CL Std Dev

c 13.2627 24.2715

t 8.4213 15.8954

Diff (1-2) Pooled 11.9981 18.4947

Diff (1-2) Satterthwaite72 / 166

Comments

I Same code as for test, look at different part of output.

I Look under 95% CL mean and Satterthwaite line, interval from−18.68 to −1.23.

I Control minus treatment, so new reading method better on averageby between 1 and 19 points.

I Would a 90% CI be longer or shorter than a 95% one? Code nextpage. Put appropriate alpha on proc ttest line.

73 / 166

90% CIproc ttest alpha=0.10;

var score;

class group;

The TTEST Procedure

Variable: score


c 41.5217 35.3816 47.6618 17.1487

t 51.4762 47.3334 55.6190 11.0074

Diff (1-2) Pooled -9.9545 -17.3414 -2.5675 14.5512



c 13.8098 22.8992

t 8.7834 14.9440

Diff (1-2) Pooled 12.3693 17.7758

Diff (1-2) Satterthwaite

74 / 166

Comparison

I 95%: −18.68 to −1.23.

I 90%: −17.22 to −2.69.

I 90% CI shorter, but has more chance to be wrong.

I Both CIs wide: we don’t know much about effect that new readingmethod has. (Solution: more data.)

75 / 166

Same thing in R

Default is 95% CI, get via t.test as before:

t.test(score~group,data=kids)

##


##


## t = -2.3109, df = 37.855, p-value = 0.02638



## -18.67588 -1.23302



## 41.52174 51.47619

Same as SAS.

76 / 166

90% CI in R

Add conf.level, which is actual confidence level, unlike SAS:

t.test(score~group,data=kids,conf.level=0.90)

##


##


## t = -2.3109, df = 37.855, p-value = 0.02638



## -17.217609 -2.691293



## 41.52174 51.47619

Again, same as SAS.

77 / 166

Duality between confidence intervals and hypothesis tests

I Tests and CIs really do the same thing, if you look at them the rightway. They are both telling you something about a parameter, andthey use same things about data.

I Illustrate with R and SAS, R first.

78 / 166

95% CI (default)

group1=c(10,11,11,13,13,14,14,15,16)

group2=c(13,13,14,17,18,19)

t.test(group1,group2)

##


##

## data: group1 and group2

## t = -2.0937, df = 8.7104, p-value = 0.0668



## -5.5625675 0.2292342


## mean of x mean of y

## 13.00000 15.66667

79 / 166

90% CI

t.test(group1,group2,conf.level=0.90)

##


##

## data: group1 and group2

## t = -2.0937, df = 8.7104, p-value = 0.0668



## -5.010308 -0.323025



## 13.00000 15.66667

80 / 166

Comparing results

Recall null here is H0 : µ1 − µ2 = 0. P-value 0.0668.

I 95% CI from −5.6 to 0.2, includes 0.

I 90% CI from −5.0 to −0.3, does not include 0.

I At α = 0.05, would not reject H0 since P-value > 0.05.

I At α = 0.10, would reject H0 since P-value < 0.10.

Not just coincidence. Following always true.

I Reject H0 at level α if and only if H0 value outside corresponding CI.

I Fail to reject H0 at level α if and only if H0 value inside correspondingCI.

I Idea: “Plausible” parameter value inside CI, not rejected;“Implausible” parameter value outside CI, rejected.

81 / 166

In SAS

I Some trickery involved because of data layout.

I Put all the data values in one column, with second column labellingwith group they are from:

data dual;

infile '/folders/myfolders/duality.txt';

input y group;

proc print;

Output over.

82 / 166

The data, small

Obs y group

1 10 1

2 11 1

3 11 1

4 13 1

5 13 1

6 14 1

7 14 1

8 15 1

9 16 1

10 13 2

11 13 2

12 14 2

13 17 2

14 18 2

15 19 2

83 / 166

Test and CI at default α = 0.05proc ttest;

var y;

class group;

The TTEST Procedure

Variable: y


1 13.0000 11.4627 14.5373 2.0000

2 15.6667 12.8769 18.4564 2.6583

Diff (1-2) Pooled -2.6667 -5.2580 -0.0754 2.2758

Diff (1-2) Satterthwaite -2.6667 -5.5626 0.2292


1 1.3509 3.8315

2 1.6593 6.5198

Diff (1-2) Pooled 1.6499 3.6665


Method Variances DF t Value Pr > |t|

Pooled Equal 13 -2.22 0.0446


84 / 166

Getting 90% CIproc ttest alpha=0.10;

var y;

class group;

The TTEST Procedure

Variable: y


1 13.0000 11.7603 14.2397 2.0000

2 15.6667 13.4798 17.8535 2.6583

Diff (1-2) Pooled -2.6667 -4.7909 -0.5425 2.2758



1 1.4365 3.4220

2 1.7865 5.5539

Diff (1-2) Pooled 1.7352 3.3806


Method Variances DF t Value Pr > |t|

Pooled Equal 13 -2.22 0.0446


85 / 166

The value of this

I If you have a test procedure but no corresponding CI:

I you make a CI by including all the parameter values that would notbe rejected by your test.

I Use:

I α = 0.01 for a 99% CI,

I α = 0.05 for a 95% CI,

I α = 0.10 for a 90% CI,

and so on.

86 / 166

The sign test

I Test for population median.

I Data:

x=c(3, 4, 5, 5, 6, 8, 11, 14, 19, 37)

I Let M denote pop. median. Test H0 : M = 12 vs. Ha : M 6= 12.

I If H0 true, each data value either above or below 12, prob. 0.5 ofeach (ignore exactly 12’s).

I Test statistic: # values above 12.

I If H0 true, test statistic hasbinomial distribution,n = 10, p = 0.5.

I Observed 3 values above 12 (and7 below).

I P-value comes from prob. of 3successes or less in that binomialdistribution.

87 / 166

Probability distribution calculatorI R knows about many distributions, eg. norm, binom.

I All have prefixes:

I d: probability (or density) of exactly given value

I p: probability of given value or less (CDF)

I q: value that has that much probability less (inverse CDF).

pnorm(1.96)

## [1] 0.9750021

qnorm(0.95)

## [1] 1.644854

dbinom(1,4,0.3)

## [1] 0.4116

pbinom(1,4,0.3)

## [1] 0.6517

qbinom(0.9,4,0.3)

## [1] 2

Standard normalProb of less than 1.96

Value with prob 0.95 less than it

Binomial, n = 4, p = 0.3Prob of exactly 1 success

Prob of 1 success or less

Value with prob (at least) 0.9 less than it

Check columns aligned ****

88 / 166

Sign testI Data:

x

## [1] 3 4 5 5 6 8 11 14 19 37

I Hypotheses about median M: H0 : M = 12, Ha : M 6= 12.

I Test statistic: number of valuesabove 12.

table(x>12)

##

## FALSE TRUE

## 7 3

3 values above.

I P-value: prob. of 3 successes orless in binomial n = 10, p = 0.5,times 2 (two-sided test):

2*pbinom(3,10,0.5)

## [1] 0.34375

I No evidence that the median differs from 12.89 / 166

Sign test in SAS

Lives in proc univariate. Note how null median specified:

data x;

input x @@;

cards;

3 4 5 5 6 8 11 14 19 37

;

proc univariate mu0=12;

90 / 166

SAS sign test results

The UNIVARIATE Procedure

Variable: x

Tests for Location: Mu0=12

Test -Statistic- -----p Value------

Student's t t -0.24399 Pr > |t| 0.8127

Sign M -2 Pr >= |M| 0.3438

Signed Rank S -9.5 Pr >= |S| 0.3730

P-value 0.3438, same as R.

91 / 166

Making R sign test into functionI Get CI for median as “what values of M would sign test not reject

for?”

I Trying lots of values of M, do sign test for each.

I Efficiency: write R function to do sign test and return P-value:

sign.test=function(med,mydata)

{tbl=table(mydata<med)

stat=min(tbl)

n=length(mydata)

pval=2*pbinom(stat,n,0.5)

return(pval)

}

I Function takes null median and data as input.

I Test statistic is #values greater or less than M, whichever is smaller.

I Returns two-sided P-value.92 / 166

Using function to get CI

Idea: call function repeatedly on same data but different null M. Values ofM with P-value > 0.05 inside 95% CI, others outside.

First test on actual problem:

sign.test(12,x)

## [1] 0.34375

All right. Try some values, making sure not to use any data values:

sign.test(18.5,x)

## [1] 0.109375

sign.test(19.5,x)

## [1] 0.02148438

18.5 inside, 19.5 outside.

93 / 166

Completing the interval

Lower end:

sign.test(4.5,x)

## [1] 0.109375

sign.test(3.5,x)

## [1] 0.02148438

4.5 inside, 3.5 outside.

So 95% CI for pop. median goes from 4.5 to 18.5.

Not a very good CI, for two reasons:

I Sample size small.

I Sign test not very powerful.

But it is a confidence interval.94 / 166

Sign-testing a whole bunch of medians at once

options(width=60)

my.meds=seq(3.5,20.5,1)

pvals=sapply(my.meds,sign.test,x)

rbind(my.meds,pvals)

## [,1] [,2] [,3] [,4] [,5]

## my.meds 3.50000000 4.500000 5.5000000 6.500000 7.500000

## pvals 0.02148438 0.109375 0.7539063 1.246094 1.246094

## [,6] [,7] [,8] [,9] [,10]

## my.meds 8.5000000 9.5000000 10.5000000 11.50000 12.50000

## pvals 0.7539063 0.7539063 0.7539063 0.34375 0.34375

## [,11] [,12] [,13] [,14] [,15]

## my.meds 13.50000 14.500000 15.500000 16.500000 17.500000

## pvals 0.34375 0.109375 0.109375 0.109375 0.109375

## [,16] [,17] [,18]

## my.meds 18.500000 19.50000000 20.50000000

## pvals 0.109375 0.02148438 0.02148438

95 / 166

Comments

I sapply says: for each of the values in first thing (my.meds), run thefunction named in the second thing (sign.test), feeding anythingelse (x) into the function.

I Runs sign.test(3.5,x), sign.test(4.5,x), . . . ,sign.test(20.5,x) and glues results together.

I P-value “1.25” is where 5 values above H0 median and 5 below, noevidence at all for rejecting null. (1 would be better value).

I For 95% CI, look at where P-values cross 0.05 (twice).

96 / 166

Some more data

Take a look at these data (12 rows of 3 columns):

Case Drug A Drug B

1 2.0 3.5

2 3.6 5.7

3 2.6 2.9

4 2.6 2.4

5 7.3 9.9

6 3.4 3.3

7 14.9 16.7

8 6.6 6.0

9 2.3 3.8

10 2.0 4.0

11 6.8 9.1

12 8.5 20.9

97 / 166

Matched pairsI Data are comparison of 2 drugs for effectiveness at reducing pain.

I 12 subjects (cases) were arthritis sufferers

I Response is #hours of pain relief from each drug.

I In reading example, each child tried only one reading method.

I But here, each subject tried out both drugs, giving us twomeasurements.

I Possible because, if you wait long enough, one drug has no influenceover effect of other.

I Advantage: focused comparison of drugs. Compare one drug withanother on same person, removes a lot of random variability.

I Matched pairs, requires different analysis.

I Design: randomly choose 6 of 12 subjects to get drug A first, other 6get drug B first.

98 / 166

Analysis, in SAS

Uses a t-test. Note paired line.

data analgesic;

infile "/folders/myfolders/analgesic.txt";

input subject druga drugb;

proc ttest;

paired druga*drugb;

99 / 166

Output of matched pairs t-test


12 -2.1333 3.4092 0.9841 -12.4000 0.6000


-2.1333 -4.2994 0.0327 3.4092 2.4150 5.7884

DF t Value Pr > |t|

11 -2.17 0.0530

100 / 166

Comments

I P-value 0.0530.

I At α = 0.05, cannot quite reject null of no difference, though result isvery close to significance.

I “Hand-calculation” way of doing this is to find the 12 differences, onefor each subject, and do 1-sample t-test on those differences. Shownon next page.

101 / 166

Alternative way to do matched pairs

Define differences in data step:

data analgesic;

infile "/folders/myfolders/analgesic.txt";

input subject druga drugb;

diff=druga-drugb;

proc ttest h0=0;

var diff;

t-test is an ordinary 1-sample test on diff. Note that null-hypothesismean has to be given with only one sample.

102 / 166

Results


12 -2.1333 3.4092 0.9841 -12.4000 0.6000


-2.1333 -4.2994 0.0327 3.4092 2.4150 5.7884

DF t Value Pr > |t|

11 -2.17 0.0530

103 / 166

Paired test in R (1)I forgot to give the variables names in the data file. Use what we have.

pain=read.table("analgesic.txt",header=F)

pain

## V1 V2 V3

## 1 1 2.0 3.5

## 2 2 3.6 5.7

## 3 3 2.6 2.9

## 4 4 2.6 2.4

## 5 5 7.3 9.9

## 6 6 3.4 3.3

## 7 7 14.9 16.7

## 8 8 6.6 6.0

## 9 9 2.3 3.8

## 10 10 2.0 4.0

## 11 11 6.8 9.1

## 12 12 8.5 20.9104 / 166

Paired test in R (2)attach(pain)

t.test(V2,V3,paired=T)

##

## Paired t-test

##

## data: V2 and V3

## t = -2.1677, df = 11, p-value = 0.05299



## -4.29941513 0.03274847


## mean of the differences

## -2.133333

P-value as before. Likewise, you can calculate the differences yourself anddo a 1-sample t-test on them.

105 / 166

Assessing normality

I Analyses assume (theoretically) that differences normally distributed.

I Though we know that t-tests generally behave well even withoutnormality.

I How to assess normality? A normal quantile plot.

I Idea: scatter of points should follow the straight line, without curving.

I Outliers show up at bottom left or top right of plot as points off theline.

106 / 166

The normal quantile plotdiff=V2-V3 ; qqnorm(diff) ; qqline(diff)

●●

●●

●

●

●

●

●●●

●

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−12

−8

−4

0

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Points should follow the straight line. Bottom left one way off, sonormality questionable here.

107 / 166

And using “ggplot”, but without lineggplot(pain,aes(sample=diff))+stat_qq()

●

●●

● ●●

● ●

●

● ●

●

−12

−8

−4

0

−1 0 1theoretical

sam

ple

108 / 166

And in SAS. . .proc univariate noprint;

qqplot diff;

109 / 166

Getting a line on SAS normal quantile plot

I SAS doesn’t automatically provide a line, but even without one, yousee that these data are not normal because of the outlier bottom left.

I SAS can draw lines, but requires you to give a mean and SD to makethe line with.

I Simplest way is to have SAS estimate them from the data, but theline is usually not very good.

I Or we can estimate them another way from IQR.

110 / 166

Having SAS estimate themproc univariate noprint;

qqplot diff / normal(mu=est sigma=est);

111 / 166

Another way to estimate µ and σ

I Problem above is that SD was grossly inflated by outlier.

I On standard normal, quartiles about ±0.675:

qnorm(0.25); qnorm(0.75)

## [1] -0.6744898

## [1] 0.6744898

I So IQR of standard normal about 2(0.675) = 1.35.

I Thus IQR of any normal about 1.35σ.

I Idea: estimate σ by taking sample IQR and dividing by 1.35. Notaffected by outliers.

I Here, IQR is 1.95, so estimate of σ is 1.455.

I In similar spirit, estimate µ by median, −1.65.

112 / 166

Using improved µ and σproc univariate noprint;

qqplot diff / normal(mu=-1.65 sigma=1.455);

113 / 166

Dealing with non-normality

I One approach: do nothing, since the t-test takes care ofnon-normality.

I Another: make a sign test, and test whether median difference iszero. In SAS, just ask for it.

I Another: do a randomization test.

114 / 166

Sign test in SAS

proc univariate;

var diff;

P-value 0.1460:

The UNIVARIATE Procedure

Variable: diff

Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t -2.16771 Pr > |t| 0.0530

Sign M -3 Pr >= |M| 0.1460

Signed Rank S -32 Pr >= |S| 0.0088

115 / 166

Look back at the data

I Data (differences drug A minus drug B):

diff

## [1] -1.5 -2.1 -0.3 0.2 -2.6 0.1 -1.8

## [8] 0.6 -1.5 -2.0 -2.3 -12.4

I See the big outlier (at end of list).

116 / 166

Sign test in R

I In R, either use the sign.test function that we wrote before:

sign.test(0,diff)

## [1] 0.1459961

I Or, count the number of positive and negative differences:

table(diff>0)

##

## FALSE TRUE

## 9 3

I and calculate the P-value as

pbinom(3,12,0.5)*2

## [1] 0.1459961

117 / 166

Randomization test idea

I Another way of doing a test when you’re worried about assumptionsis to use “randomization” idea.

I Ask “what could be switched around if H0 true”?

I In matched pairs, each of the matched observations could have comefrom either treatment (eg. drug).

I In two independent samples, each observation could have come fromeither sample.

I In ANOVA (multiple groups), each observation could have come fromany group.

I Randomization test: randomly shuffle treatments/samples/groupsaccording to experimental design.

118 / 166

Randomization test for matched pairs

I Randomly allocate results for each person to drug A or B.

I Observed sample mean difference:

mean(diff)

## [1] -2.133333

I Switching drugs (from observed data) means switching sign ofdifference A minus B. So generate random sample of −1 and 1 of size12 (with replacement), like this:

pm=c(1,-1)

random.pm=sample(pm,12,replace=T)

random.pm

## [1] -1 1 1 -1 1 -1 1 1 -1 -1 -1 1

119 / 166

Generating a randomization sample and mean

I Then take observed differences and generate randomized ones bymultiplying diff by these random signs:

random.diff=diff*random.pm

random.diff

## [1] 1.5 -2.1 -0.3 -0.2 -2.6 -0.1 -1.8

## [8] 0.6 1.5 2.0 2.3 -12.4

mean(random.diff)

## [1] -0.9666667

I This randomization gave a sample mean difference close to −1, whileour observed value was more negative, −2.13.

120 / 166

One randomization sampleI Write a function to do randomization and return statistic, once.

rand.diff=function(x) {pm=c(-1,1)

random.pm=sample(pm,length(x),replace=T)


return(mean(random.diff))

}

I Try it a few times:

rand.diff(diff)

## [1] -0.5833333

rand.diff(diff)

## [1] -1.083333

rand.diff(diff)

## [1] 0.7

rand.diff(diff)

## [1] 1.133333

I Each answer different, because of randomization. 121 / 166

Repeating many times

I Function replicate repeats something as many times as you like.

I In this case, repeat rand.diff(diff) many times:

replicate(5,rand.diff(diff))

## [1] 0.2666667 -0.7833333 1.1000000 -0.7166667

## [5] 1.1000000

rand.mean=replicate(1000,rand.diff(diff))

I Collection of replicated values called randomization distribution.

I Use this instead of normal, t, . . . to get P-value from.

122 / 166

Histogram of randomization distributionr=data.frame(mean=rand.mean)

ggplot(r,aes(x=mean))+geom_histogram(binwidth=0.5)

0

50

100

150

200

−3 −2 −1 0 1 2 3mean

coun

t

123 / 166

The randomization distributionI If t-test were plausible, this would look normal. Does it?

I Why not? The outlier difference, −12.5. If counted as positive, meanprobably positive; if counted as negative, mean probably negative.

I This true almost regardless of other values (whether plus or minus).

I Randomization test can be trusted, though.

I P-value: how many of randomized means came out less thanobserved −2.13, doubled:

tab=table(rand.mean<=

mean(diff))

tab

##

## FALSE TRUE

## 994 6

2*tab[2]/1000

## TRUE

## 0.012

I Randomization P-value small; nodoubt now that mean pain relieffor two drugs different.

124 / 166

The randomization distributionI If t-test were plausible, this would look normal. Does it?

I Why not? The outlier difference, −12.5. If counted as positive, meanprobably positive; if counted as negative, mean probably negative.

I This true almost regardless of other values (whether plus or minus).

I Randomization test can be trusted, though.

I P-value: how many of randomized means came out less thanobserved −2.13, doubled:

tab=table(rand.mean<=

mean(diff))

tab

##

## FALSE TRUE

## 994 6

2*tab[2]/1000

## TRUE

## 0.012

I Randomization P-value small; nodoubt now that mean pain relieffor two drugs different. 124 / 166

Randomization test for median difference

I With outlier, should we be using mean?

I Randomization test flexible; change mean to median:

rand.diff.med=function(x) {pm=c(-1,1)

random.pm=sample(pm,length(x),replace=T)


return(median(random.diff))

}replicate(5,rand.diff.med(diff))

## [1] 1.65 -0.15 0.15 0.35 0.05

rand.median=replicate(1000,rand.diff.med(diff))

I Change mean to median everywhere.

125 / 166

Histogram of randomization distribution for medianr=data.frame(med=rand.median)

ggplot(r,aes(x=med))+

geom_histogram(binwidth=0.5)

0

100

200

300

400

−1 0 1med

coun

t

126 / 166

P-value for randomization test for median

I Randomization distribution has only one peak now.

I Sample median difference was −1.65 (median(diff)).

I P-value is proportion of these at least as extreme:

tab=table(rand.median<=median(diff))

tab

##

## FALSE TRUE

## 989 11

2*tab[2]/1000

## TRUE

## 0.022

I Again, reject hypothesis that two drugs equally good.

127 / 166

ComparisonTest results seem to be very different:

Test Statistic P-value

t-test mean 0.053randomization mean 0.012

sign test median 0.146randomization median 0.022

I Not that much agreement, though 2 randomization tests have similarresults.

I Sign test P-value much larger than others (lack of power?)

I t-test probably not trustworthy (very non-normal dist. of differences).

I Using mean probably not wise (outlier difference)

I Randomization test for median probably best.128 / 166

The kids’ reading data, again

kids[1:15,]

## group score

## 1 t 24

## 2 t 61

## 3 t 59

## 4 t 46

## 5 t 43

## 6 t 44

## 7 t 52

## 8 t 43

## 9 t 58

## 10 t 67

## 11 t 62

## 12 t 57

## 13 t 71

## 14 t 49

## 15 t 54

kids[16:30,]

## group score

## 16 t 43

## 17 t 53

## 18 t 57

## 19 t 49

## 20 t 56

## 21 t 33

## 22 c 42

## 23 c 33

## 24 c 46

## 25 c 37

## 26 c 43

## 27 c 41

## 28 c 10

## 29 c 42

## 30 c 55

kids[31:44,]

## group score

## 31 c 19

## 32 c 17

## 33 c 55

## 34 c 26

## 35 c 54

## 36 c 60

## 37 c 28

## 38 c 62

## 39 c 20

## 40 c 53

## 41 c 48

## 42 c 37

## 43 c 85

## 44 c 42

I 21 kids in “treatment”, new reading method; 23 in “control”,standard reading method.

129 / 166

Randomization test for two-sample data

I Each observation might, under H0, have come from either treatmentor control.

I Means for data:

aggregate(score~group,kids,mean)

## group score

## 1 c 41.52174

## 2 t 51.47619

Differ by 9.95.

I If we assign Treatment and Control to the observations at random,how likely would we be to see a difference in means of 9.95 or bigger?

130 / 166

Randomization test

I Ingredients:

I sample to shuffle treatment groups

I aggregate to calculate means for shuffled groups

I little bit of calculation to get difference in sample means.

attach(kids)

myshuffle=sample(group)

myshuffle

## [1] c t t c t c c t c c c t t t c t t c c c t t t

## [24] c t t c c t t c c t c t t c t t c c c c c

## Levels: c t

I This makes shuffled groups.

131 / 166

Randomization difference in means

I Now find group means for shuffled groups:

themeans=aggregate(score~myshuffle,data=kids,mean)

themeans

## myshuffle score

## 1 c 44.39130

## 2 t 48.33333

I Difference in means is second thing in second column minus firstthing in second column:

meandiff=themeans[2,2]-themeans[1,2]

meandiff

## [1] 3.942029

I Simulated treatment > simulated control.

132 / 166

Making into function

I We will repeat shuffling process many times, so put it in function.

I Input: data frame.

I Output: difference in means of shuffled groups.

shuffle.means=function(x) {myshuffle=sample(x$group)

themeans=aggregate(score~myshuffle,data=x,mean)

meandiff=themeans[2,2]-themeans[1,2]

return(meandiff)

}

133 / 166

Again, using function!shuffle.means(kids)

## [1] 0.8447205

shuffle.means(kids)

## [1] 4.124224

shuffle.means(kids)

## [1] 3.304348

replicate(5,shuffle.means(kids))

## [1] -7.2629400 1.0269151 7.3126294 -2.6169772

## [5] 0.4803313

I Each run does different shuffle, so gives different result.

I Last line does whole thing five times, collects results.

I Looks like a difference in means of 10 is improbably large, by chance.134 / 166

1000 times

nsim=1000

ans=replicate(nsim,shuffle.means(kids))

I Investigate randomization by looking at ans.

135 / 166

Histogram of randomized mean differencesggplot(data.frame(ans=ans),aes(x=ans))+

geom_histogram(binwidth=2)+geom_vline(xintercept=10)

0

50

100

150

−10 −5 0 5 10ans

coun

t

136 / 166

How many randomizations gave mean difference > 10?I By chance, mean difference of 10 or bigger appears unlikely.

I Use a logical vector to pick out the elements of ans that we want,then make a table of them:

isbig=(ans>=10);

table(isbig)

## isbig

## FALSE TRUE

## 983 17

I 17 out of 1000.

I Our randomization test gives a P-value of 17/1000 = 0.017, so wouldreject null hypothesis “new reading program has no effect” in favourof alternative “new reading program helps”.

I P-value similar to 0.013 from two-sample t.

137 / 166

Why 1000 randomizations?

I No uniquely good answer.

I More randomizations is better.

I But didn’t want to wait too long for it to finish!

I If you think 10,000 (or a million) better, replace nsim in my code bydesired value and run again.

138 / 166

Jumping rats

I Recall jumping rats.

I See whether larger amount of exercise (jumping) went with higherbone density.

I 30 rats randomly assigned to one of 3 treatment groups:

I Control (no jumping)

I Low jumping

I High jumping

I Random assignment: rats in each group similar in all important ways.

I So entitled to draw conclusions about cause and effect.

I Recall boxplots, over:

139 / 166

Boxplotsrats=read.table("jumping.txt",header=T)

ggplot(rats,aes(y=density,x=group))+geom_boxplot()

●

●

550

575

600

625

650

675

Control Highjump Lowjumpgroup

dens

ity

140 / 166

Testing: ANOVA (comparing several means)

rats.aov=aov(density~group,data=rats)

summary(rats.aov)

## Df Sum Sq Mean Sq F value Pr(>F)

## group 2 7434 3717 7.978 0.0019 **

## Residuals 27 12579 466

## ---

## Signif. codes:

## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I Usual ANOVA table, small P-value: significant result.

I Null hypothesis: all 3 groups have same mean.

I Alternative: “not all 3 means the same”, at least one is different fromothers.

I Reject null, but not very useful finding.

141 / 166

Same thing in SAS

Read in data and do ANOVA:

data rats;

infile '/folders/myfolders/jumping.txt' firstobs=2;

input group $ density;

proc anova;

class group;

model density=group;

142 / 166

Results (some)

The ANOVA Procedure

Dependent Variable: density

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 2 7433.86667 3716.93333 7.98 0.0019

Error 27 12579.50000 465.90741

Corrected Total 29 20013.36667

Source DF Anova SS Mean Square F Value Pr > F

group 2 7433.866667 3716.933333 7.98 0.0019

143 / 166

Which groups are different from which?

I ANOVA really only answers half our questions: it says “there aredifferences”, but doesn’t tell us which groups different.

I One possibility (not the best): compare all possible pairs of groups,via two-sample t.

I First pick out each group:

detach(kids)

controls=filter(rats,group=="Control")

lows=filter(rats,group=="Lowjump")

highs=filter(rats,group=="Highjump")

144 / 166

Control vs. low

t.test(controls$density,lows$density)

##


##

## data: controls$density and lows$density

## t = -1.0761, df = 16.191, p-value = 0.2977



## -33.83725 11.03725



## 601.1 612.5

No sig. difference here.

145 / 166

Control vs. high

t.test(controls$density,highs$density)

##


##

## data: controls$density and highs$density

## t = -3.7155, df = 14.831, p-value = 0.002109



## -59.19139 -16.00861



## 601.1 638.7

These are different.

146 / 166

Low vs. high

t.test(lows$density,highs$density)

##


##

## data: lows$density and highs$density

## t = -3.2523, df = 17.597, p-value = 0.004525



## -43.15242 -9.24758



## 612.5 638.7

These are different too.

147 / 166

But. . .

I We just did 3 tests instead of 1.

I So we have given ourselves 3 chances to reject H0 : all means equal,instead of 1.

I Thus α for this combined test is not 0.05.

148 / 166

John W. Tukey

I American statistician, 1915–2000I Big fan of exploratory data analysisI Invented boxplotI Invented “honestly significant

differences”I Invented jackknife estimationI Coined computing term “bit”I Co-inventor of Fast Fourier Transform

149 / 166

Honestly Significant Differences

I Compare several groups with one test, telling you which groups differfrom which.

I Idea: if all population means equal, find distribution of highest samplemean minus lowest sample mean.

I Any means unusually different compared to that declared significantlydifferent.

150 / 166

Tukey on rat data

rats.aov=aov(density~group,data=rats)

TukeyHSD(rats.aov)

## Tukey multiple comparisons of means

## 95% family-wise confidence level

##

## Fit: aov(formula = density ~ group, data = rats)

##

## $group

## diff lwr upr p adj

## Highjump-Control 37.6 13.66604 61.533957 0.0016388

## Lowjump-Control 11.4 -12.53396 35.333957 0.4744032

## Lowjump-Highjump -26.2 -50.13396 -2.266043 0.0297843

Again conclude that bone density for highjump group significantly higherthan for other two groups.

151 / 166

Why Tukey’s procedure better than all t-tests

Look at P-values for the two tests:

Comparison Tukey t-tests

----------------------------------

Highjump-Control 0.0016 0.0021

Lowjump-Control 0.4744 0.2977

Lowjump-Highjump 0.0298 0.0045

I Tukey P-values (mostly) higher.

I Proper adjustment for doing three t-tests at once, not just one inisolation.

I lowjump-highjump comparison no longer significant at α = 0.01.

152 / 166

Same stuff in SAS

proc anova;

class group;

model density=group;

means group / tukey;

153 / 166

Tukey output (some)

The ANOVA Procedure

Tukey's Studentized Range (HSD) Test for density

Means with the same letter are not significantly different.

Tukey Grouping Mean N group

A 638.700 10 Highjump

B 612.500 10 Lowjump

B

B 601.100 10 Control

154 / 166

Checking assumptionsggplot(rats,aes(y=density,x=group))+geom_boxplot()

●

●

550

575

600

625

650

675

Control Highjump Lowjumpgroup

dens

ity

Assumptions:

I Normally distributed data within each group

I with equal group SDs.155 / 166

The assumptionsI Normally-distributed data within each group

I Equal group SDs.

These are shaky here because:

I control group has outliers

I highjump group appears to have less spread than others.

Possible remedies:

I Transformation of response (usually works best when SD increaseswith mean)

I Different test, eg. randomization test.

I Can also try Mood’s Median Test (if you know about chi-squaredtests). Based on medians, like a generalization of sign test. Notalways very powerful. See over.

156 / 166

Mood’s median testI Find median of all bone densities, regardless of group:

attach(rats)

m=median(density) ; m

## [1] 621.5

I Count up how many observations in each group above or belowoverall median:

tab=table(group,density>m)

tab

##

## group FALSE TRUE

## Control 9 1

## Highjump 0 10

## Lowjump 6 4

I All Highjump obs above overallmedian.

I Most Control obs below overallmedian.

I Suggests medians differ by group.

157 / 166

Mood’s median test 2/2

I Test whether association between group and being above/belowoverall median significant using chi-squared test for association:

chisq.test(tab)

##

## Pearson's Chi-squared test

##

## data: tab

## X-squared = 16.8, df = 2, p-value = 0.0002249

I Very small P-value says that being above/below overall mediandepends on group.

I That is, groups do not all have same median.

158 / 166

Randomization test

I Like two-sample randomization test: randomly shuffle groupmemberships.

I Choices about test statistic, eg.

I usual ANOVA F -statistic

I Tukey-like highest sample mean minus lowest

I I go with highest minus lowest.

159 / 166

Observed means and test statistic

I Calculate observed value of test statistic:

group.means=aggregate(density~group,data=rats,mean)

group.means

## group density

## 1 Control 601.1

## 2 Highjump 638.7

## 3 Lowjump 612.5

Actual means in second column of group.means:

z=group.means[,2]

obs=max(z)-min(z)

obs

## [1] 37.6

160 / 166

Randomly shuffling groupsActual groups are:

rats$group

## [1] Control Control Control Control Control Control

## [7] Control Control Control Control Lowjump Lowjump

## [13] Lowjump Lowjump Lowjump Lowjump Lowjump Lowjump

## [19] Lowjump Lowjump Highjump Highjump Highjump Highjump

## [25] Highjump Highjump Highjump Highjump Highjump Highjump

## Levels: Control Highjump Lowjump

Shuffle like this:

shuffled.groups=sample(rats$group) ; shuffled.groups

## [1] Highjump Control Control Lowjump Lowjump Lowjump

## [7] Control Control Lowjump Lowjump Highjump Control

## [13] Control Highjump Highjump Highjump Lowjump Highjump

## [19] Lowjump Highjump Control Highjump Control Lowjump

## [25] Highjump Highjump Lowjump Control Lowjump Control

## Levels: Control Highjump Lowjump161 / 166

Calculate means and highest minus lowestSame idea as for actual data, but using shuffled.groups instead ofgroup:

my.group.means=aggregate(density~shuffled.groups,

data=rats,mean)

my.group.means

## shuffled.groups density

## 1 Control 623.2

## 2 Highjump 613.0

## 3 Lowjump 616.1

z=my.group.means[,2]

max(z)-min(z)

## [1] 10.2

Need to do this a bunch of times. Idea: write a function to do it once,then run function many times.

162 / 166

Function to do it once

I Input: original data.

I Obtain shuffled groups

I Calculate mean for each group

I Return largest mean minus smallest mean.

I Function:

shuffle1=function(mydata=rats) {shuffled.groups=sample(mydata$group)

means=aggregate(density~shuffled.groups,

data=mydata,mean)

z=means[,2]

max(z)-min(z)

}

163 / 166

Obtaining randomization distribution

I Test (randomizes, so look for “sane” answer):

shuffle1(rats)

## [1] 19.4

I Use replicate to do a few times:

replicate(6,shuffle1(rats))

## [1] 9.4 12.8 25.0 15.8 6.6 1.9

I Or many times, like 1000:

test.stat=replicate(1000,shuffle1(rats))

164 / 166

Randomization distribution of test.statvals=data.frame(ts=test.stat)

ggplot(vals,aes(x=ts))+geom_histogram(binwidth=5)+

geom_vline(xintercept=obs,colour="red")

0

50

100

150

200

250

0 10 20 30 40ts

coun

t

#hist(test.stat)

#abline(v=obs,col="red")

165 / 166

Comments and P-value

I Distribution shape skewed to right (lower limit of 0).

I Red line marks observed highest minus lowest: seems unusually high.

I If group means really different (anyhow), highest minus lowest will belarge, so one-tailed test.

I (Same logic as one-sidedness of F -test in ANOVA.)

I P-value the usual way:

table(test.stat>=obs)

##

## FALSE TRUE

## 998 2

I P-value 2/1000 = 0.0020. Despite outliers, still reject “all meansequal”, with P-value very similar to ANOVA (0.0019).

166 / 166

statistical inference - university of torontobutler/c32/notes/inference.pdf · i thus you can never...

Documents