statistical inference - university of torontobutler/c32/notes/inference.pdf · i thus you can never...
TRANSCRIPT
Statistical Inference
R packages used in this chapter
library(ggplot2)
library(dplyr)
##
## Attaching package: ’dplyr’
## The following objects are masked from ’package:stats’:
##
## filter, lag
## The following objects are masked from ’package:base’:
##
## intersect, setdiff, setequal, union
2 / 166
Statistical Inference and Science
I Previously: descriptive statistics. “Here are data; what do they say?”.
I May need to take some action based on information in data.
I Or want to generalize beyond data (sample) to larger world(population).
I Science: first guess about how world works.
I Then collect data, by sampling.
I Is guess correct (based on data) for whole world, or not?
3 / 166
Sample data are imperfectI Sample data never entirely represent what you’re observing.
I There is always random error present.
I Thus you can never be entirely certain about your conclusions.
I The Toronto Blue Jays’ average home attendance in part of 2015season was 25,070 (up to May 27 2015, frombaseball-reference.com).
I Does that mean the attendance at every game was exactly 25,070?Certainly not. Actual attendance depends on many things, eg.:
I how well the Jays are playing
I the opposition
I day of week
I weather
I random chance4 / 166
Summarizing the attendances
jays=read.csv("jays15-home.csv",header=T)
summary(jays$attendance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14180 16400 21200 25070 33090 48410
Attendances varied a lot, from a low of 14180 to a high of 48410 (openingday).
Boxplot over. I used “base graphics” (not ggplot) for this one.
5 / 166
Attendance boxplotboxplot(jays$attendance)
1500
025
000
3500
045
000
6 / 166
Comments
I Distribution somewhat skewed to right (but no outliers).
I Attendances have substantial variability.
I These are a sample of “all possible games” (or maybe “all possiblegames played in April and May”). What can we say about meanattendance in all possible games based on this evidence?
I Think about:
I Confidence interval
I Hypothesis test.
7 / 166
Confidence interval for mean
I Should account for:
I we only observed 25 out of all possible games (sample size)
I attendances we did observe were highly variable (sample SD)
I we observed mean attendance of 25,070 (sample mean).
I Confidence level: how likely we want the interval to be to contain themean attendance at “all possible games”. Typically 95%.
I Formula, with sample mean x̄ , sample SD s and sample size n:
x̄ ± t∗s√n.
t∗ is number from t-table that accounts for confidence level (eg.95%) and sample size through degrees of freedom n − 1.
I For 25− 1 = 24 df and 95% confidence, t∗ = 2.064.
8 / 166
How do I get t∗ or z∗ for CI?I z∗ first, for 95% interval:
I Want middle 95%, z-values that enclose that.
I 2.5% less than −z∗, 97.5% less than z∗.
I In R, qnorm gives value of z with input probability less than it:
qnorm(0.025)
## [1] -1.959964
qnorm(0.975)
## [1] 1.959964
I So, for 95% z-confidence interval, z∗ = 1.96.
9 / 166
t∗ for 95% confidence interval
I “Half the leftovers” again.
I Use qt rather than qnorm.
I Two things in qt: prob. of less, plus degrees of freedom.
I 95% t∗ with 24 df:
qt(0.975,24)
## [1] 2.063899
I or same with 10 df:
qt(0.975,10)
## [1] 2.228139
bigger, because fewer df.
I or t∗ for 90% with 10 df:
qt(0.95,10)
## [1] 1.812461
smaller, as you’d expect.
10 / 166
CI for mean attendance
I First, use R as calculator.
I Sample mean and SD:
att=jays$attendance
mean(att)
## [1] 25070.16
sd(att)
## [1] 11006.69
I Work out “margin of error”(plus/minus part) first:
m=2.064*11007/sqrt(25)
m
## [1] 4543.69
I then get limits of CI:
xbar=25070
xbar-m
## [1] 20526.31
xbar+m
## [1] 29613.69
11 / 166
Comments
I Interval from about 20,000 to about 30,000.
I Not estimating mean attendance well at all!
I Generally want confidence interval to be shorter, which happens if:
I SD smaller
I sample size bigger
I confidence level smaller
I Last one is a cheat, really, since reducing confidence level increaseschance that interval won’t contain pop. mean at all!
12 / 166
Using t.test
I Foregoing illustrated how you would do it by hand.
I In practice, t.test function does it all:
t.test(att)
##
## One Sample t-test
##
## data: att
## t = 11.389, df = 24, p-value = 3.661e-11
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 20526.82 29613.50
## sample estimates:
## mean of x
## 25070.16
13 / 166
Or, 90% CI
I by including a value for conf.level:
t.test(att,conf.level=0.90)
##
## One Sample t-test
##
## data: att
## t = 11.389, df = 24, p-value = 3.661e-11
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 21303.93 28836.39
## sample estimates:
## mean of x
## 25070.16
14 / 166
Hypothesis testI CI answers question “what is the mean?”
I Might have a value µ in mind for the mean, and question “Is themean equal to µ, or not?”
I For example, 2014 average attendance was 29,327.
I Second question answered by hypothesis test.
I Value being assessed goes in null hypothesis: here, H0 : µ = 29, 327.
I Alternative hypothesis says how null might be wrong, eg.Ha : µ 6= 29, 327.
I Assess evidence against null. If that evidence strong enough, rejectnull hypothesis; if not, fail to reject null hypothesis (sometimes retainnull).
I Note asymmetry between null and alternative, and utter absence ofword “accept”.
15 / 166
α and errorsI Hypothesis test ends with decision:
I reject null hypothesis
I do not reject null hypothesis.
I but decision may be wrong:
DecisionTruth Do not reject Reject null
Null true Correct Type I errorNull false Type II error Correct
I Either type of error is bad, but for now focus on controlling Type Ierror: write α = P(type I error), and devise test so that α small,typically 0.05.
I That is, if null hypothesis true, have only small chance to reject it(which would be a mistake).
I Worry about type II errors later (when we consider power of test).16 / 166
Why 0.05? This man.Responsible for:
I analysis of variance
I Fisher information
I Linear discriminant analysis
I Fisher’s z-transformation
I Fisher-Yates shuffle
I Behrens-Fisher problem
Sir Ronald A. Fisher, 1890–1962.
17 / 166
Why 0.05? (2)I From The Arrangement of Field Experiments (1926):
I and
18 / 166
Test statistic: going from data to decision
I “How far away from null hypothesis is data?”
I In testing for mean, statistic is
t =x̄ − µs/√n
I and for our baseball attendance data
t.stat=(25070-29327)/(11000/sqrt(25))
t.stat
## [1] -1.935
19 / 166
P-value
I The probability of observing a test statistic value as extreme or moreextreme than that observed, if the null hypothesis is true.
I “More extreme” depends on Ha: count both sides if two-sided, countonly the proper side if one-sided.
I Our Ha was Ha : µ 6= 29, 327, two-sided.
20 / 166
t-tableI Look up 1.935 not −1.935.
I 1.71 < 1.935 < 2.06
I P-value between 0.05 and 0.10
I Or, exactly (×2: 2-sided):
2*pt(-1.93,24)
## [1] 0.06550769
I P-value not less than 0.05: donot reject null.
I No evidence of change in meanattendance.
21 / 166
Or, again, using t.test:
t.test(att, mu=29327)
##
## One Sample t-test
##
## data: att
## t = -1.9338, df = 24, p-value = 0.06502
## alternative hypothesis: true mean is not equal to 29327
## 95 percent confidence interval:
## 20526.82 29613.50
## sample estimates:
## mean of x
## 25070.16
I See test statistic −1.93, P-value 0.065.
I Again, do not reject null.
22 / 166
And in SAS
I created a spreadsheet with only attendances (since there were otherwise alot of variables). The first line is a header, so start reading data on second.
data jays;
infile '/folders/myfolders/jays15-home-att.csv'
firstobs=2;
input attendance;
proc ttest h0=29327;
var attendance;
23 / 166
Output
N Mean Std Dev Std Err Minimum Maximum
25 25070.2 11006.7 2201.3 14184.0 48414.0
Mean 95% CL Mean Std Dev 95% CL Std Dev
25070.2 20526.8 29613.5 11006.7 8594.3 15312.0
DF t Value Pr > |t|
24 -1.93 0.0650
Same CI (20527 to 29614) as R, also same P-value 0.0650.
24 / 166
SAS, the long wayI To work from the original spreadsheet, we need to read in all the
variables:
data jays;
infile '/folders/myfolders/jays15-home.csv'
dlm=',' dsd firstobs=2;
input row game date $ box $ team $ venue $ opp $
result $ runs oppruns innings wl $ position gb $
winner $ loser $ save $ gametime $ daynight $
attendance streak $;
proc print;
I Yes, that is one long input line, split.
I dlm=’,’ means data values separated by commas.
I dsd means treat two consecutive delimiters as a missing value(usually what you want for delimited data). 25 / 166
Results (some, tiny)Obs row game date box team venue opp result runs oppruns innings wl
1 82 7 Monday, boxscore TOR TBR L 1 2 . 4-3
2 83 8 Tuesday, boxscore TOR TBR L 2 3 . 4-4
3 84 9 Wednesda boxscore TOR TBR W 12 7 . 5-4
4 85 10 Thursday boxscore TOR TBR L 2 4 . 5-5
5 86 11 Friday, boxscore TOR ATL L 7 8 . 5-6
6 87 12 Saturday boxscore TOR ATL W-wo 6 5 10 6-6
7 88 13 Sunday, boxscore TOR ATL L 2 5 . 6-7
8 89 14 Tuesday, boxscore TOR BAL W 13 6 . 7-7
9 90 15 Wednesda boxscore TOR BAL W 4 2 . 8-7
10 91 16 Thursday boxscore TOR BAL W 7 6 . 9-7
11 92 27 Monday, boxscore TOR NYY W 3 1 . 13-14
12 93 28 Tuesday, boxscore TOR NYY L 3 6 . 13-15
13 94 29 Wednesda boxscore TOR NYY W 5 1 . 14-15
14 95 30 Friday, boxscore TOR BOS W 7 0 . 15-15
15 96 31 Saturday boxscore TOR BOS W 7 1 . 16-15
16 97 32 Sunday, boxscore TOR BOS L 3 6 . 16-16
17 98 40 Monday, boxscore TOR LAA W 10 6 . 18-22
18 99 41 Tuesday, boxscore TOR LAA L 2 3 . 18-23
19 100 42 Wednesda boxscore TOR LAA L 3 4 . 18-24
20 101 43 Thursday boxscore TOR LAA W 8 4 . 19-24
21 102 44 Friday, boxscore TOR SEA L 3 4 . 19-25
22 103 45 Saturday boxscore TOR SEA L 2 3 . 19-26
23 104 46 Sunday, boxscore TOR SEA W 8 2 . 20-26
24 105 47 Monday, boxscore TOR CHW W 6 0 . 21-26
25 106 48 Tuesday, boxscore TOR CHW W-wo 10 9 . 22-26
Obs position gb winner loser save gametime daynight attendance streak
1 2 1 Odorizzi Dickey Boxberge 2:30 N 48414 -
2 3 2 Geltz Castro Jepsen 3:06 N 17264 --
3 2 1 Buehrle Ramirez 3:02 N 15086 +
4 4 1.5 Archer Sanchez Boxberge 3:00 N 14433 -
5 4 2.5 Martin Cecil Grilli 3:09 N 21397 --
6 3 1.5 Cecil Marimon 2:41 D 34743 +
7 4 1.5 Miller Norris Grilli 2:41 D 44794 -
8 2 2 Buehrle Norris 2:53 N 14184 +
9 2 1 Sanchez Jimenez Castro 2:36 N 15606 ++
10 1 Tied Hutchiso Tillman Castro 2:36 N 18581 +++
11 4 3.5 Dickey Martin Cecil 2:18 N 19217 +
12 5 4.5 Pineda Estrada Miller 2:54 N 21519 -
13 3 3.5 Buehrle Sabathia 2:30 N 21312 +
14 3 4 Sanchez Miley 2:41 N 30430 ++
15 3 3 Hutchiso Kelly 3:12 D 42917 +++
16 3 4 Buchholz Dickey Uehara 2:38 D 42419 -
17 5 4.5 Osuna Morin 3:28 D 29306 +
18 5 4.5 Santiago Sanchez Street 2:32 N 15062 -
19 5 4.5 Weaver Hutchiso Street 2:36 N 16402 --
20 5 4.5 Dickey Shoemake 2:22 N 19014 +
21 5 5.5 Hernande Estrada Rodney 2:30 N 21195 -
22 5 5.5 Paxton Buehrle Rodney 2:25 D 33086 --
23 5 4.5 Sanchez Walker 2:50 D 37929 +
24 5 3.5 Hutchiso Noesi 2:10 N 15168 ++
25 4 3 Delabar Robertso 3:12 N 17276 +++
26 / 166
Results (a few, but less tiny)proc print;
var date attendance daynight;
Obs date attendance daynight
1 Monday, 48414 N
2 Tuesday, 17264 N
3 Wednesda 15086 N
4 Thursday 14433 N
5 Friday, 21397 N
6 Saturday 34743 D
7 Sunday, 44794 D
8 Tuesday, 14184 N
9 Wednesda 15606 N
10 Thursday 18581 N
11 Monday, 19217 N
12 Tuesday, 21519 N
13 Wednesda 21312 N
14 Friday, 30430 N
15 Saturday 42917 D
16 Sunday, 42419 D
17 Monday, 29306 D
18 Tuesday, 15062 N
19 Wednesda 16402 N
20 Thursday 19014 N
21 Friday, 21195 N
22 Saturday 33086 D
23 Sunday, 37929 D
24 Monday, 15168 N
25 Tuesday, 17276 N
27 / 166
Comments
I Need to know names and types for all variables to read in data (vs. R,just read in data frame).
I For text variables, just first 8 characters read in. (To change, seeinformat later.)
I daynight is D for a day game and N for a night game. How doattendances compare for these?
proc sgplot;
vbox attendance / category=daynight;
28 / 166
The boxplot
29 / 166
Comments
I Attendances on average much higher for day games than night ones.
I Why?
I What is that upper outlier in the night games?
30 / 166
Another example: learning to read
I You devised new method for teaching children to read.
I Guess it will be more effective than current methods.
I To support this guess, collect data.
I Want to generalize to “all children in Canada”.
I So take random sample of all children in Canada.
I Or, argue that sample you actually have is “typical” of all children inCanada.
I Randomization: whether or not a child in sample or not has nothingto do with anything else about that child.
I Aside: if your new method good for teaching struggling children toread, then “all kids” is “all kids having trouble learning to read”, andyou take a sample of those.
31 / 166
The data (some), in SASI Data in file drp.txt with header line, group then reading test score.
data kids;
infile '/folders/myfolders/drp.txt' firstobs=2;
input group $ score;
proc print;
Obs group score
1 t 24
2 t 61
3 t 59
4 t 46
5 t 43
6 t 44
7 t 52
8 t 43
9 t 58
10 t 67
11 t 62
12 t 57
13 t 71
14 t 49
15 t 54
16 t 43
17 t 53
18 t 57
19 t 49
20 t 56
21 t 33
22 c 42
23 c 33
24 c 46
25 c 37
26 c 43
27 c 41
28 c 10
29 c 42
30 c 55
31 c 19
32 c 17
33 c 55
34 c 26
35 c 54
36 c 60
37 c 28
38 c 62
39 c 20
40 c 53
41 c 48
42 c 37
43 c 85
44 c 42
32 / 166
Analysis
I Groups labelled c for “control” and t for “treatment”.
I Start with summaries (group means) and plot (boxplot).
I No pairing, matching: might compare means with two-sample t-test.
I For test, need approx. normality, but don’t need equal variability.
I Use summaries to decide if test reasonable.
33 / 166
Comparing means
proc means;
class group;
var score;
The MEANS Procedure
Analysis Variable : score
N
group Obs N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------------
c 23 23 41.5217391 17.1487332 10.0000000 85.0000000
t 21 21 51.4761905 11.0073568 24.0000000 71.0000000
-------------------------------------------------------------------------------
34 / 166
Boxplotsproc sgplot;
vbox score / category=group;
35 / 166
Comments
I Groups not actually same size (maybe 2 kids had to drop out).
I Means a fair bit different, treatment mean higher.
I But a lot of variability, so groups do overlap.
I Standard deviations somewhat different too.
I Biggest threat to normality is outliers, none here.
I Both distributions not far off symmetric.
I t-test should be good enough.
36 / 166
The t-test
proc ttest side=l;
var score;
class group;
The TTEST Procedure
Variable: score
Method Variances DF t Value Pr < t
Pooled Equal 42 -2.27 0.0143
Satterthwaite Unequal 37.855 -2.31 0.0132
plus a lot more output.
37 / 166
Comments and Conclusions
I One-sided test (looking for improvement). side can be l (lower), u(upper) or 2 (two-sided, can be omitted.) This is l because controlgroup first in alphabetical order.
I Right t-test is Satterthwaite (does not assume equal variability)
I P-value 0.0132 < 0.05: there is increase in reading scores.
I Should not use pooled test, because SDs not close; even so, resultvery similar (P-value 0.0143).
I One-sided test doesn’t give (regular) CI for difference in means. Toget that, repeat analysis without side=l.
38 / 166
Doing it with RI I skip the boxplots (you do it).
I c (control) before t (treatment) alphabetically, so proper alternativeis “less”.
I R does Satterthwaite test by default (as before: new reading programreally helps):
kids=read.table("drp.txt",header=T)
t.test(score~group,data=kids,alternative="less")
##
## Welch Two Sample t-test
##
## data: score by group
## t = -2.3109, df = 37.855, p-value = 0.01319
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -2.691293
## sample estimates:
## mean in group c mean in group t
## 41.52174 51.47619 39 / 166
The pooled t-testI Pooled (equal variances) test this way:
t.test(score~group,data=kids,alternative="less",
var.equal=T)
##
## Two Sample t-test
##
## data: score by group
## t = -2.2666, df = 42, p-value = 0.01431
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -2.567497
## sample estimates:
## mean in group c mean in group t
## 41.52174 51.47619
I Similar P-value to Satterthwaite test (and same results as SAS).40 / 166
Two-sided test; CI
I To do 2-sided test, leave out alternative:
t.test(score~group,data=kids)
##
## Welch Two Sample t-test
##
## data: score by group
## t = -2.3109, df = 37.855, p-value = 0.02638
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.67588 -1.23302
## sample estimates:
## mean in group c mean in group t
## 41.52174 51.47619
I Also gives CI: new reading program increases average scores bysomewhere between about 1 and 19 points.
41 / 166
Logic of testingI Work out what would happen if null hypothesis were true.
I Compare to what actually did happen.
I If these are too far apart, conclude that null hypothesis is not trueafter all. (Be guided by P-value.)
As applied to our reading programs:
I If reading programs equally good, expect to see a difference in meansclose to 0.
I Closeness quantified eg. by t.
I Mean reading score was 10 higher for new program.
I Difference of 10 was unusually big (P-value small from t-test andrandomization test). So conclude that new reading program iseffective.
Nothing here about what happens if null hypothesis is false. This is powerand type II error probability. 42 / 166
Errors in testing
What can happen:
DecisionTruth Do not reject Reject null
Null true Correct Type I errorNull false Type II error Correct
Tension between truth and decision about truth (imperfect).
I Prob. of type I error denoted α. Usually fix α, eg. α = 0.05.
I Prob. of type II error denoted β. Determined by the plannedexperiment. Low β good.
I Prob. of not making type II error called power (= 1− β). High powergood.
43 / 166
Power
I Suppose H0 : θ = 10, Ha : θ 6= 10 for some parameter θ.
I Suppose H0 wrong. What does that say about θ?
I Not much. Could have θ = 11 or θ = 8 or θ = 496. In each case, H0
wrong.
I How likely a type II error is depends on what θ is:
I If θ = 496, should be able to reject H0 : θ = 10 even for small sample,so β should be small (power large).
I If θ = 11, might have hard time rejecting H0 even with large sample, soβ would be larger (power smaller).
I Power depends on true parameter value, and on sample size.
I So we play “what if”: “if θ were 11 (or 8 or 496), what would powerbe?”.
44 / 166
Figuring out power
I Time to figure out power is before you collect any data, as part ofplanning process.
I Need to have idea of what kind of departure from null hypothesis ofinterest to you, eg. average improvement of 5 points on reading testscores. (Subject-matter decision, not statistical one.)
I Then, either:
I “I have this big a sample and this big a departure I want to detect.What is my power for detecting it?”
I “I want to detect this big a departure with this much power. How big asample size do I need?”
I R or SAS.
45 / 166
Calculating power for a t-test
Think about reading programs again. Need to specify:
I What kind of t-test (here 2-sample, not 1-sample or paired)
I What kind of Ha (here one-sided)
I What the population SD is (usually have to guess this). Here thesample SDs were 11 and 17, so 15 seems a fair guess (same for eachgroup).
I What size departure from null (here 5 points).
46 / 166
22 kids/group, difference 5
power.t.test(n=22,delta=5,sd=15,type="two.sample",
alternative="one.sided")
##
## Two-sample t test power calculation
##
## n = 22
## delta = 5
## sd = 15
## sig.level = 0.05
## power = 0.2887273
## alternative = one.sided
##
## NOTE: n is number in *each* group
This power (29%) way too small.
47 / 166
Detecting a larger differenceLarger difference, say 10 points under same conditions, should be easier todetect. Change delta=5 to delta=10:
power.t.test(n=22,delta=10,sd=15,type="two.sample",
alternative="one.sided")
##
## Two-sample t test power calculation
##
## n = 22
## delta = 10
## sd = 15
## sig.level = 0.05
## power = 0.7020779
## alternative = one.sided
##
## NOTE: n is number in *each* group
Much better, power around 70%. Much likelier to detect this biggerdifference. 48 / 166
Increasing the sample sizeMaking n bigger should improve power. For detecting difference of 5,increase sample size from 22 in each group to 40:
power.t.test(n=40,delta=5,sd=15,type="two.sample",alternative="one.sided")
##
## Two-sample t test power calculation
##
## n = 40
## delta = 5
## sd = 15
## sig.level = 0.05
## power = 0.4336523
## alternative = one.sided
##
## NOTE: n is number in *each* group
Power went up from 29% to 43%. Often need much bigger sample size tomake much difference.
49 / 166
Making a graph of power
I “Power curve”.
I Two ways:
I Keep sample size fixed, change true difference between means
I Keep difference fixed, change sample size.
I Feed power.t.test a vector of true differences or sample sizes. Getback vector of power values:
my.del=seq(0,20,2)
mypower=power.t.test(n=22,delta=my.del,sd=15,
type="two.sample",alternative="one.sided")
mypower$power
## [1] 0.0500000 0.1131924 0.2192775 0.3670836 0.5380092 0.7020779 0.8328057
## [8] 0.9192728 0.9667502 0.9883915 0.9965808
50 / 166
Fixing up “list”; making scatterplot
I To plot the power and my.del values, we need to make a data frameof them:
power.df=data.frame(diff=my.del,power=mypower$power)
I I called the difference in means diff to remind me what it was.
I Our plot will be a scatterplot. Set up ggplot in the usual way, then:
I geom_point() will plot the x-points against the y -points.
I geom_line() joins points by lines (helpful here).
51 / 166
The power curveggplot(power.df,aes(x=diff,y=power))+geom_point()+geom_line()
●
●
●
●
●
●
●
●
●● ●
0.25
0.50
0.75
1.00
0 5 10 15 20diff
pow
er
52 / 166
Embellishing the power curve
I The largest the power can be is 1 (the test correctly rejects the nullevery time).
I If two means actually equal (diff is 0), the null hypothesis actuallycorrect, and then the prob of rejecting it is 0.05. So power should be0.05 at diff 0, bigger elsewhere.
I Add these as horizontal lines to graph.
I Uses geom_hline(yintercept=y).
I Construct graph first, then print (over):
gr=ggplot(power.df,aes(x=diff,y=power))+
geom_point()+geom_line()+
geom_hline(yintercept=0.05,linetype="dashed")+
geom_hline(yintercept=1,linetype="dashed")
53 / 166
The improved power curvegr
●
●
●
●
●
●
●
●
●● ●
0.25
0.50
0.75
1.00
0 5 10 15 20diff
pow
er
54 / 166
Power curve for sample size, fixed difference of 5
Sample sizes, 10 to 100 in steps of 5:
my.n=seq(10,100,5)
mypower=power.t.test(n=my.n,delta=5,sd=15,
type="two.sample",alternative="one.sided")
mypower$power
## [1] 0.1768868 0.2254334 0.2710950 0.3145661 0.3560927 0.3957733 0.4336523
## [8] 0.4697559 0.5041065 0.5367294 0.5676549 0.5969194 0.6245648 0.6506381
## [15] 0.6751905 0.6982766 0.7199535 0.7402799 0.7593159
55 / 166
Setting up the power curve
my.df=data.frame(n=my.n,power=mypower$power)
gr=ggplot(my.df,aes(x=n,y=power))+
geom_point()+geom_line()+
geom_hline(yintercept=0.05,linetype="dashed")+
geom_hline(yintercept=1,linetype="dashed")
56 / 166
The power curvegr
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
0.25
0.50
0.75
1.00
25 50 75 100n
pow
er
57 / 166
Finding sample sizeI Instead of being given a sample size and finding a power, we can be
given a power and asked for the sample size needed to achieve it.
I Eg. to detect a difference of 5 between means, how big a sample do Ineed to get 80% power?
I Leave out n= and put in power=0.80 instead; solves for n:
power.t.test(power=0.80,delta=5,sd=15,type="two.sample",
alternative="one.sided")
##
## Two-sample t test power calculation
##
## n = 111.9686
## delta = 5
## sd = 15
## sig.level = 0.05
## power = 0.8
## alternative = one.sided
##
## NOTE: n is number in *each* group 58 / 166
Comments
I To have 80% chance of correctly detecting difference of 5 in means,need 112 children in each group.
I Sample size calculations apt to give depressing results!
59 / 166
SAS
SAS has proc power which does all the same stuff (a little more flexibly):
I Two-sample t for means: twosamplemeans.
I Satterthwaite test: diff satt
I One-sided.
I Desired difference in means 5.
I SD in each group 15
I Total sample size 44 (22 in each group)
I Want to find power for this: power=..
60 / 166
SAS code
proc power;
twosamplemeans
test=diff_satt
sides=1
meandiff=5
stddev=15
ntotal=44
power=.;
Note punctuation: twosamplemeans to power really all one line, sosemicolon only after power=..
61 / 166
SAS outputThe POWER Procedure
Two-Sample t Test for Mean Difference with Unequal Variances
Fixed Scenario Elements
Distribution Normal
Method Exact
Number of Sides 1
Mean Difference 5
Standard Deviation 15
Total Sample Size 44
Null Difference 0
Nominal Alpha 0.05
Group 1 Weight 1
Group 2 Weight 1
Computed Power
Actual
Alpha Power
0.0498 0.288
Power is 0.29, as with R. Note that we didn’t need any data. 62 / 166
Power for several alternatives and sample sizes
If you put several values on a line, SAS does all combinations of them, eg.mean differences 5 and 10, sample size 22 and 40 per group:
proc power;
twosamplemeans
test=diff_satt
sides=1
meandiff=5 10
stddev=15
ntotal=44 80
power=.;
63 / 166
OutputThe POWER Procedure
Two-Sample t Test for Mean Difference with Unequal Variances
Fixed Scenario Elements
Distribution Normal
Method Exact
Number of Sides 1
Standard Deviation 15
Null Difference 0
Nominal Alpha 0.05
Group 1 Weight 1
Group 2 Weight 1
Computed Power
Mean N Actual
Index Diff Total Alpha Power
1 5 44 0.0498 0.288
2 5 80 0.0499 0.433
3 10 44 0.0498 0.701
4 10 80 0.0499 0.905
64 / 166
Comments
I Consistent with R results before.
I Increasing sample size increases power a bit.
I Increasing true mean difference increases power dramatically.
65 / 166
Sample size calculation
To calculate sample size required, leave ntotal missing (put in a dot) andfill in power. Here, mean difference 5, desired power 0.80:
proc power;
twosamplemeans
test=diff_satt
sides=1
meandiff=5
stddev=15
ntotal=.
power=0.80;
66 / 166
ResultsThe POWER Procedure
Two-Sample t Test for Mean Difference with Unequal Variances
Fixed Scenario Elements
Distribution Normal
Method Exact
Number of Sides 1
Mean Difference 5
Standard Deviation 15
Nominal Power 0.8
Null Difference 0
Nominal Alpha 0.05
Group 1 Weight 1
Group 2 Weight 1
Computed N Total
Actual Actual N
Alpha Power Total
0.05 0.800 224
Need 224 children total, or 112 in each group. As R. 67 / 166
Or, more accurately. . .I Didn’t actually have same-size groups or equal population SDs.
Assumed them for R.
I SAS will allow different values per group.
I We had sample sizes 21 and 23, sample SDs 11 and 17 (use aspopulation SDs).
I Unequal sample sizes usually decrease power, but smaller sample sizewith smaller SD actually better. Overall effect unclear.
I SAS: use groupstddevs and groupns, and vertical bars.
proc power;
twosamplemeans
test=diff_satt
sides=1
meandiff=5
groupstddevs=11|17
groupns=21|23
power=.;68 / 166
ResultsThe POWER Procedure
Two-Sample t Test for Mean Difference with Unequal Variances
Fixed Scenario Elements
Distribution Normal
Method Exact
Number of Sides 1
Mean Difference 5
Group 1 Standard Deviation 11
Group 2 Standard Deviation 17
Group 1 Sample Size 21
Group 2 Sample Size 23
Null Difference 0
Nominal Alpha 0.05
Computed Power
Actual
Alpha Power
0.0499 0.309
Power actually went up a tiny bit. 69 / 166
Unequal sample sizes
To show effect of unequal sample sizes, go back to SDs both being 15, buthave very unequal sample sizes. What effect does that have on power?
proc power;
twosamplemeans
test=diff_satt
sides=1
meandiff=5
stddev=15
groupns=10|34
power=.;
70 / 166
ResultsPower for 22 in each group was 29%:
The POWER Procedure
Two-Sample t Test for Mean Difference with Unequal Variances
Fixed Scenario Elements
Distribution Normal
Method Exact
Number of Sides 1
Mean Difference 5
Standard Deviation 15
Group 1 Sample Size 10
Group 2 Sample Size 34
Null Difference 0
Nominal Alpha 0.05
Computed Power
Actual
Alpha Power
0.0505 0.225
Unequal sample sizes bring power down to 22.5%.71 / 166
Return to actual data: CI in SAS. . . For difference between mean reading scores:
proc ttest;
var score;
class group;
The TTEST Procedure
Variable: score
group Method Mean 95% CL Mean Std Dev
c 41.5217 34.1061 48.9374 17.1487
t 51.4762 46.4657 56.4867 11.0074
Diff (1-2) Pooled -9.9545 -18.8176 -1.0913 14.5512
Diff (1-2) Satterthwaite -9.9545 -18.6759 -1.2330
group Method 95% CL Std Dev
c 13.2627 24.2715
t 8.4213 15.8954
Diff (1-2) Pooled 11.9981 18.4947
Diff (1-2) Satterthwaite72 / 166
Comments
I Same code as for test, look at different part of output.
I Look under 95% CL mean and Satterthwaite line, interval from−18.68 to −1.23.
I Control minus treatment, so new reading method better on averageby between 1 and 19 points.
I Would a 90% CI be longer or shorter than a 95% one? Code nextpage. Put appropriate alpha on proc ttest line.
73 / 166
90% CIproc ttest alpha=0.10;
var score;
class group;
The TTEST Procedure
Variable: score
group Method Mean 90% CL Mean Std Dev
c 41.5217 35.3816 47.6618 17.1487
t 51.4762 47.3334 55.6190 11.0074
Diff (1-2) Pooled -9.9545 -17.3414 -2.5675 14.5512
Diff (1-2) Satterthwaite -9.9545 -17.2176 -2.6913
group Method 90% CL Std Dev
c 13.8098 22.8992
t 8.7834 14.9440
Diff (1-2) Pooled 12.3693 17.7758
Diff (1-2) Satterthwaite
74 / 166
Comparison
I 95%: −18.68 to −1.23.
I 90%: −17.22 to −2.69.
I 90% CI shorter, but has more chance to be wrong.
I Both CIs wide: we don’t know much about effect that new readingmethod has. (Solution: more data.)
75 / 166
Same thing in R
Default is 95% CI, get via t.test as before:
t.test(score~group,data=kids)
##
## Welch Two Sample t-test
##
## data: score by group
## t = -2.3109, df = 37.855, p-value = 0.02638
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.67588 -1.23302
## sample estimates:
## mean in group c mean in group t
## 41.52174 51.47619
Same as SAS.
76 / 166
90% CI in R
Add conf.level, which is actual confidence level, unlike SAS:
t.test(score~group,data=kids,conf.level=0.90)
##
## Welch Two Sample t-test
##
## data: score by group
## t = -2.3109, df = 37.855, p-value = 0.02638
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
## -17.217609 -2.691293
## sample estimates:
## mean in group c mean in group t
## 41.52174 51.47619
Again, same as SAS.
77 / 166
Duality between confidence intervals and hypothesis tests
I Tests and CIs really do the same thing, if you look at them the rightway. They are both telling you something about a parameter, andthey use same things about data.
I Illustrate with R and SAS, R first.
78 / 166
95% CI (default)
group1=c(10,11,11,13,13,14,14,15,16)
group2=c(13,13,14,17,18,19)
t.test(group1,group2)
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = -2.0937, df = 8.7104, p-value = 0.0668
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.5625675 0.2292342
## sample estimates:
## mean of x mean of y
## 13.00000 15.66667
79 / 166
90% CI
t.test(group1,group2,conf.level=0.90)
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = -2.0937, df = 8.7104, p-value = 0.0668
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
## -5.010308 -0.323025
## sample estimates:
## mean of x mean of y
## 13.00000 15.66667
80 / 166
Comparing results
Recall null here is H0 : µ1 − µ2 = 0. P-value 0.0668.
I 95% CI from −5.6 to 0.2, includes 0.
I 90% CI from −5.0 to −0.3, does not include 0.
I At α = 0.05, would not reject H0 since P-value > 0.05.
I At α = 0.10, would reject H0 since P-value < 0.10.
Not just coincidence. Following always true.
I Reject H0 at level α if and only if H0 value outside corresponding CI.
I Fail to reject H0 at level α if and only if H0 value inside correspondingCI.
I Idea: “Plausible” parameter value inside CI, not rejected;“Implausible” parameter value outside CI, rejected.
81 / 166
In SAS
I Some trickery involved because of data layout.
I Put all the data values in one column, with second column labellingwith group they are from:
data dual;
infile '/folders/myfolders/duality.txt';
input y group;
proc print;
Output over.
82 / 166
The data, small
Obs y group
1 10 1
2 11 1
3 11 1
4 13 1
5 13 1
6 14 1
7 14 1
8 15 1
9 16 1
10 13 2
11 13 2
12 14 2
13 17 2
14 18 2
15 19 2
83 / 166
Test and CI at default α = 0.05proc ttest;
var y;
class group;
The TTEST Procedure
Variable: y
group Method Mean 95% CL Mean Std Dev
1 13.0000 11.4627 14.5373 2.0000
2 15.6667 12.8769 18.4564 2.6583
Diff (1-2) Pooled -2.6667 -5.2580 -0.0754 2.2758
Diff (1-2) Satterthwaite -2.6667 -5.5626 0.2292
group Method 95% CL Std Dev
1 1.3509 3.8315
2 1.6593 6.5198
Diff (1-2) Pooled 1.6499 3.6665
Diff (1-2) Satterthwaite
Method Variances DF t Value Pr > |t|
Pooled Equal 13 -2.22 0.0446
Satterthwaite Unequal 8.7104 -2.09 0.0668
84 / 166
Getting 90% CIproc ttest alpha=0.10;
var y;
class group;
The TTEST Procedure
Variable: y
group Method Mean 90% CL Mean Std Dev
1 13.0000 11.7603 14.2397 2.0000
2 15.6667 13.4798 17.8535 2.6583
Diff (1-2) Pooled -2.6667 -4.7909 -0.5425 2.2758
Diff (1-2) Satterthwaite -2.6667 -5.0103 -0.3230
group Method 90% CL Std Dev
1 1.4365 3.4220
2 1.7865 5.5539
Diff (1-2) Pooled 1.7352 3.3806
Diff (1-2) Satterthwaite
Method Variances DF t Value Pr > |t|
Pooled Equal 13 -2.22 0.0446
Satterthwaite Unequal 8.7104 -2.09 0.0668
85 / 166
The value of this
I If you have a test procedure but no corresponding CI:
I you make a CI by including all the parameter values that would notbe rejected by your test.
I Use:
I α = 0.01 for a 99% CI,
I α = 0.05 for a 95% CI,
I α = 0.10 for a 90% CI,
and so on.
86 / 166
The sign test
I Test for population median.
I Data:
x=c(3, 4, 5, 5, 6, 8, 11, 14, 19, 37)
I Let M denote pop. median. Test H0 : M = 12 vs. Ha : M 6= 12.
I If H0 true, each data value either above or below 12, prob. 0.5 ofeach (ignore exactly 12’s).
I Test statistic: # values above 12.
I If H0 true, test statistic hasbinomial distribution,n = 10, p = 0.5.
I Observed 3 values above 12 (and7 below).
I P-value comes from prob. of 3successes or less in that binomialdistribution.
87 / 166
Probability distribution calculatorI R knows about many distributions, eg. norm, binom.
I All have prefixes:
I d: probability (or density) of exactly given value
I p: probability of given value or less (CDF)
I q: value that has that much probability less (inverse CDF).
pnorm(1.96)
## [1] 0.9750021
qnorm(0.95)
## [1] 1.644854
dbinom(1,4,0.3)
## [1] 0.4116
pbinom(1,4,0.3)
## [1] 0.6517
qbinom(0.9,4,0.3)
## [1] 2
Standard normalProb of less than 1.96
Value with prob 0.95 less than it
Binomial, n = 4, p = 0.3Prob of exactly 1 success
Prob of 1 success or less
Value with prob (at least) 0.9 less than it
Check columns aligned ****
88 / 166
Sign testI Data:
x
## [1] 3 4 5 5 6 8 11 14 19 37
I Hypotheses about median M: H0 : M = 12, Ha : M 6= 12.
I Test statistic: number of valuesabove 12.
table(x>12)
##
## FALSE TRUE
## 7 3
3 values above.
I P-value: prob. of 3 successes orless in binomial n = 10, p = 0.5,times 2 (two-sided test):
2*pbinom(3,10,0.5)
## [1] 0.34375
I No evidence that the median differs from 12.89 / 166
Sign test in SAS
Lives in proc univariate. Note how null median specified:
data x;
input x @@;
cards;
3 4 5 5 6 8 11 14 19 37
;
proc univariate mu0=12;
90 / 166
SAS sign test results
The UNIVARIATE Procedure
Variable: x
Tests for Location: Mu0=12
Test -Statistic- -----p Value------
Student's t t -0.24399 Pr > |t| 0.8127
Sign M -2 Pr >= |M| 0.3438
Signed Rank S -9.5 Pr >= |S| 0.3730
P-value 0.3438, same as R.
91 / 166
Making R sign test into functionI Get CI for median as “what values of M would sign test not reject
for?”
I Trying lots of values of M, do sign test for each.
I Efficiency: write R function to do sign test and return P-value:
sign.test=function(med,mydata)
{tbl=table(mydata<med)
stat=min(tbl)
n=length(mydata)
pval=2*pbinom(stat,n,0.5)
return(pval)
}
I Function takes null median and data as input.
I Test statistic is #values greater or less than M, whichever is smaller.
I Returns two-sided P-value.92 / 166
Using function to get CI
Idea: call function repeatedly on same data but different null M. Values ofM with P-value > 0.05 inside 95% CI, others outside.
First test on actual problem:
sign.test(12,x)
## [1] 0.34375
All right. Try some values, making sure not to use any data values:
sign.test(18.5,x)
## [1] 0.109375
sign.test(19.5,x)
## [1] 0.02148438
18.5 inside, 19.5 outside.
93 / 166
Completing the interval
Lower end:
sign.test(4.5,x)
## [1] 0.109375
sign.test(3.5,x)
## [1] 0.02148438
4.5 inside, 3.5 outside.
So 95% CI for pop. median goes from 4.5 to 18.5.
Not a very good CI, for two reasons:
I Sample size small.
I Sign test not very powerful.
But it is a confidence interval.94 / 166
Sign-testing a whole bunch of medians at once
options(width=60)
my.meds=seq(3.5,20.5,1)
pvals=sapply(my.meds,sign.test,x)
rbind(my.meds,pvals)
## [,1] [,2] [,3] [,4] [,5]
## my.meds 3.50000000 4.500000 5.5000000 6.500000 7.500000
## pvals 0.02148438 0.109375 0.7539063 1.246094 1.246094
## [,6] [,7] [,8] [,9] [,10]
## my.meds 8.5000000 9.5000000 10.5000000 11.50000 12.50000
## pvals 0.7539063 0.7539063 0.7539063 0.34375 0.34375
## [,11] [,12] [,13] [,14] [,15]
## my.meds 13.50000 14.500000 15.500000 16.500000 17.500000
## pvals 0.34375 0.109375 0.109375 0.109375 0.109375
## [,16] [,17] [,18]
## my.meds 18.500000 19.50000000 20.50000000
## pvals 0.109375 0.02148438 0.02148438
95 / 166
Comments
I sapply says: for each of the values in first thing (my.meds), run thefunction named in the second thing (sign.test), feeding anythingelse (x) into the function.
I Runs sign.test(3.5,x), sign.test(4.5,x), . . . ,sign.test(20.5,x) and glues results together.
I P-value “1.25” is where 5 values above H0 median and 5 below, noevidence at all for rejecting null. (1 would be better value).
I For 95% CI, look at where P-values cross 0.05 (twice).
96 / 166
Some more data
Take a look at these data (12 rows of 3 columns):
Case Drug A Drug B
1 2.0 3.5
2 3.6 5.7
3 2.6 2.9
4 2.6 2.4
5 7.3 9.9
6 3.4 3.3
7 14.9 16.7
8 6.6 6.0
9 2.3 3.8
10 2.0 4.0
11 6.8 9.1
12 8.5 20.9
97 / 166
Matched pairsI Data are comparison of 2 drugs for effectiveness at reducing pain.
I 12 subjects (cases) were arthritis sufferers
I Response is #hours of pain relief from each drug.
I In reading example, each child tried only one reading method.
I But here, each subject tried out both drugs, giving us twomeasurements.
I Possible because, if you wait long enough, one drug has no influenceover effect of other.
I Advantage: focused comparison of drugs. Compare one drug withanother on same person, removes a lot of random variability.
I Matched pairs, requires different analysis.
I Design: randomly choose 6 of 12 subjects to get drug A first, other 6get drug B first.
98 / 166
Analysis, in SAS
Uses a t-test. Note paired line.
data analgesic;
infile "/folders/myfolders/analgesic.txt";
input subject druga drugb;
proc ttest;
paired druga*drugb;
99 / 166
Output of matched pairs t-test
N Mean Std Dev Std Err Minimum Maximum
12 -2.1333 3.4092 0.9841 -12.4000 0.6000
Mean 95% CL Mean Std Dev 95% CL Std Dev
-2.1333 -4.2994 0.0327 3.4092 2.4150 5.7884
DF t Value Pr > |t|
11 -2.17 0.0530
100 / 166
Comments
I P-value 0.0530.
I At α = 0.05, cannot quite reject null of no difference, though result isvery close to significance.
I “Hand-calculation” way of doing this is to find the 12 differences, onefor each subject, and do 1-sample t-test on those differences. Shownon next page.
101 / 166
Alternative way to do matched pairs
Define differences in data step:
data analgesic;
infile "/folders/myfolders/analgesic.txt";
input subject druga drugb;
diff=druga-drugb;
proc ttest h0=0;
var diff;
t-test is an ordinary 1-sample test on diff. Note that null-hypothesismean has to be given with only one sample.
102 / 166
Results
N Mean Std Dev Std Err Minimum Maximum
12 -2.1333 3.4092 0.9841 -12.4000 0.6000
Mean 95% CL Mean Std Dev 95% CL Std Dev
-2.1333 -4.2994 0.0327 3.4092 2.4150 5.7884
DF t Value Pr > |t|
11 -2.17 0.0530
103 / 166
Paired test in R (1)I forgot to give the variables names in the data file. Use what we have.
pain=read.table("analgesic.txt",header=F)
pain
## V1 V2 V3
## 1 1 2.0 3.5
## 2 2 3.6 5.7
## 3 3 2.6 2.9
## 4 4 2.6 2.4
## 5 5 7.3 9.9
## 6 6 3.4 3.3
## 7 7 14.9 16.7
## 8 8 6.6 6.0
## 9 9 2.3 3.8
## 10 10 2.0 4.0
## 11 11 6.8 9.1
## 12 12 8.5 20.9104 / 166
Paired test in R (2)attach(pain)
t.test(V2,V3,paired=T)
##
## Paired t-test
##
## data: V2 and V3
## t = -2.1677, df = 11, p-value = 0.05299
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.29941513 0.03274847
## sample estimates:
## mean of the differences
## -2.133333
P-value as before. Likewise, you can calculate the differences yourself anddo a 1-sample t-test on them.
105 / 166
Assessing normality
I Analyses assume (theoretically) that differences normally distributed.
I Though we know that t-tests generally behave well even withoutnormality.
I How to assess normality? A normal quantile plot.
I Idea: scatter of points should follow the straight line, without curving.
I Outliers show up at bottom left or top right of plot as points off theline.
106 / 166
The normal quantile plotdiff=V2-V3 ; qqnorm(diff) ; qqline(diff)
●●
●●
●
●
●
●
●●●
●
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−12
−8
−4
0
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Points should follow the straight line. Bottom left one way off, sonormality questionable here.
107 / 166
And using “ggplot”, but without lineggplot(pain,aes(sample=diff))+stat_qq()
●
●●
● ●●
● ●
●
● ●
●
−12
−8
−4
0
−1 0 1theoretical
sam
ple
108 / 166
And in SAS. . .proc univariate noprint;
qqplot diff;
109 / 166
Getting a line on SAS normal quantile plot
I SAS doesn’t automatically provide a line, but even without one, yousee that these data are not normal because of the outlier bottom left.
I SAS can draw lines, but requires you to give a mean and SD to makethe line with.
I Simplest way is to have SAS estimate them from the data, but theline is usually not very good.
I Or we can estimate them another way from IQR.
110 / 166
Having SAS estimate themproc univariate noprint;
qqplot diff / normal(mu=est sigma=est);
111 / 166
Another way to estimate µ and σ
I Problem above is that SD was grossly inflated by outlier.
I On standard normal, quartiles about ±0.675:
qnorm(0.25); qnorm(0.75)
## [1] -0.6744898
## [1] 0.6744898
I So IQR of standard normal about 2(0.675) = 1.35.
I Thus IQR of any normal about 1.35σ.
I Idea: estimate σ by taking sample IQR and dividing by 1.35. Notaffected by outliers.
I Here, IQR is 1.95, so estimate of σ is 1.455.
I In similar spirit, estimate µ by median, −1.65.
112 / 166
Using improved µ and σproc univariate noprint;
qqplot diff / normal(mu=-1.65 sigma=1.455);
113 / 166
Dealing with non-normality
I One approach: do nothing, since the t-test takes care ofnon-normality.
I Another: make a sign test, and test whether median difference iszero. In SAS, just ask for it.
I Another: do a randomization test.
114 / 166
Sign test in SAS
proc univariate;
var diff;
P-value 0.1460:
The UNIVARIATE Procedure
Variable: diff
Tests for Location: Mu0=0
Test -Statistic- -----p Value------
Student's t t -2.16771 Pr > |t| 0.0530
Sign M -3 Pr >= |M| 0.1460
Signed Rank S -32 Pr >= |S| 0.0088
115 / 166
Look back at the data
I Data (differences drug A minus drug B):
diff
## [1] -1.5 -2.1 -0.3 0.2 -2.6 0.1 -1.8
## [8] 0.6 -1.5 -2.0 -2.3 -12.4
I See the big outlier (at end of list).
116 / 166
Sign test in R
I In R, either use the sign.test function that we wrote before:
sign.test(0,diff)
## [1] 0.1459961
I Or, count the number of positive and negative differences:
table(diff>0)
##
## FALSE TRUE
## 9 3
I and calculate the P-value as
pbinom(3,12,0.5)*2
## [1] 0.1459961
117 / 166
Randomization test idea
I Another way of doing a test when you’re worried about assumptionsis to use “randomization” idea.
I Ask “what could be switched around if H0 true”?
I In matched pairs, each of the matched observations could have comefrom either treatment (eg. drug).
I In two independent samples, each observation could have come fromeither sample.
I In ANOVA (multiple groups), each observation could have come fromany group.
I Randomization test: randomly shuffle treatments/samples/groupsaccording to experimental design.
118 / 166
Randomization test for matched pairs
I Randomly allocate results for each person to drug A or B.
I Observed sample mean difference:
mean(diff)
## [1] -2.133333
I Switching drugs (from observed data) means switching sign ofdifference A minus B. So generate random sample of −1 and 1 of size12 (with replacement), like this:
pm=c(1,-1)
random.pm=sample(pm,12,replace=T)
random.pm
## [1] -1 1 1 -1 1 -1 1 1 -1 -1 -1 1
119 / 166
Generating a randomization sample and mean
I Then take observed differences and generate randomized ones bymultiplying diff by these random signs:
random.diff=diff*random.pm
random.diff
## [1] 1.5 -2.1 -0.3 -0.2 -2.6 -0.1 -1.8
## [8] 0.6 1.5 2.0 2.3 -12.4
mean(random.diff)
## [1] -0.9666667
I This randomization gave a sample mean difference close to −1, whileour observed value was more negative, −2.13.
120 / 166
One randomization sampleI Write a function to do randomization and return statistic, once.
rand.diff=function(x) {pm=c(-1,1)
random.pm=sample(pm,length(x),replace=T)
random.diff=diff*random.pm
return(mean(random.diff))
}
I Try it a few times:
rand.diff(diff)
## [1] -0.5833333
rand.diff(diff)
## [1] -1.083333
rand.diff(diff)
## [1] 0.7
rand.diff(diff)
## [1] 1.133333
I Each answer different, because of randomization. 121 / 166
Repeating many times
I Function replicate repeats something as many times as you like.
I In this case, repeat rand.diff(diff) many times:
replicate(5,rand.diff(diff))
## [1] 0.2666667 -0.7833333 1.1000000 -0.7166667
## [5] 1.1000000
rand.mean=replicate(1000,rand.diff(diff))
I Collection of replicated values called randomization distribution.
I Use this instead of normal, t, . . . to get P-value from.
122 / 166
Histogram of randomization distributionr=data.frame(mean=rand.mean)
ggplot(r,aes(x=mean))+geom_histogram(binwidth=0.5)
0
50
100
150
200
−3 −2 −1 0 1 2 3mean
coun
t
123 / 166
The randomization distributionI If t-test were plausible, this would look normal. Does it?
I Why not? The outlier difference, −12.5. If counted as positive, meanprobably positive; if counted as negative, mean probably negative.
I This true almost regardless of other values (whether plus or minus).
I Randomization test can be trusted, though.
I P-value: how many of randomized means came out less thanobserved −2.13, doubled:
tab=table(rand.mean<=
mean(diff))
tab
##
## FALSE TRUE
## 994 6
2*tab[2]/1000
## TRUE
## 0.012
I Randomization P-value small; nodoubt now that mean pain relieffor two drugs different.
124 / 166
The randomization distributionI If t-test were plausible, this would look normal. Does it?
I Why not? The outlier difference, −12.5. If counted as positive, meanprobably positive; if counted as negative, mean probably negative.
I This true almost regardless of other values (whether plus or minus).
I Randomization test can be trusted, though.
I P-value: how many of randomized means came out less thanobserved −2.13, doubled:
tab=table(rand.mean<=
mean(diff))
tab
##
## FALSE TRUE
## 994 6
2*tab[2]/1000
## TRUE
## 0.012
I Randomization P-value small; nodoubt now that mean pain relieffor two drugs different. 124 / 166
Randomization test for median difference
I With outlier, should we be using mean?
I Randomization test flexible; change mean to median:
rand.diff.med=function(x) {pm=c(-1,1)
random.pm=sample(pm,length(x),replace=T)
random.diff=diff*random.pm
return(median(random.diff))
}replicate(5,rand.diff.med(diff))
## [1] 1.65 -0.15 0.15 0.35 0.05
rand.median=replicate(1000,rand.diff.med(diff))
I Change mean to median everywhere.
125 / 166
Histogram of randomization distribution for medianr=data.frame(med=rand.median)
ggplot(r,aes(x=med))+
geom_histogram(binwidth=0.5)
0
100
200
300
400
−1 0 1med
coun
t
126 / 166
P-value for randomization test for median
I Randomization distribution has only one peak now.
I Sample median difference was −1.65 (median(diff)).
I P-value is proportion of these at least as extreme:
tab=table(rand.median<=median(diff))
tab
##
## FALSE TRUE
## 989 11
2*tab[2]/1000
## TRUE
## 0.022
I Again, reject hypothesis that two drugs equally good.
127 / 166
ComparisonTest results seem to be very different:
Test Statistic P-value
t-test mean 0.053randomization mean 0.012
sign test median 0.146randomization median 0.022
I Not that much agreement, though 2 randomization tests have similarresults.
I Sign test P-value much larger than others (lack of power?)
I t-test probably not trustworthy (very non-normal dist. of differences).
I Using mean probably not wise (outlier difference)
I Randomization test for median probably best.128 / 166
The kids’ reading data, again
kids[1:15,]
## group score
## 1 t 24
## 2 t 61
## 3 t 59
## 4 t 46
## 5 t 43
## 6 t 44
## 7 t 52
## 8 t 43
## 9 t 58
## 10 t 67
## 11 t 62
## 12 t 57
## 13 t 71
## 14 t 49
## 15 t 54
kids[16:30,]
## group score
## 16 t 43
## 17 t 53
## 18 t 57
## 19 t 49
## 20 t 56
## 21 t 33
## 22 c 42
## 23 c 33
## 24 c 46
## 25 c 37
## 26 c 43
## 27 c 41
## 28 c 10
## 29 c 42
## 30 c 55
kids[31:44,]
## group score
## 31 c 19
## 32 c 17
## 33 c 55
## 34 c 26
## 35 c 54
## 36 c 60
## 37 c 28
## 38 c 62
## 39 c 20
## 40 c 53
## 41 c 48
## 42 c 37
## 43 c 85
## 44 c 42
I 21 kids in “treatment”, new reading method; 23 in “control”,standard reading method.
129 / 166
Randomization test for two-sample data
I Each observation might, under H0, have come from either treatmentor control.
I Means for data:
aggregate(score~group,kids,mean)
## group score
## 1 c 41.52174
## 2 t 51.47619
Differ by 9.95.
I If we assign Treatment and Control to the observations at random,how likely would we be to see a difference in means of 9.95 or bigger?
130 / 166
Randomization test
I Ingredients:
I sample to shuffle treatment groups
I aggregate to calculate means for shuffled groups
I little bit of calculation to get difference in sample means.
attach(kids)
myshuffle=sample(group)
myshuffle
## [1] c t t c t c c t c c c t t t c t t c c c t t t
## [24] c t t c c t t c c t c t t c t t c c c c c
## Levels: c t
I This makes shuffled groups.
131 / 166
Randomization difference in means
I Now find group means for shuffled groups:
themeans=aggregate(score~myshuffle,data=kids,mean)
themeans
## myshuffle score
## 1 c 44.39130
## 2 t 48.33333
I Difference in means is second thing in second column minus firstthing in second column:
meandiff=themeans[2,2]-themeans[1,2]
meandiff
## [1] 3.942029
I Simulated treatment > simulated control.
132 / 166
Making into function
I We will repeat shuffling process many times, so put it in function.
I Input: data frame.
I Output: difference in means of shuffled groups.
shuffle.means=function(x) {myshuffle=sample(x$group)
themeans=aggregate(score~myshuffle,data=x,mean)
meandiff=themeans[2,2]-themeans[1,2]
return(meandiff)
}
133 / 166
Again, using function!shuffle.means(kids)
## [1] 0.8447205
shuffle.means(kids)
## [1] 4.124224
shuffle.means(kids)
## [1] 3.304348
replicate(5,shuffle.means(kids))
## [1] -7.2629400 1.0269151 7.3126294 -2.6169772
## [5] 0.4803313
I Each run does different shuffle, so gives different result.
I Last line does whole thing five times, collects results.
I Looks like a difference in means of 10 is improbably large, by chance.134 / 166
1000 times
nsim=1000
ans=replicate(nsim,shuffle.means(kids))
I Investigate randomization by looking at ans.
135 / 166
Histogram of randomized mean differencesggplot(data.frame(ans=ans),aes(x=ans))+
geom_histogram(binwidth=2)+geom_vline(xintercept=10)
0
50
100
150
−10 −5 0 5 10ans
coun
t
136 / 166
How many randomizations gave mean difference > 10?I By chance, mean difference of 10 or bigger appears unlikely.
I Use a logical vector to pick out the elements of ans that we want,then make a table of them:
isbig=(ans>=10);
table(isbig)
## isbig
## FALSE TRUE
## 983 17
I 17 out of 1000.
I Our randomization test gives a P-value of 17/1000 = 0.017, so wouldreject null hypothesis “new reading program has no effect” in favourof alternative “new reading program helps”.
I P-value similar to 0.013 from two-sample t.
137 / 166
Why 1000 randomizations?
I No uniquely good answer.
I More randomizations is better.
I But didn’t want to wait too long for it to finish!
I If you think 10,000 (or a million) better, replace nsim in my code bydesired value and run again.
138 / 166
Jumping rats
I Recall jumping rats.
I See whether larger amount of exercise (jumping) went with higherbone density.
I 30 rats randomly assigned to one of 3 treatment groups:
I Control (no jumping)
I Low jumping
I High jumping
I Random assignment: rats in each group similar in all important ways.
I So entitled to draw conclusions about cause and effect.
I Recall boxplots, over:
139 / 166
Boxplotsrats=read.table("jumping.txt",header=T)
ggplot(rats,aes(y=density,x=group))+geom_boxplot()
●
●
550
575
600
625
650
675
Control Highjump Lowjumpgroup
dens
ity
140 / 166
Testing: ANOVA (comparing several means)
rats.aov=aov(density~group,data=rats)
summary(rats.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 7434 3717 7.978 0.0019 **
## Residuals 27 12579 466
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I Usual ANOVA table, small P-value: significant result.
I Null hypothesis: all 3 groups have same mean.
I Alternative: “not all 3 means the same”, at least one is different fromothers.
I Reject null, but not very useful finding.
141 / 166
Same thing in SAS
Read in data and do ANOVA:
data rats;
infile '/folders/myfolders/jumping.txt' firstobs=2;
input group $ density;
proc anova;
class group;
model density=group;
142 / 166
Results (some)
The ANOVA Procedure
Dependent Variable: density
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 2 7433.86667 3716.93333 7.98 0.0019
Error 27 12579.50000 465.90741
Corrected Total 29 20013.36667
Source DF Anova SS Mean Square F Value Pr > F
group 2 7433.866667 3716.933333 7.98 0.0019
143 / 166
Which groups are different from which?
I ANOVA really only answers half our questions: it says “there aredifferences”, but doesn’t tell us which groups different.
I One possibility (not the best): compare all possible pairs of groups,via two-sample t.
I First pick out each group:
detach(kids)
controls=filter(rats,group=="Control")
lows=filter(rats,group=="Lowjump")
highs=filter(rats,group=="Highjump")
144 / 166
Control vs. low
t.test(controls$density,lows$density)
##
## Welch Two Sample t-test
##
## data: controls$density and lows$density
## t = -1.0761, df = 16.191, p-value = 0.2977
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -33.83725 11.03725
## sample estimates:
## mean of x mean of y
## 601.1 612.5
No sig. difference here.
145 / 166
Control vs. high
t.test(controls$density,highs$density)
##
## Welch Two Sample t-test
##
## data: controls$density and highs$density
## t = -3.7155, df = 14.831, p-value = 0.002109
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -59.19139 -16.00861
## sample estimates:
## mean of x mean of y
## 601.1 638.7
These are different.
146 / 166
Low vs. high
t.test(lows$density,highs$density)
##
## Welch Two Sample t-test
##
## data: lows$density and highs$density
## t = -3.2523, df = 17.597, p-value = 0.004525
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -43.15242 -9.24758
## sample estimates:
## mean of x mean of y
## 612.5 638.7
These are different too.
147 / 166
But. . .
I We just did 3 tests instead of 1.
I So we have given ourselves 3 chances to reject H0 : all means equal,instead of 1.
I Thus α for this combined test is not 0.05.
148 / 166
John W. Tukey
I American statistician, 1915–2000I Big fan of exploratory data analysisI Invented boxplotI Invented “honestly significant
differences”I Invented jackknife estimationI Coined computing term “bit”I Co-inventor of Fast Fourier Transform
149 / 166
Honestly Significant Differences
I Compare several groups with one test, telling you which groups differfrom which.
I Idea: if all population means equal, find distribution of highest samplemean minus lowest sample mean.
I Any means unusually different compared to that declared significantlydifferent.
150 / 166
Tukey on rat data
rats.aov=aov(density~group,data=rats)
TukeyHSD(rats.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = density ~ group, data = rats)
##
## $group
## diff lwr upr p adj
## Highjump-Control 37.6 13.66604 61.533957 0.0016388
## Lowjump-Control 11.4 -12.53396 35.333957 0.4744032
## Lowjump-Highjump -26.2 -50.13396 -2.266043 0.0297843
Again conclude that bone density for highjump group significantly higherthan for other two groups.
151 / 166
Why Tukey’s procedure better than all t-tests
Look at P-values for the two tests:
Comparison Tukey t-tests
----------------------------------
Highjump-Control 0.0016 0.0021
Lowjump-Control 0.4744 0.2977
Lowjump-Highjump 0.0298 0.0045
I Tukey P-values (mostly) higher.
I Proper adjustment for doing three t-tests at once, not just one inisolation.
I lowjump-highjump comparison no longer significant at α = 0.01.
152 / 166
Same stuff in SAS
proc anova;
class group;
model density=group;
means group / tukey;
153 / 166
Tukey output (some)
The ANOVA Procedure
Tukey's Studentized Range (HSD) Test for density
Means with the same letter are not significantly different.
Tukey Grouping Mean N group
A 638.700 10 Highjump
B 612.500 10 Lowjump
B
B 601.100 10 Control
154 / 166
Checking assumptionsggplot(rats,aes(y=density,x=group))+geom_boxplot()
●
●
550
575
600
625
650
675
Control Highjump Lowjumpgroup
dens
ity
Assumptions:
I Normally distributed data within each group
I with equal group SDs.155 / 166
The assumptionsI Normally-distributed data within each group
I Equal group SDs.
These are shaky here because:
I control group has outliers
I highjump group appears to have less spread than others.
Possible remedies:
I Transformation of response (usually works best when SD increaseswith mean)
I Different test, eg. randomization test.
I Can also try Mood’s Median Test (if you know about chi-squaredtests). Based on medians, like a generalization of sign test. Notalways very powerful. See over.
156 / 166
Mood’s median testI Find median of all bone densities, regardless of group:
attach(rats)
m=median(density) ; m
## [1] 621.5
I Count up how many observations in each group above or belowoverall median:
tab=table(group,density>m)
tab
##
## group FALSE TRUE
## Control 9 1
## Highjump 0 10
## Lowjump 6 4
I All Highjump obs above overallmedian.
I Most Control obs below overallmedian.
I Suggests medians differ by group.
157 / 166
Mood’s median test 2/2
I Test whether association between group and being above/belowoverall median significant using chi-squared test for association:
chisq.test(tab)
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 16.8, df = 2, p-value = 0.0002249
I Very small P-value says that being above/below overall mediandepends on group.
I That is, groups do not all have same median.
158 / 166
Randomization test
I Like two-sample randomization test: randomly shuffle groupmemberships.
I Choices about test statistic, eg.
I usual ANOVA F -statistic
I Tukey-like highest sample mean minus lowest
I I go with highest minus lowest.
159 / 166
Observed means and test statistic
I Calculate observed value of test statistic:
group.means=aggregate(density~group,data=rats,mean)
group.means
## group density
## 1 Control 601.1
## 2 Highjump 638.7
## 3 Lowjump 612.5
Actual means in second column of group.means:
z=group.means[,2]
obs=max(z)-min(z)
obs
## [1] 37.6
160 / 166
Randomly shuffling groupsActual groups are:
rats$group
## [1] Control Control Control Control Control Control
## [7] Control Control Control Control Lowjump Lowjump
## [13] Lowjump Lowjump Lowjump Lowjump Lowjump Lowjump
## [19] Lowjump Lowjump Highjump Highjump Highjump Highjump
## [25] Highjump Highjump Highjump Highjump Highjump Highjump
## Levels: Control Highjump Lowjump
Shuffle like this:
shuffled.groups=sample(rats$group) ; shuffled.groups
## [1] Highjump Control Control Lowjump Lowjump Lowjump
## [7] Control Control Lowjump Lowjump Highjump Control
## [13] Control Highjump Highjump Highjump Lowjump Highjump
## [19] Lowjump Highjump Control Highjump Control Lowjump
## [25] Highjump Highjump Lowjump Control Lowjump Control
## Levels: Control Highjump Lowjump161 / 166
Calculate means and highest minus lowestSame idea as for actual data, but using shuffled.groups instead ofgroup:
my.group.means=aggregate(density~shuffled.groups,
data=rats,mean)
my.group.means
## shuffled.groups density
## 1 Control 623.2
## 2 Highjump 613.0
## 3 Lowjump 616.1
z=my.group.means[,2]
max(z)-min(z)
## [1] 10.2
Need to do this a bunch of times. Idea: write a function to do it once,then run function many times.
162 / 166
Function to do it once
I Input: original data.
I Obtain shuffled groups
I Calculate mean for each group
I Return largest mean minus smallest mean.
I Function:
shuffle1=function(mydata=rats) {shuffled.groups=sample(mydata$group)
means=aggregate(density~shuffled.groups,
data=mydata,mean)
z=means[,2]
max(z)-min(z)
}
163 / 166
Obtaining randomization distribution
I Test (randomizes, so look for “sane” answer):
shuffle1(rats)
## [1] 19.4
I Use replicate to do a few times:
replicate(6,shuffle1(rats))
## [1] 9.4 12.8 25.0 15.8 6.6 1.9
I Or many times, like 1000:
test.stat=replicate(1000,shuffle1(rats))
164 / 166
Randomization distribution of test.statvals=data.frame(ts=test.stat)
ggplot(vals,aes(x=ts))+geom_histogram(binwidth=5)+
geom_vline(xintercept=obs,colour="red")
0
50
100
150
200
250
0 10 20 30 40ts
coun
t
#hist(test.stat)
#abline(v=obs,col="red")
165 / 166
Comments and P-value
I Distribution shape skewed to right (lower limit of 0).
I Red line marks observed highest minus lowest: seems unusually high.
I If group means really different (anyhow), highest minus lowest will belarge, so one-tailed test.
I (Same logic as one-sidedness of F -test in ANOVA.)
I P-value the usual way:
table(test.stat>=obs)
##
## FALSE TRUE
## 998 2
I P-value 2/1000 = 0.0020. Despite outliers, still reject “all meansequal”, with P-value very similar to ANOVA (0.0019).
166 / 166