1 a course in multiple comparisons and multiple tests peter h. westfall, ph.d. professor of...
TRANSCRIPT
1
A Course in Multiple Comparisons and Multiple Tests
Peter H. Westfall, Ph.D.Professor of Statistics, Department of Inf.
Systems and Quant. Sci.
Texas Tech University
2
Learning Outcomes Elucidate reasons that multiple comparisons procedures (MCPs)
are used, as well as their controversial nature
Know when and how to use classical interval-based MCPs including Tukey, Dunnett, and Bonferroni
Understand how MCPs affect power
Elucidate the definition of closed testing procedures (CTPs)
Understand specific types of CTPs, benefits and drawbacks
Distinguish false discovery rate (FDR) from familywise error rate (FWE)
Understand general issues regarding Bayesian MCPs
3
Introduction. Overview of Problems, Issues, and Solutions, Regulatory and Ethical Perspectives, Families of Tests, Familywise Error Rate, Bonferroni. (pp. 5-21)
Interval-Based Multiple Inferences in the standard linear models framework. One-way ANOVA and ANCOVA, Tukey, Dunnett, and Monte Carlo Methods, Adjusted p-values, general contrasts, Multivariate T distribution, Tight Confidence Bands, TreatmentxCovariate Interaction, Subgroup Analysis (pp. 22-55)
Power and Sample Size Determinations for multiple comparisons. (pp. 56-65)
Stepwise and Closed Testing Procedures I: P-value-Based Methods. Closure Method, Global Tests; Holm, Hommel, Hochberg and Fisher combined methods for p-Values; (pp. 66-90)
Stepwise and Closed Testing Procedures II: Fixed Sequences, Gatekeepers and I-U tests: Fixed Sequence tests, Gatekeeper procedures, Multiple hypotheses in a gate, Intersection-union tests; with application to dose response, primary and secondary endpoints, bioequivalence and combination therapies (pp. 91-101)
Outline of Material
4
Stepwise and Closed Testing Procedures III: Methods that use logical constraints and correlations. Lehmacher et al. Method for Multiple endpoints; Range-Based and F-based ANOVA Tests, Fisher’s protected LSD, Free and Restricted Combinations, Shaffer-Type Methods for dose comparisons and subgroup analysis (pp. 102-118)
Multiple nonparametric and semiparametric tests: Bootstrap and Permutation-basedClosed tesing. PROC MULTTEST, examples with multiple endpoints, genetic associations, gene expression, binary data and adverse events (pp. 119-139)
More complex models and FWE control: Heteroscedasticity, Repeated measures, and large sample methods. Applications: multiple treatment comparisons, crossover designs, logistic regression of cure rates (pp. 140-152)
False Discovery Rate: Benjamini and Hochberg’s method, comparison with FWE – controlling methods (153-158)
Bayesian methods: Simultaneous credible intervals, ranking probabilities and loss functions, PROC MIXED posterior sampling, Bayesian testing of multiple endpoints (pp. 159-178)
Conclusion, discussion, references (179-184)
Outline (Continued)
5
Sources of Multiplicity Multiple variables (endpoints) Multiple timepoints Subgroup analysis Multiple comparisons Multiple tests of the same hypothesis Variable and Model selection Interim analysis Hidden Multiplicity: File Drawers, Outliers
6
The Problem:
“Significant” results may fail to replicate.
Documented cases: Ioannidis (JAMA 2005)
7
An Example
Phase III clinical trial Three arms – Placebo, AC, Drug Endpoints: Signs and symptoms Measured at weekly visits Baseline covariates
8
Example-Continued ‘Features’ displayed at trial conclusion:
Trends Baseline adjusted comparisons of raw data Baseline adjusted % changes Nonparametric and parametric tests Specific endpoints and combinations of
endpoints Particular week results AC and Placebo comparisons
Fact: The features that “look the best” are biased.
9
Example Continued –Feature Selection
‘Effect Size’ is a feature Effect size = (mean difference)/sd Dimensionless .2=‘small’, .5=‘medium’, .8=‘large’
Estimated effect sizes : F1, F2,…,Fk
What if you select (max{F1,F2,…,Fk}) and publish it?
10
The Scientific ConcernEffect Sizes in 1000 Replicated Studies
0
0.5
1
1.5
0 0.5 1 1.5
Effect Size of Selected Effect in First Study
Rep
licat
ed E
ffect
Siz
e
11
Feature Selection Model
Clinical Trials Simulation Real data used Conservative! If you must know more:
Fj = j + j, j=1,…,20. Error terms or N(0,.22) True effect sizes j are N(.3,.12) Features Fj are highly correlated.
12
Key Points: (i) Multiplicity invites Selection(ii) Selection has an EFFECT
Just like effects due toJust like effects due to• TreatmentTreatment• ConfoundingConfounding• LearningLearning• NonresponseNonresponse• PlaceboPlacebo
13
Published Guidelines
ICH Guidelines CPMP Points to consider CDRH Statistical Guidance ASA Ethical Guidelines
14
Regulatory/Journal/Ethical/Professional Concerns
Replicability (good science) Fairness Regulatory report: The drug company reported efficacy at p=.047.
We repeated the analysis in several different ways that the company might have done. In 20 re-analyses of the data, 18 produced p-values greater than .05. Only one of the 20 re-analyses produced a p-value smaller than .047.
15
Multiple Inferences: Notation
There is a “family” of k inferences
Parameters are 1,…, k
Null hypotheses are
H01: 1=0, …, H0k: k=0
16
Comparisonwise Error Rate (CER)
Intervals:
CERj = P(Intervalj incorrect)
Tests:
CERj = P(Reject H0j | H0j is true)
Usually CER = =.05
17
Familywise Error Rate (FWE)
Intervals:
FWE = 1 - P(all intervals are correct)
Tests:
FWE = P(reject at least one true null)
18
False Discovery Rate
• FDR = E(proportion of rejections that are incorrect)
• Let R = total # of rejections• Let V = # of erroneous rejections
• FDR = E(V/R) (0/0 defined as 0).
• FWE = P(V>0)
19
Bonferroni Method
Identify Family of inferences Identify number of elements (k) in the
Family Use /k for all inferences. Ex: With k=36, p-values must be less than
0.05/36 = 0.0014 to be “significant”
20
FWE Control for Bonferroni
FWE = P(p0j1
.05/36 or … or p0jm
.05/36 | H0j1,..., H0jm
true)
P(p0j1.05/36) + … + P(
p0jm
.05/36)
= (.05)m/36 .05
AB
P(AB) P(A) + P(B)
21
Main Interest - Primary & SecondaryApproval and Labeling depend on these.
Tight FWE control needed.
Lesser Interest - Depending on goals and reviewers, FWE controlling methods
might be needed.
Supportive Tests - mostly descriptiveFWE control not needed.
Exploratory Tests - investigate new indications -future trials needed to confirm - do what makes sense.
Serious andknown treatment-related AEsFWE control not needed
All other AEsReasonable to control FWE (or FDR)
Efficacy
Safety
“Families” in clinical trials1
1Westfall, P. and Bretz, F. (2003). Multiplicity in Clinical Trials. Encyclopedia of Biopharmaceutical Statistics, second edition, Shein-Chung Chow, ed., Marcel Decker Inc., New York, pp. 666-673.
22
Classical Single-Step Testing and Interval Methods to Control FWESimultaneous confidence intervals; Adjusted p-valuesDunnett method Tukey’s methodSimulation-based methods for general comparisons
23
“Specificity” and “Sensitivity”
Estimates of effect sizes & error margins
Confident inequalities
Overall Test
Simultaneous Confidence Intervals
Stepwise or closed tests
F-test, O’Brien, etc.
If you want ... …then use
24
The Model
Y = X + where ~ N(0, 2 I )
Includes ANOVA, ANCOVA, regression
For group comparisons, covariate adjustment
Not valid for survival analysis, binary data, multivariate data
25
Table3.2:MeansandStandardDeviationsofMouseGrowthDataControl123456Mean105.3895.980.4872.1491.8884.6874.24Std.Dev.13.4423.8912.688.419.4418.357.81
Table 3.2: M eans and Standard Deviations of M ouse Growth Data
Control 1 2 3 4 5 6Mean 105.38 95.9 80.48 72.14 91.88 84.68 74.24
Std. Dev. 13.44 23.89 12.68 8.41 9.44 18.35 7.81
Example: Pairwise Comparisons against Control
Goal: Estimate all mean differences from control and provide simultaneous 95% error margins:
What c to use?
26
Comparison of Critical ValuesMouse Growth Data ni = 4 mice per group (6 doses and one control)df = 21.There are 6 comparisons.
Method ' c
Bonferroni 0.05 0.0083 2.912Sidak 0.05 0.0085 2.903
Dunnett* 0.05 0.0110 2.790
Unadjusted** 0.05 0.0500 2.080
* ' calculated from c
** Does not control the FWE
SAS file: data; c_alpha = probmc("DUNNETT2",.,.95,21,6);run;proc print; run;
27
Results - DunnettThe GLM Procedure
Dunnett's t Tests for gain
NOTE: This test controls the Type I experimentwise error for comparisons of all treatments againstba control.
Alpha 0.05Error Degrees of Freedom 21Error Mean Square 210.0048Critical Value of Dunnett's t 2.78972Minimum Significant Difference 28.586
Comparisons significant at the 0.05 level are indicated by ***.
Difference Simultaneous g Between 95% ConfidenceComparison Means Limits
1 - 0 -9.48 -38.07 19.11 4 - 0 -13.50 -42.09 15.09 5 - 0 -20.70 -49.29 7.89 2 - 0 -24.90 -53.49 3.69 6 - 0 -31.14 -59.73 -2.55 *** 3 - 0 -33.24 -61.83 -4.65 ***
28
c is the 1- quantile of the distribution of maxi |Zi-Z0|/(22/df)1/2, calledDunnett’s two-sided range distribution.
29
Adjusted p-Values
Definition: Adjusted p-value =smallest FWE at which the hypothesis is rejected.
or
The FWE for which the confidence interval has “0” as a boundary.
30
Adjusted p-values for Dunnett
The GLM Procedure Least Squares Means Adjustment for Multiple Comparisons: Dunnett G GAIN Pr > |T| H0: LSMEAN LSMEAN=CONTROL 0 105.380000 1 95.900000 0.8608 (0.3654) 2 80.480000 0.1034 (0.0242) 3 72.140000 0.0188 (0.0039) 4 91.880000 0.6090 (0.2019) 5 84.680000 0.2196 (0.0563) 6 74.240000 0.0294 (0.0062)
proc glm data=tox; class g; model gain=g; lsmeans g/adjust=dunnett pdiff;run;
31
Table3.2:MeansandStandardDeviationsofMouseGrowthDataControl123456Mean105.3895.980.4872.1491.8884.6874.24Std.Dev.13.4423.8912.688.419.4418.357.81
Table 3.2: M eans and Standard Deviations of M ouse Growth Data
Control 1 2 3 4 5 6Mean 105.38 95.9 80.48 72.14 91.88 84.68 74.24
Std. Dev. 13.44 23.89 12.68 8.41 9.44 18.35 7.81
Example: All Pairwise Comparisons
Goal: Estimate all mean differences and provide simultaneous 95% error margins:
What c to use?
32
Comparison of Critical ValuesMouse Growth Data ni = 4 mice per group (6 doses and one control)df = 21.There are 21 pairwise comparisons.
Method ' c
Bonferroni 0.05 0.0024 3.453Sidak 0.05 0.0024 3.443
Tukey* 0.05 0.0038 3.251
Unadjusted** 0.05 0.0500 2.080
* ' calculated from c
** Does not control the FWE
SAS file: data; qval = probmc("RANGE",.,.95,21,7); c_alpha = qval/sqrt(2); run;proc print; run;
33
Tukey Comparisons Alpha= 0.05 df= 21 MSE= 210.0048 Critical Value of Studentized Range= 4.597 Minimum Significant Difference= 33.311 Means with the same letter are not significantly different.
Tukey Grouping Mean N G
A 105.38 4 0 A A 95.90 4 1 A A 91.88 4 4 A A 84.68 4 5 A A 80.48 4 2 A A 74.24 4 6 A A 72.14 4 3
34
Tukey Adjusted p-Values
General Linear Models Procedure Least Squares Means Adjustment for multiple comparisons: Tukey
G GAIN Pr > |T| H0: LSMEAN(i)=LSMEAN(j) LSMEAN i/j 1 2 3 4 5 6 7
0 105.380 1 . 0.9641 0.2351 0.0507 0.8364 0.4319 0.07691 95.900 2 0.9641 . 0.7391 0.2810 0.9996 0.9227 0.38062 80.480 3 0.2351 0.7391 . 0.9808 0.9172 0.9995 0.99583 72.140 4 0.0507 0.2810 0.9808 . 0.4860 0.8771 1.00004 91.880 5 0.8364 0.9996 0.9172 0.4860 . 0.9910 0.61025 84.680 6 0.4319 0.9227 0.9995 0.8771 0.9910 . 0.94386 74.240 7 0.0769 0.3806 0.9958 1.0000 0.6102 0.9438 .
35
Tukey Simultaneous Intervals Simultaneous Simultaneous Lower Difference Upper Confidence Between Confidence i j Limit Means Limit
1 2 -23.831013 9.480000 42.791013 1 3 -8.411013 24.900000 58.211013 1 4 -0.071013 33.240000 66.551013 1 5 -19.811013 13.500000 46.811013 1 6 -12.611013 20.700000 54.011013 1 7 -2.171013 31.140000 64.451013 2 3 -17.891013 15.420000 48.731013 2 4 -9.551013 23.760000 57.071013 2 5 -29.291013 4.020000 37.331013 2 6 -22.091013 11.220000 44.531013 2 7 -11.651013 21.660000 54.971013 3 4 -24.971013 8.340000 41.651013 3 5 -44.711013 -11.400000 21.911013 3 6 -37.511013 -4.200000 29.111013 3 7 -27.071013 6.240000 39.551013 4 5 -53.051013 -19.740000 13.571013 4 6 -45.851013 -12.540000 20.771013 4 7 -35.411013 -2.100000 31.211013 5 6 -26.111013 7.200000 40.511013 5 7 -15.671013 17.640000 50.951013 6 7 -22.871013 10.440000 43.751013
36
c is (1/ the 1- quantile of the distribution of maxi,i’ |Zi-Zi’|/(2/df)1/2}, which is called the Studentized range distribution.
37
Unbalanced Designs and/or Covariates
Tukey method is conservative when the design is unbalanced and/or there are covariates; otherwise exact
Dunnett method is conservative when there are covariates; otherwise exact
“Conservative” means
{True FWE} < {Nominal FWE} ;
also means “less powerful”
38
Tukey-Kramer Method for all pairwise comparisons
Let c be the critical value for the balanced case using Tukey’s method and the correct df.
Intervals are
Conservative (Hayter, 1984 Annals)
39
Exact Method for General Comparisons of Means
40
Multivariate T-Distribution Details
40
41
Calculation of “Exact” c
•Edwards and Berry: Simple simulation•Hsu and Nelson: Factor analytic control variate (better)•Genz and Bretz: Integration using lattice methods (best)
Even with simple simulation, the value c can be obtained with reasonable precision.
Edwards, D., and Berry, J. (1987) The efficiency of simulation-based multiple comparisons. Biometrics, 43, 913-928. Hsu, J.C. and Nelson, B.L. (1998) Multiple comparisons in the general linear model. Journal of Computationaland Graphical Statistics, 7, 23-41. Genz, A. and Bretz, F. (1999), Numerical Computation of Multivariate t Probabilities with Application to Power Calculation of Multiple Constrasts, J. Stat. Comp. Simul. 63, pp. 361-378.
42
Example: ANCOVA with two covariatesY = Diastolic BP Group = Therapy (Control, D1, D2, D3)X1 = Baseline Diastolic BP X2 = Baseline Systolic BP
Goal: Compare all therapies, controlling for baseline
proc glm data=research.bpr; class therapy; model dbp10 = therapy dbp7 sbp7; lsmeans therapy/pdiff cl adjust=simulate(nsamp= 10000000 cvadjust seed=121011 report);run;quit;
43
Results From ANCOVA
Source DF Type III SS Mean Square F Value Pr > F
THERAPY 3 677.429172 225.809724 6.05 0.0006 DBP7 1 6832.878653 6832.878653 183.06 <.0001 SBP7 1 51.123459 51.123459 1.37 0.2435
Least Squares Means for Effect THERAPY
Difference Simultaneous 95% Between Confidence Limits fori j Means LSMean(i)-LSMean(j)1 2 2.832658 -0.424816 6.09013
1 3 1.328481 -2.099566 4.756527
1 4 -2.536262 -5.981471 0.9089472 3 -1.504178 -4.846403 1.8380472 4 -5.368920 -8.734744 -2.0030973 4 -3.864743 -7.398994 -0.330491Note: “4” is control
44
Details for Quantile Simulation Random number seed 121011 Comparison type All Sample size 9999938 Target alpha 0.05 Accuracy radius (target) 0.0002 Accuracy radius (actual) 437E-7 Accuracy confidence 99%
Simulation Results
Estimated 99% Confidence Method 95% Quantile Alpha Limits
Simulated 2.594159 0.0500 0.0500 0.0500 Tukey-Kramer 2.594637 0.0499 0.0499 0.0500 Bonferroni 2.669484 0.0411 0.0410 0.0411 Sidak 2.662029 0.0419 0.0418 0.0419 GT-2 2.660647 0.0420 0.0420 0.0421 Scheffe 2.823701 0.0270 0.0270 0.0270 T 1.974017 0.2017 0.2016 0.2018
NOTE: PROCEDURE GLM used: real time 21.23 seconds
45
Results from ANCOVA-Dunnett H0:LSMean= ControlTHERAPY DBP10 LSMEAN Pr > |t|Dose 1 88.8171113 0.1407Dose 2 85.9844529 0.0002Dose 3 87.4886307 0.0140Placebo 91.3533732
Least Squares Means for Effect THERAPY
Difference Simultaneous 95% Between Confidence Limits fori j Means LSMean(i)-LSMean(j)1 4 -2.536262 -5.675847 0.6033232 4 -5.368920 -8.436161 -2.3016793 4 -3.864743 -7.085470 -0.644015
46
Details for Quantile Simulation-Dunnett
Random number seed 121011 Comparison type Control, two-sided Sample size 9999938 Target alpha 0.05 Accuracy radius (target) 0.0002 Accuracy radius (actual) 139E-7 Accuracy confidence 99%
Simulation Results
Estimated 99% Confidence Method 95% Quantile Alpha Limits
Simulated 2.364031 0.0500 0.0500 0.0500 Dunnett-Hsu, two-sided 2.364084 0.0500 0.0500 0.0500 Bonferroni 2.417902 0.0437 0.0437 0.0437 Sidak 2.411491 0.0444 0.0444 0.0444 GT-2 2.410664 0.0445 0.0445 0.0445 Scheffe 2.823701 0.0145 0.0145 0.0145 T 1.974017 0.1229 0.1229 0.1230
NOTE: PROCEDURE GLM used: real time 19.00 seconds
47
More General Inferences
Question: For what values of the covariate istreatment A better than treatment B?
48
Discussion of (Treatment Covariate) Interaction Example
49
The GLIMMIX Procedure
Computes MC-exact simultaneous confidence intervals and adjusted p-values for any set of linear functions in a linear model
50
GLIMMIX syntaxproc glimmix data=research.tire; class make; model cost = make mph make*mph;
estimate "10" make 1 -1 make*mph 10 -10, "15" make 1 -1 make*mph 15 -15, "20" make 1 -1 make*mph 20 -20, "25" make 1 -1 make*mph 25 -25, "30" make 1 -1 make*mph 30 -30, "35" make 1 -1 make*mph 35 -35, "40" make 1 -1 make*mph 40 -40, "45" make 1 -1 make*mph 45 -45, "50" make 1 -1 make*mph 50 -50, "55" make 1 -1 make*mph 55 -55, "60" make 1 -1 make*mph 60 -60, "65" make 1 -1 make*mph 65 -65, "70" make 1 -1 make*mph 70 -70
/adjust=simulate(nsamp=10000000 report) cl;run;
51
Output from PROC GLIMMIX
Simultaneous intervals are Estimate +- 2.648 * StdErr
Label Estimate StdErr tValue AdjLower AdjUpper 10 -4.1067 0.9143 -4.49 -6.5279 -1.6854 15 -3.4539 0.8084 -4.27 -5.5947 -1.3131 20 -2.8011 0.7101 -3.94 -4.6815 -0.9207 25 -2.1483 0.6230 -3.45 -3.7981 -0.4985 30 -1.4956 0.5524 -2.71 -2.9585 -0.03260 35 -0.8428 0.5054 -1.67 -2.1812 0.4956 40 -0.1900 0.4887 -0.39 -1.4842 1.1042 45 0.4628 0.5054 0.92 -0.8756 1.8012 50 1.1156 0.5524 2.02 -0.3474 2.5785 55 1.7683 0.6230 2.84 0.1185 3.4181 60 2.4211 0.7101 3.41 0.5407 4.3015 65 3.0739 0.8084 3.80 0.9331 5.2147 70 3.7267 0.9143 4.08 1.3054 6.1479
Bonferroni – critical value is t_{16,.05/2*13} = 3.377.
52
Other Applications of Linear Combinations Multiple Trend Tests
(0,1,2,3), (0,1,2,4), (0,4,6,7)
(carcinogenicity) (0,0,1), (0,1,1), (0,1,2)
(recessive/dominant/ordinal genotype effects) Subgroup Analysis
Subgroups define linear combinations (more on next slide)
53
Subgroup Analysis Example
Data: Yijkl , where i=Trt,Cntrl ; j=Old, Yng; k = GoodInit, PoorInit.
Model: Yijkl = ijk + ijkl, where ijk=+i+j+k+()ij+()ik+()jk
Subgroup Contrasts: 111 112 121 122 211 212 221 222 Overall ¼ ¼ ¼ ¼ -¼ -¼ -¼ -¼Older ½ ½ 0 0 -½ -½ 0 0Younger 0 0 ½ ½ 0 0 -½ -½ GoodInit ½ 0 ½ 0 -½ 0 -½ 0PoorInit 0 ½ 0 ½ 0 -½ 0 -½ OldGood 1 0 0 0 -1 0 0 0OldPoor 0 1 0 0 0 -1 0 0YoungGood 0 0 1 0 0 0 -1 0YoungPoor 0 0 0 1 0 0 0 -1
54
Subgroup Analysis Results
Label Estimate StdErr tValue Probt Adjp AdjLower AdjUpper
Overall 0.7075 0.1956 3.62 0.0002 0.0015 0.2460 I
Older 0.9952 0.2673 3.72 0.0002 0.0011 0.3646 I
Younger 0.4197 0.2847 1.47 0.0717 0.2605 -0.2521 I
GoodInitHealth 0.5871 0.2878 2.04 0.0219 0.0984 -0.09197 I
PoorInitHealth 0.8279 0.2644 3.13 0.0011 0.0068 0.2039 I
OldGood 0.8748 0.3387 2.58 0.0056 0.0295 0.07566 I
OldPoor 1.1157 0.3231 3.45 0.0004 0.0026 0.3532 I
YoungGood 0.2993 0.3562 0.84 0.2014 0.5494 -0.5413 I
YoungPoor 0.5401 0.3338 1.62 0.0544 0.2091 -0.2476 I
(SAS code available upon request)
55
Summary Include only comparisons of interest. Utilize correlations to be less conservative. The critical values can be computed exactly only in
balanced ANOVA for all pairwise comparisons, or in unbalanced ANOVA for comparisons with control.
Simulation-based methods are “exact” if you let the computer run for a while. This is my general recommendation.
56
Power Analysis
Sample size - Design of study Power is less when you use multiple
comparisons larger sample sizes Many power definitions Bonferroni & independence are convenient
(but conservative) starting points
57
Power Definitions
“Complete Power” = P(Reject all H0i that are false)
“Minimal Power” = P(Reject at least one H0i that is false)
“Individual Power” = P(Reject a particular H0i that is false)
“Proportional Power” = Average proportion of false H0i that are rejected
58
Power Calculations.Example: H1 and H2 powered individually at 50%; H3 and H4 powered individually at 80%, all tests independent.
Complete Power = P(reject H1 and H2 and H3 and H4) = .5 .5 .8 .8 = 0.16.
Minimal Power = P(reject H1or H2 or H3 or H4) = 1-P(“accept” H1 and H2 and H3 and H4) =1 (1.5) .5) .8) .8) = 0.99.
Individual Power = P(reject H3 (say)) = 0.80. (depends on the test)
Proportional Power = (.5 + .5 + .8 + .8)/4 = 0.65
59
Sample Size for Adequate Individual Power - Conservative Estimate
60
Individual power of two-tail two-sample Bonferroni t-tests
%let MuDiff = 5; /* Smallest meaningful difference MUx-MUy that you want to detect */%let Sigma = 10.0 ; /* A guess of the population std. dev. */%let alpha = .05 ; /* Familywise Type I error probability of the test */%let k = 4; /* Number of tests */
options ls=76;data power; cer = &alpha/&k; do n = 2 to 100 by 2; *n=sample size for each group*; df = n + n - 2; ncp = (&Mudiff)/(&Sigma*sqrt(2/n)); * The noncentrality parameter *; tcrit = tinv(1-cer/2, df); * The critical t value * ; power = 1 - probt(tcrit, df, ncp) + probt(-tcrit,df,ncp) ; output; end;
proc print data=power;run;
proc plot data=power; plot power*n/vpos=30;run;
61
Graph of Power Function Plot of power*n. Legend: A = 1 obs, B = 2 obs, etc.
power ‚ ‚ ‚ 1.0 ˆ ‚ ‚ ‚ ‚ AAA 0.8 ˆ AAAA ‚ AAA ‚ AAA ‚ AAA ‚ AA 0.6 ˆ AAA n=92 for 80% ‚ AA power ‚ AA ‚ AA ‚ AA 0.4 ˆ A ‚ AA ‚ AA ‚ AA ‚ AA 0.2 ˆ AA ‚ AA ‚ AA ‚ AA ‚ AAA 0.0 ˆ A ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ 0 20 40 60 80 100
n
62
%IndividualPower macro*•Uses PROBMC and PROBT (noncentral)
•Assumes that you want to use the single-step(confidence interval based) Dunnett (one-or two-sided) or Range (two-sided) test
•Less conservative than Bonferroni
•Conservative compared to stepwise procedures
•%IndividualPower(MCP=DUNNETT2,g=4,d=5,s=10);
*Westfall et al (1999), Multiple Comparisons and Multiple Tests Using SAS
63
%IndividualPower Output
64
More general Power- Simulate!
Invocation: %SimPower(method = dunnett , TrueMeans = (10, 10, 13, 15, 15) , s = 10 , n = 87 , seed=12345 );
Output: Method=DUNNETT, Nominal FWE=0.05, nrep=1000 True means = (10, 10, 13, 15, 15), n=87, s=10
Quantity Estimate ---95% CI----
Complete Power 0.28800 (0.260,0.316) Minimal Power 0.92900 (0.913,0.945) Proportional Power 0.65133 (0.633,0.669) True FWE 0.01900 (0.011,0.027) Directional FWE 0.01900 (0.011,0.027)
65
Concluding Remarks - Power
Need a bigger n Like to avoid bigger n (see sequential,
gatekeepers methods later) Which definition? Bonferroni and independence useful Simulation useful – especially for the more
complex methods that follow
66
Estimates of effect sizes & error margins
Confident inequalities
Overall Test
Simultaneous Confidence Intervals
Stepwise or closed tests Holm’s Method Hommel’s Method Hochberg’s Method Fisher Combination
Method
F-test, O’Brien, etc.
If you want ... …then use
Closed and Stepwise Testing Methods I: Standard P-Value Based Methods
67
Closed Testing Method(s) Form the closure of the family by including all
intersection hypotheses. Test every member of the closed family by a
(suitable) -level test. (Here, refers to comparison-wise error rate).
A hypothesis can be rejected provided that its corresponding test is significant at level and every other hypothesis in the family that implies it is
rejected by its level test.
68
Closed Testing – Multiple EndpointsH0: 1=2=3=4 =0
H0: 1=2=3 =0 H0: 1=2=4 =0 H0: 1=3=4 =0 H0: 2=3=4 =0
H0: 1=2 =0 H0: 1=3 =0 H0: 1=4 =0 H0: 2=3 =0 H0: 2=4 =0 H0: 3=4 =0
H0: 1=0p = 0.0121
H0: 2=0p = 0.0142
H0: 3=0p = 0.1986
H0: 4=0p = 0.0191
Where j = mean difference, treatment -control, endpoint j.
69
Closed Testing – Multiple Comparisons
1=2=3=4
1=2=3 1=2=4 1=3=4 2=3=41=2, 3=4
1=3, 2=4 1=4, 2=3
1=2 1=3 1=4 2=3 =4 3=4
Note: Logical implications imply that there are only 14 nodes, not 26 -1 = 63 nodes.
70
Control of FWE with Closed Tests
Suppose H0j1,..., H0jm all are true (unknown to you which
ones).
{Reject at least one of H0j1,..., H0jm
using CTP}
{Reject H0j1... H0jm }
Thus, P(reject at least one of H0j1,..., H0jm |
H0j1,..., H0jm all are true)
P(reject H0j1... H0jm |
H0j1,..., H0jm all are true) =
71
Examples of Closed Testing Methods
Bonferroni MinP Resampling-Based
MinP
Simes O’Brien Simple or weighted test
…
Holm’s Method Westfall-Young method
Hommel’s method Lehmacher’s method Fixed sequence test (a-
priori ordered) …
When the Composite Test is… Then the Closed Method is …
72
P-value Based Methods
Test global hypotheses using p-value combination tests
Benefit – Fewer model assumptions: only need to say that the p-values are valid
Allows for models other than homoscesdastic normal linear models (like survival analysis).
73
Holm’s Method is Closed Testing Using the Bonferroni MinP Test
Reject H0j1 H0j2... H0jm if
Min (p0j1 p0j2 ... p0jm ) /m.
Or, Reject H0j1 H0j2... H0jm if
p* = m Min (p0j1 p0j2 ... p0jm )
(Note that p* is a valid p-value for the joint null, comparable to p-value for Hotellings T2 test.)
74
Holm’s Stepdown MethodH0: 1=2=3=4 =0
minp=0.0121p*=0.0484
H0: 1=2=3 =0minp=0.0121p*=0.0363
H0: 1=2=4 =0minp=0.0121p*=0.0363
H0: 1=3=4 =0minp=.0121p*=0.0363
H0: 2=3=4 =0minp=0.0142p*=0.0426
H0: 1=2 =0minp=0.0121p*=0.0242
H0: 1=3 =0minp=0.0121p*=0.0242
H0: 1=4 =0minp=0.0121p*=0.0242
H0: 2=3 =0minp=0.0142p*=0.0284
H0: 2=4 =0minp=0.0142p*=0.0284
H0: 3=4 =0minp=0.0191p*=0.0382
H0: 1=0p = 0.0121
H0: 2=0p = 0.0142
H0: 3=0p = 0.1986
H0: 4=0p = 0.0191
Where j = mean difference, treatment -control, endpoint j.
75
Shortcut For Holm’s Method
Let H(1) ,…,H(k) be the hypotheses corresponding to p(1) … p(k)
–If p(1) /k, reject H(1) and continue, else stop and
retain all H(1) ,…,H(k) .
– If p(2) /(k-1), reject H(2) and continue, else stop
and retain all H(1) ,…,H(k) .
–…
–If p(k) , reject H(k)
76
Adjusted p-values for Closed Tests
The adjusted p-value for H0j is the maximum of all p-values over all relevant nodes
In the previous example,
pA(1)=0.0484,pA(2)=0.0484, pA(3)=0.0484, pA(4)=0.1986.
General formula for Holm: pA(j)= maxij (k-i+1)p(i) .
77
Worksheet For Holm’s Method
Holm-Bonferroni WorksheetOrdered Unadjusted Critical (k-i+1) * Adjusted
test number P-value Value p-value p-value1 0.0121 0.0125 0.0484 0.04842 0.0142 0.0166667 0.0426 0.04843 0.0191 0.025 0.0382 0.04844 0.1986 0.05 0.1986 0.1986
78
Simes’ Test for Global Hypotheses
Uses all p-values p1, p2, …, pm not just the MinP
Simes’ test rejects H01H02...H0m if
p(j) j/m for at least one j.
p-value for the joint test is p* = min {(m/j)p(j)}
Uniformly smaller p-value than m MinP Type I error at most under independence or positive
dependence of p-values
79
Rejection Regions
0 1
p1
p2
1
P(Simes Reject) = 1 – (1 P(Bonferroni Reject ) = 1 – (1
80
Hommel’s Method (Closed Simes)H0: 1=2=3=4 =0
p*=0.0255
H0: 1=2=3 =0p*=0.0213
H0: 1=2=4 =0p*=0.0191
H0: 1=3=4 =0p*=0.0287
H0: 2=3=4 =0p*=0.0287
H0: 1=2 =0p*=0.0142
H0: 1=3 =0p*=0.0242
H0: 1=4 =0p*=0.0191
H0: 2=3 =0p*=0.0284
H0: 2=4 =0p*=0.0191
H0: 3=4 =0p*=0.0382
H0: 1=0p = 0.0121
H0: 2=0p = 0.0142
H0: 3=0p = 0.1986
H0: 4=0p = 0.0191
Where j = mean difference, treatment -control, endpoint j.
81
Adjusted P-values for Hommel’s Method
Again, take the maximum p-value over all hypotheses that imply the given one.
In the previous example, the Hommel adjusted p-values are pA(1)=0.0287, pA(2)=0.0287, pA(3)=0.0382, pA(4)=0.1986.
These adjusted p-values are always smaller than the Holm step-down adjusted p-values.
82
Adjusted P-values for Hommel’s Method
They are maxima over relevant nodes
In example, Hommel adjusted p-values are pA(1)=0.0287, pA(2)=0.0287, pA(3)=0.0382, pA(4)=0.1986.
{Hommel adjusted p-value} ≤ {Holm adjusted p-value}
83
Hochberg’s Method
A conservative but simpler approximation to Hommel’s method
{Hommel adjusted p-value}
≤ {Hochberg adjusted p-value}
≤ {Holm adjusted p-value}
84
Hochberg’s Shortcut Method
Let H(1) ,…,H(k) be the hypotheses corresponding to p(1) … p(k)
–If p(k) , reject all H(j) and stop, else retain H(k) and continue.
– If p(k-1) /2, reject H(2) … H(k) and stop, else retain
H(k-1) and continue.
–…
–If p(1) /k, reject H(k)
Adjusted p-values are pA(j)= minji (k-i+1)p(i) .
85
Worksheet for Hochberg’s Method
Hochberg Adjusted P-Value WorksheetOrdered Unadjusted Critical (k-i+1) * Adjusted
test number P-value Value p-value p-value1 0.0121 0.0125 0.0484 0.03822 0.0142 0.0166667 0.0426 0.03823 0.0191 0.025 0.0382 0.03824 0.1986 0.05 0.1986 0.1986
86
Comparison of Adjusted P-Values
p-Values
StepdownTest Raw Bonferroni Hochberg Hommel
1 0.0121 0.0484 0.0382 0.02862 0.0142 0.0484 0.0382 0.02863 0.1986 0.1986 0.1986 0.19864 0.0191 0.0484 0.0382 0.0382
87
Fisher Combination Test for Independent p-Values
Reject H01H02...H0m if
-2ln(pi) > (1-, 2m)
88
Example: Non-Overlapping Subgroup* p-values
The Multtest Procedure p-Values
Stepdown FisherTest Raw Bonferroni Hochberg Hommel Combination
1 0.0784 0.3918 0.1550 0.1550 0.0784 2 0.0480 0.2883 0.1550 0.1441 0.0480 3 0.0041 0.0325 0.0305 0.0285 0.0053 4 0.0794 0.3918 0.1550 0.1550 0.0794 5 0.0044 0.0325 0.0305 0.0305 0.0056 6 0.0873 0.3918 0.1550 0.1550 0.0873 7 0.1007 0.3918 0.1550 0.1550 0.1007 8 0.1550 0.3918 0.1550 0.1550 0.1550
*Non-overlapping is required by the independence assumption.
89
Power Comparison
Liptak test stat: T = -1(pi) = Zi
90
Concluding Notes
Closed testing more powerful than single-step (/m rather than /k).
P-value based methods can be used whenever p-values are valid
Dependence issues: MinP (Holm) conservative Simes (Hommel, Hochberg) less conservative, rarely
anti-conservative Fisher combination, Liptak require independence
91
Closed and Stepwise Testing Methods II:Fixed Sequences and Gatekeepers
Methods Covered: •Fixed Sequences (hierarchical endpoints, dose response, non-inferiority superiority)•Gatekeepers (primary and secondary analyses)•Multiple Gatekeepers (multiple endpoints & multiple doses)•Intersection-Union tests*
* Doesn’t really belong in this section
92
Fixed Sequence Tests
Pre-specify H1, H2, …, Hk, and test in this sequence, stopping as soon as you fail to reject.
No -adjustment is necessary for individual tests. Applications:
Dose response: High vs. Control, then Mid vs. Control, then Low vs. Control
Primary endpoint, then Secondary endpoint
93
Fixed Sequence as a Closed Procedure
H12:
Rej if p1 .05H13:
Rej if p1 .05
H23:
Rej if p2 .05
H1:
Rej if p1 .05
H2:
Rej if p2 .05H3:
Rej if p3 .05
H123:
Rej if p1 .05
• Rej H1 if p1.05 • Rej H2 if p1.05 and p2.05• Rej H3 if p1.05 and p2.05 and p3.05
94
A Seemingly Reasonable But Incorrect Protocol
1. Test Dose 2 vs Pbo, and Dose 3 vs Pbo using the Bonferroni method (0.025 level).
2. Test Dose 1 vs Pbo at the unadjusted 0.05 level only if at least one of the first two tests is significant at the 0.025 level.
95
The problem: FWE 0.075
0
12
Pbo Low Mid High
Lower
Mean
Upper
Moral: Caution needed when there are multiple hypotheses at some point in the sequence.
96
Correcting the Incorrect Protocol: Use Closure
H0LMH
P23 < .05
H0LM P12<.05
H0LH P13 < .05
H0MH P23 < .05
H0H P3 < .05
H0M P2 < .05
H0L P1 < .05
Where pij = 2min(pi,pj)
97
References –Fixed Sequence and Gatekeeper Tests1. Bauer, P (1991) Multiple Testing in Clinical Trials, Statistics in Medicine, 10, 871-890.2. O’Neill RT. (1997) Secondary endpoints cannot be validly analyzed if the primary
endpoint does not demonstrate clear statistical significance. Controlled Clinical Trials; 18:550 –556.
3. D’Agostino RB. (2000) Controlling alpha in clinical trials: the case for secondary endpoints. Statistics in Medicine; 19:763–766.
4. Chi GYH. (1998) Multiple testings: multiple comparisons and multiple endpoints. Drug Information Journal 32:1347S–1362S.
5. Bauer P, Röhmel J, Maurer W, Hothorn L. (1998) Testing strategies in multi-dose experiments including active control. Statistics in Medicine; 17:2133 –2146.
6. Westfall, P.H. and Krishen, A. (2001). Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures, Journal of Statistical Planning and Inference 99, 25-40.
7. Chi, G. “Clinical Benefits, Decision Rules, and Multiple Inferences,” http://www.fda.gov/cder/Offices/Biostatistics/chi_1/sld001.htm
8. Dmitrienko, A, Offen, W. and Westfall, P. (2003). Gatekeeping strategies for clinical trials that do not require all effects to be significant. Stat Med. 22: 2387-2400.
9. Chen X, Luo X, Capizzi T. (2005) The application of enhanced parallel gatekeeping strategies. Stat Med. 24:1385-97.
10. Alex Dmitrienko, Geert Molenberghs, Christy Chuang-Stein, and Walter Offen (2005), Analysis of Clinical Trials Using SAS: A Practical Guide, SAS Press.
11. Wiens, B, and Dmitrienko, A. (2005). The fallback procedure for evaluating a single family of hypotheses. J Biopharm Stat.15(6):929-42.
12. Dmitrienko, A., Wiens, B. and Westfall, P. (2006). Fallback Tests in Dose Response Clinical Trials, J Biopharm Stat, 16, 745-755.
98
Intersection-Union (IU) Tests
Union-Intersection (UI): Nulls are intersections, alternatives are unions.
H0: {1=0 and 2=0} vs. H1: {10 or 20} Intersection-Union (IU): Nulls are unions, alternatives
are intersections
H0: {1=0 or 2=0} vs. H1: {10 and 20} IU is NOT a closed procedure. It is just a single test of a
different kind of null hypothesis.
99
Applications of I-U
Bioequivalence: The “TOST” test: Test 1. H01: 0 vs. HA1: 0
Test 2. H01: 0 vs. HA1: 0 Can test both at =.05, but must reject both.
Combination Therapy: Test 1. H01: 12 vs. HA1: 12
Test 2. H01: 12 vs. HA1: 12 Can test both at =.05, but must reject both.
100
Control of Type I Error for IU tests
Suppose 1=0 or 20. Then
P(Type I error) = P(Reject H0) (1)= P(p1.05 and p2.05) (2)< min{P(p1.05), P(p2.05)} (3)=.05. (4)
Note: The inequality at (3) becomes an approximate equality when p2 is extremely noncentral.
101
Concluding Notes: Fixed Sequences and Gatekeepers
• Many times, no adjustment is necessary at all!
• Other times you can gain power by specifying gatekeeping sequences
• However, you must clearly state the method and follow the rules
• There are many “incorrect” no adjustment methods - use caution
102
Closed and Stepwise Testing Methods III: Methods that Use Logical Constraints and Correlations
Methods Application
Lehmacher et al Multiple endpoints
Westfall-Tobias- Shaffer-Royen General contrasts
103
Lehmacher et al. Method• Use O’Brien test at each node (incorporates correlations)
• Do closed testing
Note: Possibly no adjustment whatsoever; possibly big adjustment
104
Calculations for Lehmacher’s Method
proc standard data=research.multend1 mean=0 std=1 out=stdzd; var Endpoint1-Endpoint4; run;
data combine; set stdzd; H1234 = Endpoint1+Endpoint2+Endpoint3+Endpoint4; H123 = Endpoint1+Endpoint2+Endpoint3 ; H124 = Endpoint1+Endpoint2+ Endpoint4; H134 = Endpoint1+ Endpoint3+Endpoint4; H234 = Endpoint2+Endpoint3+Endpoint4; H12 = Endpoint1+Endpoint2 ; H13 = Endpoint1+ Endpoint3 ; H14 = Endpoint1+ Endpoint4; H23 = Endpoint2+Endpoint3 ; H24 = Endpoint2+ Endpoint4; H34 = Endpoint3+Endpoint4; H1 = Endpoint1 ; H2 = Endpoint2 ; H3 = Endpoint3 ; H4 = Endpoint4;run;
proc ttest; class treatment; var H1234 H123 H124 H134 H234 H12 H13 H14 H23 H24 H34 H1 H2 H3 H4 ; ods output ttests=ttests; run;
105
Output For Lehmacher’s Method Obs Variable Method Variances tValue DF Probt 1 H1234 Pooled Equal 2.69 109 0.0082 3 H123 Pooled Equal 2.59 109 0.0108 5 H124 Pooled Equal 3.03 109 0.0031 7 H134 Pooled Equal 2.36 109 0.0201 9 H234 Pooled Equal 2.51 109 0.0136 11 H12 Pooled Equal 3.03 109 0.0030 13 H13 Pooled Equal 2.12 109 0.0365 15 H14 Pooled Equal 2.68 109 0.0085 17 H23 Pooled Equal 2.22 109 0.0287 19 H24 Pooled Equal 2.88 109 0.0047 21 H34 Pooled Equal 2.03 109 0.0450 23 H1 Pooled Equal 2.55 109 0.0121 25 H2 Pooled Equal 2.49 109 0.0142 27 H3 Pooled Equal 1.29 109 0.1986 29 H4 Pooled Equal 2.38 109 0.0191
pA1 = max(0.0121, 0.0085, 0.0365, 0.0030, 0.0201, 0.0031, 0.0108, 0.0082) = 0.0365pA2 = max(0.0142, 0.0047, 0.0287, 0.0030, 0.0136, 0.0031, 0.0108, 0.0082) = 0.0287pA3 = max(0.1986, 0.0450, 0.0287, 0.0365, 0.0136, 0.0201, 0.0108, 0.0082) = 0.1986pA4 = max(0.0191, 0.0450, 0.0047, 0.0085, 0.0136, 0.0201, 0.0031, 0.0082) = 0.0450
106
Free and Restricted Combinations
If truth of some null hypotheses logically forces other nulls to be true, the hypotheses are restricted.
Examples • Multiple Endpoints, one test per endpoint - free • All Pairwise Comparisons - restricted
107
Pairwise Comparisons, 3 Groups
H0:
H0: H0: H0:
H0: H0: H0:
Note : The entire middle layer is not needed!!!!! Fisher protected LSD valid!
108
Pairwise Comparisons, 4 Groups
1=2=3=4
1=2=3 1=2=4 1=3=4 2=3=41=2, 3=4
1=3, 2=4 1=4, 2=3
1=2 1=3 1=4 2=3 =4 3=4
Note: Logical implications imply that there are only 14 nodes, not 26 -1 = 63 nodes. Also, Fisher protected LSD not valid.
109
Restricted Combinations Multipliers (Shaffer* Method 1; Modified Holm)
*Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures. JASA 81, 826—831.
110
Shaffer’s (1) Adjusted p-values
NumberofTreatments,tj34567891361015212836213610152128313610152128436101521285261015212861410152128747152128837112128927111628101611162211411162212410162213391622142715221517132216613211751218184111819310182029162118162271523613245132541226311272102819298307316325334343352361
Critical Shaffer RawP Multiplier Value Adjusted p Values
0.3021 1 0.05 0.3998* 0.0435 3 0.016667 0.1305 4 0.0002 6 0.008333 0.0012 0.1999 2 0.025 0.3998 4 0.0109 3 0.016667 0.0327 4 0.0088 3 0.016667 0.0264
* Monotonicity enforced
111
Westfall/Tobias/Shaffer/Royen* Method
• Uses actual distribution of MinP instead of conservative Bonferroni approximation
• Closed testing incorporating logical constraints
• Hard-coded in PROC GLIMMIX
• Allows arbitrary linear functions
*Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method, Journal of the American Statistical Association 102: 487-494.
112
Application of Truncated Closed MinP to Subgroup Analysis
Compare Treatment with control as follows:
• Overall• In the Older Patients subgroup• In the Younger Patients subgroup• In patients with better initial health subgroup• In patients with poorer initial health subgroup• In each of the four (old/young)x(better/poorer) subgroups
• 9 tests overall (but better 1 gatekeeper + 8 follow-up)
113
Analysis Fileods output estimates=estimates_logicaltests;proc glimmix data=research.respiratory; class Treatment AgeGroup InitHealth; model score = Treatment AgeGroup InitHealth Treatment*AgeGroup Treatment*InitHealth AgeGroup*InitHealth;Estimate"Overall" treatment 4 -4 treatment*Agegroup 2 2 -2 -2 treatment*InitHealth 2 2 -2 -2 (divisor=4),"Older" treatment 2 -2 treatment*Agegroup 2 0 -2 0 treatment*InitHealth 1 1 -1 -1 (divisor=2),"Younger" treatment 2 -2 treatment*Agegroup 0 2 0 -2 treatment*InitHealth 1 1 -1 -1 (divisor=2),"GoodInitHealth" treatment 2 -2 treatment*Agegroup 1 1 -1 -1 treatment*InitHealth 2 0 -2 0 (divisor=2),"PoorInitHealth" treatment 2 -2 treatment*Agegroup 1 1 -1 -1 treatment*InitHealth 0 2 0 -2 (divisor=2),"OldGood" treatment 1 -1 treatment*Agegroup 1 0 -1 0 treatment*InitHealth 1 0 -1 0 ,"OldPoor" treatment 1 -1 treatment*Agegroup 1 0 -1 0 treatment*InitHealth 0 1 0 -1 ,"YoungGood" treatment 1 -1 treatment*Agegroup 0 1 0 -1 treatment*InitHealth 1 0 -1 0 ,"YoungPoor" treatment 1 -1 treatment*Agegroup 0 1 0 -1 treatment*InitHealth 0 1 0 -1 /adjust=simulate(nsamp=10000000 report seed=12321) upper stepdown(type=logical report);run;
proc print data=estimates_logicaltests noobs; title "Subgroup Analysis Results – Truncated Closure"; var label estimate Stderr tvalue probt Adjp;run;
114
Results – Truncated Closure Subgroup Analysis Results
adjp_ adjp_Label Estimate StdErr tValue Probt logical interval
Overall 0.7075 0.1956 3.62 0.0002 0.0011 0.0015Older 0.9952 0.2673 3.72 0.0002 0.0011 0.0011Younger 0.4197 0.2847 1.47 0.0717 0.1049 0.2605GoodInitHealth 0.5871 0.2878 2.04 0.0219 0.0432 0.0984PoorInitHealth 0.8279 0.2644 3.13 0.0011 0.0023 0.0068OldGood 0.8748 0.3387 2.58 0.0056 0.0124 0.0295OldPoor 1.1157 0.3231 3.45 0.0004 0.0011 0.0026YoungGood 0.2993 0.3562 0.84 0.2014 0.2014 0.5494YoungPoor 0.5401 0.3338 1.62 0.0544 0.1049 0.2091
The adjusted p-values for the stepdown tests are mathematicallysmaller than those of the simultaneous interval-based tests,
115
Example: Stepwise Pairwise vs. Control Testing
Teratology data set
•Observations are litters•Response variable = litter weight•Treatments: 0,5,50,500.•Covariates: Litter size, Gestation time
116
Analysis Fileproc glimmix data=research.litter; class dose; model weight = dose gesttime number; estimate "5 vs 0" dose -1 1 0 0, "50 vs 0" dose -1 0 1 0, "500 vs 0" dose -1 0 0 1 / adjust=simulate(nsample=10000000 report) stepdown(type=logical); run; quit;
117
Results
Estimates with Simulated Adjustment
Standard Label Estimate Error DF t Value Pr > |t| Adj P
5 vs 0 -3.3524 1.2908 68 -2.60 0.0115 0.0316 50 vs 0 -2.2909 1.3384 68 -1.71 0.0915 0.0915 500 vs 0 -2.6752 1.3343 68 -2.00 0.0490 0.0907
Note: 50-0 and 500-0 not significant at .10 with regular Dunnett
118
Concluding Notes: More power is available when combinations
are restricted. Power of closed tests can be improved using
correlation and other distributional characteristics
119
Nonparametric Multiple Testing Methods
Overview: Use nonparametric tests at each node of theclosure tree
• Bootstrap tests • Rank-based tests• Tests for binary data
120
Bootstrap MinP Test (Semi-Parametric Test)
The composite hypothesis H1H2…Hk may be tested using the p-value
p* = P(MinP minp | H1H2…Hk)
Westfall and Young (1993) show how to obtain p* by bootstrapping the residuals in a
multivariate regression model. how to obtain all p*’s in the closure tree efficiently
121
Multivariate Regression Model (Next Five slides are from Westfall and Young, 1993)
122
Hypotheses and Test Statistics
123
Joint Distribution of the Test Statistics
124
Testing Subset Intersection Hypotheses Using the Extreme Pivotals
125
Exact Calculation of pK
Bootstrap Approximation:
126
Bootstrap Tests (PROC MULTTEST)H0: 1=2=3=4 =0
min p = .0121, p* = .0379
H0: 1=2=3 =0min p = .0121, p* < .0379
H0: 1=2=4 =0min p = .0121, p* < .0379
H0: 1=3=4 =0min p = .0121, p* < .0379
H0: 2=3=4 =0min p = .0142, p* = .0351
H0: 1=2 =0minp = .0121p* < .0379
H0: 1=3 =0minp = .0121p* < .0379
H0: 1=4 =0minp =.0121p* < .0379
H0: 2=3 =0minp = .0142p* < .0351
H0: 2=4 =0minp = .0142p* < .0351
H0: 3=4 =0minp = .0191p* = .0355
H0: 1=0p = 0.0121p* < .0379
H0: 2=0p = 0.0142p* < .0351
H0: 3=0p = 0.1986p* = .1991
H0: 4=0p = 0.0191p* < .0355
p* = P(Min P min p | H0) (computed using bootstrap resampling)(Recall, for Bonferroni, p* = k(MinP) )
127
Permutation Tests for Composite Hypotheses H0K
Joint p-value = proportion of the n!/(nT!nC!) permutations for which miniK Pi
* miniK pi .
128
Problem; Simplification
Simplification: You need only test k of the 2k-1 subsets!
Why? Because
P(miniK Pi* c) P(miniK’ Pi
* c) when K K’.
Significance for most lower order subsets is determined by significance of higher order subsets.
Problem: There are 2k -1 subsets K to be tested
This might take a while...
129
MULTTEST PROCEDURE
Tests only the needed subsets (k, not 2k - 1).
Samples from the permutation distribution.
Only one sample is needed, not k distinct samples, ifthe joint distribution of minP is identical under HK and HS.
(Called the “subset pivotality” condition by Westfall and Young, 1993, valid under location shift and other models)
130
Great Savings are Possible with Exact Permutation Tests!
Why?
Suppose you test H12…k using MinP. The joint p-value is p* = P(MinP minp) P(P1 minp) + P(P2 minp) + … + P(Pk minp)
Many summands can be zero, others much less than minp.
131
Stepdown Stepdown Variable Contrast Raw Bonferroni Permutation
ae1 t vs c 0.0008 0.0025 0.0020 ae2 t vs c 0.6955 1.0000 1.0000 ae3 t vs c 0.5000 1.0000 1.0000 ae4 t vs c 0.7525 1.0000 1.0000 ae5 t vs c 0.2213 1.0000 0.6274 ae6 t vs c 0.0601 0.3321 0.2608 ae7 t vs c 0.8165 1.0000 1.0000 ae8 t vs c 0.0293 0.1587 0.1328 ae9 t vs c 0.9399 1.0000 1.0000 ae10 t vs c 0.2484 1.0000 0.9273 ae11 t vs c 1.0000 1.0000 1.0000 ae12 t vs c 1.0000 1.0000 1.0000 ae13 t vs c 1.0000 1.0000 1.0000 ae14 t vs c 1.0000 1.0000 1.0000 ae15 t vs c 0.2484 1.0000 0.9273 ae16 t vs c 0.7516 1.0000 1.0000 ae17 t vs c 1.0000 1.0000 1.0000 ae18 t vs c 1.0000 1.0000 1.0000 ae19 t vs c 1.0000 1.0000 1.0000 ae20 t vs c 0.5000 1.0000 1.0000 ae21 t vs c 0.7516 1.0000 1.0000 ae22 t vs c 1.0000 1.0000 1.0000 ae23 t vs c 0.5000 1.0000 1.0000 ae24 t vs c 1.0000 1.0000 1.0000 ae25 t vs c 1.0000 1.0000 1.0000 ae26 t vs c 1.0000 1.0000 1.0000 ae27 t vs c 1.0000 1.0000 1.0000 ae28 t vs c 0.4344 1.0000 0.9400
Multiple Binary Adverse Events
132
Example: Genetic Associatons
Phenotype: 0/1 (diseased or not).
Sample n1 from diseased, n2 from not diseased.
Compare 100’s of genotype frequencies (usingdominant and recessive codings) for diseased and non-diseased using multiple Fisher exact tests.
133
PROC MULTTEST Codeproc multtest data=research.gen stepperm n=20000 out=pval hommel fdr; class y; test fisher(d1-d100 r1-r100); contrast "dis v nondis" -1 1;run;
proc sort data=pval; by raw_p;run;
proc print data=pval; var _var_ raw_p stppermp hom_p; where raw_p <.05; run;
134
Results from PROC MULTTEST
Obs _var_ raw_p stppermp hom_p fdr_p
1 r100 0.000000 0.0000 0.00000 0.00000 2 r30 0.000000 0.0000 0.00004 0.00002 3 d78 0.016130 0.7955 1.00000 0.57465 4 r55 0.018157 0.8220 1.00000 0.57465 5 d62 0.019220 0.8480 1.00000 0.57465 6 r64 0.019220 0.8480 1.00000 0.57465 7 r37 0.020113 0.8520 1.00000 0.57465 8 r33 0.040043 0.9860 1.00000 1.00000
135
Application - Gene Expression
Group 1: Acute Myeloid Leukemia (AML), n1=11Group 2: Acute Lymphoblastic Leukemia (ALL), n2=27
Data:
OBS TYPE G1 G2 G3 … G7000 1 AML (Gene expression levels) 2 AML … … … … 11 AML 12 ALL … … 38 ALL
136
PROC MULTTEST code for exact* closed testing
Proc multtest data=research.leuk noprint out=adjp holm fdr stepperm n=1000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast 'AML vs ALL' -1 1;run;
proc sort data=adjp(where=(raw_p le .0005)); by raw_p;
proc print; var _var_ raw_p stpbon_p fdr_p stppermp;run;
* modulo Monte Carlo error
137
PROC MULTTEST Output
(1 hour on 2.8 GhZ Xeon for 200,000 samples)
Raw Bonferroni permutation Raw Bonferroni permutationVariable p-value p-value p-value Variable p-value p-value p-value
GENE3320 1.4E-10 0.0000 0.0001 GENE2043 1.3E-06 0.0090 0.0095GENE4847 2.4E-10 0.0000 0.0001 GENE2759 1.3E-06 0.0093 0.0097GENE2020 6.6E-10 0.0000 0.0001 GENE6803 1.4E-06 0.0102 0.0104GENE1745 1.0E-08 0.0001 0.0002 GENE1674 1.5E-06 0.0105 0.0106GENE5039 1.0E-08 0.0001 0.0002 GENE2402 1.5E-06 0.0108 0.0109GENE1834 1.5E-08 0.0001 0.0003 GENE2186 1.7E-06 0.0118 0.0118GENE461 3.6E-08 0.0003 0.0005 GENE6376 2.1E-06 0.0149 0.0142GENE4196 6.2E-08 0.0004 0.0007 GENE3605 2.6E-06 0.0181 0.0169GENE3847 7.2E-08 0.0005 0.0008 GENE6806 2.6E-06 0.0184 0.0170GENE2288 8.9E-08 0.0006 0.0010 GENE1829 2.7E-06 0.0194 0.0177GENE1249 1.7E-07 0.0012 0.0017 GENE6797 3.0E-06 0.0214 0.0194GENE6201 1.8E-07 0.0013 0.0017 GENE6677 3.4E-06 0.0244 0.0216GENE2242 2.0E-07 0.0014 0.0019 GENE4052 3.7E-06 0.0263 0.0231GENE3258 2.1E-07 0.0015 0.0020 GENE1394 4.9E-06 0.0350 0.0290GENE1882 3.2E-07 0.0023 0.0029 GENE6405 5.4E-06 0.0380 0.0311GENE2111 3.7E-07 0.0026 0.0033 GENE248 6.4E-06 0.0453 0.0359GENE2121 5.8E-07 0.0041 0.0048 GENE2267 6.5E-06 0.0460 0.0364GENE6200 6.2E-07 0.0044 0.0051 GENE6041 7.8E-06 0.0553 0.0429GENE6373 8.2E-07 0.0058 0.0065 GENE6005 8.0E-06 0.0569 0.0439GENE6539 1.1E-06 0.0080 0.0086 GENE5772 9.0E-06 0.0638 0.0480
138
Subset Pivotality, PROC MULTTEST
MULTTEST requires “subset pivotality” condition, which statescases where resampling under the global null is valid.
Valid cases:Multivariate Regression Model (location-shift).Multivariate permutation multiple comparisons, one test per variable, assuming
model with exchangeable subsets.
Not Valid with:Permutation multiple comparisons, within a variable, with three or more groups,Heteroscedasticity.
Closed testing “by hand” works regardless.
139
Summary: Nonparametric Closed Tests
• Nonparametric closed tests are simple, in principle.
• Robustness gains and power advantages are possible.
140
Further Topics: More ComplexSituations for FWE Control
Heteroscedasticity Repeated Measures Large Sample Methods
141
Heteroscedasticity in MCPsExtreme Example: data het; do g = 1 to 5; do rep = 1 to 10; input y @@; output; end; end;datalines;.01 .02 .08 .03 .04 .01 .08 .01 .01 .02.11 .12 .13 .08 .03 .09 .11 .11 .13 .14.21 .22 .23 .28 .23 .29 .21 .21 .23 .25 42 76 . . . . . . . . 45 23 56 . . . . . . .;proc glm; class g; model y = g; lsmeans g/adjust=tukey pdiff;run; quit;
142
Least Squares Means for effect g Pr > |t| for H0: LSMean(i)=LSMean(j) Adjustment for Multiple Comparisons: Tukey-Kramer
i/j 1 2 3 4 5
1 1.0000 1.0000 <.0001 <.0001 2 1.0000 1.0000 <.0001 <.0001 3 1.0000 1.0000 <.0001 <.0001 4 <.0001 <.0001 <.0001 0.0290 5 <.0001 <.0001 <.0001 0.0290
Level of --------------y-------------- g N Mean Std Dev
1 10 0.0310000 0.0276687 2 10 0.1050000 0.0320590 3 10 0.2360000 0.0287518 4 2 59.0000000 24.0416306 5 3 41.3333333 16.8027775
RMSE = 6.17
Bad Results from Heteroscedastic Data
143
proc glimmix data=het; if (g > 3) then y2=y/20; else y2=y; /* overcomes scaling problem */ class g; model y2 = g/noint ddfm=satterth; random _residual_ / group=g ; estimate '1 -2' g 1 -1 0 0 0 ,'1 -3' g 1 0 -1 0 0 ,'1 -4' g 1 0 0 -20 0 ,'1 -5' g 1 -1 0 0 -20 ,'2 -3' g 0 1 -1 0 0 ,'2 -4' g 0 1 0 -20 0 ,'2 -5' g 0 1 0 0 -20 ,'3 -4' g 0 0 1 -20 0 ,'3 -5' g 0 0 1 0 -20 ,'4 -5' g 0 0 0 20 -20 /adjust=simulate(nsamp = 1000000) stepdown(type=logical) adjdfe=row;run;
Approximate Solution for Heteroscedasticity Problem
144
Estimates with Simulated Adjustment
StandardLabel Estimate Error DF t Value Pr > |t| Adj P
1 -2 -0.07400 0.01339 17.62 -5.53 <.0001 0.02421 -3 -0.2050 0.01262 17.97 -16.25 <.0001 0.00111 -4 -58.9690 17.0000 1 -3.47 0.1787 0.24741 -5 -41.4073 9.7011 2 -4.27 0.0507 0.13742 -3 -0.1310 0.01362 17.79 -9.62 <.0001 0.01982 -4 -58.8950 17.0000 1 -3.46 0.1789 0.24742 -5 -41.2283 9.7011 2 -4.25 0.0512 0.13743 -4 -58.7640 17.0000 1 -3.46 0.1793 0.24743 -5 -41.0973 9.7011 2 -4.24 0.0515 0.13744 -5 17.6667 19.5732 1.669 0.90 0.4778 0.4778
Heteroscedastic Results
Notes:• Approximation 1: df’s• Approximation 2: Covariance matrix involving all comparisons is approximate •1,2,3 different, 4-5 not. (sensible)
145
Repeated Measures and Multiple Comparisons Usually considered quite complicated (wave hands, use
Bonferroni)
PROC GLIMMIXED provides a viable solution
The method is approximate because of its df approximation, and
because it treats estimated variance ratios as known.
146
Multiple Comparisons with Mixed Model
data Halothane; do Dog =1 to 19; do Treatment = 'HA','LA','HP','LP'; input Rate @@; output; end; end;datalines;426 609 556 600 253 236 392 395 359 433 349 357432 431 522 600 405 426 513 513 324 438 507 539310 312 410 456 326 326 350 504 375 447 547 548286 286 403 422 349 382 473 497 429 410 488 547348 377 447 514 412 473 472 446 347 326 455 468434 458 637 524 364 367 432 469 420 395 508 531397 556 645 625;
Crossover study: Dog heart rates
H,L = CO2 High/LowA,P = Halothane absent/presentSource: Johnson and Wichern, Applied Multivariate StatisticalAnalysis, 5th ed, Prentice Hall
147
GLIMMIX code for analyzing all pairwise comparisons, main effects, and interactions simultaneously
proc glimmix data=halothane order=data; class treatment dog; model rate = treatment/ddfm=satterth; random treatment/ subject=dog type=chol v=1 vcorr=1; estimate 'HA - LA' treatment 1 -1 0 0 , 'HA - HP' treatment 1 0 -1 0 , 'HA - LP' treatment 1 0 0 -1 , 'LA - HP' treatment 0 1 -1 0 , 'LA - LP' treatment 0 1 0 -1 , 'HP - LP' treatment 0 0 1 -1 , 'Co2 ' treatment 1 -1 1 -1 (divisor=2), 'Halothane' treatment 1 1 -1 -1 (divisor=2), 'Interaction' treatment 1 -1 -1 1 /adjust=simulate(nsamp=1000000) stepdown(type=logical) adjdfe=row;
148
Estimates with Simulated Adjustment
StandardLabel Estimate Error DF t Value Pr > |t| Adj P
HA - LA -36.4211 13.8522 17.63 -2.63 0.0172 0.0172HA - HP -111.05 14.1127 14.72 -7.87 <.0001 <.0001HA - LP -134.68 12.7899 14.66 -10.53 <.0001 <.0001LA - HP -74.6316 14.8794 17.94 -5.02 <.0001 0.0002LA - LP -98.2632 15.7467 17.99 -6.24 <.0001 <.0001HP - LP -23.6316 11.9884 16.22 -1.97 0.0660 0.0660Co2 -30.0263 8.2683 17.82 -3.63 0.0019 0.0059Halothane -104.66 11.1412 15.27 -9.39 <.0001 <.0001Interaction -12.7895 19.9438 17.98 -0.64 0.5294 0.5294
Results
149
Cure Rates Example: Multiple Comparisons of Odds
Treatment Complicated UncomplicatedA 73.6% 88.9%B 90.2% 91.5%C 59.6% 85.0%
Diagnosis
Questions: (1) Multiple comparisons of cure rates for the Treatments (3 comparisons)(2) Comparison of cure rates for Complicated vs Uncomplicated Diagnosis.
150
Method
• Use the estimated parameter vector and associated estimate of covariance matrix from PROC GLIMMIX
• Treat the estimated (asymptotic) covariance matrix as known
• Simulate critical values and p-values (MinP-based) from the multivariate normal distribution instead of the Multivariate T distribution
• Controls FWE asymptotically under correct logit model
151
Results
Estimates with Simulated Adjustment
StandardLabel Estimate Error DF t Value Pr > |t| Adj P
A-B -0.9760 0.3311 Infty -2.95 0.0032 0.0032A-C 0.5847 0.2641 Infty 2.21 0.0268 0.0268B-C 1.5608 0.3160 Infty 4.94 <.0001 <.0001Comp-Uncomp -0.9616 0.2998 Infty -3.21 0.0013 0.0027
152
Summary
• Classic, FWE-controlling MCPs that incorporate alternative covariance structures and non-normal distributions are easy using PROC GLIMMIX.
• However, be aware of approximations Plug-in variance/covariance estimates df
153
Further Topics: False Discovery Rate
• FDR = E(proportion of rejections that are incorrect)
• Let R = total # of rejections• Let V = # of erroneous rejections
• FDR = E(V/R) (0/0 defined as 0).
• FWE = P(V>0)
154
Example
30 independent tests: 20 null hypotheses are true with pj~U(0,1) 10 extremely alternative with pj = 0.
Decision rule: Reject H0j if pj 0.05
Then:
CERj = P(reject H0j | H0j true ) = 0.05.FWE = P(reject one or more of the 20) = 1-(.95)20 =0.64FDR = E{V/(V+10)} where V~Bin(20.05) so FDR = 0.084.
155
Benjamini and Hochberg’s FDR-Controlling Method
Let H(1) ,…,H(k) be the hypotheses corresponding to p(1) … p(k)
–If p(k) , reject all H(j) and stop, else continue.
– If p(k-1) (k-1)/k, reject H(1) … H(k-1) and stop, else continue.
–…
–If p(1) /k, reject H(1)
Adjusted p-values: pA(j)= minji (k/i)p(i) .
156
Comparison with Hochberg’s Method
• A step-up procedure, like Hochberg’s method • adjusted p-values are pA(j)= minji (k/i)p(i) .
• Recall for Hochberg’s method, pA(j)= minji (k-i+1)p(i) .
• FDR adjusted p-values are uniformly smaller since k/i k-i+1 • B-H FDR method uses Simes’ critical points.
157
Critical Values – FDR vs FWE
Alpha Levels - 4000 Tests
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0 1000 2000 3000 4000 5000
Ordered Test
Cri
tica
l
Hoc4000
BH4000
Alpha Levels
0
0.01
0.02
0.03
0.04
0.05
0 5 10 15 20 25 30
Ordered Test
Cri
tica
l
Hoc
BH
158
Comments on FDR Control
• Considered better for large numbers of tests since FWE is inconsistent• Is adaptive• Has a loose Bayesian correspondence
• Easy to misinterpret the results: Given 10 FDR<.10 rejections in a given study, it is tempting to claim that only one can be in error (in an “average” sense).
However, this is incorrect, as E(V/R | R>0) > .
159
Further Topics: Bayesian Methods
Simultaneous Credible Intervals Probabilities of ranking Loss function approach Posterior probabilities of null hypotheses
160
Bayes/Frequentist Comparisons
161
Simultaneous Credible Intervals• Create intervals Ii for i, so that P(i Ii, all i | Data) = .95
• Implementation in Westfall et al (1999) assumes– Variance components model (includes regular GLM and heteroscedastic GLM as special case)– Jeffreys’ priors on variances (vague) – Flat prior on means (also vague)
• Uses PROC MIXED to obtain sample (assume i.i.d) from posterior distribution
• Uses %BayesIntervals to obtain simultaneous credible intervals
162
Bayesian Simultaneous Conf. BandObs _NAME_ Lower Upper
1 diff10 -6.51586 -1.68863 2 diff15 -5.58315 -1.30912 3 diff20 -4.67692 -0.91457 4 diff25 -3.79373 -0.49875 5 diff30 -2.94835 -0.02991 6 diff35 -2.17939 0.49931 7 diff40 -1.48987 1.09893 8 diff45 -0.87996 1.79373 9 diff50 -0.35969 2.57208 10 diff55 0.10077 3.41950 11 diff60 0.52638 4.30526 12 diff65 0.91615 5.22872 13 diff70 1.29532 6.16164
163
Bayes/Frequentist Correspondence
From Westfall, P.H. (2005). Comment on Benjamini and Yekutieli, ‘False Discovery Rate Adjusted Confidence Intervals for Selected Parameters,’ Journal of the American Statistical Association 100, 85-89.
164
Bayesian Probabilities for Rankings
Suppose you observe Ave1 > Ave2 > … > Avek.
What is the probability that 1 > 2 > … > k ?
Bayesian Solution: Calculate proportion of posterior samples for which the ranking holds.
165
Results: Comparing Formulations Solution for Fixed Effects
StandardEffect formulation Estimate Error
formulation A 12.0500 0.3152 formulation B 11.0200 0.3152 formulation C 10.2700 0.3152 formulation D 9.2700 0.3152 formulation E 12.1700 0.3152
The MEANS Procedure
Variable N Meanƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒrank_observed_means 500000 0.5554860Mean5_best 500000 0.6051220Mean1_best 500000 0.3936600Mean2_best 500000 0.0012160Mean3_best 500000 2E-6Mean4_best 500000 0ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
166
Waller-Duncan Loss Function ApproachLet ij = i-j.Let Li<j(ij) denote the loss of declaring i<j.Let Li>j(ij) denote the loss of declaring i>j.Let Li~j(ij) denote the loss of declaring i n.s. different from j.
W-D Loss functions*Li~j(ij) = |ij| Li<j(ij) = 0, ij<0, = kij otherwiseLi>j(ij) = -kij, ij<0, = 0 otherwise
* Equivalent form; See Hochberg and Tamhane (1987, 320-330)
See Pennello, G. 1997. The k-ratio multiple comparisons Bayes rule for the balanced two-way design. Journal of the American Statistical Association 92: 675-684.
167
168
ImplementationWaller – Duncan in PROC GLM
More general: Simulate from posterior pdf of the ij, calculate all three losses, average, and choose decision with smallest average loss.
169
Sample Output
The MEANS Procedure
Variable N Mean Std ErrorƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒLoss1LT3 100000 177.9339488 0.1445577Loss1NS3 100000 1.7793558 0.0014454Loss1GT3 100000 0.0016329 0.000596411ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Decision:1 > 3
The MEANS Procedure
Variable N Mean Std ErrorƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒLoss1LT5 100000 12.7456041 0.0717432Loss1NS5 100000 0.3746864 0.000905742Loss1GT5 100000 24.7230371 0.0967412ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Decision:1 ~ 5
170
Bayesian Multiple TestingFrequentist univariate testing: Calculate p-value = P(data more extreme | H0)
Bayesian univariate testing: Calculate P(H0 is true | Data)
Frequentist multiple testing: if H01, H02, … , H0k are all true(or if many are true) then we get a small p-value by chance alone.use a more conservative rule.
Bayesian multiple testing: Express the doubt about many or allH0i being true using prior distribution; use this to calculateposterior probabilities P(H0i is true | Data).
171
Bayesian Multiple Testing: Methodology
Find posterior probability for each of the 2k models where i is either =0 or 0. Then
P(i = 0| Z) =
(Sum of posterior probs for all 2k-1 models where i = 0) (Sum of posterior probs for all 2k models)
Gopalan, R., and Berry, D.A. (1998), Bayesian Multiple Comparisons Using Dirichlet Process Priors, Journal of the American Statistical Association 93, 1130-1139.
Gönen, M., Westfall, P.H. and Johnson, W.O. (2003). Bayesian multiple testing for two-sample multivariate endpoints," Biometrics 59, 76-82.
172
The %BayesTests Macro: Priors• You can specify your level of prior doubt about individual hypotheses. You can specify either (i) P(H0i is true) or (ii) P(H0i is true, all i) , or both.
• You can specify (iii) the degree of prior correlation among the individual hypotheses.
• Specify two out of three of (i), (ii), and (iii). The third is determined by the other two.
• Specify prior expected effect sizes and prior variances of effect sizes (default: mean effect size is 2.5, variance= 2.)
173
The %BayesTests Macro: Data Assumptions, Inputs, and Outputs
• Assume: tests are free combinations (e.g.,multiple endpoints); MANOVA; Large Samples.
• Inputs: t-statistics and their (conditional) large-sample correlation matrix (this is the partial correlation matrix in the case of multiple endpoints); priors.
• Outputs: Posterior probabilities P(H0i is true | Data).
174
%BayesTests Example: Multiple Endpoints in Panic Disorder Study
proc glm data=research.panic; class TX; model AASEVO PANTOTO PASEVO PHCGIMPO = TX; estimate "Treatment vs Control" TX 1 -1; manova h=TX / printe; ods output Estimates =Estimates PartialCorr=PartialCorr;run;
%macro Estimates; use Estimates; read all var {tValue} into EstPar; use PartialCorr; read all var {AASEVO, PANTOTO, PASEVO, PHCGIMPO} into cov;%mend;
%BayesTests(rho=.5,Pi0 =.5);
175
Output from %BayesTests
176
The Effect of Prior Correlation: Borrowing Strength
Posterior Null Probability as a Function of Prior Correlation
0
0.2
0.4
0.6
0 0.2 0.4 0.6 0.8 1
r
p
H1
H2
H3
H4
177
The Bayesian Multiplicity EffectIf the multiple comparisons concern, “What if many or allnulls are true” is valid, the Bayesian must attach a higher probability to P(H0i is true, all i).Here is the result of setting P(H0i is true, all i) = .5.
“Right” answers, See Westfall, P.H., Krishen,A. and Young, S.S.(1998). "Using Prior Information to Allocate Significance Levels for Multiple Endpoints," Statistics in Medicine 17, 2107-2119.
178
Summary: Bayesian Methods
Several Bayesian MCPs are available! • Intervals• Tests• Rankings• Decision theory
Other current research:
• FDR – Bayesian connection (genetics)• Mixture models and Bayesian MCPs (variable selection)
179
Discussion
Good methods and software are available
You can’t use the excuse “I don’t have to use MCPs because there is no good method available”
This brings us back to the $100,000,000 question: “When should we use MCPs/MTPs”?
180
When Should You Adjust?A Scientific View When there is substantial doubt concerning the
collection of hypotheses tested
When you data snoop
When you play “pick the winner”
When conclusions require joint validity
181
But What “Family” Should I Use?
• The set over which you play “pick the winner”
• The set of conclusions requiring joint validity
• Not always well-defined
• Better to decide at design stage or simply to “frame the discussion”
182
Multiplicity Invites Selection; Selection has an Effect
Variability, probability theory, VERY relevant.
183
Final Words:
/k
184
References: Books
Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison Procedures. John Wiley, New York.
Hsu, J.C. (1996). Multiple Comparisons: Theory and Methods, Chapman and Hall, London.
Westfall, P.H., and Young, S.S. (1993) Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. Wiley, New York.
Westfall, P.H., Tobias, R.D., Rom, D., Wolfinger, R.D., and Hochberg, Y. (1999). Multiple Comparisons and Multiple Tests Using the SAS® System, Cary, NC: SAS Institute Inc.
Westfall, P.H. and Tobias, R. (2000). Exercises to Accompany "Multiple Comparisons and Multiple Tests Using the SAS® System" , Cary, NC: SAS Institute Inc.
185
References: Journal Articles
•Bauer, P.; George Chi; Nancy Geller; A. Lawrence Gould; David Jordan; Surya Mohanty; Robert O'Neill; Peter H. Westfall (2003). Industry, Government, and Academic Panel Discussion on Multiple Comparisons in a “Real” Phase Three Clinical Trial. Journal of Biopharmaceutical Statistics, 13(4), 691-701. •Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A new and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57, 1289-1300.•Berger, J. O. and Delampady, M. (1987), Testing precise hypothesis. Statistical Science 2, 317-352.•Cook, R.J. and Farewell, V.T.(1996). Multiplicity considerations in the design and analysis of clinical trials. JRSS-A 159, 93-110.•Dmitrienko, A, Offen, W. and Westfall, P. (2003). Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Statistics in Medicine 22, 2387-2400. •Gönen, M., Westfall, P.H. and Johnson, W.O. (2003). "Bayesian multiple testing for two-sample multivariate endpoints," Biometrics 59, 76-82. •Hellmich M, Lehmacher W. Closure procedures for monotone bi-factorial dose-response designs. Biometrics 2005;61:269-276. •Koyama, T., and Westfall, P.H. (2005). Decision-Theoretic Views on Simultaneous Testing of Superiority and Noninferiority, Journal of Biopharmaceutical Statistics 15, 943-955. •Lehmacher W., Wassmer G., Reitmeir P.: Procedures for Two-Sample Comparisons with Multiple Endpoints Controlling the Experimentwise Error Rate. Biometrics, 1991, 47: 511-521. •Marcus, R., Peritz, E. and Gabriel, K.R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655-660.•Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures. Journal of the American Statistical Association 81, 826—831.•Westfall, P.H. (1997). "Multiple Testing of General Contrasts Using Logical Constraints and Correlations," Journal of the American Statistical Association 92, 299-306. •Westfall, P.H. and Wolfinger, R.D.(1997). "Multiple Tests with Discrete Distributions," The American Statistician 51, 3-8.•Westfall, P.H., Johnson, W.O., and Utts, J.M. (1997). A Bayesian perspective on the Bonferroni adjustment. Biometrika 84, 419-427.•Westfall,P.H. and Wolfinger, R.D. (2000). "Closed Multiple Testing Procedures and PROC MULTTEST." SAS Observations, July, 2000. •Westfall, P.H., Ho, S.-Y., and Prillaman, B.A. (2001). "Properties of multiple intersection-union tests for multiple endpoints in combination therapy trials," Journal of Biopharmaceutical Statistics 11, 125-138. •Westfall, P.H. and Krishen, A. (2001). "Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures," Journal of Statistical Planning and Inference 99, 25-40. •Westfall, P. and Bretz, F. (2003). Multiplicity in Clinical Trials. Encyclopedia of Biopharmaceutical Statistics, second edition, Shein-Chung Chow, ed., Marcel Decker Inc., New York, pp. 666-673. •Westfall, P.H., Zaykin, D.V., and Young, S.S. (2001). Multiple tests for genetic effects in association studies. Methods in Molecular Biology, vol. 184: Biostatistical Methods, pp. 143-168. Stephen Looney, Ed., Humana Press, Toloway, NJ.•Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method, Journal of the American Statistical Association 102: 487-494.