1 a course in multiple comparisons and multiple tests peter h. westfall, ph.d. professor of...

1

A Course in Multiple Comparisons and Multiple Tests

Peter H. Westfall, Ph.D.Professor of Statistics, Department of Inf.

Systems and Quant. Sci.

Texas Tech University

2

Learning Outcomes Elucidate reasons that multiple comparisons procedures (MCPs)

are used, as well as their controversial nature

Know when and how to use classical interval-based MCPs including Tukey, Dunnett, and Bonferroni

Understand how MCPs affect power

Elucidate the definition of closed testing procedures (CTPs)

Understand specific types of CTPs, benefits and drawbacks

Distinguish false discovery rate (FDR) from familywise error rate (FWE)

Understand general issues regarding Bayesian MCPs

3

Introduction. Overview of Problems, Issues, and Solutions, Regulatory and Ethical Perspectives, Families of Tests, Familywise Error Rate, Bonferroni. (pp. 5-21)

Interval-Based Multiple Inferences in the standard linear models framework. One-way ANOVA and ANCOVA, Tukey, Dunnett, and Monte Carlo Methods, Adjusted p-values, general contrasts, Multivariate T distribution, Tight Confidence Bands, TreatmentxCovariate Interaction, Subgroup Analysis (pp. 22-55)

Power and Sample Size Determinations for multiple comparisons. (pp. 56-65)

Stepwise and Closed Testing Procedures I: P-value-Based Methods. Closure Method, Global Tests; Holm, Hommel, Hochberg and Fisher combined methods for p-Values; (pp. 66-90)

Stepwise and Closed Testing Procedures II: Fixed Sequences, Gatekeepers and I-U tests: Fixed Sequence tests, Gatekeeper procedures, Multiple hypotheses in a gate, Intersection-union tests; with application to dose response, primary and secondary endpoints, bioequivalence and combination therapies (pp. 91-101)

Outline of Material

4

Stepwise and Closed Testing Procedures III: Methods that use logical constraints and correlations. Lehmacher et al. Method for Multiple endpoints; Range-Based and F-based ANOVA Tests, Fisher’s protected LSD, Free and Restricted Combinations, Shaffer-Type Methods for dose comparisons and subgroup analysis (pp. 102-118)

Multiple nonparametric and semiparametric tests: Bootstrap and Permutation-basedClosed tesing. PROC MULTTEST, examples with multiple endpoints, genetic associations, gene expression, binary data and adverse events (pp. 119-139)

More complex models and FWE control: Heteroscedasticity, Repeated measures, and large sample methods. Applications: multiple treatment comparisons, crossover designs, logistic regression of cure rates (pp. 140-152)

False Discovery Rate: Benjamini and Hochberg’s method, comparison with FWE – controlling methods (153-158)

Bayesian methods: Simultaneous credible intervals, ranking probabilities and loss functions, PROC MIXED posterior sampling, Bayesian testing of multiple endpoints (pp. 159-178)

Conclusion, discussion, references (179-184)

Outline (Continued)

5

Sources of Multiplicity Multiple variables (endpoints) Multiple timepoints Subgroup analysis Multiple comparisons Multiple tests of the same hypothesis Variable and Model selection Interim analysis Hidden Multiplicity: File Drawers, Outliers

6

The Problem:

“Significant” results may fail to replicate.

Documented cases: Ioannidis (JAMA 2005)

7

An Example

Phase III clinical trial Three arms – Placebo, AC, Drug Endpoints: Signs and symptoms Measured at weekly visits Baseline covariates

8

Example-Continued ‘Features’ displayed at trial conclusion:

Trends Baseline adjusted comparisons of raw data Baseline adjusted % changes Nonparametric and parametric tests Specific endpoints and combinations of

endpoints Particular week results AC and Placebo comparisons

Fact: The features that “look the best” are biased.

9

Example Continued –Feature Selection

‘Effect Size’ is a feature Effect size = (mean difference)/sd Dimensionless .2=‘small’, .5=‘medium’, .8=‘large’

Estimated effect sizes : F1, F2,…,Fk

What if you select (max{F1,F2,…,Fk}) and publish it?

10

The Scientific ConcernEffect Sizes in 1000 Replicated Studies

0

0.5

1

1.5

0 0.5 1 1.5

Effect Size of Selected Effect in First Study

Rep

licat

ed E

ffect

Siz

e

11

Feature Selection Model

Clinical Trials Simulation Real data used Conservative! If you must know more:

Fj = j + j, j=1,…,20. Error terms or N(0,.22) True effect sizes j are N(.3,.12) Features Fj are highly correlated.

12

Key Points: (i) Multiplicity invites Selection(ii) Selection has an EFFECT

Just like effects due toJust like effects due to• TreatmentTreatment• ConfoundingConfounding• LearningLearning• NonresponseNonresponse• PlaceboPlacebo

13

Published Guidelines

ICH Guidelines CPMP Points to consider CDRH Statistical Guidance ASA Ethical Guidelines

14

Regulatory/Journal/Ethical/Professional Concerns

Replicability (good science) Fairness Regulatory report: The drug company reported efficacy at p=.047.

We repeated the analysis in several different ways that the company might have done. In 20 re-analyses of the data, 18 produced p-values greater than .05. Only one of the 20 re-analyses produced a p-value smaller than .047.

15

Multiple Inferences: Notation

There is a “family” of k inferences

Parameters are 1,…, k

Null hypotheses are

H01: 1=0, …, H0k: k=0

16

Comparisonwise Error Rate (CER)

Intervals:

CERj = P(Intervalj incorrect)

Tests:

CERj = P(Reject H0j | H0j is true)

Usually CER = =.05

17

Familywise Error Rate (FWE)

Intervals:

FWE = 1 - P(all intervals are correct)

Tests:

FWE = P(reject at least one true null)

18

False Discovery Rate

• FDR = E(proportion of rejections that are incorrect)

• Let R = total # of rejections• Let V = # of erroneous rejections

• FDR = E(V/R) (0/0 defined as 0).

• FWE = P(V>0)

19

Bonferroni Method

Identify Family of inferences Identify number of elements (k) in the

Family Use /k for all inferences. Ex: With k=36, p-values must be less than

0.05/36 = 0.0014 to be “significant”

20

FWE Control for Bonferroni

FWE = P(p0j1

.05/36 or … or p0jm

.05/36 | H0j1,..., H0jm

true)

P(p0j1.05/36) + … + P(

p0jm

.05/36)

= (.05)m/36 .05

AB

P(AB) P(A) + P(B)

21

Main Interest - Primary & SecondaryApproval and Labeling depend on these.

Tight FWE control needed.

Lesser Interest - Depending on goals and reviewers, FWE controlling methods

might be needed.

Supportive Tests - mostly descriptiveFWE control not needed.

Exploratory Tests - investigate new indications -future trials needed to confirm - do what makes sense.

Serious andknown treatment-related AEsFWE control not needed

All other AEsReasonable to control FWE (or FDR)

Efficacy

Safety

“Families” in clinical trials1

1Westfall, P. and Bretz, F. (2003). Multiplicity in Clinical Trials. Encyclopedia of Biopharmaceutical Statistics, second edition, Shein-Chung Chow, ed., Marcel Decker Inc., New York, pp. 666-673.

22

Classical Single-Step Testing and Interval Methods to Control FWESimultaneous confidence intervals; Adjusted p-valuesDunnett method Tukey’s methodSimulation-based methods for general comparisons

23

“Specificity” and “Sensitivity”

Estimates of effect sizes & error margins

Confident inequalities

Overall Test

Simultaneous Confidence Intervals

Stepwise or closed tests

F-test, O’Brien, etc.

If you want ... …then use

24

The Model

Y = X + where ~ N(0, 2 I )

Includes ANOVA, ANCOVA, regression

For group comparisons, covariate adjustment

Not valid for survival analysis, binary data, multivariate data

25

Table3.2:MeansandStandardDeviationsofMouseGrowthDataControl123456Mean105.3895.980.4872.1491.8884.6874.24Std.Dev.13.4423.8912.688.419.4418.357.81

Table 3.2: M eans and Standard Deviations of M ouse Growth Data

Control 1 2 3 4 5 6Mean 105.38 95.9 80.48 72.14 91.88 84.68 74.24

Std. Dev. 13.44 23.89 12.68 8.41 9.44 18.35 7.81

Example: Pairwise Comparisons against Control

Goal: Estimate all mean differences from control and provide simultaneous 95% error margins:

What c to use?

26

Comparison of Critical ValuesMouse Growth Data ni = 4 mice per group (6 doses and one control)df = 21.There are 6 comparisons.

Method ' c

Bonferroni 0.05 0.0083 2.912Sidak 0.05 0.0085 2.903

Dunnett* 0.05 0.0110 2.790

Unadjusted** 0.05 0.0500 2.080

* ' calculated from c

** Does not control the FWE

SAS file: data; c_alpha = probmc("DUNNETT2",.,.95,21,6);run;proc print; run;

27

Results - DunnettThe GLM Procedure

Dunnett's t Tests for gain

NOTE: This test controls the Type I experimentwise error for comparisons of all treatments againstba control.

Alpha 0.05Error Degrees of Freedom 21Error Mean Square 210.0048Critical Value of Dunnett's t 2.78972Minimum Significant Difference 28.586

Comparisons significant at the 0.05 level are indicated by ***.

Difference Simultaneous g Between 95% ConfidenceComparison Means Limits

1 - 0 -9.48 -38.07 19.11 4 - 0 -13.50 -42.09 15.09 5 - 0 -20.70 -49.29 7.89 2 - 0 -24.90 -53.49 3.69 6 - 0 -31.14 -59.73 -2.55 *** 3 - 0 -33.24 -61.83 -4.65 ***

28

c is the 1- quantile of the distribution of maxi |Zi-Z0|/(22/df)1/2, calledDunnett’s two-sided range distribution.

29

Adjusted p-Values

Definition: Adjusted p-value =smallest FWE at which the hypothesis is rejected.

or

The FWE for which the confidence interval has “0” as a boundary.

30

Adjusted p-values for Dunnett

The GLM Procedure Least Squares Means Adjustment for Multiple Comparisons: Dunnett G GAIN Pr > |T| H0: LSMEAN LSMEAN=CONTROL 0 105.380000 1 95.900000 0.8608 (0.3654) 2 80.480000 0.1034 (0.0242) 3 72.140000 0.0188 (0.0039) 4 91.880000 0.6090 (0.2019) 5 84.680000 0.2196 (0.0563) 6 74.240000 0.0294 (0.0062)

proc glm data=tox; class g; model gain=g; lsmeans g/adjust=dunnett pdiff;run;

31

Table3.2:MeansandStandardDeviationsofMouseGrowthDataControl123456Mean105.3895.980.4872.1491.8884.6874.24Std.Dev.13.4423.8912.688.419.4418.357.81

Table 3.2: M eans and Standard Deviations of M ouse Growth Data

Control 1 2 3 4 5 6Mean 105.38 95.9 80.48 72.14 91.88 84.68 74.24

Std. Dev. 13.44 23.89 12.68 8.41 9.44 18.35 7.81

Example: All Pairwise Comparisons

Goal: Estimate all mean differences and provide simultaneous 95% error margins:

What c to use?

32

Comparison of Critical ValuesMouse Growth Data ni = 4 mice per group (6 doses and one control)df = 21.There are 21 pairwise comparisons.

Method ' c

Bonferroni 0.05 0.0024 3.453Sidak 0.05 0.0024 3.443

Tukey* 0.05 0.0038 3.251

Unadjusted** 0.05 0.0500 2.080

* ' calculated from c

** Does not control the FWE

SAS file: data; qval = probmc("RANGE",.,.95,21,7); c_alpha = qval/sqrt(2); run;proc print; run;

33

Tukey Comparisons Alpha= 0.05 df= 21 MSE= 210.0048 Critical Value of Studentized Range= 4.597 Minimum Significant Difference= 33.311 Means with the same letter are not significantly different.

Tukey Grouping Mean N G

A 105.38 4 0 A A 95.90 4 1 A A 91.88 4 4 A A 84.68 4 5 A A 80.48 4 2 A A 74.24 4 6 A A 72.14 4 3

34

Tukey Adjusted p-Values

General Linear Models Procedure Least Squares Means Adjustment for multiple comparisons: Tukey

G GAIN Pr > |T| H0: LSMEAN(i)=LSMEAN(j) LSMEAN i/j 1 2 3 4 5 6 7

0 105.380 1 . 0.9641 0.2351 0.0507 0.8364 0.4319 0.07691 95.900 2 0.9641 . 0.7391 0.2810 0.9996 0.9227 0.38062 80.480 3 0.2351 0.7391 . 0.9808 0.9172 0.9995 0.99583 72.140 4 0.0507 0.2810 0.9808 . 0.4860 0.8771 1.00004 91.880 5 0.8364 0.9996 0.9172 0.4860 . 0.9910 0.61025 84.680 6 0.4319 0.9227 0.9995 0.8771 0.9910 . 0.94386 74.240 7 0.0769 0.3806 0.9958 1.0000 0.6102 0.9438 .

35

Tukey Simultaneous Intervals Simultaneous Simultaneous Lower Difference Upper Confidence Between Confidence i j Limit Means Limit

1 2 -23.831013 9.480000 42.791013 1 3 -8.411013 24.900000 58.211013 1 4 -0.071013 33.240000 66.551013 1 5 -19.811013 13.500000 46.811013 1 6 -12.611013 20.700000 54.011013 1 7 -2.171013 31.140000 64.451013 2 3 -17.891013 15.420000 48.731013 2 4 -9.551013 23.760000 57.071013 2 5 -29.291013 4.020000 37.331013 2 6 -22.091013 11.220000 44.531013 2 7 -11.651013 21.660000 54.971013 3 4 -24.971013 8.340000 41.651013 3 5 -44.711013 -11.400000 21.911013 3 6 -37.511013 -4.200000 29.111013 3 7 -27.071013 6.240000 39.551013 4 5 -53.051013 -19.740000 13.571013 4 6 -45.851013 -12.540000 20.771013 4 7 -35.411013 -2.100000 31.211013 5 6 -26.111013 7.200000 40.511013 5 7 -15.671013 17.640000 50.951013 6 7 -22.871013 10.440000 43.751013

36

c is (1/ the 1- quantile of the distribution of maxi,i’ |Zi-Zi’|/(2/df)1/2}, which is called the Studentized range distribution.

37

Unbalanced Designs and/or Covariates

Tukey method is conservative when the design is unbalanced and/or there are covariates; otherwise exact

Dunnett method is conservative when there are covariates; otherwise exact

“Conservative” means

{True FWE} < {Nominal FWE} ;

also means “less powerful”

38

Tukey-Kramer Method for all pairwise comparisons

Let c be the critical value for the balanced case using Tukey’s method and the correct df.

Intervals are

Conservative (Hayter, 1984 Annals)

39

Exact Method for General Comparisons of Means

40

Multivariate T-Distribution Details

40

41

Calculation of “Exact” c

•Edwards and Berry: Simple simulation•Hsu and Nelson: Factor analytic control variate (better)•Genz and Bretz: Integration using lattice methods (best)

Even with simple simulation, the value c can be obtained with reasonable precision.

Edwards, D., and Berry, J. (1987) The efficiency of simulation-based multiple comparisons. Biometrics, 43, 913-928. Hsu, J.C. and Nelson, B.L. (1998) Multiple comparisons in the general linear model. Journal of Computationaland Graphical Statistics, 7, 23-41. Genz, A. and Bretz, F. (1999), Numerical Computation of Multivariate t Probabilities with Application to Power Calculation of Multiple Constrasts, J. Stat. Comp. Simul. 63, pp. 361-378.

42

Example: ANCOVA with two covariatesY = Diastolic BP Group = Therapy (Control, D1, D2, D3)X1 = Baseline Diastolic BP X2 = Baseline Systolic BP

Goal: Compare all therapies, controlling for baseline

proc glm data=research.bpr; class therapy; model dbp10 = therapy dbp7 sbp7; lsmeans therapy/pdiff cl adjust=simulate(nsamp= 10000000 cvadjust seed=121011 report);run;quit;

43

Results From ANCOVA

Source DF Type III SS Mean Square F Value Pr > F

THERAPY 3 677.429172 225.809724 6.05 0.0006 DBP7 1 6832.878653 6832.878653 183.06 <.0001 SBP7 1 51.123459 51.123459 1.37 0.2435

Least Squares Means for Effect THERAPY

Difference Simultaneous 95% Between Confidence Limits fori j Means LSMean(i)-LSMean(j)1 2 2.832658 -0.424816 6.09013

1 3 1.328481 -2.099566 4.756527

1 4 -2.536262 -5.981471 0.9089472 3 -1.504178 -4.846403 1.8380472 4 -5.368920 -8.734744 -2.0030973 4 -3.864743 -7.398994 -0.330491Note: “4” is control

44

Details for Quantile Simulation Random number seed 121011 Comparison type All Sample size 9999938 Target alpha 0.05 Accuracy radius (target) 0.0002 Accuracy radius (actual) 437E-7 Accuracy confidence 99%

Simulation Results

Estimated 99% Confidence Method 95% Quantile Alpha Limits

Simulated 2.594159 0.0500 0.0500 0.0500 Tukey-Kramer 2.594637 0.0499 0.0499 0.0500 Bonferroni 2.669484 0.0411 0.0410 0.0411 Sidak 2.662029 0.0419 0.0418 0.0419 GT-2 2.660647 0.0420 0.0420 0.0421 Scheffe 2.823701 0.0270 0.0270 0.0270 T 1.974017 0.2017 0.2016 0.2018

NOTE: PROCEDURE GLM used: real time 21.23 seconds

45

Results from ANCOVA-Dunnett H0:LSMean= ControlTHERAPY DBP10 LSMEAN Pr > |t|Dose 1 88.8171113 0.1407Dose 2 85.9844529 0.0002Dose 3 87.4886307 0.0140Placebo 91.3533732

Least Squares Means for Effect THERAPY

Difference Simultaneous 95% Between Confidence Limits fori j Means LSMean(i)-LSMean(j)1 4 -2.536262 -5.675847 0.6033232 4 -5.368920 -8.436161 -2.3016793 4 -3.864743 -7.085470 -0.644015

46

Details for Quantile Simulation-Dunnett

Random number seed 121011 Comparison type Control, two-sided Sample size 9999938 Target alpha 0.05 Accuracy radius (target) 0.0002 Accuracy radius (actual) 139E-7 Accuracy confidence 99%

Simulation Results

Estimated 99% Confidence Method 95% Quantile Alpha Limits

Simulated 2.364031 0.0500 0.0500 0.0500 Dunnett-Hsu, two-sided 2.364084 0.0500 0.0500 0.0500 Bonferroni 2.417902 0.0437 0.0437 0.0437 Sidak 2.411491 0.0444 0.0444 0.0444 GT-2 2.410664 0.0445 0.0445 0.0445 Scheffe 2.823701 0.0145 0.0145 0.0145 T 1.974017 0.1229 0.1229 0.1230

NOTE: PROCEDURE GLM used: real time 19.00 seconds

47

More General Inferences

Question: For what values of the covariate istreatment A better than treatment B?

48

Discussion of (Treatment Covariate) Interaction Example

49

The GLIMMIX Procedure

Computes MC-exact simultaneous confidence intervals and adjusted p-values for any set of linear functions in a linear model

50

GLIMMIX syntaxproc glimmix data=research.tire; class make; model cost = make mph make*mph;

estimate "10" make 1 -1 make*mph 10 -10, "15" make 1 -1 make*mph 15 -15, "20" make 1 -1 make*mph 20 -20, "25" make 1 -1 make*mph 25 -25, "30" make 1 -1 make*mph 30 -30, "35" make 1 -1 make*mph 35 -35, "40" make 1 -1 make*mph 40 -40, "45" make 1 -1 make*mph 45 -45, "50" make 1 -1 make*mph 50 -50, "55" make 1 -1 make*mph 55 -55, "60" make 1 -1 make*mph 60 -60, "65" make 1 -1 make*mph 65 -65, "70" make 1 -1 make*mph 70 -70

/adjust=simulate(nsamp=10000000 report) cl;run;

51

Output from PROC GLIMMIX

Simultaneous intervals are Estimate +- 2.648 * StdErr

Label Estimate StdErr tValue AdjLower AdjUpper 10 -4.1067 0.9143 -4.49 -6.5279 -1.6854 15 -3.4539 0.8084 -4.27 -5.5947 -1.3131 20 -2.8011 0.7101 -3.94 -4.6815 -0.9207 25 -2.1483 0.6230 -3.45 -3.7981 -0.4985 30 -1.4956 0.5524 -2.71 -2.9585 -0.03260 35 -0.8428 0.5054 -1.67 -2.1812 0.4956 40 -0.1900 0.4887 -0.39 -1.4842 1.1042 45 0.4628 0.5054 0.92 -0.8756 1.8012 50 1.1156 0.5524 2.02 -0.3474 2.5785 55 1.7683 0.6230 2.84 0.1185 3.4181 60 2.4211 0.7101 3.41 0.5407 4.3015 65 3.0739 0.8084 3.80 0.9331 5.2147 70 3.7267 0.9143 4.08 1.3054 6.1479

Bonferroni – critical value is t_{16,.05/2*13} = 3.377.

52

Other Applications of Linear Combinations Multiple Trend Tests

(0,1,2,3), (0,1,2,4), (0,4,6,7)

(carcinogenicity) (0,0,1), (0,1,1), (0,1,2)

(recessive/dominant/ordinal genotype effects) Subgroup Analysis

Subgroups define linear combinations (more on next slide)

53

Subgroup Analysis Example

Data: Yijkl , where i=Trt,Cntrl ; j=Old, Yng; k = GoodInit, PoorInit.

Model: Yijkl = ijk + ijkl, where ijk=+i+j+k+()ij+()ik+()jk

Subgroup Contrasts: 111 112 121 122 211 212 221 222 Overall ¼ ¼ ¼ ¼ -¼ -¼ -¼ -¼Older ½ ½ 0 0 -½ -½ 0 0Younger 0 0 ½ ½ 0 0 -½ -½ GoodInit ½ 0 ½ 0 -½ 0 -½ 0PoorInit 0 ½ 0 ½ 0 -½ 0 -½ OldGood 1 0 0 0 -1 0 0 0OldPoor 0 1 0 0 0 -1 0 0YoungGood 0 0 1 0 0 0 -1 0YoungPoor 0 0 0 1 0 0 0 -1

54

Subgroup Analysis Results

Label Estimate StdErr tValue Probt Adjp AdjLower AdjUpper

Overall 0.7075 0.1956 3.62 0.0002 0.0015 0.2460 I

Older 0.9952 0.2673 3.72 0.0002 0.0011 0.3646 I

Younger 0.4197 0.2847 1.47 0.0717 0.2605 -0.2521 I

GoodInitHealth 0.5871 0.2878 2.04 0.0219 0.0984 -0.09197 I

PoorInitHealth 0.8279 0.2644 3.13 0.0011 0.0068 0.2039 I

OldGood 0.8748 0.3387 2.58 0.0056 0.0295 0.07566 I

OldPoor 1.1157 0.3231 3.45 0.0004 0.0026 0.3532 I

YoungGood 0.2993 0.3562 0.84 0.2014 0.5494 -0.5413 I

YoungPoor 0.5401 0.3338 1.62 0.0544 0.2091 -0.2476 I

(SAS code available upon request)

55

Summary Include only comparisons of interest. Utilize correlations to be less conservative. The critical values can be computed exactly only in

balanced ANOVA for all pairwise comparisons, or in unbalanced ANOVA for comparisons with control.

Simulation-based methods are “exact” if you let the computer run for a while. This is my general recommendation.

56

Power Analysis

Sample size - Design of study Power is less when you use multiple

comparisons larger sample sizes Many power definitions Bonferroni & independence are convenient

(but conservative) starting points

57

Power Definitions

“Complete Power” = P(Reject all H0i that are false)

“Minimal Power” = P(Reject at least one H0i that is false)

“Individual Power” = P(Reject a particular H0i that is false)

“Proportional Power” = Average proportion of false H0i that are rejected

58

Power Calculations.Example: H1 and H2 powered individually at 50%; H3 and H4 powered individually at 80%, all tests independent.

Complete Power = P(reject H1 and H2 and H3 and H4) = .5 .5 .8 .8 = 0.16.

Minimal Power = P(reject H1or H2 or H3 or H4) = 1-P(“accept” H1 and H2 and H3 and H4) =1 (1.5) .5) .8) .8) = 0.99.

Individual Power = P(reject H3 (say)) = 0.80. (depends on the test)

Proportional Power = (.5 + .5 + .8 + .8)/4 = 0.65

59

Sample Size for Adequate Individual Power - Conservative Estimate

60

Individual power of two-tail two-sample Bonferroni t-tests

%let MuDiff = 5; /* Smallest meaningful difference MUx-MUy that you want to detect */%let Sigma = 10.0 ; /* A guess of the population std. dev. */%let alpha = .05 ; /* Familywise Type I error probability of the test */%let k = 4; /* Number of tests */

options ls=76;data power; cer = &alpha/&k; do n = 2 to 100 by 2; *n=sample size for each group*; df = n + n - 2; ncp = (&Mudiff)/(&Sigma*sqrt(2/n)); * The noncentrality parameter *; tcrit = tinv(1-cer/2, df); * The critical t value * ; power = 1 - probt(tcrit, df, ncp) + probt(-tcrit,df,ncp) ; output; end;

proc print data=power;run;

proc plot data=power; plot power*n/vpos=30;run;

61

Graph of Power Function Plot of power*n. Legend: A = 1 obs, B = 2 obs, etc.

power ‚ ‚ ‚ 1.0 ˆ ‚ ‚ ‚ ‚ AAA 0.8 ˆ AAAA ‚ AAA ‚ AAA ‚ AAA ‚ AA 0.6 ˆ AAA n=92 for 80% ‚ AA power ‚ AA ‚ AA ‚ AA 0.4 ˆ A ‚ AA ‚ AA ‚ AA ‚ AA 0.2 ˆ AA ‚ AA ‚ AA ‚ AA ‚ AAA 0.0 ˆ A ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ 0 20 40 60 80 100

n

62

%IndividualPower macro*•Uses PROBMC and PROBT (noncentral)

•Assumes that you want to use the single-step(confidence interval based) Dunnett (one-or two-sided) or Range (two-sided) test

•Less conservative than Bonferroni

•Conservative compared to stepwise procedures

•%IndividualPower(MCP=DUNNETT2,g=4,d=5,s=10);

*Westfall et al (1999), Multiple Comparisons and Multiple Tests Using SAS

63

%IndividualPower Output

64

More general Power- Simulate!

Invocation: %SimPower(method = dunnett , TrueMeans = (10, 10, 13, 15, 15) , s = 10 , n = 87 , seed=12345 );

Output: Method=DUNNETT, Nominal FWE=0.05, nrep=1000 True means = (10, 10, 13, 15, 15), n=87, s=10

Quantity Estimate ---95% CI----

Complete Power 0.28800 (0.260,0.316) Minimal Power 0.92900 (0.913,0.945) Proportional Power 0.65133 (0.633,0.669) True FWE 0.01900 (0.011,0.027) Directional FWE 0.01900 (0.011,0.027)

65

Concluding Remarks - Power

Need a bigger n Like to avoid bigger n (see sequential,

gatekeepers methods later) Which definition? Bonferroni and independence useful Simulation useful – especially for the more

complex methods that follow

66

Estimates of effect sizes & error margins

Confident inequalities

Overall Test

Simultaneous Confidence Intervals

Stepwise or closed tests Holm’s Method Hommel’s Method Hochberg’s Method Fisher Combination

Method

F-test, O’Brien, etc.

If you want ... …then use

Closed and Stepwise Testing Methods I: Standard P-Value Based Methods

67

Closed Testing Method(s) Form the closure of the family by including all

intersection hypotheses. Test every member of the closed family by a

(suitable) -level test. (Here, refers to comparison-wise error rate).

A hypothesis can be rejected provided that its corresponding test is significant at level and every other hypothesis in the family that implies it is

rejected by its level test.

68

Closed Testing – Multiple EndpointsH0: 1=2=3=4 =0

H0: 1=2=3 =0 H0: 1=2=4 =0 H0: 1=3=4 =0 H0: 2=3=4 =0

H0: 1=2 =0 H0: 1=3 =0 H0: 1=4 =0 H0: 2=3 =0 H0: 2=4 =0 H0: 3=4 =0

H0: 1=0p = 0.0121

H0: 2=0p = 0.0142

H0: 3=0p = 0.1986

H0: 4=0p = 0.0191

Where j = mean difference, treatment -control, endpoint j.

69

Closed Testing – Multiple Comparisons

1=2=3=4

1=2=3 1=2=4 1=3=4 2=3=41=2, 3=4

1=3, 2=4 1=4, 2=3

1=2 1=3 1=4 2=3 =4 3=4

Note: Logical implications imply that there are only 14 nodes, not 26 -1 = 63 nodes.

70

Control of FWE with Closed Tests

Suppose H0j1,..., H0jm all are true (unknown to you which

ones).

{Reject at least one of H0j1,..., H0jm

using CTP}

{Reject H0j1... H0jm }

Thus, P(reject at least one of H0j1,..., H0jm |

H0j1,..., H0jm all are true)

P(reject H0j1... H0jm |

H0j1,..., H0jm all are true) =

71

Examples of Closed Testing Methods

Bonferroni MinP Resampling-Based

MinP

Simes O’Brien Simple or weighted test

…

Holm’s Method Westfall-Young method

Hommel’s method Lehmacher’s method Fixed sequence test (a-

priori ordered) …

When the Composite Test is… Then the Closed Method is …

72

P-value Based Methods

Test global hypotheses using p-value combination tests

Benefit – Fewer model assumptions: only need to say that the p-values are valid

Allows for models other than homoscesdastic normal linear models (like survival analysis).

73

Holm’s Method is Closed Testing Using the Bonferroni MinP Test

Reject H0j1 H0j2... H0jm if

Min (p0j1 p0j2 ... p0jm ) /m.

Or, Reject H0j1 H0j2... H0jm if

p* = m Min (p0j1 p0j2 ... p0jm )

(Note that p* is a valid p-value for the joint null, comparable to p-value for Hotellings T2 test.)

74

Holm’s Stepdown MethodH0: 1=2=3=4 =0

minp=0.0121p*=0.0484

H0: 1=2=3 =0minp=0.0121p*=0.0363

H0: 1=2=4 =0minp=0.0121p*=0.0363

H0: 1=3=4 =0minp=.0121p*=0.0363

H0: 2=3=4 =0minp=0.0142p*=0.0426

H0: 1=2 =0minp=0.0121p*=0.0242

H0: 1=3 =0minp=0.0121p*=0.0242

H0: 1=4 =0minp=0.0121p*=0.0242

H0: 2=3 =0minp=0.0142p*=0.0284

H0: 2=4 =0minp=0.0142p*=0.0284

H0: 3=4 =0minp=0.0191p*=0.0382

H0: 1=0p = 0.0121

H0: 2=0p = 0.0142

H0: 3=0p = 0.1986

H0: 4=0p = 0.0191


75

Shortcut For Holm’s Method

Let H(1) ,…,H(k) be the hypotheses corresponding to p(1) … p(k)

–If p(1) /k, reject H(1) and continue, else stop and

retain all H(1) ,…,H(k) .

– If p(2) /(k-1), reject H(2) and continue, else stop

and retain all H(1) ,…,H(k) .

–…

–If p(k) , reject H(k)

76

Adjusted p-values for Closed Tests

The adjusted p-value for H0j is the maximum of all p-values over all relevant nodes

In the previous example,

pA(1)=0.0484,pA(2)=0.0484, pA(3)=0.0484, pA(4)=0.1986.

General formula for Holm: pA(j)= maxij (k-i+1)p(i) .

77

Worksheet For Holm’s Method

Holm-Bonferroni WorksheetOrdered Unadjusted Critical (k-i+1) * Adjusted

test number P-value Value p-value p-value1 0.0121 0.0125 0.0484 0.04842 0.0142 0.0166667 0.0426 0.04843 0.0191 0.025 0.0382 0.04844 0.1986 0.05 0.1986 0.1986

78

Simes’ Test for Global Hypotheses

Uses all p-values p1, p2, …, pm not just the MinP

Simes’ test rejects H01H02...H0m if

p(j) j/m for at least one j.

p-value for the joint test is p* = min {(m/j)p(j)}

Uniformly smaller p-value than m MinP Type I error at most under independence or positive

dependence of p-values

79

Rejection Regions

0 1

p1

p2

1

P(Simes Reject) = 1 – (1 P(Bonferroni Reject ) = 1 – (1

80

Hommel’s Method (Closed Simes)H0: 1=2=3=4 =0

p*=0.0255

H0: 1=2=3 =0p*=0.0213

H0: 1=2=4 =0p*=0.0191

H0: 1=3=4 =0p*=0.0287

H0: 2=3=4 =0p*=0.0287

H0: 1=2 =0p*=0.0142

H0: 1=3 =0p*=0.0242

H0: 1=4 =0p*=0.0191

H0: 2=3 =0p*=0.0284

H0: 2=4 =0p*=0.0191

H0: 3=4 =0p*=0.0382

H0: 1=0p = 0.0121

H0: 2=0p = 0.0142

H0: 3=0p = 0.1986

H0: 4=0p = 0.0191


81

Adjusted P-values for Hommel’s Method

Again, take the maximum p-value over all hypotheses that imply the given one.

In the previous example, the Hommel adjusted p-values are pA(1)=0.0287, pA(2)=0.0287, pA(3)=0.0382, pA(4)=0.1986.

These adjusted p-values are always smaller than the Holm step-down adjusted p-values.

82

Adjusted P-values for Hommel’s Method

They are maxima over relevant nodes

In example, Hommel adjusted p-values are pA(1)=0.0287, pA(2)=0.0287, pA(3)=0.0382, pA(4)=0.1986.

{Hommel adjusted p-value} ≤ {Holm adjusted p-value}

83

Hochberg’s Method

A conservative but simpler approximation to Hommel’s method

{Hommel adjusted p-value}

≤ {Hochberg adjusted p-value}

≤ {Holm adjusted p-value}

84

Hochberg’s Shortcut Method


–If p(k) , reject all H(j) and stop, else retain H(k) and continue.

– If p(k-1) /2, reject H(2) … H(k) and stop, else retain

H(k-1) and continue.

–…

–If p(1) /k, reject H(k)

Adjusted p-values are pA(j)= minji (k-i+1)p(i) .

85

Worksheet for Hochberg’s Method

Hochberg Adjusted P-Value WorksheetOrdered Unadjusted Critical (k-i+1) * Adjusted

test number P-value Value p-value p-value1 0.0121 0.0125 0.0484 0.03822 0.0142 0.0166667 0.0426 0.03823 0.0191 0.025 0.0382 0.03824 0.1986 0.05 0.1986 0.1986

86

Comparison of Adjusted P-Values

p-Values

StepdownTest Raw Bonferroni Hochberg Hommel

1 0.0121 0.0484 0.0382 0.02862 0.0142 0.0484 0.0382 0.02863 0.1986 0.1986 0.1986 0.19864 0.0191 0.0484 0.0382 0.0382

87

Fisher Combination Test for Independent p-Values

Reject H01H02...H0m if

-2ln(pi) > (1-, 2m)

88

Example: Non-Overlapping Subgroup* p-values

The Multtest Procedure p-Values

Stepdown FisherTest Raw Bonferroni Hochberg Hommel Combination

1 0.0784 0.3918 0.1550 0.1550 0.0784 2 0.0480 0.2883 0.1550 0.1441 0.0480 3 0.0041 0.0325 0.0305 0.0285 0.0053 4 0.0794 0.3918 0.1550 0.1550 0.0794 5 0.0044 0.0325 0.0305 0.0305 0.0056 6 0.0873 0.3918 0.1550 0.1550 0.0873 7 0.1007 0.3918 0.1550 0.1550 0.1007 8 0.1550 0.3918 0.1550 0.1550 0.1550

*Non-overlapping is required by the independence assumption.

89

Power Comparison

Liptak test stat: T = -1(pi) = Zi

90

Concluding Notes

Closed testing more powerful than single-step (/m rather than /k).

P-value based methods can be used whenever p-values are valid

Dependence issues: MinP (Holm) conservative Simes (Hommel, Hochberg) less conservative, rarely

anti-conservative Fisher combination, Liptak require independence

91

Closed and Stepwise Testing Methods II:Fixed Sequences and Gatekeepers

Methods Covered: •Fixed Sequences (hierarchical endpoints, dose response, non-inferiority superiority)•Gatekeepers (primary and secondary analyses)•Multiple Gatekeepers (multiple endpoints & multiple doses)•Intersection-Union tests*

* Doesn’t really belong in this section

92

Fixed Sequence Tests

Pre-specify H1, H2, …, Hk, and test in this sequence, stopping as soon as you fail to reject.

No -adjustment is necessary for individual tests. Applications:

Dose response: High vs. Control, then Mid vs. Control, then Low vs. Control

Primary endpoint, then Secondary endpoint

93

Fixed Sequence as a Closed Procedure

H12:

Rej if p1 .05H13:

Rej if p1 .05

H23:

Rej if p2 .05

H1:

Rej if p1 .05

H2:

Rej if p2 .05H3:

Rej if p3 .05

H123:

Rej if p1 .05

• Rej H1 if p1.05 • Rej H2 if p1.05 and p2.05• Rej H3 if p1.05 and p2.05 and p3.05

94

A Seemingly Reasonable But Incorrect Protocol

1. Test Dose 2 vs Pbo, and Dose 3 vs Pbo using the Bonferroni method (0.025 level).

2. Test Dose 1 vs Pbo at the unadjusted 0.05 level only if at least one of the first two tests is significant at the 0.025 level.

95

The problem: FWE 0.075

0

12

Pbo Low Mid High

Lower

Mean

Upper

Moral: Caution needed when there are multiple hypotheses at some point in the sequence.

96

Correcting the Incorrect Protocol: Use Closure

H0LMH

P23 < .05

H0LM P12<.05

H0LH P13 < .05

H0MH P23 < .05

H0H P3 < .05

H0M P2 < .05

H0L P1 < .05

Where pij = 2min(pi,pj)

97

References –Fixed Sequence and Gatekeeper Tests1. Bauer, P (1991) Multiple Testing in Clinical Trials, Statistics in Medicine, 10, 871-890.2. O’Neill RT. (1997) Secondary endpoints cannot be validly analyzed if the primary

endpoint does not demonstrate clear statistical significance. Controlled Clinical Trials; 18:550 –556.

3. D’Agostino RB. (2000) Controlling alpha in clinical trials: the case for secondary endpoints. Statistics in Medicine; 19:763–766.

4. Chi GYH. (1998) Multiple testings: multiple comparisons and multiple endpoints. Drug Information Journal 32:1347S–1362S.

5. Bauer P, Röhmel J, Maurer W, Hothorn L. (1998) Testing strategies in multi-dose experiments including active control. Statistics in Medicine; 17:2133 –2146.

6. Westfall, P.H. and Krishen, A. (2001). Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures, Journal of Statistical Planning and Inference 99, 25-40.

7. Chi, G. “Clinical Benefits, Decision Rules, and Multiple Inferences,” http://www.fda.gov/cder/Offices/Biostatistics/chi_1/sld001.htm

8. Dmitrienko, A, Offen, W. and Westfall, P. (2003). Gatekeeping strategies for clinical trials that do not require all effects to be significant. Stat Med. 22: 2387-2400.

9. Chen X, Luo X, Capizzi T. (2005) The application of enhanced parallel gatekeeping strategies. Stat Med. 24:1385-97.

10. Alex Dmitrienko, Geert Molenberghs, Christy Chuang-Stein, and Walter Offen (2005), Analysis of Clinical Trials Using SAS: A Practical Guide, SAS Press.

11. Wiens, B, and Dmitrienko, A. (2005). The fallback procedure for evaluating a single family of hypotheses. J Biopharm Stat.15(6):929-42.

12. Dmitrienko, A., Wiens, B. and Westfall, P. (2006). Fallback Tests in Dose Response Clinical Trials, J Biopharm Stat, 16, 745-755.

http://www.fda.gov/cder/Offices/Biostatistics/chi_1/sld001.htm

98

Intersection-Union (IU) Tests

Union-Intersection (UI): Nulls are intersections, alternatives are unions.

H0: {1=0 and 2=0} vs. H1: {10 or 20} Intersection-Union (IU): Nulls are unions, alternatives

are intersections

H0: {1=0 or 2=0} vs. H1: {10 and 20} IU is NOT a closed procedure. It is just a single test of a

different kind of null hypothesis.

99

Applications of I-U

Bioequivalence: The “TOST” test: Test 1. H01: 0 vs. HA1: 0

Test 2. H01: 0 vs. HA1: 0 Can test both at =.05, but must reject both.

Combination Therapy: Test 1. H01: 12 vs. HA1: 12

Test 2. H01: 12 vs. HA1: 12 Can test both at =.05, but must reject both.

100

Control of Type I Error for IU tests

Suppose 1=0 or 20. Then

P(Type I error) = P(Reject H0) (1)= P(p1.05 and p2.05) (2)< min{P(p1.05), P(p2.05)} (3)=.05. (4)

Note: The inequality at (3) becomes an approximate equality when p2 is extremely noncentral.

101

Concluding Notes: Fixed Sequences and Gatekeepers

• Many times, no adjustment is necessary at all!

• Other times you can gain power by specifying gatekeeping sequences

• However, you must clearly state the method and follow the rules

• There are many “incorrect” no adjustment methods - use caution

102

Closed and Stepwise Testing Methods III: Methods that Use Logical Constraints and Correlations

Methods Application

Lehmacher et al Multiple endpoints

Westfall-Tobias- Shaffer-Royen General contrasts

103

Lehmacher et al. Method• Use O’Brien test at each node (incorporates correlations)

• Do closed testing

Note: Possibly no adjustment whatsoever; possibly big adjustment

104

Calculations for Lehmacher’s Method

proc standard data=research.multend1 mean=0 std=1 out=stdzd; var Endpoint1-Endpoint4; run;

data combine; set stdzd; H1234 = Endpoint1+Endpoint2+Endpoint3+Endpoint4; H123 = Endpoint1+Endpoint2+Endpoint3 ; H124 = Endpoint1+Endpoint2+ Endpoint4; H134 = Endpoint1+ Endpoint3+Endpoint4; H234 = Endpoint2+Endpoint3+Endpoint4; H12 = Endpoint1+Endpoint2 ; H13 = Endpoint1+ Endpoint3 ; H14 = Endpoint1+ Endpoint4; H23 = Endpoint2+Endpoint3 ; H24 = Endpoint2+ Endpoint4; H34 = Endpoint3+Endpoint4; H1 = Endpoint1 ; H2 = Endpoint2 ; H3 = Endpoint3 ; H4 = Endpoint4;run;

proc ttest; class treatment; var H1234 H123 H124 H134 H234 H12 H13 H14 H23 H24 H34 H1 H2 H3 H4 ; ods output ttests=ttests; run;

105

Output For Lehmacher’s Method Obs Variable Method Variances tValue DF Probt 1 H1234 Pooled Equal 2.69 109 0.0082 3 H123 Pooled Equal 2.59 109 0.0108 5 H124 Pooled Equal 3.03 109 0.0031 7 H134 Pooled Equal 2.36 109 0.0201 9 H234 Pooled Equal 2.51 109 0.0136 11 H12 Pooled Equal 3.03 109 0.0030 13 H13 Pooled Equal 2.12 109 0.0365 15 H14 Pooled Equal 2.68 109 0.0085 17 H23 Pooled Equal 2.22 109 0.0287 19 H24 Pooled Equal 2.88 109 0.0047 21 H34 Pooled Equal 2.03 109 0.0450 23 H1 Pooled Equal 2.55 109 0.0121 25 H2 Pooled Equal 2.49 109 0.0142 27 H3 Pooled Equal 1.29 109 0.1986 29 H4 Pooled Equal 2.38 109 0.0191

pA1 = max(0.0121, 0.0085, 0.0365, 0.0030, 0.0201, 0.0031, 0.0108, 0.0082) = 0.0365pA2 = max(0.0142, 0.0047, 0.0287, 0.0030, 0.0136, 0.0031, 0.0108, 0.0082) = 0.0287pA3 = max(0.1986, 0.0450, 0.0287, 0.0365, 0.0136, 0.0201, 0.0108, 0.0082) = 0.1986pA4 = max(0.0191, 0.0450, 0.0047, 0.0085, 0.0136, 0.0201, 0.0031, 0.0082) = 0.0450

106

Free and Restricted Combinations

If truth of some null hypotheses logically forces other nulls to be true, the hypotheses are restricted.

Examples • Multiple Endpoints, one test per endpoint - free • All Pairwise Comparisons - restricted

107

Pairwise Comparisons, 3 Groups

H0:

H0: H0: H0:

H0: H0: H0:

Note : The entire middle layer is not needed!!!!! Fisher protected LSD valid!

108

Pairwise Comparisons, 4 Groups

1=2=3=4

1=2=3 1=2=4 1=3=4 2=3=41=2, 3=4

1=3, 2=4 1=4, 2=3

1=2 1=3 1=4 2=3 =4 3=4

Note: Logical implications imply that there are only 14 nodes, not 26 -1 = 63 nodes. Also, Fisher protected LSD not valid.

109

Restricted Combinations Multipliers (Shaffer* Method 1; Modified Holm)

*Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures. JASA 81, 826—831.

110

Shaffer’s (1) Adjusted p-values

NumberofTreatments,tj34567891361015212836213610152128313610152128436101521285261015212861410152128747152128837112128927111628101611162211411162212410162213391622142715221517132216613211751218184111819310182029162118162271523613245132541226311272102819298307316325334343352361

Critical Shaffer RawP Multiplier Value Adjusted p Values

0.3021 1 0.05 0.3998* 0.0435 3 0.016667 0.1305 4 0.0002 6 0.008333 0.0012 0.1999 2 0.025 0.3998 4 0.0109 3 0.016667 0.0327 4 0.0088 3 0.016667 0.0264

* Monotonicity enforced

111

Westfall/Tobias/Shaffer/Royen* Method

• Uses actual distribution of MinP instead of conservative Bonferroni approximation

• Closed testing incorporating logical constraints

• Hard-coded in PROC GLIMMIX

• Allows arbitrary linear functions

*Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method, Journal of the American Statistical Association 102: 487-494.

112

Application of Truncated Closed MinP to Subgroup Analysis

Compare Treatment with control as follows:

• Overall• In the Older Patients subgroup• In the Younger Patients subgroup• In patients with better initial health subgroup• In patients with poorer initial health subgroup• In each of the four (old/young)x(better/poorer) subgroups

• 9 tests overall (but better 1 gatekeeper + 8 follow-up)

113

Analysis Fileods output estimates=estimates_logicaltests;proc glimmix data=research.respiratory; class Treatment AgeGroup InitHealth; model score = Treatment AgeGroup InitHealth Treatment*AgeGroup Treatment*InitHealth AgeGroup*InitHealth;Estimate"Overall" treatment 4 -4 treatment*Agegroup 2 2 -2 -2 treatment*InitHealth 2 2 -2 -2 (divisor=4),"Older" treatment 2 -2 treatment*Agegroup 2 0 -2 0 treatment*InitHealth 1 1 -1 -1 (divisor=2),"Younger" treatment 2 -2 treatment*Agegroup 0 2 0 -2 treatment*InitHealth 1 1 -1 -1 (divisor=2),"GoodInitHealth" treatment 2 -2 treatment*Agegroup 1 1 -1 -1 treatment*InitHealth 2 0 -2 0 (divisor=2),"PoorInitHealth" treatment 2 -2 treatment*Agegroup 1 1 -1 -1 treatment*InitHealth 0 2 0 -2 (divisor=2),"OldGood" treatment 1 -1 treatment*Agegroup 1 0 -1 0 treatment*InitHealth 1 0 -1 0 ,"OldPoor" treatment 1 -1 treatment*Agegroup 1 0 -1 0 treatment*InitHealth 0 1 0 -1 ,"YoungGood" treatment 1 -1 treatment*Agegroup 0 1 0 -1 treatment*InitHealth 1 0 -1 0 ,"YoungPoor" treatment 1 -1 treatment*Agegroup 0 1 0 -1 treatment*InitHealth 0 1 0 -1 /adjust=simulate(nsamp=10000000 report seed=12321) upper stepdown(type=logical report);run;

proc print data=estimates_logicaltests noobs; title "Subgroup Analysis Results – Truncated Closure"; var label estimate Stderr tvalue probt Adjp;run;

114

Results – Truncated Closure Subgroup Analysis Results

adjp_ adjp_Label Estimate StdErr tValue Probt logical interval

Overall 0.7075 0.1956 3.62 0.0002 0.0011 0.0015Older 0.9952 0.2673 3.72 0.0002 0.0011 0.0011Younger 0.4197 0.2847 1.47 0.0717 0.1049 0.2605GoodInitHealth 0.5871 0.2878 2.04 0.0219 0.0432 0.0984PoorInitHealth 0.8279 0.2644 3.13 0.0011 0.0023 0.0068OldGood 0.8748 0.3387 2.58 0.0056 0.0124 0.0295OldPoor 1.1157 0.3231 3.45 0.0004 0.0011 0.0026YoungGood 0.2993 0.3562 0.84 0.2014 0.2014 0.5494YoungPoor 0.5401 0.3338 1.62 0.0544 0.1049 0.2091

The adjusted p-values for the stepdown tests are mathematicallysmaller than those of the simultaneous interval-based tests,

115

Example: Stepwise Pairwise vs. Control Testing

Teratology data set

•Observations are litters•Response variable = litter weight•Treatments: 0,5,50,500.•Covariates: Litter size, Gestation time

116

Analysis Fileproc glimmix data=research.litter; class dose; model weight = dose gesttime number; estimate "5 vs 0" dose -1 1 0 0, "50 vs 0" dose -1 0 1 0, "500 vs 0" dose -1 0 0 1 / adjust=simulate(nsample=10000000 report) stepdown(type=logical); run; quit;

117

Results

Estimates with Simulated Adjustment

Standard Label Estimate Error DF t Value Pr > |t| Adj P

5 vs 0 -3.3524 1.2908 68 -2.60 0.0115 0.0316 50 vs 0 -2.2909 1.3384 68 -1.71 0.0915 0.0915 500 vs 0 -2.6752 1.3343 68 -2.00 0.0490 0.0907

Note: 50-0 and 500-0 not significant at .10 with regular Dunnett

118

Concluding Notes: More power is available when combinations

are restricted. Power of closed tests can be improved using

correlation and other distributional characteristics

119

Nonparametric Multiple Testing Methods

Overview: Use nonparametric tests at each node of theclosure tree

• Bootstrap tests • Rank-based tests• Tests for binary data

120

Bootstrap MinP Test (Semi-Parametric Test)

The composite hypothesis H1H2…Hk may be tested using the p-value

p* = P(MinP minp | H1H2…Hk)

Westfall and Young (1993) show how to obtain p* by bootstrapping the residuals in a

multivariate regression model. how to obtain all p*’s in the closure tree efficiently

121

Multivariate Regression Model (Next Five slides are from Westfall and Young, 1993)

122

Hypotheses and Test Statistics

123

Joint Distribution of the Test Statistics

124

Testing Subset Intersection Hypotheses Using the Extreme Pivotals

125

Exact Calculation of pK

Bootstrap Approximation:

126

Bootstrap Tests (PROC MULTTEST)H0: 1=2=3=4 =0

min p = .0121, p* = .0379

H0: 1=2=3 =0min p = .0121, p* < .0379

H0: 1=2=4 =0min p = .0121, p* < .0379

H0: 1=3=4 =0min p = .0121, p* < .0379

H0: 2=3=4 =0min p = .0142, p* = .0351

H0: 1=2 =0minp = .0121p* < .0379

H0: 1=3 =0minp = .0121p* < .0379

H0: 1=4 =0minp =.0121p* < .0379

H0: 2=3 =0minp = .0142p* < .0351

H0: 2=4 =0minp = .0142p* < .0351

H0: 3=4 =0minp = .0191p* = .0355

H0: 1=0p = 0.0121p* < .0379

H0: 2=0p = 0.0142p* < .0351

H0: 3=0p = 0.1986p* = .1991

H0: 4=0p = 0.0191p* < .0355

p* = P(Min P min p | H0) (computed using bootstrap resampling)(Recall, for Bonferroni, p* = k(MinP) )

127

Permutation Tests for Composite Hypotheses H0K

Joint p-value = proportion of the n!/(nT!nC!) permutations for which miniK Pi

* miniK pi .

128

Problem; Simplification

Simplification: You need only test k of the 2k-1 subsets!

Why? Because

P(miniK Pi* c) P(miniK’ Pi

* c) when K K’.

Significance for most lower order subsets is determined by significance of higher order subsets.

Problem: There are 2k -1 subsets K to be tested

This might take a while...

129

MULTTEST PROCEDURE

Tests only the needed subsets (k, not 2k - 1).

Samples from the permutation distribution.

Only one sample is needed, not k distinct samples, ifthe joint distribution of minP is identical under HK and HS.

(Called the “subset pivotality” condition by Westfall and Young, 1993, valid under location shift and other models)

130

Great Savings are Possible with Exact Permutation Tests!

Why?

Suppose you test H12…k using MinP. The joint p-value is p* = P(MinP minp) P(P1 minp) + P(P2 minp) + … + P(Pk minp)

Many summands can be zero, others much less than minp.

131

Stepdown Stepdown Variable Contrast Raw Bonferroni Permutation

ae1 t vs c 0.0008 0.0025 0.0020 ae2 t vs c 0.6955 1.0000 1.0000 ae3 t vs c 0.5000 1.0000 1.0000 ae4 t vs c 0.7525 1.0000 1.0000 ae5 t vs c 0.2213 1.0000 0.6274 ae6 t vs c 0.0601 0.3321 0.2608 ae7 t vs c 0.8165 1.0000 1.0000 ae8 t vs c 0.0293 0.1587 0.1328 ae9 t vs c 0.9399 1.0000 1.0000 ae10 t vs c 0.2484 1.0000 0.9273 ae11 t vs c 1.0000 1.0000 1.0000 ae12 t vs c 1.0000 1.0000 1.0000 ae13 t vs c 1.0000 1.0000 1.0000 ae14 t vs c 1.0000 1.0000 1.0000 ae15 t vs c 0.2484 1.0000 0.9273 ae16 t vs c 0.7516 1.0000 1.0000 ae17 t vs c 1.0000 1.0000 1.0000 ae18 t vs c 1.0000 1.0000 1.0000 ae19 t vs c 1.0000 1.0000 1.0000 ae20 t vs c 0.5000 1.0000 1.0000 ae21 t vs c 0.7516 1.0000 1.0000 ae22 t vs c 1.0000 1.0000 1.0000 ae23 t vs c 0.5000 1.0000 1.0000 ae24 t vs c 1.0000 1.0000 1.0000 ae25 t vs c 1.0000 1.0000 1.0000 ae26 t vs c 1.0000 1.0000 1.0000 ae27 t vs c 1.0000 1.0000 1.0000 ae28 t vs c 0.4344 1.0000 0.9400

Multiple Binary Adverse Events

132

Example: Genetic Associatons

Phenotype: 0/1 (diseased or not).

Sample n1 from diseased, n2 from not diseased.

Compare 100’s of genotype frequencies (usingdominant and recessive codings) for diseased and non-diseased using multiple Fisher exact tests.

133

PROC MULTTEST Codeproc multtest data=research.gen stepperm n=20000 out=pval hommel fdr; class y; test fisher(d1-d100 r1-r100); contrast "dis v nondis" -1 1;run;

proc sort data=pval; by raw_p;run;

proc print data=pval; var _var_ raw_p stppermp hom_p; where raw_p <.05; run;

134

Results from PROC MULTTEST

Obs _var_ raw_p stppermp hom_p fdr_p

1 r100 0.000000 0.0000 0.00000 0.00000 2 r30 0.000000 0.0000 0.00004 0.00002 3 d78 0.016130 0.7955 1.00000 0.57465 4 r55 0.018157 0.8220 1.00000 0.57465 5 d62 0.019220 0.8480 1.00000 0.57465 6 r64 0.019220 0.8480 1.00000 0.57465 7 r37 0.020113 0.8520 1.00000 0.57465 8 r33 0.040043 0.9860 1.00000 1.00000

135

Application - Gene Expression

Group 1: Acute Myeloid Leukemia (AML), n1=11Group 2: Acute Lymphoblastic Leukemia (ALL), n2=27

Data:

OBS TYPE G1 G2 G3 … G7000 1 AML (Gene expression levels) 2 AML … … … … 11 AML 12 ALL … … 38 ALL

136

PROC MULTTEST code for exact* closed testing

Proc multtest data=research.leuk noprint out=adjp holm fdr stepperm n=1000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast 'AML vs ALL' -1 1;run;

proc sort data=adjp(where=(raw_p le .0005)); by raw_p;

proc print; var _var_ raw_p stpbon_p fdr_p stppermp;run;

* modulo Monte Carlo error

137

PROC MULTTEST Output

(1 hour on 2.8 GhZ Xeon for 200,000 samples)

Raw Bonferroni permutation Raw Bonferroni permutationVariable p-value p-value p-value Variable p-value p-value p-value

GENE3320 1.4E-10 0.0000 0.0001 GENE2043 1.3E-06 0.0090 0.0095GENE4847 2.4E-10 0.0000 0.0001 GENE2759 1.3E-06 0.0093 0.0097GENE2020 6.6E-10 0.0000 0.0001 GENE6803 1.4E-06 0.0102 0.0104GENE1745 1.0E-08 0.0001 0.0002 GENE1674 1.5E-06 0.0105 0.0106GENE5039 1.0E-08 0.0001 0.0002 GENE2402 1.5E-06 0.0108 0.0109GENE1834 1.5E-08 0.0001 0.0003 GENE2186 1.7E-06 0.0118 0.0118GENE461 3.6E-08 0.0003 0.0005 GENE6376 2.1E-06 0.0149 0.0142GENE4196 6.2E-08 0.0004 0.0007 GENE3605 2.6E-06 0.0181 0.0169GENE3847 7.2E-08 0.0005 0.0008 GENE6806 2.6E-06 0.0184 0.0170GENE2288 8.9E-08 0.0006 0.0010 GENE1829 2.7E-06 0.0194 0.0177GENE1249 1.7E-07 0.0012 0.0017 GENE6797 3.0E-06 0.0214 0.0194GENE6201 1.8E-07 0.0013 0.0017 GENE6677 3.4E-06 0.0244 0.0216GENE2242 2.0E-07 0.0014 0.0019 GENE4052 3.7E-06 0.0263 0.0231GENE3258 2.1E-07 0.0015 0.0020 GENE1394 4.9E-06 0.0350 0.0290GENE1882 3.2E-07 0.0023 0.0029 GENE6405 5.4E-06 0.0380 0.0311GENE2111 3.7E-07 0.0026 0.0033 GENE248 6.4E-06 0.0453 0.0359GENE2121 5.8E-07 0.0041 0.0048 GENE2267 6.5E-06 0.0460 0.0364GENE6200 6.2E-07 0.0044 0.0051 GENE6041 7.8E-06 0.0553 0.0429GENE6373 8.2E-07 0.0058 0.0065 GENE6005 8.0E-06 0.0569 0.0439GENE6539 1.1E-06 0.0080 0.0086 GENE5772 9.0E-06 0.0638 0.0480

138

Subset Pivotality, PROC MULTTEST

MULTTEST requires “subset pivotality” condition, which statescases where resampling under the global null is valid.

Valid cases:Multivariate Regression Model (location-shift).Multivariate permutation multiple comparisons, one test per variable, assuming

model with exchangeable subsets.

Not Valid with:Permutation multiple comparisons, within a variable, with three or more groups,Heteroscedasticity.

Closed testing “by hand” works regardless.

139

Summary: Nonparametric Closed Tests

• Nonparametric closed tests are simple, in principle.

• Robustness gains and power advantages are possible.

140

Further Topics: More ComplexSituations for FWE Control

Heteroscedasticity Repeated Measures Large Sample Methods

141

Heteroscedasticity in MCPsExtreme Example: data het; do g = 1 to 5; do rep = 1 to 10; input y @@; output; end; end;datalines;.01 .02 .08 .03 .04 .01 .08 .01 .01 .02.11 .12 .13 .08 .03 .09 .11 .11 .13 .14.21 .22 .23 .28 .23 .29 .21 .21 .23 .25 42 76 . . . . . . . . 45 23 56 . . . . . . .;proc glm; class g; model y = g; lsmeans g/adjust=tukey pdiff;run; quit;

142

Least Squares Means for effect g Pr > |t| for H0: LSMean(i)=LSMean(j) Adjustment for Multiple Comparisons: Tukey-Kramer

i/j 1 2 3 4 5

1 1.0000 1.0000 <.0001 <.0001 2 1.0000 1.0000 <.0001 <.0001 3 1.0000 1.0000 <.0001 <.0001 4 <.0001 <.0001 <.0001 0.0290 5 <.0001 <.0001 <.0001 0.0290

Level of --------------y-------------- g N Mean Std Dev

1 10 0.0310000 0.0276687 2 10 0.1050000 0.0320590 3 10 0.2360000 0.0287518 4 2 59.0000000 24.0416306 5 3 41.3333333 16.8027775

RMSE = 6.17

Bad Results from Heteroscedastic Data

143

proc glimmix data=het; if (g > 3) then y2=y/20; else y2=y; /* overcomes scaling problem */ class g; model y2 = g/noint ddfm=satterth; random _residual_ / group=g ; estimate '1 -2' g 1 -1 0 0 0 ,'1 -3' g 1 0 -1 0 0 ,'1 -4' g 1 0 0 -20 0 ,'1 -5' g 1 -1 0 0 -20 ,'2 -3' g 0 1 -1 0 0 ,'2 -4' g 0 1 0 -20 0 ,'2 -5' g 0 1 0 0 -20 ,'3 -4' g 0 0 1 -20 0 ,'3 -5' g 0 0 1 0 -20 ,'4 -5' g 0 0 0 20 -20 /adjust=simulate(nsamp = 1000000) stepdown(type=logical) adjdfe=row;run;

Approximate Solution for Heteroscedasticity Problem

144


StandardLabel Estimate Error DF t Value Pr > |t| Adj P

1 -2 -0.07400 0.01339 17.62 -5.53 <.0001 0.02421 -3 -0.2050 0.01262 17.97 -16.25 <.0001 0.00111 -4 -58.9690 17.0000 1 -3.47 0.1787 0.24741 -5 -41.4073 9.7011 2 -4.27 0.0507 0.13742 -3 -0.1310 0.01362 17.79 -9.62 <.0001 0.01982 -4 -58.8950 17.0000 1 -3.46 0.1789 0.24742 -5 -41.2283 9.7011 2 -4.25 0.0512 0.13743 -4 -58.7640 17.0000 1 -3.46 0.1793 0.24743 -5 -41.0973 9.7011 2 -4.24 0.0515 0.13744 -5 17.6667 19.5732 1.669 0.90 0.4778 0.4778

Heteroscedastic Results

Notes:• Approximation 1: df’s• Approximation 2: Covariance matrix involving all comparisons is approximate •1,2,3 different, 4-5 not. (sensible)

145

Repeated Measures and Multiple Comparisons Usually considered quite complicated (wave hands, use

Bonferroni)

PROC GLIMMIXED provides a viable solution

The method is approximate because of its df approximation, and

because it treats estimated variance ratios as known.

146

Multiple Comparisons with Mixed Model

data Halothane; do Dog =1 to 19; do Treatment = 'HA','LA','HP','LP'; input Rate @@; output; end; end;datalines;426 609 556 600 253 236 392 395 359 433 349 357432 431 522 600 405 426 513 513 324 438 507 539310 312 410 456 326 326 350 504 375 447 547 548286 286 403 422 349 382 473 497 429 410 488 547348 377 447 514 412 473 472 446 347 326 455 468434 458 637 524 364 367 432 469 420 395 508 531397 556 645 625;

Crossover study: Dog heart rates

H,L = CO2 High/LowA,P = Halothane absent/presentSource: Johnson and Wichern, Applied Multivariate StatisticalAnalysis, 5th ed, Prentice Hall

147

GLIMMIX code for analyzing all pairwise comparisons, main effects, and interactions simultaneously

proc glimmix data=halothane order=data; class treatment dog; model rate = treatment/ddfm=satterth; random treatment/ subject=dog type=chol v=1 vcorr=1; estimate 'HA - LA' treatment 1 -1 0 0 , 'HA - HP' treatment 1 0 -1 0 , 'HA - LP' treatment 1 0 0 -1 , 'LA - HP' treatment 0 1 -1 0 , 'LA - LP' treatment 0 1 0 -1 , 'HP - LP' treatment 0 0 1 -1 , 'Co2 ' treatment 1 -1 1 -1 (divisor=2), 'Halothane' treatment 1 1 -1 -1 (divisor=2), 'Interaction' treatment 1 -1 -1 1 /adjust=simulate(nsamp=1000000) stepdown(type=logical) adjdfe=row;

148



HA - LA -36.4211 13.8522 17.63 -2.63 0.0172 0.0172HA - HP -111.05 14.1127 14.72 -7.87 <.0001 <.0001HA - LP -134.68 12.7899 14.66 -10.53 <.0001 <.0001LA - HP -74.6316 14.8794 17.94 -5.02 <.0001 0.0002LA - LP -98.2632 15.7467 17.99 -6.24 <.0001 <.0001HP - LP -23.6316 11.9884 16.22 -1.97 0.0660 0.0660Co2 -30.0263 8.2683 17.82 -3.63 0.0019 0.0059Halothane -104.66 11.1412 15.27 -9.39 <.0001 <.0001Interaction -12.7895 19.9438 17.98 -0.64 0.5294 0.5294

Results

149

Cure Rates Example: Multiple Comparisons of Odds

Treatment Complicated UncomplicatedA 73.6% 88.9%B 90.2% 91.5%C 59.6% 85.0%

Diagnosis

Questions: (1) Multiple comparisons of cure rates for the Treatments (3 comparisons)(2) Comparison of cure rates for Complicated vs Uncomplicated Diagnosis.

150

Method

• Use the estimated parameter vector and associated estimate of covariance matrix from PROC GLIMMIX

• Treat the estimated (asymptotic) covariance matrix as known

• Simulate critical values and p-values (MinP-based) from the multivariate normal distribution instead of the Multivariate T distribution

• Controls FWE asymptotically under correct logit model

151

Results



A-B -0.9760 0.3311 Infty -2.95 0.0032 0.0032A-C 0.5847 0.2641 Infty 2.21 0.0268 0.0268B-C 1.5608 0.3160 Infty 4.94 <.0001 <.0001Comp-Uncomp -0.9616 0.2998 Infty -3.21 0.0013 0.0027

152

Summary

• Classic, FWE-controlling MCPs that incorporate alternative covariance structures and non-normal distributions are easy using PROC GLIMMIX.

• However, be aware of approximations Plug-in variance/covariance estimates df

153

Further Topics: False Discovery Rate

• FDR = E(proportion of rejections that are incorrect)

• Let R = total # of rejections• Let V = # of erroneous rejections

• FDR = E(V/R) (0/0 defined as 0).

• FWE = P(V>0)

154

Example

30 independent tests: 20 null hypotheses are true with pj~U(0,1) 10 extremely alternative with pj = 0.

Decision rule: Reject H0j if pj 0.05

Then:

CERj = P(reject H0j | H0j true ) = 0.05.FWE = P(reject one or more of the 20) = 1-(.95)20 =0.64FDR = E{V/(V+10)} where V~Bin(20.05) so FDR = 0.084.

155

Benjamini and Hochberg’s FDR-Controlling Method


–If p(k) , reject all H(j) and stop, else continue.

– If p(k-1) (k-1)/k, reject H(1) … H(k-1) and stop, else continue.

–…

–If p(1) /k, reject H(1)

Adjusted p-values: pA(j)= minji (k/i)p(i) .

156

Comparison with Hochberg’s Method

• A step-up procedure, like Hochberg’s method • adjusted p-values are pA(j)= minji (k/i)p(i) .

• Recall for Hochberg’s method, pA(j)= minji (k-i+1)p(i) .

• FDR adjusted p-values are uniformly smaller since k/i k-i+1 • B-H FDR method uses Simes’ critical points.

157

Critical Values – FDR vs FWE

Alpha Levels - 4000 Tests

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0 1000 2000 3000 4000 5000

Ordered Test

Cri

tica

l

Hoc4000

BH4000

Alpha Levels

0

0.01

0.02

0.03

0.04

0.05

0 5 10 15 20 25 30

Ordered Test

Cri

tica

l

Hoc

BH

158

Comments on FDR Control

• Considered better for large numbers of tests since FWE is inconsistent• Is adaptive• Has a loose Bayesian correspondence

• Easy to misinterpret the results: Given 10 FDR<.10 rejections in a given study, it is tempting to claim that only one can be in error (in an “average” sense).

However, this is incorrect, as E(V/R | R>0) > .

159

Further Topics: Bayesian Methods

Simultaneous Credible Intervals Probabilities of ranking Loss function approach Posterior probabilities of null hypotheses

160

Bayes/Frequentist Comparisons

161

Simultaneous Credible Intervals• Create intervals Ii for i, so that P(i Ii, all i | Data) = .95

• Implementation in Westfall et al (1999) assumes– Variance components model (includes regular GLM and heteroscedastic GLM as special case)– Jeffreys’ priors on variances (vague) – Flat prior on means (also vague)

• Uses PROC MIXED to obtain sample (assume i.i.d) from posterior distribution

• Uses %BayesIntervals to obtain simultaneous credible intervals

162

Bayesian Simultaneous Conf. BandObs _NAME_ Lower Upper

1 diff10 -6.51586 -1.68863 2 diff15 -5.58315 -1.30912 3 diff20 -4.67692 -0.91457 4 diff25 -3.79373 -0.49875 5 diff30 -2.94835 -0.02991 6 diff35 -2.17939 0.49931 7 diff40 -1.48987 1.09893 8 diff45 -0.87996 1.79373 9 diff50 -0.35969 2.57208 10 diff55 0.10077 3.41950 11 diff60 0.52638 4.30526 12 diff65 0.91615 5.22872 13 diff70 1.29532 6.16164

163

Bayes/Frequentist Correspondence

From Westfall, P.H. (2005). Comment on Benjamini and Yekutieli, ‘False Discovery Rate Adjusted Confidence Intervals for Selected Parameters,’ Journal of the American Statistical Association 100, 85-89.

164

Bayesian Probabilities for Rankings

Suppose you observe Ave1 > Ave2 > … > Avek.

What is the probability that 1 > 2 > … > k ?

Bayesian Solution: Calculate proportion of posterior samples for which the ranking holds.

165

Results: Comparing Formulations Solution for Fixed Effects

StandardEffect formulation Estimate Error

formulation A 12.0500 0.3152 formulation B 11.0200 0.3152 formulation C 10.2700 0.3152 formulation D 9.2700 0.3152 formulation E 12.1700 0.3152

The MEANS Procedure

Variable N Meanƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒrank_observed_means 500000 0.5554860Mean5_best 500000 0.6051220Mean1_best 500000 0.3936600Mean2_best 500000 0.0012160Mean3_best 500000 2E-6Mean4_best 500000 0ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

166

Waller-Duncan Loss Function ApproachLet ij = i-j.Let Li<j(ij) denote the loss of declaring i<j.Let Li>j(ij) denote the loss of declaring i>j.Let Li~j(ij) denote the loss of declaring i n.s. different from j.

W-D Loss functions*Li~j(ij) = |ij| Li<j(ij) = 0, ij<0, = kij otherwiseLi>j(ij) = -kij, ij<0, = 0 otherwise

* Equivalent form; See Hochberg and Tamhane (1987, 320-330)

See Pennello, G. 1997. The k-ratio multiple comparisons Bayes rule for the balanced two-way design. Journal of the American Statistical Association 92: 675-684.

168

ImplementationWaller – Duncan in PROC GLM

More general: Simulate from posterior pdf of the ij, calculate all three losses, average, and choose decision with smallest average loss.

169

Sample Output

The MEANS Procedure

Variable N Mean Std ErrorƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒLoss1LT3 100000 177.9339488 0.1445577Loss1NS3 100000 1.7793558 0.0014454Loss1GT3 100000 0.0016329 0.000596411ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Decision:1 > 3

The MEANS Procedure

Variable N Mean Std ErrorƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒLoss1LT5 100000 12.7456041 0.0717432Loss1NS5 100000 0.3746864 0.000905742Loss1GT5 100000 24.7230371 0.0967412ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Decision:1 ~ 5

170

Bayesian Multiple TestingFrequentist univariate testing: Calculate p-value = P(data more extreme | H0)

Bayesian univariate testing: Calculate P(H0 is true | Data)

Frequentist multiple testing: if H01, H02, … , H0k are all true(or if many are true) then we get a small p-value by chance alone.use a more conservative rule.

Bayesian multiple testing: Express the doubt about many or allH0i being true using prior distribution; use this to calculateposterior probabilities P(H0i is true | Data).

171

Bayesian Multiple Testing: Methodology

Find posterior probability for each of the 2k models where i is either =0 or 0. Then

P(i = 0| Z) =

(Sum of posterior probs for all 2k-1 models where i = 0) (Sum of posterior probs for all 2k models)

Gopalan, R., and Berry, D.A. (1998), Bayesian Multiple Comparisons Using Dirichlet Process Priors, Journal of the American Statistical Association 93, 1130-1139.

Gönen, M., Westfall, P.H. and Johnson, W.O. (2003). Bayesian multiple testing for two-sample multivariate endpoints," Biometrics 59, 76-82.

172

The %BayesTests Macro: Priors• You can specify your level of prior doubt about individual hypotheses. You can specify either (i) P(H0i is true) or (ii) P(H0i is true, all i) , or both.

• You can specify (iii) the degree of prior correlation among the individual hypotheses.

• Specify two out of three of (i), (ii), and (iii). The third is determined by the other two.

• Specify prior expected effect sizes and prior variances of effect sizes (default: mean effect size is 2.5, variance= 2.)

173

The %BayesTests Macro: Data Assumptions, Inputs, and Outputs

• Assume: tests are free combinations (e.g.,multiple endpoints); MANOVA; Large Samples.

• Inputs: t-statistics and their (conditional) large-sample correlation matrix (this is the partial correlation matrix in the case of multiple endpoints); priors.

• Outputs: Posterior probabilities P(H0i is true | Data).

174

%BayesTests Example: Multiple Endpoints in Panic Disorder Study

proc glm data=research.panic; class TX; model AASEVO PANTOTO PASEVO PHCGIMPO = TX; estimate "Treatment vs Control" TX 1 -1; manova h=TX / printe; ods output Estimates =Estimates PartialCorr=PartialCorr;run;

%macro Estimates; use Estimates; read all var {tValue} into EstPar; use PartialCorr; read all var {AASEVO, PANTOTO, PASEVO, PHCGIMPO} into cov;%mend;

%BayesTests(rho=.5,Pi0 =.5);

175

Output from %BayesTests

176

The Effect of Prior Correlation: Borrowing Strength

Posterior Null Probability as a Function of Prior Correlation

0

0.2

0.4

0.6

0 0.2 0.4 0.6 0.8 1

r

p

H1

H2

H3

H4

177

The Bayesian Multiplicity EffectIf the multiple comparisons concern, “What if many or allnulls are true” is valid, the Bayesian must attach a higher probability to P(H0i is true, all i).Here is the result of setting P(H0i is true, all i) = .5.

“Right” answers, See Westfall, P.H., Krishen,A. and Young, S.S.(1998). "Using Prior Information to Allocate Significance Levels for Multiple Endpoints," Statistics in Medicine 17, 2107-2119.

178

Summary: Bayesian Methods

Several Bayesian MCPs are available! • Intervals• Tests• Rankings• Decision theory

Other current research:

• FDR – Bayesian connection (genetics)• Mixture models and Bayesian MCPs (variable selection)

179

Discussion

Good methods and software are available

You can’t use the excuse “I don’t have to use MCPs because there is no good method available”

This brings us back to the $100,000,000 question: “When should we use MCPs/MTPs”?

180

When Should You Adjust?A Scientific View When there is substantial doubt concerning the

collection of hypotheses tested

When you data snoop

When you play “pick the winner”

When conclusions require joint validity

181

But What “Family” Should I Use?

• The set over which you play “pick the winner”

• The set of conclusions requiring joint validity

• Not always well-defined

• Better to decide at design stage or simply to “frame the discussion”

182

Multiplicity Invites Selection; Selection has an Effect

Variability, probability theory, VERY relevant.

183

Final Words:

/k

184

References: Books

Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison Procedures. John Wiley, New York.

Hsu, J.C. (1996). Multiple Comparisons: Theory and Methods, Chapman and Hall, London.

Westfall, P.H., and Young, S.S. (1993) Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. Wiley, New York.

Westfall, P.H., Tobias, R.D., Rom, D., Wolfinger, R.D., and Hochberg, Y. (1999). Multiple Comparisons and Multiple Tests Using the SAS® System, Cary, NC: SAS Institute Inc.

Westfall, P.H. and Tobias, R. (2000). Exercises to Accompany "Multiple Comparisons and Multiple Tests Using the SAS® System" , Cary, NC: SAS Institute Inc.

185

References: Journal Articles

•Bauer, P.; George Chi; Nancy Geller; A. Lawrence Gould; David Jordan; Surya Mohanty; Robert O'Neill; Peter H. Westfall (2003). Industry, Government, and Academic Panel Discussion on Multiple Comparisons in a “Real” Phase Three Clinical Trial. Journal of Biopharmaceutical Statistics, 13(4), 691-701. •Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A new and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57, 1289-1300.•Berger, J. O. and Delampady, M. (1987), Testing precise hypothesis. Statistical Science 2, 317-352.•Cook, R.J. and Farewell, V.T.(1996). Multiplicity considerations in the design and analysis of clinical trials. JRSS-A 159, 93-110.•Dmitrienko, A, Offen, W. and Westfall, P. (2003). Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Statistics in Medicine 22, 2387-2400. •Gönen, M., Westfall, P.H. and Johnson, W.O. (2003). "Bayesian multiple testing for two-sample multivariate endpoints," Biometrics 59, 76-82. •Hellmich M, Lehmacher W. Closure procedures for monotone bi-factorial dose-response designs. Biometrics 2005;61:269-276. •Koyama, T., and Westfall, P.H. (2005). Decision-Theoretic Views on Simultaneous Testing of Superiority and Noninferiority, Journal of Biopharmaceutical Statistics 15, 943-955. •Lehmacher W., Wassmer G., Reitmeir P.: Procedures for Two-Sample Comparisons with Multiple Endpoints Controlling the Experimentwise Error Rate. Biometrics, 1991, 47: 511-521. •Marcus, R., Peritz, E. and Gabriel, K.R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655-660.•Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures. Journal of the American Statistical Association 81, 826—831.•Westfall, P.H. (1997). "Multiple Testing of General Contrasts Using Logical Constraints and Correlations," Journal of the American Statistical Association 92, 299-306. •Westfall, P.H. and Wolfinger, R.D.(1997). "Multiple Tests with Discrete Distributions," The American Statistician 51, 3-8.•Westfall, P.H., Johnson, W.O., and Utts, J.M. (1997). A Bayesian perspective on the Bonferroni adjustment. Biometrika 84, 419-427.•Westfall,P.H. and Wolfinger, R.D. (2000). "Closed Multiple Testing Procedures and PROC MULTTEST." SAS Observations, July, 2000. •Westfall, P.H., Ho, S.-Y., and Prillaman, B.A. (2001). "Properties of multiple intersection-union tests for multiple endpoints in combination therapy trials," Journal of Biopharmaceutical Statistics 11, 125-138. •Westfall, P.H. and Krishen, A. (2001). "Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures," Journal of Statistical Planning and Inference 99, 25-40. •Westfall, P. and Bretz, F. (2003). Multiplicity in Clinical Trials. Encyclopedia of Biopharmaceutical Statistics, second edition, Shein-Chung Chow, ed., Marcel Decker Inc., New York, pp. 666-673. •Westfall, P.H., Zaykin, D.V., and Young, S.S. (2001). Multiple tests for genetic effects in association studies. Methods in Molecular Biology, vol. 184: Biostatistical Methods, pp. 143-168. Stephen Looney, Ed., Humana Press, Toloway, NJ.•Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method, Journal of the American Statistical Association 102: 487-494.

1 a course in multiple comparisons and multiple tests peter h. westfall, ph.d. professor of...

Documents

multiple tests peter

multiple inferences

multiple hypotheses

multiple treatment comparisons

multiple endpoints range

based methods

bayesian methods

families of tests