1 the changing landscape of interim analyses for efficacy / futility marc buyse, scd iddi,...
TRANSCRIPT
1
The changing landscape of interim analyses for
efficacy / futility
Marc Buyse, ScDIDDI, Louvain-la-Neuve, Belgium
Massachusetts Biotechnology CouncilCambridge, Mass
June 2, 2009
2
Reasons for Interim Analyses• Early stopping for
• safety
• extreme efficacy
• futility
• Adaptation of design based on observed data to
• play the winner / drop the loser
• maintain power
• make any adaptation, for whatever reason and whether or not data-derived, whilst controlling for
3
Methods for Interim Analyses
• Multi-stage designs / seamless transition designs
• Group-sequential designs
• Stochastic curtailment
• Sample size adjustments
• Adaptive (« flexible ») designs
4
Early Stopping Helsinki Declaration:
“Physician should cease any investigation if the hazards are found to outweigh the potential benefits.”(« Primum non nocere »)
Trials with serious, irreversible endpoints should be stopped if one treatment is “proven” to be superior, and such potential stopping should be formally pre-specified in the trial design.
5
The Cost of Delay
Average Daily Sales ($ in Millions)
$0
$2
$4
$6
$8
$10
$12
Prilosec Zocor Norvasc Paxil Claritin
« Blockbusters » reach sales > 500 M$ a year (> 1 M$ a day)
6
Fixed Sample Size Trials…
1 – the sample size is calculated to detect a given difference at given significance and power2 – the required number of patients is accrued3 – patient outcomes are analyzed at the end of the trial, after observation of the pre-specified number of events
7
…vs (Group) Sequential Trials…
1 – the sample size is calculated to detect a given difference at given significance and power2 – patients are accrued until a pre-planned interim analysis of patient outcomes takes place3a – the trial is terminated early, or3b – the trial continues unchanged4 – patient outcomes are analyzed at the end of the trial, after observation of the pre-specified number of events
8
…vs Adaptive Trials
1 – the sample size is calculated to detect a given difference at given significance and power2 – patients are accrued until a pre-planned interim analysis of patient outcomes takes place3a – the trial is terminated early, or3b – the trial continues unchanged, or3c – the trial continues with adaptations4 – patient outcomes are analyzed at the end of the trial, after observation of the pre-specified or modified number of events
Randomized phase II trial with continuation as phase III trial
Simultaneous screening of several treatment groups with continuation as phase III trial :
PHASE III
Comparison of the arms
Arm 2
Arm 1
Early stopping ofone or more arms
PHASE II
Arm 3
Phase III trial with interim analysis
Phase III trial with interim look at data:
Interim comparison of
the arms
PHASE III INTERIM PHASE III
Comparison of the arms
Arm 2
Arm 1
Arm 3
11
Seamless transition designs(e.g. for dose selection)
Designs can be operationally or inferentially seamless:
12
Group Sequential Trials If several analyses are carried out, the Type I error
is inflated if each analysis is carried out at the target level of significance.
So, the interim analyses must use an adjusted level of significance so as to preserve the overall type I error.
13
Inflation of with multiple analyses
With 5 analyses performed at level 0.05, the overall level is 0.15
Adjusting for multiple analyses
The 5 analyses must be performed at level 0.0159 in order to preserve an overall level of 0.05
15
Group sequential designs Test H0: Δ = 0 vs. HA: Δ ≠ 0
m pts. accrued to each arm between analyses
Use standardized test statistic Zk, k=1,...,K
mk
XX
mk
XXZ CkEk
mk
iCi
mk
iEi
k/22
11
16
Group-Sequential Designs – Type I Error
Probability of wrongly stopping/rejecting H0 at
analysis k
PH0(|Z1|< c1, ..., |Zk-1|< ck-1, | Zk |≥ ck) = πk
• “Type I error spent at stage k”
P(Type I error) = ∑πk
Choose ck’s so that ∑πk α
17
Group-Sequential Designs – Type II Error
Probability of Type II error is
1-PHA( U {|Z1|<c1, ..., |Zk-1|<ck-1, | Zk |≥ck} )
Depends on K, α, β, ck’s.
Given the values, the required sample size can be computed• it can be expressed as R x (fixed sample size)
18
Pocock Boundaries
Reject H0 if | Zk | > cP(K,α)
• cP(K,α) chosen so that P(Type I error) = α
All analyses are carried out at the same adjusted significance level
The probability of early rejection is high but the power at the final analysis may be compromised
19
Pocock Boundaries
p-values for Zk (two-sided) per interim analysis (K=5)
20
O’Brien-Fleming Boundaries
Reject H0 if | Zk | > cOBF(K,α)√(K / k)
• for k=K we get | ZK | > cOBF(K,α)
• cOBF(K,α) chosen so that P(Type I error) = α
Early analyses are carried out at extreme adjusted significance levels
The probability of early rejection is low but the power at the final analysis is almost unaffected
21
O’Brien-Fleming Boundaries
p-values for Zk (two-sided) per interim analysis (K=5)
22
Wang & Tsiatis Boundaries Wang & Tsiatis (1987):
Reject H0 if | Zk | > cWT(K,α,θ)(K / k)θ - ½
• θ = 0.5 gives Pocock’s test; θ = 0, O’Brien-Fleming
• implemented in some software (e.g. EaSt)
Can accomodate any intermediate choice between Pocock and O’Brien-Fleming
23
p-values for Zk (two-sided) per interim analysis (K=5) with = .2
Wang & Tsiatis Boundaries
24
Haybittle & Peto Boundaries Haybittle & Peto (1976):
Reject H0 if | Zk | > 3 for k = 1,...,K-1
Reject H0 if | Zk | > cHP(K,α) for k = K
• | Zk | > 3 corresponds to using p < 0.0026
Early analyses are carried out at extreme, yet reasonable adjusted significance levels
Intuitive and easily implemented if correction to final significance level is ignored (pragmatic approach)
25
p-values for Zk (two-sided) per interim analysis (K=5)
Haybittle & Peto Boundaries
26
Boundaries compared
p-values for Zk (two-sided) per interim analysis (K=5)
27
Boundaries compared
Zk per interim analysis (K=5)
Potential savings / costs in using group sequential designs
A - B Fixed sample Pocock O’Brien-Fleming
0.0 170 205 179
0.5 170 182 168
1.0 170 117 130
1.5 170 70 94
Expected sample sizes for different designs (K=5): - outcomes normally distributed with = 2- = 0.05- = 0.1 for A - B = 1
29
Error-Spending Approach Removing the requirement of a fixed number of equally- spaced analyses
Lan & DeMets (1983): two-sided tests “spending” Type I error.
Maximum information design:
• Error spending function →
• Defines boundaries
• Accept H0 if Imax attained without rejecting the null
30
Error-Spending Approach
f(t)=min(2-2Φ(z1-α/2),α) yields ≈ O’B-F boundaries
f(t)=min(α ln (1+(e -1)t,α) yields ≈ Pocock boundaries
f(t)=min(αtθ,α):•θ=1 or 3 corresponds to Pocock and O’B-F, respectively
31
How Many Interim Analyses?
One or two interim analyses give most benefit in terms of a reduction of the expected sample size
Not much gain from going beyond 5 analyses
32
When to Conduct Interim Analyses?
With error-spending, full flexibility as to number and timing of analyses
• First analysis should not be “too early” (often at 50% of information time)
• Equally-spaced analyses advisable
In principle, strategy/timing should not be chosen based on the observed results
33
Who conducts interim analyses? Independent Data Monitoring Committee
Experts from different disciplines (clinicians, statisticians, ethicists, patient advocates, …)
Reviews trial conduct, safety and efficacy data
Recommends• Stopping the trial• Continuing the trial unchanged• Amending the trial
34
Sample Size Re-Estimation Assume normally distributed endpoints
2
22/112
zznI
Sample size depends on σ2
If misspecified, nI can be too small
Idea: internal pilot study
• estimate σ2 based on early observed data
• compute new sample size, nA
• if necessary, accrue extra patients above nI
35
Early Stopping for Futility
Stopping to reject H0 of no treatment difference
• Avoids exposing further patients to the inferior treatment
• Appropriate if no further checks are needed on, e.g., treatment safety or long-term effects.
Stopping to accept H0 of no treatment difference
• Stopping “for futility” or “abandoning a lost cause”
• Saves time and effort when a study is unlikely to lead to a positive conclusion.
36
Two-Sided Test
37
Stochastic CurtailmentIdea:
Terminate the trial for efficacy if there is high probability of rejecting the null, given the current data and assuming the null is true among future patients
Conversely, terminate the trial for futility if there is low probability of rejecting the null, given the current data and assuming the alternative is true among future patients
38
Conditional Power
At the interim analysis k, define
pk(Δ) = PHA(Test will reject H0 | current data)
A high value of pk(0) suggests T will reject H0
• terminate the trial & reject H0 if pk(0) > ξ
• terminate the trial & accept H0 if 1-pk(Δ) > ξ’ (1-sided)
• probabilities of error, type I α / ξ, type II β / ξ’
Note: ξ and ξ’ 0.8
39
Conditional Power Unconditional power
for α=0.05 and β=0.1 at Δ=0.2
Conditional power for a mid-trial analysis with an estimate of Δ of 0.1• probability of rejecting
the null at the end of the trial has been reduced from 0.9 to 0.1
40
Conditional Power
B(t) = Z(t)t1/2 = t
41
Conditional Power
Slope = assumed treatment effect in
future patients
42
Conditional Power
Crosshatched area = conditional power
43
Predictive Power
Problem with the conditional power approach: it is computed assuming Δ not supported by the current data.
A solution: average across the values of Δ
“Predictive power”
dpP kk )data|()(
π(Δ | data) is the posterior density
Termination against H0 if Pk > ξ etc.
What prior ?
Futility guidelines
Less indicated More indicated
Controversial intervention requiring large randomized evidence (e.g. drug eluding stents)
Time to event endpoints with rapid enrollment (e.g. cholesterol lowering drugs)
Intervention in current use Learning curve by
investigators (e.g. mechanical heart valves)
Late effects suspected
Safety expected to be an issue (e.g. cox-2 inhibitors)
Approved competitive products (e.g. drugs for allergic rhinitis)
Long pipeline of alternative drugs (e.g. oncology)
Short-term outcomes (e.g. 30 day mortality in sepsis)
Overruling futility boundaries
No stopping when boundary crossed
Stopping when boundary not crossed
Time trends Baseline imbalances Major problems with quality
of data Considerable imputation of
missing data Important secondary
endpoints showing benefit External information on
benefit t of similar therapies
Benefit/risk ratio unlikely to be good enough to adopt experimental treatment
All endpoints showing consistent trends against experimental treatment
External information on lack of effect of similar therapies
46
Adaptive Designs
Based on combining p-values from different analyses
Allow for flexible designs
• sample size re-calculation
• any changes to the design (including endpoint, test, etc!)
47
Adaptive Designs
Lehmacher and Wassmer (1999):
At stage k, combine one-sided p-values p1,... ,pk
L = k-1/2∑Φ-1(1-pk)
Use any group sequential design for L
Slight power loss as compared to a group-sequential plan
Flexibility as to design modifications: OK for control of type I error, BUT…
48
Potential concerns with adaptive designs
Major changes between cohorts make clinical interpretation difficult
If eligibility / endpoint changed, what is adequate label?
Temporal trends
Operational bias
Less efficient than group sequential for sample size adjustments
Modest gains (in general), high risks