p values , hypothesis testing and reproducibility …neill...p –values , hypothesis testing and...
TRANSCRIPT
P – Values , Hypothesis Testing and Reproducibility
An FDA Perspective
Robert T. O’Neill Ph.D.
Senior Statistical Advisor
Office of Translational Sciences
CDER, FDA
The opinions expressed in this talk are my own and do not represent FDA policy
Outline of talk
Regulatory standards for evidence of efficacy and safety
Substantial evidence –
Adequate and well controlled studies: prospective plans
The review process: assessing bias and uncertainty
Evidence confirmation / replication
Single study planning ; exploratory vs. confirmatory study
The type 1 error – its choice and guidances
Safety vs efficacy assessment: observational studies
Need for flexibility – some future issues
The 1962 Kefauver-Harris Amendments:The Foundation for Experimental Evidence
As the Basis for Drug Approvals
“substantial evidence” means evidence consisting of adequate and well controlled
investigations, including clinical investigations, by experts qualified by scientific
training and experience to evaluate the effectiveness of the drug involved, on the basis of
which it could fairly and responsibly be concluded by such experts that the drug will
have the effect it purports or is represented to have under the condition of use
prescribed, recommended, or suggested in the labeling or proposed labeling thereof.
The Act sets the standards for approval of a new drug application by making it clear that:
(1) adequate and well controlled investigations are the basis for evidence and
(2) the purpose those investigations are intended to play, namely to allow a
conclusion that the drug will have ‘the effect’ it purports or is represented to have
under conditions of use prescribed or recommended in the labeling.
The 1970 Definition of ‘Adequate and Well Controlled Investigations’:
The Foundation for Statistical Principles: The Concepts of Randomization, Hypothesis Testing
and Estimation
Two features of that definition are important to the application of statistical
principles and methodology employed in the design, analysis and
interpretation of clinical investigations:
(1) the study uses a design that permits a valid comparison with a control to
provide a quantitative assessment of drug effect; and
(2) there is an analysis of the results of the study adequate to assess the effects
of the drug. The report of the study should describe the results and the
analytic methods used to evaluate them, including any appropriate
statistical methods.
Characteristics of an adequate and well-controlled study
(1) Clear statement of objectives, summary of proposed or actual methods of analysis in the protocol or in report of its results.
(2) Uses a design that permits a valid comparison with a control to provide a quantitative assessment of drug effect.
Placebo concurrent control, dose-comparison concurrent control, no treatment concurrent control, active treatment concurrent control, historical control.
(3)The method of selection of subjects provides adequate assurance they have the disease or condition being studied.
(4) The method of assigning patients to treatment and control groups minimizes bias and is intended to assure comparability of the groups with respect to pertinent variables… Ordinarily, in a concurrently controlled study, assignment is by randomization.
Characteristics of an adequate and well-controlled study
(5) Adequate measures are taken to minimize bias on the part of the subjects, observers, and analysts of the data [blinding].
(6) The methods of assessment of the subjects’ response are well-defined and reliable.
(7) There is an analysis of the results of the study adequate to assess the effects of the drug. The analysis should assess … the effects of any interim data analyses performed
Note that the regulations did not dictate how the operational aspects of assessing evidence should occur
Level of statistical evidence for a single adequate and well controlled trial (Type 1 error 0.05)
Studie(s) interpreted as at least two studies demonstrating statistical significant effects (scientific principle of replication/confirmation)
Discussions and controversy on statistical vs. clinical significance
The regulations set the framework for hypothesis testing , Type 1 error control and
making a decision using the statistical certainty of results
P-values are the product of the test of hypothesis and reflect uncertainty under assumptions of no treatment effect
P- values are necessary but not sufficient to interpret the results of a study
Bias is an equally important statistical concept that is incorporated into the evaluation of substantial evidence
Bias in a clinical study (impacts the p-value and the weight it carries)
The study design itself (eg. Crossovers)
The randomization strategy
The conduct
Blinding
Informative censoring, missing data
The analysis – changing pre-specified goals
The reporting and interpretation (what was intended by the protocol and all amendments)
Adaptive Designs: Selection Bias, Type 1 error control, estimation
Important issues in assessing evidence
How certain or uncertain are study results ? Bias and chance
What are the estimates of treatment effects and how consistent or variable are they ?
Confirmation of treatment effects
Interpreting inconsistent evidence
Collective evidence - putting it all together
Multiplicity control (strong and weak)
Study designs and data analysis
Safety vs. efficacy
Concepts of statistical evidence – not discussed in regulation
Frequentist: the probability of the data given the assumed hypothesis (eg. Effect size is zero) ; the range of parameters consistent with the data
P-values, confidence intervals, study power, type 1 error, number of hypotheses - reliance on null hypothesis
Frequency in the long run - over many studies
Bayesian: the probability of the hypothesis, given the data
Posterior probability, Bayes factor
Requires a prior assumption of the distribution of the hypothesis (assumed known vs estimated from data [empirical Bayes])
The interpretation of a single study
Characterizing Evidence
controlling the long run chance of a bad conclusion
for all drugs
for all studies within an application
The prior probability of effective therapies
The vote counting approaches -the number of studies that show a statistically significant finding
The collective evidence approach; integrated efficacy
Replication / Confirmation / Repeatability / Level of Uncertainty
When might a single RCT provide sufficient evidence ; statistically more persuasive evidence
The concepts of Type 1 error and p-values are prominent in
‘ Statistical Principles for Clinical Trials’ (1998)
International guidance agreed to by the United States, European Union and Japans
Distinguishes exploratory from confirmatory studies
What is planned in the protocol
What does ICH E9 say about the Type 1 error and P-values
What does ICH E9 say about the Type 1 error and P-values
Evidence from a single study - what is a statistically persuasive finding
Lower type 1 error
Consider posterior probability of a true positive finding conditional on a statistically significant finding
Consider posterior probability of a true negative finding conditional on no statistically significant finding
Hypothesis testing is the core framework for design and evaluation of randomized clinical trials – the basis for substantial evidence
This guidance concerns how to deal with multiple primary and secondary endpoints and type 1 error control
Testing hypotheses for each endpoint, Type 1 error control, both weak and strong –making claims on individual endpoints
Understanding these issues requires professional statisticians in the evaluation process
Need for flexibility
Rare and orphan diseases where study size is small and may need to borrow strength from empirical sources (natural disease progression registries)
Pediatric extrapolation studies using adult data –concept of different criteria to control false positive conclusions
Personalized medicine, small subsets, multiplicity, drawing strength –when, consequence of false conclusion
Real world evidence – trading data set bias for relevance (electronic medical records)
Meta-analysis of RCTs for safety assessment
Extrapolation from adults to pediatrics
Interpretation of p-values for RCT’s vs. observational studies – role of bias
Observational studies are subject to bias because there is no randomization and adjustments can not overcome a design flaw
P values, statistical significance in safety assessment
Concluding remarks
The role of p values, type 1 error control is well understood and not misinterpreted in the statistical regulatory framework – clinical colleagues may differ
The regulatory review of protocols, and completed clinical trials analyzed by sponsors generally identifies problems with over interpretation – confirmation is an important principle that is followed
There is flexibility in the application of statistical methodology appropriate to the problem – as sources and types of data expand, there is an increasing need for statistical talent to navigate the interplay of study design, bias, and statistical uncertainty in evidence generation