p values , hypothesis testing and reproducibility …neill...p –values , hypothesis testing and...

P – Values , Hypothesis Testing and Reproducibility

An FDA Perspective

Robert T. O’Neill Ph.D.

Senior Statistical Advisor

Office of Translational Sciences

CDER, FDA

The opinions expressed in this talk are my own and do not represent FDA policy

Outline of talk

Regulatory standards for evidence of efficacy and safety

Substantial evidence –

Adequate and well controlled studies: prospective plans

The review process: assessing bias and uncertainty

Evidence confirmation / replication

Single study planning ; exploratory vs. confirmatory study

The type 1 error – its choice and guidances

Safety vs efficacy assessment: observational studies

Need for flexibility – some future issues

The 1962 Kefauver-Harris Amendments:The Foundation for Experimental Evidence

As the Basis for Drug Approvals

“substantial evidence” means evidence consisting of adequate and well controlled

investigations, including clinical investigations, by experts qualified by scientific

training and experience to evaluate the effectiveness of the drug involved, on the basis of

which it could fairly and responsibly be concluded by such experts that the drug will

have the effect it purports or is represented to have under the condition of use

prescribed, recommended, or suggested in the labeling or proposed labeling thereof.

The Act sets the standards for approval of a new drug application by making it clear that:

(1) adequate and well controlled investigations are the basis for evidence and

(2) the purpose those investigations are intended to play, namely to allow a

conclusion that the drug will have ‘the effect’ it purports or is represented to have

under conditions of use prescribed or recommended in the labeling.

The 1970 Definition of ‘Adequate and Well Controlled Investigations’:

The Foundation for Statistical Principles: The Concepts of Randomization, Hypothesis Testing

and Estimation

Two features of that definition are important to the application of statistical

principles and methodology employed in the design, analysis and

interpretation of clinical investigations:

(1) the study uses a design that permits a valid comparison with a control to

provide a quantitative assessment of drug effect; and

(2) there is an analysis of the results of the study adequate to assess the effects

of the drug. The report of the study should describe the results and the

analytic methods used to evaluate them, including any appropriate

statistical methods.

Characteristics of an adequate and well-controlled study

(1) Clear statement of objectives, summary of proposed or actual methods of analysis in the protocol or in report of its results.

(2) Uses a design that permits a valid comparison with a control to provide a quantitative assessment of drug effect.

Placebo concurrent control, dose-comparison concurrent control, no treatment concurrent control, active treatment concurrent control, historical control.

(3)The method of selection of subjects provides adequate assurance they have the disease or condition being studied.

(4) The method of assigning patients to treatment and control groups minimizes bias and is intended to assure comparability of the groups with respect to pertinent variables… Ordinarily, in a concurrently controlled study, assignment is by randomization.

Characteristics of an adequate and well-controlled study

(5) Adequate measures are taken to minimize bias on the part of the subjects, observers, and analysts of the data [blinding].

(6) The methods of assessment of the subjects’ response are well-defined and reliable.

(7) There is an analysis of the results of the study adequate to assess the effects of the drug. The analysis should assess … the effects of any interim data analyses performed

Note that the regulations did not dictate how the operational aspects of assessing evidence should occur

Level of statistical evidence for a single adequate and well controlled trial (Type 1 error 0.05)

Studie(s) interpreted as at least two studies demonstrating statistical significant effects (scientific principle of replication/confirmation)

Discussions and controversy on statistical vs. clinical significance

The regulations set the framework for hypothesis testing , Type 1 error control and

making a decision using the statistical certainty of results

P-values are the product of the test of hypothesis and reflect uncertainty under assumptions of no treatment effect

P- values are necessary but not sufficient to interpret the results of a study

Bias is an equally important statistical concept that is incorporated into the evaluation of substantial evidence

Bias in a clinical study (impacts the p-value and the weight it carries)

The study design itself (eg. Crossovers)

The randomization strategy

The conduct

Blinding

Informative censoring, missing data

The analysis – changing pre-specified goals

The reporting and interpretation (what was intended by the protocol and all amendments)

Adaptive Designs: Selection Bias, Type 1 error control, estimation

Important issues in assessing evidence

How certain or uncertain are study results ? Bias and chance

What are the estimates of treatment effects and how consistent or variable are they ?

Confirmation of treatment effects

Interpreting inconsistent evidence

Collective evidence - putting it all together

Multiplicity control (strong and weak)

Study designs and data analysis

Safety vs. efficacy

Concepts of statistical evidence – not discussed in regulation

Frequentist: the probability of the data given the assumed hypothesis (eg. Effect size is zero) ; the range of parameters consistent with the data

P-values, confidence intervals, study power, type 1 error, number of hypotheses - reliance on null hypothesis

Frequency in the long run - over many studies

Bayesian: the probability of the hypothesis, given the data

Posterior probability, Bayes factor

Requires a prior assumption of the distribution of the hypothesis (assumed known vs estimated from data [empirical Bayes])

The interpretation of a single study

Characterizing Evidence

controlling the long run chance of a bad conclusion

for all drugs

for all studies within an application

The prior probability of effective therapies

The vote counting approaches -the number of studies that show a statistically significant finding

The collective evidence approach; integrated efficacy

Replication / Confirmation / Repeatability / Level of Uncertainty

When might a single RCT provide sufficient evidence ; statistically more persuasive evidence

The concepts of Type 1 error and p-values are prominent in

‘ Statistical Principles for Clinical Trials’ (1998)

International guidance agreed to by the United States, European Union and Japans

Distinguishes exploratory from confirmatory studies

What is planned in the protocol

What does ICH E9 say about the Type 1 error and P-values

Evidence from a single study - what is a statistically persuasive finding

Lower type 1 error

Consider posterior probability of a true positive finding conditional on a statistically significant finding

Consider posterior probability of a true negative finding conditional on no statistically significant finding

Hypothesis testing is the core framework for design and evaluation of randomized clinical trials – the basis for substantial evidence

This guidance concerns how to deal with multiple primary and secondary endpoints and type 1 error control

Testing hypotheses for each endpoint, Type 1 error control, both weak and strong –making claims on individual endpoints

Understanding these issues requires professional statisticians in the evaluation process

Need for flexibility

Rare and orphan diseases where study size is small and may need to borrow strength from empirical sources (natural disease progression registries)

Pediatric extrapolation studies using adult data –concept of different criteria to control false positive conclusions

Personalized medicine, small subsets, multiplicity, drawing strength –when, consequence of false conclusion

Real world evidence – trading data set bias for relevance (electronic medical records)

Meta-analysis of RCTs for safety assessment

Extrapolation from adults to pediatrics

Interpretation of p-values for RCT’s vs. observational studies – role of bias

Observational studies are subject to bias because there is no randomization and adjustments can not overcome a design flaw

P values, statistical significance in safety assessment

Concluding remarks

The role of p values, type 1 error control is well understood and not misinterpreted in the statistical regulatory framework – clinical colleagues may differ

The regulatory review of protocols, and completed clinical trials analyzed by sponsors generally identifies problems with over interpretation – confirmation is an important principle that is followed

There is flexibility in the application of statistical methodology appropriate to the problem – as sources and types of data expand, there is an increasing need for statistical talent to navigate the interplay of study design, bias, and statistical uncertainty in evidence generation

p values , hypothesis testing and reproducibility …neill...p –values , hypothesis testing and...

Documents