statistical principles for clinical research sponsored by: nih general clinical research center los...

Post on 04-Jan-2016

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistical Principles for Clinical Research

Sponsored by:

NIH General Clinical Research Center

Los Angeles Biomedical Research Institute

at Harbor-UCLA Medical Center

November 1, 2007

Peter D. Christenson

Conducting Clinical Trials 2007

Speaker Disclosure Statement

The speaker has no financial relationships relevant to this presentation.

Recommended Textbook: Making Inference

Design issues

Biases

How to read papers

Meta-analyses

Dropouts

Non-mathematical

Many examples

Example: Harbor Study Protocol

18 Pages of Background and Significance, Preliminary Studies, and Research Design and Methods. Then:

“Pearson correlation, repeated measure of the general linear model, ANOVA analyses and student t tests will be used where appropriate. …

The [two] main parameters of interest will be … [A and B. For A, using a t-test] 40 subjects provide 80% assurance that a XX reduction … will be detected, with p<0.05.

Similar comparisons as for … [A and B] will be carried out …”

Example: Harbor Study Protocol

The good ….

“The [two] main parameters of interest will be … [A and B. For A, using a t-test,] 40 subjects provide 80% assurance that a XX reduction … will be detected, with p<0.05.”

Because:

• Explicit: Specifies primary outcome of interest.

• Explicit: Justification for # of subjects.

Example: Harbor Study Protocol

… the Bad …

“Pearson correlation, repeated measure of the general linear model, ANOVA analyses and student t tests will be used where appropriate. …”

Because:

• Boilerplate.

• These methods are almost always used.

• “Where appropriate”?

• Tries to satisfy reviewer, not science.

Example: Harbor Study Protocol

… and the Ugly.

“Similar comparisons as for … [A and B] will be carried out …”

Because:

• 1º OK: Diff b/w 2 visits for 2 measures, A & B.

• But, 15 measures taken at each of 19 visits.

• Torture the data long enough, and it will confess to something.

Goals of this Presentation

More good.

Less bad.

Less ugly.

Biostatistical Involvement in Studies

Off-site statistical design and analysis

Multicenter studies; data coordinating center.

In house drug company statisticians.

CRO through NIH or drug company.

Local study contracted elsewhere

e.g. UCLA, USC, CRO.

Local protocol, and statistical design and analysis

Occasionally multicenter.

Studies with Off-Site Biostatistics

Not responsible for statistical design and analysis.

Are responsible for study conduct that may:

• … impact analysis, believability of results.

• … reduce sensitivity (power) of the study to be able to detect effects.

Review of Basic Method of Inference

from Clinical Studies

Typical Study Data Analysis

Large enough “signal-to-noise ratio” → Proves an effect beyond a reasonable doubt. Often:

Observed Effect

Natural Variation/√N

Signal

NoiseRatio ==

Difference in Means

SD/√N

For a t-test comparing two groups:

t Ratio =

Degree of allowable doubt → How large t needs to be.

5% (p<0.05) → |t| > ~2

Meaning of p-valuep-value:Probability of a test statistic (ratio) that is at least as deviant as was observed, if there is really no effect.

Smaller p-values ↔ more evidence of effect.

Validity of p-value interpretation typically requires:• Proper data generation, e.g., randomness.• Subjects provide independent information.• Data is not used in other statistical tests.

or: an accounting for not satisfying these criteria.

→ p-values are earned by satisfying appropriately.

Truth:

No Effect Effect

No Effect

Effect

Study Claims:

Correct

CorrectError

Error

Power: Maximize.

Choose N for 80%

Set p≤0.05

Specificity=95%

Specificity

Sensitivity

Analogy with Diagnostic Testing

← Typical →

Analogy

True Effect

Disease

Study Claim

Diagnosis

Study Conduct Impacting Analysis

Non-adherence of study personnel to the protocol in general. [Increases variation.]

Enrolling subjects who do not satisfy inclusion or exclusion criteria. [ E.g., no effect in 10% wrongly included & real effect=50% → ~0.9(50%) = 45% observed effect. Can decrease observed effect.]

Subjects not completing entire study. [May decrease N, or give potentially conflicting results.]

↓ effect detectability (and ↓ratio) results from:

Potentially Conflicting Results

Example: Subjects not completing the entire study.

Tigabine Study Results: How Believable?

1

2

3

Conclusions differ depending on how non-completing subjects (24%) are handled in the analysis.

Primary analysis here is specified, but we would prefer robustness to the method of analysis (agreement), which is more likely with more completing subjects.

Study Conduct Impacting Analysis

Intention-to-Treat (ITT)

Continued …

ITT typically specifies that all subjects are included in analysis, regardless of treatment compliance or whether lost to follow-up.

Purposes: Avoid bias from subjective exclusions or differential exclusion between treatment groups; sometimes argued to mimic non-compliance in real world setting.

More emphasis on policy implications of societal effectiveness than on scientific efficacy.

Not appropriate for many studies.

Study Conduct Impacting Analysis

Lost to follow-up:

Always minimize; no “real world” analogy as for treatment compliance.

Need to define outcomes for non-completing subjects.

Current Harbor study:

N≈1200 would need N≈3000 if ITT used, 20% lost, and lost counted as treatment failures.

Intention-to-Treat (ITT)

ITT: Need to Impute Unknown Values

Change from Baseline

Baseline Final VisitIntermediate Visit

0

Change from Baseline

Intermediate Visit

Final VisitBaseline

0

LOCF:

Ignore Presumed

Progression

LRCF:

Maintain Expected Relative

Progression

Individual Subjects

Ranks

Observations

Study Conduct Impacting Feasibility

Potential Effects of Slow Enrollment

• Needed N may be impossible → Study stopped.

• Competitive site enrollment → Local financial loss.

• Insufficient person-years (PY) of observation for some studies, even if N is attained:

0 1 2 0 1 2 0 1 2Planned Slower YetSlower

Area = PY

N

# of

Sub

ject

s

Year

Detects Effect=Δ

Detects Effect=1.1Δ Detects

Effect=1.7Δ

Biostatistical Involvement in Studies

Off-site statistical design and analysis

Multicenter studies; data coordinating center.

In-house drug company statisticians.

By CRO through NIH or drug company.

Local study contracted elsewhere

e.g. UCLA, USC, CRO

Local protocol, and statistical design and analysis

Occasionally multicenter.

Local Protocols and Data Analysis

1. Develop protocol and data analysis plan.

2. Have randomization and blinding strategy, if study requires.

3. Data management.

4. Perform data analyses.

Local Data Analysis Resources

Biostatistician:

Peter Christenson, PChristenson@labiomed.org.

Develop study design, analysis plan.

Advise throughout for any study.

Perform all non-basic analyses.

Full responsibility for studies with funded %FTE.

Review some protocols for committees.

Data Management:

Database development for GCRC studies by database manager.

Statistical Components of Protocols

• Target population / source of subjects.• Quantification of aims, hypotheses.• Case definitions, endpoints quantified. • Randomization plan, if any.• Masking, if used.• Study size: screen, enroll, complete.• Use of data from non-completers.• Justification of study size (power, precision, other).• Methods of analysis.• Mid-study analyses.

SelectedStatistical Components

and Issues

Case Definitions and Endpoints

• Primary case definitions and endpoints need careful thought.

• Will need to report results based on these.

Example: Study at HarborDefinition of cure very strict.

Analyzed data with this definition.

Cure rates too low - would not be taken seriously.

Scientific method → need to report them; otherwise cherry-picking.

Publication: Use primary definition; explain; also report with secondary definition. Less credible.

Randomization

• Helps assure attributability of treatment effects.

• Blocked randomization assures approximate chronologic equality of numbers of subjects in each treatment group.

• Recruiters must not have access to randomization list.

• List can be created with a random number generator in software, printed tables in stat texts, or even shuffled slips of paper.

Non-completing Subjects

• Enrolled subjects are never “dropouts”.• Protocol should specify:

– Primary analysis set (e.g., ITT or per-protocol).

– How final values will be assigned to non-completers.

• Time-to-event (survival analysis) studies may not need final assignments; use time followed.

• Study size estimates should incorporate the number of expected non-completers.

Study Size: Power

Power = Probability of detecting real effects of a specified minimal (clinically relevant) magnitude

• Power will be different for each outcome.• Power depends on the statistical method.• Five factors including power are inter-related.

Fixing four of these specifies the fifth:– Study size– Heterogeneity among subjects (SD)– Magnitude of treatment effect to be detected– Power to detect this magnitude of effect– Acceptable chance of false positive conclusion,

usually 0.05

Free Study Size Software

www.stat.uiowa.edu/~rlenth/Power

Free Study Size Software: ExamplePilot data: SD=8.19 in 36 subjects.

We propose N=40 subjects/group in order to provide 80% power to detect (p<0.05) an effect Δ of 5.2:

Study Size : May Not be Based on Power

Precision refers to how well a measure is estimated.

Margin of error = the ± value (half-width) of the 95% confidence interval.

Smaller margin of error ←→ greater precision.

To achieve a specified margin of error, solve the CI formula for N.Polls: N ≈ 1000→ margin of error on % ≈ 1/√N ≈ 3%.

Pilot Studies, Phase I, Some Phase II: Power not relevant; may have a goal of obtaining an SD for future studies.

Mid-Study Analyses

• Mid-study comparisons should not be made before study completion unless planned for (interim analyses). Early comparisons are unstable, and can invalidate final comparisons.

• Interim analyses are planned comparisons at specific times, usually by an unmasked advisory board. They allow stopping the study early due to very dramatic effects, and final comparisons, if study continues, are adjusted to validly account for “peeking”.

Continued …

Mid-Study Analyses

Effect

0

Number of Subjects EnrolledTime →

Too many analyses

Wrong early conclusion

Need to monitor, but also account for many analyses

Mid-Study Analyses

• Mid-study reassessment of study size is advised for long studies. Only standard deviations to date, not effects themselves, are used to assess original design assumptions.

• Feasibility analysis: – may use the assessment noted above to

decide whether to continue the study.– may measure effects, like interim analyses, by

unmasked advisors, to project ahead on the likelihood of finding effects at the planned end of study.

Continued …

Mid-Study Analyses

Study 1: Groups do not differ; plan to add more subjects.

Consequence → final p-value not valid; probability requires no prior knowledge of effect.

Study 2: Groups differ significantly; plan to stop study.

Consequence → use of this p-value not valid; the probability requires incorporating later comparison.

Examples: Studies at HarborRandomized; not masked; data available to PI.

Compared treatment groups repeatedly, as more subjects were enrolled.

Multiple Analyses at Study End

Lagakos NEJM 354(16):1667-1669.

Replacing “Subgroup”

with “Analysis” Gives a Similar

Problem

Torturing Data

False Positive

Conclusions

Multiple Analyses at Study End

• There are formal methods to incorporate the number of multiple analyses.

• Bonferroni

• Tukey

• Dunnett

• Transparency of what was done is most important.

• Should be aware of number of analyses and report it with any conclusions.

Summary:Bad Science That May Seem So Good

1. Re-examining data, or using many outcomes, seeming to be performing due diligence.

2. Adding subjects to a study that is showing marginal effects; or, stopping early due to strong results.

3. Examining effects in subgroups. See NEJM 2006 354(16):1667-1669.

Actually bad? Could be negligent NOT to do these, but need to account for doing them.

Statistical Software

Professional Statistics Software Package

Output

Enter code; syntax.

Stored data; access-ible.

Microsoft Excel for Statistics

• Primarily for descriptive statistics.

• Limited output.

Almost Free On-Line Statistics Software

Run from browser; not local.

$5/ 6 months usage.

Potential HIPPA concerns

www.statcrunch.com

Supported by NSF

Typical Statistics Software PackageSelect Methods from Menus

Output after menu selection

Data in spreadsheet

www.ncss.com

www.minitab.com

www.stata.com

$100 - $500

http://gcrc.labiomed.org/biostat

This and

other biostat talks

posted

Conclusions

Don’t put off slow enrollment; find the cause; solve it.

I am available.

Do put off analyses of efficacy, not of design assumptions.

I am available.

P-values are earned, by following methods which are needed for them to be valid.

I am available.

You may have to pay for lack of attention to protocol decisions, to satisfy the scientific method.

I am available.

Software always takes more time than expected.

Thank You

Nils Simonson, in

Furberg & Furberg,

Evaluating Clinical Research

top related