validation of predictive classifiers

Validation of Predictive Classifiers

Richard Simon, D.Sc.Chief, Biometric Research Branch

National Cancer Institutehttp://linus.nci.nih.gov/brb

Biomarker =Biological Measurement

• Surrogate endpoint– A measurement made on a patient before, during and

after treatment to determine whether the treatment is working

• Prognostic factor– A measurement made before treatment that

correlates with outcome, often for a heterogeneous set of patients

• Predictive factors– A measurement made before treatment to predict

whether a particular treatment is likely to be beneficial

Prognostic Factors

• Most prognostic factors are not used because they are not therapeutically relevant

• Many prognostic factor studies use a convenience sample of patients for whom tissue is available. Generally the patients are too heterogeneous to support therapeutically relevant conclusions

Pusztai et al. The Oncologist 8:252-8, 2003

• 939 articles on “prognostic markers” or “prognostic factors” in breast cancer in past 20 years

• ASCO guidelines only recommend routine testing for ER, PR and HER-2 in breast cancer

• “With the exception of ER or progesterone receptor expression and HER-2 gene amplification, there are no clinically useful molecular predictors of response to any form of anticancer therapy.”

Predictive Biomarkers

• Most cancer treatments benefit only a minority of patients to whom they are administered

• Being able to predict which patients are likely to benefit would – save patients from unnecessary toxicity, and enhance

their chance of receiving a drug that helps them– Improve the efficiency of clinical development– Help control medical costs

• In new drug development, the role of a classifier is to select a target population for treatment– The focus should be on evaluating the new

drug, not on validating the classifier

• Adoption of a classifier to restrict the use of a treatment in wide use should be based on demonstrating that use of the classifier leads to better clinical outcome

• Targeted clinical trials can be much more efficient than untargeted clinical trials, if we know who to target

Developmental Strategy (I)

• Develop a diagnostic classifier that identifies the patients likely to benefit from the new drug

• Develop a reproducible assay for the classifier• Use the diagnostic to restrict eligibility to a

prospectively planned evaluation of the new drug

• Demonstrate that the new drug is effective in the prospectively defined set of patients determined by the diagnostic

Using phase II data, develop predictor of response to new drugDevelop Predictor of Response to New Drug

Patient Predicted Responsive

New Drug Control

Patient Predicted Non-Responsive

Off Study

Evaluating the Efficiency of Strategy (I)

• Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research 10:6759-63, 2004.

• Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-339, 2005.

• reprints at http://linus.nci.nih.gov/brb

• For Herceptin, even a relatively poor assay enabled conduct of a targeted phase III trial which was crucial for establishing effectiveness

• Recent results with Herceptin in early stage breast cancer show dramatic benefits for patients selected to express Her-2

Developmental Strategy (II)

Develop Predictor of Response to New Rx

Predicted Non-responsive to New Rx

Predicted ResponsiveTo New Rx

ControlNew RX Control

New RX

Developmental Strategy II

• Do not use the diagnostic to restrict eligibility, but to structure a prospective analysis plan.

• Compare the new drug to the control for classifier positive patients – If p+>0.05 make no claim of effectiveness

– If p+ 0.05 claim effectiveness for the classifier positive patients and

• Test treatment effect for classifier negative patients at 0.05 level

Key Features of Design (II)

• The purpose of the RCT is to evaluate treatment T vs C for the two pre-defined subsets defined by the binary classifier; not to re-evaluate the components of the classifier, or to modify, refine or re-develop the classifier

Guiding Principle

• The data used to develop the classifier must be distinct from the data used to test hypotheses about treatment effect in subsets determined by the classifier– Developmental studies are exploratory– Studies on which treatment effectiveness

claims are to be based should be definitive studies that test a treatment hypothesis in a patient population completely pre-specified by the classifier

Adaptive Signature Design An adaptive design for generating and prospectively testing a gene expression

signature for sensitive patients

Boris Freidlin and Richard SimonClinical Cancer Research 11:7872-8, 2005

Adaptive Signature DesignEnd of Trial Analysis

• Compare E to C for all patients at significance level 0.04– If overall H0 is rejected, then claim

effectiveness of E for eligible patients– Otherwise

• Otherwise:– Using only the first half of patients accrued during the

trial, develop a binary classifier that predicts the subset of patients most likely to benefit from the new treatment E compared to control C

– Compare E to C for patients accrued in second stage who are predicted responsive to E based on classifier

• Perform test at significance level 0.01

• If H0 is rejected, claim effectiveness of E for subset defined by classifier

Biomarker Adaptive Threshold Design

Wenyu Jiang, Boris Freidlin & Richard Simon

JNCI 99:1036-43, 2007http://linus.nci.nih.gov/brb

Biomarker Adaptive Threshold Design

• Randomized pivotal trial comparing new treatment E to control C

• Survival or DFS endpoint• Have identified a univariate biomarker

index B thought to be predictive of patients likely to benefit from E relative to C

• Eligibility not restricted by biomarker• No threshold for biomarker determined• Biomarker value scaled to range (0,1)

Evaluating a Classifier

• Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data.

• When the number of candidate predictors (p) exceeds the number of cases (n), perfect prediction on the same data used to create the predictor is always possible

Evaluating a Classifier

• Validation does not mean that repeating classifier process results in similar gene sets

• Validation means predictions for independent cases are accurate

Internal Validation of a Predictive Classifier

• Split-sample validation• Often applied with too small a validation set• Don’t combine the training and validation set• Don’t validate multiple models and select the “best”

• Cross-validation• Often misused by pre-selection of genes

Split Sample Approach

• Separate training set of patients from test set

• Patients should represent those eligible for a clinical trial that asks a therapeutically relevant question

• Do not access information about patients in test set until a single completely specified classifier is agreed upon based on the training set data

Re-Sampling Approach

• Partition data into training set and test set

• Develop a single fully specified classifier of outcome on training set

• Use the classifier to predict outcome for patients in the test set and estimate the error rate

• Repeat the process for many random training-test partitions

• Re-sampling is only valid if the training set is not used in any way in the development of the model. Using the complete set of samples to select genes violates this assumption and invalidates the process

• With proper re-sampling, the model must be developed from scratch for each training set. This means that gene selection must be repeated for each training set.

• Re-sampling, e.g. leave-one-out cross-validation is widely misunderstood even by statisticians and widely misused in the published clinical literature

• It is only applicable when there is a completely pre-defined algorithm for gene selection and classifier development that can be applied blindly to each training set

• Split sample validation is superior to LOOCV or 10-fold CV for estimating prediction error

Types of Clinical Outcome

• Survival or disease-free survival

• Response to therapy

• 90 publications identified that met criteria– Abstracted information for all 90

• Performed detailed review of statistical analysis for the 42 papers published in 2004

Major Flaws Found in 40 Studies Published in 2004

• Inadequate control of multiple comparisons in gene finding– 9/23 studies had unclear or inadequate methods to deal with

false positives• 10,000 genes x .05 significance level = 500 false positives

• Misleading report of prediction accuracy– 12/28 reports based on incomplete cross-validation

• Misleading use of cluster analysis – 13/28 studies invalidly claimed that expression clusters based on

differentially expressed genes could help distinguish clinical outcomes

• 50% of studies contained one or more major flaws

Validation of Predictive Classifiers for Use with Available Treatments

• Should establish that the classifier is reproducibly measurable and has clinical utility– Better patient outcome or equivalent outcome

with less morbidity– Improvement relative to available staging

Developmental vs Validation Studies

• Developmental studies should select patients sufficiently homogeneous for addressing a therapeutically relevant question

• Developmental studies should develop a completely specified classifier

• Developmental studies should provide an unbiased estimate of predictive accuracy– Statistical significance of association between

prediction and outcome is not the same as predictive accuracy

Limitations to Developmental Studies

• Sample handling and assay conduct are performed under controlled conditions that do not incorporate real world sources of variability

• Poor analysis may result in biased estimates of prediction accuracy

• Small study size limits precision of estimates of predictive accuracy– Cases may be unrepresentative of patients at other sites

• Developmental studies may not estimate to what extent predictive accuracy is greater than that achievable with standard prognostic factors

• Predictive accuracy is often not clinical utility

Independent Validation Studies

• Predictive classifier completely pre-specified

• Patients from different clinical centers

• Specimen handling and assay simulates real world conditions

• Study addresses medical utility of new classifier relative to practice standards

Types of Clinical Utility

• Identify patients whose prognosis is sufficiently good without cytotoxic chemotherapy

• Identify patients who are likely to benefit from a specific therapy or patients who are unlikely to benefit from it

Establishing Clinical Utility

• Develop prognostic classifier for patients not receiving cytotoxic chemotherapy

• Identify patients for whom– current practice standards imply

chemotherapy– Classifier indicates very good prognosis

without chemotherapy

• Withhold chemotherapy to test predictions

Prospectively Planned Validation Using Archived Materials

Oncotype-Dx• Fully specified classifier developed using

data from NSABP B20 applied prospectively to frozen specimens from NSABP B14 patients who received Tamoxifen for 5 years

• Long term follow-up available

• Good risk patients had very good relapse-free survival

Prospective Validation Design

• Randomize patients with node negative ER+ breast cancer receiving TAM to chemotherapy vs classifier determined therapy

• Determine whether classifier determined arm has equivalent outcome to arm in which all patients receive chemotherapy– Therapeutic equivalence trial

• Gold standard but rarely performed– Very inefficient because most patients get same

treatment in both arms and so the trial must be sized to detect miniscule difference in outcome

• Measure classifier for all patients and randomize only those for whom classifier determined therapy differs form standard of care

M-rx SOC

• SOC involves chemotherapy– M-rx does not

• SOC does not involve chemotherapy– M-rx does

M-rx SOC

• SOC involves chemotherapy– M-rx does not– Validation by withholding chemotherapy and

observing outcome of cases in single arm study

• SOC does not involve chemotherapy– M-rx does– Validation by withholding chemotherapy and

observing outcome in single arm study?– Validation by randomization chemo vs no chemo

US Intergroup Study

• OncotypeDx risk score <15– Tam alone

• OncotypeDx risk score >30– Tam + Chemo

• OncotypeDx risk score 15-30– Randomize to Tam vs Tam + Chemo

Key Steps in Development and Validation of Therapeutically Relevant

Genomic Classifiers

• Develop classifier for addressing a specific important therapeutic decision: – Patients sufficiently homogeneous and receiving uniform

treatment so that results are therapeutically relevant.– Treatment options and costs of mis-classification such that a

classifier is likely to be used • Perform internal validation of classifier to assess whether

it appears sufficiently accurate relative to standard prognostic factors that it is worth further development

• Translate classifier to platform that would be used for broad clinical application

• Demonstrate that the classifier is reproducible• Independent validation of the completely specified

classifier on a prospectively planned study

Types of Clinical Utility

• Identify patients whose prognosis is sufficiently good without cytotoxic chemotherapy– Identify patients whose prognosis is so good

on standard therapy S that they do not need additional treatment T

• Identify patients who are likely to benefit from a specific systemic therapy and/or patients who are unlikely to benefit from it

Validation Study for Identifying Patients Who Benefit from a Specific Regimen

• Standard treatment S• Test treatment T (e.g. S+X)• Classifier based on previous data for identifying

patients who benefit from T relative to S• Randomized study of S vs T• Endpoint is accepted measure of patient benefit• Measure classifier on all patients• Compare T vs S separately within classifier +

and classifier – patients– Establish that T is better than S for classifier +

patients but not for classifier - patients

• Approach is not feasible when T is curative for some patients and S is not

• Under these circumstances, a classifier with less than perfect negative predictive value is probably not acceptable

• These studies are best accomplished prior to approval of the new drug T or using archived specimens from the pivotal trials leading to approval of T

Acknowledgements

• Kevin Dobbin

• Boris Freidlin

• Aboubakar Maitnourim

• Annette Molinaro

• Michael Radmacher

• Yingdong Zhao

validation of predictive classifiers

clinical cancer research

minority of patients

new drugpatient

new drugdemonstrate

new drug development

randomized clinical

untargeted clinical

defined set of patients

Documents

evaluating credal classifiers by utility-discounted...

predictive model robustness validation

improving predictive maintenance classifiers of industrial...

wildfire predictions: determining reliable models using ......

validation and diagnostic accuracy of predictive curves

verification, validation, and predictive capability in...

validation of predictive classifiers richard simon, d.sc....

development and validation of prognostic classifiers using...

development and validation of a predictive nomogram for

beyond reliability: advanced analytics for predicting...

predictive les modeling and validation of high-pressure...

2017 predictive analytics symposium - soa€¦ · 2017...

verfication validation and predictive capapility in...

clinical trial designs for the evaluation of prognostic &...

on two novel parameters for validation of predictive qsar...

an effective combination based on class-wise expertise of...

predictive analysis of text: concepts, instances,...

use of controls approaches for verification, validation and...

geographic information software and predictive …...

department of health and human services ivd validation and...