statistical review: recursive partitioning identifies patients at high and low risk for ipsilateral...

Statistical Review:Recursive Partitioning Identifies Patients at High and Low

Risk for Ipsilateral Tumor Recurrence After Breast-Conserving Surgery and Radiation

Freedman, Hanlon, Fowble, Anderson, and NicolaouJCO, October 2002

• Analytic Method: Recursive Partitioning Analysis• “Supervised classification” method• General ideas of RPA

– Build a “tree” for diagnostic profiling that can distinguish amongst groups of patients

– Example: • useful for diagnosing based on symptom profiles versus more invasive

approach.• Useful for predicting survival based on symptom profile

– Variables are based on their ability to “differentiate” types of patients.• In some cases, you might want to differentiate sub-types (e.g. build molecular

profiles to differentiate squamous versus adenocarcinoma of the lung)• In this case, differentiation is based on length of time to IBTR (survival

outcome).

• The root node contains the whole sample

• From there, the tree is the “grown”. • The root node is partitioned into two

nodes in the next layer using the predictor variable that makes the best separation based on the log rank statistic. This may cause a continuous variable to be dichotomized (e.g. age < 55 versus 55)

• For each branch, the algorithm then looks for the next variable which creates the broadest separation.

• The aim is to make the “terminal nodes” (i.e. the nodes which have no offsprings) as homogeneous as possible.

How is the tree built?

http://www.jco.org/content/vol20/issue19/images/large/13ff1.jpeg

When does it stop?• It MUST stop if

– All predictors have the same values for all subjects within a node– there is only one observation in each node– All subjects in a node have the same outcome

• “Backward Pruning”– Test-statistics can be used to assess which are statistically significant

nodes. For example, the log rank statistic can be used to assess whether a split should be “pruned”

– Zhang et al. (Statistics in Medicine, 1995) examine each tree to see• Which splits are superficial?• Which splits are scientifically unreasonable?• Which splits might require more data?

– Pruning procedure is NOT completely automatic!– It is unclear if any pruning was done in the Freedman article. If it

was done, it was not explained and no guidelines for pruning were provided.

“A too large tree is usually useless - too small nodes do not allow us to make sensible statistical inference and the result is rarely scientifically interpretable. Therefore, we first compute the whole tree (maybe too fine) and afterwards we employ a technique called pruning. It goes from the bottom up and finds a subtree that is most "predictive" of the outcome and least vulnerable to the noise in the data.”

(P. Cizek, K. Komorad, http://adela.karlin.mff.cuni.cz/~xplore/help/stree.html)

Prognostic indicators of IBTR: – age (as a continuous variable), – menopausal status, – race, – family history, – method of detection, – presence of EIC, – margin status, – ER status, – number of positive lymph nodes, – histology, – lobular carcinoma-in-situ (LCIS), – use of chemotherapy – use of tamoxifen.

23%

3% 34%

20% 5%

2%5%

9%

10-year IBTRrates in blue with 95% confidence intervals in parentheses

(5,41)

(-3,9)

(1,9) (-2,6)

(-1,11)(10,30)

(-8,76) (1,17)

http://www.jco.org/content/vol20/issue19/images/large/13ff1.jpeg

Problems with this approach• Many of age (as a continuous variable), menopausal status,

race, family history, margin status, ER status, number of positive lymph nodes, histology, lobular carcinoma-in-situ (LCIS) are known risk factors for IBTR

• These factors are strongly predictive of whether or not a patient receives tamoxifen and/or chemotherapy.

• Why? Oncologists will tend to give patients at high risk of recurrence adjuvant treatment.

• As a result:– Low risk women do not receive adjuvant therapy– High risk women do receive adjuvant therapy

Example

IBTR rate

High risk, no therapy 25%High risk, therapy 15%

Low risk, no therapy 5%Low risk, therapy 4%

We arecomparingthese two groups andconcludingthat the difference is due to therapy

Adjuvant therapy is confounded with risk (i.e., those with high risk are more likelyto get adjuvant therapy).

High risk women may still tend to have IBTR even in presence of tamoxifen or chemotherapy, but it might still be higher than the rates in the low risk womenThis could make it appear that adjuvant therapy is related to poor IBTR outcomes!

As a result…..

• Authors conclude that only modest effect is seen from tamoxifen

• Chemotherapy does not appear in the tree (it is not predictive of outcomes based on the model)

• For women less then 35, model suggests that chemotherapy and/or tamoxifen do not affect outcomes.

Other Issues• Statistically different “nodes”?

– Nowhere is it stated how “different” the risk groups are• Confounding

– Age < 55 is very similar to including menopause as a predictor. Which is more relevant?

– Are other predictors highly associated?• Varying times of treatment: 1979 to 1995. Are these

women comparable in terms of outcomes?• Two sites. Are these sites similar enough to not adjust for

site?• No validation! For prediction models such as this,

validation is a crucial part of the analytic process. Validation wasn’t performed, so classifications may be very unstable.

IBTR rates

• Tree was determined and then the 8 risks groups were explored using “Kaplan-Meier analysis”

• Rates of 10 year IBTR are not so different, and precision is rather poor (see tree).

• There doesn’t appear to be much reason to declare that these risk groups are different.

• Negative rates in confidence intervals!?

How could we do the “right” thing in this case?

• Hire a statistician!• Stratify by adjuvant therapy• Two or Three separate trees

– Adjuvant chemotherapy*– Adjuvant tamoxifen*– No adjuvant therapy

• Does it seem unfair?– But the trial isn’t randomized!– Adjuvant therapy assignment is not at random.

• Model validation is needed!!!

Prediction models

• Association and prediction are two different concepts in medical research.

• Association implies that two things are related, yet one might not be precisely predicted from the other.

• Prediction models require “tight” correlation or association.

100 150 200

6.5

7.5

8.5

9.5

BloodGlucose

hbA

1c1

r= 0.98 p= 8.1e-148

100 150 200

6.5

7.0

7.5

8.0

8.5

9.0

BloodGlucose

hbA

1c2

r= 0.93 p= 5.7e-86

100 150 200

67

89

BloodGlucose

hbA

1c3

r= 0.68 p= 2.2e-28

100 150 200

56

78

910

11

BloodGlucose

hbA

1c4

r= 0.39 p= 1.7e-08

Example in this study….

(I made up this data, but it is somewhat consistent with that presented)

EIC+ EIC -

Relapse 20 43 63

No relapse 36 801 837

56 844 900

Odds ratio = 10.395% confidence interval (5.2, 20.1)p < 0.0001

VERY STRONGASSOCIATION!

Example in this study….

(I made up this data, but it is somewhat consistent with that presented)

EIC+ EIC -

Relapse 20 43 63

No relapse 36 801 837

56 844 900

Sensitivity = 20/63 = 32%Specificity = 801/837 = 96%PPV = 20/56 = 36%NPV = 801/844 = 95%

Not a very goodprognostic factor.

• Need much larger sample sizes to build predictive models than to prove associations exist.

• But, the key is that there must be little variability due to factors outside the model.

• For measuring association, increasing your sample size will always help you because you are dealing with population averages. You can always shrink standard errors for parameters (std error = s/N1/2).

• For prediction models, increasing sample size will not improve your predictive ability.

• The key is to control for all factors influencing the outcome.• For linear regression: the difference in interpretation between

a pvalue and the R2.• For 2x2 table: the difference in interpretation between and

odds ratio and sensitivity or specificity.

Validation• A crucial aspect of prediction models is VALIDATION.

• When you apply a model to the data from which it was derived it ALWAYS looks better than it really is.

• “OVERFITTING”

• Questions to consider

– Is the model too specific to the sample on which it was derived?

– How well will the model perform in predicting results in a different sample

• From the same population?

• From a slightly different population?

Approaches to Validation

• Get another dataset and try your model

• Perform validation methods using the dataset at hand.– leave k-out validation:

• Iteratively remove k individuals and refit data

• Compare all the models

– Splitting the dataset…..

Splitting the data• Could have been used by Freedman et al. for a more

convincing story• Divide the data into two parts: 1/3 and 2/3 at random.• Fit the model using 2/3 portion of the data• Apply the model to the 1/3 portion and check for

predictability and difference from the 2/3 portion.• Repeat this process• Results from repetitions will give a sense of the stability of

the model.• “Consensus tree”:

– Looks across all the trees that were created and picks the one that agrees with most of them. Usually will be smaller than the individual trees. But more representative and not due to noise.

• Commonly see the data splitting approach now in gene expression profiling models (Alizadeh, Nature, 2000)

statistical review: recursive partitioning identifies patients at high and low risk for ipsilateral...

Documents