using cart to unravel clusters for the testing of interactions in asthma databases

Using CART to Unravel Clusters for the Testing of Interactions in Asthma Databases

Ben Trzaskoma

Sr Statistical Scientist

Genentech, Inc

Introduction

• Ultimate Purpose: Determine subgroups with greatest

“Drug advantage” (versus Placebo)

• Variable clustering

− Used to identify groups of variables

− Created composite scores summarizing groups of variables

− Determined “Drug benefit” for each composite score

• Patient (case) clustering

− Used to identify subsets of patients

− Determined “Drug benefit” for each subset of patients

Clinical Trial Dataset

• Phase IIIb multicenter, randomized, double-blind,

placebo-controlled study

• Original purpose of the study:

− Evaluate efficacy and safety of Asthma Drug

− Subjects have moderate to severe asthma

− Asthma is inadequately controlled with ICS and LABA

• 850 patients age 12-75 from about 150 sites were

enrolled and followed for 48 weeks

− Half given Drug + ICS + LABA

− Half given placebo + ICS + LABA

Methods

• Two approaches:

• Variable Clustering (Proc Varclus in SAS)

• Related to factor analysis

• Looks for relationship among variables

• Patient (Case) Clustering (Proc Cluster in SAS)

• Looks for groups of “similar” cases

• Patient (case) clustering with SAS CLUSTER procedure− Identify groups of similar patients− Variables with most variability are most important − The resulting clusters allow for the analysis of patient groups with

different characteristics

Methods: Patient Clustering

• Patient clustering details− The SAS CLUSTER procedure is an agglomerative clustering

technique Starts with one cluster per patient and iteratively groups the two

nearest clusters until there is only one cluster with all patients in it. Based on a set of key variables Ward’s method selected. This method uses the minimum variance

to determine which two clusters should be the next to cluster together. It tends to maximize the corresponding ANOVA.

Selected stopping point with moderate number of clusters (selection is somewhat arbitrary -- see dendrogram)

The squared multiple correlation, R-squared, is the proportion of variance accounted for by the clusters and is used to assess the goodness of a particular cluster solution.



Ward’s MethodWard proposes that at any stage of an analysis the loss of information which results fromthe grouping of individuals into clusters can be measured by the total sum of squareddeviations of every point from the mean of the cluster to which it belongs.

The distance between a group k and a group (ij) formed by the fusion of i and j:

dk(ij) = αidki + αjdkj + βdij

Where dij is the distance between groups i and j

αi =

nk + ni

nk + ni + nj

αj =

nk + nj

nk + ni + nj

β =

-nk

nk + ni + nj

Everitt B, Cluster Analysis, 1977

And ni is the number of cases in group i.

Patient Clusters from Clinical Trial Dataset

Age, Sex,

Race, BMI

Demographics

Data File

FEV1, FVC

variables

Spirometry

Data File

Duration,

Onset, Skin

tests

Allergic History

Data File

Symptoms,

Activity,

Smoking

AQLQ Data

File

Data Variables used to Cluster

Patient Clusters

• Used CART to better understand patient clusters Used to uncover hidden structure in complex data to predict our 7

clusters 10-fold cross validation used to build the CART model

– We set the variable indicating the 7 clusters as the target Allowed CART to help us describe and ultimately name the 7

clusters via the nodes in the final tree

Patient Clusters

• CART Method Details– The target is the Cluster assignment variable, 30 predictors

were included– Cluster was considered categorical and the classification tree

was used– CART single variable splitting criteria method used was Gini– Each predictor was given equal weight– No priors, constraints, or penalties were defined– Default of 10-fold cross validation was used

Patient Clusters

CART Tree Page 1

Patient Clusters

CART Tree Page 2

Patient Clusters

CART Tree Node Descriptions1. ICS Use = Low/Medium2. ICS Use High and ICU/Intubated3. ICS Use High and Not ICU/Intubated and On Women’s Hormone Therapy4. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Black5. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age < 44.5 and Post-Bronchodilator % Predicted FVC <= .726. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age < 44.5 and Post-Bronchodilator % Predicted FVC > .72

and Activity Score <= 3.59

7. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age < 44.5 and Post-Bronchodilator % Predicted FVC > .72 and Activity Score > 3.598. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age > 44.5 and Post-Bronchodilator % Predicted FEV1 <= 61.889. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age > 44.5 and Post-Bronchodilator % Predicted FEV1 > 61.88

Patient Clusters

• CART output and node descriptions helped to name the

7 clusters– (1) Older/Poor Lung Function – (2) Younger/Good Lung Function/Good Activity– (3) Older/Moderate Lung Function– (4) High Women’s Hormone Therapy– (5) Race - Black– (6) High ICS Use– (7) High ICU/Intubation

Results

FEV1* and Exacerbation Advantage for the Seven Cluster Solution

Limitations

• Potential Issues with Case Clustering– Additional variables could have been included– CART is one way to describe the cluster splits, but not the

only way

Conclusions

• Conclusions– CART helped us describe the clusters in a clinically

meaningful way– There are groups of patients that respond to Active Drug

over placebo better than other groups– Patients in cluster 4 (High Women’s Hormone Therapy) and

the cluster 2 (Younger/Good Lung Function/Good Activity) responded better than other patient clusters

using cart to unravel clusters for the testing of interactions in asthma databases

Technology