using cart to unravel clusters for the testing of interactions in asthma databases
TRANSCRIPT
Using CART to Unravel Clusters for the Testing of Interactions in Asthma Databases
Ben Trzaskoma
Sr Statistical Scientist
Genentech, Inc
Introduction
• Ultimate Purpose: Determine subgroups with greatest
“Drug advantage” (versus Placebo)
• Variable clustering
− Used to identify groups of variables
− Created composite scores summarizing groups of variables
− Determined “Drug benefit” for each composite score
• Patient (case) clustering
− Used to identify subsets of patients
− Determined “Drug benefit” for each subset of patients
Clinical Trial Dataset
• Phase IIIb multicenter, randomized, double-blind,
placebo-controlled study
• Original purpose of the study:
− Evaluate efficacy and safety of Asthma Drug
− Subjects have moderate to severe asthma
− Asthma is inadequately controlled with ICS and LABA
• 850 patients age 12-75 from about 150 sites were
enrolled and followed for 48 weeks
− Half given Drug + ICS + LABA
− Half given placebo + ICS + LABA
Methods
• Two approaches:
• Variable Clustering (Proc Varclus in SAS)
• Related to factor analysis
• Looks for relationship among variables
• Patient (Case) Clustering (Proc Cluster in SAS)
• Looks for groups of “similar” cases
• Patient (case) clustering with SAS CLUSTER procedure− Identify groups of similar patients− Variables with most variability are most important − The resulting clusters allow for the analysis of patient groups with
different characteristics
Methods: Patient Clustering
• Patient clustering details− The SAS CLUSTER procedure is an agglomerative clustering
technique Starts with one cluster per patient and iteratively groups the two
nearest clusters until there is only one cluster with all patients in it. Based on a set of key variables Ward’s method selected. This method uses the minimum variance
to determine which two clusters should be the next to cluster together. It tends to maximize the corresponding ANOVA.
Selected stopping point with moderate number of clusters (selection is somewhat arbitrary -- see dendrogram)
The squared multiple correlation, R-squared, is the proportion of variance accounted for by the clusters and is used to assess the goodness of a particular cluster solution.
Methods: Patient Clustering
Methods: Patient Clustering
Ward’s MethodWard proposes that at any stage of an analysis the loss of information which results fromthe grouping of individuals into clusters can be measured by the total sum of squareddeviations of every point from the mean of the cluster to which it belongs.
The distance between a group k and a group (ij) formed by the fusion of i and j:
dk(ij) = αidki + αjdkj + βdij
Where dij is the distance between groups i and j
αi =
nk + ni
nk + ni + nj
αj =
nk + nj
nk + ni + nj
β =
-nk
nk + ni + nj
Everitt B, Cluster Analysis, 1977
And ni is the number of cases in group i.
Patient Clusters from Clinical Trial Dataset
Age, Sex,
Race, BMI
Demographics
Data File
FEV1, FVC
variables
Spirometry
Data File
Duration,
Onset, Skin
tests
Allergic History
Data File
Symptoms,
Activity,
Smoking
AQLQ Data
File
Data Variables used to Cluster
Patient Clusters
• Used CART to better understand patient clusters Used to uncover hidden structure in complex data to predict our 7
clusters 10-fold cross validation used to build the CART model
– We set the variable indicating the 7 clusters as the target Allowed CART to help us describe and ultimately name the 7
clusters via the nodes in the final tree
Patient Clusters
• CART Method Details– The target is the Cluster assignment variable, 30 predictors
were included– Cluster was considered categorical and the classification tree
was used– CART single variable splitting criteria method used was Gini– Each predictor was given equal weight– No priors, constraints, or penalties were defined– Default of 10-fold cross validation was used
Patient Clusters
CART Tree Page 1
Patient Clusters
CART Tree Page 2
Patient Clusters
CART Tree Node Descriptions1. ICS Use = Low/Medium2. ICS Use High and ICU/Intubated3. ICS Use High and Not ICU/Intubated and On Women’s Hormone Therapy4. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Black5. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age < 44.5 and Post-Bronchodilator % Predicted FVC <= .726. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age < 44.5 and Post-Bronchodilator % Predicted FVC > .72
and Activity Score <= 3.59
7. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age < 44.5 and Post-Bronchodilator % Predicted FVC > .72 and Activity Score > 3.598. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age > 44.5 and Post-Bronchodilator % Predicted FEV1 <= 61.889. ICS Use High and Not ICU/Intubated and Not On Women’s Hormone Therapy and Not Black and Age > 44.5 and Post-Bronchodilator % Predicted FEV1 > 61.88
Patient Clusters
• CART output and node descriptions helped to name the
7 clusters– (1) Older/Poor Lung Function – (2) Younger/Good Lung Function/Good Activity– (3) Older/Moderate Lung Function– (4) High Women’s Hormone Therapy– (5) Race - Black– (6) High ICS Use– (7) High ICU/Intubation
Results
FEV1* and Exacerbation Advantage for the Seven Cluster Solution
Limitations
• Potential Issues with Case Clustering– Additional variables could have been included– CART is one way to describe the cluster splits, but not the
only way
Conclusions
• Conclusions– CART helped us describe the clusters in a clinically
meaningful way– There are groups of patients that respond to Active Drug
over placebo better than other groups– Patients in cluster 4 (High Women’s Hormone Therapy) and
the cluster 2 (Younger/Good Lung Function/Good Activity) responded better than other patient clusters