predictive and causal modeling in biomedicine · 2015-09-08 · predictive and causal modeling in...
Post on 29-Aug-2018
214 Views
Preview:
TRANSCRIPT
Predictive and Causal Modeling in the Health Sciences
Sisi Ma MS, MS, PhD. New York University,
Center for Health Informatics and Bioinformatics
1
Exponentially Rapid Data Accumulation
1975 Rapid DNA Sequencing
1982 GeneBank
Formed
1990 Human
Genome Project
Initiated
2003 Completion of
Human Genome
Sequencing PDB initiated
Protein Sequencing
via MS 1986
2006 TCGA
Initiated 1,000
Genomes Initiated
First GWAS Study
Published; NGS 2005
2016 TCGA
Completed >10,000 Tumors
2010 Human
Connectome Project
Single Cell
Sequencing 2012
2
From Data to Discoveries
Predictive Model Screening Diagnostics Prognostics
Causal Model Causal Knowledge Intervention Therapeutics
Predictive Knowledge
Advanced Data Preparation, Analysis and Modeling methods are needed for knowledge discovery in high volume, high variety data. Two key types: Predictive Modeling and Computational Causal Discovery
3
Talk Outline
• Predictive Modeling o Brief Introduction to Predictive Modeling
o Indicative Case Studies
• Causal Modeling o Causal Modeling using Observation Data
o Indicative Case Studies
o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection
4
Talk Outline
• Predictive Modeling o Brief Introduction to Predictive Modeling
o Indicative Case Studies
• Causal Modeling o Causal Modeling using Observation Data
o Indicative Case Studies
o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection
5
Predictive Models : the Goal
6
Example of Predictive Modeling : Support Vector Machines (SVMs)
Support Vector Machine 7
Key Characteristics of SVM • Maximum gap to prevent overfitting • QP problems can be solved with
standard methods. • Soft margins to tolerate noise • Kernel trick for linearly non-separable
data Boser et al.1992; Statnikov et al., 2011
Predictive Models : the Goal
8
Predictive Modeling: a Simplified General Framework
9
Predictive Modeling: Cross validation for performance estimation and model selection
10 Ma et al., 2015 (in preparation)
Talk Outline
• Predictive Modeling o Brief Introduction to Predictive Modeling
o Indicative Case Studies
• Causal Modeling and its Applications o Causal Modeling using Observation Data
o Indicative Case Studies
o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection
11
Predictive Modeling for Post-traumatic Stress
Post-traumatic Stress Response:
• Almost everyone experience at least one traumatic event in their life.
• Most people display acute stress responses.
• Acute stress responses diminish over time in most individuals, but about 10% - 20% people experience non-remitting stress responses long after the trauma.
• Persistent stress is detrimental to Physiological and psychological well-being of individuals.
12 Galatzer-Levy et al., 2015; Ma et al. 2015; Galatzer-Levy et al., 2015 (submitted)
Predictive Modeling for Post-traumatic Stress
Discovery Goals/Questions:
• Can we identify the people who will suffer from non-remitting stress responses? If so, can they be identified early enough?
• What types of data need to be collected to identify people who will suffer from non-remitting stress responses?
13
Predictive Modeling for Post-traumatic Stress
• 166 trauma survivors that were admitted to the ER were followed up to 4 month after the trauma.
• Patient history, clinical data, stress hormones, psychiatric related measurements were collected in the ER, 1 week, 1 month, and 4 month after the trauma. A total number of 135 variables were collected.
101289479238749818817989 …
101289479238749818817989 …
101289479238749818817989 …
101289479238749818817989 …
101289479238749818817989 …
15675672308252573213 …
101289479238749818817989 …
101289479238749818817989 …
998234989238749892409880 …
101289479238749818817989 …
101289479238749818817989 …
884729238761912876128764 …
101289479238749818817989 …
101289479238749818817989 …
112343247498231881324742 …
Data:
14
Predictive Modeling for Post-traumatic Stress
Remitting and Non-remitting Post-traumatic Stress Responses (Identified via Latent Growth Mixture Modeling)
15
Predictive Modeling for Post-traumatic Stress
Discovery Goals/Questions:
• Can we identify the people who will suffer from non-remitting stress responses? If so, can they be identified early enough?
• What types of data need to be collected to identify people who will suffer from non-remitting stress responses?
16
Predictive Model for Post-traumatic Stress
Study Design: • Five predictive models were build using data incorporating
increasing amounts of information: (1) background data (2) Data collected through ER (3) Data collected through 1 week (4) Data collected through 1 month (5) Data collected though 4 month
• SVM with feature selection was employed, with 10 split 5 fold cross-validation
17
Predictive Modeling for Post-traumatic Stress
• Prediction accuracy increases progressively as data collected at later time points are added to the predictive models.
• Predictivity of the model built with patient background information is statistically significant.
• Model built with patient background information and data collected in the ER have strong enough predictive performance to be clinically useful.
18
Predictive Modeling for Post-traumatic Stress
Discovery Goals/Questions:
• Can we identify the people who will suffer from non-remitting stress responses? If so, can they be identified early enough?
• What types of data need to be collected to identify people who will suffer from non-remitting stress responses? Specifically, can neuroendocrine levels predict non-remitting post-traumatic stress?
19
Predictive Modeling for Post-traumatic Stress
• Neuroendocrine data studied contain limited information for non-remitting stress response.
• Except at the time of ER, combining neuroendocrine and other data (clinical information, psychiatric surveys) do not significantly increase predictivity of the models.
20
Other Case Studies for Predicting Modeling
• Predicting Cancer Patient Outcome
• Predicting Neural Activity in the Dorsolateral Striatum
• Predicting Transposon Insertion
21
Other Case Studies for Predicting Modeling
Predicting Cancer Patient Outcome
• Problem: Determine the most informative data modality for predicting cancer patient outcome
• Data: 47 datasets/predictive tasks that in total span over 9 data modalities including copy number, gene expression, protein expression, mico-RNA expression, imaging, GWAS, somatic mutation, methylation, and clinical information.
• Conclusion: Gene expression is in generally the most informative data modality. Combining different data modality do not increase predictive performance.
22
Ray MS, Henaff MS, Aliferis PhD, Statnikov PhD @NYU
Ray et al., 2014
Other Case Studies for Predicting Modeling
Predicting Neural Activity in the Dorsolateral Striatum (DLS)
• Problem: Predict neural activity from movement data
• Data: Single Neuron Activity in the DLS
Head Movement Tracking Data
• Model: Linear-Non-linear-Poisson Model to predict neural activity from head movement profile of the animal and spike history of the neuron.
• Reconstructed neural activity in subpopulation of the neurons.
23
David Barker PhD @ NIDA Ma and Barker, 2014
Other Case Studies for Predicting Modeling
Predicting Transposon Insertion • Problem: Identify transposon insertion location in the genome. • Data: Targeted Sequencing Data. • Model: train logistic regression model on a set of annotated
transposon insertion sites and apply the model for de-novo insertion identification.
• More than 95% of the de-novo insertion identified by the model was validated by experiments.
Zuojian Tang MS, David Fenyo PhD, Jeff Boeke PhD @NYU Langone, Kathleen Burns @ JHU
24
Talk Outline
• Predictive Modeling o Brief Introduction to Predictive Modeling
o Indicative Case Studies
• Causal Modeling o Causal Modeling using Observation Data
o Indicative Case Studies
o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection
25
Causal Modeling: the Goal
26
Causal Modeling: the Goal
27
Causal Modeling: Causal graphs Capture Direct, Indirect Relationships
28
Causal Modeling: V-structures a Common Technique for Orienting Causal Relationships
29
Casual Modeling: PC Algorithm a prototypical causal discovery algorithm
30
PC algorithm: Skeleton Discovery
Sprites et al., 1993
31
Casual Modeling: PC Algorithm
PC algorithm: Skeleton Discovery, Trace
Casual Modeling: PC Algorithm
32
PC algorithm: Orientation
Causal Modeling: HITON-PC Algorithm
B
T
C
D
E
A
33
• Local causal discovery method • Easily extended for global causal
discovery with the LGL framework.
Aliferis et al., 2010
Causal Modeling: HITON-PC Algorithm
B
T
C
D
E
A
Trace of HITON-PC
34
Causal Modeling: Semi-Interleaved HITON-PC a more efficient implementation
35
• Efficient, and robust. • Scalable to very BIG
DATA. • Easily extended for
global causal discovery with the LGL framework.
• An instantiation of the GLL framework.
Talk Outline
• Predictive Modeling o Brief Introduction to Predictive Modeling
o Indicative Case Studies
• Causal Modeling o Causal Modeling using Observation Data
o Indicative Case Studies
o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection
36
Causal Modeling for Post-traumatic Stress Study
• 166 trauma survivors that were admitted to the ER were followed up to 4 month after the trauma.
• Patient history, clinical data, stress hormones, psychiatric related measurements were collected in the ER, 1 week, 1 month, and 4 month after the trauma. A total number of 135 variables were collected.
101289479238749818817989 …
101289479238749818817989 …
101289479238749818817989 …
101289479238749818817989 …
101289479238749818817989 …
15675672308252573213 …
101289479238749818817989 …
101289479238749818817989 …
998234989238749892409880 …
101289479238749818817989 …
101289479238749818817989 …
884729238761912876128764 …
101289479238749818817989 …
101289479238749818817989 …
112343247498231881324742 …
Data:
37 Galatzer-Levy et al., 2015; Ma et al. 2015; Galatzer-Levy et al., 2015 (submitted)
Causal Model for Post-traumatic Stress
Causal Discovery Question:
• What are the factors determining non-remitting stress responses?
Analysis Design:
• Apply local causal discovery algorithms (HITON-PC) to find the parent children sets for all measured variables
• A global causal graph depicting the relationship among all measured variables were constructed using the local to global framework LGL.
• Edges were oriented according the time that individual variables were measured.
38
Causal Modeling for Post-traumatic Stress
The Global Causal Graph
A very complicated model! 39
Causal Modeling for Post-traumatic Stress
Example Causal Path Leading to non-remitting Stress Responses
40
Causal Modeling for Post-traumatic Stress
Potential intervention for non-remitting Stress Responses
41
Causal Modeling for Post-traumatic Stress
Potential Intervention for non-remitting Stress Responses
42
Talk Outline
• Predictive Modeling o Brief Introduction to Predictive Modeling
o Indicative Case Studies
• Causal Modeling o Causal Modeling using Observation Data
o Indicative Case Studies
o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection
43
Causal Model Guided Experimental Minimization and Adaptive Data Collection
Goals: • Reduce number of experiments that experimentalists need to
do in order to fully resolve a biological pathway (or other complex set of causal interactions among variables of interest).
• Reduce time to discovery
• Reduce costs
44
Causal Model Guided Experimental Minimization and Adaptive Data Collection
Special Importance In Health Sciences with both omics data and clinical data:
• One variable could be univariately associated with hundred to thousand variables: – Drivers: direct and indirect
– Passengers
– Effects
• High degree of multiplicity.
• Classical statistical techniques exhibit both increased false positives and negatives
45
Causal Model-Guided Experimental Minimization and Adaptive Data Collection
Simplified view of the Framework:
46
Causal Model Guided Experimental Minimization and Adaptive Data Collection
The ODLP Algorithm:
Output:
• Local causal pathway (parents and children) of the variable of interest.
Two Phases:
• Identify local causal pathway consistent with the data and information equivalent clusters.
• Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway.
47 Statnikov et al., 2015 (Accepted)
Causal Model Guided Experimental Minimization and Adaptive Data Collection
48
The ODLP Algorithm:
Output:
• Local causal pathway (parents and children) of the variable of interest.
Two Phases:
• Identify local causal pathway consistent with the data and information equivalent clusters.
• Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway.
ODLP: Pseudo Code:
Causal Model Guided Experimental Minimization and Adaptive Data Collection
The ODLP Algorithm Phase I:
• Identify local causal pathway consistent with the data and information equivalent clusters (TIE*, iTIE* algorithms).
49
Causal Model Guided Experimental Minimization and Adaptive Data Collection
The ODLP Algorithm Phase I: iTIE*
50
Causal Model Guided Experimental Minimization and Adaptive Data Collection
The ODLP Algorithm Phase II:
• Adaptively recommend experiments to perform, integrate experimental results to refine and orient the local causal pathway. (i.e. Identify Causes, Effects, and Passengers).
51
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP: Identifying effects
effects
• Manipulate T and obtain experimental data DE.
• Mark all variables in V that change in DE due to manipulation of T as effects.
52
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP: direct and indirect effects
Indirect effect
• Select an effect variable X that has neither been marked as indirect effect nor as direct effect.
• Manipulate X and obtain experimental data DE.
• Mark all effect variables that change in DE due to manipulation of X and belong to the same equivalence cluster as indirect effects.
• The last effect variable in an equivalent cluster that is not marked as indirect effect is a direct effect.
53
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP: Identifying Passengers
Passengers
• Select an unmarked variable X from an equivalence cluster.
• Manipulate X and obtain experimental data DE.
• If T does not change in DE due to manipulation of X, mark X as a passenger and mark all other non-effect variables that change in DE due to manipulation of X as passengers; otherwise mark X as a cause.
54
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP: Identifying Causes
• For every cause X, mark X as a direct cause if there exist no other cause in the same equivalence cluster that changes due to manipulation of X; otherwise mark X as an Indirect cause.
• If there is an equivalence cluster that contains a single unmarked variable X and all marked variables in this cluster (if any) are only passengers and/or effects, then mark X as a direct cause.
55
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP vs Other Algorithms: Performance on Simulated Data
• Benchmark study
• 58 algorithms/variant from 4 algorithm families.
• 11 networks of different sizes.
56
Statnikov et al., 2015 (Accepted)
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP vs Other Algorithms: Network Reconstruction Quality
57
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP vs Other Algorithms: Reconstruction Quality & Efficiency
58
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP vs Other Algorithms: Scalability
59
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP vs Other Algorithms: Performance on Real Biological Data
60
Ma et al., 2015 (submitted)
Causal Model Guided Experimental Minimization and Adaptive Data Collection
ODLP vs Other Algorithms: Performance on Real Biological Data
61
Summary
• Predictive Modeling o Brief Introduction to Predictive Modeling
o Indicative Case Studies
• Causal Modeling o Causal Modeling using Observation Data
o Indicative Case Studies
o Causal Modeling- Guided Experimental Minimization and Adaptive Data Collection
62
Future directions
• Improve Existing algorithms (e.g., relax some application assumptions).
• Design and Implement Analysis Pipelines that can be used by non experts.
• Disseminate Software and Analytics Packages.
• Apply these techniques broadly in different domains.
• Educate researchers about the capabilities (and limitations) as well as proper use of these and related methods.
63
top related