data science, big data and you
DESCRIPTION
Presentation at George Mason University, April 2013TRANSCRIPT
Joel Saltz MD, PhDEmory UniversityFebruary 2013
Data Science, Big Data and You
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Big Data
• Social media—analysis of tweets and Facebook to observed trends in real time
• Local Walgreens stock their shelves according to local tweets about cold symptoms
• Credit card fraud—lost of transactions, but yet you get a flag that you shopped in a store that does not fit your profile—and within minutes your card is blocked.
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Big Data in Commerce - Fraud Detection
• Seek unexpected data – outliers• Lots of data – all Amex, Visa or Mastercard
transactions• Look for individual outliers – e.g. credit
transaction involving large amount of money purchasing unusual product
• Look for sequence data with temporal or spatial relationship -- find unusual sequence e.g., intrusion detection and cyber security
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Define the “typical” regions in a data set – may be difficult
• “Typical” behavior may change with time. What is typical today may be considered anomalous in future and vice versa.
• (Smart) crooks will make “keep under the radar” to try to stay undetected
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Approaches
• Sometimes build a model from the training data and apply the model to detect outliers
• Sometimes use the existing data directly to detect outliers
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Big Data Ecosystem
6
Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Science and Engineering Applications
Sloan Sky Survey
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Early Big Data 1922 -Lewis Richardson Weather Forecasting
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Scientific Big Data Targets
• Multi-dimensional spatial-temporal datasets– Biomedicine
– Oil Reservoir Simulation/Carbon Sequestration/Groundwater Pollution Remediation
– Biomass monitoring and disaster surveillance
– Weather prediction
– Analysis of Results from Large Scale Simulations
• Correlative and cooperative analysis of data from multiple sensor modalities and sources
• What-if scenarios and multiple design choices or initial conditions
Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Integrative Cancer Research with Digital Pathology
histology neuroimaging
clincal\pathology
IntegratedAnalysis
molecular
High-resolution whole-slide microscopy
Multiplex IHC
Integrative Analysis: OSU BISTI NBIB Center
Big Data (2005)Associate genotype with
phenotypeBig science experiments on
cancer, heart disease, pathogen host responseTissue specimen -- 1 cm3
0.3 μ resolution – roughly 1013 bytes
Molecular data (spatial location) can add additional significant factor; e.g. 102
Multispectral imaging, laser captured microdissection, Imaging Mass Spec, Multiplex QD
Multiple tissue specimens; another factor of 103
Total: 1018 bytes – exabyte per big science experiment
A Data Intense Challenge:The Instrumented Oil Field of the
Future
The Tyranny of Scale(Tinsley Oden - U Texas)
process scalefield scale
km
cm
simulation scale
mm
pore scale
Why Applications Get Big• Physical world or simulation results• Detailed description of two, three (or more)
dimensional space• High resolution in each dimension, lots of
timesteps• e.g. oil reservoir code -- simulate 100 km by
100 km region to 1 km depth at resolution of 100 cm:
– 10^6*10^6*10^4 mesh points, 10^2 bytes per mesh point, 10^6 timesteps --- 10^24 bytes (Yottabyte) of data!!!
Detect and track changes in data during productionInvert data for reservoir propertiesDetect and track reservoir changes
Assimilate data & reservoir properties into the evolving reservoir model
Use simulation and optimization to guide future production
Oil Field Management – Joint ITR with Mary Wheeler, Paul Stoffa
Coupled Ground Water and Surface Water Simulations
Multiple codes -- e.g. fluid code, contaminant transport codeDifferent space and time scalesData from a given fluid code run is used in different contaminant transport code scenarios
Bioremediation Simulation
Microbe colonies (magenta)
Dissolved NAPL (blue)
Mineral oxidation products (green)
abiotic reactions compete with
microbes, reduce extent of biodegradation
National Science Foundation Grand Challenge in Land Cover Dynamics - 1994
• Remote sensing analysis of high resolution satellite images.
• Databases of land cover dynamics are essential for global carbon models, biogeochemical cycling, hydrological modeling and ecosystem response modeling
• Maps of the world's tropical rain forest during the past three decades.
Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John Townshend
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental Results
Dimitri Mavriplis, Raja Das, Joel Saltz -- 1990’s
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Whole Slide Imaging: Scale
Data per slide: 500MB to 100GBRoughly 250-500M Slides/Year in USA
Total: 0.1-10 Exabytes/year
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Using TCGA Data to Study
Glioblastoma
Diagnostic Improvement
Molecular Classification
Predictors of Progression
Digital Pathology
Neuroimaging
TCGA Network
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Morphological Tissue Classification
Nuclei Segmentation
Cellular Features
Lee Cooper,Jun Kong
Whole Slide Imaging
Oligodendroglioma Astrocytoma
Nuclear Qualities
Can we use image analysis of TCGA GBMs TO INFORM diagnostic criteria based on molecular or clinical endpoints?
Application: Oligodendroglioma Component in GBM
Millions of Nuclei Defined by n Features
• Top-down analysis: use the features with existing diagnostic constructs
• Bottom-up analysis: let features define and drive the analysis
TCGA Whole Slide Images
Jun Kong
Step 1:Nuclei
Segmentation
• Identify individual nuclei and their boundaries
Nuclear Analysis Workflow
• Describe individual nuclei in terms of size, shape, and texture
Step 2:Feature
Extraction
Step 1:Nuclei
Segmentation
Oligodendroglioma Astrocytoma
Nuclear Qualities
1 10
Step 3:Nuclei
Classification
Survival Analysis
Human Machine
Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification
Oligo Related Genes
Myelin Basic ProteinProteolipoproteinHoxD1
Nuclear features mostAssociated with Oligo Signature Genes:
Circularity (high)Eccentricity (low)
Millions of Nuclei Defined by n Features
• Top-down analysis: analyze features in context of existing diagnostic constructs
• Bottom-up analysis: let nuclear features define and drive the analysis
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information
Lee Cooper,Carlos Moreno
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Consensus clustering of morphological signatures
Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients
Each possibility evaluated using 2000 iterations of K-means to quantify co-clustering
Nuclear Features Used to Classify GBMs
3 2 1
20 40 60 80 100 120 140 160
20
40
60
80
100
120
140
1602 3 4 5 6 725
30
35
40
45
50
# Clusters
Silh
ouet
te A
rea
0 0.5 1
1
2
3
Silhouette Value
Clu
ster
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)
Featu
re I
ndic
es
CC CM PB
10
20
30
40
500 500 1000 1500 2000 2500 3000
0
0.2
0.4
0.6
0.8
1
Days
Sur
viva
l
CC
CM
PB
Molecular Correlates of MR Features Using TCGA Data
MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In Vivo Imaging tools
MR Features compared to TCGA Transcriptional Classes and Genetic Alterations
David Gutman
VASARI Feature Set
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
46
Principal Investigator and Director: Haian FuCo-Directors: Fadlo R. Khuri, Joel Saltz
Project Manager: Margaret Johns
Aim 1 LeaderYuhong Du
Aim 2 Leader Carlos Moreno
Cancer genomics-
based HT PPI network
discovery & validation
Genomics informatics and data integration
Emory CTD2 Center:
High throughput protein-protein interaction interrogation in cancer
Winship Cancer Institute
Center for Comprehensive
InformaticsEmory Chemical Biology Discovery Center
Emory Molecular Interaction Center for Functional Genomics (MicFG)
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
a.k.a “Big Data”
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
HPC Segmentation and Feature Extraction Pipeline
Tony Pan and George Teodoro
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Large Scale Data Management
Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.
Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships
Highly optimized spatial query and analyses Implemented in a variety of ways including
optimized CPU/GPU, Hadoop/HDFS and IBM DB2
Spatial Centric – Pathology Imaging “GIS”Point query: human marked point inside a nucleus
.
Window query: return markups contained in a rectangle
Spatial join query: algorithm validation/comparison
Containment query: nuclear featureaggregation in tumor regions
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
a.k.a “Big Data”
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Example Project: Find hot spots in readmissions within 30 days– What fraction of patients with a given principal diagnosis will be
readmitted within 30 days?– What fraction of patients with a given set of diseases will be
readmitted within 30 days?– How does severity and time course of co-morbidities affect
readmissions?– Geographic analyses
• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., product lines)
useful?
Clinical Phenotype Characterization and the Emory Analytic Information Warehouse
Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Overall System
I2b2 Web Server
I2b2 Database
Source data
Database Mapper
Source data
Source data
Data Processing
Metadata Manager
Metadata Repository
Query Specification
Investigator
Data Analyst
Data Analyst
Data Modeler
Investigator
Query toolsStudy-
specific Database
Investigator
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
5-year Datasets from Emory and University Healthcare Consortium
• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiation
therapy readmit encounters (CDW data)
• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and American
Community Survey)Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Using Emory & UHC Data to Find Associations With 30-day Readmits
• Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit
• Solution: Transform the data into a much smaller set of variables using heuristic knowledge– Categorize diagnosis and procedure codes using code
hierarchies– Classify continuous variables using standard interpretations
(e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration, sequence)– Apply standard data mining techniques
Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
30-Day Readmission Rates for Derived VariablesEmory Health Care
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Geographic AnalysesUHC Medicine General Product Line (#15)
Analytic Information Warehouse
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Predictive Modeling for Readmission
• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of the
variables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in a
training dataset– Generate a patient-specific readmission risk for each
encounter
• Rank the encounters by risk for a subsequent 30-day readmission
Sharath Cholleti
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Predictive Modeling for 180 UHC Hospitals, 35 Million PatientsIdentify High Risk Patients! Readmission fraction of top 10% high risk patients
1 14 27 40 53 66 79 92 105 118 131 144 157 170 1830
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
All Hospital Model
Individual Hospital Model
Quasi-real-time display and analysis of physiologic data from Emory University Hospital SICU
Numerics and Waveforms (240 Hz)
~ 2 sec latency
Burst of tachycardia, no desaturation
Two episodes ofdesaturation, no change in heart rate
HR
SpO2
This slide is for orientation. Red data are the newest, green intermediate, blue oldest. Frequency every 2 seconds.
We have started to construct alerts around desaturation behaviors
(this image courtesy IBM)
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,
Systems Software• Analysis of Patient Populations
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish
Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)
• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)
• Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich
Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos
• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Thanks to
• National Cancer Institute• National Library of Medicine• National Science Foundation• Cardiovascular Research Grid (NHLBI)• Minority Health Grid (ARRA)• Emory Health Care• Kaiser Health Care• Winship Cancer Institute• Oak Ridge National Laboratory• Woodruff Health Sciences
Thanks!