data science, big data and you

69
Joel Saltz MD, PhD Emory University February 2013 Data Science, Big Data and You

Upload: joel-saltz

Post on 15-Jan-2015

262 views

Category:

Documents


6 download

DESCRIPTION

Presentation at George Mason University, April 2013

TRANSCRIPT

Page 1: Data Science, Big Data and You

Joel Saltz MD, PhDEmory UniversityFebruary 2013

Data Science, Big Data and You

Page 2: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Big Data

• Social media—analysis of tweets and Facebook to observed trends in real time

• Local Walgreens stock their shelves according to local tweets about cold symptoms 

• Credit card fraud—lost of transactions, but yet you get a flag that you shopped in a store that does not fit your profile—and within minutes your card is blocked. 

Page 3: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Big Data in Commerce - Fraud Detection

• Seek unexpected data – outliers• Lots of data – all Amex, Visa or Mastercard

transactions• Look for individual outliers – e.g. credit

transaction involving large amount of money purchasing unusual product

• Look for sequence data with temporal or spatial relationship -- find unusual sequence e.g., intrusion detection and cyber security

Page 4: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Define the “typical” regions in a data set – may be difficult

• “Typical” behavior may change with time. What is typical today may be considered anomalous in future and vice versa.

• (Smart) crooks will make “keep under the radar” to try to stay undetected

Page 5: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Approaches

• Sometimes build a model from the training data and apply the model to detect outliers

• Sometimes use the existing data directly to detect outliers

Page 6: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Big Data Ecosystem

6

Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/

Page 7: Data Science, Big Data and You
Page 8: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Science and Engineering Applications

Sloan Sky Survey

Page 9: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Early Big Data 1922 -Lewis Richardson Weather Forecasting

Page 10: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 11: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Page 12: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Scientific Big Data Targets

• Multi-dimensional spatial-temporal datasets– Biomedicine

– Oil Reservoir Simulation/Carbon Sequestration/Groundwater Pollution Remediation

– Biomass monitoring and disaster surveillance

– Weather prediction

– Analysis of Results from Large Scale Simulations

• Correlative and cooperative analysis of data from multiple sensor modalities and sources

• What-if scenarios and multiple design choices or initial conditions

Page 13: Data Science, Big Data and You

Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)

Page 14: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Integrative Cancer Research with Digital Pathology

histology neuroimaging

clincal\pathology

IntegratedAnalysis

molecular

High-resolution whole-slide microscopy

Multiplex IHC

Page 15: Data Science, Big Data and You

Integrative Analysis: OSU BISTI NBIB Center

Big Data (2005)Associate genotype with

phenotypeBig science experiments on

cancer, heart disease, pathogen host responseTissue specimen -- 1 cm3

0.3 μ resolution – roughly 1013 bytes

Molecular data (spatial location) can add additional significant factor; e.g. 102

Multispectral imaging, laser captured microdissection, Imaging Mass Spec, Multiplex QD

Multiple tissue specimens; another factor of 103

Total: 1018 bytes – exabyte per big science experiment

Page 16: Data Science, Big Data and You

A Data Intense Challenge:The Instrumented Oil Field of the

Future

Page 17: Data Science, Big Data and You
Page 18: Data Science, Big Data and You

The Tyranny of Scale(Tinsley Oden - U Texas)

process scalefield scale

km

cm

simulation scale

mm

pore scale

Page 19: Data Science, Big Data and You

Why Applications Get Big• Physical world or simulation results• Detailed description of two, three (or more)

dimensional space• High resolution in each dimension, lots of

timesteps• e.g. oil reservoir code -- simulate 100 km by

100 km region to 1 km depth at resolution of 100 cm:

– 10^6*10^6*10^4 mesh points, 10^2 bytes per mesh point, 10^6 timesteps --- 10^24 bytes (Yottabyte) of data!!!

Page 20: Data Science, Big Data and You

Detect and track changes in data during productionInvert data for reservoir propertiesDetect and track reservoir changes

Assimilate data & reservoir properties into the evolving reservoir model

Use simulation and optimization to guide future production

Oil Field Management – Joint ITR with Mary Wheeler, Paul Stoffa

Page 21: Data Science, Big Data and You

Coupled Ground Water and Surface Water Simulations

Multiple codes -- e.g. fluid code, contaminant transport codeDifferent space and time scalesData from a given fluid code run is used in different contaminant transport code scenarios

Page 22: Data Science, Big Data and You

Bioremediation Simulation

Microbe colonies (magenta)

Dissolved NAPL (blue)

Mineral oxidation products (green)

abiotic reactions compete with

microbes, reduce extent of biodegradation

Page 23: Data Science, Big Data and You

National Science Foundation Grand Challenge in Land Cover Dynamics - 1994

• Remote sensing analysis of high resolution satellite images.

• Databases of land cover dynamics are essential for global carbon models, biogeochemical cycling, hydrological modeling and ecosystem response modeling

• Maps of the world's tropical rain forest during the past three decades.

Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John Townshend

Page 24: Data Science, Big Data and You
Page 25: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental Results

Dimitri Mavriplis, Raja Das, Joel Saltz -- 1990’s

Page 26: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 27: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Whole Slide Imaging: Scale

Data per slide: 500MB to 100GBRoughly 250-500M Slides/Year in USA

Total: 0.1-10 Exabytes/year

Page 28: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Page 29: Data Science, Big Data and You

Using TCGA Data to Study

Glioblastoma

Diagnostic Improvement

Molecular Classification

Predictors of Progression

Page 30: Data Science, Big Data and You

Digital Pathology

Neuroimaging

TCGA Network

Page 31: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Morphological Tissue Classification

Nuclei Segmentation

Cellular Features

Lee Cooper,Jun Kong

Whole Slide Imaging

Page 32: Data Science, Big Data and You

Oligodendroglioma Astrocytoma

Nuclear Qualities

Can we use image analysis of TCGA GBMs TO INFORM diagnostic criteria based on molecular or clinical endpoints?

Application: Oligodendroglioma Component in GBM

Page 33: Data Science, Big Data and You

Millions of Nuclei Defined by n Features

• Top-down analysis: use the features with existing diagnostic constructs

• Bottom-up analysis: let features define and drive the analysis

Page 34: Data Science, Big Data and You

TCGA Whole Slide Images

Jun Kong

Step 1:Nuclei

Segmentation

• Identify individual nuclei and their boundaries

Page 35: Data Science, Big Data and You

Nuclear Analysis Workflow

• Describe individual nuclei in terms of size, shape, and texture

Step 2:Feature

Extraction

Step 1:Nuclei

Segmentation

Page 36: Data Science, Big Data and You

Oligodendroglioma Astrocytoma

Nuclear Qualities

1 10

Step 3:Nuclei

Classification

Page 37: Data Science, Big Data and You

Survival Analysis

Human Machine

Page 38: Data Science, Big Data and You

Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification

Oligo Related Genes

Myelin Basic ProteinProteolipoproteinHoxD1

Nuclear features mostAssociated with Oligo Signature Genes:

Circularity (high)Eccentricity (low)

Page 39: Data Science, Big Data and You

Millions of Nuclei Defined by n Features

• Top-down analysis: analyze features in context of existing diagnostic constructs

• Bottom-up analysis: let nuclear features define and drive the analysis

Page 40: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information

Lee Cooper,Carlos Moreno

Page 41: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Consensus clustering of morphological signatures

Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients

Each possibility evaluated using 2000 iterations of K-means to quantify co-clustering

Nuclear Features Used to Classify GBMs

3 2 1

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

1602 3 4 5 6 725

30

35

40

45

50

# Clusters

Silh

ouet

te A

rea

0 0.5 1

1

2

3

Silhouette Value

Clu

ster

Page 42: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:

Cell Cycle (CC), Chromatin Modification (CM),

Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)

Featu

re I

ndic

es

CC CM PB

10

20

30

40

500 500 1000 1500 2000 2500 3000

0

0.2

0.4

0.6

0.8

1

Days

Sur

viva

l

CC

CM

PB

Page 43: Data Science, Big Data and You

Molecular Correlates of MR Features Using TCGA Data

MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In Vivo Imaging tools

MR Features compared to TCGA Transcriptional Classes and Genetic Alterations

David Gutman

Page 44: Data Science, Big Data and You

VASARI Feature Set

Page 45: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Page 46: Data Science, Big Data and You

46

Principal Investigator and Director: Haian FuCo-Directors: Fadlo R. Khuri, Joel Saltz

Project Manager: Margaret Johns

Aim 1 LeaderYuhong Du

Aim 2 Leader Carlos Moreno

Cancer genomics-

based HT PPI network

discovery & validation

Genomics informatics and data integration

Emory CTD2 Center:

High throughput protein-protein interaction interrogation in cancer

Winship Cancer Institute

Center for Comprehensive

InformaticsEmory Chemical Biology Discovery Center

Emory Molecular Interaction Center for Functional Genomics (MicFG)

Page 47: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 48: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!

Page 49: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Page 50: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

HPC Segmentation and Feature Extraction Pipeline

Tony Pan and George Teodoro

Page 51: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Large Scale Data Management

Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.

Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships

Highly optimized spatial query and analyses Implemented in a variety of ways including

optimized CPU/GPU, Hadoop/HDFS and IBM DB2

Page 52: Data Science, Big Data and You

Spatial Centric – Pathology Imaging “GIS”Point query: human marked point inside a nucleus

.

Window query: return markups contained in a rectangle

Spatial join query: algorithm validation/comparison

Containment query: nuclear featureaggregation in tumor regions

Page 53: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 54: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Example Project: Find hot spots in readmissions within 30 days– What fraction of patients with a given principal diagnosis will be

readmitted within 30 days?– What fraction of patients with a given set of diseases will be

readmitted within 30 days?– How does severity and time course of co-morbidities affect

readmissions?– Geographic analyses

• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., product lines)

useful?

Clinical Phenotype Characterization and the Emory Analytic Information Warehouse

Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod

Page 55: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Overall System

I2b2 Web Server

I2b2 Database

Source data

Database Mapper

Source data

Source data

Data Processing

Metadata Manager

Metadata Repository

Query Specification

Investigator

Data Analyst

Data Analyst

Data Modeler

Investigator

Query toolsStudy-

specific Database

Investigator

Page 56: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

5-year Datasets from Emory and University Healthcare Consortium

• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiation

therapy readmit encounters (CDW data)

• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and American

Community Survey)Analytic Information Warehouse

Page 57: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Using Emory & UHC Data to Find Associations With 30-day Readmits

• Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit

• Solution: Transform the data into a much smaller set of variables using heuristic knowledge– Categorize diagnosis and procedure codes using code

hierarchies– Classify continuous variables using standard interpretations

(e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration, sequence)– Apply standard data mining techniques

Analytic Information Warehouse

Page 58: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

30-Day Readmission Rates for Derived VariablesEmory Health Care

Page 59: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Geographic AnalysesUHC Medicine General Product Line (#15)

Analytic Information Warehouse

Page 60: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Predictive Modeling for Readmission

• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of the

variables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in a

training dataset– Generate a patient-specific readmission risk for each

encounter

• Rank the encounters by risk for a subsequent 30-day readmission

Sharath Cholleti

Page 61: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest

Page 62: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Predictive Modeling for 180 UHC Hospitals, 35 Million PatientsIdentify High Risk Patients! Readmission fraction of top 10% high risk patients

1 14 27 40 53 66 79 92 105 118 131 144 157 170 1830

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

All Hospital Model

Individual Hospital Model

Page 63: Data Science, Big Data and You

Quasi-real-time display and analysis of physiologic data from Emory University Hospital SICU

Numerics and Waveforms (240 Hz)

~ 2 sec latency

Page 64: Data Science, Big Data and You

Burst of tachycardia, no desaturation

Two episodes ofdesaturation, no change in heart rate

HR

SpO2

This slide is for orientation. Red data are the newest, green intermediate, blue oldest. Frequency every 2 seconds.

Page 65: Data Science, Big Data and You

We have started to construct alerts around desaturation behaviors

(this image courtesy IBM)

Page 66: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 67: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish

Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)

• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)

• Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod

• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich

Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe

• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos

• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL

Page 68: Data Science, Big Data and You

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Thanks to

• National Cancer Institute• National Library of Medicine• National Science Foundation• Cardiovascular Research Grid (NHLBI)• Minority Health Grid (ARRA)• Emory Health Care• Kaiser Health Care• Winship Cancer Institute• Oak Ridge National Laboratory• Woodruff Health Sciences

Page 69: Data Science, Big Data and You

Thanks!