"expodb: an exploratory data science platform"

37
© 2016 Mohammad Sadoghi (Purdue University) ExpoDB: An Exploratory Data Science Platform (A New Frontier: From Data Processing to Knowledge Exploration) Mohammad Sadoghi Assistant Professor Department of Computer Science Purdue University IBM Cognitive Systems Institute Speaker Series September 29, 2016

Upload: diannepatricia

Post on 12-Apr-2017

497 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

ExpoDB: An Exploratory Data Science Platform(A New Frontier: From Data Processing to Knowledge Exploration)

Mohammad Sadoghi Assistant ProfessorDepartment of Computer SciencePurdue University

IBM Cognitive Systems Institute Speaker Series September 29, 2016

Page 2: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Insight is Lost in Islands of Data

2

http://www.cpsresearch.eu/clinical-trials/

http://news.mit.edu/2015/mnookin-vaccination-public-health-0227

http://www.healthcarepackaging.com/trends-and-issues/clinical-trials

http://stormercellularloo.gq/evolve-ii-clinical-trial.html

https://www.geneticliteracyproject.org

Data is spread across many islands of disconnected sources (a lack of holistic view)

Page 3: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Insight is Lost in Islands of Data

3

http://www.cpsresearch.eu/clinical-trials/

http://news.mit.edu/2015/mnookin-vaccination-public-health-0227

http://www.healthcarepackaging.com/trends-and-issues/clinical-trials

http://stormercellularloo.gq/evolve-ii-clinical-trial.html

https://www.geneticliteracyproject.org

Sadly, adverse drug reactions (ADRs) is the 4th leading cause of deaths in United States, resulting in100,000 loss of life annually

Page 4: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Insight is Lost in Islands of Data

4

http://www.cpsresearch.eu/clinical-trials/

http://news.mit.edu/2015/mnookin-vaccination-public-health-0227

http://www.healthcarepackaging.com/trends-and-issues/clinical-trials

http://stormercellularloo.gq/evolve-ii-clinical-trial.html

https://www.geneticliteracyproject.org

Adverse drug reaction costs over $136 billion dollars in US annually

Page 5: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Real-time Fusion and Exploration of Data

Page 6: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Real-time Fusion and Exploration of Enriched Data

Page 7: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Real-time Fusion and Exploration of Enriched Data at Web Scale

Page 8: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data

8

PTGS2 (Gene)

inhibits

TP53 (Gene)

Rheumatoid Arthritis

Osteosarcoma (Bone Cancer)

Naproxen (Aleve)

Disease

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

Methotrexate

DHFR (Gene)

inhibi

ts

Arthritis

WarfarinEmbolism

(Blood Clot)

Nicotine

VKORC1 (Gene)CYP2C9

(Enzyme)

Chemical

Carboxylic Acids

Heterocyclic

Aminopterin Phenylpro- pionates

Approved Drugs

increased degradation

inhibits

Inhibits

Inhi

bits

Inhibits

limit cells growth

tumor

suppressor

Why capture the semantic/context?Semantic is essential to connect the dots.

Page 9: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data

9

PTGS2 (Gene)

inhibits

TP53 (Gene)

Rheumatoid Arthritis

Osteosarcoma (Bone Cancer)

Naproxen (Aleve)

Disease

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

Methotrexate

DHFR (Gene)

inhibi

ts

Arthritis

WarfarinEmbolism

(Blood Clot)

Nicotine

VKORC1 (Gene)CYP2C9

(Enzyme)

Chemical

Carboxylic Acids

Heterocyclic

Aminopterin Phenylpro- pionates

Approved Drugs

increased degradation

inhibits

Inhibits

Inhi

bits

Inhibits

limit cells growth

tumor

suppressor

Why capture the semantic/context?Semantic is essential to connect the dots.

Page 10: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data

10

PTGS2 (Gene)

inhibits

TP53 (Gene)

Rheumatoid Arthritis

Osteosarcoma (Bone Cancer)

Naproxen (Aleve)

Disease

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

Methotrexate

DHFR (Gene)

inhibi

ts

limit cells growth

Arthritis

WarfarinEmbolism

(Blood Clot)

Nicotine

VKORC1 (Gene)CYP2C9

(Enzyme)

Chemical

Carboxylic Acids

Heterocyclic

Aminopterin Phenylpro- pionates

Approved Drugs

increased degradation

inhibits

Inhibits

Inhi

bits

Inhibits

tumor

suppressor

Why capture the semantic/context?Semantic is essential to connect the dots.

Page 11: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data

11

PTGS2 (Gene)

inhibits

TP53 (Gene)

Rheumatoid Arthritis

Osteosarcoma (Bone Cancer)

Naproxen (Aleve)

Disease

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

Methotrexate

DHFR (Gene)

inhibi

ts

limit cells growth

Arthritis

WarfarinEmbolism

(Blood Clot)

Nicotine

VKORC1 (Gene)CYP2C9

(Enzyme)

Chemical

Carboxylic Acids

Heterocyclic

Aminopterin Phenylpro- pionates

Approved Drugs

increased degradation

inhibits

Inhibits

Inhi

bits

Inhibits

tumor

suppressor

?

Why capture the semantic/context?Semantic is essential to connect the dots.

Page 12: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data

12

PTGS2 (Gene)

inhibits

TP53 (Gene)

Rheumatoid Arthritis

Osteosarcoma (Bone Cancer)

tumor

suppressor

Naproxen (Aleve)

Disease

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

Methotrexate

DHFR (Gene)

inhibi

ts

Arthritis

WarfarinEmbolism

(Blood Clot)

Nicotine

VKORC1 (Gene)CYP2C9

(Enzyme)

Chemical

Carboxylic Acids

Heterocyclic

Aminopterin Phenylpro- pionates

Approved Drugs

increased degradation

inhibits

Inhibits

Inhi

bits

Inhibits

limit cells growth ?

?

?

Why capture the semantic/context?Semantic is essential to connect the dots.

Page 13: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data

13

PTGS2 (Gene)

inhibits

TP53 (Gene)

Rheumatoid Arthritis

Osteosarcoma (Bone Cancer)

Naproxen (Aleve)

Disease

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

Methotrexate

DHFR (Gene)

inhibi

ts

Arthritis

WarfarinEmbolism

(Blood Clot)

Nicotine

VKORC1 (Gene)CYP2C9

(Enzyme)

Chemical

Carboxylic Acids

Heterocyclic

Aminopterin Phenylpro- pionates

Approved Drugs

increased degradation

inhibits

Inhibits

Inhi

bits

Inhibits (1) Instance Layer: Capturing raw data instances including both structured & semi-structured data

How to capture the context?

limit cells growth

tumor

suppressor

Page 14: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data

14

PTGS2 (Gene)

inhibits

TP53 (Gene)

Rheumatoid Arthritis

Osteosarcoma (Bone Cancer)

Naproxen (Aleve)

Disease

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

Methotrexate

DHFR (Gene)

inhibi

ts

Arthritis

WarfarinEmbolism

(Blood Clot)

Nicotine

VKORC1 (Gene)CYP2C9

(Enzyme)

Chemical

Carboxylic Acids

Heterocyclic

Aminopterin Phenylpro- pionates

Approved Drugs

increased degradation

inhibits

Inhibits

Inhi

bits

Inhibits

How to capture the context?

limit cells growth

tumor

suppressor

(2) Relation Layer: Capturing the interconnectedness of data instances across data sources

Page 15: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Drug Safety: Challenges of Real-time Fusion & Exploration of Open Data

15

PTGS2 (Gene)

inhibits

TP53 (Gene)

Rheumatoid Arthritis

Osteosarcoma (Bone Cancer)

Naproxen (Aleve)

Disease

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

Methotrexate

DHFR (Gene)

inhibi

ts

Arthritis

WarfarinEmbolism

(Blood Clot)

Nicotine

VKORC1 (Gene)CYP2C9

(Enzyme)

Chemical

Carboxylic Acids

Heterocyclic

Aminopterin Phenylpro- pionates

Approved Drugs

increased degradation

inhibits

Inhibits

Inhi

bits

Inhibits

How to capture the context?

limit cells growth

tumor

suppressor

(3) Semantic Layer: Capturing conceptual relationships among data instances and their types

Page 16: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Enriched Data Model: Semantic is essential to connect the dots

16

PTGS2(Gene)

TP53(Gene)

Acetaminophen(Tylenol)

Rheumatoid Arthritis

Osteosarcoma(Bone Cancer)

ReliefFever

Ibuprofen(Advil)

Immune System

Autoimmune

Joint Diseases

Sarcoma

Neoplasms

DrugName DrugTargets(Genes)

SymptomaticTreatment

Ibuprofen PTGS2 Rheumatoid Arthritis

Acetaminophen PTGS2 Relief Fever

Methotrexate DHFR AntineoplasticAnti-metabolite

Warfarin TP53 Embolism(Blood Clot)

Gene Interaction

PTGS2 TP53(Gene)

DrugBank: Bioinformatics & Cheminformatics ResourceCTD: Comparative Toxicogenomics Database

Gene Function

TP53 TumorSuppressor

DHFR LimitsCell Growth

Uniprot: Universal Protein Resource

Gene Disease

TP53 Osteosarcoma

Sem

anti

c la

yer

Rel

atio

n la

yer

Inst

ance

laye

r

Methotrexate

DHFR(Gene)

Arthritis

WarfarinEmbolism

(Blood Clot)

Info

rmat

ion

Kno

wle

dge

Dat

a

Warfarin has narrow therapeutic range(fatal outcomes)

Dosage for Asians population: 3.4 mg

Dosage for Whites population: 5.1mg

Dosage for African-Americans population: 6.1 mg

Page 17: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

17

RankQueryRepresentation

RankQueryRefinement

RankDataSourcesDiscovery

RankQueryComposition

RankQueryAnswers

RankAnswerEvidence

RankAnswerRepresentation

QueryRefinementRanking

DataSourceDiscoveryRanking

QueryCompositionRanking

QueryAnswerRanking

EvidenceRanking

QueryRepresentationRanking

AnswerRepresentationRanking

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

Yes/No

Page 18: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

18

RankQueryRepresentation

RankQueryRefinement

RankDataSourcesDiscovery

RankQueryComposition

RankQueryAnswers

RankAnswerEvidence

RankAnswerRepresentation

QueryRefinementRanking

DataSourceDiscoveryRanking

QueryCompositionRanking

QueryAnswerRanking

EvidenceRanking

QueryRepresentationRanking

AnswerRepresentationRanking

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

Yes/No

“Is Warfarin sensitive to ethnic background?”

Page 19: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

19

RankQueryRepresentation

RankQueryRefinement

RankDataSourcesDiscovery

RankQueryComposition

RankQueryAnswers

RankAnswerEvidence

RankAnswerRepresentation

QueryRefinementRanking

DataSourceDiscoveryRanking

QueryCompositionRanking

QueryAnswerRanking

EvidenceRanking

QueryRepresentationRanking

AnswerRepresentationRanking

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

Yes/No

“Is Warfarin sensitive to ethnic background?”

“Does Warfarin have a narrow therapeutic range?”

Page 20: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

20

RankQueryRepresentation

RankQueryRefinement

RankDataSourcesDiscovery

RankQueryComposition

RankQueryAnswers

RankAnswerEvidence

RankAnswerRepresentation

QueryRefinementRanking

DataSourceDiscoveryRanking

QueryCompositionRanking

QueryAnswerRanking

EvidenceRanking

QueryRepresentationRanking

AnswerRepresentationRanking

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

Yes/No

“Is Warfarin sensitive to ethnic background?”

“Does Warfarin have a narrow therapeutic range?”

“What are the disjoint classes of population with respect to Warfarin?”

Page 21: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

21

RankQueryRepresentation

RankQueryRefinement

RankDataSourcesDiscovery

RankQueryComposition

RankQueryAnswers

RankAnswerEvidence

RankAnswerRepresentation

QueryRefinementRanking

DataSourceDiscoveryRanking

QueryCompositionRanking

QueryAnswerRanking

EvidenceRanking

QueryRepresentationRanking

AnswerRepresentationRanking

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

Yes/No

“Is Warfarin sensitive to ethnic background?”

“Does Warfarin have a narrow therapeutic range?”

“What are the disjoint classes of population with respect to Warfarin?”

“What are the adverse reactions of Warfarin?”

Page 22: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

22

RankQueryRepresentation

RankQueryRefinement

RankDataSourcesDiscovery

RankQueryComposition

RankQueryAnswers

RankAnswerEvidence

RankAnswerRepresentation

QueryRefinementRanking

DataSourceDiscoveryRanking

QueryCompositionRanking

QueryAnswerRanking

EvidenceRanking

QueryRepresentationRanking

AnswerRepresentationRanking

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

Yes/No

“Is Warfarin sensitive to ethnic background?”

“Does Warfarin have a narrow therapeutic range?”

“What are the disjoint classes of population with respect to Warfarin?”

“What are the adverse reactions of Warfarin?”

“What is an effective dosage of Warfarin for preventing blood clot?”

Page 23: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

23

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

“What are the disjoint classes of population with

respect to Warfarin?”

“What is an effective dosage of Warfarin for preventing blood clot?”

“Does Warfarin have a narrow therapeutic range?”

Page 24: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

24

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

“What are the disjoint classes of population with

respect to Warfarin?”

“What is an effective dosage of Warfarin for preventing blood clot?”

“Does Warfarin have a narrow therapeutic range?”

Dosage for African-Americans population: 6.1 mg Dosage for Whites

population: 5.1mg

Dosage for Asians population: 3.4 mg

Page 25: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

25

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

“What are the disjoint classes of population with

respect to Warfarin?”

Querying different sources return 6.1 mg, 5.1 mg, & 3.4 mg,

so is the data inconsistent? (revisiting consistent answers formalism

& possible world semantics)

“What is an effective dosage of Warfarin for preventing blood clot?”

“Does Warfarin have a narrow therapeutic range?”

Dosage for African-Americans population: 6.1 mg Dosage for Whites

population: 5.1mg

Dosage for Asians population: 3.4 mg

Page 26: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Context-aware Query Model

26

“Is 5.0 mg an effective dosage of Warfarin for preventing blood clot?”

“What are the disjoint classes of population with

respect to Warfarin?”

Querying different sources return 6.1 mg, 5.1 mg, & 3.4 mg,

so is the data inconsistent? (revisiting consistent answers formalism

& possible world semantics)

“What is an effective dosage of Warfarin for preventing blood clot?”

“Does Warfarin have a narrow therapeutic range?”

Dosage for African-Americans population: 6.1 mg Dosage for Whites

population: 5.1mg

Dosage for Asians population: 3.4 mg

Given the known narrow therapeutic range, so is 5.1 mg close enough to 5.0 mg?

(fuzzy answers formalism in presence of enriched data)

Page 27: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Spark Architecture: Knowledge ObliviousApplications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Immutable

Collection of Objects)

Storage

Resource Virtualization

27

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Apache Spark (General Data Processing on Distributed Memory)

Spark Data Model (Resilient Distributed Datasets — RDDs)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Personalized Medicine (Drug Discovery/Safety)

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

Computational FinanceCompliance Informatics

Page 28: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Spark Architecture: Knowledge ObliviousApplications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Immutable

Collection of Objects)

Storage

Resource Virtualization

28

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Apache Spark (General Data Processing on Distributed Memory)

Spark Data Model (Resilient Distributed Datasets — RDDs)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

Personalized Medicine (Drug Discovery/Safety)

Computational FinanceCompliance Informatics

Page 29: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

ExpoDB Architecture: From Data to Knowledge

Applications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Enriching Raw Data Towards Knowledge)

Storage

Resource Virtualization

29

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Apache Spark (General Data Processing on Distributed Memory)

Personalized Medicine (Drug Discovery/Safety)

Computational FinanceCompliance Informatics

Page 30: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

ExpoDB Architecture: From Data to Knowledge

Applications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Enriching Raw Data Towards Knowledge)

Storage

Resource Virtualization

30

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)

Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Apache Spark (General Data Processing on Distributed Memory)

Personalized Medicine (Drug Discovery/Safety)

Computational FinanceCompliance Informatics

Page 31: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

ExpoDB Architecture: From Data to Knowledge

Applications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Enriching Raw Data Towards Knowledge)

Storage

Resource Virtualization

31

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

Semantic Layer Ontology Rules Stochastic Models Tensor Embedding

Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)

Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Apache Spark (General Data Processing on Distributed Memory)

Personalized Medicine (Drug Discovery/Safety)

Computational FinanceCompliance Informatics

Page 32: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

ExpoDB Architecture: From Data to Knowledge

Applications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Enriching Raw Data Towards Knowledge)

Storage

Resource Virtualization

32

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

Semantic Layer

Spark Data Model (RDDs) Generic Data Model (Key-Value Store)

Ontology Rules Stochastic Models Tensor Embedding

Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)

Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Apache Spark (General Data Processing on Distributed Memory)

Personalized Medicine (Drug Discovery/Safety)

Computational FinanceCompliance Informatics

Page 33: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

ExpoDB Architecture: From Data to Knowledge

Applications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Enriching Raw Data Towards Knowledge)

Storage

Resource Virtualization

33

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

ReasoningRefinementCuration Fusion Discovery

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP)

Semantic Layer

Spark Data Model (RDDs) Generic Data Model (Key-Value Store)

Ontology Rules Stochastic Models Tensor Embedding

Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)

Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON

Personalized Medicine (Drug Discovery/Safety)

Computational FinanceCompliance Informatics

Page 34: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

ExpoDB Architecture: Active Data PathApplications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Enriching Raw Data Towards Knowledge)

Storage

Resource Virtualization

34

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

ReasoningRefinementCuration Fusion

Semantic Layer

Spark Data Model (RDDs) Generic Data Model (Key-Value Store)

Ontology Rules Stochastic Models Tensor Embedding

Discovery

Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)

Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Virtualized Hardware Acceleration (GPU & FPGA)

Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP)

Personalized Medicine (Drug Discovery/Safety)

Computational FinanceCompliance Informatics

Page 35: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Personalized Medicine (Drug Discovery/Safety)

Computational Finance

The First Step!Applications

APIs/Services(Access/Interfaces)

Processing Engine

Data Model(Enriching Raw Data Towards Knowledge)

Storage

Resource Virtualization

35

Spark Streaming

SparkSQLBlinkDB GraphX SparkR MLlib

ReasoningRefinementCuration Fusion

Semantic Layer

Spark Data Model (RDDs) Generic Data Model (Key-Value Store)

Ontology Rules Stochastic Models Tensor Embedding

Discovery

Relation Layer Intra- & Inter-domain Linkage (fine-grained & instance-level)

Instance Layer Relational Graph/RDF Dense/Sparse MatricesJSON

Distributed File Systems (e.g., HDFS, S3, Ceph)

Distributed Memory (Tachyon)Compression (Succinct)

Resource Abstractions (Apache Mesos)

Resource Management(Hadoop Yarn)

Online Transactional Processing (OLTP) + Online Analytical Processing (OLAP)

L-Store (Real-time OLTP+OLAP)

FQP (Flexible Query Processor)

EmbedS (Ontology)

Phenomenological Features (Deep-Learning-as-Oracle)

PADRES (Event Processing)

IBM DB2 BLU (Column Store)

SPIDER (Declarative Data Cleansing)

Vraph (Vectorized Graph Processing)

Tiresias (Predicting Adverse Drug Reaction)

fpga-ToPSS (Algorithmic Trading)

Compliance Informatics

Virtualized Hardware Acceleration (GPU & FPGA)

Page 36: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Thank YouQ&A

Exploratory Systems Lab (ExpoLab)website: https://msadoghi.github.io/

Page 37: "ExpoDB: An Exploratory Data Science Platform"

© 2016 Mohammad Sadoghi (Purdue University)

Data/Knowledge Exploration: • Mohammad Sadoghi, Kavitha Srinivas, Oktie Hassanzadeh, Yuan-Chi Chang, Mustafa Canim, Achille Fokoue, Yishai A. Feldman: Self-Curating Databases. EDBT 2016

• Amit Chandel, Oktie Hassanzadeh, Nick Koudas, Mohammad Sadoghi, Divesh Srivastava: Benchmarking declarative approximate selection predicates. SIGMOD Conference 2007: 353-364

• Oktie Hassanzadeh, Mohammad Sadoghi, Renée J. Miller: Accuracy of Approximate String Joins Using Grams. QDB 2007

Drug Safety: • Achille Fokoue, Mohammad Sadoghi, Oktie Hassanzadeh, Ping Zhang: Predicting Drug-Drug Interactions Through Large-Scale Similarity-Based Link Prediction. ESWC 2016

• Achille Fokoue, Oktie Hassanzadeh, Mohammad Sadoghi, Ping Zhang: Predicting Drug-Drug Interactions Through Similarity-Based Link Prediction Over Web Data. WWW 2016

OLTP & OLAP: • Mohammad Sadoghi, Souvik Bhattacherjee, Bishwaranjan Bhattacharjee, Mustafa Canim: L-Store: A Real-time OLTP and OLAP System. CoRR abs/1601.04084 (2016)

• Kaiwen Zhang, Mohammad Sadoghi, Hans-Arno Jacobsen: DL-Store: A Distributed Hybrid OLTP and OLAP Data Processing Engine. ICDCS 2016

• Mohammad Sadoghi, Kenneth A. Ross, Mustafa Canim, Bishwaranjan Bhattacharjee: Exploiting SSDs in operational multiversion databases. VLDB J. 25(5): 651-672 (2016)

• Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel, Kenneth A. Ross: Reducing Database Locking Contention Through Multi-version Concurrency. PVLDB 7(13): 1331-1342 (2014)

• Prashanth Menon, Tilmann Rabl, Mohammad Sadoghi, Hans-Arno Jacobsen: CaSSanDra: An SSD boosted key-value store. ICDE 2014: 1162-1167

• Prashanth Menon, Tilmann Rabl, Mohammad Sadoghi, Hans-Arno Jacobsen: Optimizing key-value stores for hybrid storage architectures. CASCON 2014: 355-358

• Mohammad Sadoghi, Kenneth A. Ross, Mustafa Canim, Bishwaranjan Bhattacharjee: Making Updates Disk-I/O Friendly Using SSDs. PVLDB 6(11): 997-1008 (2013)

Hardware Acceleration: • Rajesh R. Bordawekar, Mohammad Sadoghi: Accelerating database workloads by software-hardware-system co-design. ICDE 2016

• Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: SplitJoin: A Scalable, Low-latency Stream Join Architecture with Adjustable Ordering Precision. USENIX Annual Technical Conference 2016

• Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: The FQP Vision: Flexible Query Processing on a Reconfigurable Computing Fabric. SIGMOD Record 44(2): 5-10 (2015)

• Mohammadreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: Configurable hardware-based streaming architecture using Online Programmable-Blocks. ICDE 2015

• Mohammedreza Najafi, Mohammad Sadoghi, Hans-Arno Jacobsen: Flexible Query Processor on FPGAs. PVLDB 6(12): 1310-1313 (2013)

• Mohammad Sadoghi, Rija Javed, Naif Tarafdar, Harsh Singh, Rohan Palaniappan, Hans-Arno Jacobsen: Multi-query Stream Processing on FPGAs. ICDE 2012: 1229-1232

• Mohammad Sadoghi, Harsh Singh, Hans-Arno Jacobsen: Towards highly parallel event processing through reconfigurable hardware. DaMoN 2011: 27-32

• Mohammad Sadoghi, Harsh Singh, Hans-Arno Jacobsen: fpga-ToPSS: line-speed event processing on fpgas. DEBS 2011: 373-374

• Mohammad Sadoghi, Hans-Arno Jacobsen, Martin Labrecque, Warren Shum, Harsh Singh: Efficient Event Processing through Reconfigurable Hardware for Algorithmic Trading. PVLDB 3(2):

1525-1528 (2010)

References: