preproposal talk

66
Query Federa*on over Biomedical Linked Open Data Applica*ons in Pharmacovigilance and Ques*on-Answering Maulik R. Kamdar Musen Lab Pre-proposal Talk August 9, 2017 1

Upload: maulik-kamdar

Post on 23-Jan-2018

135 views

Category:

Health & Medicine


0 download

TRANSCRIPT

QueryFedera*onoverBiomedicalLinkedOpenData

Applica*onsinPharmacovigilanceandQues*on-Answering

MaulikR.KamdarMusenLab

Pre-proposalTalkAugust9,2017

1

ThedataandknowledgediscoveryboFleneck

BiomedicalQueries

Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteinsinvolvedinsignaltransducMon?

ListmolecularcharacterisMcsoftheanMneoplasMcdrugs,thattargetEGFRandhaveMol.Wt<300g/mol.

2

DesirableDrugsMolecularcharacterisMcsProteinTargetsDownstreamGenes…

BiomedicalInformaMcsResearchMethods

OpenPHACTS.Williams,etal.DrugDiscoveryToday,2012

Post-markeMngsurveillancefordetecMngdrug-druginteracMonsandtheadversereacMons

3JaneP.F.BaiandDarrellR.Abernethy.Annualreviewofpharmacologyandtoxicology53(2013)

Mechanism-basedpredicMon

JaneP.F.BaiandDarrellR.Abernethy.Annualreviewofpharmacologyandtoxicology53(2013) 4

Isolateddatabasesandknowledgebases

DISTRIBUTED DATA and KNOWLEDGE

5

•  Formats(XML,CSV,MySQLDatabase,etc.)•  EnMtyNotaMons(Ensembl,Entrez,HGNC,etc.)•  Schemas(SmallCompound,Compound,etc.)

6

Anoveldataintegra*onmethodtotacklethechallengesofinconsistencies,

incompletenessandheterogeneityacrossdataandknowledgesources

7

SemanMcWebTechnologies

8BernersLee,ScienMficAmerican2001TimBerners-Lee:ThenextWebofopen,linkeddata(TEDTalk2009)

SemanMcWebtechnologies

9BernersLee,ScienMficAmerican2001TimBerners-Lee:ThenextWebofopen,linkeddata(TEDTalk2009)

RDF:Publishingdataasagraph

10

mol_weight

ResourceDescripMonFramework(RDF)

target name

typeprocess

UniformResourceIdenMfier

RDF:Publishingdataasagraph

11

589.25

mol_weight

Gleevec(Mol.Wt.:589.25g/mol,Half-Life:18hours)inhibitsPDGFR,involvedinsignaltransducMon.

“18hours”half-life

x-ref

GleevecDrugB:DB00619

Gleevec

ResourceDescripMonFramework(RDF)

Inhibits

target name

type

GO:0007165(Signal

TransducMon)

process

PDGFRKEGG:D01441h@p://bio2rdf.org/kegg:D01441

h@p://bio2rdf.org/drugbank:DB00619

UniformResourceIdenMfier

SPARQL:Queryingthegraph

<1000

mol_weight

?half-life

x-ref

?

?

Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteins

involvedinsignaltransducMon?

SPARQLQueryLanguage12

Inhibits

?target name

type

GO:0007165(Signal

TransducMon)

process

LinkedOpenData(LOD)Cloud

Cyganiak,Richardetal.2014

13

LifeSciencesLinkedOpenData(LSLOD)Cloud

Saleem,M.,Kamdar,MR.etal.,JournalofWebSemanMcs2014.

Callahan,A.,etal.,ISWC2013.

14

Notanexclusiveclubanymore…

15

EbolaVirusknowledgebasequeriestheLSLODcloud

16

MaulikR.KamdarandMichelDumonMer.AnEbolaVirus-centeredKnowledgeBase.Database(2015)

Whatmyprojectisallabout…

17

ChallengesminingtheLSLODcloud

18

QueryingSta*s*cs:•  40+LinkedDataSources•  10,000+classes,

objectanddataproperMes•  30,000+edges

AutomatedqueryingacrosstheLSLODcloud

Ini*alinsights:•  Minimalsharingofcommon

vocabulariesandontologies•  Hubnodes:

•  rdfs:label•  dc:Mtle•  bio2rdf:idenMfier

MaulikR.Kamdar,etal.(manuscriptinpreparaMon) 19

MiningLSLODcloudisnotaseasyasitseems…

•  IsolatedSPARQLendpointsorRDFDumps(withvaryingsupporttoSPARQLoperators)

•  DifferentURInotaMons,withnoexplicitx-refslinks•  hCp://bio2rdf.org/uniprot:P45059•  hCp://purl.uniprot.org/uniprot/P45059

•  MalformedURIs,unavailableSPARQLendpoints,etc.•  hCp://bio2rdf.org/kegg:map00010

hCp://bio2rdf.org/kegg:00010•  hCp://bio2rdf.org/go:0030307\”

•  HeterogeneityintheLSLODclouddatasets

20

HeterogeneityintheLSLODCloud

21

•  InconsistentA@ributevaluesforen**es

22

InconsistentA@ributeValues:Differentactualvaluesanddatatypes

Gleevecmolecular-weight

493.61 Gleevecmol_weight

589.25

(clinicalfeatures) (biologicalfeatures)

Name Mol.Formula(KEGG)

Mol.Formula(DrugBank)

Lepirudin C287H440N80O111S6 C287H440N80O110S6

PyridoxalPhosphate C8H10NO6P.H2O C8H10NO6P

Cevimeline (C10H17NOS)2.2HCl.H2O C10H17NOS

Cispla*n PtCl2.2NH3 Cl2H4N2Pt

Sodiumbicarbonate NaHCO3 CHNaO3

HeterogeneityintheLSLODCloud

23

•  InconsistentA@ributevaluesforen**es•  IncompleteEn**es

24

Incompleteness:CompletelyuniqueenMMesacrosssources

E1:Drug

Findingssimilarto“Willthecorrectdrugspleasestandup?”-Southanetal.GCC2016

HeterogeneityintheLSLODCloud

25

•  InconsistentA@ributevaluesforen**es•  IncompleteEn**es

•  IncompleteRela*onsbetweenen**es

26

Incompleteness:CompletelyuniquerelaMonsacrosssources

R1:DrughasTargetProtein

HeterogeneityintheLSLODCloud

27

•  InconsistentA@ributevaluesforen**es•  IncompleteEn**es

•  IncompleteRela*onsbetweenen**es

•  InconsistentURIlabelsforclasses,rela*onsanda@ributes

28

LabelMismatch:Differentlabelsforclasses,relaMonsandaFributes

Gleevecmolecular-weight

493.61 Gleevecmol_weight

589.25

(clinicalfeatures) (biologicalfeatures)

Source UniformResourceIden*fier(URI) ParsedLabel

hFp://bio2rdf.org/drugbank_vocabulary:Molecular-Weight MolecularWeight

hFp://www.biopax.org/release/biopax-level3.owl#molecularWeight MolecularWeight

hFp://semanMcscience.org/resource/CHEMINF_000198molecularweightcalculatedbypipelinepilot

hFp://mo-ld.org/mine_vocabulary:hasMolecularWeight HasMolecularWeight

hFp://bio2rdf.org/kegg_vocabulary:mol_weight MolWeight

HeterogeneityintheLSLODCloud

29

•  InconsistentA@ributevaluesforen**es•  IncompleteEn**es

•  IncompleteRela*onsbetweenen**es

•  InconsistentURIlabelsforclasses,rela*onsanda@ributes

•  InconsistentGraphpa@ernsforSPARQLqueries

30

ModelMismatch:DifferentgraphpaFernstocapturegranularity

(clinicalfeatures) (biologicalfeatures)

Gleevec PDGFRdrug-target

Gleevec

Inhibits

PDGFRtarget

name

type

PubMed:21152856

source

Source GraphPa@ern

E1<--drug--gene-drug-Associa*on--gene-->E2

E1<--chemical--Chemical-Gene-Associa*on--gene-->E2

HeterogeneityintheLSLODCloud

31

•  InconsistentA@ributevaluesforen**es•  IncompleteEn**es

•  IncompleteRela*onsbetweenen**es

•  InconsistentURIlabelsforclasses,rela*onsanda@ributes

•  InconsistentGraphpa@ernsforSPARQLqueries

Andmanyotherproblems…MaulikR.Kamdar,etal.(manuscriptinpreparaMon)

32

QueryfederaMonovertheLSLODcloud

33

DataWarehousing:TransformingdataunderoneuniformschemaanduniformnotaMons

34

WAREHOUSING

OpenPHACTS.Williams,2012DataGraphs

✓EfficientqueryexecuMon✓Completeresults✗  Datacopies✗  Inflexible,notscalable

QueryFedera*on:RewriMngandexecuMngqueriesacrossdifferentsources

QUERY FEDERATION

Drugv  molecular-weight<1000v  target

v  process=“GO:0007165”v  half-life

35Schwarte,etal.ISWC2012

Drugv  molecular-weight<1000v  targetv  half-life

Drugv  molecular-weight<1000v  target

v  process=“GO:0007165”

Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteinsinvolvedinsignaltransducMon?

LabelmismatchmakessimplisMcqueryfederaMondifficult…

36

Gleevecmolecular_weight

493.61 Gleevecmol_weight

589.25

(clinicalfeatures) (biologicalfeatures)

Mappingsourceschemastoanontology

Callahan,etal.JournalofBiomedicalSemanMcs2013

ChemicalEnMty

Protein

Process

isPar*cipantIn

isPar*cipantInisP

ar*cipan

tIn

Seman*cScienceIntegratedOntology

37

Q:ChemicalsthatparMcipateinthesameprocessesasSaccharomycesProteins

SaccharomycesGenomeDatabase

U2AF1

Protein

GO_Code

RNASplicing

hasAnnota*on

hasAnnota*on

Compara*veToxicogenomicsDatabase

Chemical

BiologicalProcess

par*cipates

RNASplicing

Vinclozolin

par*cipates

UsingontologymappingrulesforqueryrewriMng

Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteinsinvolvedinsignaltransducMon?

?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000?protein<hasGO><GO:0007165>

?sa<Drug>{?s<molecular-weight>?mw}?s<drug-target>?protein{?s<half-life>?hl}?mw<1000

?sa<Drug>?s<mol_wt>?mw{?s<target>?protein}?protein<hasGO><GO:0007165>

QueryRewriteQueryRewri*ng

?DrugDrugBank:drug-target?Protein?DrugKEGG:target?Protein

MappingRules:

?DrughasTarget?Protein

38

Thisdoesnotsolvethemodelmismatchproblem…

Gleevec PDGFRdrug-target

Gleevec

Inhibits

PDGFRtarget

name

type

PubMed:21152856

source

PDGFRQueryResults:

?DrugDrugBank:drug-target?Protein?DrugKEGG:target?Protein

MappingRules:

?DrughasTarget?Protein

39

(clinicalfeatures) (biologicalfeatures)

40

Proposal:MappinggraphpaFernstoamodel

nametarget

drug-target

mol_weight

molecularWeight value

LSLODcloudsourceschemas GraphPaFerns Model

Proposal:UsinggraphpaFernsforqueryrewriMng

?DrugDrugBank:drug-target?Protein?DrugKEGG:target?blankKEGG:link?Protein

MappingRules:

Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteinsinvolvedinsignaltransducMon?

?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000g/mol?protein<hasGO><GO:0007165>

?sa<Drug>{?s<molecular-weight>?mw}?s<drug-target>?protein{?s<half-life>?hl}?mw<1000g/mol

?sa<Drug>?s<mol_wt>?mw{?s<target>?protein_blank?protein_blank<link>?protein}?protein<hasGO><GO:0007165>

QueryRewriteQuery

Rewri*ng

41

?DrughasTarget?Protein

Applica*on:Mechanism-basedpharmacovigilance

42

LifeSciencesLinkedOpenDataCloud

QueryFederationMappingRules

DataModel

Queries

PhLeGrA– LinkedGraphAnalyMcsinPharmacology

43

PhlegraisaspidergenusoftheSalMcidaefamily,commonlytermedjumpingspiders.

Inputdatamodel:Underlyingmechanismsbehinddrug-adversereacMonassociaMons

44

Drug1(InacMveState)Enzyme

Drug1

J.Jiaetal.NaturereviewsDrugdiscovery,2009.

Inputdatamodel:Underlyingmechanismsbehinddrug-adversereacMonassociaMons

45

Drug1(InacMveState)

Drug1(IncreasedToxicity)

Drug2(TargetsEnzyme)

J.Jiaetal.NaturereviewsDrugdiscovery,2009.

Inputdatamodel

Concept

E1 Drug

E2 Protein

E3 Pathway

E4 AdverseDrugReacMon

Rela*on

R1 DrughasTargetProtein

R2 DrughasEnzymeProtein

R3 DrughasTransporterProtein

R4 ProteinisPresentInPathway

R5 PathwayisImplicatedInADR

Inputmappingrules:GraphpaFernsmappedtoDrughasTargetProtein

Source GraphPa@ern

E1<--drug--Target-Rela*on--target-->E2

E1<--drug--gene-drug-Associa*on--gene-->E2

E1--target-->_:blank--link-->E2

E1<--chemical--Chemical-Gene-Associa*on--gene-->E2

47

Gleevec

Inhibits

PDGFRtarget

link

type

PubMed:21152856

source

LifeSciencesLinkedOpenDataCloud

QueryFederationMappingRules

DataModel

Drug Protein PathwayAdverseReaction

Queries

k-parMtenetworkcanbegeneratedasoutput

48

MaulikR.KamdarandMarkA.Musen.PhLeGrA:GraphAnalyMcsinPharmacologyovertheWebofLifeSciencesLinkedOpenDataCloud.Interna*onalConferenceonWorldWideWeb(WWW)(2017)

EnMMesandrelaMonsfrom4differentsourcesareretrievedtocreatethek-parMtenetwork

Thisk-parMtenetworkisgeneratedin<1day

49

VisualizaMonofthenetworkforselecteddrugsandADRs

hFp://onto-apps.stanford.edu/phlegra50

LifeSciencesLinkedOpenDataCloud

QueryFederationMappingRules

DataModel

Drug Protein PathwayAdverseReaction

GraphAnalyticsModule

Queries

AgraphanalyMcsmoduletorankthemechanisms

51

ImplemenMngnetwork-basedapriorialgorithm

•  Inputs–Outcomesdatabase:–  USFDAAdverseEventReporMngSystem(FAERS):2013-2015–  3millioncasereportswith

Drugs,AdverseReacMons,IndicaMons,Dosesetc.

•  Associa*on:{Drug}n-->ADR–  FilteringnodesandpathsbasedontheSupportstaMsMc.–  PredicMngifanassociaMonexistsbasedontheNetwork-based

Rela*veRepor*ngRa*ostaMsMc–  RankingunderlyingmechanismsbasedontheConfidencestaMsMc.

Harpaz,etal.2010,Inokuchi,etal.2000 52

ValidaMonoftheapproach

•  “Silver”standardvalidaMonsets:–  ObservaMonalMedicalOutcomesPartnership(OMOP)dataset–  ExploringandUnderstandingAdverseDrugReacMons(EU-ADR)dataset–  Drugs.comandMediSpanDrug-druginteracMonsdataset(Iyer,etal.2014)

•  BaselineMethods:–  BayesianConfidencePropagaMonNeuralNetwork(BCPNN)–  GammaPoissonShrinkage(GPS)

Dataset UniqueDrugs UniqueADRs Posi*veAssoc. Nega*veAssoc.

OMOP 155 4 137 158

EU-ADR 59 9 44 39

Iyer,etal. 252 9 315 288

53

Preliminaryresultsshowcomparableperformance

54

Dataset BCPNN GPS Network-basedRRR

OMOP 0.70 0.70 0.72

EU-ADR 0.75 0.76 0.78

Iyer,etal. 0.81 0.83 0.82

MaulikR.KamdarandMarkA.Musen.Mechanism-basedPharmacovigilanceovertheLifeSciencesLinkedOpenDataCloud.AmericanMedicalInforma*csAssocia*on(AMIA)AnnualSymposium(2017)

Event-specificthresholdscangeneratebeFerAUROCstaMsMcsforsomeadversereacMons

55

Thestorysofar..

•  ComparableperformancewithexisMngbaselinemethodsusedtodetectsignalsinUSFAERSdatasetsforpharmacovigilance.

•  Event-specificthresholdscanleadtoanAUROCstaMsMc>0.75formorethan146AdversereacMons.

•  Mechanism-basedpharmacovigilancewithconfidencestaMsMcsforunderlyingmechanisms.

56

PlansfordissertaMon–automaMonandevaluaMon

57

PlansfordissertaMon

•  IwillperformaCompara*veevalua*onofmypaFern-basedfederaMonmethodwithexisMngmethods(FedX,SPLENDID)–  Querycomplexity,queryexecuMonMme,completeness.

58

PlansfordissertaMon

•  IwillperformaCompara*veevalua*onofmypaFern-basedfederaMonmethodwithexisMngmethods(FedX,SPLENDID)–  Querycomplexity,queryexecuMonMme,completeness.

•  Iwillcombinemypastresearchonques*on-answeringovertheLSLODcloud,withtheupdatedqueryfederaMonmethod.

59

ReVeaLD:Real-MmeVisualExplorerandAggregatorofLinkedData

ListmolecularcharacterisMcsoftheanMneoplasMcdrugs,thattargetEGFRandhaveMol.Wt<300g/mol.

MaulikR.Kamdar,etal.ReVeaLD:Auser-driven,domain-specificinteracMvesearchplawormforbiomedicalresearch.JournalofBiomedicalInforma*cs(2014) 60

hFps://www.youtube.com/watch?v=6HHK4ASIkJM

PlansfordissertaMon

•  IwillperformaCompara*veevalua*onofmypaFern-basedfederaMonmethodwithexisMngmethods(FedX,SPLENDID)–  Querycomplexity,queryexecuMonMme,completeness.

•  Iwillcombinemypastresearchonques*on-answeringovertheLSLODcloud,withtheupdatedqueryfederaMonmethod.

•  IwillevaluateanapproachtoautomatethegeneraMonofthemappingrulesusedbythequeryfederaMonmethod.

61

Semi-automaMnggeneraMonofmappingrules

Approachtomapgraphpa@erns:•  WordEmbeddings•  GraphLevenshteinDistance•  Instance-levelSimilariMes

WhatIqueriedsofar:•  40+LinkedDataSources•  10,000+classes,

objectanddataproperMes•  30,000+edges

62

PlansfordissertaMon

•  IwillperformaCompara*veevalua*onofmypaFern-basedfederaMonmethodwithexisMngmethods(FedX,SPLENDID)–  Querycomplexity,queryexecuMonMme,completeness.

•  Iwillcombinemypastresearchonques*on-answeringovertheLSLODcloud,withtheupdatedqueryfederaMonmethod.

•  IwillevaluateanapproachtoautomatethegeneraMonofthemappingrulesusedbythequeryfederaMonmethod.

•  I plan to getdomain-specific feedback from the PharmGKBteamaxertheupdatedapplicaMonsaredeployedonline.

63

64

Acknowledgments

MusenLab-  TaniaTudorache-  CsongorNyulas-  MaFhewHorridge-  SimonWalk-  RafaelGonçalves-  JosefHardi-  MarcosMarMnez-  MarMnO’Connor-  JohnGraybeal-  AlexScrenchukAndothers…BMIStudents

65

MarkMusenRussAltmanJureLeskovecMichelDumonMerTeriKleinRainerWinnenbergJuanBandaErikVanMulligenAmrapaliZaveriStefanDeckerMaryJeanneOlivaJoanMeneesAylaAkgulSteveBagley

[email protected]

www.onto-apps.stanford.eduwww.stanford.edu/~maulikrk

66