linked data as a starting point for knowledge discovery 1 michel dumontier, ph.d. associate...

Download Linked Data as a Starting Point for Knowledge Discovery 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University

If you can't read please download the document

Upload: tara-buddin

Post on 14-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1

Linked Data as a Starting Point for Knowledge Discovery 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University @micheldumontier::CSWR:Jan-2015 Slide 2 Scientific knowledge is growing at an incredible rate (but hard to use to directly answer questions) 2 @micheldumontier::CSWR:Jan-2015 Slide 3 Thousands of biological databases curate the literature into consumable facts (problems: access, format, identifiers & linking) 3 @micheldumontier::CSWR:Jan-2015 Slide 4 Specialized software is also needed to analyze, predict and evaluate (problems: OS, versioning, input/output formats) 4 @micheldumontier::CSWR:Jan-2015 Slide 5 We develop fairly sophisticated programs/workflows to uncover knowledge and test hypotheses 5 @micheldumontier::CSWR:Jan-2015 This currently requires substantial knowledge of the domain, and of tools and services available. Slide 6 Wouldnt it be great if we could just find the evidence required to support or dispute a scientific hypothesis using the most up-to-date and relevant data, tools and scientific knowledge? 6 @micheldumontier::CSWR:Jan-2015 Slide 7 So what do we need to achieve this? 1.Standards to construct and interrogate a massive, decentralized network of interconnected data and software 2.Methods and Tools To prepare, interlink, and query data To mine and discover associations To identify novel, promising, or well-supported associations 3.Uptake Publications, journals, funding agencies, institutions, societies, conferences, and workshops @micheldumontier::CSWR:Jan-2015 7 Slide 8 The Semantic Web is the new global web of knowledge 8 @micheldumontier::CSWR:Jan-2015 standards for publishing, sharing and querying facts, expert knowledge and services scalable approach for the discovery of independently formulated and distributed knowledge Slide 9 We are building a massive network of linked data 9 @micheldumontier::CSWR:Jan-2015 Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/" Slide 10 @micheldumontier::CSWR:Jan-2015 Linked Data for the Life Sciences 10 Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF. chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications Release 3 (June 2014): 11B+ interlinked statements from 35 biomedical datasets dataset description, provenance & statistics Partnerships with EBI, NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers Slide 11 Resource Description Framework A knowledge representation language that is good for Describing knowledge in terms of types, attributes, relations Integrating data from different sources Answering questions using a standard query language (SPARQL) Publishing and linking to other data on the web Key is to reuse what is available, develop what you need, and contribute your data to the network. @micheldumontier::CSWR:Jan-2015 11 Slide 12 @micheldumontier::CSWR:Jan-2015 12 The URI is a name for a concept, relation or individual Bio2RDF Basics the assertion It contains a base URI, a namespace, delimiter, and a resource identifier The following is the URI for Diclofenac, a drug in the DrugBank dataset For convenience, we will use the CURIE as a shorthand notation for the base URI + namespace drugbank:DB00586 PREFIX drugbank: The namespace is drawn from a registry of 2100 datasets. We also create two additional supporting namespaces: vocabulary and resource The vocabulary namespace is used for types and relations The resource namespace is used for generated data identifiers Slide 13 @micheldumontier::CSWR:Jan-2015 dct:title "Diclofenac" 13 Bio2RDF Basics the assertion We can annotate the URI with a human readable label using the title annotation property from Dublin Core Metadata Initiatives terminology (prefix - dct) drugbank:DB00586 We have our first statement! Slide 14 @micheldumontier::CSWR:Jan-2015 14 Our first assertion can be serialized into a variety of formats: RDF Basics serialization Diclofenac@en. N-Triples Turtle RDF/XML PREFIX drugbank:. PREFIX dct:. drugbank:DB00586 dct:title "Diclofenac" @en. diclofenac Slide 15 @micheldumontier::CSWR:Jan-2015 drugbank:DB00586 drugbank_vocabulary:Drug rdf:type dct:title "diclofenac" 15 Bio2RDF Basics typing and labeling dct:title "Drug" All data items should be typed to a more general class of objects, and labeled for human consumption. PREFIX rdf:. PREFIX rdfs:. PREFIX dct:. PREFIX drugbank:. PREFIX drugbank_vocabulary:. PREFIX drugbank_resource:. Slide 16 @micheldumontier::CSWR:Jan-2015 drugbank:290 drugbank_vocabulary:targets "Prostaglandin G/H synthase 2" dct:title 16 Bio2RDF Basics use object properties to relate data items drugbank:DB00586 dct:title "diclofenac" PREFIX rdf:. PREFIX rdfs:. PREFIX dct:. PREFIX drugbank:. PREFIX drugbank_vocabulary:. PREFIX drugbank_resource:. Slide 17 @micheldumontier::CSWR:Jan-2015 drugbank_resource:DB00586_DB00091 17 Bio2RDF Basics convert n-ary relations into objects drugbank:DB00586 Diclofenac is involved in a drug-drug interaction (monitor for nephrotoxicity) with Cyclosporine, but there is no identifier for the interaction. drugbank_vocabulary:ddi-interactor-in rdf:type PREFIX rdf:. PREFIX rdfs:. PREFIX dct:. PREFIX drugbank:. PREFIX drugbank_vocabulary:. PREFIX drugbank_resource:. drugbank:DB00091 Drugbank_vocabulary:Drug-Drug-Interaction Slide 18 @micheldumontier::CSWR:Jan-2015 drugbank_resource:14523e2498086b9f99333997452e7119 "Patheon Inc." dct:title 18 Bio2RDF Basics transform names into labeled resources drugbank:DB00586 Patheon Inc. is a packager for diclofenac drugbank_vocabulary:packager rdf:type drugbank_vocabulary:Packager "Packager" dct:title PREFIX rdf:. PREFIX rdfs:. PREFIX dct:. PREFIX drugbank:. PREFIX drugbank_vocabulary:. PREFIX drugbank_resource:. Slide 19 @micheldumontier::CSWR:Jan-2015 rdf:type drugbank_vocabulary:Target 19 Bio2RDF Basics Building the knowledge graph drugbank:DB00586 drugbank_vocabulary:Drug rdf:type dct:title "diclofenac" dct:title "Drug" drugbank:290 drugbank_vocabulary :targets "Prostaglandin G/H synthase 2" dct:title "Target" drugbank_resource: 14523e2498086b9f99333997452e7119 "Patheon Inc." dct:title rdf:type drugbank_vocabulary:Packager drugbank_vocabulary :packager dct:title Packager" Slide 20 Linking Data case 1: using source provided links @micheldumontier::CSWR:Jan-2015 drugbank:DB00586 pharmgkb_vocabulary:Drug rdf:type dct:title diclofenac pharmgkb:PA449293 drugbank_vocabulary:Drug pharmgkb_vocabulary:x-drugbank diclofenac dct:title DrugBank PharmGKB 20 Slide 21 Linking Data case 2: lexical matching @micheldumontier::CSWR:Jan-2015 21 LIMES (Soren Auer) LInk discovery framework for MEtric Spaces http://aksw.org/Projects/limes SILK (Chris Bizer) Link Discovery Framework for the Web of Data http://www4.wiwiss.fu-berlin.de/bizer/silk/ Slide 22 We are building a highly connected network of data @micheldumontier::CSWR:Jan-2015 22 Slide 23 @micheldumontier::CSWR:Jan-2015 drugbank_vocabulary:Target 23 Data-driven schema generation drugbank_vocabulary:Drug drugbank_vocabulary :drug drugbank_vocabulary :packager drugbank_vocabulary:Drug-Drug-Interaction drugbank_vocabulary :targets drugbank_vocabulary:Packager Slide 24 @micheldumontier::CSWR:Jan-2015 24 Slide 25 @micheldumontier::CSWR:Jan-2015 25 Slide 26 @micheldumontier::CSWR:Jan-2015 26 Slide 27 Graph summarization for query formulation @micheldumontier::CSWR:Jan-2015 PREFIX drugbank_vocabulary: PREFIX rdfs: SELECT ?ddi ?d1name ?d2name WHERE { ?ddi a drugbank_vocabulary:Drug-Drug-Interaction. ?d1 drugbank_vocabulary:ddi-interactor-in ?ddi. ?d1 rdfs:label ?d1name. ?d2 drugbank_vocabulary:ddi-interactor-in ?ddi. ?d2 rdfs:label ?d2name. FILTER (?d1 != ?d2) } 27 Slide 28 You can use query assistants @micheldumontier::CSWR:Jan-2015 http://sindicetech.com/sindice-suite/sparqled/ graph: http://sindicetech.com/analytics 28 Slide 29 Federated Queries over Independent SPARQL EndPoints Get all protein catabolic processes (and more specific) in biomodels query against SELECT ?go ?label count(distinct ?x) WHERE { ?go rdfs:label ?label. ?go rdfs:subClassOf ?tgo ?tgo rdfs:label ?tlabel. FILTER regex(?tlabel, "^protein catabolic process") service { ?x ?go. ?x a. } @micheldumontier::CSWR:Jan-2015 29 Slide 30 Despite all the data, its still hard to find answers to questions Because there are many ways to represent the same data and each dataset represents it differently @micheldumontier::CSWR:Jan-2015 30 Slide 31 Massive Proliferation of Ontologies / Vocabularies @micheldumontier::CSWR:Jan-2015 31 Slide 32 Multi-Stakeholder Efforts to Standardize Representations are Reasonable, Long Term Strategies for Data Integration @micheldumontier::CSWR:Jan-2015 32 Slide 33 @micheldumontier::CSWR:Jan-2015 33 http://tiny.cc/hcls-datadesc-ed Slide 34 Three Component Metadata Model: description version - distribution @micheldumontier::CSWR:Jan-2015 34 Slide 35 61 metadata elements Core Identifiers Title Description Attribution Homepage License Language Keywords Concepts and vocabularies used Standards Publication Provenance and Change Version number Source Provenance: retrieved from, derived from, created with Frequency of change Availability Format Download URL Landing page SPARQL endpoint 13 Content Statistics With SPARQL queries @micheldumontier::CSWR:Jan-2015 35 Slide 36 multiple formalizations of the same kind of data do emerge, each with their own merit @micheldumontier::CSWR:Jan-2015 36 Slide 37 uniprot:P05067 uniprot:Protein is a sio:gene is a Semantic data integration, consistency checking and query answering over Bio2RDF with the Semanticscience Integrated Ontology (SIO) dataset ontology Knowledge Base @micheldumontier::CSWR:Jan-2015 pharmgkb:PA30917 refseq:Protein is a omim:189931 omim:Genepharmgkb:Gene Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, Jos Cruz-Toledo and Michel Dumontier. Bio-ontologies 2012. 37 Slide 38 @micheldumontier::CSWR:Jan-2015 38 SRIQ(D) 10700+ axioms 1300+ classes 201 object properties (inc. inverses) 1 datatype property Slide 39 Bio2RDF and SIO powered SPARQL 1.1 federated query: Find chemicals (from CTD) and proteins (from SGD) that participate in the same process (from GOA) SELECT ?chem, ?prot, ?proc FROM WHERE { SERVICE { ?chemical a sio:chemical-entity. ?chemical rdfs:label ?chem. ?chemical sio:is-participant-in ?process. ?process rdfs:label ?proc. FILTER regex (?process, "http://bio2rdf.org/go:") } SERVICE { ?protein a sio:protein. ?protein sio:is-participant-in ?process. ?protein rdfs:label ?prot. } @micheldumontier::CSWR:Jan-2015 39 Slide 40 tactical formalization @micheldumontier::CSWR:Jan-2015 40 Take what you need and represent it in a way that directly serves your objective STANDARDS USER DRIVEN REPRESENTATION identifying aberrant and pharmacological pathways predicting drug targets using organism phenotypes Biopax-pathway exploration FALDO-powered genome navigation Slide 41 aberrant and pharmacological pathways Q1. Can we identify pathways that are associated with a particular disease or class of diseases? Q2. Can we identify pathways are associated with a particular drug or class of drugs? drug pathway disease @micheldumontier::CSWR:Jan-2015 41 gene Slide 42 Identification of drug and disease enriched pathways Approach Integrate 3 datasets DrugBank, PharmGKB and CTD Integrate 7 terminologies MeSH, ATC, ChEBI, UMLS, SNOMED, ICD, DO Formalize data of interest using a pre-defined pattern Identify significant associations using enrichment analysis over the fully inferred knowledge base Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. @micheldumontier::CSWR:Jan-2015 42 Slide 43 Formal knowledge representation as a strategy for data integration @micheldumontier::CSWR:Jan-2015 43 Slide 44 Have you heard of OWL? @micheldumontier::CSWR:Jan-2015 44 Slide 45 mercaptopurine [pharmgkb:PA450379] mercaptopurine [drugbank:DB01033] purine-6-thiol [CHEBI:2208] Class Equivalence mercaptopurine [ATC:L01BB02] Top Level Classes (disjointness) drugdiseasegenepathway property chains Class subsumption mercaptopurine [mesh:D015122] Reciprocal Existentials drug disease pathway gene Formalized as an OWL-EL ontology 650,000+ classes, 3.2M subClassOf axioms, 75,000 equivalentClass axioms @micheldumontier::CSWR:Jan-2015 45 Slide 46 Benefits: Enhanced Query Capability Use any mapped terminology to query a target resource. Use knowledge in target ontologies to formulate more precise questions ask for drugs that are associated with diseases of the joint: Chikungunya (do:0050012) is defined as a viral infectious disease located in the joint (fma:7490) and caused by a Chikungunya virus (taxon:37124). Learn relationships that are inferred by automated reasoning. alcohol (ChEBI:30879) is associated with alcoholism (PA443309) since alcoholism is directly associated with ethanol (CHEBI:16236) parasitic infectious disease (do:0001398) associated with 129 drugs, 15 more than are directly linked. @micheldumontier::CSWR:Jan-2015 46 Slide 47 Knowledge Discovery through Data Integration and Enrichment Analysis OntoFunc: Tool to discover significant associations between sets of objects and ontology categories. Enrichment of attribute among a selected set of input items as compared to a reference set. hypergeometric or the binomial distribution, Fisher's exact test, or a chi-square test. We found 22,653 disease-pathway associations, where for each pathway we find genes that are linked to disease. Mood disorder (do:3324) associated with Zidovudine Pathway (pharmgkb:PA165859361). Zidovudine is for treating HIV/AIDS. Side effects include fatigue, headache, myalgia, malaise and anorexia We found 13,826 pathway-chemical associations Clopidogrel (chebi:37941) associated with Endothelin signaling pathway (pharmgkb:PA164728163). Endothelins are proteins that constrict blood vessels and raise blood pressure. Clopidogrel inhibits platelet aggregation and prolongs bleeding time. @micheldumontier::CSWR:Jan-2015 47 Slide 48 PhenomeDrug A computational approach to predict drug targets, drug effects, and drug indications using phenotypes @micheldumontier::CSWR:Jan-2015 48 Mouse model phenotypes provide information about human drug targets. Hoehndorf R, Hiebert T, Hardy NW, Schofield PN, Gkoutos GV, Dumontier M. Bioinformatics. 2013. Slide 49 animal models provide insight for on target effects In the majority of 100 best selling drugs ($148B in US alone), there is a direct correlation between knockout phenotype and drug effect Immunological Indications Anti-histamines (Claritin, Allegra, Zyrtec) KO of histamine H 1 receptor leads to decreased responsiveness of immune system Predicts on target effects : drowsiness, reduced anxiety @micheldumontier::CSWR:Jan-2015 49 Zambrowicz and Sands. Nat Rev Drug Disc. 2003. Slide 50 Identifying drug targets from mouse knock-out phenotypes @micheldumontier::CSWR:Jan-2015 50 drug gene phenotypes effects human gene non-functional gene model ortholog similar inhibits Main idea: if a drugs phenotypes matches the phenotypes of a null model, this suggests that the drug is an inhibitor of the gene Slide 51 @micheldumontier::CSWR:Jan-2015 51 B6.Cg-Alms1 foz/fox /J increased weight, adipose tissue volume, glucose homeostasis altered ALSM1(NM_015120.4) [c.10775delC] + [-] GENOTYPE PHENOTYPE obesity, diabetes mellitus, insulin resistance increased food intake, hyperglycemia, insulin resistance kcnj11 c14/c14 ; insr t143/+ (AB) ? Slide 52 Terminological Interoperability (we must compare apples with apples) Mouse Phenotypes Drug effects (mappings from UMLS to DO, NBO, MP) Mammalian Phenotype Ontology PhenomeNet PhenomeDrug @micheldumontier::CSWR:Jan-2015 Slide 53 Semantic Similarity @micheldumontier::CSWR:Jan-2015 53 Given a drug effect profile D and a mouse model M, we compute the semantic similarity as an information weighted Jaccard metric. The similarity measure used is non-symmetrical and determines the amount of information about a drug effect profile D that is covered by a set of mouse model phenotypes M. Slide 54 Loss of function models predict targets of inhibitor drugs 14,682 drugs; 7,255 mouse genotypes Validation against known and predicted inhibitor-target pairs 0.76 ROC AUC for human targets (DrugBank) 0.81 ROC AUC for mouse targets (STITCH) diclofenac (STITCH:000003032) NSAID used to treat pain, osteoarthritis and rheumatoid arthritis Drug effects include liver inflammation (hepatitis), swelling of liver (hepatomegaly), redness of skin (erythema) 49% explained by PPARg knockout peroxisome proliferator activated receptor gamma (PPARg) regulates metabolism, proliferation, inflammation and differentiation, Diclofenac is a known inhibitor 46% explained by COX-2 knockout Diclofenac is a known inhibitor @micheldumontier::CSWR:Jan-2015 Slide 55 Using the Semantic Web to Gather Evidence for Scientific Hypotheses What evidence supports or disputes that TKIs are cardiotoxic? @micheldumontier::CSWR:Jan-2015 55 Slide 56 Tyrosine Kinase Inhibitors (TKI) Imatinib, Sorafenib, Sunitinib, Dasatinib, Nilotinib, Lapatinib Used to treat cancer Linked to cardiotoxicity. FDA launched drug safety program to detect toxicity Need to integrate data and ontologies (Abernethy, CPT 2011) Abernethy (2013) suggest using public data in genetics, pharmacology, toxicology, systems biology, to predict/validate adverse events What evidence could we gather to give credence that TKIs causes non-QT cardiotoxicity? @micheldumontier::CSWR:Jan-2015 56 FDA Use Case: TKI non-QT Cardiotoxicity Slide 57 @micheldumontier::CSWR:Jan-2015 57 Jane P.F. Bai and Darrell R. Abernethy. Systems Pharmacology to Predict Drug Toxicity: Integration Across Levels of Biological Organization. Annu. Rev. Pharmacol. Toxicol. 2013.53:451-473 Slide 58 The goal of HyQue is retrieve and evaluate evidence that supports/disputes a hypothesis hypotheses are described as a set of events e.g. binding, inhibition, phenotypic effect events are associated with types of evidence a query is written to retrieve data a weight is assigned to provide significance Hypotheses are written by people who seek answers data retrieval rules are written by people who know the data and how it should be interpreted @micheldumontier::CSWR:Jan-2015 58 HyQue 1. HyQue: Evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3. 2. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012. Slide 59 HyQue: A Semantic Web Application @micheldumontier::CSWR:Jan- 2015 59 Software OntologiesData HypothesisEvaluation Slide 60 What evidence might we gather? clinical: Are there cardiotoxic effects associated with the drug? Literature (studies) [curated db] Product labels (studies) [r3:sider] Clinical trials (studies) [r3:clinicaltrials] Adverse event reports [r2:pharmgkb/onesides] Electronic health records (observations) pre-clinical associations: genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase] in vitro assays (IC50) [r3:chembl] drug targets [r2:drugbank; r2:ctd; r3:stitch] drug-gene expression [r3:gxa] pathways [r2:kegg; r3:reactome] Drug-pathway, disease-pathway enrichments [aberrant pathways] Chemical properties [r2:pubchem; r2.drugbank] Toxicology [r1.toxkb/cebs] @micheldumontier::CSWR:Jan-2015 60 Slide 61 Data retrieval is done with SPARQL @micheldumontier::CSWR:Jan-2015 61 Slide 62 Data Evaluation is done with SPIN rules @micheldumontier::CSWR:Jan-2015 62 Slide 63 @micheldumontier::CSWR:Jan-2015 63 Slide 64 @micheldumontier::CSWR:Jan-2015 64 http://bio2rdf.org/drugbank:DB01268 Slide 65 @micheldumontier::CSWR:Jan-2015 65 Slide 66 @micheldumontier::CSWR:Jan-2015 66 Slide 67 @micheldumontier::CSWR:Jan-2015 67 Slide 68 In Summary This talk was about making sense out of the the structured data we already have RDF-based Linked Open Data acts as a substrate for query answering and task-based formalization in OWL Discovery through the generation of testable hypotheses in the target domain. Using Linked Data to evaluate scientific hypotheses @micheldumontier::CSWR:Jan-2015 68 Slide 69 Looking to the Future Community guidelines for RDF-based data and dataset descriptions (e.g. CEDAR) Alignment and consolidation of OWL ontologies (e.g. UMLS) Identifying and filling gaps in our knowledge (e.g. Adam the Robot scientist) Improving our coverage of available evidence (e.g. HyQue) More sophisticated data mining (e.g. you!) @micheldumontier::CSWR:Jan-2015 69 Slide 70 Acknowledgements Bio2RDF: Allison Callahan, Jose Cruz-Toledo, Peter Ansell W3C HCLS Dataset Descriptions: Alasdair Gray, M. Scott Marshall, Joachim Baran, and many others! Aberrant Pathways: Robert Hoehndorf, Georgios Gkoutos PhenomeDrug: Tanya Hiebert, Robert Hoehndorf, Georgios Gkoutos, Paul Schofield TKI Cardiotoxicity: Alison Callahan, Tania Hiebert, Beatriz Lujan, Sira Sarntivijai (FDA) @micheldumontier::CSWR:Jan-2015 70 Slide 71 dumontierlab.com [email protected] Website: http://dumontierlab.comhttp://dumontierlab.com Presentations: http://slideshare.com/micheldumontierhttp://slideshare.com/micheldumontier 71 @micheldumontier::CSWR:Jan-2015