using domain ontologies to improve information retrieval in scientific publications

45
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Kincho H. Law, Siddharth Taduri, Gloria T. Lau Engineering Informatics Lab at Stanford University

Upload: caron

Post on 20-Mar-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications. Kincho H. Law, Siddharth Taduri, Gloria T. Lau Engineering Informatics Lab at Stanford University. Motivation. PMID: 12897095 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Kincho H. Law, Siddharth Taduri, Gloria T. LauEngineering Informatics Lab at Stanford

University

Page 2: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

MotivationPMID: 12897095

Regional variability in the incidence of end-stage renal disease: an epidemiological approach. ….Regional variability in the incidence of end-stage renal disease (ESRD) in Austria is reported. Our aim was …. low rates in the state of Tyrol.….ESRD incidence data were obtained from …. ….Between 1995 and 1999, 4811 new cases of ESRD were recorded; the state of Tyrol (T) …. incidence of ESRD patients with type 2 diabetes mellitus …. the difference in the overall ESRD incidence …. prevalence of DM, a highly significant correlation was found between ESRD incidence and DM.….variability in the ESRD incidence in Austria is explained mainly by regional differences in DM-2. Data from similar studies …. allocation for ESRD ….….

Synonyms for ESRD

End Stage Kidney Disease…Renal Disease, End Stage….Renal Failure, End Stage….Kidney Disease, ChronicRenal Failure, ChronicEnd-Stage Kidney DiseaseESRDRenal Disease, End-StageRenal Failure, End-StageChronic Kidney FailureChronic Renal Failure

05/01/2012 Engineering Informatics Lab at Stanford University 2

Page 3: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Data Set and Knowledge

TREC 2007 Genomics Data Set• Over 162,000 full-text scientific publications from 49 prominent

journals in biomedicine• Metadata available through MEDLINE• Tasks involve passage, document, and feature retrieval• Methodologies are evaluated on their response to 36 topics

(‘queries’)• The topics are categorized based on 13 entity types (Proteins,

Genes, etc.)

Domain Knowledge• Over 250 biomedical ontologies from BioPortal

05/01/2012 Engineering Informatics Lab at Stanford University 3

Page 4: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

XML Representation of Scientific Publications in PubMed

<PubmedArticle> <MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID>10022466</PMID> <DateCreated> <Year>1999</Year> <Month>02</Month> <Day>25</Day> </DateCreated> …. <Article PubModel="Print"> <Journal> …. <JournalIssue CitedMedium="Print"> <Volume>84</Volume> <Issue>2</Issue> …. </JournalIssue> <Title>The Journal of clinical endocrinology and metabolism</Title> <ISOAbbreviation>J. Clin. Endocrinol. Metab.</ISOAbbreviation> </Journal> <ArticleTitle>About the use … of an ACTH 1-39 ….</ArticleTitle> ….

05/01/2012 Engineering Informatics Lab at Stanford University 4

Page 5: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Domain Knowledge Integration

(1) Annotating Documents prior to indexing– Response time is fast– Not flexible, the entire index has to be updated if a

new ontology needs to be added– Indexes can grow very large

(2) Query Expansion– Response time is slower– Very flexible, ontologies can be dynamically

chosen

05/01/2012 Engineering Informatics Lab at Stanford University 5

Page 6: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Query Expansion

• The pre-processed query is automatically expanded using BioPortal’s API[Tumor][MeSH] => {Tumor, Neoplasm, Carcinoma,

Leukemia …}

Tumor

Leukemia

Melanoma

Adenocarcinoma

Nerve Sheath Neo

Synonyms Cancer, Neoplasm, …

Synonyms LeucocythaemiasLeucocythemia

MeSH

05/01/2012 Engineering Informatics Lab at Stanford University 6

Page 7: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Choosing Domain Knowledge

• The use of synonymy results in inconsistent performance (2007 TREC genomics track)

• Common reasons include:– Relevant terms may not be classified as expected– Some relevant terms may not be classified in a particular

ontology– Incomplete information (such as synonyms)

• Selection of the appropriate domain ontology is important

05/01/2012 Engineering Informatics Lab at Stanford University 7

Page 8: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Enriching Existing Ontologies• Existing ontologies can be enriched to complete some missing

information

• Multiple ontologies can be used to provide different classifications

MeSH

NCI

Ontology NDF

Concept Pamidronate

Synonyms from NDF APD, Amidronate, ...

Synonyms from MeSH

pamidronate calcium, pamidronate monosodium, aredia

Synonyms from NCI Pamidronic acid, pamidronate disodium, …

05/01/2012 Engineering Informatics Lab at Stanford University 8

Page 9: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Evaluations

• Baseline• With Query Expansion (Suggested Sources)• Using Enriched Ontologies• Multiple Query Expansions per query

Summary of Document MAP scores in 2007 TREC genomics track

Max 0.3286

Min 0.0329

Mean 0.1862

Median 0.1897

05/01/2012 Engineering Informatics Lab at Stanford University 9

Page 10: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

QueriesTopic

NumberQuery Suggested

Sources for Terms (TREC)

Selected Domain Knowledge (Our Methodology)

205 What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease?

Wikipedia Symptom Ontology

206 What [TOXICITIES] are associated with zoledronic acid?

Wikipedia + Aaron

NCI Thesaurus

207 What [TOXICITIES] are associated with etidronate? Wikipedia + Aaron

NCI Thesaurus

211 What [ANTIBODIES] have been used to detect protein PSD-95?

MeSH MeSH

229 What [SIGNS OR SYMPTOMS] are caused by human parvovirus infection?

Wikipedia Symptom Ontology

231 What [TUMOR TYPES] are found in zebrafish? Aaron MeSH

05/01/2012 Engineering Informatics Lab at Stanford University 10

Page 11: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Baseline

• Queries are used without modification, e.g.,– “What [ANTIBODIES] have been used to detect

protein PSD-95?”– “What [SIGNS OR SYMPTOMS] of anxiety disorder

are related to coronary artery disease?”

• Document MAP: 0.277

05/01/2012 Engineering Informatics Lab at Stanford University 11

Page 12: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Query Expansion

• Original Query: What [TUMOR TYPES] are found in zebrafish?

• Queries are formulated in ‘AND’ clauses:“[Tumor][MeSH] AND zebrafish”

=> (Tumor, Neoplasm, Carcinoma, Leukemia …) AND

zebrafish

• Document MAP: 0.347

05/01/2012 Engineering Informatics Lab at Stanford University 12

Page 13: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Multiple Query Expansion Terms

• Expansion can be performed on multiple terms in the query

• Example: Coronary Artery Disease => {Coronary heart disease, coronary disease, CAD, …}

[Tumor][MeSH] AND zebrafish[MeSH} =>

(tumor, neoplasm, …) AND (zebrafish, danio rerio, …)

• Document MAP: 0.352

05/01/2012 Engineering Informatics Lab at Stanford University 13

Page 14: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Enriched Ontology – Current Status

• Marginal improvement over basic enhanced models

• Document MAP: 0.352 (Marginal improvement from 0.347)

• Issues:– Framework for enrichment based on synonymy is rigid,

i.e., relevant terms that are entirely missing in the ontology are still not included

– Relevant terms that are classified differently are never included in the search

05/01/2012 Engineering Informatics Lab at Stanford University 14

Page 15: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

IR Tool

• Expert knowledge is valuable• Developed a search tool which automatically

integrates with knowledge sources and searches documents

• We extend MINOE, a co-occurrence based visualization tool, originally designed for exploring marine ecosystems

• User can browse (or search) documents through ontologies and visualize interactions between concepts

05/01/2012 Engineering Informatics Lab at Stanford University 15

Page 16: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Snapshots of the Tool

05/01/2012 Engineering Informatics Lab at Stanford University 16

Page 17: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

I. Enter Query Terms

II. Domain Knowledge Integration

III. Shows Expanded Query, and other filters that are added to the search

05/01/2012 Engineering Informatics Lab at Stanford University 17

Page 18: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

TREC Topic 220

• Query: What [PROTEINS] are involved in the activation or recognition mechanism for PmrD?

• Domain Knowledge: MeSH

05/01/2012 Engineering Informatics Lab at Stanford University 18

Depth of Hierarchical Expansion to Child Nodes Level 1 Level 2 Level 3

Document MAP 0.0 0.2 0.8

Page 19: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 19

Page 20: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 20

Page 21: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 21

Page 22: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 22

Page 23: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 23

Page 24: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 24

Page 25: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Changed

05/01/2012 Engineering Informatics Lab at Stanford University 25

Page 26: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 26

Page 27: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

MeSH Descriptors

05/01/2012 Engineering Informatics Lab at Stanford University 27

Page 28: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 28

Page 29: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 29

Page 30: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

05/01/2012 Engineering Informatics Lab at Stanford University 30

Page 31: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

(>1500 Documents)

(>1500 Documents)Shows Association Between Concepts

05/01/2012 Engineering Informatics Lab at Stanford University 31

Page 32: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

CHILD CONCEPTS

Stronger Association: ~270 Documents

Weaker Association: ~57 Documents

05/01/2012 Engineering Informatics Lab at Stanford University 32

Page 33: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Retrieving Information Across Multiple Diverse Information Sources

Issued Patents and

Applications

Court Cases

File Wrappers

Technical PublicationsRegulations

and Laws

Patent System Technology Firms’ Concerns• Can I get patent protection for my innovation?• Do I build or do I buy related technologies?• What are my competitors doing? • How strong are their patents? • Am I perhaps infringing on someone else’s patents? • Is so, are those patents valid? • Have they been enforced in court?• Has their validity been challenged in court?

05/01/2012 Engineering Informatics Lab at Stanford University 33

Page 34: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

PATENT

United States Patent, 5,955,422September 21, 1999

Production of erthropoietin

Abstract: Disclosed are novel polypeptides possessing part or all of the primary structural conformation and one or more of the biological properties of mammalian erythropoietin ("EPO") …

Inventors: Lin; Fu-Kuen (Thousand Oaks, CA)Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA) Appl. No.: 08/100,197Filed: August 2, 1993.

COURT CASE

314 F.3d 1313 (2003)AMGEN INC., Plaintiff-Cross Appellant v. HOECHST MARION ROUSSEL, INC. (now known as Aventis Pharmaceuticals, Inc.) and Transkaryotic Therapies, Inc., Defendants-Appellants.

…Plaintiff-Cross Appellant Amgen Inc. is the owner of numerous patents directed to the production of erythropoietin ("EPO"), …alleging that TKT's Investigational New Drug Application ("INDA") infringed United States Patent Nos. 5,547,933; 5,618,698; and 5,621,080. The complaint was amended in October 1999 to include United States Patent Nos. 5,756,349 and 5,955,422, which issued after suit was filed.

FILE WRAPPERU.S. Patent 5,955,422

Claims 61-63 are rejected under 35 U.S.C. § 103 as being unpatentable over any one of Miyake et al., 1977 (R)

…In accordance with the provisions of 37 C.F.R. §1.607, the present continuation is being filed for the purpose of

Publication Database

REGULATIONS:U.S. Code Title 35, C. F. R Title 37, M. P. E. P. …

BIOPORTAL: DOMAIN KNOWLEDGE

Cross-Referencing between Information Sources

Solution: Patent System Ontology

05/01/2012 Engineering Informatics Lab at Stanford University 34

Page 35: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Patent System OntologyI. Facilitate information integration across multiple diverse information

sources• This requires a standardized representation (a formal semantic model) -

Patent System Ontology

II. Integrate Domain Semantics into existing Information Retrieval and Text mining methodologies to improve retrieval of information

05/01/2012 Engineering Informatics Lab at Stanford University 35

Page 36: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Patent System Ontology

Information Retrieval Framework

05/01/2012 Engineering Informatics Lab at Stanford University 36

Page 37: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Future Work

• Using multiple enriched ontologies may provide the necessary terms

• MeSH Descriptors are provided for every publication during indexing and can potentially improve results

• Implement Okapi model for scoring documents

05/01/2012 Engineering Informatics Lab at Stanford University 37

Page 38: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Thank You

05/01/2012 Engineering Informatics Lab at Stanford University 38

Page 39: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Backup Slides

05/01/2012 Engineering Informatics Lab at Stanford University 39

Page 40: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Motivation

• Scientific literature is an important source of information

• Retrieving relevant information from scientific publications is challenging

• Domain terminology is used inconsistently in scientific publications

• Increasing amounts of information amplify the problem

• Improved methodologies based on semantics are required

05/01/2012 Engineering Informatics Lab at Stanford University 40

Page 41: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Background

• Text REtrieval Conference (TREC) organized by NIST has showcased many successful methods

• The Genomics track focused on full-text scientific publications from 49 prominent journals

• Methodologies involved:– Use of Synonymy from ontologies– Language based models– Query expansion and annotations– Okapi scoring model

05/01/2012 Engineering Informatics Lab at Stanford University 41

Page 42: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Goals

• Understand how domain ontologies can be leveraged

• Understand which domain ontologies can be leveraged

• Develop a knowledge-based approach to integrate domain knowledge with search mechanism

05/01/2012 Engineering Informatics Lab at Stanford University 42

Page 43: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Query Expansion

• TREC Queries are first manually pre-processed

“What [TUMOR TYPES] are found in zebrafish?”=>

“[Tumor][MeSH] AND zebrafish”

• [Tumor] indicates term that has to be expanded• [MeSH] indicates ontology that should be used

05/01/2012 Engineering Informatics Lab at Stanford University 43

Page 44: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Summary

• Search methodologies must be based on semantics in order to tackle terminology inconsistency

• Domain ontologies provide these semantics• Domain ontologies need to be modified (or

enriched) in order to fulfill information needs• User interaction is important

05/01/2012 Engineering Informatics Lab at Stanford University 44

Page 45: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

BioPortal

• BioPortal is an integrated resource for biomedical ontologies

• Currently indexes over 300 ontologies including Medical Subject Headings and Gene Ontology

• Provides a comprehensive web service, abstracting the formats and API’s of all underlying ontologies

05/01/2012 Engineering Informatics Lab at Stanford University 45