olivier bodenreider , m.d., ph.d. thomas c. rindflesch , ph.d

43
Advanced Library Advanced Library Services Services Developing a Biomedical Knowledge Repository Developing a Biomedical Knowledge Repository to Support Advanced Information Management Applications to Support Advanced Information Management Applications Olivier Bodenreider Olivier Bodenreider , M.D., Ph , M.D., Ph Thomas C. Rindflesch Thomas C. Rindflesch , Ph.D. , Ph.D. NCICB Operations meeting April 13, 2007

Upload: roanna-fleming

Post on 30-Dec-2015

23 views

Category:

Documents


5 download

DESCRIPTION

Olivier Bodenreider , M.D., Ph.D. Thomas C. Rindflesch , Ph.D. NCICB Operations meeting April 13, 2007. Advanced Library Services Developing a Biomedical Knowledge Repository to Support Advanced Information Management Applications. Context. - PowerPoint PPT Presentation

TRANSCRIPT

Advanced Library Services Advanced Library Services Developing a Biomedical Knowledge RepositoryDeveloping a Biomedical Knowledge Repository

to Support Advanced Information Management Applicationsto Support Advanced Information Management Applications

Olivier BodenreiderOlivier Bodenreider, M.D., Ph.D., M.D., Ph.D.Thomas C. RindfleschThomas C. Rindflesch, Ph.D., Ph.D.

NCICB Operations meeting April 13, 2007

2 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ContextContext

Provide biomedical informationProvide biomedical informationto health care professionals and consumersto health care professionals and consumers Exploit NLM resourcesExploit NLM resources Maintain NLM’s cutting edgeMaintain NLM’s cutting edge

Proposal overviewProposal overview Advanced Library ServicesAdvanced Library Services Biomedical Knowledge RepositoryBiomedical Knowledge Repository

Pilot projectsPilot projects

Proposal overviewProposal overview Advanced Library ServicesAdvanced Library Services Biomedical Knowledge RepositoryBiomedical Knowledge Repository

Pilot projectsPilot projects

3 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Why additional services?Why additional services?

Biomedical information is growing at an Biomedical information is growing at an increasingly faster paceincreasingly faster pace High-throughput approach to knowledge processingHigh-throughput approach to knowledge processing

Information retrieval is the starting point, not the Information retrieval is the starting point, not the end of the journey for the researcherend of the journey for the researcher Towards “computable” knowledgeTowards “computable” knowledge

Integration between literature and other resources Integration between literature and other resources is insufficientis insufficient Adequate for navigation purposesAdequate for navigation purposes Insufficient for knowledge processingInsufficient for knowledge processing

4 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

What additional services?What additional services?

Refined information retrievalRefined information retrieval Indexing on relations in addition to conceptsIndexing on relations in addition to concepts Find articles asserting that Find articles asserting that IL-13 inhibits COX-2IL-13 inhibits COX-2

Multi-document summarizationMulti-document summarization Extract and visualize facts from the literatureExtract and visualize facts from the literature Summarize the top 300 papers on Summarize the top 300 papers on panic disorderpanic disorder

Question answeringQuestion answering Clinical and biological questionsClinical and biological questions What drugs What drugs interactinteract with with imipramineimipramine??

Knowledge discoveryKnowledge discovery Reasoning with facts from heterogeneous resourcesReasoning with facts from heterogeneous resources From MEDLINE and UMLS togetherFrom MEDLINE and UMLS together

5 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Normalized and integrated knowledgeNormalized and integrated knowledge

Normalized knowledgeNormalized knowledge Common formatCommon format Common identification mechanismCommon identification mechanism

Integrated knowledgeIntegrated knowledge Single repositorySingle repository Seamless environmentSeamless environment Phenotype and genotype information togetherPhenotype and genotype information together

Biomedical Knowledge Repository

6 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Sources of knowledgeSources of knowledge

Biomedical literatureBiomedical literature Predications extracted from Predications extracted from MEDLINEMEDLINE abstracts and full-text abstracts and full-text

publicly available articles using text mining techniquespublicly available articles using text mining techniques Other corpora (e.g., Other corpora (e.g., ClinicalTrials.govClinicalTrials.gov))

Terminological knowledgeTerminological knowledge UMLSUMLS

Structured knowledge basesStructured knowledge bases NCBI resources (e.g., NCBI resources (e.g., Entrez GeneEntrez Gene)) Functional annotations from model organism databasesFunctional annotations from model organism databases ……

Contributed knowledgeContributed knowledge The repository is open to collaborators outside NLMThe repository is open to collaborators outside NLM

7 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Formalism Formalism TriplesTriples

FactsFacts AssertionsAssertions RelationsRelations Semantic predicationsSemantic predications RDF triplesRDF triples

Imipramine Panic Disorder

treats

APP Alzheimer disease

has_associated_disease

concept1 concept2

relationship

8 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Annotated knowledgeAnnotated knowledge

Provenance informationProvenance information Source (e.g., PMID)Source (e.g., PMID) Extraction mechanismExtraction mechanism TimestampTimestamp

Frequency informationFrequency information RedundancyRedundancy

Collaborative annotationCollaborative annotation ““Was this information useful?”Was this information useful?” Context of use/usefulnessContext of use/usefulness

9 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic Web perspectiveSemantic Web perspective

Common format for knowledgeCommon format for knowledge Resource Description Format (RDF)Resource Description Format (RDF)

Common identification schemeCommon identification scheme Unified Resource Identifier (URI)Unified Resource Identifier (URI)

Standard toolsStandard tools RDF browsersRDF browsers RDF “reasoners”RDF “reasoners”

High level of interest for biomedicine in the SW High level of interest for biomedicine in the SW communitycommunity Health Care and Life Sciences Interest GroupHealth Care and Life Sciences Interest Group

BiomedicalKnowledgeRepository

InformationRetrieval

QuestionAnswering

KnowledgeDiscovery

DocumentSummarization

Sourceselection(PubMed,

annotations)

UMLSTerminological

Knowledge

GO

EntrezGene Structured

Knowl. Bases

ContributedKnowledge

Advanced Library Services SummaryAdvanced Library Services Summary

MEDLINE BiomedicalLiteratureCT.gov

BiomedicalKnowledgeRepository

InformationRetrieval

QuestionAnswering

KnowledgeDiscovery

DocumentSummarization

Sourceselection(PubMed,

annotations)

UMLSTerminological

Knowledge

GO

EntrezGene Structured

Knowl. Bases

ContributedKnowledge

MEDLINE BiomedicalLiteratureCT.gov

BiomedicalKnowledgeRepositoryStructured

Knowl. Bases

EntrezGene

Advanced Library Services Pilot projectsAdvanced Library Services Pilot projects

DocumentSummarization

Populating the repository Exploiting the repository

Source selection(PubMed)

XSLT

MEDLINE BiomedicalLiteratureCT.gov

SemRep

Pilot #1Pilot #1

Populating and exploiting thePopulating and exploiting theBiomedical Knowledge RepositoryBiomedical Knowledge Repository

Converting Entrez Gene into RDFConverting Entrez Gene into RDF

With With Satya Sahoo Satya Sahoo (U. Georgia)(U. Georgia)and and Kelly Zeng Kelly Zeng (LHC)(LHC)

13 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

14 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

OverviewOverview

XML(file)XML(file)

RDF(file)RDF(file)

RDF(Oracle)

RDF(Oracle)

JAPX Jena

XSLTStylesheet

XSLTStylesheet

124 element tags2M genes

106 properties410M triples

Names has_name

15 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

APP(GeneID: 351)

amyloid beta A4 protein

has_protein_name

16 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

APP (geneid-351) amyloid beta A4 protein

eg:has_protein_reference_name_E

subject predicate object

RDF triple RDF triple Gene propertyGene property

17 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

RDF graph RDF graph Connecting several genesConnecting several genes

PARK1 Parkinson disease

has_associated_disease

MAPT Parkinson disease

MAPT Pick disease

TBP Parkinson disease

TBP Spinocerebellar ataxia

PARK1 Parkinson disease

Parkinson diseaseMAPT

Pick disease

Parkinson diseaseTBP

Spinocerebellar ataxia

PARK1 Parkinson disease

MAPT Pick disease

TBP Spinocerebellar ataxia

18 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Future workFuture work

Transform additional resources into RDFTransform additional resources into RDF UMLS MetathesaurusUMLS Metathesaurus Other NCBI databasesOther NCBI databases Drug knowledge basesDrug knowledge bases ……

Integrate resourcesIntegrate resources Query across resourcesQuery across resources

APP Alzheimer disease

PARK1 Parkinson disease

has_associated_disease

Alzheimer disease

Parkinson disease

Neurodegenerative diseases

isa

19 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

From From glycosyltransferaseglycosyltransferaseto to congenital muscular dystrophycongenital muscular dystrophy

MIM:608840Muscular dystrophy, congenital, type 1D

GO:0008375

has_associated_phenotype

has_molecular_function

EG:9215LARGE

acetylglucosaminyl-transferase

GO:0016757glycosyltransferase

GO:0008194isa

GO:0008375acetylglucosaminyl-

transferase

GO:0016758

Pilot #2Pilot #2

Populating and exploiting thePopulating and exploiting theBiomedical Knowledge RepositoryBiomedical Knowledge Repository

Semantic Medline:Semantic Medline:Multi-document summarizationMulti-document summarization

and visualizationand visualization

With With Marcelo Fiszman, Marcelo Fiszman, M.D., Ph.D.M.D., Ph.D.and and Halil Kilicoglu,Halil Kilicoglu, M.S. M.S.

BiomedicalKnowledgeRepository

InformationRetrieval

QuestionAnswering

KnowledgeDiscovery

DocumentSummarization

Sourceselection(PubMed,

annotations)

UMLSTerminological

Knowledge

GO

EntrezGene Structured

Knowl. Bases

ContributedKnowledge

MEDLINE BiomedicalLiteratureCT.gov

BiomedicalKnowledgeRepository

Advanced Library Services Pilot projectsAdvanced Library Services Pilot projects

DocumentSummarization

Populating the repository Exploiting the repository

Source selection(PubMed)

MEDLINE BiomedicalLiteratureCT.gov

SemRep

22 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Managing retrieval resultsManaging retrieval results

Information retrieval

summarization

Semantic Medline

breast cancer

Network of relations

retrieval

500 citations

23 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Managing retrieval resultsManaging retrieval results

breast cancer

24 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Guiding principlesGuiding principles

VisualizationVisualization Overview firstOverview first Details on demandDetails on demand

Integration of knowledge contentIntegration of knowledge content Automated management of knowledge from textAutomated management of knowledge from text Seamless application interfaces Seamless application interfaces

[Shneiderman 1996]

[BoSC, April, 2006]

25 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Seamless integration of technologiesSeamless integration of technologies

Information retrievalInformation retrieval PubMed - MEDLINEPubMed - MEDLINE Essie - ClinicalTrials.govEssie - ClinicalTrials.gov

Natural language processing: Natural language processing: SemRepSemRep Represent content of text with semantic predicationsRepresent content of text with semantic predications

Abstraction summarizationAbstraction summarization Informative: Overview of most salient informationInformative: Overview of most salient information

VisualizationVisualization Indicative: Links to source text and additional informationIndicative: Links to source text and additional information

26 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

DataUMLS

SemanticPredications

SemRep

InformativeGraph

Visualize

SalientSemantic

Predications

Summarize

Semantic Medline Semantic Medline OverviewOverview

Text

MEDLINEClinicalTrials.gov

PubMedEssie

Query

27 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

DataUMLS

SemanticPredications

SemRep

InformativeGraph

Visualize

SalientSemantic

Predications

Summarize

Document selectionDocument selection

Text

MEDLINEClinicalTrials.gov

PubMedQuery ““breast cancer”breast cancer”““breast cancer”breast cancer”

28 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

DataUMLS

SemanticPredications

SemRep

InformativeGraph

Visualize

SalientSemantic

Predications

Summarize

MEDLINE citationsMEDLINE citations

MEDLINEClinicalTrials.gov

PubMedEssie

Query

Text

… … aromatase inhibitor provides aromatase inhibitor provides mortality benefit in early breast mortality benefit in early breast carcinomacarcinoma ……

……determined the spectrum and determined the spectrum and frequency of ATM missense frequency of ATM missense variants in 443 breast cancer variants in 443 breast cancer patientspatients ……

29 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

Data

InformativeGraph

Visualize

SalientSemantic

Predications

Summarize

Semantic intepretationSemantic intepretation

MEDLINEClinicalTrials.gov

PubMedEssie

Query

UMLS

SemanticPredications

SemRep

Text

30 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic interpretationSemantic interpretation

... ... aromatase inhibitor provides mortality benefit in early breast carcinoma … …

Aromatase Inhibitors Breast Carcinomatreats

… … determined the spectrum and frequency of ATM missense variants in 443 breast cancer patients … …

ATM gene Breast Carcinomaassociated_with

31 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

DataUMLS

SemRep

InformativeGraph

Visualize

SalientSemantic

Predications

Summarize

Semantic predicationsSemantic predications

Text

MEDLINEClinicalTrials.gov

PubMedEssie

Query

SemanticPredications

Aromatase Inhibitors Breast Carcinomatreats

ATM gene Breast Carcinomaassociated_with

Tamoxifen Breast Carcinomatreats

Tamoxifen Patientstreats

Breast Carcinoma Individualprocess_of

… … …

32 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

DataUMLS

SemRep

InformativeGraph

Visualize

SummarizationSummarization

Text

MEDLINEClinicalTrials.gov

PubMedEssie

Query

SalientSemantic

Predications

Summarize

SemanticPredications

33 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Abstraction summarizationAbstraction summarization

AllPredications

Reduction SalientPredications

Specify a topicSpecify a topic Retain predications on the topicRetain predications on the topic Eliminate uninformative predicationsEliminate uninformative predications Retain most frequent predicationsRetain most frequent predications

34 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

DataUMLS

SemRep

InformativeGraph

VisualizeSummarize

Salient semantic predicationsSalient semantic predications

Text

MEDLINEClinicalTrials.gov

PubMedEssie

Query

SemanticPredications

SalientSemantic

Predications

Aromatase Inhibitors Breast Carcinomatreats

ATM gene Breast Carcinomaassociated_with

Tamoxifen Breast Carcinomatreats

Tamoxifen Patientstreats

Breast Carcinoma Individualprocess_of

… … …

35 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

StructuredBiomedical

DataUMLS

SemRep Summarize

VisualizationVisualization

Text

MEDLINEClinicalTrials.gov

PubMedEssie

Query

SemanticPredications

InformativeGraph

Visualize

SalientSemantic

Predications

36 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

SemRep Summarize

Informative graphInformative graph

MEDLINEClinicalTrials.gov

PubMedEssie

Query

SemanticPredications

Visualize

SalientSemantic

Predications

StructuredBiomedical

DataUMLS

Text Informative

Graphtreats

associated_with

treats

Aromatase InhibitorsTamoxifen

ATM gene

Breast Carcinoma

37 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Semantic Medline Semantic Medline LiveLive

38 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Related research Related research Visualizing relationsVisualizing relations

Maps of linked concepts among documentMaps of linked concepts among document

Literature network of co-occurring genesLiterature network of co-occurring genes

Associative concept space for discoveryAssociative concept space for discovery

Genomic information across structured and textual Genomic information across structured and textual databasesdatabases

[Fuller et al. 2004]

[Jensen et al. 2001]

[van der Eijk et al. 2004]

[Tao et al. 2005]

39 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

Future workFuture work

Process all of MEDLINE/PubMedProcess all of MEDLINE/PubMed With SemRepWith SemRep

Incrementally integrate structured knowledge sourcesIncrementally integrate structured knowledge sources Entrez databasesEntrez databases UMLSUMLS Genetics Home ReferenceGenetics Home Reference

Implementation Implementation EfficiencyEfficiency Large amount of dataLarge amount of data

40 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

SummarySummary

Deliver health informationDeliver health information Biomedical Knowledge RepositoryBiomedical Knowledge Repository Advanced Library ServicesAdvanced Library Services

ExploitExploit Current Library resourcesCurrent Library resources Advanced information technologyAdvanced information technology

Support timely translationSupport timely translation Of biomedical researchOf biomedical research Into improvements in patient careInto improvements in patient care

and public healthand public health

41 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

AcknowledgmentsAcknowledgments

Caroline AhlersCaroline Ahlers Mariana DimitrovMariana Dimitrov Marcelo FiszmanMarcelo Fiszman Halil KilicogluHalil Kilicoglu FranFrançois-Michel Langçois-Michel Lang Lee PetersLee Peters Anna RippleAnna Ripple Graciela RosemblatGraciela Rosemblat Satya SahooSatya Sahoo Kelly ZengKelly Zeng

42 Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications

ReferencesReferences

Bodenreider O, Rindflesch TC. Bodenreider O, Rindflesch TC. Advanced library Advanced library services: Developing a biomedical knowledge services: Developing a biomedical knowledge repository to support advanced information repository to support advanced information management applications.management applications. Technical report. Technical report. Bethesda, Maryland: Lister Hill National Center Bethesda, Maryland: Lister Hill National Center for Biomedical Communications, National Library for Biomedical Communications, National Library of Medicine; September 14, 2006.of Medicine; September 14, 2006.http://lhncbc.nlm.nih.gov/lhc/docs/reports/2006/tr2006001.pdfhttp://lhncbc.nlm.nih.gov/lhc/docs/reports/2006/tr2006001.pdf

Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland - USABethesda, Maryland - USA

AdvancedLibraryServices

Olivier BodenreiderOlivier Bodenreider [email protected]@nlm.nih.govThomas C. RindfleschThomas C. Rindflesch [email protected]@nlm.nih.gov