issues and perspectives for medical text indexing
Post on 03-Feb-2022
2 Views
Preview:
TRANSCRIPT
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA
Issues and perspectivesfor medical text indexing
Medical Informatics Europe 2003St-Malo, France
May 6, 2003 - Workshop W20
The Indexing Initiative
3
Motivation at NLMMotivation at NLM
◆◆ Increasing volume of biomedical literatureIncreasing volume of biomedical literature●● MEDLINE has grown from about 7 million citations in MEDLINE has grown from about 7 million citations in
1996 to over 12 million now1996 to over 12 million now
●● The number of journals indexed has grown from about The number of journals indexed has grown from about 3,750 in 1996 to 4,600 now3,750 in 1996 to 4,600 now
◆◆ Increasing availability of full textIncreasing availability of full text
◆◆ Limited resourcesLimited resources●● Especially qualified indexersEspecially qualified indexers
4
The IND ProjectThe IND Project
◆◆ ObjectivesObjectives●● Investigate automatic and semiautomatic indexing Investigate automatic and semiautomatic indexing
methodsmethods
●● Producing equal or better retrievalProducing equal or better retrieval
◆◆ Initially, an independent collection of projects Initially, an independent collection of projects addressingaddressing●● Indexing methodsIndexing methods
●● EvaluationEvaluation
●● PolicyPolicy
http://ii.nlm.nih.gov
[Aronson & al., AMIA, 2000]
5
Current statusCurrent status
◆◆ SemiSemi--automatic indexingautomatic indexing●● New citations are indexed every nightNew citations are indexed every night
●● Suggested descriptors integrated in the environment Suggested descriptors integrated in the environment used by the indexersused by the indexers
●● Ongoing evaluationOngoing evaluation
◆◆ Automatic indexingAutomatic indexing●● Collections not otherwise indexedCollections not otherwise indexed
●● Descriptors not displayedDescriptors not displayed
6
OverviewOverview Title + Abstract
Ordered list of MeSH Main Headings
MeSH Main Headings
UMLS concepts
Clustering
Restrict to MeSH
TrigramPhrase
Matching
Rel. Citations
PubMedRelated
Citations
ExtractMeSHdescr.
Phrasex
MetaMap
Noun Phrases
Three issues
8
Three issuesThree issues Title + Abstract
Ordered list of MeSH Main Headings
MeSH Main Headings
UMLS concepts
Clustering
Restrict to MeSH
TrigramPhrase
Matching
Rel. Citations
PubMedRelated
Citations
ExtractMeSHdescr.
Phrasex
MetaMap
Noun Phrases
Word-senseambiguity
Terminologyvs. ontology
Evaluation
9
Word sense ambiguityWord sense ambiguity
◆◆ Inherent to natural language processing (NLP)Inherent to natural language processing (NLP)
◆◆ Active research fieldActive research field
◆◆ Compounded in the biomedical domainCompounded in the biomedical domain●● Acronyms / abbreviationsAcronyms / abbreviations
●● Gene / gene product namesGene / gene product names
●● Terms not fully specifiedTerms not fully specified
10
Terminology vs. ontologyTerminology vs. ontology
◆◆ Hierarchies often taskHierarchies often task--drivendrivenrather than based on principlesrather than based on principles
◆◆ Usually suitable for information retrievalUsually suitable for information retrieval●● Better recallBetter recall
●● Precision may not be crucialPrecision may not be crucial
◆◆ Not necessarily suitable for reasoningNot necessarily suitable for reasoning
11
EvaluationEvaluation
◆◆ IndexIndex--basedbased●● Gold standardGold standard
■■ But no ground truthBut no ground truth
●● Similarity measuresSimilarity measures■■ But multiple perspectives possibleBut multiple perspectives possible
◆◆ RetrievalRetrieval--basedbased●● Requires test collectionsRequires test collections
◆◆ SystemSystem--vs. uservs. user--centeredcentered
Perspectives
13
PerspectivesPerspectives
◆◆ RequirementsRequirements●● Better ontologiesBetter ontologies
●● Better identification of specialized entitiesBetter identification of specialized entities(e.g., gene names)(e.g., gene names)
●● Better wordBetter word--sense disambiguation techniquessense disambiguation techniques
◆◆ Tremendous interestTremendous interest(through data mining and knowledge discovery)(through data mining and knowledge discovery)●● In the medical informatics communityIn the medical informatics community
●● And beyond (KDD cup 02, genomic track at TREC 03)And beyond (KDD cup 02, genomic track at TREC 03)
Contact:Contact: olivierolivier@@nlmnlm..nihnih..govgovWeb:Web: etbsun2.etbsun2.nlmnlm..nihnih..govgov:8000:8000
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA
MedicalOntologyResearch
top related