NLP for Biomedical ApplicationsInformation integration through terminology integration
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA
AMIA SymposiumWashington, DC
November 12, 2003
2Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
IntroductionIntroduction
◆◆ NLP and text mining requireNLP and text mining require●● TerminologyTerminology
●● Domain knowledgeDomain knowledge
◆◆ Biomedical terminologiesBiomedical terminologies●● Usually provide vocabularyUsually provide vocabulary
●● May provide some domain knowledgeMay provide some domain knowledge
●● Enable semantic integrationEnable semantic integration
◆◆ Semantic integration may benefit NLPSemantic integration may benefit NLPby enabling links to external resourcesby enabling links to external resources
Terminology integration
The Unified Medical Language System
4Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Unified Medical Language SystemUnified Medical Language System
◆◆ Started in 1986Started in 1986
◆◆ National Library of MedicineNational Library of Medicine
◆◆ Terminology integrationTerminology integration●● 60 families of biomedical vocabularies60 families of biomedical vocabularies
«[…] the UMLS project is an effort to overcome two significant barriers to
effective retrieval of machine-readable information.
• The first is the variety of ways the same concepts are expressed in
different machine-readable sources and by different people.
• The second is the distribution of useful information among many
disparate databases and systems.»
5Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Integrating Integrating subdomainssubdomains
Biomedicalliterature
Biomedicalliterature
MeSH
GenomeannotationsGenome
annotations
GOModelorganisms
Modelorganisms
NCBITaxonomy
Geneticknowledge bases
Geneticknowledge bases
OMIM
Clinicalrepositories
Clinicalrepositories
SNOMEDOthersubdomains
Othersubdomains
…
AnatomyAnatomy
UWDA
UMLS
6Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Integrating Integrating subdomainssubdomains
Biomedicalliterature
Biomedicalliterature
GenomeannotationsGenome
annotations
Modelorganisms
Modelorganisms
Geneticknowledge bases
Geneticknowledge bases
Clinicalrepositories
Clinicalrepositories
Othersubdomains
Othersubdomains
AnatomyAnatomy
Information integration
Genetics as an example
8Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
NF2 NF2 GeneGene, , proteinprotein, and , and diseasedisease
Neurofibromatosis 2 is an autosomal dominant disease characterized by tumors called schwannomas involving the acoustic nerve, as well as other features. The disorder is caused by mutations of the NF2 gene resulting in absence or inactivation of the protein product. The protein product of NF2 is commonly called merlin (but also neurofibromin 2 and schwannomin) and functions as a tumor suppressor.
9Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
SchwannomaSchwannoma (acoustic (acoustic neuromaneuroma))
http://www.mayoclinic.com
10Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
11Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
NF2 geneNF2 gene
http://staff.washington.edu/timk/cyto/human/ http://www.ncbi.nlm.nih.gov/mapview/
12Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
13Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
MerlinMerlin
◆◆ SynonymsSynonyms●● NeurofibrominNeurofibromin 22●● SchwannominSchwannomin●● SchwannomerlinSchwannomerlin●● NeurofibromatosisNeurofibromatosis--22
◆◆ 10 10 isoformsisoforms◆◆ AnnotationsAnnotations
●● Negative regulation of cell proliferationNegative regulation of cell proliferation●● CytoskeletonCytoskeleton●● Plasma membrane Plasma membrane
14Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
15Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
Neurofibromatosis 2(Type II neurofibromatosis,
Bilateral acoustic neurofibromatosis)C0027832
NF2(Neurofibromin 2 gene)
C0085114 Merlin(Schwannomin,
Neurofibromin 2)C0254123
NEUROFIBROMATOSIS,TYPE II; NF2
#101000
������������ ������ ����� ������� ������������������
U49724OMIM GenbankExternal resources
UMLS Metathesaurus(Concepts and relations)
Amino Acid,
Peptide, or Protein
Biologically Active
Substance
Neoplastic Process Gene or Genome
UMLS Semantic Network (Semantic Types)
Merlin, Drosophila
Tumor suppressorgenes
Benign neoplasmsof cranial nerves
Neuro-fibromatoses
Tumor suppressorproteins
16Lister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical CommunicationsLister Hill National Center for Biomedical Communications
LimitationsLimitations
◆◆ Genes not systematically representedGenes not systematically represented●● Most gene products and diseases areMost gene products and diseases are
◆◆ Gene/Gene productGene/Gene product--Disease relationsDisease relations●● Not systematically representedNot systematically represented
●● Not explicitly represented (e.g., coNot explicitly represented (e.g., co--occurrence)occurrence)
◆◆ CrossCross--references not systematically representedreferences not systematically represented
◆◆ Naming conventions (genes)Naming conventions (genes)
MedicalOntologyResearch
Olivier BodenreiderOlivier Bodenreider
Lister Hill National CenterLister Hill National Centerfor Biomedical Communicationsfor Biomedical CommunicationsBethesda, Maryland Bethesda, Maryland -- USAUSA
Contact:Contact:Web:Web:
[email protected]@nlm.nih.govmor.nlm.nih.govmor.nlm.nih.gov