Literature Mining and OntologyBMI/IBGP 730
Autumn, 2011 Yang Xiang, Ph.D. in Computer Science
Department of Biomedical InformaticsThe Ohio State University
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
What is Literature (Text) Mining?
• The purposes of Literature Mining– Find relevant documents– Discover knowledge (what is knowledge?)
• e.g. opinion mining (sentiment analysis)• e.g. document similarity
• The advantage of computer-based Literature Mining– Simply, computers can search much more documents!– Computers can ‘think’ and discover knowledge.
• We will focus on biomedical literature mining in the following
Why Literature Mining is Very Popular in Biomedical Science?
• Biomedical science studies nature subjects.– Species– Genes– Phenotypes– Diseases….
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Popular Tools for Biomedical Literature Mining – Document search
• Google– Google Scholar: http://scholar.google.com
• ISI web of knoledge– www.isiknowledge.com
• Pubmed– www.ncbi.nlm.nih.gov/pubmed
• Scopus– www.scopus.com
Tools for Biomedical Literature Mining – Knowledge discovery
• The Gene Ontology– http://www.geneontology.org/
• Gene answer– www.geneanswers.com
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Techniques Behind Literature Mining
• Interdisciplinary– Computer Science
• Information retrieval• Data mining• Natural Language Processing• Machine learning
– Library Science– Biomedical Science– Linguistics
• Computational linguistics
– Statistics– And more!
• Two main research areas (some overlaps)– Information Retrieval– Natural Language Processing
Basic Text Search Algorithm
• Assume text size is n.• Assume search string size is m.• How to design an efficient algorithm to find all
matches in the text?– Brutal force algorithm, O(mn).– Boyer-Moore Heuristics, O(mn), but fast in most cases
for English text.– KMP (Knuth-Morris-Pratt) algorithm, O(m+n).
H e l l o , w o r l d
w o r l d
…… text
String to match
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Information Retrieval (Indexing)
• Archiving (preprocessing) documents for fast search– Preprocessing time– Query time– Index size
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Programming language processing (C++, Java, etc)
• Lexical analysisy=x+10;
• Syntax analysis
lexeme Token typey identifier= assignment operatorx identifier+ addition operator10 number; end of statement
assignment operator
identifierexpression
identifier number
expression expression
x 10
+=
y
Natural Language Processing• Lexical level
– Stemming (including lemmatizing): find the root of a wordswimming, swam, swim, swimmer swim
– Stemming rule may vary (balance between overstemming and understemming)– Typical algorithm (Porter Stemming algorithm)– Alias, Synonym
• Grammatical level– Parsing
“…We find Gene1 interacts with Gene2…”
Sentence
Noun phrase Verb phrase
Gene1Verb
interact
Noun phrase
Gene2
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Statistical and Data Mining Processing
• Statistical– Count the word frequency– Count the expression frequency
• Data Mining– Mining the set of frequent words– Association Rule Mining
Document Classification
• E.g., classify all documents related to coffee and health
• Various machine learning algorithms can be applied here.
Coffee and health related
documents
Documents show
benefits
Documents showrisk
Cardioprotective
Laxative
…
Cholesterol
…
Anxiety
Accuracy vs Relevancyin Pattern Recognition/Machine Learning
• Precision=|{relevant docs}∩{retrieved docs}|/| {retrieved docs}|
• Recall= |{relevant docs}∩{retrieved docs}|/|{relevant docs}|
• Fall-out |{nonrelevant docs}∩{retrieved docs}|/|{nonrelevant docs}|
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Ontology
• According to philosophy, ontology is a systematic account of Existence
• In information science, ontology is a representation of concepts and their relationships, often by directed graphs
Ontology Example (Informal)fish
fresh water salt water
North American Asian ……Europe
Common Carp
mirror Carp invasive
native
Crappie
Ontology Example: Scientifc classification
Animalia
Chordata Hemichordata…
Actinopterygii Sarcopterygii…
Neopterygii Chondrostei…
Teleostei …
Cypriniformes
Cyprinidae
…
…
Kingdom
Phylum
Class
Subclass
Infraclass
Order
Family
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Gene Ontology (GO) ConsortiumMolecular function
Nucleic acid binding
enzyme
helicaseDNA binding
DNA helicase
ATP-dependent DNA helicase
DNAmetabolis
cell
…… …
…
Reference: Gene Ontology: tool for the unification of biology, nature genetics, 2000 http://dx.doi.org/ 10.1038/75556
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Unified Medical Language System (UMLS)
• A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains:– Metathesaurus– Semantic Network– SPECIALIST Lexicon
• UMLS contains data more than ontologies• Maintained by US National Library of Medicine• Website: http://www.nlm.nih.gov/research/umls/
UMLS - Metathesaurus
• Number of biomedical concepts > 1 million• Stem from over 100 incorporated controlled source
vocabularies:– ICD (International Statistical Classification of Diseases and Related
Health Problems)– MeSH (Medical Subject Headings)– SNOMED CT (Systematized Nomenclature of Medicine – Clinical
Terms)– LOINC (Logical Observation Identifiers Names and Codes)– Gene Ontology– OMIM (Mendelian Inheritance in Man)…
http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html
UMLS - Semantic Network• Semantic types (categories)
– Entity• Physical Object
– Organism…
…
– Event• Actitivity
– Behavior…
…
• Semantic relationships (connecting two concepts)– isa– assoicated_with
• physically_related_to– part_of
…
• spatially_related_to– location_of
…
…http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.htmlhttp://www.clres.com/semrels/umls_relation_list.html
Drug A
treats
Disease B
Gene A
disease_is_marked_by_gene
treated_by
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Use and index ontology
• Applications of Literature Mining and Ontology
Use of ontology systems
• Statistical– Gene ontology enrichment test
• Indexing– Reachability– Distance– Path
Represent Ontology by Graphs
• Directed Graph• Directed Acyclic Graph (DAG): Most ontologies
fall into this type.• Directed Tree
Directed Graph DAG Tree
Reachability
1 2
3 4
6 7 8
5
9
13 10
11
12
14
15
?Query(1,11) Yes
?Query(3,9) No
The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ?
Distance
1 2
3 4
6 7 8
5
9
13 10
11
12
14
15
?Query dG(1, 11)
=3
The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v?
Path
1 2
3 4
6 7 8
5
9
13 10
11
12
14
15
The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ?
Find a path from 1 to 11
The estimated difficulty of building a very efficient indexing graph database schemes
(based on current research)
Reachability Distance Path
Directed Tree easy easy easy
Directed Acyclic Graph medium hard hard
Directed Graph medium hard hard
Reference: R. Jin, Y. Xiang, N. Ruan, H. Wang, "Efficiently Answering Reachability Queries on Very Large Directed Graphs", Proc. of ACM SIGMOD Conference, Vancouver, June 9-12, 2008, pp. 595-608.R. Jin, Y. Xiang, N. Ruan, D. Fuhry, "3-HOP: A High-Compression Indexing Scheme for Reachability Query", Proc. of ACM SIGMOD Conference, Providence, Rhode Island, June 29-July 2, 2009, pp. 813-826.
Outline
• What is Literature Mining?– Popular Tools for Literature Mining– Basic Techniques– Information Retrieval (Indexing): Expediting searching– Linguistic Processing– Other Processing
• What is Ontology?– Simple ontology examples– Gene ontology– Unified Medical Language System– Ontology use and indexing
• Applications of Literature Mining and Ontology
Applications of Literature Mining and Ontology - I
• Build confirmed gene-phenotype relations– Human Phenotype Ontology (HPO)– Built from Online Mendelian Inheritance in Man
(OMIM) database.– http://human-phenotype-ontology.org/
Reference: Robinson PN, Mundlos S. The Human Phenotype Ontology. Clinical Genetics 77(6) 2010: 525–534. http://dx.doi.org/10.1111/j.1399-0004.2010.01436.x
Applications of Literature Mining and Ontology - II
• MetaMap program and CKC Mining– MetaMap: Mapping biomedical text to UMLS Metathesaurus.– CKC (Conceptual Knowledge Constructs) represents a path connecting
several concepts in the UMLS.– Knowledge Discovery using MetaMap and CKC mining.
Reference: Aronson, A.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In : AMIA Symposium, p.17 (2001)Payne, P., Borlawsky, T., Kwok, A., Greaves, A.: Supporting the design of translational clinical studies through the generation and verification of conceptual knowledge-anchored hypotheses. In : AMIA Annual Symposium Proceedings, p.566 (2008)
Literature MetaMap
……
… .…
…
C CKCs
phenotypes
bio-molecular
Thanks!
Questions?