biomedical text mining: inferring hidden relationships from biological literature
TRANSCRIPT
Biomedical Text Mining: Inferring Hidden Relationships
from Biological Literature
Biomedical Text Mining (BTM)
Why biomedicine? Consider just MEDLINE: more than 20,000,000
references, 40,000 added per month Dynamic nature of the domain: new terms (genes,
proteins, chemical compounds, drugs) constantly created
Impossible to manage such an information overload
From Text to Knowledge: tackling the data deluge through text mining
Unstructured Text(implicit knowledge)
Structured content(explicit knowledge)
Informationextraction
Semanticmetadata
Knowledge Discovery
InformationRetrieval
AdvancedInformation
Retrieval
Information Deluge
Bio-databases, controlled vocabularies and bio-ontologies encode only small fraction of information
Linking text to databases and ontologies Curators struggling to process scientific literature Discovery of facts and events crucial for gaining
insights in biosciences: need for text mining
Aims of Biomedical Text Mining
Text mining: discover & extract unstructured knowledge hidden in text Hearst (1999)
Text mining aids to construct hypotheses from associations derived from text
protein-protein interactions associations of genes – phenotypes functional relationships among genes
Impact of biomedical text mining
Extraction of named entities (genes, proteins, metabolites, etc)
Discovery of concepts allows semantic annotation of documents Improves information access by going beyond index
terms, enabling semantic queryingConstruction of concept networks from text
Allows clustering, classification of documents Visualization of concept maps
Impact of BTM
Extraction of relationships (events and facts) for knowledge discovery Information extraction, more sophisticated annotation
of texts (event annotation) Beyond named entities: facts, events Enables even more advanced semantic querying
Literature Based Discovery (LBD)
Swanson experiments (1986) influenced conceptual biology rapid ‘mining’ of candidate hypotheses from the
literature migraine and magnesium deficiency (Swanson, 1988) indomethacin and Alzheimer’s disease (Swanson and
Smalheiser 1994), Curcuma longa and retinal diseases, Crohn's disease
and disorders related to the spinal cord (Srinivasan and Libbus 2004).
(Weeber M, Rein et al. 2003) thalidomide for treating a series of diseases such as acute pancreatitis, chronic hepatitis C.
Literature Based Discovery (LBD)
Conceptual Biology?
Swanson’s ABC model
Drug repositioning
Alzheimer
In-sulin
PKC1
CATS
SOS2
3
5
2
8
9
4
Literature-based discovery (LDA)? ---the very idea.
1. It means deriving, from the public record of science new solutions to scientific problems.
2. The possibility arises, for example, when two articles considered together for the first time suggest new information of scientific interest not apparent from either article alone.
Venn Diagram -- ABC Model
A CB
Articles about an AB relationship.
Articles about a BC relationship.
AB BC
AB and BC are complementary but disjoint :They can reveal an implicit relationship between A and C in the absence of any explicit relation.
An ABC example based on title words in Medline
Magnesium-deficient rat as a model of epilepsy.Lab Animal Sci 28:680-5, 1978
The relation of migraineand epilepsy. Brain 92: 285-300, 1969
A magnesium88204
C migraine26923An unintended link
Venn diagram: sets of Medline records; A,C are disjoint.
1018 1710
B epilepsy
Research problems
Information model Biological information Multi-level
Automation Gigantic amount of data Swanson’s ABC model
Semi-automatic
How to discover novelty Find novel information
Novel Hypothesis Genera-tion
A1(Fish Oil)
C1(Raynaud Disease)
B1(Blood Viscos-
ity)Re-duce
Aggre-gate
Information Model
Information Model : category-based interaction model
Interactor node Connects whole relation Represents action by verb
Interactor Type
Induce Increase
Contribute Increase
Reduce Reduction
Increase Increase
Resistant Reduction
Information Model
Each node is represented by mapping a semantic type of the node to its corresponding UMLS top category.
Methods
Data Flow of BioDiscovery
MED-LINE /
PubMED Abstracts
Sentence Splitter
Entity Extractor
Relation Extractor
Similar Entity De-
tectorUPK
UMLS
Extracted Enti-ties/Relations
Graph Builder
Sentence Parser
Visual-izer
MethodsData Flow of BioDiscovery
=Sentence Parser=- Input : Split Sentence
- Output : Sentence tree by Link Grammar Parser
Sentence Parsing Phase
Split Sen-tences
TaggerParsed Tree
Sentence Parsing - Example
Original Sentence: After the DF1 cells had been cultured for 9 d, the ALV p27 antigen in the supernatants of the two sets was detected by ELISA
Sentence Parsing - Example
Entity Extractor
A NER technique is used to detect entities LingPipe NER and Genia corpus used to detect
The accuracy of entity extraction by LingPipe is low.
Validation of the entity type of extracted entities: by looking up UMLS Semantic Network
Assignment of the category tag for each entity: by utilizing UMLS top categories such as Anatomical
Structure, Substance, and Phenomenon or Process
Relation Extractor
Selection of the key connector term (i.e., verb) Difficult decision where complex sentences contain
many verbs Utilize Link Grammar link types such as V and MV
to determine the key connector
Entities that appear before the key connector is set to Interactor entities
Entities that appear after the key connector is set to Interactee entities
Interaction Graph Builder
A maximum connected graph that can be built by our interaction model is a bow tie shape.
Each node represents an entity. Edge between entities is determined by
proximity in a sentence.First two nodes to be connected are an
interactor entity and an interactee entity that are located closest to the connector.
Entities that belong to the same category are inter-connected to each other.
Methods
Data Flow of MKEM
* See Appendix B for description
MEDLINE / PubMED Abstracts
Sentence Selector
Relation Extractor
Informa-tion Ele-
ment Rec-ognizer
Similarity MeasureUPK
Entity Ex-traxtor
Relation Extractor
Similar Entity
DetectorUPK
=UPK Infer-ence=- Input : Extracted Entity/Relation- Output : UPK
Similarity Measure*
MetaMap Type
Structural Atomic Count
Semantic Similar-ity
0 : Not Simi-lar1 : Similar
0 : Not Similar0.5 : Substruc-ture1 : Similar
0 : Not Simi-lar1 : Similar
Ranking scores Graph Builder
Visual-izer
UMLS
Similarity Measures
Semantic Type UMLS Semantic Type
Structural Similarity Structural similarity is calculated using the SMSD
(Small Molecule Subgraph Detector) systemAtomic Count
is taken from the chemDB database. Atomic count defines the enumeration of constituent atoms of the chemical which is of interest.
Semantic Similarity Relative importance-based graph similarity
Topological Similarity (Not implemented yet) Graph topology-based similarity
Semantic Similarity
Build dependency tree of a sentenceCreate semantic distributional models (based
on feature vectors) by Tensor Singular Value Decomposition (SVD) The shape is a 3-dimensional tensor of the edge
statistics, which has the shape Head-Relation-Dependency
It adds dependency edges in the reverse directionCalculate term weight by Point-wise Mutual
Information (PMI)
Tensor Example
Tensors are useful for 3 or more modes
Tensor SVD Decomposition
2D Analog of Tensor SVD Decomposition
Methods
UPK Inference Example
Wogonin
Apopto-sis
N/A
Malig-nant T-Cells
In-crease
Fisetin
Apopto-sis
N/A
HCT-116 Cells
In-crease
Methods
UPK Inference Example
Wogonin
Apopto-sis
N/A
Malig-nant T-Cells
In-crease
Fisetin
Apopto-sis
N/A
HCT-116 Cells
In-crease
Wogonin Fisetin Similarity
UMLS Semantic
Type
Organic Chemical 1
Structural Similarity
0.75 1
Atomic Count
C16H12O5 C15H10O6 1
Semantic Similarity
0.265
Similarity measure
Methods
UPK Inference Example
Wogonin
Apopto-sis
N/A
Malig-nant T-Cells
In-crease
Fisetin
Apopto-sis
N/A
HCT-116 Cells
In-crease
Wogonin
Fisetin
Results & Discussion
Input data 500 PubMED abstracts related to ‘apoptosis’
Extraction result
Entity Type # of extracted entities
Substances 410
Processes 357
Diseases 44
Body Parts 82
Results & Discussion: Semantic Similarity with Wogonin
Similarity between wogonin_NN1 and docetaxel_NN1 : 0.0555 Similarity between wogonin_NN1 and serotonin_NN1 : 0.0558 Similarity between wogonin_NN1 and amisulpride_NN1 : 0.0 Similarity between wogonin_NN1 and ranolazine_NN1 : 0.0 Similarity between wogonin_NN1 and genistein_NN1 : 0.0429 Similarity between wogonin_NN1 and brivaracetam_NN1 : 0.0 Similarity between wogonin_NN1 and carisbamate_NN1 : 0.0 Similarity between wogonin_NN1 and riboflavin_NN1 : 0.0 Similarity between wogonin_NN1 and fisetin_NN1 : 0.0532 Similarity between wogonin_NN1 and daidzein_NN1 : 0.0 Similarity between wogonin_NN1 and caffeine_NN1 : 0.0 Similarity between wogonin_NN1 and enzyme_NN1 : -
1.530258524063656E-4 Similarity between wogonin_NN1 and topiramate_NN1 : 0.0 Similarity between wogonin_NN1 and melatonin_NN1 : 0.084 Similarity between wogonin_NN1 and nimodipine_NN1 : 0.086
Results & Discussion: PageRank Score
Substance Name Semantic Type PageRank Similarity
NAG-1 Gene or Genome 0.007264810642471977
apoptosis Cell Function 0.0072537088024985635
Flou-3 AM Pharmacologic Substance 0.007244320944332344
wogonin Organic Chemical 0.007134948358585843
Docetaxel Organic Chemical 0.0070126880085477124
Jarisch-Herxheimer reaction Functional Concept 0.0070126880085477124
apoptotic cells Cell 0.0067827690545702755
Genistein Organic Chemical 0.006781234384834052
p53 Gene or Genome 0.006771667214596861
docetaxel+SN Organic Compound 0.006762759924385635
adverse reactions Finding 0.006762759924385635
atRA Organic Chemical 0.006762759924385635
HCT-116 Cell Line 0.006762759924385635
HCT-116 cells Cell Line 0.006762759924385635
SN-38 Organic Chemical 0.006521739130434784
mesenchyme Embryonic Structure 0.006521739130434784
Results & Discussion: Semantic Similarity for Magnesium and Migraine
A1(Magnesium)
C1(Migraine)
B1(Epilepsy)
Positive impact on
Is related to
Semantic Type: Dis-ease or Syndrome
Element, Ion, or Iso-tope
Semantic Type: Dis-ease or Syndrome
Magnesium – Epilepsy: 0.033Magnesium – Malaria: 0.011Magnesium – Sarcoidosis: 0.015Magnesium – Diabetes: 0.017Magnesium – Asthma: 0.021Magnesium – Hyperoxaluria: 0.026 Magnesium – Hepatitis: 0.018
Epilepsy – Migraine: 0.158Epilepsy – Malaria: 0.004Epilepsy – Sarcoidosis: 0.041Epilepsy – Diabetes: 0.049Epilepsy – Asthma: 0.058Epilepsy – Hyperoxaluria: 0.002Epilepsy – Hepatitis: 0.009
Results & Discussion
Sample of new relationships
Supporting Papers Wogonin increases apoptosis in HCT-116 cells
“Reactive oxygen species up-regulate p53 and Puma; a possible mechanism for apoptosis during combined treatment with TRAIL and wogonin”, Dae-Hee Lee et. al.
Genistein can induce apoptosis in HCT-116 cells “Genistein, a Dietary Isoflavone, down-regulates the MDM2 Oncogene at
Both Transcriptional and Posttranslational Levels”, Mao Li et. al.
Substance Effect Type
Process Disease
Body Part
Wogonin Increase Apoptosis N/A HCT-116 Cells
Fisetin Increase Apoptosis N/A Malignant T Cells
Docetaxel Increase mRNA expression of IL-1
N/A N/A
Genistein Increase Apoptosis N/A HCT-116 Cells
Fisetin Increase Apoptosis N/A Tumor Cells
Summary & Future Work
It is a on-going project. The system was applied on the entity
relations identified by our information model.
We proposed a new system that extracts relationships from biomedical text and infers new information.
Future work Other techniques for NER. Anaphoric relationship extraction. Further enhancing Link Grammar lexicon. Rule generalization to provide better coverage.
Demo
Retrieve stored entities and relations http://
informatics.yonsei.ac.kr/relex/SelectDatabase.jsp
Download pubmed record and extract entities and relations http://
informatics.yonsei.ac.kr/relex/DownloadPubMedRecord.jsp
Conclusion
We suggested context-vectors to infer unknown relationships based on biologically meaningful terms.
We constructed multi-level entity dictionary to recognize multi-level entities from the literature.
We utilized our context vectors to discover putative drugs and diseases relationships.
We evaluated the results by drug-disease relations which are curated from the literature.(PharmGKB, CTD).
In the Alzheimer’s disease 77,711 papers, we found that our context vector based hybrid approach has better precision than previous frequency based ABC model.
Thank you!
Questions?
Thank You!
Appendix: Future Study: Difference Approach to Context Terms
Based on Interaction words (verb terms), define possible direct interaction among entities, and assume that interactions among the rest of entities are context.
I-verbI-Ent1 I-En2 C-Ent C-EntC-Ent
Sentence 1
I-verbI-Ent1 I-En2C-Ent C-EntC-Ent
Sentence 2
I-verbC-Ent I-En1C-Ent C-EntI-Ent2
Sentence 3