biomedical text mining: inferring hidden relationships from biological literature

42
Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Upload: christian-holmes

Post on 26-Dec-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Biomedical Text Mining: Inferring Hidden Relationships

from Biological Literature

Page 2: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Biomedical Text Mining (BTM)

Why biomedicine? Consider just MEDLINE: more than 20,000,000

references, 40,000 added per month Dynamic nature of the domain: new terms (genes,

proteins, chemical compounds, drugs) constantly created

Impossible to manage such an information overload

Page 3: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

From Text to Knowledge: tackling the data deluge through text mining

Unstructured Text(implicit knowledge)

Structured content(explicit knowledge)

Informationextraction

Semanticmetadata

Knowledge Discovery

InformationRetrieval

AdvancedInformation

Retrieval

Page 4: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Information Deluge

Bio-databases, controlled vocabularies and bio-ontologies encode only small fraction of information

Linking text to databases and ontologies Curators struggling to process scientific literature Discovery of facts and events crucial for gaining

insights in biosciences: need for text mining

Page 5: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Aims of Biomedical Text Mining

Text mining: discover & extract unstructured knowledge hidden in text Hearst (1999)

Text mining aids to construct hypotheses from associations derived from text

protein-protein interactions associations of genes – phenotypes functional relationships among genes

Page 6: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Impact of biomedical text mining

Extraction of named entities (genes, proteins, metabolites, etc)

Discovery of concepts allows semantic annotation of documents Improves information access by going beyond index

terms, enabling semantic queryingConstruction of concept networks from text

Allows clustering, classification of documents Visualization of concept maps

Page 7: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Impact of BTM

Extraction of relationships (events and facts) for knowledge discovery Information extraction, more sophisticated annotation

of texts (event annotation) Beyond named entities: facts, events Enables even more advanced semantic querying

Page 8: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Literature Based Discovery (LBD)

Swanson experiments (1986) influenced conceptual biology rapid ‘mining’ of candidate hypotheses from the

literature migraine and magnesium deficiency (Swanson, 1988) indomethacin and Alzheimer’s disease (Swanson and

Smalheiser 1994), Curcuma longa and retinal diseases, Crohn's disease

and disorders related to the spinal cord (Srinivasan and Libbus 2004).

(Weeber M, Rein et al. 2003) thalidomide for treating a series of diseases such as acute pancreatitis, chronic hepatitis C.

Page 9: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Literature Based Discovery (LBD)

Conceptual Biology?

Swanson’s ABC model

Drug repositioning

Alzheimer

In-sulin

PKC1

CATS

SOS2

3

5

2

8

9

4

Page 10: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Literature-based discovery (LDA)? ---the very idea.

1. It means deriving, from the public record of science new solutions to scientific problems.

2. The possibility arises, for example, when two articles considered together for the first time suggest new information of scientific interest not apparent from either article alone.

Page 11: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Venn Diagram -- ABC Model

A CB

Articles about an AB relationship.

Articles about a BC relationship.

AB BC

AB and BC are complementary but disjoint :They can reveal an implicit relationship between A and C in the absence of any explicit relation.

Page 12: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

An ABC example based on title words in Medline

Magnesium-deficient rat as a model of epilepsy.Lab Animal Sci 28:680-5, 1978

The relation of migraineand epilepsy. Brain 92: 285-300, 1969

A magnesium88204

C migraine26923An unintended link

Venn diagram: sets of Medline records; A,C are disjoint.

1018 1710

B epilepsy

Page 13: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Research problems

Information model Biological information Multi-level

Automation Gigantic amount of data Swanson’s ABC model

Semi-automatic

How to discover novelty Find novel information

Novel Hypothesis Genera-tion

A1(Fish Oil)

C1(Raynaud Disease)

B1(Blood Viscos-

ity)Re-duce

Aggre-gate

Page 14: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Information Model

Information Model : category-based interaction model

Interactor node Connects whole relation Represents action by verb

Interactor Type

Induce Increase

Contribute Increase

Reduce Reduction

Increase Increase

Resistant Reduction

Page 15: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Information Model

Each node is represented by mapping a semantic type of the node to its corresponding UMLS top category.

Page 16: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Methods

Data Flow of BioDiscovery

MED-LINE /

PubMED Abstracts

Sentence Splitter

Entity Extractor

Relation Extractor

Similar Entity De-

tectorUPK

UMLS

Extracted Enti-ties/Relations

Graph Builder

Sentence Parser

Visual-izer

Page 17: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

MethodsData Flow of BioDiscovery

=Sentence Parser=- Input : Split Sentence

- Output : Sentence tree by Link Grammar Parser

Sentence Parsing Phase

Split Sen-tences

TaggerParsed Tree

Page 18: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Sentence Parsing - Example

Original Sentence: After the DF1 cells had been cultured for 9 d, the ALV p27 antigen in the supernatants of the two sets was detected by ELISA

Page 19: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Sentence Parsing - Example

Page 20: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Entity Extractor

A NER technique is used to detect entities LingPipe NER and Genia corpus used to detect

The accuracy of entity extraction by LingPipe is low.

Validation of the entity type of extracted entities: by looking up UMLS Semantic Network

Assignment of the category tag for each entity: by utilizing UMLS top categories such as Anatomical

Structure, Substance, and Phenomenon or Process

Page 21: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Relation Extractor

Selection of the key connector term (i.e., verb) Difficult decision where complex sentences contain

many verbs Utilize Link Grammar link types such as V and MV

to determine the key connector

Entities that appear before the key connector is set to Interactor entities

Entities that appear after the key connector is set to Interactee entities

Page 22: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Interaction Graph Builder

A maximum connected graph that can be built by our interaction model is a bow tie shape.

Each node represents an entity. Edge between entities is determined by

proximity in a sentence.First two nodes to be connected are an

interactor entity and an interactee entity that are located closest to the connector.

Entities that belong to the same category are inter-connected to each other.

Page 23: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Methods

Data Flow of MKEM

* See Appendix B for description

MEDLINE / PubMED Abstracts

Sentence Selector

Relation Extractor

Informa-tion Ele-

ment Rec-ognizer

Similarity MeasureUPK

Entity Ex-traxtor

Relation Extractor

Similar Entity

DetectorUPK

=UPK Infer-ence=- Input : Extracted Entity/Relation- Output : UPK

Similarity Measure*

MetaMap Type

Structural Atomic Count

Semantic Similar-ity

0 : Not Simi-lar1 : Similar

0 : Not Similar0.5 : Substruc-ture1 : Similar

0 : Not Simi-lar1 : Similar

Ranking scores Graph Builder

Visual-izer

UMLS

Page 24: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Similarity Measures

Semantic Type UMLS Semantic Type

Structural Similarity Structural similarity is calculated using the SMSD

(Small Molecule Subgraph Detector) systemAtomic Count

is taken from the chemDB database. Atomic count defines the enumeration of constituent atoms of the chemical which is of interest.

Semantic Similarity Relative importance-based graph similarity

Topological Similarity (Not implemented yet) Graph topology-based similarity

Page 25: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Semantic Similarity

Build dependency tree of a sentenceCreate semantic distributional models (based

on feature vectors) by Tensor Singular Value Decomposition (SVD) The shape is a 3-dimensional tensor of the edge

statistics, which has the shape Head-Relation-Dependency

It adds dependency edges in the reverse directionCalculate term weight by Point-wise Mutual

Information (PMI)

Page 26: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Tensor Example

Page 27: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Tensors are useful for 3 or more modes

Page 28: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Tensor SVD Decomposition

Page 29: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

2D Analog of Tensor SVD Decomposition

Page 30: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Methods

UPK Inference Example

Wogonin

Apopto-sis

N/A

Malig-nant T-Cells

In-crease

Fisetin

Apopto-sis

N/A

HCT-116 Cells

In-crease

Page 31: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Methods

UPK Inference Example

Wogonin

Apopto-sis

N/A

Malig-nant T-Cells

In-crease

Fisetin

Apopto-sis

N/A

HCT-116 Cells

In-crease

Wogonin Fisetin Similarity

UMLS Semantic

Type

Organic Chemical 1

Structural Similarity

0.75 1

Atomic Count

C16H12O5 C15H10O6 1

Semantic Similarity

0.265

Similarity measure

Page 32: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Methods

UPK Inference Example

Wogonin

Apopto-sis

N/A

Malig-nant T-Cells

In-crease

Fisetin

Apopto-sis

N/A

HCT-116 Cells

In-crease

Wogonin

Fisetin

Page 33: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Results & Discussion

Input data 500 PubMED abstracts related to ‘apoptosis’

Extraction result

Entity Type # of extracted entities

Substances 410

Processes 357

Diseases 44

Body Parts 82

Page 34: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Results & Discussion: Semantic Similarity with Wogonin

Similarity between wogonin_NN1 and docetaxel_NN1 : 0.0555 Similarity between wogonin_NN1 and serotonin_NN1 : 0.0558 Similarity between wogonin_NN1 and amisulpride_NN1 : 0.0 Similarity between wogonin_NN1 and ranolazine_NN1 : 0.0 Similarity between wogonin_NN1 and genistein_NN1 : 0.0429 Similarity between wogonin_NN1 and brivaracetam_NN1 : 0.0 Similarity between wogonin_NN1 and carisbamate_NN1 : 0.0 Similarity between wogonin_NN1 and riboflavin_NN1 : 0.0 Similarity between wogonin_NN1 and fisetin_NN1 : 0.0532 Similarity between wogonin_NN1 and daidzein_NN1 : 0.0 Similarity between wogonin_NN1 and caffeine_NN1 : 0.0 Similarity between wogonin_NN1 and enzyme_NN1 : -

1.530258524063656E-4 Similarity between wogonin_NN1 and topiramate_NN1 : 0.0 Similarity between wogonin_NN1 and melatonin_NN1 : 0.084 Similarity between wogonin_NN1 and nimodipine_NN1 : 0.086

Page 35: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Results & Discussion: PageRank Score

Substance Name Semantic Type PageRank Similarity

NAG-1 Gene or Genome 0.007264810642471977

apoptosis Cell Function 0.0072537088024985635

Flou-3 AM Pharmacologic Substance 0.007244320944332344

wogonin Organic Chemical 0.007134948358585843

Docetaxel Organic Chemical 0.0070126880085477124

Jarisch-Herxheimer reaction Functional Concept 0.0070126880085477124

apoptotic cells Cell 0.0067827690545702755

Genistein Organic Chemical 0.006781234384834052

p53 Gene or Genome 0.006771667214596861

docetaxel+SN Organic Compound 0.006762759924385635

adverse reactions Finding 0.006762759924385635

atRA Organic Chemical 0.006762759924385635

HCT-116 Cell Line 0.006762759924385635

HCT-116 cells Cell Line 0.006762759924385635

SN-38 Organic Chemical 0.006521739130434784

mesenchyme Embryonic Structure 0.006521739130434784

Page 36: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Results & Discussion: Semantic Similarity for Magnesium and Migraine

A1(Magnesium)

C1(Migraine)

B1(Epilepsy)

Positive impact on

Is related to

Semantic Type: Dis-ease or Syndrome

Element, Ion, or Iso-tope

Semantic Type: Dis-ease or Syndrome

Magnesium – Epilepsy: 0.033Magnesium – Malaria: 0.011Magnesium – Sarcoidosis: 0.015Magnesium – Diabetes: 0.017Magnesium – Asthma: 0.021Magnesium – Hyperoxaluria: 0.026 Magnesium – Hepatitis: 0.018

Epilepsy – Migraine: 0.158Epilepsy – Malaria: 0.004Epilepsy – Sarcoidosis: 0.041Epilepsy – Diabetes: 0.049Epilepsy – Asthma: 0.058Epilepsy – Hyperoxaluria: 0.002Epilepsy – Hepatitis: 0.009

Page 37: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Results & Discussion

Sample of new relationships

Supporting Papers Wogonin increases apoptosis in HCT-116 cells

“Reactive oxygen species up-regulate p53 and Puma; a possible mechanism for apoptosis during combined treatment with TRAIL and wogonin”, Dae-Hee Lee et. al.

Genistein can induce apoptosis in HCT-116 cells “Genistein, a Dietary Isoflavone, down-regulates the MDM2 Oncogene at

Both Transcriptional and Posttranslational Levels”, Mao Li et. al.

Substance Effect Type

Process Disease

Body Part

Wogonin Increase Apoptosis N/A HCT-116 Cells

Fisetin Increase Apoptosis N/A Malignant T Cells

Docetaxel Increase mRNA expression of IL-1

N/A N/A

Genistein Increase Apoptosis N/A HCT-116 Cells

Fisetin Increase Apoptosis N/A Tumor Cells

Page 38: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Summary & Future Work

It is a on-going project. The system was applied on the entity

relations identified by our information model.

We proposed a new system that extracts relationships from biomedical text and infers new information.

Future work Other techniques for NER. Anaphoric relationship extraction. Further enhancing Link Grammar lexicon. Rule generalization to provide better coverage.

Page 39: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Demo

Retrieve stored entities and relations http://

informatics.yonsei.ac.kr/relex/SelectDatabase.jsp

Download pubmed record and extract entities and relations http://

informatics.yonsei.ac.kr/relex/DownloadPubMedRecord.jsp

Page 40: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Conclusion

We suggested context-vectors to infer unknown relationships based on biologically meaningful terms.

We constructed multi-level entity dictionary to recognize multi-level entities from the literature.

We utilized our context vectors to discover putative drugs and diseases relationships.

We evaluated the results by drug-disease relations which are curated from the literature.(PharmGKB, CTD).

In the Alzheimer’s disease 77,711 papers, we found that our context vector based hybrid approach has better precision than previous frequency based ABC model.

Page 41: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Thank you!

Questions?

Thank You!

Page 42: Biomedical Text Mining: Inferring Hidden Relationships from Biological Literature

Appendix: Future Study: Difference Approach to Context Terms

Based on Interaction words (verb terms), define possible direct interaction among entities, and assume that interactions among the rest of entities are context.

I-verbI-Ent1 I-En2 C-Ent C-EntC-Ent

Sentence 1

I-verbI-Ent1 I-En2C-Ent C-EntC-Ent

Sentence 2

I-verbC-Ent I-En1C-Ent C-EntI-Ent2

Sentence 3