natural language processing in bioinformatics: uncovering semantic relations barbara rosario joint...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Natural Language Processing in Bioinformatics:
Uncovering Semantic Relations
Barbara RosarioJoint work with Marti Hearst
SIMS, UC Berkeley
2
Outline of Talk
Goal: Extract semantics from text
Information and relation extraction
Protein-protein interactions Noun compounds
3
Text Mining
Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text
4
Text Mining Text:
Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker
1: Extract semantic entities from text
5
Text Mining Text:
Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker
Stress Migraine
Magnesium Calcium channel blockers
1: Extract semantic entities from text
6
Text Mining (cont.) Text:
Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker
Stress Migraine
Magnesium Calcium channel blockers
2: Classify relations between entities
Associated withLead to loss
Prevent
Subtype-of (is a)
7
Text Mining (cont.) Text:
Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker
Stress Migraine
Magnesium Calcium channel blockers
3: Do reasoning: find new correlations
Associated withLead to loss
Prevent
Subtype-of (is a)
8
Text Mining (cont.) Text:
Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker
Stress Migraine
Magnesium Calcium channel blockers
4: Do reasoning: infer causality
Associated withLead to loss
Prevent
Subtype-of (is a)
No prevention
Deficiency of magnesium migraine
9
My research
Stress Migraine
Magnesium Calcium channel blockers
Information Extraction
Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker
10
My research
Relation extraction
Stress Migraine
Magnesium Calcium channel blockers
Associated withLead to loss
Prevent
Subtype-of (is a)
11
Information and relation extraction
Problems: Given biomedical text: Find all the treatments and all the
diseases Find the relations that hold between them
Treatment Disease
Cure?
Prevent?
Side Effect?
12
Hepatitis Examples
Cure These results suggest that con A-induced
hepatitis was ameliorated by pretreatment with TJ-135.
Prevent A two-dose combined hepatitis A and B
vaccine would facilitate immunization programs
Vague Effect of interferon on hepatitis B
13
Two tasks
Relationship extraction: Identify the several semantic relations
that can occur between the entities disease and treatment in bioscience text
Information extraction (IE): Related problem: identify such entities
14
Outline of IE
Data and semantic relations Quick intro to graphical models Models and results Features Conclusions
15
Data and Relations
MEDLINE, abstracts and titles 3662 sentences labeled
Relevant: 1724 Irrelevant: 1771
e.g., “Patients were followed up for 6 months”
2 types of Entities treatment and disease
7 Relationships between these entities
The labeled data are available at http://biotext.berkeley.edu
16
Semantic Relationships 810: Cure
Intravenous immune globulin for recurrent spontaneous abortion
616: Only Disease Social ties and susceptibility to the common
cold 166: Only Treatment
Flucticasone propionate is safe in recommended doses
63: Prevent Statins for prevention of stroke
17
Semantic Relationships 36: Vague
Phenylbutazone and leukemia 29: Side Effect
Malignant mesodermal mixed tumor of the uterus following irradiation
4: Does NOT cure Evidence for double resistance to
permethrin and malathion in head lice
18
Outline of IE
Data and semantic relations Quick intro to graphical models Models and results Features Conclusions
19
Graphical Models
Unifying framework for developing Machine Learning algorithms
Graph theory plus probability theory Widely used
Error correcting codes Systems diagnosis Computer vision Filtering (Kalman filters) Bioinformatics
20
(Quick intro to) Graphical Models
Nodes are random variables
Edges are annotated with conditional probabilities
Absence of an edge between nodes implies conditional independence
“Probabilistic database”
B C
DA
21
Graphical Models
A
B C
D
Define a joint probability distribution:
P(X1, ..XN) = i P(Xi | Par(Xi) )
P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D) Learning
Given data, estimate P(A), P(B|A), P(D), P(C | A, D)
22
Graphical Models
A
B C
D
Define a joint probability distribution:
P(X1, ..XN) = i P(Xi | Par(Xi) )
P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D) Learning
Given data, estimate P(A), P(B|A), P(D), P(C | A, D) Inference: compute conditional probabilities,
e.g., P(A|B, D) Inference = Probabilistic queries. General
inference algorithms (Junction Tree)
23
Naïve Bayes models
Simple graphical model
Xi depend on Y Naïve Bayes assumption: all Xi are
independent given Y Currently used for text classification and
spam detection
x1 x2 x3
Y
24
Dynamic Graphical Models
Graphical model composed of repeated segments
HMMs (Hidden Markov Models) POS tagging, speech recognition, IE
2t 1-Nt
1w 2w 1-NwWords
Tags POS1t tN
wN
25
HMMs Joint probability distribution
P(t1,.., tN, w1,.., wN) = P(t1) P(ti|ti-1)P(wi|ti)
Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data
2t 1-Nt
1w 2w 1-Nw
1t tN
wN
26
HMMs Joint probability distribution
P(t1,.., tN, w1,.., wN) = P(t1) P(ti|ti-1)P(wi|ti)
Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data
Inference: P(ti | w1 , w2 ,… wN) 2t 1-Nt
1w 2w 1-Nw
1t tN
wN
27
Graphical Models for IE Different dependencies between
the features and the relation nodes
D3
D1 S1
D2 S2
Dynamic
Static
28
Graphical Model Relation node:
Semantic relation (cure, prevent, none..) expressed in the sentence
Relation generate the state sequence and the observations
Relation
29
Graphical Model
Markov sequence of states (roles) Role nodes:
Rolet {treatment, disease, none}
Rolet-1 Rolet Rolet+1
30
Graphical Model Roles generate multiple
observations Feature nodes (observed):
word, POS, MeSH…
Features
31
Graphical Model
Inference: Find Relation and Roles given the features observed
???
?
32
Features
Word Part of speech Phrase constituent Orthographic features
‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ …
Semantic features (MeSH)
33
MeSH MeSH Tree Structures 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]
34
MeSH (cont.)
1. Anatomy [A] Body Regions [A01] + Musculoskeletal System [A02]
Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07]
+ Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] (…..)
Body Regions [A01] Abdomen [A01.047]
Groin [A01.047.365] Inguinal Canal [A01.047.412] Peritoneum [A01.047.596] + Umbilicus [A01.047.849]
Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] (….)
35
Use of lexical Hierarchies in NLP
Big problem in NLP: few words occur a lot, most of them occur very rarely (Zipf’s law)
Difficult to do statistics One solution: use lexical hierarchies Another example: WordNet Statistics on classes of words instead of
words
36
Mapping Words to MeSH Concepts
headache pain C23.888.592.612.441 G11.561.796.444
C23.888 G11.561 [Neurologic Manifestations][Nervous System Physiology ]
C23 G11 [Pathological Conditions, Signs and Symptoms]
[Musculoskeletal, Neural, and Ocular Physiology]
headache recurrence C23.888.592.612.441 C23.550.291.937
breast cancer cells A01.236 C04 A11
37
Graphical Model
Joint probability distribution over relation, roles and features nodes
Parameters estimated with maximum likelihood and absolute discounting smoothing
Rela) Role | P(f, Rela) | RoleP(Role
Rela)|oleP(Rela)P(R)f,..f,RoleleP(Rela, Ro
t
T
1t
n
j
jtt-1t
0nTT0
,
1
10 , ,..,
38
Graphical Model
Inference: Find Relation and Roles given the features observed
???
?
39
Relation extraction
Results in terms of classification accuracy (with and without irrelevant sentences)
2 cases: Roles given Roles hidden (only features)
)f,..,f,,...,RoleRole,P(RelaRela nTTkRela
^
k
argmax 100
40
Relation classification: Results
Good results for a difficult task One of the few systems to tackle several DIFFERENT
relations between the same types of entities; thus differs from the problem statement of other work on relations
Accuracy
Sentences Input Base. GM D2
Only rel. only feat. 46.7 72.6
roles given
76.6
Rel. + irrel.
only feat. 50.6 74.9
roles given
82.0
41
Role Extraction: ResultsJunction tree algorithm
F-measure = (2*Prec*Recall)/(Prec + Recall)
(Related work extracting “diseases” and “genes” reports F-measure of 0.50)
Sentences F-measure
Only rel. 0.73
Rel. + irrel.
0.71
42
Features impact: Role extraction
Most important features: 1)Word 2)MeSH
Rel. + irrel. Only rel. All features 0.71 0.73
No word 0.61 0.66 -14.1% -9.6%
No MeSH 0.65 0.69 -8.4% -5.5%
43
Most important features: Roles
Accuracy All feat. + roles 82.0
Features impact: Relation classification
(rel. + irrel.)
All feat. – roles 74.9 -8.7%
All feat. + roles – Word 79.8 -2.8%
All feat. + roles – MeSH 84.6 3.1%
44
Features impact: Relation classification
Most realistic case: Roles not known Most important features: 1) Word 2) Mesh
Accuracy All feat. – roles 74.9
(rel. + irrel.)
All feat. - roles – Word 66.1 -11.8%
All feat. - roles – MeSH 72.5 -3.2%
45
Conclusions Classification of subtle semantic
relations in bioscience text Graphical models for the
simultaneous extraction of entities and relationships
Importance of MeSH, lexical hierarchy
46
Outline of Talk
Goal: Extract semantics from text Information and relation extraction Protein-protein interactions; using
an existing database to gather labeled data
47
Protein-Protein interactions
One of the most important challenges in modern genomics, with many applications throughout biology
There are several protein-protein interaction databases (BIND, MINT,..), all manually curated
48
Protein-Protein interactions
Supervised systems require manually labeled data, while purely unsupervised are still to be proven effective for these tasks.
Some other approaches: semi-supervised, active learning, co-training.
We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins
49
HIV-1, Protein Interaction Database
Documents interactions between HIV-1 proteins and host cell proteins other HIV-1 proteins disease associated with HIV/AIDS
2224 pairs of interacting proteins, 65 types
http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions
50
HIV-1, Protein Interaction Database
Protein 1
Protein 2 Paper ID Interaction Type
Tat, p14 AKT3 11156964, 11994280..
activates
AIP1 Gag, Pr55
14519844,…
binds
Tat, p14 CDK2 9223324 induces
Tat, p14 CDK2 7716549 enhances
Tat, p14 CDK2 9525916 downregulates
….
51
Most common interactions
52
Protein-Protein interactions Idea: use this to “label data”
Protein 1
Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964Extract from the paper all the sentences with Protein 1 and Protein 2
…
53
Protein-Protein interactions Idea: use this to “label data”
Protein 1
Protein 2 Interaction Paper ID
Tat, p14 AKT3 activates 11156964Extract from the paper all the sentences with Protein 1 and Protein 2
…
Label them with the interaction given in the database
activates
activates
54
Protein-Protein interactions
Use citations
Find all the papers that cite the papers in the database
Protein 1
Protein 2 Interaction Paper ID
Tat, p14
AKT3 activates 11156964
ID 9918876 ID 9971769
55
Protein-Protein interactions
From the papers, extract the citation sentences; from these extract the sentences with Protein 1 and Protein 2 Label them
Protein 1
Protein 2 Interaction Paper ID
Tat, p14
AKT3 activates 11156964
ID 9918876 ID 9971769
activates
56
Examples of sentences
Papers: The interpretation of these results was slightly
complicated by the fact that AIP-1/ALIX depletion by using siRNA likely had deleterious effects on cell viability , because a Western blot analysis showed slightly reduced Gag expression at later time points (fig. 5C ).
Citations: They also demonstrate that the GAG protein from
membrane - containing viruses , such as HIV , binds to Alix / AIP1 , thereby recruiting the ESCRT machinery to allow budding of the virus from the cell surface (TARGET_CITATION; CITATION ) .
57
10 Interaction types
58
Protein-Protein interactions
Tasks: Given sentences from Paper ID,
and/or citation sentences to ID Predict the interaction type given in
the HIV database for Paper ID Extract the proteins involved
10-way classification problem
59
Protein-Protein interactions
Models Dynamic graphical model Naïve Bayes
60
Graphical Models
61
Evaluation
Evaluation at document level All (sentences from papers + citations) Papers (only sentences from papers) Citations (only citation sentences) “Trigger word” approach
List of keywords (ex: for inhibits: “inhibitor”, “inhibition”, “inhibit”…etc.
If keyword presents: assign corresponding interaction
62
Results Accuracies on interaction classification
Model All Papers Citations
Markov Model 60.5 57.8 53.4
Naïve Bayes 58.1 57.8 55.7
Baselines
Most freq. inter.
21.8 11.1 26.1
TriggerW 20.1 24.4 20.4
TriggerW + BO 25.8 40.0 26.1
(Roles hidden)
63
Results: confusion matrix
For All. Overall accuracy: 60.5%
64
Hiding the protein names
Replaced protein names with tokens PROT_NAME Selective CXCR4 antagonism by Tat Selective PROT_NAME antagonism by
PROT_NAME
65
Results with no protein names
Model Papers Citations
Markov Model 44.4(-
23.1%)
52.3 (-2.0%)
Naïve Bayes 46.7 (-
19.2%)
53.4 (-4.1 %)
66
Protein extraction
(Protein name tagging, role extraction) The identification of all the proteins
present in the sentence that are involved in the interaction These results suggest that Tat - induced
phosphorylation of serine 5 by CDK9 might be important after transcription has reached the +36 position, at which time CDK7 has been released from the complex.
Tat might regulate the phosphorylation of the RNA polymerase II carboxyl - terminal domain in pre - initiation complexes by activating CDK7
67
Protein extraction: results
Recall Precision F-measure
All 0.74 0.85 0.79
Papers 0.56 0.83 0.67
Citations 0.75 0.84 0.79
No dictionary used
68
Conclusions of protein-protein interaction project
Encouraging results for the automatic classification of protein-protein interactions
Use of an existing database for gathering labeled data
Use of citations
69
Noun compounds (NCs)
Any sequence of nouns that itself functions as a noun
asthma hospitalizations asthma hospitalization rates health care personnel hand wash
Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.
70
NCs: 3 computational tasks
Identification Syntactic analysis (attachments)
[Baseline [headache frequency]] [[Tension headache] patient]
Semantic analysis Headache treatment treatment for headache Corticosteroid treatment treatment that uses
corticosteroid
71
Two approaches
Treat it as a classification problem (and use a machine learning algorithm)
Linguistically motivated: consider the “semantics” of the nouns which will determine the relations between them
72
Second approach
Linguistic Motivation Head noun has argument structure
Meaning of the head noun determines what kinds of things can be done to it, what it is made of, what it is a part of…
73
Linguistic Motivation Material + Cutlery Made of
steel knife, plastic fork, wooden spoon Food + Cutlery Used on
meat knife, dessert spoon, salad fork Profession + Cutlery Used by
chef's knife, butcher's knife
74
Linguistic Motivation
Hypothesis: A particular semantic relation holds
between all 2-word NCs that can be categorized by a MeSH pair.
Use the classes of MeSH to identify semantic relations
75
Grouping the NCs A02 C04 (Musculoskeletal System, Neoplasms)
skull tumors, bone cysts, bone metastases, skull osteosarcoma… B06 B06 (Plants, Plants)
eucalyptus trees, apple fruits, rice grains, potato plants A01 M01 (Body region, Person)
shoulder patient, eye physician, eye donor Too different: need to be more specific: go down
the hierarchy A01 M01.643 (Body Regions, Patients)
shoulder patient C04 M01.526 (Body Regions, Occupational Groups)
eye physician, chest physicians
76
Classification Decisions + Relations
A02 C04 Location of Disease B06 B06 Kind of Plants C04 M01
C04 M01.643 Person afflicted by Disease C04 M01.526 Person who treats Disease
A01 H01 A01 H01.770 A01 H01.671
A01 H01.671.538 A01 H01.671.868
A01 M01 A01 M01.643 Person afflicted by Disease A01 M01.526 Specialist of A01 M01.898 Donor of
77
Evaluation
Accuracy: Anatomy: 91% accurate Natural Science: 79% Neoplasm: 100%
Total Accuracy : 90.8%
78
Conclusion of NCs
Problem of assigning semantic relations to two-word technical NCs
Important problem: many NCs in technical text
Especially difficult for the lack of syntactic clues
State-of-the-art results One of very few working systems to
tackle this task for NCs
79
Conclusion
Machine Learning methods for NLP tasks
Three lines of research in this area, state-of-the art results Information and relation extraction for
“treatments” and “diseases” Protein-protein interactions (Noun compounds)
81
Future work
Unsupervised, semi-supervised methods Reasoning (knowledge representation
and inference procedures) Huge amount of textual data (Web) Connection between several databases
and/or text collections for linking different pieces of information
System architecture to support multiple layers of annotation on text
Development of effective interface
Additional slides on IE
83
Related work Several DIFFERENT Relations
between the Same Types of Entities Thus differs from the problem
statement of other work on relations Many find one relation which holds
between two entities (many based on ACE)
84
Related work (cont.)
Agichtein and Gravano (2000), lexical patterns for location of
Zelenko et al. (2002) SVM for person affiliation and organization-location
Hasegawa et al. (ACL 2004) Person-Organization -> President “relation”
Craven (1999, 2001) HMM for subcellular-location and disorder-association Doesn’t identify the actual relation
85
Related work: Bioscience
Many hand-built rules Feldman et al. (2002), Friedman et al. (2001) Pustejovsky et al. (2002) Saric et al.; (2004)
86
Our D1
Thompson et al. 2003Frame classification and role
labeling for FrameNet sentencesTarget word must be observed
More relations and roles
87
Smoothing: absolute discounting
Lower the probability of seen events by subtracting a constant from their count (ML estimate: )
The remaining probability is evenly divided by the unseen events
e
MLec
eceP
)(
)()(
0)( if
0)( if )()(
eP
ePePeP
ML
MLMLad
events)seen (
events)seen (
UNc
c
88
F-measures for role extraction in function of smoothing factors
89
Relation accuracies in function of smoothing factors
90
Relation classification: Confusion Matrix
Computed for “rel + irrel.”, “only features”
91
Proteins: sentence-level evaluation
Total accuracy: 38.9% (49.4% without interact with)
92
Learning with the hand-labeled sentences
Model All Papers Citations
Markov Model 48.9 28.9 47.9
Naïve Bayes 47.1 33.3 53.4
Baselines
Most freq. inter.
36.3 34.4 37.6
TriggerW 30.5 18.9 38.3
TRiggerW + BO 46.2 36.6 52.6
93
Learning with the hand-labeled sentences
94
Q & A system
Q: What are the treatments of cervical carcinoma
A: Stage Ib and IIa cervical carcinoma can be cured by radical surgery or radiotherapy
95
Q & A system
Q: What are the methods of administration of headache treatment
A: intranasal migraine treatment
96
Evaluation
Mj: For each triple, for each sentence of the triple, find the interaction that maximizes the posterior probability of the interaction given the features; then assign to all sentences of this triple the most frequent interaction between those predicted for the individual sentences.
Mj*: Same as Mj, except that if the interaction predicted is the generic interacts with, choose instead the next most frequent interaction (retain interacts with only if it is the only interaction predicted.
97
Evaluation
Cf: Retain all the conditional probabilities (i.e., don't first choose an interaction per sentence), then for each triple choose the interaction that maximizes the sum over all the sentences of the triple.
Cf*: Same as Cf, substituting interacts with with the next most confident interaction.