recognition of multi-sentence n-ary subcellular localization mentions in biomedical abstracts g....
TRANSCRIPT
![Page 1: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/1.jpg)
Recognition of Multi-sentencen-ary Subcellular Localization
Mentions in Biomedical Abstracts
G. Melli, M. Ester, A. Sarkar
Dec. 6, 2007
http://www.gabormelli.com/2007/2007_MultiNaryBio_Melli_Presentation.ppt
![Page 2: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/2.jpg)
Introduction
• We propose a method for detecting n-ary relations that may span multiple sentences
• Motivation is to support the semi-automated population of subcellular localizations in db.psort.org.– Organism / Protein / Location
• We cast each document as a text graph and use machine learning to detect patterns in the graph.
![Page 3: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/3.jpg)
Is there an SCL in this text?
![Page 4: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/4.jpg)
• Yes: (V. cholerae, TcpC, outer membrane)
• Current algorithms are restricted to the detection of binary relations within one sentence: (TcpC, outer membrane).
Here is the relevant passage
“The pilus(location) of V. cholerae(organism) is essential for intestinal colonization.
The pilus(location) biogenesis apparatus is composed of nine proteins.
TcpC(protein) is an outer membrane(location) lipoprotein required for pilus(location) biogenesis.”
![Page 5: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/5.jpg)
Challenge #1
• A significant number of the relation cases (~40%) span multiple sentences.
• Proposed solution:– Create a text graph for the entire document – The graph can contain a superset of the
information used by the current binary relation single sentence approaches. (Jiang and Zhai, 2007; Zhou et al, 2007)
![Page 6: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/6.jpg)
ORG
LOC
PROT
LOC
LOC
pilus
LOC
Automated Markup
Syntactic analysis1. End of sent.2. Part-of-speech3. Parse tree
Semantic analysi1. Named-entity
recognition2. Coreference
resolution
![Page 7: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/7.jpg)
A Single Relation Case
NNPTcpC
PROTEINNP S VP NP
JJOuter Membrane
LOCATIONNP
NNPpilus
LOCTN.VP PP NP NNP
NNPV. CholeraeORGANISM
NP PP NP NPNNPpilus
LOCTN.
pilusLOCTN.
NNPpilus
LOCTN.
.
.S NP
AUXis
DTan
INof
VBNrequired
INof
![Page 8: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/8.jpg)
Challenge #2
• An n-ary Relation– The task involves three entity mentions:
Organism, Protein, Subcellular Loc.– Current approaches designed for detecting
mentions with two entities.
• Proposed solution– Create a feature vector that contains the
information for three pairings `
![Page 9: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/9.jpg)
3-ary Relation Feature Vector
1 O 1 P 1 L 2 3 1 4 0 1 0 1 0 1 0 0 0 5 1 1 0 1 0 1 0 0 0 0 0 4 2 1 0 1 0 1 0 0 0 0 0 T1 O 1 P 1 L 3 1 2 1 0 0 1 1 0 0 0 0 1 1 1 2 0 0 1 1 0 0 0 0 1 2 1 3 0 0 0 1 0 0 0 0 1 F… … … … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …
d O i P i L i 2 2 2 0 0 0 0 0 0 1 0 0 4 1 2 0 0 0 0 0 1 1 0 0 3 1 1 0 0 1 0 0 1 1 0 0 ?
Intra.Subtrees Intra. Entity SubtreesP j L ,j
Intra. Entity
lab
el
Rel. CaseEntity
Organism - LocationSubtrees
Feature SpaceProtein - LocationOrganism - Protein
D O j
“The pilus(location) of V. cholerae(organism) is essential for intestinal colonization.
The pilus(location) biogenesis apparatus is composed of nine proteins.
TcpC(protein) is an outer membrane(location) lipoprotein required for pilus(location) biogenesis.”
![Page 10: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/10.jpg)
PPLRE v1.4 Data Set
• 540 true and 4,769 false curated relation cases drawn from 843 research paper abstracts.
• 267 of the 540 true relation cases (~49%) span multiple sentences.
• Data available at koch.pathogenomics.ca/pplre/
![Page 11: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/11.jpg)
Performance Results
• Tested against two baselines that were tuned to this task: YSRL and Zparser.
• TeGRR achieved the highest F-score (by significantly increasing the Recall).
P R FTeGRR 18.0% 47.5% 26.1%YSRL 29.1% 13.3% 18.3%Zparser 63.5% 9.3% 16.1%All True 8.3% 75.6% 14.9%
5-fold crossvalidated
![Page 12: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/12.jpg)
Research Directions
1. Actively grow the PSORTdb curated set
2. Qualifying the Certainty of a Case E.g. label cases with: “experiment”,
“hypothesized”, “assumed”, and “False”.
3. Ontology constrained predictions E.g. Gram-positive bacteria do not have a
periplasm therefore do not predict periplasm.
4. Application to other tasks
![Page 13: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/13.jpg)
Recognition of Multi-sentencen-ary Subcellular Localization
Mentions in Biomedical Abstracts
G. Melli, M. Ester, A. Sarkar
Dec. 6, 2007
http://www.gabormelli.com/2007/2007_MultiNaryBio_Melli_Presentation.ppt
![Page 14: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/14.jpg)
ExtraSlides for Questions
![Page 15: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/15.jpg)
Shortened Reference ListM. Craven, and J. Kumlien. (1999). Constructing Biological Knowledge-
bases by Extracting Information from Text Sources. In Proc. of the International Conference on Intelligent Systems for Molec. Bio.
K. Fundel, R. Kuffner, and R. Zimmer. (2007). RelEx--Relation Extraction Using Eependency Parse Trees. Bioinformatics. 23(3).
J. Jiang and C. Zhai. (2007). A Systematic Exploration of the Feature Space for Relation Extraction. In Proc. of NAACL/HLT-2007.
Y. Liu, Z. Shi and A. Sarkar. (2007). Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. In Proc. of NAACL/HLT-2007.
Z. Shi, A. Sarkar and F. Popowich. (2007). Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques. Proc. of NAACL/HLT-2007
M. Skounakis, M. Craven and S. Ray. (2003). Hierarchical Hidden Markov Models for Information Extraction. In Proc. of IJCAI-2003.
Zhang M, Zhang J, Su J: Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. Procs. of NAACL/HLT-2006; 2006.
![Page 16: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/16.jpg)
Pipelined Process Framework
TrainingDocuments Feature
Generation
Case Labeling
Model Induction
UnlabeledRel. Cases
LabeledRCs
Classifier
Natural Language Processing
Relation Case
Generation
Training Phase
TestingDocuments
PredictionGeneration Predictions
Natural Language Processing
Relation Case
Generation
Feature Generation
Testing Phase
![Page 17: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/17.jpg)
Relation Case Generation
• Input: (D, R): A text document D and a set of semantic relations R with a arguments.
• Output: (C): A set of unlabelled semantic relation cases.• Method:
• Identify all e entity mentions Ei in D
• Create every combination of a entity mentions from the e mentions in the document (without replacement).– For intrasentential semantic relation detection and classification tasks,
limit the entity mentions to be from the same sentence.
– For typed semantic relation detection and classification tasks, limit the combinations to those where there is a match between the semantic classes of each of the entity mentions Ei and the semantic class of their corresponding relation argument Ai.
![Page 18: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/18.jpg)
Relation Case Labeling
![Page 19: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/19.jpg)
Naïve Baseline Algorithms
• Predict True: Always predicts “True” regardless of the contents of the relation case – Attains the maximum Recall by any algorithm on the task.– Attains the maximum F1 by any naïve algorithm.– Most commonly used naïve baseline.
![Page 20: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/20.jpg)
Prediction Outcome Labels
• true positive (tp)– predicted to have the label True and whose label is
indeed True .• false positive (fp)
– predicted to have the label True but whose label is instead False .
• true negative (tn)– predicted to have the label False and whose label is
indeed False .• false negative (fn)
– predicted to have the label False and whose label is instead True .
![Page 21: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/21.jpg)
Performance Metrics• Precision (P): probability that a test case that is
predicted to have label True is tp. • Recall (R): probability that a True test case will
be tp.• F-measure (F1): Harmonic mean of the
Precision and Recall estimates.
FNFPTP
TP
RP
PR
RP
F
2
221111
1
![Page 22: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/22.jpg)
Token-based Features “Protein1 is a Location1 ...”
• Token Distance– 2 intervening tokens
• Token Sequence(s)– Unigrams
– Bigrams
the of . , and in a … pyelonephritis0 0 0 0 0 0 1 … 0
of the and the is the is a … causes pyelonephritis0 0 0 1 … 0
tok. dist.2
![Page 23: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/23.jpg)
Token-based Features (cont.)
• Token Part-of-Speech Role Sequences
NN IN JJ DT COMMA … WP0 0 0 1 0 … 0
DT IN IN NN JJ DT NN JJ AUX DT … AUX RBS0 0 0 1 … 0
![Page 24: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/24.jpg)
Additional Features/Knowledge
• Expose additional features that can identify the more esoteric ways of expressing a relation.
• Features from outside of the “shortest-path”.– Challenge: past open-ended attempts have reduced
performance (Jiang and Zhi, 2007)– (Zhou et al, 2007) add heuristics for five common
situations.
• Use domain-specific background knowledge.– E.g. Gram-positive bacteria (such as M. tuberculosis)
do not have a periplasm therefore do not predict periplasm.
![Page 25: Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007](https://reader030.vdocument.in/reader030/viewer/2022032803/56649e315503460f94b227b6/html5/thumbnails/25.jpg)
Challenge: Qualifying the Certainty of a Relation Case
• It would be useful qualify the certainty that can be assigned to a relation mention.
• E.g. In the news domain, distinguish relation mentions based on first hand information versus those based on hearsay.
• Idea: Add an additional label to each relation case that qualifies the certainty of the statement. E.g. in the PPLRE task label cases with: “directly validated”, “indirectly validated”, “hypothesized”, and “assumed”.