a probabilistic term variant generator for biomedical terms
DESCRIPTION
A Probabilistic Term Variant Generator for Biomedical Terms. Yoshimasa Tsuruoka and Jun ’ ichi Tsujii CREST, JST The University of Tokyo. Outline. Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion. Background. - PowerPoint PPT PresentationTRANSCRIPT
A Probabilistic Term Variant Generator for Biomedical Terms
Yoshimasa Tsuruoka and Jun’ichi Tsujii
CREST, JST The University of Tokyo
Outline
Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion
Information extraction from biomedical documents
Recognizing technical terms (e.g. DNA, protein names)
We measured glucocorticoid receptors ( GR )in mononuclear leukocytes ( MNL ) isolated…
Background
Technical Term Recognition
Machine learning based Identifying the regions of terms
⇒ No ID information
Dictionary-based Comparing the strings with each entry in
the dictionary⇒ ID information
Problems of Dictionary-based approaches
Spelling variation degrades recall ⇒ Approximate string searching
False positives degrade precision ⇒ Filtering by machine learning
Exact String Searching
Example Text Phorbol myristate acetate induced Egr-1
mRNA…
DictionaryEGPEGR-1EGR-1 binding protein
: ⇒ Any of them does not
match
Edit Distance
Defines the distance of two strings by the sequence of three kinds of operations. Substitution Insertion Deletion
Ex.) board → abord Cost = 2 (delete `a’ and add `a’)
Automatic Generation of Spelling Variants
Variant Generator
NF-Kappa B (1.0)NF Kappa B (0.9)NF kappa B(0.6)NF kappaB (0.5)NFkappaB (0.3)
:
GeneratorNF-Kappa B
Each generated variant is associated with its generation probability
Generation Algorithm
T cell (1.0)
T-cell (0.5) T cells (0.2)
T-cells (0.1)
0.5
0.2
0.2
Recursive generation P = P’ x Pop
Collecting Examples of Spelling Variation
Abbreviation Extraction ( Schwartz 2003 ) Extracts short and long form pairs
Short form Long form
AA Alcoholic Anonymous
American
Americans
Arachidonic acid
arachidonic acid
anaemia
anemia
:
Learning Operation Rules
Operations for generating variants Substitution Deletion Insertion
Context Character-level context: preceding (following)
two characters Operation Probability
contextf
operationcontextfcontextoperationP
,
Probabilistic Rules
Probability
Left-context
TargetRight-
contextOperation
0.96 * ‘ End of String
Delete
0.96 Start of String
I m Replace ‘I’ with ‘i’
0.95 * H yd Replace ‘H’ with ‘h’
: : : : :
0.75 phEnd of String
Insert ‘y’
: : : : :
Example (1)
Generation Probability
Generated Variants Frequency
1.0 (Input) NF-kappa B 857
0.417 NF-kappaB 692
0.417 nF-kappa B 0
0.337 Nf-kappa B 0
0.275 NF kappa B 25
0.226 NF-kappa b 0
: : :
Example (2)
Generation Probability
Generated Variants Frequency
1.0 (input) antiinflammatory effect 7
0.462 anti-inflammatory effect 33
0.393 antiinflammatory effects 6
0.356 Antiinflammatory effect 0
0.286 antiinflammatory-effect 0
0.181 anti-inflammatory effects 23
: : :
Example (3)
Generation Probabilitiy
Generated Variants Frequency
1.0 (Input) tumour necrosis factor alpha 15
0.492 tumor necrosis factor alpha 126
0.356 tumour necrosis factor-alpha 30
0.235 Tumour necrosis factor alpha 2
0.175 tumor necrosis factor alpha 182
0.115 Tumor necrosis factor alpha 8
: : :
Application:Dictionary Expansion
Expanding each entry in the dictionary Threshold of Generation Probability: 0.1 Max number of variants for each entry: 20
Protein Name Recognition
Information Extraction Longest match GENIA corpus
Results of Dictionary Expansion
a
Conclusion
Probabilistic Variant Generator Learning from actual examples Dictionary expansion by the generator
improves recall without the loss of precision.