a probabilistic term variant generator for biomedical terms

19
A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun’ichi Tsujii CREST, JST The University of Tokyo

Upload: orsin

Post on 05-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

A Probabilistic Term Variant Generator for Biomedical Terms. Yoshimasa Tsuruoka and Jun ’ ichi Tsujii CREST, JST The University of Tokyo. Outline. Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Probabilistic Term Variant Generator for Biomedical Terms

A Probabilistic Term Variant Generator for Biomedical Terms

Yoshimasa Tsuruoka and Jun’ichi Tsujii

CREST, JST The University of Tokyo

Page 2: A Probabilistic Term Variant Generator for Biomedical Terms

Outline

Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion

Page 3: A Probabilistic Term Variant Generator for Biomedical Terms

Information extraction from biomedical documents

Recognizing technical terms (e.g. DNA, protein names)

We measured glucocorticoid receptors ( GR )in mononuclear leukocytes ( MNL ) isolated…

Background

Page 4: A Probabilistic Term Variant Generator for Biomedical Terms

Technical Term Recognition

Machine learning based Identifying the regions of terms

⇒ No ID information

Dictionary-based Comparing the strings with each entry in

the dictionary⇒ ID information

Page 5: A Probabilistic Term Variant Generator for Biomedical Terms

Problems of Dictionary-based approaches

Spelling variation degrades recall ⇒ Approximate string searching

False positives degrade precision ⇒ Filtering by machine learning

Page 6: A Probabilistic Term Variant Generator for Biomedical Terms

Exact String Searching

Example Text Phorbol myristate acetate induced Egr-1

mRNA…

DictionaryEGPEGR-1EGR-1 binding protein

: ⇒  Any of them does not

match

Page 7: A Probabilistic Term Variant Generator for Biomedical Terms

Edit Distance

Defines the distance of two strings by the sequence of three kinds of operations. Substitution Insertion Deletion

Ex.)    board  →  abord Cost = 2 (delete `a’ and add `a’)

Page 8: A Probabilistic Term Variant Generator for Biomedical Terms

Automatic Generation of Spelling Variants

Variant Generator

NF-Kappa B (1.0)NF Kappa B (0.9)NF kappa B(0.6)NF kappaB (0.5)NFkappaB (0.3)

:

GeneratorNF-Kappa B

Each generated variant is associated with its generation probability

Page 9: A Probabilistic Term Variant Generator for Biomedical Terms

Generation Algorithm

T cell (1.0)

T-cell (0.5) T cells (0.2)

T-cells (0.1)

0.5

0.2

0.2

Recursive generation  P = P’ x Pop

Page 10: A Probabilistic Term Variant Generator for Biomedical Terms

Collecting Examples of Spelling Variation

Abbreviation Extraction ( Schwartz 2003 ) Extracts short and long form pairs

Short form Long form

AA Alcoholic Anonymous

American

Americans

Arachidonic acid

arachidonic acid

anaemia

anemia

:

Page 11: A Probabilistic Term Variant Generator for Biomedical Terms

Learning Operation Rules

Operations for generating variants Substitution Deletion Insertion

Context Character-level context: preceding (following)

two characters Operation Probability

contextf

operationcontextfcontextoperationP

,

Page 12: A Probabilistic Term Variant Generator for Biomedical Terms

Probabilistic Rules

Probability

Left-context

TargetRight-

contextOperation

0.96 * ‘ End of String

Delete

0.96 Start of String

I m Replace ‘I’ with ‘i’

0.95 * H yd Replace ‘H’ with ‘h’

: : : : :

0.75 phEnd of String

Insert ‘y’

: : : : :

Page 13: A Probabilistic Term Variant Generator for Biomedical Terms

Example (1)

Generation Probability

Generated Variants Frequency

1.0 (Input) NF-kappa B 857

0.417 NF-kappaB 692

0.417 nF-kappa B 0

0.337 Nf-kappa B 0

0.275 NF kappa B 25

0.226 NF-kappa b 0

: : :

Page 14: A Probabilistic Term Variant Generator for Biomedical Terms

Example (2)

Generation Probability

Generated Variants Frequency

1.0 (input) antiinflammatory effect 7

0.462 anti-inflammatory effect 33

0.393 antiinflammatory effects 6

0.356 Antiinflammatory effect 0

0.286 antiinflammatory-effect 0

0.181 anti-inflammatory effects 23

: : :

Page 15: A Probabilistic Term Variant Generator for Biomedical Terms

Example (3)

Generation Probabilitiy

Generated Variants Frequency

1.0 (Input) tumour necrosis factor alpha 15

0.492 tumor necrosis factor alpha 126

0.356 tumour necrosis factor-alpha 30

0.235 Tumour necrosis factor alpha 2

0.175 tumor necrosis factor alpha 182

0.115 Tumor necrosis factor alpha 8

: : :

Page 16: A Probabilistic Term Variant Generator for Biomedical Terms

Application:Dictionary Expansion

Expanding each entry in the dictionary Threshold of Generation Probability: 0.1 Max number of variants for each entry: 20

Page 17: A Probabilistic Term Variant Generator for Biomedical Terms

Protein Name Recognition

Information Extraction Longest match GENIA corpus

Page 18: A Probabilistic Term Variant Generator for Biomedical Terms

Results of Dictionary Expansion

a

Page 19: A Probabilistic Term Variant Generator for Biomedical Terms

Conclusion

Probabilistic Variant Generator Learning from actual examples Dictionary expansion by the generator

improves recall without the loss of precision.