nlp techniques (machine learning) ner in biomedical domain tsujii laboratory hong-woo chun (d1)...
TRANSCRIPT
![Page 1: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/1.jpg)
NLP Techniques (Machine NLP Techniques (Machine Learning)Learning)
NER in Biomedical DomainNER in Biomedical Domain
Tsujii LaboratoryTsujii LaboratoryHong-Woo CHUN (D1)Hong-Woo CHUN (D1)
February 10th , 2005
![Page 2: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/2.jpg)
Univ. of Tokyo
2/11
Introduction
As the research in biomedical domain has grown rapidly in recent years, a huge amount of nature language resources have been developed and become a rich knowledge base.
NER (Named Entity Recognition) is strongly demanded to be applied in biomedical
domain. identifies names of genes, gene products and diseases in a b
iomedical text in this project.From now on, genes and gene products are called by ‘gene’.
has not got high performance.
compared with those in newswire domain
![Page 3: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/3.jpg)
Univ. of Tokyo
3/11
Introduction::Problems in NER
Some modifiers are often before basic NEs activated B cell lines
Sometimes biomedical NEs are very long 47 kDa sterol regulatory element binding factor
Two or more NEs share one head noun by using conjunction or disjunction construction
91 and 84 kDa proteins An entity may be found with various spelling forms NE may be cascaded One NE may be embedded in anothe
r NE Abbreviations are frequently used
Therefore, it is necessary to explore more evidential features and more effective methods to cope with such difficulties.
![Page 4: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/4.jpg)
Univ. of Tokyo
4/11
NER without NLP tech.
Dictionary based longest matching ! The number of words in dictionaries
Gene : 44,463
Disease : 159,477
Corpus 1,000 biomedical sentences which are tagged by
biologists
Gene and Disease names and their Association
Gene Disease
Hishiki Nagata Hishiki Nagata
Precision 57.7% 65.0% 78.0% 82.1%
Recall 100% 100% 100% 100%
F-score 73.2% 78.8% 87.6% 90.2%
![Page 5: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/5.jpg)
Univ. of Tokyo
5/11
Experimental results(1)
Maximum Entropy based model Features
Local context (Name itself, Unigrams and Bigrams)
POS (Name itself, Unigrams and Bigrams)
Capitalization (All capital, Mixed capital, No capital)
Digitalization ( All digit, Mixed digit, No digit)
24 Greek Letters (alpha, beta, gamma, …)
12 suffix
Corpus 1,000 biomedical sentences which are tagged by biologists
Gene and Disease names and their Association
Evaluations 10-fold cross validation
L2 L1 NE R1 R2
![Page 6: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/6.jpg)
Univ. of Tokyo
6/11
Experimental results(2)
Example of Corpus
![Page 7: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/7.jpg)
Univ. of Tokyo
7/11
Experimental results(3)::Useful features
Gene Disease
Local context
Capitalization
Digitalization
Greek Letters
Affix
POS NE
NE, Uni
NE, Uni, Bi
![Page 8: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/8.jpg)
Univ. of Tokyo
8/11
Experimental results(4)Agreement for Annotations between Hishiki san and Nagata san
Comparison Features
Gene Local context, Capitalization, POS of NEDisease Local context, Capitalization, POS of NE and Unigram
Evaluation : 10fold-cross validation
Gene 90.3%
Disease 89.3%
Test data Training data Gene Disease
P R F P R F
Nagata
Gene:650
Disease:821
Hishiki 88.6 81.4 84.8 90.4 92.8 91.6
Nagata 86.8 90.9 88.8 89.6 95.7 92.6
Intersection 90.6 80.0 85.0 91.1 89.9 90.5
Union 85.4 91.7 88.4 88.8 97.4 92.9
Hishiki
Gene:577
Disease:780
Hishiki 80.2 83.0 81.6 88.7 95.9 92.2
Nagata 77.5 91.5 83.9 86.8 97.6 91.9
Intersection 81.7 81.3 81.5 89.9 93.3 91.6
Union 76.5 92.5 83.8 85.5 98.7 91.6
![Page 9: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/9.jpg)
Univ. of Tokyo
9/11
Experimental results(5)::Gene
![Page 10: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/10.jpg)
Univ. of Tokyo
10/11
Experimental results(6)::Disease
![Page 11: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/11.jpg)
Univ. of Tokyo
11/11
Conclusions
Through the experiments, we found that the NLP techniques (ML approach) play an important role in improving the performance We can expect that the performance may be
increases by considering more evidential features.
It is necessary to explore more evidential features and more effective methods to cope with NER difficulties.
We found that the performance was improved as the size of training corpus increases.
![Page 12: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/12.jpg)
Univ. of Tokyo
12/11
Thank you!!!
![Page 13: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/13.jpg)
Univ. of Tokyo
13/11
Gaussian Prior (Hishiki)
Gaussian Prior
Gene Disease
P R F P R F
20 73.8 78.5 76.0 85.6 96.5 90.7
50 75.2 79.0 77.1 87.0 95.5 91.0
80 75.4 79.0 77.1 87.3 95.4 91.2
100 75.4 79.9 77.2 87.5 95.4 91.2
200 75.6 79.0 77.2 87.6 95.3 91.3
300 75.6 79.0 77.2 87.7 95.2 91.3
400 75.4 79.0 77.1 87.8 95.2 91.3
500 75.4 78.9 77.1 87.8 95.1 91.3
800 75.3 78.2 76.7 87.8 94.9 91.2
1000 75.3 78.1 76.7 87.8 94.7 91.1
1500 75.5 77.4 76.4 87.7 94.7 91.1
2000 75.5 76.7 76.1 87.7 94.7 91.1
![Page 14: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/14.jpg)
Univ. of Tokyo
14/11
Experimental results (Hishiki)Features Gene Disease
P R F P R F
Name, context (W ) 76.6 83.4 79.8 89.1 95.6 92.3
Caps Info 73.5 68.1 70.7 78.0 99.4 87.4
Digit Info. 63.7 86.8 73.5 77.9 99.5 87.4
Greek 63.2 84.4 72.3 77.9 99.5 87.4
Affix 62.9 83.7 71.8 78.0 99.5 87.4
POS 64.4 78.9 70.9 78.1 99.2 87.4
W+Caps Info. 80.7 84.6 82.6 87.8 98.2 92.7
W+Digit Info. 79.0 83.9 81.3 87.7 98.2 92.7
W+Greek 75.2 84.1 79.4 87.6 98.3 92.6
W+Affix 75.0 84.9 79.7 87.7 98.2 92.7
W+D+G 79.7 84.2 81.9 87.7 98.2 92.7
W+C+D 80.7 84.6 82.6 87.7 98.2 92.7
W+C+G 80.4 84.1 82.2 87.7 98.2 92.7
W +A+C 80.6 84.2 82.4 87.8 98.2 92.7
W+A+D 78.9 84.1 81.4 87.8 98.2 92.7
W+A+G 75.0 83.9 79.2 87.6 98.2 92.6
W+C+D+G 80.5 83.7 82.1 87.8 98.2 92.7
W+A+C+D 80.5 84.2 82.3 88.0 98.3 92.9
W+A+C+G 80.3 84.1 82.1 87.8 98.2 92.7
W+A+D+G 79.5 83.9 81.6 87.9 98.3 92.8
W+A+C+D+G 80.5 83.9 82.2 87.9 98.2 92.8
![Page 15: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/15.jpg)
Univ. of Tokyo
15/11
Experimental results (Hishiki)Features Gene Disease
P R F P R F
Name, context(W)
76.6 83.4 79.8 89.1 95.6 92.3
W +POS of NE 76.3 84.2 80.1 87.7 97.6 92.0
W +POS(NE,uni) 75.9 82.3 79.0 88.6 95.8 92.1
W +POS(NE,uni,bi) 76.0 79.4 77.6 87.8 94.9 91.2
W+Caps Info. 80.7 84.6 82.6 87.8 98.2 92.7
W+C+POS 81.0 83.5 82.3 87.8 97.6 92.4
W+C+POS1 80.0 82.5 81.2 88.6 95.6 92.0
W+C+POS2 77.2 78.9 78.0 88.4 95.1 91.7
W+C+D 80.7 84.6 82.6 87.7 98.2 92.7
W+C+D+POS 80.8 83.0 81.9 87.6 97.6 92.3
W+C+D+POS1 79.9 82.5 81.2 88.7 95.9 92.2
W+C+D+POS2 77.2 79.2 78.2 88.4 95.1 91.7
W+A+C+D 80.5 84.2 82.3 88.0 98.3 92.9
W+A+C+D+POS 81.0 83.4 82.2 87.8 97.6 92.4
W+A+C+D+POS1 79.8 82.3 81.1 88.8 95.8 92.2
W+A+C+D+POS2 77.0 79.0 78.0 88.1 94.5 91.2
![Page 16: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/16.jpg)
Univ. of Tokyo
16/11
Experimental results (Nagata)Features Gene Disease
P R F P R F
Name, context (W ) 82.7 88.3 85.4 89.7 95.0 92.3
Caps Info 73.4 88.8 80.4 82.1 99.4 89.9
Digit Info. 72.2 89.7 80.0 82,1 99.5 90.0
Greek 71.7 86.2 78.3 82.1 99.5 90.0
Affix 71.6 85.1 77.8 82.1 99.5 90.0
POS 72.8 86.3 79.0 82.2 99.3 89.9
W+Caps Info. 86.4 90.2 88.3 88.5 97.8 92.9
W+Digit Info. 82.2 91.2 86.5 88.5 97.9 93.0
W+Greek 80.9 92.0 86.1 88.6 98.1 93.1
W+Affix 80.4 92.0 85.8 88.6 98.1 93.1
W+D+G 82.7 91.4 86.8 88.6 98.2 93.1
W+C+D 85.9 90.2 88.0 88.5 97.7 92.9
W+C+G 86.2 90.6 88.4 88.5 97.7 92.9
W +A+C 86.0 90.2 88.1 88.5 97.8 92.9
W+A+D 82.3 91.4 86.6 88.5 98.1 93.0
W+A+G 80.7 91.5 85.8 88.6 98.2 93.1
W+C+D+G 86.1 90.8 88.4 88.6 98.1 93.1
W+A+C+D 85.9 90.2 88.0 88.5 97.8 92.9
W+A+C+G 86.2 90.5 88.3 88.7 98.1 93.1
W+A+D+G 82.6 91.4 86.8 88.7 98.1 93.1
W+A+C+D+G 85.7 90.6 88.1 88.6 97.8 93.0
![Page 17: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/17.jpg)
Univ. of Tokyo
17/11
Experimental results (Nagata)Features Gene Disease
P R F P R F
Name, context(W)
82.7 88.3 85.4 89.7 95.0 92.3
W +POS 81.5 90.6 85.8 88.5 96.0 92.1
W +POS1 81.7 90.6 85.9 89.8 95.5 92.6
W +POS2 81.8 86.3 84.0 89.3 95.4 92.2
W+Caps Info. 86.4 90.2 88.3 88.5 97.8 92.9
W+C+POS 86.3 89.4 87.8 88.6 97.0 92.6
W+C+POS1 85.9 90.2 88.0 90.0 96.1 92.9
W+C+POS2 85.7 87.5 86.6 89.4 95.2 92.2
W+C+D+G 86.1 90.8 88.4 88.6 98.1 93.1
W+C+D+G+POS 86.5 89.1 87.8 88.6 97.1 92.6
W+C+D+G+POS1
85.5 89.8 87.6 89.9 96.1 92.9
W +C+D+G+POS2 85.3 87.5 86.4 89.5 95.1 92.2
W+C+G+POS 86.7 89.1 87.9 88.5 97.0 92.6
W+C+G+POS1 85.6 89.7 87.6 89.8 96.3 92.9
W +C+G+POS2 85.2 87.5 86.3 89.2 95.0 92.0
![Page 18: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005](https://reader035.vdocument.in/reader035/viewer/2022081602/5515d97e550346d46f8b49f6/html5/thumbnails/18.jpg)
Univ. of Tokyo
18/11
Prefix and suffix Important cue for terminology identification
~cin
~mide
~zole
actinomycin
cycloheximide
sulphamethoxazole
~lipid
~rogen
~vitamin
phospholipids
estrogen
dihydroxyvitamin
etc …