nlp techniques (machine learning) ner in biomedical domain tsujii laboratory hong-woo chun (d1)...

18
NLP Techniques (Machine Learning) NLP Techniques (Machine Learning) NER in Biomedical Domain NER in Biomedical Domain Tsujii Laboratory Tsujii Laboratory Hong-Woo CHUN (D1) Hong-Woo CHUN (D1) February 10th , 2005

Upload: juan-tobin

Post on 28-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

NLP Techniques (Machine NLP Techniques (Machine Learning)Learning)

NER in Biomedical DomainNER in Biomedical Domain

Tsujii LaboratoryTsujii LaboratoryHong-Woo CHUN (D1)Hong-Woo CHUN (D1)

February 10th , 2005

Page 2: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

2/11

Introduction

As the research in biomedical domain has grown rapidly in recent years, a huge amount of nature language resources have been developed and become a rich knowledge base.

NER (Named Entity Recognition) is strongly demanded to be applied in biomedical

domain. identifies names of genes, gene products and diseases in a b

iomedical text in this project.From now on, genes and gene products are called by ‘gene’.

has not got high performance.

compared with those in newswire domain

Page 3: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

3/11

Introduction::Problems in NER

Some modifiers are often before basic NEs activated B cell lines

Sometimes biomedical NEs are very long 47 kDa sterol regulatory element binding factor

Two or more NEs share one head noun by using conjunction or disjunction construction

91 and 84 kDa proteins An entity may be found with various spelling forms NE may be cascaded One NE may be embedded in anothe

r NE Abbreviations are frequently used

Therefore, it is necessary to explore more evidential features and more effective methods to cope with such difficulties.

Page 4: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

4/11

NER without NLP tech.

Dictionary based longest matching ! The number of words in dictionaries

Gene : 44,463

Disease : 159,477

Corpus 1,000 biomedical sentences which are tagged by

biologists

Gene and Disease names and their Association

Gene Disease

Hishiki Nagata Hishiki Nagata

Precision 57.7% 65.0% 78.0% 82.1%

Recall 100% 100% 100% 100%

F-score 73.2% 78.8% 87.6% 90.2%

Page 5: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

5/11

Experimental results(1)

Maximum Entropy based model Features

Local context (Name itself, Unigrams and Bigrams)

POS (Name itself, Unigrams and Bigrams)

Capitalization (All capital, Mixed capital, No capital)

Digitalization ( All digit, Mixed digit, No digit)

24 Greek Letters (alpha, beta, gamma, …)

12 suffix

Corpus 1,000 biomedical sentences which are tagged by biologists

Gene and Disease names and their Association

Evaluations 10-fold cross validation

L2 L1 NE R1 R2

Page 6: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

6/11

Experimental results(2)

Example of Corpus

Page 7: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

7/11

Experimental results(3)::Useful features

Gene Disease

Local context

Capitalization

Digitalization

Greek Letters

Affix

POS NE

NE, Uni

NE, Uni, Bi

Page 8: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

8/11

Experimental results(4)Agreement for Annotations between Hishiki san and Nagata san

Comparison Features

Gene Local context, Capitalization, POS of NEDisease Local context, Capitalization, POS of NE and Unigram

Evaluation : 10fold-cross validation

Gene 90.3%

Disease 89.3%

Test data Training data Gene Disease

P R F P R F

Nagata

Gene:650

Disease:821

Hishiki 88.6 81.4 84.8 90.4 92.8 91.6

Nagata 86.8 90.9 88.8 89.6 95.7 92.6

Intersection 90.6 80.0 85.0 91.1 89.9 90.5

Union 85.4 91.7 88.4 88.8 97.4 92.9

Hishiki

Gene:577

Disease:780

Hishiki 80.2 83.0 81.6 88.7 95.9 92.2

Nagata 77.5 91.5 83.9 86.8 97.6 91.9

Intersection 81.7 81.3 81.5 89.9 93.3 91.6

Union 76.5 92.5 83.8 85.5 98.7 91.6

Page 9: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

9/11

Experimental results(5)::Gene

Page 10: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

10/11

Experimental results(6)::Disease

Page 11: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

11/11

Conclusions

Through the experiments, we found that the NLP techniques (ML approach) play an important role in improving the performance We can expect that the performance may be

increases by considering more evidential features.

It is necessary to explore more evidential features and more effective methods to cope with NER difficulties.

We found that the performance was improved as the size of training corpus increases.

Page 12: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

12/11

Thank you!!!

Page 13: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

13/11

Gaussian Prior (Hishiki)

Gaussian Prior

Gene Disease

P R F P R F

20 73.8 78.5 76.0 85.6 96.5 90.7

50 75.2 79.0 77.1 87.0 95.5 91.0

80 75.4 79.0 77.1 87.3 95.4 91.2

100 75.4 79.9 77.2 87.5 95.4 91.2

200 75.6 79.0 77.2 87.6 95.3 91.3

300 75.6 79.0 77.2 87.7 95.2 91.3

400 75.4 79.0 77.1 87.8 95.2 91.3

500 75.4 78.9 77.1 87.8 95.1 91.3

800 75.3 78.2 76.7 87.8 94.9 91.2

1000 75.3 78.1 76.7 87.8 94.7 91.1

1500 75.5 77.4 76.4 87.7 94.7 91.1

2000 75.5 76.7 76.1 87.7 94.7 91.1

Page 14: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

14/11

Experimental results (Hishiki)Features Gene Disease

P R F P R F

Name, context (W ) 76.6 83.4 79.8 89.1 95.6 92.3

Caps Info 73.5 68.1 70.7 78.0 99.4 87.4

Digit Info. 63.7 86.8 73.5 77.9 99.5 87.4

Greek 63.2 84.4 72.3 77.9 99.5 87.4

Affix 62.9 83.7 71.8 78.0 99.5 87.4

POS 64.4 78.9 70.9 78.1 99.2 87.4

W+Caps Info. 80.7 84.6 82.6 87.8 98.2 92.7

W+Digit Info. 79.0 83.9 81.3 87.7 98.2 92.7

W+Greek 75.2 84.1 79.4 87.6 98.3 92.6

W+Affix 75.0 84.9 79.7 87.7 98.2 92.7

W+D+G 79.7 84.2 81.9 87.7 98.2 92.7

W+C+D 80.7 84.6 82.6 87.7 98.2 92.7

W+C+G 80.4 84.1 82.2 87.7 98.2 92.7

W +A+C 80.6 84.2 82.4 87.8 98.2 92.7

W+A+D 78.9 84.1 81.4 87.8 98.2 92.7

W+A+G 75.0 83.9 79.2 87.6 98.2 92.6

W+C+D+G 80.5 83.7 82.1 87.8 98.2 92.7

W+A+C+D 80.5 84.2 82.3 88.0 98.3 92.9

W+A+C+G 80.3 84.1 82.1 87.8 98.2 92.7

W+A+D+G 79.5 83.9 81.6 87.9 98.3 92.8

W+A+C+D+G 80.5 83.9 82.2 87.9 98.2 92.8

Page 15: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

15/11

Experimental results (Hishiki)Features Gene Disease

P R F P R F

Name, context(W)

76.6 83.4 79.8 89.1 95.6 92.3

W +POS of NE 76.3 84.2 80.1 87.7 97.6 92.0

W +POS(NE,uni) 75.9 82.3 79.0 88.6 95.8 92.1

W +POS(NE,uni,bi) 76.0 79.4 77.6 87.8 94.9 91.2

W+Caps Info. 80.7 84.6 82.6 87.8 98.2 92.7

W+C+POS 81.0 83.5 82.3 87.8 97.6 92.4

W+C+POS1 80.0 82.5 81.2 88.6 95.6 92.0

W+C+POS2 77.2 78.9 78.0 88.4 95.1 91.7

W+C+D 80.7 84.6 82.6 87.7 98.2 92.7

W+C+D+POS 80.8 83.0 81.9 87.6 97.6 92.3

W+C+D+POS1 79.9 82.5 81.2 88.7 95.9 92.2

W+C+D+POS2 77.2 79.2 78.2 88.4 95.1 91.7

W+A+C+D 80.5 84.2 82.3 88.0 98.3 92.9

W+A+C+D+POS 81.0 83.4 82.2 87.8 97.6 92.4

W+A+C+D+POS1 79.8 82.3 81.1 88.8 95.8 92.2

W+A+C+D+POS2 77.0 79.0 78.0 88.1 94.5 91.2

Page 16: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

16/11

Experimental results (Nagata)Features Gene Disease

P R F P R F

Name, context (W ) 82.7 88.3 85.4 89.7 95.0 92.3

Caps Info 73.4 88.8 80.4 82.1 99.4 89.9

Digit Info. 72.2 89.7 80.0 82,1 99.5 90.0

Greek 71.7 86.2 78.3 82.1 99.5 90.0

Affix 71.6 85.1 77.8 82.1 99.5 90.0

POS 72.8 86.3 79.0 82.2 99.3 89.9

W+Caps Info. 86.4 90.2 88.3 88.5 97.8 92.9

W+Digit Info. 82.2 91.2 86.5 88.5 97.9 93.0

W+Greek 80.9 92.0 86.1 88.6 98.1 93.1

W+Affix 80.4 92.0 85.8 88.6 98.1 93.1

W+D+G 82.7 91.4 86.8 88.6 98.2 93.1

W+C+D 85.9 90.2 88.0 88.5 97.7 92.9

W+C+G 86.2 90.6 88.4 88.5 97.7 92.9

W +A+C 86.0 90.2 88.1 88.5 97.8 92.9

W+A+D 82.3 91.4 86.6 88.5 98.1 93.0

W+A+G 80.7 91.5 85.8 88.6 98.2 93.1

W+C+D+G 86.1 90.8 88.4 88.6 98.1 93.1

W+A+C+D 85.9 90.2 88.0 88.5 97.8 92.9

W+A+C+G 86.2 90.5 88.3 88.7 98.1 93.1

W+A+D+G 82.6 91.4 86.8 88.7 98.1 93.1

W+A+C+D+G 85.7 90.6 88.1 88.6 97.8 93.0

Page 17: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

17/11

Experimental results (Nagata)Features Gene Disease

P R F P R F

Name, context(W)

82.7 88.3 85.4 89.7 95.0 92.3

W +POS 81.5 90.6 85.8 88.5 96.0 92.1

W +POS1 81.7 90.6 85.9 89.8 95.5 92.6

W +POS2 81.8 86.3 84.0 89.3 95.4 92.2

W+Caps Info. 86.4 90.2 88.3 88.5 97.8 92.9

W+C+POS 86.3 89.4 87.8 88.6 97.0 92.6

W+C+POS1 85.9 90.2 88.0 90.0 96.1 92.9

W+C+POS2 85.7 87.5 86.6 89.4 95.2 92.2

W+C+D+G 86.1 90.8 88.4 88.6 98.1 93.1

W+C+D+G+POS 86.5 89.1 87.8 88.6 97.1 92.6

W+C+D+G+POS1

85.5 89.8 87.6 89.9 96.1 92.9

W +C+D+G+POS2 85.3 87.5 86.4 89.5 95.1 92.2

W+C+G+POS 86.7 89.1 87.9 88.5 97.0 92.6

W+C+G+POS1 85.6 89.7 87.6 89.8 96.3 92.9

W +C+G+POS2 85.2 87.5 86.3 89.2 95.0 92.0

Page 18: NLP Techniques (Machine Learning) NER in Biomedical Domain Tsujii Laboratory Hong-Woo CHUN (D1) February 10th, 2005

Univ. of Tokyo

18/11

Prefix and suffix Important cue for terminology identification

~cin

~mide

~zole

actinomycin

cycloheximide

sulphamethoxazole

~lipid

~rogen

~vitamin

phospholipids

estrogen

dihydroxyvitamin

etc …