extracting chemical- protein interactions using long short-term memory networks · 2018-11-26 ·...

Extracting Chemical-Protein Interactions using Long Short-Term Memory Networks

Sérgio [email protected]://bioinformatics.ua.pt

BioCreative VI WorkshopBethesda, 18-20 October 2017

Problem

• Detect relations between chemical compounds/drug and genes/proteins in PubMed abstracts

• Gold-standard entities provided

• Five relation classes

MethodsOverview

• Multi-class classification

• Each possible chemical-gene pair in each sentence considered as an instance

• Labeled as one of five classes or negative

• Dependency features + linear sentence features

MethodsPreprocessing

• Extract each chemical-protein pair• Extract shortest path

– TEES (https://github.com/jbjorne/TEES) - BLLIP parser– Datasets preprocessed using command line tool: XML output– Create sentence graph and extract SP (NetworkX)– Gold-standard entity annotations mapped to tokens– Use head word for multi-word entity mentions

MethodsFeatures

1) wordsinshortestpath– entitiesblinded2) POStagsofwordsinshortestpath3) shortestpathdependencies4) upto30wordsbeforethefirstentity5) thewordsbetweentheentities6) upto30wordsafterthesecondentity

MethodsFeatures

1) chemical effectscomparedthosediclofenacinhibitorgene2) NNNNSVBNDTNNNNNN3) prep_of nsubjpass prep_with prep_of appos nn4) theeffectsof5) werecomparedwiththoseofdiclofenacanonselective6) inhibitor

Data

Training Development Test

Documents 1020 612 3334Sentences 10309 6175 33854Chemical 13017 8004 44066Gene 12735 7563 41072

Instances Training Development Test

Total 11953 7653 40887upregulation/activation 595 432

downregulation/inhibition 1827 941agonist 140 104

antagonist 206 182substrate 624 401

Deep learning classifier

shortestpath

words

Bi-LSTMDropoutWord

embeddingshortestpath

POStags

Bi-LSTMDropoutPOS

embedding Fullyconnected

Outputlayer

shortestpath

depe

nden

cies

Bi-LSTMDropoutDependencyembedding

64unitsdropout=0.1

64unitsdropout=0.1

64unitsdropout=0.1

0.2

0.2

0.2

Embeddings

• Word embeddings– Word2vec (Gensim)– 15 million MEDLINE abstracts – simple tokenization– ~775k words– 6 model parameters

• window = 5/20/50• vector size = 100/300

• POS embeddings (size 200, random init)

• Dependency embeddings (300, random init)

Run configurations

Run

Configuration

Dependencyfeatures SentencefeaturesClass

weightsWord POS Dep Left Middle Right

1 x x x x

2 x x x x x

3 x x x

4 x x x x

5 x x x x x x x

Results

RunDevelopment Test

Precision Recall F-Score Precision Recall F-Score

1 0,6547 0,5403 0,5919 0,6419 0,2577 0,3677

2 0,4856 0,6221 0,5449 0,5156 0,4670 0,4901

3 0,6334 0,5126 0,5664 0,5919 0,2403 0,3418

4 0,4310 0,6092 0,5047 0,4024 0,4193 0,4107

5 0,4999 0,6074 0,5470 0,5738 0,4722 0,5181

Results

RunDevelopment Test

Precision Recall F-Score Precision Recall F-Score

1 0,6547 0,5403 0,5919 0,6419 0,2577 0,3677

2 0,4856 0,6221 0,5449 0,5156 0,4670 0,4901

3 0,6334 0,5126 0,5664 0,5919 0,2403 0,3418

4 0,4310 0,6092 0,5047 0,4024 0,4193 0,4107

5 0,4999 0,6074 0,5470 0,5738 0,4722 0,5181

LinearSVM~0.25F-scoreusing1+2gramsofsamefeatures

Results

CPR:0 CPR:3 CPR:4 CPR:5 CPR:6 CPR:9

CPR:0 292 433 46 77 234

CPR:3 147 258 30 0 1 1

CPR:4 271 19 661 1 4 5

CPR:5 40 1 0 70 0 0

CPR:6 24 0 2 2 122 0

CPR:9 158 3 5 0 0 223

Results


CPR:0 292 433 46 77 234

CPR:3 147 64% 30 0 1 1

CPR:4 271 19 71% 1 4 5

CPR:5 40 1 0 64% 0 0

CPR:6 24 0 2 2 84% 0

CPR:9 158 3 5 0 0 59%

Precision59%- 84%

Results


CPR:0 53% 40% 40% 39% 51%CPR:3 147 258 30 0 1 1

CPR:4 271 19 661 1 4 5

CPR:5 40 1 0 70 0 0

CPR:6 24 0 2 2 122 0

CPR:9 158 3 5 0 0 223

Recall47%- 61%

Conclusions / Future

Conclusions / Future

• Error analysis

• Different kernels

• Network topology

• Hyper-parameters

Thank you!

Sérgio [email protected]://bioinformatics.ua.pt

BioCreative VI WorkshopBethesda, 18-20 October 2017