extracting chemical- protein interactions using long short-term memory networks · 2018-11-26 ·...
TRANSCRIPT
Extracting Chemical-Protein Interactions using Long Short-Term Memory Networks
Sérgio [email protected]://bioinformatics.ua.pt
BioCreative VI WorkshopBethesda, 18-20 October 2017
Problem
• Detect relations between chemical compounds/drug and genes/proteins in PubMed abstracts
• Gold-standard entities provided
• Five relation classes
MethodsOverview
• Multi-class classification
• Each possible chemical-gene pair in each sentence considered as an instance
• Labeled as one of five classes or negative
• Dependency features + linear sentence features
MethodsOverview
• Multi-class classification
• Each possible chemical-gene pair in each sentence considered as an instance
• Labeled as one of five classes or negative
• Dependency features + linear sentence features
MethodsPreprocessing
• Extract each chemical-protein pair• Extract shortest path
– TEES (https://github.com/jbjorne/TEES) - BLLIP parser– Datasets preprocessed using command line tool: XML output– Create sentence graph and extract SP (NetworkX)– Gold-standard entity annotations mapped to tokens– Use head word for multi-word entity mentions
MethodsFeatures
1) wordsinshortestpath– entitiesblinded2) POStagsofwordsinshortestpath3) shortestpathdependencies4) upto30wordsbeforethefirstentity5) thewordsbetweentheentities6) upto30wordsafterthesecondentity
MethodsFeatures
1) chemical effectscomparedthosediclofenacinhibitorgene2) NNNNSVBNDTNNNNNN3) prep_of nsubjpass prep_with prep_of appos nn4) theeffectsof5) werecomparedwiththoseofdiclofenacanonselective6) inhibitor
Data
Training Development Test
Documents 1020 612 3334Sentences 10309 6175 33854Chemical 13017 8004 44066Gene 12735 7563 41072
Instances Training Development Test
Total 11953 7653 40887upregulation/activation 595 432
downregulation/inhibition 1827 941agonist 140 104
antagonist 206 182substrate 624 401
Deep learning classifier
shortestpath
words
Bi-LSTMDropoutWord
embeddingshortestpath
POStags
Bi-LSTMDropoutPOS
embedding Fullyconnected
Outputlayer
shortestpath
depe
nden
cies
Bi-LSTMDropoutDependencyembedding
64unitsdropout=0.1
64unitsdropout=0.1
64unitsdropout=0.1
0.2
0.2
0.2
Embeddings
• Word embeddings– Word2vec (Gensim)– 15 million MEDLINE abstracts – simple tokenization– ~775k words– 6 model parameters
• window = 5/20/50• vector size = 100/300
• POS embeddings (size 200, random init)
• Dependency embeddings (300, random init)
Run configurations
Run
Configuration
Dependencyfeatures SentencefeaturesClass
weightsWord POS Dep Left Middle Right
1 x x x x
2 x x x x x
3 x x x
4 x x x x
5 x x x x x x x
Results
RunDevelopment Test
Precision Recall F-Score Precision Recall F-Score
1 0,6547 0,5403 0,5919 0,6419 0,2577 0,3677
2 0,4856 0,6221 0,5449 0,5156 0,4670 0,4901
3 0,6334 0,5126 0,5664 0,5919 0,2403 0,3418
4 0,4310 0,6092 0,5047 0,4024 0,4193 0,4107
5 0,4999 0,6074 0,5470 0,5738 0,4722 0,5181
Results
RunDevelopment Test
Precision Recall F-Score Precision Recall F-Score
1 0,6547 0,5403 0,5919 0,6419 0,2577 0,3677
2 0,4856 0,6221 0,5449 0,5156 0,4670 0,4901
3 0,6334 0,5126 0,5664 0,5919 0,2403 0,3418
4 0,4310 0,6092 0,5047 0,4024 0,4193 0,4107
5 0,4999 0,6074 0,5470 0,5738 0,4722 0,5181
LinearSVM~0.25F-scoreusing1+2gramsofsamefeatures
Results
CPR:0 CPR:3 CPR:4 CPR:5 CPR:6 CPR:9
CPR:0 292 433 46 77 234
CPR:3 147 258 30 0 1 1
CPR:4 271 19 661 1 4 5
CPR:5 40 1 0 70 0 0
CPR:6 24 0 2 2 122 0
CPR:9 158 3 5 0 0 223
Results
CPR:0 CPR:3 CPR:4 CPR:5 CPR:6 CPR:9
CPR:0 292 433 46 77 234
CPR:3 147 64% 30 0 1 1
CPR:4 271 19 71% 1 4 5
CPR:5 40 1 0 64% 0 0
CPR:6 24 0 2 2 84% 0
CPR:9 158 3 5 0 0 59%
Precision59%- 84%
Results
CPR:0 CPR:3 CPR:4 CPR:5 CPR:6 CPR:9
CPR:0 53% 40% 40% 39% 51%CPR:3 147 258 30 0 1 1
CPR:4 271 19 661 1 4 5
CPR:5 40 1 0 70 0 0
CPR:6 24 0 2 2 122 0
CPR:9 158 3 5 0 0 223
Recall47%- 61%
Conclusions / Future
Conclusions / Future
• Error analysis
• Different kernels
• Network topology
• Hyper-parameters
Thank you!
Sérgio [email protected]://bioinformatics.ua.pt
BioCreative VI WorkshopBethesda, 18-20 October 2017