effective reranking for extracting protein-protein interactions from biomedical literature

Post on 18-Mar-2016

48 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature. Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering Nanyang Technological University, Singapore 30 August 2007. Outline. Protein-protein interactions (PPIs) extraction - PowerPoint PPT Presentation

TRANSCRIPT

Effective Reranking for Extracting Protein-protein Interactions from Biomedical

Literature

Deyu Zhou, Yulan He and Chee Keong Kwoh

School of Computer Engineering

Nanyang Technological University, Singapore

30 August 2007

OutlineOutline• Protein-protein interactions (PPIs) extraction

• Hidden Vector State (HVS) model for PPIs extraction

• Reranking approaches

• Experimental results

• Conclusions

ProteinProteinInteract

Protein

Protein-Protein Interactions ExtractionProtein-Protein Interactions Extraction

Spc97p interacts with Spc98 and Tub4 in the two-hybrid system

Spc97p interact Spc98Spc97p interact Tub4

Existing ApproachesExisting Approaches

Statistics Methods

Pattern Matching

Parsing-Based

Simple to Complicated

An exampleAn example

However, unlike another tumor suppressor protein, p53, Rb did not have any significant effecton basal levels of transcription, suggesting that Rb specifically interacts with IE2 rather ...

Part-of-speech tagging

However/RB ,/, unlike/IN another/DT tumor/NN suppressor/NN protein/NN ,/, p53/NN ,/, Rb/NN did/VBD not/RB have/VB any/DT significant/JJ effect/NN on/IN basal/JJ levels/NNS of/INtranscription/NN ,/, suggesting/VBG that/IN Rb/NN specifically/RB interacts/VBZ with/IN IE2/NN rather/RB ...

However/RB ,/, unlike/IN another/DT tumor/NN suppressor/NN protein/NN ,/, PROTEIN(p53/NN) ,/, PROTEIN(Rb/NN) did/VBD not/RB have/VB any/DT significant/JJ effect/NN on/INbasal/JJ levels/NNS of/IN transcription/NN ,/, suggesting/VBG that/IN PROTEIN(Rb/NN)specifically/RB interacts/VBZ with/IN PROTEIN(IE2/NN) rather/RB ...

Protein name identification

Statistics-Based ApproachesStatistics-Based Approaches

Corpus level statisticSentence level statistic

(p53, IE2)(Rb, IE2)

+1+1

Relation Occurrence

(p53, Rb) +1(p53, IE2)

...81

Relation Occurrence

... 6

Relation Confidence

(p53, IE2)...

75%...

... ...

Predefined threshold a = 7

Pattern Matching ApproachesPattern Matching Approaches

Rb interact IE2p53 interact IE2

Protein [*] interact[s] with protein protein RB VBZ WITH protein

Rb interact IE2

Pattern matching

Pattern 1 Pattern 2

Parsing-Based ApproachesParsing-Based Approaches

Syntactic processing

Semantic processing...Rb specifically interacts with IE2...

N ADV V P N

NP PP

VP

VP

(<INTERACT><THE Rb PROTEIN><THE IE2 PROTEIN>)

Rb interact IE2

…...

Semantic ParserSemantic Parser

Ĉ = argmax { P(C|Wn) } = argmax { P(C) P(Wn|C) } c c

For each candidate word string Wn, need to compute most likely set of embedded concepts

semanticmodel

lexicalmodel

We could use a simple finite state tagger …

P(Wn|C)

P(C)

… can be robustly trained using EM, but model is too weak to represent embeddings in natural language

<s> Spc97p interacts with Spc98 and Tub4 in the </s>

SS PROTEIN INTERACT DUMMY SEPROTEIN PROTEINDUMMY DUMMY

two-hybrid system

Perhaps use some form of hierarchical HMM in which each state is a terminal or a nested HMM …

… but when using EM, models rarely converge on good solutions and, in practice, direct maximum-likelihood from “tree-bank” data are needed to train models

P(Wn|C)

P(C)

Spc97p interacts with Spc98 and Tub4 in the two-hybrid system

S

PROTEIN

INTERACT

PREP PROTEIN PROTEINAND DUMMY

INTERACTION

SUBJECT OBJECT OBJECT

Hidden Vector State ModelHidden Vector State Model

<s> Spc97p interacts with Spc98 and Tub4 in the two-hybrid system </s>

SS

PROTEIN

INTERACT

DUMMY SEPROTEIN PROTEINDUMMY DUMMY

PROTEININTERACTPROTEIN

SS

SS PROTEINSS

INTERACTPROTEIN

SS

DUMMYINTERACTPROTEIN

SS

PROTEININTERACTPROTEIN

SS

DUMMYINTERACTPROTEIN

SS

DUMMYSS

SESS

The HVS model is an HMM in which the states correspond to the stack of a push-down automata with a bounded stack size …

P(Wn|C)

… this is a very convenient framework for applying constraints

P(C) PROTEININTERACTPROTEIN

SS

SS PROTEINSS

INTERACTPROTEIN

SS

DUMMYINTERACTPROTEIN

SS

PROTEININTERACTPROTEIN

SS

DUMMYINTERACTPROTEIN

SS

DUMMYSS

SESS

<s> Spc97p interacts with Spc98 and Tub4 in the two </s> -hybrid system

HVS model transition constraints:

• finite stack depth – D• push only one non-terminal semantic onto the stack at each step

… model defined by three simple probability tables

Ĉ = argmax { ∏P(nt|Ct-1) P(Ct[1]|Ct [2..Dt]) P(Wt|Ct) } c,N t

Parsing with the HVS model

P(nt|Ct-1)

1) POP 1 elements from the previous stack state, n =1

P(Ct[1]|Ct [2..Dt])

2) Push 1 pre-terminal semantic concept into stack

P(Wt|Ct)3) Generate the next word

PROTEININTERACTPROTEIN

SS

… with Spc98 and Tub4 …

INTERACTPROTEIN

SS

DUMMYINERACTPROTEIN

SS

Train using EM and apply constraints

Abstract semantic annotationPROTEIN (

INTERACT (PROTEIN) )

CUL-1 was found to interact with SKR-1, SKR-2, SKR-3, and SKR-7 in yeast two-hybrid system

Training text

Data Constraints

EM Parameter Estimation

HVS Model Parameters

Parse Statistics

Limit forward-backward search to only include states which are consistent with the constraints

Reranking MethodologyReranking Methodology• Reranking approaches attempts to improve upon an

existing probabilistic parser by reranking the output of the parser.

• It has benefited applications such as name-entity extraction, semantic parsing and semantic labeling.

• To rerank parses generated by the HVS model for protein-protein interactions extraction

Architecture Architecture

Annotated Corpus E

Test DataTraining

Training

SemanticParsing

RerankingReranking Model

Parse results

Ranked 1st parse

Extracted protein-protein

Interactions

HVS model

Parsing Information IPStructure Information ISComplexity Information IC...

Features:

Reranking approaches Reranking approaches • Features for Reranking Suppose sentence Si has its corresponding parse set Ci = {Cij, j = 1,.. N}

– Parsing Information

– Structure Information

– Complexity Information

Reranking approaches Reranking approaches Score is defined as• log-linear regression model

• Neural Network

• Support Vector Machines

Experiments Experiments • Setup

– Corpus I• comprises of 300 abstracts randomly retrieved from

the GENIA corpus• GENIA is a collection of research abstracts selected

from the search results of MEDLINE database with keyword (MeSH terms) “human, blood cells and transcription factors”

• split into two parts:– Part I contains 1500 sentences (training data)

– Part II consists of 1000 sentences (test data)

Experimental ResultsExperimental Results

Figure 1: F-measure vs number of candidate parses.

Experimental Results Experimental Results (cont’d)(cont’d)

Experiments

Recall (%)

Precision (%)

F-Score (%)

Baseline 55.8 55.6 55.7SVMNNLLR

59.157.958.5

60.261.861.2

59.759.859.8

Table 3: Results based on the interaction category.

ConclusionsConclusions• Three reranking methods for the HVS model in the

application of extracting protein-protein interactions from biomedical literature.

• Experimental results show that 4% relative improvement in F-measure can be obtained through reranking on the semantic parse results

• Incorporating other semantic or syntactic information might be able to give further gains.

top related