university of texas at austin machine learning group learning to extract proteins and their...

58
University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Yuk Wah Wong Edward M. Marcotte, Arun Ramani Department of Computer Sciences Institute for Cellular and Molecular Biology University of Texas at Austin Raymond J. Mooney Department of Computer Sciences

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

University of Texas at Austin

Machine Learning Group

Learning to Extract Proteins and their Interactions from Medline Abstracts

Razvan Bunescu, Ruifang Ge,

Rohit J. Kate, Yuk Wah Wong

Edward M. Marcotte,

Arun Ramani

Department of Computer Sciences

Institute for Cellular and Molecular Biology

University of Texas at Austin

Raymond J. Mooney

Department of Computer Sciences

Page 2: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

2University of Texas at Austin

Machine Learning Group

Biological Motivation

• Human Genome Project has produced huge amounts of genetic data.

• Next step is analyzing and interpreting this data.

Page 3: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

3University of Texas at Austin

Machine Learning Group

Page 4: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

4University of Texas at Austin

Machine Learning Group

1 taaccctaac cctaacccta accctaaccc taaccctaac cctaacccta accctaaccc 61 taaccctaac cctaacccta accctaaccc taaccctaac cctaacccaa ccctaaccct 121 aaccctaacc ctaaccctaa ccctaacccc taaccctaac cctaacccta accctaacct 181 aaccctaacc ctaaccctaa ccctaaccct aaccctaacc ctaaccctaa cccctaaccc 241 taaccctaaa ccctaaaccc taaccctaac cctaacccta accctaaccc caaccccaac 301 cccaacccca accccaaccc caaccctaac ccctaaccct aaccctaacc ctaccctaac 361 cctaacccta accctaaccc taaccctaac ccctaacccc taaccctaac cctaacccta 421 accctaaccc taaccctaac ccctaaccct aaccctaacc ctaaccctcg cggtaccctc 481 agccggcccg cccgcccggg tctgacctga ggagaactgt gctccgcctt cagagtacca 541 ccgaaatctg tgcagaggac aacgcagctc cgccctcgcg gtgctctccg ggtctgtgct 601 gaggagaacg caactccgcc ggcgcaggcg cagagaggcg cgccgcgccg gcgcaggcgc 661 agacacatgc tagcgcgtcg gggtggaggc gtggcgcagg cgcagagagg cgcgccgcgc 721 cggcgcaggc gcagagacac atgctaccgc gtccaggggt ggaggcgtgg cgcaggcgca 781 gagaggcgca ccgcgccggc gcaggcgcag agacacatgc tagcgcgtcc aggggtggag 841 gcgtggcgca ggcgcagaga cgcaagccta cgggcggggg ttgggggggc gtgtgttgca 901 ggagcaaagt cgcacggcgc cgggctgggg cggggggagg gtggcgccgt gcacgcgcag 961 aaactcacgt cacggtggcg cggcgcagag acgggtagaa cctcagtaat ccgaaaagcc 1021 gggatcgacc gccccttgct tgcagccggg cactacagga cccgcttgct cacggtgctg 1081 tgccagggcg ccccctgctg gcgactaggg caactgcagg gctctcttgc ttagagtggt

... 5641 gctccagggc ccgctcacct tgctcctgct ccttctgctg ctgcttctcc agctttcgct 5701 ccttcatgct gcgcagcttg gccttgccga tgcccccagc ttggcggatg gactctagca 5761 gagtggccag ccaccggagg ggtcaaccac ttccctggga gctccctgga ctggagccgg 5821 gaggtgggga acagggcaag gaggaaaggc tgctcaggca gggctgggga agcttactgt 5881 gtccaagagc ctgctgggag ggaagtcacc tcccctcaaa cgaggagccc tgcgctgggg 5941 aggccggacc tttggagact gtgtgtgggg gcctgggcac tgacttctgc aaccacctga 6001 gcgcgggcat cctgtgtgca gatactccct gcttcctctc tagcccccac cctgcagagc 6061 tggacccctg agctagccat gctctgacag tctcagttgc acacacgagc cagcagaggg 6121 gttttgtgcc acttctggat gctagggtta cactgggaga cacagcagtg aagctgaaat 6181 gaaaaatgtg ttgctgtagt ttgttattag accccttctt tccattggtt taattaggaa 6241 tggggaaccc agagcctcac ttgttcaggc tccctctgcc ctagaagtga gaagtccaga 6301 gctctacagt ttgaaaacca ctattttatg aaccaagtag aacaagatat ttgaaatgga 6361 aactattcaa aaaattgaga atttctgacc acttaacaaa cccacagaaa atccacccga 6421 gtgcactgag cacgccagaa atcaggtggc ctcaaagagc tgctcccacc tgaaggagac 6481 gcgctgctgc tgctgtcgtc ctgcctggcg ccttggccta caggggccgc ggttgagggt 6541 gggagtgggg gtgcactggc cagcacctca ggagctgggg gtggtggtgg gggcggtggg 6601 ggtggtgtta gtaccccatc ttgtaggtct gaaacacaaa gtgtggggtg tctagggaag... and 3x109 more...

Starting at the tip of chromosome 1...

Page 5: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

5University of Texas at Austin

Machine Learning Group

Proteomics 101

• Genes code for proteins.• Proteins are the basic components of biological

machinery.• Proteins accomplish their functions by interacting

with other proteins.• Knowledge of protein interactions is fundamental to

understanding gene function.• Chains of interactions compose large, complex gene

networks.

Page 6: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

6University of Texas at Austin

Machine Learning Group

Sample Gene Network

Page 7: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

7University of Texas at Austin

Machine Learning Group

~5,800 genes

~5,800 proteins x 2-10 interactions/protein

~12,000 - 60,000 interactions

Yeast

~10-20,000 known==> ~1/3 of the way to a complete map!

Yeast Gene Network

Page 8: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

8University of Texas at Austin

Machine Learning Group

~40,000 genes

>>40,000 proteins x 2-10 interactions/protein

>>80,000 - 400,000 interactions<5,000 known

==> approx. 1% of the complete map!

==> We’re a long ways from the complete map

Human Gene Network

Page 9: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

9University of Texas at Austin

Machine Learning Group

Biological literature ~14 million documentsDNA sequence data ~1010 nucleotidesGene expression data ~108 measurements, but...DNA polymorphisms ~107 knownGene inactivation (knockout) studies ~105

Protein structure data ~104 structures Protein interaction data ~104 interactions, but…Protein expression data ~104 measurements, but...Protein location data ~104 measurements

Relevant Sources of Data

Page 10: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

10University of Texas at Austin

Machine Learning Group

Extraction from Biomedical Literature

• An ever increasing wealth of biological information is present in millions of published articles but retrieving it in structured form is difficult.

• Much of this literature is available through the NIH -NLM’s Medline repository.

• 11 million abstracts in electronic form are available through Medline.

• Excellent source of information on protein interactions.

• Need automated information extraction to easily locate and structure this information.

Page 11: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

11University of Texas at Austin

Machine Learning Group

TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein

AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene.

Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene.

However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved.

In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein.

The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells.

Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro.

In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity.

Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein.

This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression.

Page 12: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

12University of Texas at Austin

Machine Learning Group

TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein

AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene.

Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene.

However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved.

In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein.

The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells.

Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro.

In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity.

Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein.

This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression.

Page 13: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

13University of Texas at Austin

Machine Learning Group

TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein

AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene.

Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene.

However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to govern cell cycle progression remain unresolved.

In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclin-dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein.

The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells.

Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro.

In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity.

Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the p105Rb protein.

This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the regulation of gene expression.

Page 14: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

14University of Texas at Austin

Machine Learning Group

Manually Developed IE Systems for Medline

• A number of projects have focused on the manual development of information extraction (IE) systems for biomedical literature.

• KeX for extracting protein names (Fukuda et al., 1998):

Extract words with special symbols excluding those with more than half of the characters being special symbols, hence eliminating strings such as “+/−”.

• Suiseki for extracting protein interactions (Blaschke et al., 2001):

PROT (0-2) PROT (0-2) complexNOUN between (0-3) PROT (0-3) and (0-3) PROT

Page 15: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

15University of Texas at Austin

Machine Learning Group

Learning Information Extractors

• Manually developing IE systems is tedious and time-consuming and they do not capture all possible formats and contexts for the desired information.

• Machine learning from supervised corpora, is becoming the standard approach to building information extractors.

• Recently, several learning approaches have been applied to Medline extraction (Craven & Kumlein, 1999; Tanabe & Wilbur, 2002; Raychaudhuri et al., 2002).

• We have explored the use of a variety of machine learning techniques to develop IE systems for extracting human protein names and interactions, presenting uniform results on a single, reasonably large, human-annotated corpus.

Page 16: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

16University of Texas at Austin

Machine Learning Group

Non-Learning Protein Extractors

• Dictionary-based extraction• KEX (Fukuda et al., 1998)

Page 17: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

17University of Texas at Austin

Machine Learning Group

Learning Methods for Protein Extraction• Rule-based pattern induction

– Rapier (Califf & Mooney, 1999)

– BWI (Freitag & Kushmerick, 2000)

• Token classification (chunking approach):

– K-nearest neighbor

– Transformation-Based Learning Abgene (Tanabe & Wilbur, 2002)

– Support Vector Machine

– Maximum entropy

• Hidden Markov Models

• Conditional Random Fields (Lafferty, McCallum, and Pereira, 2001)

• Relational Markov Networks (Taskar, Abbeel, and Koller, 2002)

Page 18: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

18University of Texas at Austin

Machine Learning Group

Our Biomedical Corpora

• 750 abstracts that contain the word human were randomly chosen from Medline for testing protein name extraction. They contain a total of 5,206 protein references.

• 200 abstracts previously known to contain protein interactions were obtained from the Database of Interacting Proteins. They contain 1,101 interactions and 4,141 protein names.

• As negative examples for interaction extraction are rare, an extra set of 30 abstracts containing sentences with non-interacting proteins are included.

• The resulting 230 abstracts are used for testing protein interaction extraction.

Page 19: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

19University of Texas at Austin

Machine Learning Group

The Yapex Corpus

• 200 abstracts from Medline, manually tagged for protein names.• 147 randomly chosen such that they contain the Mesh terms “protein binding”, “interaction”, “molecular”.• 53 randomly chosen from the GENIA corpus

http://www.sics.se/humle/projects/prothalt/

Page 20: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

20University of Texas at Austin

Machine Learning Group

Evaluation Metrics for Information Extraction

• Precision is the percentage of extracted items that are correct.

• Recall is the percentage of correct items that are extracted.

• Extracted protein names are considered correct if the same character sequences have been human-tagged as protein names in the exact positions.

• Extracted protein interactions from an abstract are considered correct if both proteins have been human-tagged as interacting in that abstract. Positions are not taken into account.

Page 21: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

21University of Texas at Austin

Machine Learning Group

Dictionary as Source of Domain Knowledge

• Before applying machine learning, abstracts are tagged by matching n-grams against entries from a dictionary. Tagged abstracts are used as input for subsequent methods.

• A dictionary of 42,000 protein names is used (synonyms included).

• Generalization of protein names leads to increased coverage:

Original Protein Name Generalized Name

Interleukin-1 beta Interleukin num greek

Interferon alpha-D Interferon greek roman

NF-IL6-beta NF IL num greek

TR2 TR num

Page 22: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

22University of Texas at Austin

Machine Learning Group

Rule-based Learning Algorithms: Rapier and BWI

• Rule-based learning algorithms are used for inducing patterns for extracting protein names.

• For Rapier (Califf & Mooney, 1999), each rule consists of a pre-filler pattern, a filler pattern and a post-filler pattern.

[ human ] [ (2) transcriptase ] [ ( ]

• For BWI (Freitag & Kushmerick, 2000), rules are composed of contextual patterns called wrappers, recognizing the start or end of a protein name.

[ human ] [] [ transcriptase ] [ ( ]

• High precision (> 70%) but low recall (< 25%).

Page 23: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

23University of Texas at Austin

Machine Learning Group

Hidden Markov Models• We use part-of-speech information in HMMs as described in (Ray &

Craven, 2001).

• We train a positive model that generates sentences containing proteins, and a null model that generates sentences containing no proteins.

• Select the model which gives the highest likelihood of generating a particular sentence, and tag the sentence using the Viterbi path in that model.

• Moderate precision (~60%) and moderate recall (~40%).

START

NN:PROT

NN

END… START NN END…

Page 24: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

24University of Texas at Austin

Machine Learning Group

Name Extraction by Token Classification(“Chunking” Approach)

• Since in our data no protein names directly abut each other, we can reduce the extraction problem to classification of individual words as being part of a protein name or not.

• Protein names are extracted by identifying the longest sequences of words classified as being part of a protein name.

Two potentially oncogenic cyclins , cyclin A and cyclin D1 , share common properties of subunit configuration , tyrosine phosphorylation and physical association with the Rb protein

Page 25: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

25University of Texas at Austin

Machine Learning Group

Two potentially oncogenic cyclins , cyclin A and cyclin D1 , share common properties of subunit configuration , tyrosine phosphorylation and physical association with the Rb protein

Constructing Feature Vectors for Classification

• For each token, we take the following as features:– Current token

– Last 2 tokens and next 2 tokens

– Output of dictionary-based tagger for these 5 tokens

– Suffix for each of the 5 tokens (last 1, 2, and 3 characters)

– Class labels for last 2 tokens

Page 26: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

26University of Texas at Austin

Machine Learning Group

Maximum-Entropy Token Classifier• Distinguish among 5 types of tags:

• S(-tart), C(-ontinue), E(-nd), U(-nique), O(-ther)

• Feature templates:

– current, previous, next word, and previous tag

– part-of-speech for current, previous, next word

– word class (full) ex: FGF1 => AAA0

– word class (brief) ex: FGF1 => A0 (Collins, ACL02)

• An extraction’s confidence is the minimum of its transition probabilities.

t(y) is the forward probability of getting to state y at time step t

Example (4 tokens):

Page 27: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

27University of Texas at Austin

Machine Learning Group

MaxEnt: Greedy Extraction

• Use a Viterbi-like algorithm to find the most likely complete sequence of tags.

• Drawback: many low confidence extractions are missed.

•Want to be able to increase recall beyond Viterbi results to control precision-recall trade-off.

• Solution: use a greedy extraction algorithm on all token sequences between any two consecutive Viterbi extractions.

Page 28: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

28University of Texas at Austin

Machine Learning Group

Experimental Method

• 10-fold cross-validation: Average results over 10 trials with different training and (independent) test data.

• For methods which produce confidence in extractions, vary threshold for extraction in order to explore recall-precision trade-off.

• Use standard methods from information-retrieval to generate a complete precision-recall curve.

• Maximizing F-measure assumes a particular cost-benefit trade-off between incorrect and missed extractions.

Page 29: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

29University of Texas at Austin

Machine Learning GroupProtein Name Extraction Results

(Bunescu et al., 2004)

Page 30: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

30University of Texas at Austin

Machine Learning Group

Graphical ModelsAn intuitive representation of conditional independence between domain

variables.

Directed Models => well suited to represent temporal and causal

relationships (Bayesian Networks, HMMs) Undirected Models => appropriate for representing statistical correlation

between variables (Markov Networks)

Generative Models => define a joint probability over observations and labels

(HMMs) Discriminative Models => specifies a probability over labels given a set of

observations (Conditional Random Fields [Lafferty et al. 2001]). Allow for arbitrary, overlapping features over the observation sequence.

Page 31: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

31University of Texas at Austin

Machine Learning Group

Discriminative Markov Networks

G = (V, E) – an undirected graph

V = X Y – a set of discrete random variables

X – observed variables

Y – hidden variables (labels)

C(G) – the cliques of G

Vc = Xc Yc – the set of vertices in a clique cC(G)

)}( ,:|{ GCcRVccc – the set of clique potentials

)(

),()(

1)|(

GCcccc YX

XZXYP

A clique potential c specifies the compatibility of any possible assignment of values over the nodes in the associated clique c.

Page 32: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

32University of Texas at Austin

Machine Learning Group

Conditional Random Fields[Lafferty et al. 2001]

CRF’s are a type of discriminative Markov networks used for tagging sequences.

CRF’s have shown superior or competitive performance in various tasks as:

Shallow Parsing

Entity Recognition

Table Extraction

[Sha & Pereira 2003]

[McCallum & Li 2003]

[Pinto et al 2003]

Page 33: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

33University of Texas at Austin

Machine Learning Group

Conditional Random Fields (CRFs) Lafferty, McCallum & Pereira 2001

•Undirected graphical model for sequence segmentation.• Log-linear model, different from MaxEnt model because of “global normalization”

T1.tag T2.tag T3.tagStart Tn.tag

T1.w T2.w T3.w Tn.w

T1.cap T2.cap T3.cap Tn.cap

cap

tw

tags

End

• Tj.tag – the tag (one of S, C, E, U, O) at position j• Tj.w – true if word w occurs at position j• Tj.cap – true if word at position j begins with capital letter, …

Page 34: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

34University of Texas at Austin

Machine Learning Group

Protein Name Extraction Results (Yapex)

Page 35: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

35University of Texas at Austin

Machine Learning Group

Collective Classification of Web Pages[Taskar, Abbeel & Koller 2002]

Page 36: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

36University of Texas at Austin

Machine Learning Group

Collective Information Extraction

Task: Extracting protein/gene names from Medline abstracts.

Approach: Collectively classify all candidate phrases from the same abstract. Binary classification:

e.label = 0 => e is not a protein name e.label = 1 => e is a protein name

Use two types of label correlations: Acronyms and their long forms. Repetitions of the same phrase.

Page 37: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

37University of Texas at Austin

Machine Learning Group

Collective Information Extraction

The control of human ribosomal protein L22 ( rpL22 ) to enter into the nucleolus and its ability to be assembled into the ribosome is regulated by its sequence . The nuclear import of rpL22 depends on a classical nuclear localization signal of four lysines at positions 13 – 16 … Once it reaches the nucleolus , the question of whether rpL22 is assembled into the ribosome depends upon the presence of the N - domain .

e1 e2

e3

e4

ribosomal protein L22 ( rpL22 )

of rpL22 depends

whether rpL22 is

acronymrepetiti

on

repetition

repetition

overlape5

L22

Page 38: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

38University of Texas at Austin

Machine Learning Group

Relational Markov Networks

Discriminative Markov Networks, augmented with clique templates:

RT

RT

RT

AT

Acronym Template (AT)

Repeat Template (RT)

Overlap Template (OT)

[Taskar, Abbeel & Koller 2002]

e1 e2

e3

e4

ribosomal protein L22 ( rpL22 )

of rpL22 depends

whether rpL22 is

e5

L22

OT

Page 39: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

39University of Texas at Austin

Machine Learning Group

Candidate Entities: Definition

Candidate Entities: The set of candidate entities usually depends on the type of named entity. In general, could consider as candidates all phrases of length < L, where L may be task dependent.

Two examples: [Genes, Proteins] Most entity names are base noun phrases or parts of them. Thus a candidate extraction is any contiguous sequence of tokens whose POS tags are from {“JJ”, “VBN”, “VBG”, “POS”, “NN”, “NNS”, “NNP”, “NNPS”, “CD”, “”}, and whose head is either a noun or a number. [People, Organizations, Locations] Most entity names are sequences of proper names potentially interspersed with definite articles and prepositions.

Page 40: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

40University of Texas at Austin

Machine Learning Group

Candidate Entities: Local Features

Entity Features: based on features introduced in [Collins ’02] head word, with generic placeholder for numbers => “HD = 0” entity text => “TXT = superoxide dismutase – 1” entity type e.g. concatenation of its words types => “TYPE = a a – 0” bigrams / trigrams at entity left / right boundaries based on combinations

of lexical tokens, and word types. Bigrams left => “BL = antioxidant superoxide”, “BL = antioxidant a”, … Bigrams right => “BR = 0 (“,… Trigrams left => “TL = the antioxidant superoxide”, “TL = the antioxidant

a”, … Trigrams right => “TR = 0 ( SOD1”, “TR = 0 ( A0”, …

suffix / prefix lists of words and word types Preffixes => “PF = superoxide”, “PF = superoxide dismutase”, … Suffixes => “SF = 0”, “SF = – 0”, “SF = dismutase – 0”, …

“… to the antioxidant superoxide dismutase 1 ( SOD1 ) enzyme and …”

Page 41: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

41University of Texas at Austin

Machine Learning Group

Overlap Template

e1

OT

e2 e1.label=0 e1.label=1

e2.label=0 1 1

e2.label=1 1 0

),( 21 eeOT

“… to the antioxidant superoxide dismutase 1 ( SOD1 ) enzyme and …”

e1

...} ),{(. 21, eeOTd

e2

Entity names should not overlap => hardwired overlap potential OT.

Page 42: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

42University of Texas at Austin

Machine Learning Group

Repeat Template

Production of nitric oxide ( NO ) in endothelial cells is regulated by direct interactions of endothelial nitric oxide synthase ( eNOS ) …Here we have used the yeast two - hybrid system and identified a novel 34 kDa protein , termed NOSIP ( eNOS interaction protein ) , which avidly binds to the carboxyl – terminal region of the eNOS oxygenase domain .

...} ),{(. u,vRTd RTu vu “eNOS”

v “eNOS”

uOR 0

uOR

OR

u1 umu2

vOR v1 v2

v1 “eNOS interaction”v2 “eNOS interaction protein”

vOR

OR

v1 vnv2

Page 43: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

43University of Texas at Austin

Machine Learning Group

Acronym Template

vOR v1 v2 v3

“to the antioxidant superoxide dismutase 1 ( SOD1 ) enzyme and ”v1 v

v3

v2

...} ,{. vATd

d

AT vORv

OR

v1 vnv2

Page 44: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

44University of Texas at Austin

Machine Learning Group

Experimental Results

Datasets: Yapex – a dataset of 200 Medline abstracts, manually tagged for protein names. Aimed – a dataset of 225 Medline abstracts, of which 200 are known to mention protein interactions. CoNLL – the CoNLL 2003 English dataset.

Compared three approaches:

LT–RMN RMN extraction using local templates + Overlap Template

GLT–RMN RMN extraction using both local and global templates.

CRF extraction as token classification using Conditional Random Fields [Lafferty et al 2001], with features based on current word, previous/next words, words short/long types and POS tags [Bunescu et al 2004].

Page 45: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

45University of Texas at Austin

Machine Learning Group

Experimental Results – Yapex

50

55

60

65

70

75

Precision Recall F-measure

LT-RMN GLT_RMN CRF

Yapex

Page 46: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

46University of Texas at Austin

Machine Learning Group

Experimental Results – Aimed

60

65

70

75

80

85

90

Precision Recall F-measure

LT-RMN GLT_RMN CRF

Aimed

Page 47: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

47University of Texas at Austin

Machine Learning Group

Experimental Results – CoNLL

60

65

70

75

80

85

Precision Recall F-measure

LT-RMN GLT_RMN CRF

CoNLL 2003

Page 48: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

48University of Texas at Austin

Machine Learning Group

Protein Interaction Extraction

• Most IE methods focus on extracting individual entities.

• Protein interaction extraction requires extracting relations between entities.

• Our current results on relation extraction have focused on rule-based learning approaches.

Page 49: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

49University of Texas at Austin

Machine Learning Group

Rapier and BWI Revisited: the Inter-filler Approach

• Existing rule-based learning algorithms are used for inducing patterns for identifying protein interactions.

• Rules are learned for extracting inter-fillers.

SHPTPW interacts with another signaling protein, Grb7.

• Inter-fillers are sometimes very long (~9 tokens on average; 215 tokens maximum!). For some rule-based learning algorithms (e.g. Rapier), the time complexity can grow exponentially in the length of inter-fillers.

Page 50: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

50University of Texas at Austin

Machine Learning Group

Rapier and BWI Revisited: the Role-filler Approach

• In the role-filler approach, we extract two interacting proteins into different slots, which we call the interactor and the interactee.

• A sentence is divided into segments. Interactors are associated with interactees in the same segment using simple heuristics.

• Moderately high precision (> 60%) but low recall (< 40%).

We show that the S252W mutation allows the mesenchymal splice form of

FGFR2 (FGFR2c) to bind and to be activated by the mesenchymally

expressed ligands FGF7 or FGF10 and the epithelial splice form of FGFR2

(FGFR2b) to be activated by FGF2, FGF6, and FGF9.

Page 51: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

51University of Texas at Austin

Machine Learning Group

ELCS (Extraction using Longest Common Subsequences)

• A new method for inducing rules that extract interactions between previously tagged proteins.

• Each rule consists of a sequence of words with allowable word gaps between them (similar to Blaschke & Valencia, 2001, 2002).

- (7) interactions (0) between (5) PROT (9) PROT (17) .

• Any pair of proteins in a sentence if tagged as interacting forms a positive example, otherwise it forms a negative example.

• Positive examples are repeatedly generalized to form rules until the rules become overly general and start matching negative examples.

Page 52: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

52University of Texas at Austin

Machine Learning Group

Generalizing Rules using Longest Common Subsequence

- (7) interactions (0) between (5) PROT (9) PROT (17) .

The self - association site appears to be formed by interactions between helices 1 and 2 of beta spectrin repeat 17 of one dimer with helix 3 of alpha spectrin repeat 1 of the other dimer to form two combined alpha - beta triple - helical segments .

Title – Physical and functional interactions between the transcriptional inhibitors Id3 and ITF-2b .

Page 53: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

53University of Texas at Austin

Machine Learning Group

The ELCS Framework

• A greedy-covering, bottom-up rule induction method is used to cover all the positive examples without covering many negative examples.

• We use an algorithm similar to beam search that considers only the n = 25 best rules for generalization at any time.

• The confidence level of a rule is based on the number of positive and negative examples the rule covers while allowing some margin for noise (Cestnik, 1990).

Page 54: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

54University of Texas at Austin

Machine Learning Group

Protein Interaction Extraction Results

Page 55: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

55University of Texas at Austin

Machine Learning Group

Protein Interaction Extraction Results (full)

Page 56: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

56University of Texas at Austin

Machine Learning Group

Ongoing and Future Work

• Extracted proteins and their interactions from 753,459 Medline abstracts on human biology. Evaluation of results in progress.

• Improve RMN approach with better local and global templates, better candidate entity generation, and better algorithms for probabilistic inference.

• Extend RMN approach to handle extracting relations between entities.

• Evaluate RMN approach on other biological entities and relations and on other non-biological corpora.

• Reduce human efforts by actively selecting the best training examples for human labeling.

• Combine evidence from text with other biological data sources to derive accurate, comprehensive gene networks.

Page 57: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

57University of Texas at Austin

Machine Learning Group

Conclusions

• We have compared a wide variety of existing machine-learning methods for extracting human protein names and interactions.

• CRFs approach performs the best of existing methods.

• We developed a new more-general approach based on RMN’s that allows collective extraction that integrates information across all potential extractions.

• For extracting protein interactions, we found that several methods for learning extraction rules outperform hand-written rules with respect to precision and noisy protein tags.

Page 58: University of Texas at Austin Machine Learning Group Learning to Extract Proteins and their Interactions from Medline Abstracts Razvan Bunescu, Ruifang

58University of Texas at Austin

Machine Learning Group

The End