16 generic approach for predicting unannotated protein pair
TRANSCRIPT
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
142
GENERIC APPROACH FOR PREDICTING UNANNOTATED
PROTEIN PAIR FUNCTION USING PROTEIN
Anjan Kumar Payra1, Sovan Saha
1
1
Dept. of Computer Science &Engg
Dr. Sudhir Chandra Sur Degree Engineering College, DumDum
Kolkata, India
ABSTRACT
Proteins are the most versatile macromolecules in living systems and serve crucial
functions in essentially all biological processes. With successful sequencing of several
genomes, the challenging problem now is to determine the functions of proteins in post
genomic era. Determining protein functions experimentally is a laborious and time-
consuming task involving many resources. Therefore, research is going on to predict protein
functions using various computational methods since at present there are various diseases
whose recovery drugs are still unknown or yet to be discovered and the drug discovery
process starts with protein identification because proteins are responsible for many functions
required for maintenance of life. So Protein identification further needs determination of
protein function. These methods are based on sequence and structure, gene neighborhood,
gene fusions, cellular localization, protein-protein interactions etc. In this work, we present an
approach to predict functions of unannotated protein pair in an intelligent way based on their
protein interaction network. The success rate obtained in our work is 94.4 %.
Keywords: Protein interaction network, Unannotated protein pair function prediction,
Functional groups, success rate.
I. INTRODUCTION
Proteins are the building blocks of life. Human body needs protein to repair and
maintain itself. So proteins have versatile functions to perform. However the concept of
protein function is highly context-sensitive and not very well-defined. In fact, this concept
typically acts as an umbrella term for all types of activities that a protein is involved in, be it
cellular, molecular or physiological. One such categorization of the types of functions a
protein can perform has been suggested by Bork et al. [1998]:
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), pp. 142-157 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com
IJCET
© I A E M E
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
143
o Molecular function: The biochemical functions performed by a protein, such as ligand
binding, catalysis of biochemical reactions and conformational changes.
o Cellular function: Many proteins come together to perform complex physiological
functions, such as operation of metabolic pathways and signal transduction, to keep the
various components of the organism working well.
o Phenotypic function: The integration of the physiological subsystems, consisting of
various proteins performing their cellular functions, and the interaction of this integrated
system with environmental stimuli determines the phenotypic properties and behavior of the
organism.
In order to predict protein function we have to study the existing data types which can be
broadly classified under 8 sections:
� Amino acid sequences
� Protein structure
� Genome sequences
� Phylogenetic data
� Micro array expression data
� Protein interaction networks and protein complexes
� Biomedical literature
� Combination of multiple data types
� Amino acid sequences: An amino acid sequence is the order that amino acids join
together to form peptide chains, or polypeptides. If the peptide chain is a protein, this
sequence is often called the primary structure of the protein. Due to the structure of amino
acids and how they bond together, the order of the amino acids is only read in one direction
and is specific for the peptide being formed. It can be used to identify a protein or
homologous proteins through searches in databases and also to obtain information about post
translational cleavage points. In addition, the sequence results provide information about the
purity of a preparation. It limits of detectable contamination depend on the sequences of the
analyzed proteins. The central dogma of molecular biology is the conversion of a gene to
protein via the transcription and translation phases as shown in Fig. 1. The result of this
process is a sequence constructed from twenty amino acids, and is known as the protein’s
primary structure. This sequence is the most fundamental form of information available about
the protein since it determines different characteristics of the protein such as its sub-cellular,
localization, structure and function.
Fig. 1 Central dogma of molecular biology
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
144
The most popular experimental method for the identification of protein sequences is mass
spectrometry [Sickmann et al. 2003], which, in combination with algorithms such as
ProFound [Zhang and Chait 2000], comes in various flavors, such as peptide mass finger
printing, peptide fragmentation and other comparative methods. However, these methods are
low-throughput, and thus, with the exponential generation of genome sequences, the focus
has shifted to computational approaches that can identify genes from these genomes.
Specifically, techniques that predict protein function from sequence can be categorized into
three classes, namely, sequence homology-based approaches, subsequence-based approaches
and feature-based approaches, which are explained below:
Homology-based approaches: Homologous traits of organism are therefore due to decent
from common ancestor. The homology based search process more sensitive by multiple
means, such as making the search probabilistic and adding evidence from other sources of
data to obtain more accurate and confident annotations for the query proteins.
Subsequence-based approaches: It has been reflected in several studies that often not the
whole sequence, but only some segments of it are important for determining the function of a
given protein. Consequently, the approaches in this category treat these segments or
subsequences as features of a protein sequence and construct models for the mapping of these
features to protein function. These models are then used to predict the function of a query
protein.
Feature-based approaches: The final category of approaches attempts to exploit the
perspective that the amino acid sequence is a unique characterization of a protein, and
determines several of its physical and functional features. These features are used to construct
a predictive model which can map the feature-value vector of a query protein to its function.
� Protein Structure: A protein is an organic biopolymer that is comprised of a set of amino
acids, and assumes a configuration in three-dimensional space due to interactions between
these constituents as shown in Fig. 2. Protein structures may be specified at multiple levels.
Usually, it is specified at three levels, with a fourth level being specified for some cases
[Schulz and Schirmer 1996]. Following is a brief description of these levels:
Primary structure: The primary structure of a protein is simply a sequence of amino acids.
Secondary structure: The sequence of a protein influences its conformation in three
dimensional spaces via the formation of bonds between spatially close amino acids in the
sequence. This process is popularly known as protein folding, and leads to the creation of
substructures such as α-helices, β-sheets, turns and random coils, of which the first two are
the most common, while the last two are formed very rarely. The collection of these
substructures forms the secondary structure of a protein.
Tertiary structure: The attractive and repulsive forces among the substructures caused by
the folding balance each other and provide the protein with a relatively stable, though
complicated, three-dimensional structure. This structure is known as the tertiary structure of
the protein.
Quaternary structure: Some proteins, such as the spectrin protein [Fuller et al.1974] consist
of multiple amino acid sequences, also known as protein subunits. Each of these sequences
folds to form its own tertiary structure, which come together to produce the quarter nary
structure of protein.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
145
The existing approaches in predicting protein functions from protein structure are:
Similarity-based approaches: Given the structure of a protein, these approaches identify the
protein with the most similar structure using structural alignment techniques, and transfer its
functional annotations to the query protein.
Fig. 2 Structure of protein
Motif-based approaches: The approaches in this category attempt to identify three
dimensional motifs, that are substructures conserved in a set of functionally related proteins,
and estimate a mapping between the function of a protein and the structural motifs it contains.
This mapping is then used to predict the functions of unannotated proteins.
Surface-based approaches: It is sometimes necessary to analyze the structure of a protein at
a higher resolution than that of distances between consecutive amino acids. This corresponds
to the modeling of a continuous surface for the structure and identifying features such as
voids or holes in these surfaces. The approaches in this category utilize these features to infer
a protein’s function.
Learning-based approaches: This category of recent approaches employ effective
classification methods, such as SVM and k-nearest neighbor, to identify the most appropriate
functional class for a protein from its most relevant structural features.
� Genomic sequences: Genome sequencing is a laboratory process that determines the
complete DNA sequence of an organism's genome at a single time. This entails sequencing
all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and,
for plants, in the chloroplast. Almost any biological sample containing a full copy of the
DNA—even a very small amount of DNA or ancient DNA—can provide the genetic material
necessary for full genome sequencing.DNA itself is typically a double stranded molecule
,where one of the strands is constituted of four characters, namely A, T , C and G, which
denote the four nucleotides adenosine, guanine, cytosine and thymine, and other strand is
complimentary to the first, owing to the complimentarity of the A−C and T−G nucleotide
pairs as shown in Fig. 3 . Several approaches have been proposed to accomplish the target of
deriving functional associations from genomic data, and possible function prediction
subsequently.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March
These approaches largely fall into one of the following three categories [Marcotte 2000]:
Genome-wide homology-based annotation transfer
use of larger databases for searching proteins homologous to the query proteins, and the
transfer of functional annotation from the closest results.
Gene neighborhood- or gene order
hypothesis that proteins, whose corresponding genes are located “close” to each other in
multiple genomes, are expected to interact functionally. This hypothesis is supported by the
concept of an operon, and its relevance to protein function [Salgado
Gene fusion-based approaches
in one genome that are merged to form a single gene in another genome. The underlying
hypothesis here is that these sets of genes are functionally relat
biochemical and structural evidence [Marcotte et al. 1999].
� Phylogenetic data: A phylogenetic tree or evolutionary tree is a branching diagram or
"tree" showing the inferred evolutionary relationships among various biological speci
other entities based upon similarities and differences in their physical and/or genetic
characteristics. The organisms are joined together in the tree, are implied to have descended
from a ancestor. In a rooted phylogenetic tree, each node with desce
inferred most recent common ancestor of the descendants and the edge lengths in some trees
may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are
generally called hypothetical taxonomic units, a
Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint
absence of two traits across large numbers of species is used to infer a meaningful biological
connection, such as involvement of
is essential to include the evolutionary perspective in any complete understanding of protein
function. As a result, several approaches for predicting protein function using evolution
based data have recently been proposed.
relationships among living organisms
2004]. The phylogenetic profile
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
146
Fig. 3 DNA molecules
These approaches largely fall into one of the following three categories [Marcotte 2000]:
based annotation transfer: This category consists simply of the
use of larger databases for searching proteins homologous to the query proteins, and the
transfer of functional annotation from the closest results.
or gene order-based approaches: These approaches are based on
hypothesis that proteins, whose corresponding genes are located “close” to each other in
multiple genomes, are expected to interact functionally. This hypothesis is supported by the
, and its relevance to protein function [Salgado et al. 2000].
based approaches: These approaches attempt to discover pairs or sets of genes
in one genome that are merged to form a single gene in another genome. The underlying
hypothesis here is that these sets of genes are functionally related, and is supported by
biochemical and structural evidence [Marcotte et al. 1999].
A phylogenetic tree or evolutionary tree is a branching diagram or
"tree" showing the inferred evolutionary relationships among various biological speci
other entities based upon similarities and differences in their physical and/or genetic
characteristics. The organisms are joined together in the tree, are implied to have descended
from a ancestor. In a rooted phylogenetic tree, each node with descendants represents the
inferred most recent common ancestor of the descendants and the edge lengths in some trees
may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are
generally called hypothetical taxonomic units, as they cannot be directly observed.
Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint
absence of two traits across large numbers of species is used to infer a meaningful biological
connection, such as involvement of two different proteins in the same biological pathway.
is essential to include the evolutionary perspective in any complete understanding of protein
function. As a result, several approaches for predicting protein function using evolution
ve recently been proposed. The field of biology that deals with the evolutionary
relationships among living organisms is also known as phylogenetics [Bittar and Sonderegger
phylogenetic profile of a protein is (generally) a binary vector whose l
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
April (2013), © IAEME
These approaches largely fall into one of the following three categories [Marcotte 2000]:
ory consists simply of the
use of larger databases for searching proteins homologous to the query proteins, and the
: These approaches are based on the
hypothesis that proteins, whose corresponding genes are located “close” to each other in
multiple genomes, are expected to interact functionally. This hypothesis is supported by the
et al. 2000].
: These approaches attempt to discover pairs or sets of genes
in one genome that are merged to form a single gene in another genome. The underlying
ed, and is supported by
A phylogenetic tree or evolutionary tree is a branching diagram or
"tree" showing the inferred evolutionary relationships among various biological species or
other entities based upon similarities and differences in their physical and/or genetic
characteristics. The organisms are joined together in the tree, are implied to have descended
ndants represents the
inferred most recent common ancestor of the descendants and the edge lengths in some trees
may be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes are
s they cannot be directly observed.
Phylogenetic profiling is a bioinformatics technique in which the joint presence or joint
absence of two traits across large numbers of species is used to infer a meaningful biological
two different proteins in the same biological pathway. It
is essential to include the evolutionary perspective in any complete understanding of protein
function. As a result, several approaches for predicting protein function using evolution-
he field of biology that deals with the evolutionary
also known as phylogenetics [Bittar and Sonderegger
of a protein is (generally) a binary vector whose length is
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
147
the number of available genomes. The vector contains a 1 in the ith position if the ith genome
contains a homologue of the corresponding gene, else a 0.In several other studies, a more
extensive representation of evolutionary knowledge is used [Bittar and Sonderegger 2004].
This representation is known as a phylogenetic tree [Baldauf 2003], which is a standard tree
with respect to the graph theoretical definition, but whose nodes and branches carry special
meaning as shown in Fig. 4.
� Micro array expression data: Protein synthesis from genes occurs in prokaryotic
organisms in two phases [Weaver 2002]. In the transcription phase, an mRNA is created from
the original gene by converting the latter to the corresponding RNA code. The protein is then
synthesized from mRNA by translating the RNA code to the corresponding amino acid
sequence according to the codon translation rules. Gene expression experiments are a method
to quantitatively measure the transcription phase of protein synthesis [Nguyen et al. 2002].
The most common category of these experiments uses square-shaped glass chips measuring
as little as 1 inch on either side, also known as cDNA micro arrays. Experiment using Micro
array is shown in Fig. 5. The experiment is carried out in the following stages.
Fig. 4 Constructing a simple phylogenetic tree
In the first stage, the chip is laid out with a matrix of dots of cDNAs, usually several
thousands in number, one corresponding to each of the gene being measured. In parallel,
mRNA is extracted from both the normal as well as the cells of the organism that have been
exposed to the condition being studied. These mRNA are reverse transcripted to cDNA and
colored with green and red colors respectively. These colored cDNAs are then spread on the
micro array chip, leading to a hybridization of the cDNA already on the chip with those
produced by the genes in the two types of cells. This generates a spot of a certain color on the
chip for each gene which denotes its expression level. In the final stage of the experiment, the
intensity of this region is measured by a laser scanners connected to a computer, which
generates a real valued measurement of the expression of each gene as the ratio of the log
intensities of red and blue colors in the region. The result of the experiment thus is a
measurement of the transcription activity of the genes under the specified condition.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
148
Fig. 5 Micro array procedure
Existing approaches in gene expression data are:
Clustering-based approaches: An underlying hypothesis of gene expression analysis is that
functionally similar genes have similar expression profiles, since they are expected to be
activated and repressed under the same conditions. Because clustering is a natural approach
for grouping similar data points, approaches in this category cluster genes on the basis of
their gene expression profiles, and assign functions to the unannotated proteins using the
most dominant function for the respective clusters containing them.
Classification-based approaches: A more direct solution to the problem of predicting
protein function from gene expression profiles is the data mining approach of classification.
Thus, approaches in this category build various types of models for the expression function
mapping using classifiers, such as neural networks, SVMs and the naive Bayes classifier, and
use these models to annotate novel proteins.
Temporal analysis-based approaches: Temporal gene expression experiments measure the
activity of genes at different instances of time, for instance, during a disease. This behavior
can also be used to predict protein function. Thus, approaches in this category derive features
from this temporal data and use classification.
� Protein interaction networks and protein complexes: A protein almost never performs
its function in isolation. Rather, it usually interacts with other proteins in order to accomplish
a certain function. However, in keeping with the complexity of the biological machinery,
these interactions are of various kinds. At the highest level, they can be categorized into
genetic and physical interactions. Genetic interactions occur when the mutations in one gene
cause modifications in the behavior of another gene, which implies that these interactions are
only conceptual and do not occur physically in a genome. In our project we consider the
physical interactions between proteins, since they are more directly related to the process
through which a protein accomplishes its functions. Since a protein generally interacts with
more than one other protein, these interactions can be structured to form a network, and
hence the name protein interaction networks which is shown in Fig. 6 and Fig. 7.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
149
Fig. 6 Organic View (Cytoscape) of our data set
Existing Approaches that attempt to predict function of proteins from a protein interaction
network can be broadly categorized into the following four categories:
Neighborhood-based approaches: These approaches utilize the neighborhood of the query
protein in the interaction network and the most “dominant” annotations among these
neighbors to predict its function.
Fig. 7 Circle View (Cytoscape) of our data set
Global optimization-based approaches: In many cases, the neighborhood of the query
protein may not contain enough information, such as annotated proteins, for determining the
function of the query protein robustly. Under these conditions, it may be advantageous to
consider the structure of the entire network and use the annotations of the proteins indirectly
connected to the query protein also. The approaches in this category are based on this idea,
and in most cases, are based on the optimization of an objective function based on the
annotations of the proteins in the network.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
150
Clustering-based approaches: The approaches in this category were based on the
hypothesis that dense regions in the interaction network represented functional modules,
which are natural units in which proteins perform their function. Thus, these approaches
apply graph clustering algorithms to these networks and then determine the functions of
unannotated proteins in the extracted modules using measures such as majority.
Association-based approaches: Recently, several computationally efficient algorithms have
been proposed for finding frequently occurring patterns in data, in the field of association
analysis in data mining [Tan et al. 2005]. The approaches in this category use these
algorithms to detect frequently occurring sets of interactions in interaction networks of
protein complexes, and hypothesize that these sub graphs denote function modules. Function
prediction from these modules is performed as in the clustering based approaches.
� Biomedical literature: As in all other research communities, researchers in the fields of
biology and medicine publish the results of their research in various journals and conferences.
As a result, over the past, a huge repository of knowledge has been created in the form of
papers, books, reports, theses and other such texts. Clearly, these repositories contain a huge
amount of information about important biological concepts such as protein structure and
function, cancer-causing genes and several others. Thus, there is great utility in the mining of
these repositories and retrieval of useful information as shown in Fig. 8.
Multiple data types: With a plethora of data being generated by a wide spectrum of
proteomics experiments, it may be hypothesized that sometimes what can’t be discovered
from one source of information may become obvious when multiple sources are analyzed
simultaneously. This intuition has been concretized by Kemmeren and Holstege [2003], who
have suggested the following distinct advantages achieved by integrating functional genomics
data:
Fig. 8 Biomedical literature
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
151
o Usually, individual biological data sets provide information about complimentary
biological processes, such as gene expression and protein interaction networks. Thus,
combining them provides a global picture of the biological phenomena a set of genes is
involved in.
o Often, data quality varies between different types of data, as well as within different
sources of data of the same type. For instance, studies have shown significant variations
between the qualities of different protein interaction data sets [Deng et al. 2003]. Thus, the
combination of several data sources/types improves the quality of the overall data set, since
the errors in one data set may be corrected in another.
o The most important advantage of the integrative approach is that since only conclusions
valid over a set of data types are accepted, the predictions made by this approach are usually
more confident than those made on the basis of individual data sets.
Hence, now we have a clear idea regarding the different existing data types. So now let
us highlight about our work. Our objective is to assign un-annotated “protein pair” to
different functional groups. So we now focus on discussing the existing computational
techniques that use protein-protein interaction data to predict protein function. Protein
functionality can be predicted by neighborhood property which suggests that the PPI network,
neighbors of a particular protein have similar function. In the work of Schwikowski [1] a
neighborhood-counting method is proposed to assign k functions to a protein by identifying
the k most frequent functional labels among its interacting partners. It is simple and effective,
but the full topology is not considered and no confidence scores are assigned for the
annotations. But in the chi-square method, Hishigaki et al. [2] assigns k functions to a protein
with the k largest chi-square scores. For a protein P, each function f is assigned a score �������
�
��, where nf is the number of proteins in the n-neighborhood of P that have the function
f; The value ef is the expectation of this number based on the frequency of f among all
proteins in the network. Chen et al. [3] extends this neighborhood property to higher levels in
the network. They speculate the functional similarity between a protein and its neighbors
from the level-1 and level-2. An algorithm developed here is to assign a weight to each of its
level-1 and level-2 neighbors by estimating its functional similarity. Many graph algorithms
have been applied for its functional analysis. Vazquez et al. [4] assign proteins to a function
so as to maximize the connectivity of a protein assigned with the same function. They map
this problem into an optimization problem using simulated annealing where they maximizes
the number of edges that connect proteins ( un-annotated or previously annotated) assigned
with the same function. Karaoz et al. [5] apply a similar approach to a collection of PPI data
and gene expression data. They construct a distinct network for each function in GO. For a
particular state of function of each annotated protein v equals +1 if v has function f and -1 if v
has different function. Nabieva et al. [6] proposes a flow based approach to predict protein
function from the protein interaction network. Considering both the local and global
properties of the graph, this approach assigns function to un-annotated protein based on the
amount of flow it receives during simulation whereas each annotated protein is the source of
functional flow. Deng et al. [7] proposes an approach employing the theory of Markov
random field where they estimates the posterior probability of a protein of interest. Letvsky
and Kasif [8] use loopy belief propagation with the assumption of a binomial model for local
neighbors of protein annotated with a given time. Similarly, Wu et al. [9] propose a related
probabilistic model to annotate functions of unknown proteins and PPI networks based on the
structure of the PPI network. Joshi et al. [10] develop new integrated probabilistic method for
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
152
cellular function by combining information from protein-protein interaction, protein
complexes, micro array gene expression profiles and annotations of known protein through an
integrative statistical model. In the work of Samanta et al. [11], a network based statistical
algorithm is proposed, which assumes that if two proteins share significantly larger number of
common interacting partners they share a common functionality. Another application is
UVCLUSTER based on bi-clustering which iteratively explored distance datasets proposed
by Arnau et al. [12].Apart from graph clustering, in the early stage, Bader and Hogue [13]
propose Molecular Complex Detection (MCODE) where dense regions are detected
according to some parameters.Altaf-ul-Amin et al.[14] also use a clustering approach. It starts
from a single node in a graph and clusters are gradually grown until the similarity of every
added node within a cluster and density of clusters reaches a certain limit. Spirin and Mirny
[15] use graph clustering approach where they detect densely connected modules within
themselves as well as sparsely connected with the rest of the network based on super
paramagnetic clustering and Monte Carlo algorithm. Pruzli et al. [16] use graph theoretic
approach where clusters are identified using Leda’s routine components and those clusters are
analyzed by Highly Connected Sub graphs (HCS) algorithm. Later King et al. [17] partition
networks into clusters using a cost function applying Restricted Neighborhood Search
Clustering algorithm (RNCS). Clusters are filtered according to their size, density and
functional homogeneity. Krogan et al. [18] use Markov clustering algorithm to predict
Protein function.
II. PRESENT WORK
o Motivation: Many approaches have been discussed in the previous section over protein-
protein interaction network (PPI).After studying and going through various papers it can be
analyzed that very few assessment had been pursued on PPI considering protein pairs and
interconnection within their PPI network. This analyzation has encouraged us to work over
PPI network and to predict function of unannotated protein pair using a generic approach
which will be discussed in the forward sections.
o Dataset: In this work, the protein-protein interaction data of yeast (Saccharomyces
Cerevisiae) from ftp://ftpmips.gsf.de/yeast/PPI/, is collected which contains 15613 genetic
and physical interactions. Self-interactions are discarded. A set of 12487 unique binary
interactions involving 4648 proteins are taken as data. In our proposed method 15 functional
groups are considered. They are cell cycle control (O1), cell polarity (O2), cell wall
organization and biogenesis (O3), chromatin chromosome structure (O4), co-immuno-
precipitation (O5), co-purification (O6), DNA Repair(O7), lipid metabolism (O8), nuclear-
cytoplasmic transport (O9), pol II transcription (O10), protein folding (O11), protein
modification (O12), protein synthesis(O13), small molecule transport (O14) and vesicular
transport (O15). For each functional group, 90% protein pairs are taken as training samples
and rest (2-8%) among them are considered as test samples.
o Basic terminologies:
Protein interaction network: Protein–protein interactions occur when two or
more proteins bind together, often to carry out their biological function. Many of the most
important molecular processes in the cell such as DNA replication are carried out by large
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
153
molecular machines that are built from a large number of protein components organized by
their protein–protein interactions. These protein interactions form a network like structure
which is known as Protein interaction network. Here protein interaction network is
represented as a graph GP which consists of a set of vertex (nodes) V connected by edges
(links) E. Thus GP = (V, E).Here each protein is represented as a node and their
interconnections are represented by edges.
Sub graph: A graph G´P is a sub graph of a graph GP if the vertex set of G´P is a subset of the
vertex set of GP and if the edge set of G´P is a subset of the edge set of GP. That is, if G´P =
(V', E’) and GP= (V, E), then G´P is called as sub graph of GP if V′ V andE′ E. G´P may
be defined as a set of {K � U} where K represents the set of un-annotated protein pair while
U represents the set of annotated protein pair.
Level-1 neighbors: In G´P, the directly connected neighbors of a particular vertex are called
level-1 neighbors.
o Proposed Work: Here the work which has been proposed is to deduce the PPI network of
each individual protein belonging to unannotated protein pair chosen from the original data
set mentioned earlier. Hence afterward identifying the common interaction between those
deduced PPI networks and thereby estimating success rate by using a Generic Approach for
predicting function of unannotated protein pair.
o Method: In this method, given �′�, a sub graph of protein interaction network, consisting
of protein pair as nodes associated with any element of set O= {O1, O2, O3,….,O15} where Oi
represents a particular functional group, this method maps the elements of the set of un-
annotated protein pair U to any element of set O. Steps associated with this method is
described as follows:
Step 1: Take any protein pair as an element from set U.
Step 2: Deduce PPI network for each protein belonging to selected
protein pair in Step 1.
Step 3: Find common interacting pair in between PPI network
deduced in step 2.
Step 4: Count the number of occurrences Si (i=1,..,15) of set O= {O1, O2,O3,….,O15} in between
common interacting pair found in Step 3.
Step 5: Assign Oi of set O= {O1, O2, O3,….,O15} corresponding
Max (Si (i=1,..,15) ) to unannotated protein pair considered
in Step 1.
o Illustration of Method-I with an example:
An un-annotated protein pair YAL011w-YDL181w is taken from our test dataset U, which is
shown in yellow color in Fig 9. From GP,�′������� is taken where its level-1 neighbors are
YDR146c,YCR033w,YDR181c,YDL080c,YDR269w. Similarly, level-1 neighbors are taken
for �′������� ,which are YPL078c,YPL240c,YBR118w,and YER148w respectively. Two
functional groups (i.e., DNA repair and cell polarity) are involved in level-1 which is shown
in Fig 9.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
154
Fig. 9 Sub-graph G´P of Protein pair YAL011w-YDL181w and its level-1 neighbor
Then common interacting pair between �′������� and �′������� is considered. So, In Fig
9, it is seen that there exists only one common interacting pair that is YDL080c-YPL078c
which is marked in green color in Fig 9.By studying our dataset ,it is derived that the protein
pair YDL080c-YPL078c belongs to functional group DNA Repair(O7).Now the number of
occurrences of each functional groups among the common interacting pair is enlisted and
highest number of occurrences of a particular functional group is assigned as the functional
group of unannotated protein pair. So, as in Fig 9, there exists one interacting pair of O7, we
assign O7 to unannotated protein pair YAL011w-YDL181w.
Fig. 10 Sub-graph G´P of Protein pair YMR236w-YHR099w and its level-1 neighbor
.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
155
Another example of sub graph obtained in our work has been highlighted above in Fig. 10 and further the
method for predicting function of YMR236w-YHR099w is same as mentioned earlier. In our work, we select
unannotated protein pairs and predict their functional group using Generic approach as shown in TABLE -I.
Simultaneously, by counting matched and unmatched set of predicting protein pairs, we obtained success rate or
probability of success, as shown in TABLE-II.
TABLE - I
C Unannotated protein pair Original function Predicted function R
1 YNL250w|YKL101w Cell cycle control Cell cycle control �
2 YBR023c|YER111c Cell cycle control Cell cycle control �
3 YPL174c|YLR210w Mitosis Mitosis �
4 YLR229c|YPL161c Two hybrid Two hybrid �
5 YBR023c|YLR370c Cell polarity Cell polarity �
6 YNL233w|YCR009c Cell polarity Cell wall organization and biogenesis ˟
7 YBL061c|YLR342w Cell polarity Cell polarity �
8 YFR036w|YLR127c Coimmunoprecipitation Coimmunoprecipitation �
9 YDR108w|YML077w Coimmunoprecipitation Coimmunoprecipitation �
10 YFR002w|YGR119c two hybrid two hybrid �
11 YBL014c|YML043c Coimmunoprecipitation affinity purification ˟
12 YBR193c|YOL135c Coimmunoprecipitation Coimmunoprecipitation �
13 YBL084c|YDR118w Coimmunoprecipitation Coimmunoprecipitation �
14 YDR145w|YGR252w copurification copurification �
15 YHR099w|YOL148c copurification copurification �
16 YHR099w|YMR236w copurification copurification �
17 YGL112c|YHR099w copurification copurification �
18 YBR081c|YDR392w copurification copurification �
19 YGL097w|YIL063c copurification copurification �
20 YGL097w|YIL063c synthetic lethal synthetic lethal �
21 YDR145w|YDR176w copurification copurification �
22 YDR145w|YLR055c copurification copurification �
23 YNL273w|YGL163c DNA repair DNA repair �
24 YCL061c|YMR190c DNA repair DNA repair �
25 YKL113c|YDR369c DNA repair DNA repair �
26 YGR078c|YFR019w Lipid metabolism Lipid metabolism �
27 YBR023c|YFR019w Lipid metabolism Lipid metabolism �
28 YCL061c|YAR002w Nuclear-cytoplasmic transport Nuclear-cytoplasmic transport �
29 YLR418c|YLR384c Pol II transcription Pol II transcription �
30 YLR418c|YJR140c Pol II transcription Pol II transcription �
31 YPR135w|YGL244w Pol II transcription Pol II transcription �
32 YPR135w|YHR200w Pol II transcription Pol II transcription �
33 YOR070c|YJR032w Protein folding Protein folding �
34 YDR420w|YDR245w Protein modification Protein modification �
35 YLR418c|YDR363w-a Vesicular transport Vesicular transport �
36 YLR039c|YLR360w Vesicular transport Vesicular transport �
TABLE - II
Total no. of Unannotated protein pair Matched Unmatched Success rate
36 34 2 94.4
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
156
III. RESULTS& DISCUSSION
The above methods are evaluated by success rate which is defined as
������� �� � ! "#$%&' () *'(+&," -.,' /#"0+,(" *'&1,0+&1 0(''&0+23
+(+.2 "#$%&' () 4".""(+.+&1 *'(+&," -.,'5
In our work, we predict functions of protein pairs using algorithm of Generic Approach and
estimate success rate of 15 considered functional groups, out of which the probability of
success for six functional groups (co-purification (O6), co-immuno-precipitation (O5), pol
II transcription (O10), vesicular transport (O15), DNA Repair (O7), cell polarity (O2)) have
been shown in tabular and pictorial representation, as shown in TABLE-III and Fig. 12
respectively.
TABLE - III
Fig. 12 Pictorial representation of success rate for five functional groups.
Our proposed work adds an extra dimension to existing graph-theoretic methods as it
computes functions of unannotated protein pair instead of single protein considering level-1
neighbors. We hope the performance of generic approach will increase if we consider more a
large interaction network and level-2 neighbors. In future, our aim is to work with more
functional groups and for different organisms also.
0
1
2
3
4
5
6
7
8
9NUMBER OF
UNANNOTATED
PROTEIN PAIR
NUMBER OF
MATCHED PROTEIN
PAIR
PROBABLITY OF
SUCCESS
FUNCTIONAL GROUP
NUMBER OF UNANNOTATED
PROTEIN PAIR
NUMBER OF MATCHED PROTEIN
PAIR
PROBABLITY OF SUCCESS
O6 8 8 1
O5 5 4 0.8
O10 4 4 1
O15 2 2 1
O2 3 2 0.66
O7 3 3 1
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
157
REFERENCES
[1] B. Schwikowski, P. Uetz and S. Fields, A network of protein- protein interactions in yeast.
Nature Biotech.18, 1257-1261, 2000.
[2] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami, and T. Tagaki, Assessment of prediction
accuracy of protein function from Protein- protein interaction data. Yeast 18, 523-531,
2001.
[3] J. Chen, W. Hsu, M. L. Lee, and S. K. Ng. Labeling network motifs in protein
interactomes for protein function prediction. Proc 23rd International Conference on Data
Engineering (ICDE). 546- 555, 2007.
[4] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction
Networks,” Nature Biotechnology, vol. 21, pp. 0697- 700, June, 2003.
[5] U. Karaoz, T. M. Murali, S. Letovsky, Y. Zheng, C. Ding, C. R. Cantor, and S. Kasif.
Whole-genome annotation by using evidence Integration in functional-linkage.
[6] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, M. Singh. Whole Proteome prediction of
protein functions via graph-theoretic analysis of interaction maps. Bioinformatics 21
(Suppl 1): i302– i310, 2005.
[7] M. Deng, Inferring domain-domain interactions from protein protein interactions.
Genome Res. 12(10):1540-8, 2002.
[8] S. Letovsky, S. Kasif. Predicting protein function from protein protein interaction data: a
probabilistic approach. Bioinformatics.19 (Suppl 1): i197–i204, 2003.
[9] D. D. Wu, X. Hu, An efficient approach to detect a protein community from a seed. 2005
IEEE Symposium on Computational Intelligence in Bioinformatics and Computational
Biology (CIBCB2005).La Jolla CA, USA: IEEE pp. 135–141, 2005.
[10] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction
Networks,” Nature Biotechnology, vol. 21, pp. 697- 700, June 2003.
[11] M. P. Samanta,S. Liang, Predicting protein functions from
redundancies in large scale protein interaction networks. ProcNatlAcadSci USA 100:
12579–12583, 2003.
[12] V. Arnau, S. Mars, Marin I Iterative cluster analysis of protein interaction data.
Bioinformatics 21: 364–378, 2005.
[13] G. D. Bader,C. W. Hogue, An automated method for finding molecular complexes in
large protein interaction networks.BMC Bioinformatics 4: 2,2003.
[14] M. Altaf-Ul-Amin,Y. Shinbo,K. Mihara,K. Kurokawa,S. Kanaya Development and
implementation of an algorithm for detection of protein complexes in large interaction
networks. BMC bioinformatics 7: 207, 2006.
[15] V. Spirin, L. A. Mirny, Protein complexes and functional modules in molecular
networks. ProcNatlAcadSci USA 100:12123–12128, 2003.
[16] A. D. King, N. Przulj, I. Jurisica, Protein complex prediction via cost-based clustering.
Bioinformatics 20: 3013–3020, 2004.
[17] S. Asthana, O. D. King, F. D. Gibbons, F. P. Roth, Predicting protein complex
membership using probabilistic network reliability. Genome Res 14: 1170–1175, 2004.
[18] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, Global
landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440: 637–
643, 2006.
[19] Deepalakshmi. R and Jothi Venkateswaran C, “A Survey on Mining Methods for
Protein Sequence Analysis: An Aerial View”, International journal of Computer
Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 28 - 34, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375.