performance enhancement algorithms for data reduction in …an iterative mapreduce a pproach to...
TRANSCRIPT
Performance Enhancement Algorithms For Data Reduction in Hadoop Environment
Mr. A. Antony Prakash1, Dr. A. Aloysius2
1. Asst.Professor in Information Tech, St Joseph‟s College - Tiruchirappalli- 2
[email protected] 2 Asst.Professor in Computer Science, St Joseph‟s College - Tiruchirappalli- 2
Abstract: The Current research developments in Interactomics are mainly focused on designing the intelligent computational systems
which results in producing heaps of Biological Interaction data. The protein complex (dense sub-networks) prediction from these
voluminous host-viral interaction networks is one of the research challenges. They are in the form of cliques and non-cliques for single
species interaction, bicliques and non-bicliques for host-viral interactions (dual species interaction). For this research problem, the existing
graph theoretic computational approaches concentrated on the clique mining from Interaction networks based on their topological
properties but the dense non-cliques are ignored. The protein complexes in host-viral interactions are dense sub-networks possibly both
bicliques and non-bicliques. The Score based Co-Clustering with MapReduce (MR-CoC) is one of the sub-network mining algorithms based
on score measure which extracts both cliques and non-cliques. This approach is used in this paper for mining Protein Complexes (bicliques
and non bicliques) from HIV-Human Protein Interaction Network. The protein complex coverage of the extracted HIV-Human sub-graphs
are mapped with existing HIV-Human complexes, almost 95 percent of the complexes are mapped. Further, unknown protein sub-graphs
extracted can be provided to biologists for new complex discovery. The Gene Ontology and Pathways based analysis is carried out in this
work. This analysis shows that the viral infections are on the immune system of the human proteins which confirms the presence of
functionality of HIV.
Keywords: Big Data, Biclique, Bipartite graph, Complete graph, Clustering, Co-Clustering, Sub-graph mining, Sub-network Mining, MR-
CoC
I. Introduction
The Protein Complexes of host-viral interactions are also
the bio-products same as normal protein complexes that
are used to understand the viral dynamisms, central hubs
for viral infections, disease diagnosis, biological
characteristics of the biological systems [1]. The protein
complexes in host-viral interactions are referred as protein
complex, throughout this paper. This protein complex
prediction is disease specific and very few complexes
were predicted so far. Normally, The protein complexes
are sub-networks of the protein interaction networks
which are responsible for a specific biological functions
like signal transductions, cell replication, cellular
immunization, catalytic activities, etc at various parts of
the cell [2] [3] [4]. The disease pathogens‟ infections on
the host organisms are diagnosed based on these bio-
products. The various sub-network mining algorithms
available are CMC, COACH, MCODE [5], MCL [6],
Cfinder [7], RNSC [8], STM [9] and other mining
algorithms. These algorithms are mainly mines the cliques
and not attempted for biclique mining. The scalability of
these existing approaches is poor to process the
voluminous data. The parallelization of the existing
computational approaches is one of the solutions to
increase the scalability.
In this scenario, the MapReduce programming model is
one of the de-facto standards for handling the big data [10]
[11] [12] [13]. This model can be used for different
scenarios like parallelization, reducing the redundant
computations, scalability issues and so on. It is mainly
chosen to parallelize the computation and to cope with the
scalability issues. In normal parallel computations, most
of the time will be spent on the context switching and
event synchronization processes, which is not in the
MapReduce model. Some MapReduce based approaches
[14] [15] [16] were exist still they are suffering with time
complexity issues. In this research work the score based
MR-CoC is used for detecting bicliques and non-bicliques
in Interaction Data. The complexity of this approach is O
(Es+log Ns) which is less than existing approaches [17].
The limitation of the Score based MR-CoC is that the sub-
networks identified are dependent on the initial seeds fed
to this model. The number of seeds can be increased to
maximize the randomness of the approach to mine more
sub-networks. To improvise the results, seeds are
generated „n‟ number of times and the results of all seed
configuration are combined. This approach is slightly
modified to mine bipartite sub-graphs and attempted on
the HIV-Human Interaction dataset. The sub-networks are
then mapped with the available HIV- Human protein
complexes to analyze the coverage of protein complexes.
International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 1803-1811ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
1803
The result shows that the proposed approach can mine
more than 30,000 sub-networks than existing complexes.
The biological significance of these unknown sub-
networks is annotated. The following section covers the literature study. The Proposed Approach and Discussion on Experimental Results are concentrated in the consecutive sections. Finally the summary of this research besides the future enhancements are discussed in the section 6.
(a)
(b)
Figure 1. A Sample Protein Interaction Network with Sub-
Networks highlighted, A Complete Bipartite Graph and its
adjacency matrix
The following section covers the literature study. The
Proposed Approach and Discussion on Experimental
Results are concentrated in the consecutive sections.
Finally the summary of this research besides the future
enhancements are discussed in the section 6.
2. Related Work
The literature study reveals that there are many
supervised and unsupervised models [3] [8] [18] [19]
were devised to mine the protein complexes (sub-
networks), protein patterns based on their topological
properties. They used various graph-theoretic techniques,
topological properties, classification measures to confine
their results They concentrate in different criteria like
cliques with fixed size ignores non-cliques and high
complexity; dense sub-graphs with fixed size having high
complexity etc.. All the existing protein complexes are
not the complete sub-graphs, possibly dense sub-graphs in
some cases. Further in some related research works, the
computational overhead and scalability issues are avoided
by parallelizing the computation in distributed
environment using MapReduce. Still the time complexity
issues are present due to hybridization of graph
algorithms with MapReduce model. The related research
literatures taken for observation are listed in the table.
The current research work deals with parallelizing the
mining process using MapReduce and uses a score
measure which explores the Cohesiveness of the nodes.
Some scientists suggested that mining the adjacency
matrix does not produce fruitful results. But the proposed
approach uses the adjacency matrix to produce efficient
results and outperforms the existing benchmark
approaches like MCODE [5]. Its complexity is O(n3)
where n is number of nodes which is higher than the
proposed MR-CoC.
Table 1. Related Research Works
Title MapReduce Algorithm Input Output
Detection of
Functional
modules from
protein
Interaction
Networks(20)
- Clustering Weighted
Score
Network
Clusters
Identifying
Functional
Modules in
Protein-Protein
interaction
Networks: an
integrated exact
approach(21)
- Mathematical
optimization
Weighted
Vertex
Network
Modules
Weighted
Consensus
Clustering for
Identifying
Functional
Modules In
Protein-Protein
Interaction
Networks(22)
-
Clustering
(Combines 4
Clustering
algorithms)
V,E Network
clusters
An automated
method for
finding
molecular
complexes in
large protein
interaction
networks(23)
- MCODE Vertex,
degree
Clique
(Connected
Sub-graph)
A Faster
Algorithm for
Detecting
Network
Motifs(5)
- Enumerating
Sub-graphs
V, E,
neighbor
list
k-size Sub-
graphs
(Network
Motifs)
Scalable Sub-
graph
Enumeration in
MapReduce(14)
MapReduce Enumerating
Sub-graphs
V, E,
Pattern
Sub
Graphs
Efficient
sampling
algorithm for
estimating sub-
graph
concentrations
and detecting
network
motifs(24)
- Edge
Sampling
Edge list
and
neighbor
list
Sub
Graphs
International Journal of Pure and Applied Mathematics Special Issue
1804
Title MapReduce Algorithm Input Output
A novel
MapReduce-
based approach
for distributed
frequent sub-
graph
mining(16)
MapReduce
MR based
Graph
Partitioning
V, E Frequent
Pattern
An Iterative
MapReduce
Approach to
Frequent Sub-
graph Mining in
Biological
Datasets (15)
MapReduce Clustering Frequent
Sub-graphs
The current research work deals with parallelizing the
mining process using MapReduce and uses a score
measure which explores the Cohesiveness of the nodes.
Some scientists suggested that mining the adjacency
matrix does not produce fruitful results. But the proposed
approach uses the adjacency matrix to produce efficient
results and outperforms the existing benchmark
approaches like MCODE(5). Its complexity is O(n3)
where n is number of nodes which is higher than the
proposed MR-CoC.
3. Bipartite Graph Mining using Score based MR-
CoC
Network Representation and Terminology
`Protein Interaction Network is represented using
the undirected graph structure as G (V, E) (25)(18) (26)
(27). The interactions between different species like
disease-host interactions are represented as Bipartite
graphs GB. Here the proteins are the set of vertices or
nodes (Vd, Vh) of two species, whereas the interactions
between these proteins are edges or links (E). Edge is
represented as Ei=<Pa, Pb>, where Paϵ Vd, Pbϵ Vh, one
protein from each species. The PIN is an undirected
bipartite graph (<Pa, Pb> =<Pb, Pa>). The connectivity
among the proteins is represented using the adjacency
matrix which is a matrix of size |Vd|×|Vh|. The adjacency
matrix of the PIN represents the membership value of the
edge as defined in the equation (1).The adjacency matrix
(Am,n) of the undirected bipartite graph has disease
proteins in rows and host (human) proteins in columns.
(1)
Complete Bipartite Graph (Gm, n): all the vertices of the
first vertex set (|Vd|=m) has an edge to all the vertices
in the second vertex set (|Vh|=n) (26) (28). All the
entries of the adjacency matrix must be 1 as in figure
1(b).
Biclique: The biclique is a complete subgraph that
present in a given graph Gm,n (26). It possesses all the
properties of the complete bipartite graph but it is a
sub-graph.
Non-Biclique: it is a bipartite sub-graph but not a
complete graph.
Initial Seed Vectors:A seed vector is a set of proteins
chosen randomly to mine sub-network as seen in the
figure 2. Similarly multiple random seeds are
generated to mine sub networks in a random sub-space.
Figure 2. A Sample Seed Vector – the numeric value
represents the protein index
Sub-Network Mining using Score based MR-CoC
The Score measure is used for Co-Clustering the
PIN. The score is defined using the frequency of „1‟s in
the adjacency matrix. The Score of the adjacency matrix
of the graph or sub-graph can be calculated to access the
nature of the derived local pattern from the given PIN
using the equation. It is the ratio of the frequency of „1‟s
to the number of elements(17), for Bipartite Graphs the
score will be redefined as in the equation.
(2)
where Am,n is the adjacency matrix of order ‘m x n’ with
membership value (either ‘0’ or ‘1’) of the edge, the
numerator represents the frequency of ‘1’s in the
adjacency matrix. The Score value of the bicliques and
the complete bipartite graphs will always 1. The decimal
values ranging below 1 can also be easily considered for
mining non-bicliques or dense sub-networks. The
proposed approach Score based Co-clustering approach
for bipartite graphs is depicted in the figure 3. The
scoreCoC(An,m) function is slightly modified for mining
Bipartite sub-networks as shown below.
Function scoreCoC(Am,n)
while (score<thres)
Remove protein (row or column) which has min(min(row_freq), min(col_freq))
Evaluate score(Am,n)
end return (protein ids in Am,n)
end
The MR-CoC has two main phases Map phase and
Reduce phase. The generated seeds are written in the text
files and they are fed as input to the map phase. For each
seed generate the adjacency matrix by extracting the
corresponding columns of seed proteins from Am,n as sub-
matrix then follow the score based co-clustering process
as given in the algorithm. The row_freq represents the
frequency of 1’s in each row, similarly col_freq for
columns.
International Journal of Pure and Applied Mathematics Special Issue
1805
PIN
Initial Random Seed Matrix
(a) Seed Generation
Adjacency
matrices
(b) Mining Sub-Network using Score based MR-CoC
(c) Biological Significance of Sub-Networks
Figure 3. Workflow of the proposed MR-CoC model
The complexity of the proposed work is O (Es+log Ns)
where Es is number of edges in a seed, Ns represents
number of nodes in a seed. It is minimum than the
MCODE (O(n3)) (5)algorithm which is widely used for
mining sub-graphs.
4. Experimental Setup
The Homo Sapiens dataset from String DB(29)as
in figure 5, is attempted previously for mining the cliques
and non-clique. The sub-networks obtained are entirely
depends on initial seeds generated (17). In this work, the
proposed approach is experimented 50 times for each seed
configuration with same number of random seeds
(generated newly for each experiment). Initially the
cliques and non-cliques are mined in the previous work.
Besides, the protein complex coverage is evaluated to
showcase the performance of the proposed approach.
Secondly, the HIV1-Human interaction Database
(bipartite graph) is chosen to attempt the proposed
approach for mining sub-networks (bicliques and non
bicliques) as shown in the figure 4. The protein
interaction networks are taken from NCBI (30) and their
descriptions are given in the table 2. The seeds are an
initial set of proteins for generating each sub-graph which
will be redundant. The score measure is used to find the
sub-graph from each seed. The distinct subgraphs are
extracted in the reduce phase. The proposed approach is
implemented using MapReduce model in the Matlab. The
environmental setup is discussed in the table. The workers
represent the number of parallel threads, 10 workers are
chosen for this implementation. The Score threshold 0.8 is
chosen for non-cliques and non-bicliques.The
experiments are carried out in a system with Intel I7
processor and 12 GB RAM.
Table 2 Experimental Setup
Parameter Homo Sapiens PPI HIV-Human PPI
Environment Matlab 2016b (Map Reduce Model)
Matlab 2016b (Map Reduce Model)
Number of
Interactions
85,48,003 17,104
Number of proteins 19, 427 HIV- 11, Human- 4481
Seed length 50 proteins 40 proteins
Minimum size of sub-
network
3,4,5 proteins 4 proteins
The Protein complexes of Homo sapiens are taken from
the CORUM database, which is a comprehensive resource
of mammalian protein complex (31). It has 2358 human
protein complexes. The results are mapped to these
existing protein complexes for analyzing the performance
of the proposed approach.
Figure 4. Heat Map of adjacency matrix of HIV-Human PIN
(17,104 Interactions)
Figure 5. Heat Map of adjacency matrix of Homo Sapiens PIN
(85,48,003 Interactions)
International Journal of Pure and Applied Mathematics Special Issue
1806
5 .Result and Discussion
The sub-networks are mined from the Homo Sapiens PPI
using MR-CoC. The number of cliques and non-cliques
obtained using the proposed methodology for different
initial seed setup and minimum sub-network sizes are
listed in the table 3. The cohesiveness of the sub-network
is clearly showcased by the score measure. The Biological
significances of the obtained cliques and non-cliques can
be further studied by comparing it with the existing
protein complexes taken from the CORUM database [31].
From the experimental results some sub-networks are
exactly same as existing protein complexes;some sub-
networks are partially same as existing protein complexes;
some complexes are remains unmapped. This protein
complex coverage of the computational results is
discussed in the table 4, 5 and 6.
The protein complex coverage is evaluated as Fully
mapped, Partially Mapped and Un-mapped sub-networks.
Fully mapped sub-networks(FM) are same as
existing protein complex.
Partially mapped sub-networks(PM) contain 90
percent of the participants of the existing protein
complexes.
Un-mapped sub-networks(UM) contain less than
90 percent of the participants of the existing
protein complexes.
The coverage of the human protein complexes over the
resultant sub-networks with different seed configuration
on Homo Sapiens Dataset are visualized in the Fig 6, 7
and 8. Similarly the complex coverage of HIV-Human
protein complexes over the resultant sub-networks is
visualized in Fig 9. The X-axis represents the number
of seed vectors in terms of millions (M). The Y-axis
represents the number of sub-networks maps the protein
complexes fully, partially and so on. 2237 protein
complexes out of 2358 existing human protein
complexes are mapped by the resultant sub-networks.
There are 26575 unmapped sub-networks which are
provided to the biologists for further observation of
their biological significances and new complexes
prediction.
Table 2. Number of Cliques and Non- Cliques extracted
on Homo Sapiens Dataset with different parameter values
Number of
Seeds
Minimum sub-
network size = 3,
threshold=0.9
Minimum sub-
network size = 4,
threshold=0.85
Minimum sub-
network size =
5, threshold=0.8
Cliques Non-
Cliques Cliques
Non-
Cliques
Cliq
ues
Non-
Cliqu
es 100000 2910 3181 821 2766 683 2194
500000 2174 4522 1139 4172 812 3178
1000000 4372 7188 2897 3921 2381 2190
5000000 5917 8620 3200 5193 1782 4822
10000000 8539 11861 3176 7885 2910 8821
50000000 6433 10865 5192 7231 4987 6975
100000000 18862 37297 11862 9021 1019
2 19021
In proposed Bipartite sub-networks mining approach, the
38 protein complexes out of 40 existing HIV-Human
protein complexes are predicted. Some of the bicliques
and non-bicliques of Un-mapped sub-networks are
annotated for their biological function, molecular process
and cellular component using the String database [29], the
query services are listed in the table 4.
Figure 6. Protein Complex Coverage with minimum sub-network
size 3
Figure 7. Protein Complex Coverage with minimum sub-network
size 4
International Journal of Pure and Applied Mathematics Special Issue
1807
Figure 8. Protein Complex Coverage with minimum sub-network
size 5
Some of the bicliques and non-bicliques along with
their biological function, molecular process and cellular
component were extracted using the String database (29)
query service, are listed in the table 4. The biological
significances of the sub-networks are further studied to
know their functionalities. The study shows that the HIV
1 interaction on host organisms affects its immune system.
Most of the components have the proteins responsible for
immune system maintenance. The HIV infection is on the
extra-cellular space of the host organisms. The common
functionalities of the components‟ subunits are wound
healing, cell binding, defense response, immune system
regulations, cell growth, regulation of metabolic activities
and so on as discussed in the table 6. The biological
processes of some components extracted using the
proposed methodology are listed in table 6. The KEGG
pathways seen in the extracted in some of the components
are discussed in the table 7. The pathway analysis evident
that the components have the traces of viral infection
pathways, respiration regulatory signals, tuberculosis
pathways, Natural killer cell mediated cytotoxicity
pathways and so on. Thus the proposed approach can
extracts huge components from the randomized space
(random seeds) than other graph theoretic methods.
Table 4 Biological Significances of the Components in
HIV-Human PIN using Score based MR-CoC
GO Term Biological Process
Component 1
(C3, CD46, CFH,
CR1, IFNA8, IGF1,
IGF2,ITIH4, MASTL)
GO.0002682 regulation of immune system process
GO.0006952 defense response
GO.0032269 negative regulation of cellular
protein metabolic process
GO.0002252 immune effector process
GO.0042060 wound healing
GO.0048584 positive regulation of
response to stimulus
GO.0048583 regulation of response to stimulus
GO.0006950 response to stress
Component 2
(APOBEC3D,CD46,
CFH,CR1,HFE,HGF,SEC62,SEC63, C3,
TFRC)
GO.0006952 defense response
GO.0006950 response to stress
GO.0002252 immune effector process
GO.0002376 immune system process
GO.0045087 innate immune response
Component 3
(C3,CD46,CFH,CR1,IGF1,IGF2,IGFBP1,
MASTL,NUPL2,SH
3RF1,VPRBP, ARPP19, SH3RF1)
GO.0006959 humoral immune response
GO.0006956 complement activation
GO.0010827 regulation of glucose transport
GO.0019538 protein metabolic process
GO.0002455
humoral immune response
mediated by circulating immunoglobulin
GO.0002250 adaptive immune response
GO.0002252 immune effector process
GO.0002684 positive regulation of immune
system process
GO.0043086 negative regulation of
catalytic activity
GO.0048583 regulation of response to
stimulus
GO.0048584 positive regulation of
response to stimulus
GO.0031324 negative regulation of cellular
metabolic process
GO.0045087 innate immune response
Component 4 (CDK1,CFH,FLNA,
GRM1,HFE,HGF,IFI
16,IFI27,ITIH4)
GO.0006103 2-oxoglutarate metabolic
process
GO.0030162 regulation of proteolysis
GO.0006102 isocitrate metabolic process
GO.0051246 regulation of protein metabolic process
GO.0009060 aerobic respiration
Component 5 (IFI35,IFIT1,IFIT2,IFIT3,IFNA1,IFNA2,I
FNA4,IFNA7,IFNG
R2,ITIH4,ITK,NEDD4,SP110,VPRBP)
GO.0009615 response to virus
GO.0051607 defense response to virus
GO.0006955 immune response
GO.0034097 response to cytokine
GO.0045087 innate immune response
GO.0006952 defense response
GO.0043330 response to exogenous
dsRNA
GO.0002376 immune system process
GO.0002250 adaptive immune response
GO.0009615 response to virus
GO.0071345 cellular response to cytokine
stimulus
GO.0002323 natural killer cell activation
involved in immune response
GO.0006950 response to stress
GO.0042110 T cell activation
GO.0002286 T cell activation involved in
immune response
GO.0002520 immune system development
GO.0006959 humoral immune response
GO.0050794 regulation of cellular process
GO.0050896 response to stimulus
Table 5 KEGG Pathway Significances of the
Components in HIV-Human Interactions using Score
based MR-CoC
Pathway KEGG
Pathways
Component 1
(C3, CD46, CFH, CR1, IFNA8)
4610
Complemen
t and coagulation
cascades
5152 Tuberculosi
s
5150
Staphylococ
cus aureus
infection
Component 2 (C3, CD46, CFH, CR1) 4610
Complement and
coagulation cascades
5144 Malaria
Component 3
(C3, CD46, CFH, CR1, IGF1,IGF2, IGFBP1, NUPL2, SH3RF1, VPRBP)
4610
Complemen
t and coagulation
cascades
5152 Tuberculosi
s
5134 Legionellosi
s
5150
Staphylococ
cus aureus infection
Component 4
(CDK1, CFH,FLNA,GRM1,HFE,HGF, IFI16, IFI27, ITIH4)
1210
2-
Oxocarboxylic acid
metabolism
1230
Biosynthesis
of amino
International Journal of Pure and Applied Mathematics Special Issue
1808
acids
20
Citrate cycle
(TCA cycle)
1200
Carbon
metabolism
Component 5
(IFIT1,IFNA1,IFNA2,IFNA4,IFNA7,IFN
GR2)
5168
Herpes
simplex infection
4140
Regulation
of autophagy
4650
Natural
killer cell
mediated cytotoxicity
5160 Hepatitis C
5162 Measles
5320
Autoimmune thyroid
disease
4630
Jak-STAT
signaling pathway
5164 Influenza A
5152
Tuberculosi
s
4622
RIG-I-like
receptor
signaling pathway
5164 Influenza A
4060
Cytokine-
cytokine receptor
interaction
5161 Hepatitis B
Further, analyzing all the components for its biological
significance will help the biologist to study the
characteristics of the disease on the host organisms. These
components can be further used to assist the drug
discovery, drug target identification, etc. The proposed
methodology is useful in distributed environment to mine
the sub-networks from the complex interaction networks.
The computational time can be reduced considerably if
the computation is carried out in distributed setup. It will
help to overcome the issues of big data in interact comes.
6. Conclusion
Protein Complex mining is one of the emerging research
areas. The proposed methodology Score based Co-
Clustering algorithm with MapReduce model is devised to
mine all kind of dense sub-graphs like clique, bicliques,
non-clique, non-bicliques. This approach is previously
attempted to mine the sub-networks from large networks
like PIN. The performance of the proposed approach is
studied based on the complex coverage level of the results.
More than 94.86 percent of the existing complexes are
mapped by the resultant sub-networks. The proposed
approach discovers 26575 unmapped protein sub-
networks as well. It is further attempted to extract
bicliques and non-bicliques from HIV-Human interactions.
Similarly the proposed approach extracts 6824 sub-
networks and 38 existing HIV-Human protein complexes
out of 40 are mapped.
The unmapped sub-networks of HIV-Human dataset are
annotated for their biological significances. The result
shows the infections of viral pathogens are on the immune
system monitoring proteins of human evidence the
presence of HIV functionalities. They targets proteins in
the extracellular spaces of blood particles. The pathway
analysis reveals that the HIV infections affect the
respiration regulatory proteins, and other viral traces that
possess similar pathways. The extracted components
(protein sub-networks) are further analyzed to understand
the clear dynamics of the HIV on host system.
REFERENCES
[1] “Structures of Life”, 2007.
[2] E. M. Hanna, N. Zaki and A. Amin, "Detecting Protein Complexes in Protein Interaction Networks Modeled as Gene Expression Biclusters," pp. 1-19, 2015.
[3] F. Y. Yu, Z. H. Yang, X. H. Hu, Y. Y. Sun, H. F. Lin and J. Wang, "Protein complex detection in PPI networks based on data integration and supervised learning method," BMC Bioinformatics, vol. 16, no. 12, pp. 1-9, 2015.
[4] L. Ou-Yang, X.-F. Zhang, D.-Q. Dai, M.-Y. Wu, Y. Zhu, Z. Liu and H. Yan, "Protein complex detection based on partially shared multi-view clustering," BMC Bioinformatics, 2016.
[5] S. Wernicke, "A Faster Algorithm for Detecting Motifs," in 5th WABI-05, 2005.
[6] A. Enright, S. Dongen and C. Ouzounis, "An efficient algorithm for largescale detection of protein families," Nucleic Acids Research, vol. 30, no. 7, pp. 1575-1584, 2002.
[7] B. Adamcsek, G. Palla, I. J. Farkas, I. Dernyi and T. Vicsek, "Cfinder: locating cliques and overlapping modules in biological networks," Bioinformatics, vol. 22, no. 8, p. 1021–1023, 2006.
[8] A. D. King, N. Przulj and I. Jurisica, "Protein complex prediction via cost-based clustering," Bioinformatics, vol. 20, no. 17, pp. 3013-3020, 2004.
[9] W. Hwang, Y. R. Cho, A. Zhang and M. Ramanathan, "A novel functional module detection algorithm for protein-protein interaction networks," Algorithms for Molecular Biology, vol. 1, no. 24, 2006.
[10] J. Ekanayake, S. Pallickara and G. Fox, "MapReduce for Data Intensive Scientific Analyses," in Proceeding ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience, 2008.
[11] Schlosser, S. Chen and S. W., "Map-reduce meets wider varieties of applications," 2008.
[12] J. Rosen, N. Polyzotis, V. Borkar, Y. Bu, M. J. Carey, M. Weimer, T. Condie and R. Ramakrishnan, "Iterative mapreduce for large scale machine learning".
[13] D. J and G. S, "MapReduce: simplified data processing on large clusters," Commun ACM, vol. 51, no. 1, p. 107–113, 2008.
[14] LongbinLaix, L. Qinzx, XueminLinx and L. Chang, "Scalable Subgraph Enumeration in MapReduce," in Proceedings of the VLDB Endowment.
[15] B. R. Steven Hill, "An Iterative MapReduce Approach to Frequent Subgraph Mining in Biological Datasets," ACM-BCB‟12, pp. 7-10, 2012.
[16] S. Aridhi, L. D'Orazio, M. Maddouri and E. Mephu, "A Novel MapReduce-based Approach for Distributed Frequent Subgraph Mining," RFIA, 2014.
[17] R.Gowri and R.Rathipriya, "Cohesive Sub-Network Mining in Protein Interaction Networks using Score based Co-Clustering with MapReduce Model (MR-CoC)," 2017.
[18] S. E. Schaeffer, "Graph clustering," Computer Science Review, pp. 27-64, 2007.
[19] H. S. M. Mosaddek, Z. Mahboob, R. Chowdhury, A. Sohel and S. Ray, "Protein Complex Detection in PPI Network by Identifying
International Journal of Pure and Applied Mathematics Special Issue
1809
Mutually Exclusive Protein-protein Interactions," Procedia Computer Science, vol. 93, pp. 1054-1060, 2016.
[20] J. B, Pereira-Leal, A. J. Enright and C. A. Ouzounis, "Detection of Functional modules from protein Interaction Networks," PROTEINS: Structure, Function, and Bioinformatics, vol. 54, p. 49–57, 2004.
[21] M. T. Dittrich, G. W. Klau, A. Rosenwald, ThomasDandekar and T. Müller, "Identifying Functional Modules in Protein-Protein interaction Networks: an integrated exact approach," ISMB, vol. 24, p. 223–231, 2008.
[22] Y. Zhang, ErliangZeng, T. Li and GiriNarasimhan, "Weighted Consensus Clustering for Identifying Functional Modules In Protein-Protein Interaction Networks".
[23] G. D. Bader and C. W. Hogue, "An automated method for finding molecular complexes in large protein interaction networks," BMC Bioinformatics, vol. 4, no. 2, 2003.
[24] JyotiRao and S. M. Ms., "Efficient Method for Finding Conserved Regions in Protein Interactions Network," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 7, pp. 756-761, 2013.
[25] G. A, K. O and N. R, "Topological properties of protein interaction networks from a structural perspective," Biochemical Society Transactions, pp. 1398-1403, 2008.
[26] D. Reinhard, Graph Theory, 5 ed., Springer, 2016.
[27] Ray and S. Saha, "Subgraphs, Paths and Connected Graphs," in Graph Theory with Algorithms and its Applications, 2013, pp. 11-24.
[28] B. R.B., Graphs and Matrices, Springer, Hindustan Book Agency, 2010.
[29] D. Szklarczyk, A. Franceschini, S. Wyder, KristofferForslund, D. Heller, J. Huerta-Cepas, Milan Simonovic, l. Roth, A. Santos, K. P. Tsafou, M. Kuhn, P. Bork, L. J. Jensen and C. v. Mering, "STRING v10: protein–protein interaction networks, integrated over the tree of life," Nucleic Acids Research, vol. 43, p. 447–452, 2015.
[30] D. Ako-Adjei, W. Fu, C. Wallin, K. S. Katz, G. Song, D. Darji, J. R. Brister, R. G. Ptak and K. D. Pruitt, "HIV-1, human interaction database: current status and new features," Nucleic Acids Research, vol. 43, pp. 566-570, 2015.
[31] A. Ruepp, B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, M. Stransky, B. Waegele, T. Schmidt, O. Doudieu, V. Stümpflen and H. Mewes, "CORUM: the Comprehensive Resource of Mammalian Protein Complexes," Nucleic Acids Res, pp. 449-454, 2008.
[32] Emig-Agius, Dorothea, K. Olivieri, L. Pache, H. L. Shih and O. Pustovalova, "An Integrated Map of HIV-Human Protein Complexes that Facilitate Viral Infection," PLoS ONE, vol. 9, no. 5, 2014.
[33] J. S. J. Stefan Pinkert, "Protein Interaction Networks- More than mere Modules," PLoS Computational Biology, vol. 6, no. 1, 2010.
[34] M. S. a. S. Liang, "Predicting protein functions from redundancies in large-scale protein interaction networks," Proc. of the National Academy of Science, vol. 100, no. 22, p. 12579–12583, 2003.
[35] G. B. a. H. Hogue, "An automated method for finding molecular complexes in large protein-protein interaction networks," BMC Bioinformatics, vol. 4, no. 2, 2003.
[36] R. Gowri and R. Rathipriya, "A Study on Clustering the Protein Interaction Networks using Bio-Inspired Optimization," International Journal Computational Intelligence and Informatics, vol. 3, no. 2, pp. 89-95, 2013.
[37] R. Gowri and R. Rathipriya, "Extraction of Protein Sequence Motif Information using PSO K-Means," Journal of Network and Information Security, 2014.
[38] R. Gowri, S. Sivabalan and R. Rathipriya, "Biclustering using Venus Flytrap Optimization Algorithm," in Computational Intelligence in Data Mining, Proceedings of International Conference on CIDM, Advances in Intelligent Systems and Computing series, vol. 410, 2015, pp. 199- 207.
[39] HuanKe, P. Li, S. Guo and MinyiGuo, "On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications," IEEE Transactions on Parallel and Distributed Systems, 2015.
[40] R. Gowri and R.Rathipriya, "Protein motif comparator using PSO k-means," International Journal of Applied Metaheuristic Computing (IJAMC), vol. 7, no. 3, 2016.
International Journal of Pure and Applied Mathematics Special Issue
1810
1811
1812