gplexus: enabling genome-scale gene association network reconstruction and analysis … ·...

GPLEXUS enabling genome-scale gene associationnetwork reconstruction and analysis for verylarge-scale expression dataJun Li1 Hairong Wei2 Tingsong Liu1 and Patrick Xuechun Zhao1

1Plant Biology Division the Samuel Roberts Noble Foundation 2510 Sam Noble Parkway Ardmore OK 73401USA and 2School of Forest Resources and Environmental Science Michigan Technological University 1400Townsend Drive Houghton MI 49931 USA

Received March 15 2013 Revised September 30 2013 Accepted October 2 2013

ABSTRACT

The accurate construction and interpretation ofgene association networks (GANs) is challengingbut crucial to the understanding of gene functioninteraction and cellular behavior at the genomelevel Most current state-of-the-art computationalmethods for genome-wide GAN reconstructionrequire high-performance computational resourcesHowever even high-performance computing cannotfully address the complexity involved with con-structing GANs from very large-scale expressionprofile datasets especially for the organismswith medium to large size of genomes such asthose of most plant species Here we present anew approach GPLEXUS (httpplantgrnnobleorgGPLEXUS) which integrates a series of novelalgorithms in a parallel-computing environment toconstruct and analyze genome-wide GANsGPLEXUS adopts an ultra-fast estimation forpairwise mutual information computing that issimilar in accuracy and sensitivity to the Algorithmfor the Reconstruction of Accurate CellularNetworks (ARACNE) method and runs 1000 timesfaster GPLEXUS integrates Markov ClusteringAlgorithm to effectively identify functional subnet-works Furthermore GPLEXUS includes a novellsquocondition-removingrsquo method to identify the majorexperimental conditions in which each subnetworkoperates from very large-scale gene expressiondatasets across several experimental conditionswhich allows users to annotate the varioussubnetworks with experiment-specific conditionsWe demonstrate GPLEXUSrsquos capabilities byconstruing global GANs and analyzing subnetworksrelated to defense against biotic and abiotic stress

cell cycle growth and division in Arabidopsisthaliana

INTRODUCTION

The availability of terabyte- and petabyte-sized gene ex-pression datasets in public repositories (12) has inspiredscientists to use genome-wide reverse genetic approachesto reconstruct gene networks and decipher the interactionbetween genes Compared with forward genetics whichusually focuses on individual genes the reverse-engineer-ing that focuses on genome-wide transcriptome profilescan provide a holistic and comprehensive view of the inter-actions of an entire network which may lead to the dis-covery of novel regulatory relationships that underpineach biological process or trait (3ndash5) Although thegenome-wide reverse-engineering approach has many ad-vantages several computational challenges exist particu-larly for organisms with large genome datasets such asthose of plants The computational complexity grows ex-ponentially as the number of genes increases and anincrease in the number of genes in turn demands an ac-cordingly increased sample size to achieve the desiredaccuracy for network reconstructionAmong the available gene association network (GAN)

algorithms for the reverse engineering of large-scale genenetworks (6ndash9) the gene co-expression network method(6) is a commonly used method to infer potential geneticinteractions from gene expression datasets One problemthat is inherent in this co-expression network method is itshigh false-positive prediction rate which is due to its in-ability to distinguish direct gene interactions from largenumber of indirect interactions Other methods such asthe Bayesian Network (7) and Gaussian Graphics Model(GGM) (9) can infer the local network structure with highprecision (10) but cannot handle genome-wide networkconstruction due to the increased computational complex-ity that arises from the large number of gene variables (10)

To whom correspondence should be addressed Tel +1 580 224 6725 Fax +1 580 224 4743 Email pzhaonobleorg

Published online 30 October 2013 Nucleic Acids Research 2014 Vol 42 No 5 e32doi101093nargkt983

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby30) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

and the large sample requirement (11) To date construct-ing GANs from very large-scale expression profiledatasets especially for the organisms with medium tolarge size of genomes such as those of most plantspecies is still a challenging task which necessitates thedevelopment of more effective computational solutionsSeveral methods have been developed to infer GANs

using mutual information (MI)-based approaches(1213) including the Algorithm for the Reconstructionof Accurate Cellular Networks (ARACNE) method (12)which can identify both linear and non-linear dependencerelationships from large samples Furthermore a largenumber of potential indirect genendashgene interactions canbe eliminated using the ARACNE method through theapplication of data processing inequality (DPI) which isbased on information theory (14) The ARACNE method(4) which is now widely used in GAN reconstruction(1516) achieves better performance than the BayesianNetwork method (7) However this method remains com-putationally demanding due to the high computationalcomplexity of both the MI estimation which usesGaussian kernel-based methods and the DPI processingwhich is applied to a large number of edges Most GANsthat are deciphered by this method have therefore adoptedcompromise strategies that either reduce the number ofgenes included in the networks or use a relatively smallnumber of gene expression profiles (417) Even with thesecompromises ARACNE requires gt100 h to calculate theMI of all gene pairs for the human B-cell expressiondataset (18) which only contains 10 000 genes across336 samples on a typical modern server with 32 CPUcores and 128-GB RAM For this reason fast MI estima-tion methods (1819) have been developed However thesemethods usually shorten the computational time at thecost of memory consumption which often reduces theability to process massive datasets Meanwhile the appli-cation of a DPI filter to remove potentially false inter-active edges is also a time-consuming process thatcannot be easily reduced Therefore the development ofultra-fast methods to enable the construction of genome-wide association networks particularly for plant specieswith large genomes is imperative for the field of genomicsAnother open question in genome-wide GAN analysis

in plants is how to effectively identify functional subnet-works that control specific cellular processes such as de-velopment or response to environmental cues Sessileplants are subjected to constant biotic and abiotic stres-sors and the study of the interaction of plants with theenvironment can lead to adaptation-ameliorated plantvarieties For example well-adapted plants have beenselected during crop domestication to meet the increasedneeds of human beings Evidence has shown that themajor evolutionary changes in plant adaptation (2021)can be linked to the modification of higher hierarchicalregulatory relationships (2223) Therefore the linkbetween genendashgene associated pairs within a subnetworkand the experimental condition under which the pair wasobserved is an important subject in modern plantgenomics and systems biologyHere we present a new integrated approach for

high-performance GAN construction and analysis from

large-scale data using an ultra-fast MI estimation thatuses a Spearman correlation-based transformation forpairwise MI computing and removes potentially falseinteraction edges on the basis of DPI We further reducethe computational time by implementing integrated GANconstruction and network analysis algorithms on acustom-built parallel-computing Linux cluster platformBioGrid which accelerates the total computationalprocess in almost linear proportion to the number ofCPU cores that are used We demonstrate GPLEXUSrsquoscapability by analyzing large-scale Arabidopsis thalianagene expression datasets that have been pooled frommultiple experimental conditions to first build a genome-wide GAN and then decompose this GAN into subnet-works Moreover we have developed a novel function toidentify major experimental conditions that contribute tothe MI of genendashgene interactions in the constructednetworks which allows users to link each genendashgene asso-ciation or subnetwork to a specific experimental conditionto learn under which condition these genendashgene associ-ations may operate

To promote and facilitate the use of this platform toperform GAN analyses for organisms with large genomesand a large number of genes we have provided a user-friendly online platform (httpplantgrnnobleorgGPLEXUS) that allows users to upload their expressiondatasets and perform GAN and gene set enrichmentanalysis To the best of our knowledge this is the firstweb-based platform that is able to construct and analyzegenome-scale GANs from massive genomic datasets

MATERIALS AND METHODS

Datasets used for method evaluation

Four compendium datasets were downloaded from publicdomains and compiled to evaluate the performance ofGPLEXUS and other methods (Table 1) The first threedatasets were downloaded from ArrayExpress (1) DatasetI comprises gene expression profiles of 313 microarrayhybridizations for Escherichia coli Dataset II comprisesgene expression profiles of 1848 microarray hybridizationsfor A thaliana Dataset III comprises 768 microarray hy-bridizations from multiple tissues for Glycine maxDataset IV comprises human glioblastoma gene expres-sion profiles which were downloaded from the TCGAData Portal (Level 1 Affymetrix HT Human GenomeU133 Array) (24) All of these datasets were normalizedusing the Robust Multiarray Averaging (RMA) method(25) and were used to evaluate and compare the perfor-mance of GPLEXUS with other methods All four datasetsare available for download on the GPLEXUS Web site(httpplantgrnnobleorgGPLEXUSdatasetjsp)

Fast MI estimation

The ARACNE method had successfully applied theproperties of MI and DPI which are based on informa-tion theory (14) to GAN reconstruction (4) Mountingevidence has also demonstrated the effectiveness ofARACNE to build valid and explainable gene networks(26ndash28) The network construction from microarrays of

e32 Nucleic Acids Research 2014 Vol 42 No 5 PAGE 2 OF 11

plant species which usually contain a very large numberof genes is often time-consuming and sometimes compu-tationally infeasible To reduce the computational com-plexity of the MI estimation we have developed a fastMI estimation method that is based on the Spearman cor-relation-based transformation formula for MI computingOne report (29) has proven that the Pearsonrsquos correlationand MI are related when the sample data follow a normaldistribution For this case the value of the MI can becalculated as

I XiXj

frac14

1

2log

iijjCj j

frac14

1

2log 1 2

eth1THORN

where C is the determinant of the covariance matrix isthe standard deviation of the gene expression dataset and is the Pearsonrsquos correlation The Spearman correlationcoefficient is a special case of the Pearsonrsquos correlation inwhich the data are converted to ranks before estimatingthe coefficient The MI computed by Spearman correl-ation-based transformation can capture both linear andnon-linear relationships Furthermore the use of theSpearman correlation-based transformation also enablesthe application of DPI to remove potential indirectgenetic relationships from the constructed network Boththe Spearman correlation-based transformation-based MIestimation method and the Gaussian kernel-based MIestimation method used by the original ARACENalgorithm have been integrated into GPLEXUS TheSpearman correlation-based transformation method ismuch faster than the ARACNE method the MI compu-tational complexity of GPLEXUS is Oethn2 mTHORN whereasthe MI computational complexity of ARACNE isOethn2 m2THORN where n is the number of microarray probe-setsgenes and m is the number of microarrayhybridizationssamples

Ultrafast MI computing and DPI processing viaparallel computing

We implemented the integrated algorithms with parallelprogramming techniques in an efficient C++ and Javacomputing languages and deployed the GPLEXUSanalysis pipelines on an in-house Linux cluster calledBioGrid which currently consists of gt700 CPU cores toachieve a high-performance computing capacity

When a user submits an analysis job through theGPLEXUS online web server the master node of theBioGrid system first transfers the datasets to slavecomputing nodes in the Linux cluster Next the masternode remotely calls to execute the analysis pipelines andmonitors the analysis progress in these computing nodesThe master node collects the analysis outputs when all ofthe distributed jobs have been completed This procedureis iterated twice to first complete the MI estimation andthen to remove indirect edges by DPI analysis The initialnetwork construction can be further refined by iterativelyre-running the analysis pipelines with more stringent par-ameters By default GPLEXUS estimates and chooses theMI of the 10th percentile of N-ordered values (arrangedfrom the largest to the smallest) as the default MI thresh-old and a P-value is assigned to this MI threshold byrandom shuffling of the expression of these genendashgenepairs across the various microarray profiles The compu-tational complexity of the Spearman correlation-basedtransformation and the DPI processing is furtherreduced through the parallel-computing implementationto O n2 m=k

and O n3=k

respectively where n is the

number of microarray probe-setsgenes m is the numberof microarray hybridizationssamples and k is the numberof CPU cores in the BioGrid system

A lsquocondition-removingrsquo approach to identify experiment-specific conditions for gene-gene associations

To infer the potential experimental conditions underwhich genendashgene interactionsregulations may occur wehave developed a lsquocondition-removingrsquo approach to inferthe experimental conditions of the microarray The prin-ciple of the approach supposes that if a regulated relation-ship occurs under a specific experimental condition thenthe MI value for the gene pair would be reduced if thisexperimental condition was removed from all of themicroarrays A larger decrease in the MI value indicatesa higher likelihood that the genendashgene pair is regulated orinteracts under this condition Therefore a series of MIvalues can be estimated for every genendashgene pair byremoving experimental conditions one-by-one Becauseall of the MI values for one gene pair would follow anormal distribution (in general the MI values could beestimated accurately from gt100 microarray chips) a

Table 1 Comparison of the performance of several integrated methods on four compendium datasets

Speciescell line Number of arrays Number of probe sets Runtime (minutes)

M1 M2 M3 M4 M5

Escherichia coli 313 (Dataset I) 15 552 12 640 1034 40 12 7Arabidopsis thaliana 1848 (Dataset II) 22 810 a a 5800 51 34Glycine max 738 (Dataset III) 66 190 a a 12 000 88 42Human glioblastoma 547 (Dataset IV) 22 277 30 640 12 260 180 18 12

M1 Original ARACNE method M2 Spearman-based MI estimation with integrated DPI analysis M3 Parallel implementation of the originalARACNE method deployed on our BioGrid system M4 Parallel implementation of the B-Spline-based MI estimation with integrated DPI analysisdeployed on our BioGrid system M5 Parallel implementation of the Spearman-based MI estimation with integrated DPI analysis deployed on ourBioGrid systemaComputationally infeasible

PAGE 3 OF 11 Nucleic Acids Research 2014 Vol 42 No 5 e32

one-side z-test could be applied to identify the experimen-tal conditions that lead to significant reductions in theMI value (default P-value 104) All of these identifiedconditions would cover the range of biological conditionsin which each possible genendashgene interaction exists Thecomputational complexity of this method is Oethn mTHORN Thisanalysis cannot be performed easily using a single serveron a dataset that contains 10 000 identified genendashgenelinks and gt1000 chips However the ultra-fastSpearman correlation-based MI estimation on theGPLEXUS high-performance computing platform canperform this analysis in less than a few minutes

Subnetwork discovery by Markov clustering analysis

In general biological networks are lsquoscale-freersquo (30)Therefore a network clustering method that simulates arandom walk such as the Markov Clustering Algorithm(MCL) (31) can identify actual or potential functionalsubnetworks with a high degree of accuracy Weintegrated this method into our GPLEXUS platform topartition the networks into functional subnetworks

GPLEXUS online

We provide a user-friendly version of GPLEXUScalled GPLEXUS Online (httpplantgrnnobleorgGPLEXUS) as a public and freely available web serverthat allows users to upload their expression datasetsperform GAN analysis and download the analysisresults The analysis flow of GPLEXUS Online isillustrated in Supplementary Figure S1

Uploading expression dataThe input data should consist of normalized expressiondata in a tab-delimited text format The first column ofthe data is the gene identifier such as the GeneChipProbe-set ID or gene name and the other columns arethe expression values The first row refers to the chipname so that GPLEXUS can identify chip names duringthe experimental condition identification analysis

MI estimation algorithm selectionTwo methods are provided in GPLEXUS to estimate theMI the Spearman correlation-based transformation MIestimation method (this is the default method) and theGaussian-kernel-based MI estimation method that wasused in the original ARACNE method

Iterative GAN construction and refinementWhen the data are submitted for analysis GPLEXUSautomatically performs a parameter estimation and con-structs lsquocoarsersquo networks with less stringent parametersThe MI threshold is set as a value that includes the top10 of genendashgene pairs and a DPI tolerance of 10 Thecoarse networks can be viewed as original relevancenetworks (13) that have no indirect edges removed Theuser may further adjust and filter the initial constructednetworks by applying more stringent parameter settingsie by adjusting both the MI threshold and DPI tolerancevalues A high-precision network can be reached via alsquocoarse-to-finersquo refining process as described

Functional subnetwork discoveryGPLEXUS automatically performs an MCL clusteringanalysis for functional subnetwork discovery after eachGAN network is constructed

OutputGPLEXUS summarizes the features of the constructedGANs such as the subnetwork structures the numberof hub genes and functional subnetworks and the hyper-links to constructed networks and subnetworks which canbe downloaded as delimited text files on the results pageThe downloaded networkssubnetworks can be importedinto the open-source Cytoscape (32) software for visual-ization and downstream analysis such as Gene OntologySet Enrichment Analysis The user can also perform theexperiment-specific condition identification analysis fromthe results page

Auxiliary tools

Auxiliary tools such as the Gene Ontology SetEnrichment Analysis tool and the RMA-based microarraydata normalization tool have also been integrated intoGPLEXUS to facilitate the use and annotation of con-structed networks and subnetworks

RESULTS

Comparisons between GPLEXUS and ARACNE

We compared the runtimes between the Spearman correl-ation-based transformation methods and ARACNE aswell as the B-spline-based MI estimation method (8)using the four datasets outlined in Table 1 Our resultsshow that the Spearman correlation-based transformationmethod that is implemented in GPLEXUS has a signifi-cantly reduced runtime compared with the originalARACNE method and B-spline-based MI estimationmethod It was computationally infeasible to constructglobal GANs using large-scale genomic datasets fromplant species with small genomes such as A thalianaand G max with the original ARACNE method on atypical server (DELL PowerEdge R815 Server equippedwith four 8-core CPUs and 128-GB RAM) without anyoptimization However the Spearman correlation-basedtransformation method could compute a GAN analysison these datasets gt1000 times faster as shown in Table 1

Our next step was to evaluate the prediction accuracy forour methods against other methods There were no avail-able experimental expression datasets with known inter-actions in the constructed network that could be appliedto test the methods Therefore we generated a series ofsynthetic gene expression datasets using the gene networkmodeling software SynTReN (33) The generated expres-sion datasets were based on experimentally validated geneinteractions through the SynTRen software The validatedgene interactions for the yeast dataset have already beenintegrated into the SynTRen software A thaliana inter-action data were downloaded from the Arabidopsis GeneRegulatory Information Server (34) and added to theSynTRen database Different synthetic gene expressiondatasets with 200 400 600 800 and 1000 samples for 400


yeast genes and 600 A thaliana genes respectively weregenerated We evaluated the prediction accuracy of fivemethods including the original ARACNE Spearman cor-relation-based transformation MIDPI B-Spline MIDPIco-expression network based on the Spearman correlationand GeneNet (35) on these benchmark datasets The pre-diction accuracy was measured by the area under thereceiver operating characteristic (ROC) curve (AUROC)We calculated the true positives (TP) true negatives(TN) false positives (FP) and false negatives (FN) foreach method for each dataset The ROC curve wasplotted as the sensitivity [TP=ethTP+FNTHORN] versus the 1-specificity [TN=ethTN+FPTHORN] A high AUC score indicatesbetter performance Our results suggest that the averageAUROC score is similar across methods for the twomodel species (Table 2) The Spearman-based MIDPIachieved the highest score among all of the methods thatwere tested We also applied an average F-score[2 Pr ecision Recall=ethPr ecision+RecallTHORN] across allsamples to measure the accuracy among differentmethods by plotting the (Precision versus Recall) PRcurve as Precision TP=ethTP+FPTHORN versus Recall[TP=ethTP+FNTHORN] The average F-score of the Spearman-based MIDPI method is higher than the average F-scores of the original ARACNE method the B-SplineMIDPI method and GeneNet (Table 2) The average F-score of the SpearmanMIDPI was also significantly higherthan the average F-score of the co-expression network TheROC curves and PR curves for the predictions of eachmethod for the 1000-sample expression dataset are shownin Supplementary Figures S2 and S3 respectively TheROC and PR curves both indicate that the method usedin GPLEXUS achieves an accuracy at least as high as theMI approximation methods that use a Gaussian-kernel-based method (ARACNE) or B-spline-based method andall three methods achieve a better accuracy than the co-expression network method or GeneNet

From these results we concluded that the Spearman-based MIDPI method can achieve a similar or betterperformance as the original ARACNE method andsignificantly accelerate the analysis runtime ThereforeGPLEXUS enables and empowers GAN analysisfor very large-scale gene expression datasets from organ-isms with large genomes particularly from plantsFurthermore this acceleration in runtime without apenalty on performance enables the powerful downstreamanalysis of experiment-specific condition identification

Construction and analysis of A thaliana global GANswith GPLEXUS

We demonstrated the ability of GPLEXUS to decipherglobal GANs from 1848 A thaliana microarray samples(Dataset II) that were measured by the Affymetrix ATH1GeneChip which consists of probe-sets for 22 810 genesThe network is available for download at httpplantgrnnobleorgGPLEXUSResultjspsessionid=ArabidopsisThe global gene network includes 10 174 nodes and161 712 edges and was constructed in lt20min usingGPLEXUS with a DPI tolerance of 0155 and MI thresh-old of 05281 Each node in the network has an average of

16 links The constructed network could be partitionedinto 154 functional subnetworksMany studies have indicated that the node degree dis-

tribution in biological networks follows a power-law dis-tribution Such networks are also known as scale-freenetworks (36) In a scale-free gene network a few genesare highly connected with other genes but the majority ofgenes have a low number of connected genes The networkthat was constructed in GPLEXUS was a typical scale-free network (Figure 1) that follows the properties ofnatural biological networks most of the nodes have afew links and a few nodes have a large number of linksThe degree distribution of the network followed a power-law distribution with a degree exponent of r=167In a previous report (9) a global A thaliana GAN with

6760 nodes and 18 625 edges was constructed using theGGM method We focused our comparison on our con-structed network with this reported study (9) to under-stand the properties of global genendashgene interactionnetworks rather than perform an exhaustive comparisonbetween the networks because the samples used to con-struct each network were significantly different Howeverthe networks that were derived from the GGM-basedmethod did not exhibit the typical scale-free networkproperty (30) [eg Figure 3 in (9)]

The identification of functional subnetworks involvedin diverse biotic and abiotic stress and their operatingconditions in GPLEXUS

Sessile plants have evolved highly sophisticated mechan-isms to survive various harsh environmental conditionsincluding life-threatening pathogens and pests Forexample Botrytis cinerea is considered the most aggressiveplant fungus and causes diseases in many plant species dueto its broad host range and exceptionally strong patho-genicity (3738) The WRKY gene family encodes tran-scription factors that regulate the responses of plants topathogen attacks and various abiotic stressors (3839)However little is known about the functionally associatedand coordinated genes that are targets and upstream regu-lators of these transcription factorsWRKY33 (AT2G38470) plays a critical role in the

plantrsquos response to pathogen invasion and abiotic stressand is known to regulate the antagonistic relationshipsbetween the defense pathways that mediate the responseto pathogens (40) A total of 41 genes were associated withthe WRKY33 gene in the A thaliana global GAN that weconstructed (Figure 2) Of these 41 genes 9 belong to

Table 2 The average performance of several integrated methods on

yeast and A thaliana benchmark datasets

Method AUROC Average F-score

Yeast Arabidopsis Yeast Arabidopsis

Spearman-based MIDPI 0896 0856 0512 0512ARACNE 089 0831 0443 0508B-Spline-based MIDPI 0879 0834 0438 0509Co-expression 0872 0791 0124 010GeneNet 0875 0813 0388 0477


functionally unknown genes and the remaining genes canbe classified into five categories defense pathway ethylene(ET) pathway jasmonate (JA) signaling pathway abscisicacid signaling pathway and calcium ion binding Most ofthese genes are involved in the defense of either biotic orabiotic stress Plants under a pathogen attack oftenproduce an increase in ethylene which plays an importantrole in plant immunity (41) ACS6 (AT4G11280) which isknown to control the rate-liming step in ethylene biosyn-thesis is involved in B cinerea-induced ethylene biosyn-thesis and WRKY acts on the pathway that induces ACS6expression (4243) ERF5 (AT5G47230) plays a positiverole in the JAET-mediated defense against B cinerea inA thaliana MKS1 (AT3G18690) which encodes a nuclearsubstrate is essential for basal immunity and participatesin the regulation of WRKY33 via MAP kinase 4 (MPK4)(4445) WRKY33 in turn controls the production of anti-microbial phytoalexins (46) WRKY33 can bind toupstream sequences of genes that are involved in defensepathways that include JA signaling and ET signaling aswell as camalexin biosynthesis (404748) For examplethe direct binding of WRKY33 to a JA signaling gene(AT3G10930) was demonstrated in chromatinimmunoprecipitation (Chip)-polymerase chain reactionexperiments (47)To identify the environmental condition(s) under which

WRKY33 may interact with the 41 genes that wereidentified in the constructed network we uploaded the41 interactions to GPLEXUS and performed an experi-ment-specific condition analysis The top 10 experimentalconditions that were related to each genendashgene interactionwere identified with Plt 106 We checked the specific ex-perimental conditions for some of the interactions andalmost all of the identified experimental conditions were

correlated with the reported genendashgene associations Forexample WRKY33 may induce the expression of ACS6under different chemical treatments The analysis resultssuggest that ACS6 may interact with WRKY33 under saltstress (ArrayExpress accession number E-GEOD-5623chip names GSM131321CEL GSM131318CELGSM131322CEL GSM131325CEL GSM131329CELGSM131330CEL and GSM131317CEL) cadmium treat-ment (ArrayExpress accession number E-GEOD-22114chip names GSM549993CEL and GSM549996cel) andchitin treatment (ArrayExpress accession number E-GEOD-2169 chip name GSM39204CEL) Almost all ofthe 41 interactions (37 of 41) are related to experimentsthat involve salt stress (ArrayExpress accession number E-GEOD-5623 chip name GSM131321CEL) Thereforethese interactions suggest that the WRKY gene familyplays an important role in biotic and abiotic stress Areview of the literature as well as of a public database(TAIR) confirms that WRKY33 is involved in salt stress(49) and the regulation of defense pathways to variouspathogens (5051) The identified experimental conditionsfor all 41 interactions are described in the lsquoSupplementaryMaterialsrsquo This analysis confirmed that GPLEXUS canaccurately capture the overall genendashgene association infor-mation and identify experiment-specific conditions from alarge-scale microarray dataset

The identification of functional subnetworks involved incell growth and division and their operating condition inGPLEXUS

Cell division is one of the most important and conservedbiological processes related to the growth andor develop-ment of an organism By applying the Gene Ontology SetEnrichment Analysis to each module in the example

Figure 1 The topological properties of the constructed Arabidopsis thaliana gene networks (A) A plot of the node degree distribution (B) A plot ofthe node degree distribution after a log-transformation


network construction we identified that the 22nd modulesubnetwork is implicitly associated with cell division Inparticular this module is associated with the function ofthe G2M transition of the mitotic cell cycle microtubule-based movement process This module is shown inSupplementary Figure S4a and is similar to the pathwaysuggested in (52) that is shown in Supplementary FigureS4b Most of the genes in this module were annotated withfunctions related to mitosis (34 genes) and cell division (56genes) as well as microtubule cytoskeleton organizationcytoskeleton organization organelle organization andchromosome organization We identified an enrichedmitosis-specific gene family that contains five mitosis-specific kinesin genes (AT3G20150 AT1G72250AT2G28620 AT2G22610 and AT3G44050) and twomicrogametogenesis genes Kinesin-12A and Kinesin-12B(AT4G14150 and AT3G23670 respectively) based onthe Gene Ontology Set Enrichment Analysis auxiliarytool in GPLEXUS Kinesins are a class of microtubule-associated proteins that possess a motor domain for

binding to microtubules and allow movement alongmicrotubules (53) In addition this gene family isinvolved in spindle formation and chromosomemovement (54)The other gene family that was present in this module

contained cyclins which are assembled with cyclin-de-pendent kinases (CDKs) in the same complex to triggerthe G2M transition through phosphorylation (55)CCS52 a cell-cycle switch protein works with theanaphase-promoting complex to initiate the destructionof cyclin subunits which leads to a cell exit frommitosis The link between the cyclin-dependent proteinkinase regulators CYCA11 (AT1G44110) and CYCB11(AT4G37490) was reported in (56) The link between theendocycle activator CCS52B (AT5G13840) (57) and theanaphase-promoting complex (APCC) activating subunitCDC20 (AT4G33260) gene was reported in (58) and theinteraction between CCS52B and CDKB22 (AT1G20930)was reported in (59) Another cyclin gene that was foundin the 22nd subnetwork CYCD31 (AT4G34160) is

Figure 2 A visualization of the genes that were predicted to interact directly with the WRKY33 gene Nodes with five-star labels are genes validatedby literature


involved in the switch from cell proliferation to the finalstage of differentiation which is a key regulatory point inthe cell cycle of plants The overexpression of CYCD31increases the length of the G2-phase and delays the acti-vation of mitotic genes (60) CDC25 encodes an enzymewith both arsenate reductase and phosphatase activities(52) Evidence suggests that CDC25 together withWEE1 controls the cyclin-dependent kinases that arethe key regulators of cell cycle progression via phosphor-ylation (61)We also performed an analysis to identify the experi-

ment-specific conditions for all of the interactions in the22nd module (781 genendashgene interactions) More than 200genendashgene interactions in this module were associated withthe developmental series (ie shoots and stems) experi-ments (ArrayExpress accession number E-GEOD-5633)based on the GPLEXUS analysis results Among the781 genendashgene interactions 346 interactions were relatedto the experimental condition with chip nameGSM131653CEL which is an experiment that focuseson the floral transition of the shoot apex before boltingThese experimental conditions are consistent with thefunctional annotations of this module that are related tothe G2M transition of the mitotic cell cycle

Performance comparisons between the networksconstructed by GPLEXUS and ARACNE

We have demonstrated that GPLEXUS achieves a similaror better accuracy than the ARACNE method using thebenchmark datasets that were generated by the SynTRensoftware To further explore the efficiency of GPLEXUScompared with ARACNE we constructed a network fromDataset II that included 10 176 genes and 169 186 edges

using the ARACNE method The constructed networkcan be downloaded from httpplantgrnnobleorgGPLEXUSResultjspsessionid=Arabidopsis_gaussian_new Even using our high-performance BioGrid systemwhich is used with 700 CPU cores this network construc-tion required gt5 days with the ARACNE method

We compared the overlap ratio of the networks thatwere constructed from ARACNE and GPLEXUS Thetwo networks shared 9764 genes and 117 664 edges Thestatistical tests demonstrated that the overlap ratio wasstatistically significant with a P-value of 1e-30 which isfar less than 005 The P-value was estimated as followswe first randomly generated 1 million networks with thesame edges and then a t-test was applied to estimatethe statistical significance of overlap ratio We furtherestimate the overlap ratio between paired networks byselecting different MI cutoff thresholds Higher MIthresholds indicate higher probability networks withsmaller number of nodes and edges The overlap ratiosof edges between paired networks under differentnetwork size are shown in Figure 3A The overlap ratioranged from 073 to 10 We then further compare thedistribution of MI for all overlapped edges of the pairednetworks inferred by Gaussian kernel-based method(adopted in ARACNE) and Spearman transformation-based method (adopted in GPLEXUS)which is shownin Figure 3B In this case these overlapped edges withhigher Gaussian kernel-based MI value in ARACNEmethod also have higher Spearman transformation-based MI value in GPLEXUS method So these analysesindicate that most of the genendashgene interactions that wereinferred by the ARACNE method could also be inferredby GPLEXUS

Figure 3 Comparisons of network edges recovered by the GPLEXUS and the ARACNE methods (A) The overlap ratio of edges of the pairednetworks (B) The distribution of the Gaussian-kernel-based MI values versus the Spearman transformation-based MI values for all overlappededges


We then compared the subnetworks that were identifiedby both the ARACNE method and GPLEXUS Thisanalysis revealed that of the top 150 subnetworks 99were identified by both the ARACNE method andGPLEXUS We further checked the functional subnet-works that were inferred by GPLEXUS including thefunctional subnetwork related to biotic and abiotic stressand the functional subnetwork involved in cell growth anddivision The subnetwork involved in the defense againstbiotic and abiotic stress was identified by both methodsGPLEXUS identified 41 genes that interacted withWRKY33 and ARACNE identified 47 genes that inter-acted with WRKY33 A total of 39 genes that wereidentified by GPLEXUS were also identified by theARACNE method however an experimentally validatedinteraction between WRKY33 and the At3g10930 gene(47) was not identified by the ARACNE method Thesubnetwork involved in cell cycle division and growththat was identified by GPLEXUS was not identified bythe ARACNE method

From these results we conclude that the performance ofboth GPLEXUS and the ARACNE method is similarThere exist highly overlapped genes and genendashgene inter-actions in the networks that were constructed by each ofthese methods However our analysis suggests thatGPLEXUS is slightly more sensitive than the ARACNEmethod

DISCUSSION

Tools and methods that can be used to process andanalyze large-scale genomic datasets are urgently neededas the amount of microarray and RNA-seq data that isavailable in public databases accumulates TheGPLEXUS Online platform is a tool for gene networkconstruction and analysis that significantly reduces thecomputational runtime compared with other methodssuch as ARACNE We demonstrated the ability ofGPLEXUS to construct and analyze genome-wideGANs from both simulated and experimental data witha focus on the model plant A thaliana We also integratedseveral auxiliary tools in GPLEXUS such as the GeneOntology Set Enrichment Analysis and RMA-basedmicroarray data normalization tools to facilitate the useand annotation of constructed networks and subnetworksOur future goals are to integrate more auxiliary toolssuch as network validation based on the KyotoEncyclopedia of Genes and Genomes (KEGG) Pathwayto further improve the power of the GPLEXUS platform

CONCLUSION

We have developed a high-performance web-basedplatform for GAN construction and analysis Thissystem is capable of processing thousands of microarrayor RNA-seq gene expression datasets from organisms withlarge genomes andor a large number of genes such asfrom plants through a combination of improvements inthe MI estimation method and the high-performancecomputing implementation and deployment The

GPLEXUS platform is as accurate and sensitive as theoriginal ARACNE method but produces results 1000times faster GPLEXUS also uses a condition-removingmethod to identify experiment-specific microarray chipsfrom large-scale microarray datasets to gain insightfulunderstanding of genendashgene associations FurthermoreGPLEXUS is able to identify new functional subnetworksthrough the integration of the MCL GPLEXUS Onlineprovides interactive user interfaces to facilitate large scalegene expression data analysis and network discovery

AVAILABILITY

GPLEXUS is public and freely available at httpplantgrnnobleorgGPLEXUS The presented caseanalysis input data and results are also freely availablefrom the Web siteThe source code for the Spearman-based MIDPI

method is freely available on the GPLEXUS web ser-ver (httpplantgrnnobleorgGPLEXUSdatasetjsp) Inaddition auxiliary tools including tools for microarraydata normalization and Gene Ontology Set EnrichmentAnalysis are available under the lsquoAuxiliary Toolsrsquo menuof the GPLEXUS web server

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank Dr Xinbin Dai for his assistance in thedesign and deployment of the GPLEXUS system Theyappreciate the constructive comments and suggestionsfrom the anonymous reviewers on an earlier version ofthis manuscript

FUNDING

National Science Foundation [DBI 0960897 to PXZ]Samuel Roberts Noble Foundation Funding for openaccess charge National Science Foundation [DBI0960897 to PXZ] Samuel Roberts Noble Foundation

Conflict of interest statement None declared

REFERENCES

1 ParkinsonH SarkansU ShojatalabM AbeygunawardenaNContrinoS CoulsonR FarneA LaraGG HollowayEKapusheskyM et al (2005) ArrayExpressmdasha public repositoryfor microarray gene expression data at the EBI Nucleic AcidsRes 33 D553ndashD555

2 BarrettT TroupDB WilhiteSE LedouxP RudnevDEvangelistaC KimIF SobolevaA TomashevskyM andEdgarR (2007) NCBI GEO mining tens of millions ofexpression profilesmdashdatabase and tools update Nucleic AcidsRes 35 D760ndashD765

3 LiuM LiberzonA KongSW LaiWR ParkPJKohaneIS and KasifS (2007) Network-based analysis ofaffected biological processes in type 2 diabetes models PLoSGenet 3 e96


4 BassoK MargolinAA StolovitzkyG KleinU Dalla-FaveraR and CalifanoA (2005) Reverse engineering ofregulatory networks in human B cells Nat Genet 37 382ndash390

5 NieJ StewartR ZhangH ThomsonJA RuanF CuiX andWeiH (2011) TF-Cluster a pipeline for identifying functionallycoordinated transcription factors via network decomposition ofthe shared coexpression connectivity matrix (SCCM) BMC SystBiol 5 53

6 StuartJM SegalE KollerD and KimSK (2003) A gene-coexpression network for global discovery of conserved geneticmodules Science 302 249ndash255

7 FriedmanN (2004) Inferring cellular networks using probabilisticgraphical models Science 303 799ndash805

8 FaithJJ HayeteB ThadenJT MognoI WierzbowskiJCottarelG KasifS CollinsJJ and GardnerTS (2007) Large-scale mapping and validation of Escherichia coli transcriptionalregulation from a compendium of expression profiles PLoS Biol5 e8

9 MaS GongQ and BohnertHJ (2007) An Arabidopsis genenetwork based on the graphical Gaussian model Genome Res17 1614ndash1625

10 FriedmanN LinialM NachmanI and PersquoerD (2002) UsingBayesian Networks to Analyze Expression Data J Comput Biol7 601ndash620

11 SchaferJ and StrimmerK (2005) An empirical Bayes approachto inferring large-scale gene association networks Bioinformatics21 754ndash764

12 MargolinAA NemenmanI BassoK WigginsCStolovitzkyG Dalla FaveraR and CalifanoA (2006)ARACNE an algorithm for the reconstruction of gene regulatorynetworks in a mammalian cellular context BMC Bioinformatics7(Suppl 1) S7

13 ButteAJ and KohaneIS (2000) Mutual information relevancenetworks functional genomic clustering using pairwise entropymeasurements Pac Symp Biocomput 2000 418ndash429

14 CoverTM and ThomasJA (1991) Elements of InformationTheory Wiley New York

15 CarroMS LimWK AlvarezMJ BolloRJ ZhaoXSnyderEY SulmanEP AnneSL DoetschF ColmanHet al (2010) The transcriptional network for mesenchymaltransformation of brain tumours Nature 463 318ndash325

16 CaiL YuanMQ LiZJ ChenJC and ChenGQ (2011)Genetic engineering of Ketogulonigenium vulgare for enhancedproduction of 2-keto-l-gulonic acid J Biotechnol 157 320ndash325

17 TorkamaniA and SchorkNJ (2009) Identification of rare cancerdriver mutations by network reconstruction Genome Res 191570ndash1578

18 QiuP GentlesAJ and PlevritisSK (2009) Fast calculation ofpairwise mutual information for gene regulatory networkreconstruction Comput Methods Programs Biomed 94 177ndash180

19 SalesG and RomualdiC (2011) parmigenemdasha parallel Rpackage for mutual information estimation and gene networkreconstruction Bioinformatics 27 1876ndash1877

20 CrombachA and HogewegP (2008) Evolution of evolvability ingene regulatory networks PLoS Comput Biol 4 e1000112

21 VercruysseM FauvartM JansA BeullensS BraekenKClootsL EngelenK MarchalK and MichielsJ (2011) Stressresponse regulators identified through genome-wide transcriptomeanalysis of the (p)ppGpp-dependent response in Rhizobium etliGenome Biol 12 R17

22 PuruggananMD and FullerDQ (2009) The nature of selectionduring plant domestication Nature 457 843ndash848

23 BortiriE ChuckG VollbrechtE RochefordT MartienssenRand HakeS (2006) ramosa2 encodes a LATERAL ORGANBOUNDARY domain protein that determines the fate of stemcells in branch meristems of maize Plant Cell 18 574ndash585

24 Cancer Genome Atlas Research Network (2008) Comprehensivegenomic characterization defines human glioblastoma genes andcore pathways Nature 455 1061ndash1068

25 IrizarryRA BolstadBM CollinF CopeLM HobbsB andSpeedTP (2003) Summaries of Affymetrix GeneChip probe leveldata Nucleic Acids Res 31 e15

26 GregorettiF BelcastroV di BernardoD and OlivaG (2010)A parallel implementation of the network identification by

multiple regression (NIR) algorithm to reverse-engineer regulatorygene networks PloS One 5 e10179

27 CrombachA WottonKR Cicin-SainD AshyraliyevM andJaegerJ (2012) Efficient reverse-engineering of a developmentalgene regulatory network PLoS Computat Biol 8 e1002589

28 AllenJD XieY ChenM GirardL and XiaoG (2012)Comparing statistical methods for constructing large scale genenetworks PloS One 7 e29348

29 OlsenC MeyerPE and BontempiG (2009) On the impact ofentropy estimation on transcriptional regulatory network inferencebased on mutual information EURASIP J Bioinform Syst Biol2009 308959

30 BarabasiAL (2009) Scale-free networks a decade and beyondScience 325 412ndash413

31 EnrightAJ Van DongenS and OuzounisCA (2002) Anefficient algorithm for large-scale detection of protein familiesNucleic Acids Res 30 1575ndash1584

32 ShannonP MarkielA OzierO BaligaNS WangJTRamageD AminN SchwikowskiB and IdekerT (2003)Cytoscape a software environment for integrated models ofbiomolecular interaction networks Genome Res 13 2498ndash2504

33 Van den BulckeT Van LeemputK NaudtsB van RemortelPMaH VerschorenA De MoorB and MarchalK (2006)SynTReN a generator of synthetic gene expression data fordesign and analysis of structure learning algorithms BMCBioinformatics 7 43

34 YilmazA Mejia-GuerraMK KurzK LiangX WelchL andGrotewoldE (2011) AGRIS the Arabidopsis Gene RegulatoryInformation Server an update Nucleic Acids Res 39D1118ndashD1122

35 Opgen-RheinR and StrimmerK (2007) From correlation tocausation networks a simple approximate learning algorithm andits application to high-dimensional plant gene expression dataBMC Syst Biol 1 37

36 JeongH TomborB AlbertR OltvaiZN and BarabasiAL(2000) The large-scale organization of metabolic networksNature 407 651ndash654

37 DeanR Van KanJA PretoriusZA Hammond-KosackKEDi PietroA SpanuPD RuddJJ DickmanM KahmannREllisJ et al (2012) The Top 10 fungal pathogens in molecularplant pathology Mol Plant Pathol 13 414ndash430

38 AbuQamarS ChenX DhawanR BluhmB SalmeronJLamS DietrichRA and MengisteT (2006) Expression profilingand mutant analysis reveals complex regulatory networks involvedin Arabidopsis response to Botrytis infection Plant J 48 28ndash44

39 ChenL ZhangL and YuD (2010) Wounding-induced WRKY8is involved in basal defense in Arabidopsis Mol Plant MicrobeInteract 23 558ndash565

40 ZhengZ QamarSA ChenZ and MengisteT (2006)Arabidopsis WRKY33 transcription factor is required forresistance to necrotrophic fungal pathogens Plant J 48 592ndash605

41 BroekaertWF DelaureSL De BolleMF and CammueBP(2006) The role of ethylene in host-pathogen interactions AnnRev Phytopathol 44 393ndash416

42 LiG MengX WangR MaoG HanL LiuY and ZhangS(2012) Dual-level regulation of ACC synthase activity by MPK3MPK6 cascade and its downstream WRKY transcription factorduring ethylene induction in Arabidopsis PLoS Genet 8e1002767

43 HanL LiGJ YangKY MaoG WangR LiuY andZhangS (2010) Mitogen-activated protein kinase 3 and 6 regulateBotrytis cinerea-induced ethylene production in arabidopsis PlantJ 64 114ndash127

44 AndreassonE JenkinsT BrodersenP ThorgrimsenSPetersenNH ZhuS QiuJL MicheelsenP RocherAPetersenM et al (2005) The MAP kinase substrate MKS1 is aregulator of plant defense responses EMBO J 24 2579ndash2589

45 QiuJL FiilBK PetersenK NielsenHB BotangaCJThorgrimsenS PalmaK Suarez-RodriguezMC Sandbech-ClausenS LichotaJ et al (2008) Arabidopsis MAP kinase 4regulates gene expression through transcription factor release inthe nucleus EMBO J 27 2214ndash2221

46 PetersenK QiuJL LutjeJ FiilBK HansenS MundyJ andPetersenM (2010) Arabidopsis MKS1 is involved in basal


immunity and requires an intact N-terminal domain for properfunction PloS One 5 e14364

47 BirkenbihlRP DiezelC and SomssichIE (2012) ArabidopsisWRKY33 is a key transcriptional regulator of hormonal andmetabolic responses toward Botrytis cinerea infection PlantPhysiol 159 266ndash285

48 LiS FuQ ChenL HuangW and YuD (2011) Arabidopsisthaliana WRKY25 WRKY26 and WRKY33 coordinateinduction of plant thermotolerance Planta 233 1237ndash1252

49 GolldackD LukingI and YangO (2011) Plant tolerance todrought and salinity stress regulating transcription factors andtheir functional significance in the cellular transcriptionalnetwork Plant Cell Rep 30 1383ndash1391

50 LaiZ WangF ZhengZ FanB and ChenZ (2011) A criticalrole of autophagy in plant resistance to necrotrophic fungalpathogens Plant J 66 953ndash968

51 LaiZ LiY WangF ChengY FanB YuJQ and ChenZ(2011) Arabidopsis sigma factor binding proteins are activators ofthe WRKY33 transcription factor in plant defense Plant Cell 233824ndash3841

52 InzeD and De VeylderL (2006) Cell cycle regulation in plantdevelopment Annu Rev Genet 40 77ndash105

53 VanstraelenM InzeD and GeelenD (2006) Mitosis-specifickinesins in Arabidopsis Trends Plant Sci 11 167ndash175

54 OhSA AllenT KimGJ SidorovaA BorgM ParkSK andTwellD (2012) Arabidopsis Fused kinase and the Kinesin-12subfamily constitute a signalling module required forphragmoplast expansion Plant J 72 308ndash319


56 GutierrezC (2009) The Arabidopsis cell division cycleArabidopsis Book 7 e0120

57 de Almeida EnglerJ KyndtT VieiraP Van CappelleEBoudolfV SanchezV EscobarC De VeylderL EnglerGAbadP et al (2012) CCS52 and DEL1 genes are keycomponents of the endocycle in nematode-induced feeding sitesPlant J 72 185ndash198

58 KeveiZ BalobanM Da InesO TiriczH KrollARegulskiK MergaertP and KondorosiE (2011) ConservedCDC20 cell cycle functions are carried out by two of the fiveisoforms in Arabidopsis thaliana PLoS One 6 e20618

59 Van LeeneJ HollunderJ EeckhoutD PersiauG Van DeSlijkeE StalsH Van IsterdaelG VerkestA NeirynckSBuffelY et al (2010) Targeted interactomics reveals a complexcore cell cycle machinery in Arabidopsis thaliana Mol SystBiol 6 397

60 MengesM SamlandAK PlanchaisS and MurrayJA (2006)The D-type cyclin CYCD31 is limiting for the G1-to-S-phasetransition in Arabidopsis Plant Cell 18 893ndash906

61 SpadaforaND DoonanJH HerbertRJ BitontiMBWallaceE RogersHJ and FrancisD (2011) ArabidopsisT-DNA insertional lines for CDC25 are hypersensitive tohydroxyurea but not to zeocin or salt stress Ann Bot 1071183ndash1192


and the large sample requirement (11) To date construct-ing GANs from very large-scale expression profiledatasets especially for the organisms with medium tolarge size of genomes such as those of most plantspecies is still a challenging task which necessitates thedevelopment of more effective computational solutionsSeveral methods have been developed to infer GANs

using mutual information (MI)-based approaches(1213) including the Algorithm for the Reconstructionof Accurate Cellular Networks (ARACNE) method (12)which can identify both linear and non-linear dependencerelationships from large samples Furthermore a largenumber of potential indirect genendashgene interactions canbe eliminated using the ARACNE method through theapplication of data processing inequality (DPI) which isbased on information theory (14) The ARACNE method(4) which is now widely used in GAN reconstruction(1516) achieves better performance than the BayesianNetwork method (7) However this method remains com-putationally demanding due to the high computationalcomplexity of both the MI estimation which usesGaussian kernel-based methods and the DPI processingwhich is applied to a large number of edges Most GANsthat are deciphered by this method have therefore adoptedcompromise strategies that either reduce the number ofgenes included in the networks or use a relatively smallnumber of gene expression profiles (417) Even with thesecompromises ARACNE requires gt100 h to calculate theMI of all gene pairs for the human B-cell expressiondataset (18) which only contains 10 000 genes across336 samples on a typical modern server with 32 CPUcores and 128-GB RAM For this reason fast MI estima-tion methods (1819) have been developed However thesemethods usually shorten the computational time at thecost of memory consumption which often reduces theability to process massive datasets Meanwhile the appli-cation of a DPI filter to remove potentially false inter-active edges is also a time-consuming process thatcannot be easily reduced Therefore the development ofultra-fast methods to enable the construction of genome-wide association networks particularly for plant specieswith large genomes is imperative for the field of genomicsAnother open question in genome-wide GAN analysis

in plants is how to effectively identify functional subnet-works that control specific cellular processes such as de-velopment or response to environmental cues Sessileplants are subjected to constant biotic and abiotic stres-sors and the study of the interaction of plants with theenvironment can lead to adaptation-ameliorated plantvarieties For example well-adapted plants have beenselected during crop domestication to meet the increasedneeds of human beings Evidence has shown that themajor evolutionary changes in plant adaptation (2021)can be linked to the modification of higher hierarchicalregulatory relationships (2223) Therefore the linkbetween genendashgene associated pairs within a subnetworkand the experimental condition under which the pair wasobserved is an important subject in modern plantgenomics and systems biologyHere we present a new integrated approach for

high-performance GAN construction and analysis from

large-scale data using an ultra-fast MI estimation thatuses a Spearman correlation-based transformation forpairwise MI computing and removes potentially falseinteraction edges on the basis of DPI We further reducethe computational time by implementing integrated GANconstruction and network analysis algorithms on acustom-built parallel-computing Linux cluster platformBioGrid which accelerates the total computationalprocess in almost linear proportion to the number ofCPU cores that are used We demonstrate GPLEXUSrsquoscapability by analyzing large-scale Arabidopsis thalianagene expression datasets that have been pooled frommultiple experimental conditions to first build a genome-wide GAN and then decompose this GAN into subnet-works Moreover we have developed a novel function toidentify major experimental conditions that contribute tothe MI of genendashgene interactions in the constructednetworks which allows users to link each genendashgene asso-ciation or subnetwork to a specific experimental conditionto learn under which condition these genendashgene associ-ations may operate

To promote and facilitate the use of this platform toperform GAN analyses for organisms with large genomesand a large number of genes we have provided a user-friendly online platform (httpplantgrnnobleorgGPLEXUS) that allows users to upload their expressiondatasets and perform GAN and gene set enrichmentanalysis To the best of our knowledge this is the firstweb-based platform that is able to construct and analyzegenome-scale GANs from massive genomic datasets

MATERIALS AND METHODS

Datasets used for method evaluation

Four compendium datasets were downloaded from publicdomains and compiled to evaluate the performance ofGPLEXUS and other methods (Table 1) The first threedatasets were downloaded from ArrayExpress (1) DatasetI comprises gene expression profiles of 313 microarrayhybridizations for Escherichia coli Dataset II comprisesgene expression profiles of 1848 microarray hybridizationsfor A thaliana Dataset III comprises 768 microarray hy-bridizations from multiple tissues for Glycine maxDataset IV comprises human glioblastoma gene expres-sion profiles which were downloaded from the TCGAData Portal (Level 1 Affymetrix HT Human GenomeU133 Array) (24) All of these datasets were normalizedusing the Robust Multiarray Averaging (RMA) method(25) and were used to evaluate and compare the perfor-mance of GPLEXUS with other methods All four datasetsare available for download on the GPLEXUS Web site(httpplantgrnnobleorgGPLEXUSdatasetjsp)

Fast MI estimation

The ARACNE method had successfully applied theproperties of MI and DPI which are based on informa-tion theory (14) to GAN reconstruction (4) Mountingevidence has also demonstrated the effectiveness ofARACNE to build valid and explainable gene networks(26ndash28) The network construction from microarrays of



I XiXj

frac14

1

2log

iijjCj j

frac14

1

2log 1 2

eth1THORN





and O n3=k







M1 M2 M3 M4 M5







GPLEXUS online







Auxiliary tools


RESULTS











































DISCUSSION


CONCLUSION



AVAILABILITY



SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



REFERENCES




































































I XiXj

frac14

1

2log

iijjCj j

frac14

1

2log 1 2

eth1THORN





and O n3=k







M1 M2 M3 M4 M5







GPLEXUS online







Auxiliary tools


RESULTS











































DISCUSSION


CONCLUSION



AVAILABILITY



SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



REFERENCES






































































GPLEXUS online







Auxiliary tools


RESULTS











































DISCUSSION


CONCLUSION



AVAILABILITY



SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



REFERENCES









































































































DISCUSSION


CONCLUSION



AVAILABILITY



SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



REFERENCES



































































gplexus: enabling genome-scale gene association network reconstruction and analysis … ·...

Documents