author's personal copy -...

This article appeared in a journal published by Elsevier. The attached

copy is furnished to the author for internal non-commercial research

and education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling or

licensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of the

article (e.g. in Word or Tex form) to their personal website or

institutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies are

encouraged to visit:

http://www.elsevier.com/copyright

http://www.elsevier.com/copyright

Author's personal copy

BioSystems 105 (2011) 109– 121

Contents lists available at ScienceDirect

BioSystems

journa l h o me pa g e: www.elsev ier .com/ locate /b iosystems

Prediction of metabolic pathways from genome-scale metabolic networks

Karoline Fausta,!, Didier Croesb, Jacques van Heldenb

a Research Group of Bioinformatics and (Eco-)Systems Biology (BSB), VIB – Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgiumb Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles, Boulevard du Triomphe, B-1050 Brussels, Belgium

a r t i c l e i n f o

Article history:Received 22 December 2010Received in revised form 23 March 2011Accepted 5 May 2011

Keywords:Metabolic pathway definitionMetabolic pathway predictionMetabolic network representationSubgraph extraction

a b s t r a c t

The analysis of a variety of data sets (transcriptome arrays, phylogenetic profiles, etc.) yields groups offunctionally related genes. In order to determine their biological function, associated gene groups areoften projected onto known pathways or tested for enrichment of known functions. However, theseapproaches are not flexible enough to deal with variations or novel pathways. During the last decade, wedeveloped and refined an approach that predicts metabolic pathways from a global metabolic networkencompassing all known reactions and their substrates/products, by extracting a subgraph connectingat best a set of seed nodes (compounds, reactions, enzymes or enzyme-coding genes). In this review, wesummarize this work, while discussing the problems and pitfalls but also the advantages and applicationsof network-based metabolic pathway prediction.

© 2011 Elsevier Ireland Ltd. All rights reserved.

1. Introduction

1.1. Interpretation of functionally associated genes

A variety of experimental and in silico approaches identifiesgroups of genes likely to participate in a common biologicalprocess: transcriptome arrays, prediction of operons and regu-lons, groups of synteny, gene fusions, phylogenetic profiles, etc.Researchers are then faced with the challenge of understanding thebiological function of a group of associated genes. Usually, this taskis tackled by gene set enrichment analysis (GSEA) (e.g. Subramanianet al., 2005; Backes et al., 2007), which detects overrepresentedfunctional categories in the gene set of interest or by pathway pro-jection (also known as pathway mapping) e.g. (Dahlquist et al.,2002; Paley and Karp, 2006; Adler et al., 2008), which maps genesof interest to a set of known pathways.

Our group spent a decade of research to develop an alternativeapproach to interpret associated enzyme-coding genes. This reviewwill summarize this approach and discuss its drawbacks as well asits benefits.

1.2. Pathway-based interpretation of functionally associatedgenes

Both GSEA and pathway mapping approaches rely on pre-defined functional groups (e.g. Gene Ontology classes) or pathways

! Corresponding author.E-mail addresses: [email protected] (K. Faust), [email protected] (D. Croes),

[email protected] (J. van Helden).

(e.g. KEGG reference maps) and are therefore unable to deal withvariants or combinations of the reference pathways or to detectnovel pathways.

GSEA returns a list of functional classes showing a significantintersection with the query gene set. Such associations provide afirst clue to the possible functions of the gene set of interest, butprocesses are considered as “bags” of genes, without any indicationabout internal relationships and interconnections of their individ-ual activities.

Pathway mapping provides a more interpretable result, by high-lighting the relationships between the mapped enzymes, but failsto identify “transversal” pathways, i.e. pathways that combinefragments of several reference pathways (e.g. in yeast, methio-nine biosynthesis combines the synthesis of the carbon skeletonfrom aspartate and the incorporation of sulfide resulting fromthe sulfur assimilation pathway). Biological processes are highlyinterconnected and are therefore better described as a networkrather than as a set of separated pathways. Taking a networkperspective can overcome the restrictions of GSEA and pathwayprojection.

For these reasons, ab initio pathway discovery methods arerequired to annotate the metabolism of the thousands of organ-isms for which we dispose of a fully sequenced genome but whosepathways have never been characterized experimentally.

In addition, ab initio pathway discovery can be employed to inte-grate data from high-throughput experiments with knowledge onbiological networks. The scores from these experiments are con-verted into network weights, which allow favoring pathways thatincorporate a maximal number of high-scoring (i.e. highly differen-tially expressed) genes. We will discuss this application of pathwaydiscovery in more detail in Section 10.3.

0303-2647/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights reserved.doi:10.1016/j.biosystems.2011.05.004


110 K. Faust et al. / BioSystems 105 (2011) 109– 121

1.3. Network-based interpretation of functionally associatedgenes

Zien et al. (2000) were among the first to analyze co-expressedgenes by extracting pathways from a network instead of map-ping the genes to pre-defined pathways. They first constructed ametabolic network (reactions and compounds) collected from var-ious databases, and enumerated all possible pathways betweenglucose and pyruvate (two-end path finding). These pathways weresubsequently scored according to the expression ratios of theirassociated genes. The expression data (DeRisi et al., 1997) mon-itored the difference in gene expression between a start time(reference time point) and several time points during the transi-tion of yeast cells from fermentation to respiration. The canonicalglycolysis pathway ranked at the top of the list of expression-basedscored pathways. This two-end path finding approach howeverrequires to give as input the start and the end of the pathway, aninformation which is usually missing when dealing with groups offunctionally related genes.

In 2000, one of us proposed an approach based on subgraphextraction, to assemble metabolic pathways from a set of “seed”reactions (van Helden et al., 2000). As a proof of concept, thisapproach was applied to a set of 20 genes co-expressed duringthe yeast cell cycle, resulting in a linear pathway from sulfate to l-methionine that combined the annotated pathways for methioninebiosynthesis and sulfur assimilation (van Helden et al., 2000).

Fig. 1 summarizes this procedure. Starting from a set of co-expressed genes, a first step is to identify enzyme-coding genesand the reactions associated to them. In some cases, these seedreactions are directly connected by their substrates and products,so that they can be immediately interlinked to form a pathway.However, the main idea of the approach is to fill gaps between theseed reactions by extracting intermediate compounds and reac-tions from a metabolic network.

In the following, we will present various metabolic pathwaydefinitions and introduce a more functionally oriented definitionof a metabolic pathway (Sections 2–4). Next, we will enumer-ate different ways to represent metabolic networks as graphs andhighlight the problems of graph traversal in metabolic networks(Section 5). The main problem is the presence of highly connected(hub) compounds, which leads to biochemically infeasible path-ways when traversing the graph naively. The impact of this problemon the measurement of network properties such as the average pathlength (“small-worldness”) will be discussed in Section 7. We willexpound our solutions to the hub compound problem in the contextof two-end path finding (between two compounds or reactions)in Section 6 and of multiple-end pathway prediction (between aset of compounds or reactions) in Section 8. In addition, we willdemonstrate multiple-end pathway prediction on an example (Sec-tion 9). Finally, we will discuss the strengths and weaknesses oftopology-based pathway prediction and its applications in microar-ray analysis and metabolic network reconstruction.

2. From networks to pathways

A major weakness of GSEA and pathway projection is theirreliance on fixed pathway boundaries. In many cases, textbooksand databases agree on the boundary of a pathway (e.g. glycol-ysis starts with glucose and ends with pyruvate), but there arealso some “fuzzy” pathways with less well defined boundaries (forinstance, KEGG module M00176 sulfur reduction starts from sul-fate (Kanehisa et al., 2008), whereas the MetaCyc sulfur reductionpathways start all from elemental sulfur (Caspi et al., 2008)). Inaddition, pathway databases deal differently with pathway variants(for example, KEGG maps merge all known variants of a pathway,

whereas MetaCyc stores them as separate entities). Thus, pathwayprojection results depend on the specific reference pathway sourceselected (KEGG maps, KEGG modules, SEED subsystems (Overbeeket al., 2004), MetaCyc pathways, etc.).

One may define pathway boundaries by partitioning themetabolic network with different algorithms (Gagneur et al., 2003;Gerlee et al., 2009; Guimerà and Amaral, 2005). However, the per-formance of these algorithms is assessed by comparing their outputto a reference pathway set. Thus, metabolic network partition-ing does not solve the underlying problem: What is a metabolicpathway? What does distinguish current reference pathways fromrandom reaction sequences? This question is not easily answered,as the variety of existing metabolic pathway definitions underlines(e.g. Lacroix et al., 2008).

3. What is a metabolic pathway?

Since the metabolic pathway definition of choice determinesthe prediction procedure, we will briefly summarize some currentdefinitions.

In the words of the classical biochemistry textbook Nelson andCox (2005), a metabolic pathway is a “sequence of enzyme-catalyzedreactions by which a living organism transforms an initial sourcecompound into a final target compound”. This definition has someshortcomings: (1) it excludes branched pathways (e.g. aromaticamino acid biosynthesis) and cycles (e.g. TCA cycle), (2) it does nottake into account spontaneous reactions, i.e. reactions that occurwithout being catalyzed by an enzyme (e.g. the conversion froml-glutamate gamma-semialdehyde into water and (S)-1-pyrroline-5-carboxylate).

A generic definition that also covers branched and cyclic path-ways considers a pathway to be a sub-network of the metabolicnetwork (e.g. Forst and Schulten, 1999). Since this definition can beexpressed entirely in a graph theoretical form, without the need ofadditional concepts, it could be also termed the topological defini-tion of metabolic pathways. However, this definition suffers froma severe problem: It does not differentiate between biochemicallyvalid and invalid pathways. To exemplify this problem, Fig. 2 showsa biochemically irrelevant pathway that perfectly conforms to thetopological definition.

Since the topological definition does not distinguish betweenbiochemically relevant and irrelevant pathways, alternatives havebeen proposed in the literature, which reduce the number of per-missible pathways by imposing constraints.

Schuster et al. (1999) introduced the elementary modes, whichare defined as a “minimal set of enzymes that could operate at steadystate with all irreversible reactions proceeding in the appropriatedirection”. To qualify as an elementary mode, a pathway has to sto-ichiometrically balance all its compounds and in addition satisfy anon-decomposability constraint, which roughly says that it cannotbe further divided into stoichiometrically balanced sub-pathways.How elementary modes can be applied to discover novel pathwaysis briefly reviewed in Schuster et al. (2010). Interestingly, the ele-mentary modes correspond to the minimal T-invariants knownfrom Petri net theory (Voss et al., 2003). A related concept arethe extreme pathways, introduced by Schilling et al. In elementarymode as well as in flux balance analysis, the metabolic system isassumed to be at steady state. Thus, the left-hand site of equa-tion dx/dt = S " v, where x is the compound concentration, S isthe stoichiometric matrix and v is the flux vector, can be set tozero. The set of solutions satisfying this simplified equation plusa number of additional constraints (e.g. a non-negative constraint)forms the flux cone. The extreme pathways are the edges of the fluxcone, that is they are the basis from which all other fluxes can beobtained by linear combination (Schilling et al., 2000). The concept


K. Faust et al. / BioSystems 105 (2011) 109– 121 111

Fig. 1. Flow chart of pathway prediction by subgraph extraction. The approach takes two inputs: (1) a set of seed reactions, which are obtained from a set of associatedenzyme-coding genes and (2) a metabolic network constructed from a metabolic database, such as KEGG (Kanehisa et al., 2008) or MetaCyc (Caspi et al., 2008). A metabolicpathway is computed by connecting the seed reactions (in blue) with intermediate reactions and compounds (in yellow) from the metabolic network. Image sources: theglobal KEGG map image and KEGG symbol were obtained from the KEGG database homepage (Kanehisa et al., 2008), the MetaCyc symbol from the MetaCyc home page(Caspi et al., 2008). The operon image was taken from RegulonDB (Gama-Castro et al., 2008) and the genome image from the Comprehensive Microbial Resources (Davidsenet al., 2010).

of extreme currents is closely related to that of extreme pathways(Clarke, 1988).

Another definition was advanced by Pitkänen et al. (2005),which states that pathways should be minimal feasible (that is self-maintaining) systems. More precisely, the reactions contained ina pathway should be necessary and sufficient to synthesize all itscompounds. Chemical organization theory (e.g. Centler et al., 2005)refines this definition. A chemical organization has two properties:self-maintenance (for each compound of the pathway there exists anon-negative flux vector for its synthesis) and closure (each com-

pound that can be generated by the reactions of the system is a partof the system). Both properties guarantee that chemical organiza-tions are stable reaction systems.

These constraint-based definitions divide the pathway’s com-pounds in two sets: internal compounds that have to satisfy theconstraints imposed by the definition (i.e. of stoichiometricallybalance or self-maintenance) and external compounds that areexcluded from the constraint. It is not always easy to know whichcompound belongs to which set, a problem that will be furtherdiscussed in Section 5.2.2.



Fig. 2. Example of a metabolic pathway that conforms to the topological pathwaydefinition but is biochemically invalid. This pathway suggests that ADP can be syn-thesized from d-glucose within one step, which would require an unrealisticallylarge number of atomic re-arrangements, additions and eliminations. Such unreal-istic pathways often result when a naive path finding algorithm traverses metabolicnetworks without distinguishing between main and side compounds.

Other definitions emphasize the matter flow through the path-way. For instance, Arita (2004) defines a pathway as follows: “Ametabolic pathway (pathway for short) from metabolite X to Y isdefined as a sequence of biochemical reactions through which at leastone carbon atom in X reaches Y”. This definition allows distinguish-ing main compounds that carry matter through the pathway fromside compounds that are energy or electron donors/acceptors (e.g.ATP/ADP or NADPH/NADP+) or small inorganic molecules (e.g. CO2or water). However, Arita’s definition has the weakness to excludepathways involving the transfer of other atom types, e.g. the sulfurreduction pathway. Rather than associating the “main” and “side”property to compounds as such, Karp defines those qualifiers in thecontext of a pathway: “The main compounds lie along the backboneof the pathway—these compounds are shared between consecutivesteps of a pathway” (Karp and Paley, 1994). However, this distinc-tion between main and side compounds pre-supposes a pathwaydefinition and is thus not helpful in a context of pathway discovery.

The definitions listed so far emphasize different aspects of path-ways, such as their stoichiometric balance, self-maintenance oratom flow. Not all of these aspects are equally relevant for the inter-pretation of sets of functionally-related genes. To interpret sucha gene set, we need to obtain the part (or sub-network) of themetabolic network in which their products are involved. We canenforce this sub-network to be balanced and/or self-maintained,but we may also assume that the pathway catalyzed by those genesdoes not have to be balanced.

For instance, changes in external conditions (availability ordepletion of some metabolites) will generally activate the expres-sion of a handful of specific enzymes catalyzing the degradation orthe biosynthesis of the concerned metabolites. In many cases, thosepathways are not balanced, and their activation will decrease inter-nal concentrations of several other metabolites. For example, inEscherichia coli, a lack in methionine inactivates the MetJ repressor,thereby triggering the transcription of the enzymes ensuring a lin-ear path from l-aspartate to l-methionine. This pathway however

consumes l-cysteine, whose sulfur is transferred to cystathioinine.The resulting depletion in l-cysteine may eventually lead to theactivation of the cystein biosynthesis pathway.

This case is far from exceptional: a representative fraction ofthe canonical metabolic pathways are not balanced (Planes andBeasley, 2008). When inferring pathways from sets of function-ally related enzymes, we can thus relax the self-maintenance andstoichiometric balance constraints.

Nevertheless, it is important to differentiate between biochem-ically relevant and irrelevant connections of the seed reactions. Inthe absence of further constraints, an atom flow-based definitionis most helpful for this purpose.

4. Functional definition of a metabolic pathway

At this point, we would like to introduce a definition that empha-sizes the functional aspect of a pathway.

We mentioned the problem of arbitrary boundaries betweenpathways in the metabolic network. We note that such boundariesare naturally introduced if we consider gene regulation. Genes areswitched on and off together in response to a particular condition,up- or down-regulating the pathway that is needed/unnecessaryin that condition. Genes involved in a common function tend tobe co-expressed, conserved in phylogenetic profiles or transferredtogether in horizontal gene transfer events. Different organismsmay respond to the same metabolic requirement by expressingdifferent sets of enzymes and transporters. The “boundaries” of ametabolic pathway should thus not be defined in terms of absoluterules, such as key compounds or stoichiometry, but be consideredas organism- and even context-dependent.

Thus, we can define a metabolic pathway as a set of intercon-nected reactions that can be activated coordinately to ensure aparticular cellular function.

This definition encompasses the topological constraint whilstensuring biochemical relevance, but presents the weakness ofrequiring a definition of the cellular function. However, the lackof a rigorous definition of cellular function gives the freedom todefine multi-scale functional modules (processes) ranging fromvery specific pathways to “super-pathways” (e.g. biosynthesis ofmethionine from homoserine, biosynthesis of methionine fromaspartate, biosynthesis of aspartate-derivative amino acids, etc.).

5. Graph representation and traversal in metabolicnetworks

5.1. Representation of metabolic networks

In order to connect query reactions (seeds) in the metabolic net-work, we need to represent the metabolic network as a graph. Agraph is a mathematical abstraction of connected objects. It con-sists of nodes (also called vertices) representing the objects andof edges, which connect the nodes and represent links betweenobjects. The link between two objects may be directional and isthen represented by a directed edge called arc, which points fromthe source node to the target node. A graph containing arcs is alsocalled directed graph or digraph. It is not trivial to map metabolismonto a graph in a meaningful way. Some network representationssuffer from important drawbacks, and are therefore less suited forthe prediction of metabolic pathways than others (van Helden et al.,2002).

A commonly employed network representation is the compoundgraph (e.g. in Chou et al., 2009; Goesmann et al., 2002; Ma andZeng, 2003), where nodes stand for compounds and edges/arcs forreactions (see Fig. 3A). Two compound nodes are connected if theyparticipate as substrate and product in a common reaction. In the



Fig. 3. Alternative network representations of metabolism. (A and B) In the compound (A) and reaction (B) graph, nodes represent resp. compounds and reactions, whereasedges represent resp. reactions and compounds. These graphs represent the same reactions (e.g. 2.7.1.2 in A) or compounds (e.g. d-glucose in B) multiple times. (C and D)Bipartite graphs (C) and hypergraphs (D) avoid this shortcoming by representing reactions and compounds as two separate node types (bipartite graph) or by representingreactions with hyper-edges, which connect multiple compounds (compound-centric hypergraph). (E) Undirected graphs enable graph algorithms to go from one substrate toanother substrate or from one product to another product. To prevent this, metabolic networks should be represented by directed graphs. (F) In bipartite graphs, forward andreverse direction of a reaction can then be represented by two separate nodes, which have to be mutually exclusive to prevent a graph algorithm to go from one substrate toanother substrate (or from one product to another product) via the forward and reverse reaction direction. (G) The atom mapping graph traces atoms through the metabolicnetwork by matching the structures of substrates and products. Corresponding atom groups are encircled in matching colors. The compound structures were downloadedfrom KEGG and drawn with Jmol (Herréz, 2006).

reaction graph (Fig. 3B), the counterpart of the compound graph,nodes represent reactions and edges/arcs compounds (e.g. in Forstand Schulten, 2001; Wagner and Fell, 2001). Two reaction nodesare connected if they share a compound that acts as product of thefirst and as substrate of the second reaction. In both network repre-sentations, a compound (reaction graph) or a reaction (compoundgraph) can occur multiple times. For instance, in the reaction graph,a compound edge/arc occurs as often as there are reactions whichproduce or consume it. This problem has recently been noted byKlamt et al. (2009), but was already recognized by van Helden et al.(2002). An algorithm that connects seed reactions in a metabolicnetwork represented as compound or reaction graph may predicta pathway containing the same reaction or compound multipletimes. This may be positive in some cases (fatty acid biosynthe-sis, cyclic pathways), but will in most cases result in problematicpathways containing futile cycles or invalid shortcuts. For instance,in the compound graph, one could reach one substrate of a reac-tion from another in case the reaction is reversible. To avoid thisbehavior, the algorithm would need to keep track of the labels ofarcs that were already traversed. In addition, both the compoundand the reaction graph make it more difficult to deal with mixedinput comprising compounds as well as reactions. For these rea-sons, these graph representations are unsuitable for our pathwayprediction approach.

The bipartite graph (Fig. 3C) is a representation that avoids themultiple occurrence problem of the compound and reaction graphs.Bipartite graphs are made of two node types, where edges/arcs

never connect two nodes of the same type. For metabolic networks,the two node types represent compounds and reactions, respec-tively (e.g. in Sirava et al., 2002; van Helden et al., 2001), and arcsconnect either a compound to a reaction (substrate link) or viceversa (product link). Petri nets, which are sometimes employed todescribe metabolic networks e.g. Küffner et al. (2000) are a spe-cial case of bipartite graphs, where compounds are place nodes andreactions transition nodes. Bipartite graphs representing metabolicnetworks have also been described as AND–OR graphs, where com-pound nodes have the role of OR nodes and reaction nodes the roleof AND nodes (Pitkänen et al., 2005).

Hypergraphs generalize the concept of a graph. Whereas ina graph, each arc/edge connects only two nodes, an arc/edgein a hypergraph (termed hyper-arc/hyper-edge) may connectmore than two nodes and can thus easily represent a reactionthat involves more than two compounds. In principle, hyper-graphs can be compound-centered (nodes represent compounds)or reaction-centered (nodes represent reactions), but so far, onlythe compound-centered hypergraph has been mentioned in the lit-erature (e.g. Mithani et al., 2009) (see Fig. 3D). The stoichiometricmatrix employed in flux balance analysis is mathematically equiv-alent to a (compound-centered) directed hypergraph (excludingcatalytic compounds) (Klamt et al., 2009).

Despite of their recent recommendation in Klamt et al. (2009),hypergraphs have a disadvantage for pathway prediction: It isnot as straightforward as in bipartite graphs to predict pathwayswhen the “seeds” combine compounds and reactions. Furthermore,



hypergraphs can be mapped onto bipartite graphs by transformingeach hyper-arc/hyper-edge into a node. Thus, bipartite and hyper-graphs are mathematically equivalent graph representations.

Arita introduced the atom mapping graph (Arita, 2000), wherenodes represent atoms and arcs mappings between atoms in thesubstrate and product of a reaction. Atom mappings are computedby representing the chemical structure of a compound as a graphand then using a graph matching algorithm to find the most similarsubstrate-product structures (see Fig. 3G). The atom mapping graphhas been recently employed in Boyer and Viari (2003), Heath et al.(2010) and Pitkänen et al. (2009).

5.2. Graph traversal in metabolic networks

A simple approach towards the prediction of a pathway is tofind the shortest path(s) between a given start and end node inthe network. In this context, we would like to clarify the differencebetween a path and a (metabolic) pathway. In graph theory, a pathis defined as a linear sequence of nodes connected by edges suchthat each node pair is connected by only one edge. In addition, eachpath node, including the start and the end node, can be found atmost once in a path. In contrast, according to most definitions, ametabolic pathway (as defined above) may contain branches, cycles,and multiple instances of the same compound.

K-shortest path algorithms such as (Jimenez and Marzal, 1999;Eppstein, 1999; Yen, 1971) enumerate all shortest paths betweentwo nodes in a network in the order of their length and are com-monly applied to predict pathways in metabolic networks (Siravaet al., 2002; Arita, 2003; Blum and Kohlbacher, 2008; Faust et al.,2009b; Pitkänen et al., 2009; Heath et al., 2010). However, whenusing them to predict metabolic pathways, two issues specific tometabolic networks need to be considered.

5.2.1. Reaction directionalityA specific problem of metabolic networks is reaction reversibil-

ity. This is the fact that the chemical reactions in these networksmust be considered as reversible, unless specific information isgiven to the contrary. Indeed, even if a chemical reaction has astrong preference for one direction in one organism, its directionmay be the opposite in another organism, as reaction directional-ity depends not only on the standard Gibbs free energy change ofthe reaction but also on reactant concentrations and the tempera-ture. One is faced with several choices for representing reactionreversibility in the metabolic graph. One possibility is to linkreversible reactions and their reactants with an edge that can be tra-versed in both directions during path finding. This however wouldmake it cumbersome to distinguish between substrates and prod-ucts. Indeed, straightforward navigation through the graph wouldresult in connecting two substrates (or two products) of the samechemical reaction to each other. In the context of path finding, thiswould mean that two substrates of the same reaction can be inter-converted in one step, thereby violating the laws of chemistry (seeFig. 3E). Another solution is to represent a reversible reaction as twoseparate nodes in the network, one for each direction, as illustratedin Fig. 3F. However, inclusion of both direct and reverse reactions inthe same path would lead to the chemically meaningless situationwhere two substrates of a reversible reaction could be transformedinto each other in two steps. For instance, a reversible reaction(A + B # C) would be converted to a direct reaction (A + B $ C) anda reverse (C $ A + B), which would open the possibility for an irrel-evant two-steps path converting one substrate into the other one(A $ C $ B).

Path finding algorithms should thus be adapted to preventincluding both the direct and the reverse reactions in the samepath. This constraint of mutual exclusion increases the complex-ity of path finding algorithms. Despite of this difficulty, we opted

for this solution, as it allows to easily represent both reversibleand physiologically irreversible reactions. For the latter, the reversereaction can simply be omitted.

5.2.2. The hub compound problemAs illustrated in Fig. 2, not all reaction sequences are bio-

chemically relevant. The pathway shown in Fig. 2 suggests thatADP could be produced from d-glucose within one step, whichis biochemically impossible. ADP is indeed a product of reaction2.7.1.2 and a substrate of reaction 2.7.1.40, but is in both casesa side-compound and cannot be considered as “produced” fromd-glucose, or “producing” pyruvate. It should thus never be usedas intermediate node in a path connecting d-glucose and pyru-vate. This problem has a tremendous effect on path finding: sidecompounds may be involved in hundreds of reactions, thus act-ing as hubs in the network. If a path finding algorithm connectstwo nodes via such a short-cut, this will result in an invalid path-way in most cases. The problem is illustrated in Fig. 4B: in theraw metabolic network, all the shortest paths from l-arginine tosuccinate use irrelevant shortcut through metabolites that serveas side-compounds in the reactions included in the paths (H2O,NADP+, NAD+, ADP).

A first attempt to circumvent the shortcut effect of “metabolichubs” was to search paths in a “filtered” network, from which a sub-set of the most connected compounds (“hubs”) had been removed.Our early results showed that this filtering improved the relevanceof the subgraph extraction, but was only able to infer short paths(typically, 2 reaction steps) between seed reactions (van Heldenet al., 2001), because of the remaining highly connected compounds(which do not act as side compounds) such as alanine or aspartate.As shown in Fig. 4C, the shortest paths from l-arginine to succinatein the filtered graph are much shorter than the reference argininecatabolic pathway and none of the intermediate steps correspondsto the annotations (Fig. 4A).

The problem is to distinguish main compounds from side com-pounds. For instance, ADP is a side compound in the glycolysispathway, but a perfectly valid intermediate in nucleotide biosyn-thesis. In addition, it is unclear what defines a side compound.Compounds such as water, ADP/ATP, NADPH/NADP+ act as sidecompound in most cases, but what about acetyl-CoA? Besides, someother highly connected compounds such as pyruvate act as mainsubstrate/product in many reactions, and will have the same short-cut effect in a basic path finding approach. For these reasons, simplyremoving a list of “hub” compounds from the network does notsolve the problem.

6. Path finding in weighted networks

As discussed in Section 5.2.2, the removal of hub compoundsis problematic because there is no clear-cut distinction betweenmain and side compounds. However, we observe that typical sidecompounds tend to be involved in more reactions than the aver-age compound. Motivated by this observation, we thought abouta path finding strategy that takes the number of connections of acompound into account and came up with the idea of weighting themetabolic network. In a node-weighted graph, each node is assigneda real number, called its weight or cost. We employed a weight pol-icy that assigns to each compound node a weight correspondingto its degree (i.e. the number of edges connecting this node) andto each reaction node a weight of 1. We adapted the path findingalgorithm to enumerate the lightest paths (i.e. the path minimiz-ing the sum of node weights) instead of the shortest paths (i.e. thepath with the minimal number of nodes) (Croes et al., 2005, 2006).Since there may be more than one lightest path, we employ a K-shortest path finding algorithm (backtracking) to collect all paths of



Fig. 4. Evaluation of paths computed for E. coli Arginine degradation II (AST) pathway. (A) The AST pathway as annotated in MetaCyc. Ellipses represent compounds andrectangles reactions. Nodes are labeled with their KEGG identifier, compounds in addition with their name and reactions with their enzyme classification (EC) number. Theseed nodes are colored in blue, whereas intermediate nodes, which should be found by path finding, have a green border. (B-D) Paths computed by path finding in a metabolicnetwork constructed from KEGG data, where (B) shows the raw graph, (C) the filtered graph and (D) the weighted graph. (E) Paths computed by path finding in the RPairsnetwork constructed from KEGG data. Correctly predicted intermediate nodes have a green border. The paths are labeled with their rank, where the lightest path(s) has rankone, the second-lightest path(s) rank two and so on.

equivalent weight. By penalizing the highly connected compounds,we reduce the probability of selecting meaningless shortcuts whennavigating in the graph, without requiring to make somewhat arbi-trary choices of the compounds to exclude.

Fig. 4D displays the five lightest paths from l-arginine tosuccinate in the compound-weighted metabolic network. The

third-ranking path perfectly matches the annotated pathway. Sincein general we cannot know which among the top-ranking paths toselect, we base our evaluation on a comparison of the first-rankingpaths (i.e. the lightest paths) to the annotated pathway.

We systematically evaluated the performance of this approachby comparing the intermediate reactions in the computed



Fig. 5. The decomposition of reaction R02724 into five reactant pairs (RPairs) is shown. Each RPair links a substrate to a product with similar compound structure. In addition,each RPair is assigned to a class that describes its role in the reaction. For instance, “main” refers to “main changes on substrates” (Kotera et al., 2004a) (e.g. cholesteroland pregnenolone) whereas “leave” describes addition or elimination of inorganic compounds (e.g. oxygen and 4-methylpentanal). Importantly, the rare side compoundferredoxin is not involved in any RPair. The reason is that ferredoxin does not contribute atoms, but electrons to the reaction. A path finding algorithm that is unaware ofthe RPair annotation may falsely connect cholesterol with oxidized adrenal ferredoxin. Thus, RPairs improve the accuracy of path finding by preventing the traversal of areaction via its side products.

pathways1 to those in linearized annotated pathways. In addition,we compared the performance of the weighted graph with thatof the un-weighted raw (all compounds and reactions) and fil-tered (top 30 highly connected metabolites removed) graphs andfound the correspondence between the computed and annotatedpathways to be very poor (<30%) in the raw graph, increasing toapproximately 65% in the filtered graph and reaching approxi-mately 85% in the weighted graph. Considering the best-matchingpath among the five lightest paths increases the correspondence to92%.

7. The illusion of small-worldness

In a weighted metabolic network, we can measure a metabolicdistance between two enzymes s and t, which we define as:Ds,t = Wp % (Ws + Wt)/2, where Wp is the sum of node weights in thelightest path between any reaction associated to enzyme s and anyreaction associated to enzyme t, Ws is the weight of the start nodeand Wt the weight of the end node. In the same way, we can alsocompute the metabolic distance between two compounds.

It was previously claimed that metabolic networks are “smallworld” networks (Jeong et al., 2000; Fell and Wagner, 2000). Smallworld networks are characterized by a small average path lengthcompared to random networks (Watts and Strogatz, 1998). Theaverage path length (AL) is defined as the average length of shortestpaths computed between all node pairs. In metabolic networks,a small AL means that any compound can be synthesized fromany other compound within a few enzymatic steps (around threeas stated in Jeong et al., 2000). We computed the distribution ofmetabolic distances between random pairs of reactions in the raw,the filtered and the weighted network and found that in the rawnetwork, the most frequent metabolic distance is two, whereas themost frequent metabolic distances in the weighted network com-prise five to eight reaction steps. This result is supported by twoother studies, which measured the average shortest path lengthbetween compounds in the E. coli network and treated hub com-

1 In cases where several paths have the same weight, we merge all of them, thuswe can predict branched pathways by enumerating linear paths.

pounds either by excluding them (Ma and Zeng, 2003) or by tracingatoms (Arita, 2004). Both concluded that the average path length isaround eight. Thus, the small world property claimed for metabolicnetworks is mainly due to biochemically irrelevant paths traversingside compounds and disappears when measuring metabolic dis-tance in a more realistic way (Croes et al., 2006; Lima-Mendez andvan Helden, 2009).

8. Extending pathway prediction with RPairs and multipleseed nodes

8.1. The benefits of RPairs

In the previous section, we have shown that penalizing hubcompounds by weighting them substantially improves metabolicpath finding. However, a degree-dependent weight policy does notpenalize rare side compounds. For example, consider KEGG reac-tion R02724 depicted in Fig. 5. In this reaction, reduced adrenalferredoxin does not contribute any atoms to the products, butacts as an electron acceptor. However, since it has a low degree(it is involved only in a few reactions), the path finding algo-rithm may traverse it even in a weighted graph, thus predictingan invalid pathway. One solution to this problem is to trace theatoms of the compound(s) of interest through the metabolic net-work, an approach first introduced by Arita (2000, 2003) and thenapplied in various path finding tools (Rahman et al., 2004; Blumand Kohlbacher, 2008; Pitkänen et al., 2009). In our case, there isan important drawback to this approach: it is not suited for pathfinding between reactions. Indeed, all tools based on atom trac-ing predict pathways between compounds only. Atoms are usuallytraced by first computing substrate-product mappings and thenenumerating paths through a graph constructed from these map-pings (Arita, 2004). However, in the case of a reaction with multipleproducts, it is not clear which of the multiple substrate-productmappings to select. If we decide to trace the atoms of all productsof such a reaction, we will frequently include highly connected sidecompounds such as ADP, water or orthophosphate, thus predictingirrelevant pathways.

The evaluation of weighted path finding (see Section 6) hasshown that pathway prediction improves when we force the path



finding algorithm at a branching point (that is when reachinga reaction with multiple products) to continue with the leastconnected compound. Likewise, atom tracing applied to reac-tions with multiple products may benefit when tracing the leastconnected compounds only. To test this idea, we employed thesubstrate-product mappings (called RPairs) stored in the KEGGRPAIR database (Kotera et al., 2004b,a). The RPAIR database hasthe additional benefit of assigning a role to the RPairs, whichinclude main (for major carbon atom transfer), cofac (for cofactors ofoxidoreductases), trans (for functional groups transferred by trans-ferases), ligase (for triphosphates involved in ligase reactions) andleave (for addition or elimination of small inorganic compounds).These roles are reaction-specific, thus allowing us to select for eachreaction its relevant substrate-product mappings and to neglectirrelevant cofac, ligase or leave mappings. For instance, KEGG reac-tion R02724 (Fig. 5) is divided into 5 reactant pairs. None of thesereactant pairs involves reduced or oxidized adrenal ferredoxin, asit neither contributes nor receives any atoms during the reaction(see Fig. 5). An RPairs-aware path finding algorithm will thereforeavoid ferredoxin as intermediate compound.

To test the impact of RPair annotations on path finding accu-racy, we compared the performance of the RPair network (withRPairs instead of reactions) with the default metabolic network. Wemeasured the path finding accuracy in both networks for differentvalues of various parameters, among them the directionality of thenetwork, the weight policy, removal of hub compounds and RPairclass filtering (e.g. by keeping only main RPairs). The most impor-tant finding was that the weighted RPair network yields higherpathway prediction accuracies than either the unweighted RPairnetwork or the weighted reaction network (Faust et al., 2009b). Themain reason for the better performance of the weighted comparedto the unweighted RPair network is that the RPairs do not alwaysallow us to avoid side compounds (e.g. in KEGG reaction R00299,ADP and ATP form a main RPair). However, they allow us to do soin many cases, which is why the unweighted RPair network per-forms better than the unweighted network without RPairs. Thus,the combination of a hub-node penalizing weight policy with RPairsperforms better than either strategy alone, a result in agreementwith the findings by Blum and Kohlbacher (2008). Another interest-ing observation is that the removal of all non-main RPairs decreasesthe path finding performance. This counter-intuitive result is dueto the fact that we measured path finding accuracy with referencepathways containing some substrate-product pairs not classifiedas main pairs in the RPAIR database. The presence of trans RPairsin reference pathways reflects the ambiguity of some RPair classassignments. Tools such as (Antonov et al., 2008) that only rely onmain RPairs may therefore miss some pathways of interest.

8.2. From two-end to multi-end pathway prediction

So far, we have focussed on predicting pathways between twoseed nodes (which could be compounds or reactions). In order tointerpret associated enzyme-coding gene sets, we need first to mapgenes to reactions and second to extend pathway prediction to aset of seed nodes.

8.2.1. Gene-to-reaction mapping and seed node setsThere is a many-to-many relationship between enzyme-coding

genes, their EC numbers and associated reactions. For example,genes coding for different sub-units of the same enzyme share thesame enzyme classification (EC) number whereas genes coding formultifunctional enzymes (e.g. pentafunctional ARO1 gene in Sac-charomyces cerevisiae) are associated to more than one EC number.In turn, an EC number can be linked to more than one reaction.For example, EC number 1.1.1.1 (conversion of an alcohol into analdehyde or ketone) is associated to 18 reactions in KEGG, out of

which only one may be relevant for the pathway to be predicted.To deal with this complex relationship, we devised a strategy tocope with groups of seed nodes (by introducing pseudo-nodes, seeFaust et al., 2010). Only one of the members of a seed node groupneeds to be included into the predicted pathway, but possibly moremay be added. Thus, an (inclusive) OR relationship between seednodes can be expressed by grouping them into the same group,whereas an AND relationship can be expressed by instantiating adifferent group for each seed. Seed reactions can be grouped onseveral levels: gene-wise (all reactions associated to one gene formone seed node group), EC-number wise (all reactions associated toan EC number form one seed node group) or reaction-wise (eachreaction forms a seed node group of its own). In our experience,EC number-wise groups are most suited for pathway prediction:They consider all catalytic functions of an enzyme while avoiding toinclude all reactions associated to an EC number. Thus, EC numbergroups are robust with respect to imprecise EC number-reactionmappings.

8.2.2. Multi-end pathway predictionPathway prediction given multiple seeds in a weighted network

can be regarded as an instance of the Steiner tree problem, wherea set of seed nodes has to be connected within a weighted graphsuch that the resulting subgraph is of minimal weight. Because ofthe minimal weight constraint, the solution subgraph will alwaysbe a tree, called Steiner tree. The Steiner tree problem is known tobe NP-hard (Karp, 1972). We tested three different heuristics thatare all based on repetitively executed path finding (details see Faustet al. (2010)). In addition, we also evaluated a random-walk basedalgorithm called kWalks (Dupont et al., 2006). kWalks computesthe relevance of each edge and node in the network with respect tothe seed node set. The relevance of an edge or node is defined as theexpected number of times it appears in random walks between theseed nodes, where a random walk starts from each seed node andends as soon as it hits another seed node. The network is then builtby adding edges in the order of their relevance to a sub-networkinitially consisting of the seed nodes only, until either all seed nodesare connected or all edges have been added. The random walks arecomputed efficiently using absorbing Markov chain theory. Thus,kWalks is designed to extract rapidly the part of the input net-work most relevant for connecting the given seed nodes and isindeed much faster than the three Steiner tree heuristics. However,our evaluation showed that MetaCyc reference pathways are moreaccurately predicted with a Steiner tree heuristic than with kWalks.So we decided to combine kWalks with the best-performing Steinertree heuristic in a hybrid approach that unites the strengths of bothapproaches: kWalks is launched first to reduce the input networksize and thus the run-time of the subsequently executed Steinertree heuristic, which predicts the pathway within the sub-networkextracted by kWalks. This hybrid approach yielded the highest pre-diction accuracy in our evaluation. An interesting feature of thekWalks algorithm is its good performance in the absence of a weightpolicy. Indeed, in the hybrid approach, a subgraph with weightscomputed by kWalks yields a higher accuracy than an unweightedsubgraph. This suggests that kWalks can be applied to discoverweights in unweighted networks.

9. Pathway prediction example

In the following, we will demonstrate pathway prediction onan operon from Pseudomonas aeruginosa PAO1. The aruCFGDBoperon contains five genes: PA0895 (aruC), PA0896 (aruF), PA0897(aruG), PA0898 (aruD) and PA0899 (aruB). When entering thesegene identifiers into the pathway extraction web server (athttp://rsat.bigre.ulb.ac.be/neat/, Brohée et al., 2008) and select-ing P. aeruginosa (pae) as the organism of interest, the genes are



Fig. 6. Reaction mapping output of the pathway extraction tool for the five genes in the aruCFGDB operon of P. aeruginosa PAO1. The genes are mapped to their EC numbers,reactions and main RPairs using the KEGG database (version 55.0). The many-to-many relationship between enzyme-coding genes and EC numbers is exemplified by PA0895(associated to two EC numbers) and PA0896 and PA0897 (associated to the same EC number).

mapped to their respective EC numbers, reactions and main RPairs(see Fig. 6). Notably, genes PA0896 and PA0897 share the sameEC number, whereas PA0895 is associated to two EC numbers,thus illustrating the aforementioned many-to-many relationshipbetween enzyme-coding genes and EC numbers. When group-ing seed reactions EC number-wise, the pathway shown in Fig. 7results. This pathway has been predicted from the weighted RPairnetwork (KEGG version 55.0), thus benefiting from the weights aswell as the RPair annotation. The network includes all reactant pairsand small molecule compounds in KEGG. We could have selectedan organism-specific network of P. aeruginosa for higher specificity,but at the cost of possibly decreasing the sensitivity (i.e. the capa-bility of predicting pathways that occur in P. aeruginosa, but havenot yet been included in the organism-specific network).

Interestingly, the predicted pathway proposes several interme-diate reactions for which P. aeruginosa genes are known in KEGG(PA0901 and PA1162) but also contains a reaction (R04217 with ECnumber 2.6.1.81) that is not associated to any gene in P. aeruginosa.Such a gap may be interpreted in several ways: (1) the reaction isspontaneous, (2) the reaction is a wrong prediction, i.e. P. aeruginosa

has no enzyme that catalyses this reaction, (3) the reaction is carriedout by an enzyme in P. aeruginosa, but this enzyme is not knownor not annotated in the source database or the enzyme is presentin the database but not linked to this reaction. In general, the sub-strate range of many enzymes is unknown, so we may expect manymissing links between reactions and their catalyzing enzymes incurrent metabolic databases. However, in MetaCyc the EC num-ber of the gap reaction (2.6.1.81) is linked to a P. aeruginosa gene(aruC). This gene is contained in KEGG (version 55.0), but not linkedto EC number 2.6.1.81. Indeed, the P. aeruginosa-specific KEGG mapof Arginine and proline metabolism, which contains the predictedpathway, does not link EC number 2.6.1.81 to any P. aeruginosa gene.Interestingly, the predicted pathway reproduces the AST (argininesuccinyl-transferase) pathway annotated in MetaCyc for E. coli andP. aeruginosa.

Thus, without knowledge of either the metabolic network or thepathways of P. aeruginosa, we could correctly predict the argininedegradation pathway (AST pathway) from the operon encoding it.It is clear that not in all cases a complete pathway will be encodedin an operon. Enzyme-coding genes involved in the same path-

Fig. 7. Pathway prediction result for the enzyme-coding genes in the aruCFGDB operon of P. aeruginosa PAO1. Seed reactant pairs (RPairs) obtained from the genes wereconnected by a subgraph extraction algorithm (combining kWalks and a Steiner tree heuristic) in a network built from KEGG RPAIR (version 55.0). RPairs are represented asrectangles and compounds as ellipses. Seed RPairs have blue borders, whereas the borders of intermediate RPairs and compounds are colored according to their membershipto different KEGG maps, where beige stands for arginine and proline metabolism and violet for lysine biosynthesis. Intermediates that are not part of any KEGG map haveorange borders. If a compound of a seed RPair did not already form a part of the pathway, it was added after prediction and colored in magenta. The predicted pathway coversthe complete arginine succinyl-transferase (AST) pathway, which is annotated in MetaCyc for P. aeruginosa.



ways may belong to several operons or even be scattered over thewhole genome (e.g. methionine biosynthesis in E. coli). However,if these genes are co-regulated, and thus detectable as a group ofassociated genes (i.e. co-expressed in a micro-array experiment orsharing common transcription factor binding sites), our pathwayprediction procedure can still assemble their pathway.

10. Discussion

10.1. Strengths of the prediction approach

Pathway prediction by subgraph extraction is a generic pathwayprediction approach that can be applied to any biological networkand that can handle networks of realistic sizes (with thousands ofnodes and tens of thousands of edges). It does not require any otherinput apart from the network and the seed nodes, although addi-tional information in form of RPairs or a well defined weight policyincreases prediction accuracy. In addition, weights can be fine-tuned to favor organism-specific reactions or to integrate scoresfrom a high-throughput experiment. To our knowledge, the pre-diction approach summarized here is the only one that can predictmetabolic pathways from a mixture of reactions and compounds.

10.2. Weaknesses of the prediction approach

We decided not to impose constraints on the stoichiometry orfeasibility of the extracted pathway, i.e. compounds in the path-way are expected to be available in sufficient numbers. This may insome cases lead to wrong predictions (de Figueiredo et al., 2009).A more in-depth discussion of non-stoichiometric versus stoichio-metric approaches in path finding can also be found in Faust et al.(2009a) and Planes and Beasley (2008). Another disadvantage ofour metabolic pathway prediction approach is its restriction tosmall molecule metabolism, which excludes polymers (e.g. starch,DNA, RNA). In addition, our evaluations have shown that the pre-diction approach does not perform well for the highly connectedcentral part of the metabolic network, e.g. glycolysis. A distinc-tion between these pathways from alternatives may be possiblewhen Gibbs free energy changes are taken into account. However,the computation of these energy changes requires data on com-pound concentrations in the cell. Furthermore, subgraph extractionwith Steiner tree heuristics cannot predict spiral-shaped (e.g. fattyacid biosynthesis, which uses the same enzymes in several elonga-tion rounds) or cyclic pathways, as the solution to the Steiner treeproblem is a tree. It may however find parts of these pathways. Ingeneral, the extraction of the lightest sub-network assumes thatthe pathway to be extracted is as short as biochemically possi-ble. This parsimony assumption makes sense in many cases, as thesynthesis of an enzyme is costly. However, not all pathways havedeveloped to synthesize or degrade a compound with the mini-mal number of enzymatic steps. For example, the TCA cycle has notbeen optimized to synthesize oxaloacetate with the smallest pos-sible number of enzymes but to produce energy and precursors foranabolic metabolic pathways such as amino acid biosynthesis.

10.3. Applications

10.3.1. Interpretation of associated genes from high throughputdata

The main application of pathway prediction by subgraph extrac-tion is the interpretation of co-expressed gene sets. Severalnetwork-based tools to interpret gene (or protein) sets obtainedfrom high-throughput data have been developed. They all rely onthe same principle: First, gene scores are obtained from the high-throughput data. For instance in the case of micro-arrays, the scorescould consist in the log-ratio of measured gene expression levels.

Next, the scores are converted into node weights and the sub-network is extracted from the weighted graph. The weights causethe extraction algorithm to favor sub-networks containing high-scoring genes over others. Roughly speaking, we may distinguish“global” and “local” subgraph extraction strategies.

Tools following the global strategy incorporate nodes in such away that the weight of the extracted sub-network is optimized,without the need of specifying seed nodes. To our knowledge,Ideker et al. (2002) were the first to propose this strategy, whichthey implemented using a simulated annealing based algorithm.A recently published tool, MetaPath, is based on the same prin-ciple (Liu and Pop, 2010). An interesting study by Dittrich et al.(2008) identifies sub-networks in protein–protein interaction net-works by applying an algorithm that solves the prize collectingSteiner tree problem exactly (Ljubic et al., 2006). These authorsalso present a statistic to aggregate multiple p-values on nodesderived from several experiments into one node weight and in addi-tion report sub-optimal solutions in the order of their weight. Theprize-collecting Steiner tree problem is a variant of the Steiner treeproblem that does not consider seed nodes but instead searchesfor the maximum weight sub-network in a network where nodeshave positive weights (prizes) and edges negative weights (costs).A prize-collecting Steiner tree solving algorithm has also beenapplied to detect novel pathways in integrated protein–protein andprotein–DNA networks (Huang and Fraenkel, 2009).

In contrast to the global strategy, the local strategy connects spe-cific nodes of interest (i.e. seed nodes) in the network. This strategyhas been applied to protein–protein interaction networks (Scottet al., 2005) as well as to metabolic networks (Antonov et al., 2008,2009; Noirel et al., 2008). In the former case, Scott and cowork-ers employ a Steiner tree exact solution (small seed node sets) anda heuristic (large seed node sets), whereas in the latter case, theauthors rely on custom algorithms to solve the same problem.

The distinction between global and local strategies is somewhatartificial: In fact, the global strategy may be tuned by assign-ing weights such that specific nodes of interest are favored. Inturn, the local strategy may be “globalized” by repeatedly runningthe extraction algorithm and then reporting the optimal solutionamong a set of solutions. Whether the global or local strategy ismore appropriate depends on the size and the expected number offalse positives in the gene/enzyme group of interest.

In comparison to the metabolic subgraph extraction approachesmentioned above, our approach offers the advantages of a welldesigned network representation, of an appropriate treatment ofhub compounds (RPairs and weights), of being able to handle setsof seed nodes as well as being extensively evaluated on referencepathways. A major weakness of our approach is its lack of a methodto compute network weights from high-throughput experimentscores. However, this weakness can be circumvented by first com-puting network weights independently with a statistically soundprocedure (such as the one suggested in Dittrich et al. (2008)) andthen applying our prediction approach using these weights.

All approaches discussed so far rely only on the network topol-ogy. Recently, a metabolic subgraph extraction approach waspublished that is based on the stoichiometric definition of ametabolic pathway (i.e. the idea that a valid metabolic pathwayshould stoichiometrically balance all its internal compounds). Thisapproach combines stoichiometric and reaction directionality con-straints (as in flux balance analysis) with a constraint with respectto expression data, such that the agreement between the activityof a reaction (its flux) and the expression values of its associ-ated enzyme(s) is maximized Shlomi et al. (2008) and Zur et al.(2010). Since it does not consider seed nodes, this approach canbe compared to the “global” topology-based subgraph extractionapproaches. Whether this stoichiometric-based approach performsbetter than topological approaches remains to be evaluated.



10.3.2. Metabolic network reconstructionAnother application of metabolic pathway prediction is

metabolic network reconstruction, whose goal is to decipherthe metabolic network of an organism from its genome (seee.g. Duarte et al., 2007 as an example and Thiele and Palsson,2010 for a reconstruction protocol). Currently, manual metabolicnetwork reconstruction (in some cases combined with auto-mated procedures as in Tier 2 BioCyc databases Caspi et al.,2008) still outperforms entirely automated procedures Ginsburg(2009), making automated high-quality metabolic reconstructionan object of intensive research. Automated metabolic reconstruc-tion approaches may be roughly divided into two categories:Pathway-based approaches rely on known metabolic pathways astemplate (Moriya et al., 2007; Karp et al., 2009), which may befurther assembled into a network (DeJongh et al., 2007), whereasconstraint-based approaches start from the genome without tak-ing knowledge on pathways into account (Becker et al., 2007; Henryet al., 2010). All of these approaches may result in a reconstructednetwork with gaps, i.e. missing reactions that are expected to bepresent because of phenotype data or because they synthesize ordegrade important “house-keeping” (primary) compounds. Vari-ous approaches have been suggested to fill these gaps, either on thelevel of the individual reaction (Green and Karp, 2004) or the entirenetwork (Christian et al., 2009). Pathway prediction can fill gaps onthe pathway level, by proposing alternative pathways that can carryout the missing function. In addition, pathway prediction may serveas an alternative starting point for pathway-based metabolic net-work reconstruction. Instead of starting with a set of “template”pathways, pathways can be predicted from known gene associ-ations in the organism (operons/regulons, synteny, etc.), therebyintegrating knowledge on gene regulation into the reconstructionprocedure.

10.3.3. Other applicationsSubgraph extraction could also be applied to predict biodegra-

dation pathways. In contrast to other biodegradation predictionapproaches (Jaworska et al., 2002; Pazos et al., 2005; Ellis et al.,2008), our pathway prediction approach accepts both compoundsand reactions as input. Therefore, it can incorporate knowledge onintermediates as well as on participating enzymes.

Pathway prediction by subgraph extraction could also be usefulin comparative metagenomics, where it could identify differentiallyabundant pathways from over- or under-represented orthologousgene groups.

10.4. Outlook

In the future, we plan to apply pathway prediction on otherbiological networks such as protein–protein and protein–geneinteraction networks and ultimately on a network combining theformer with metabolism. In addition, metabolic networks couldbe refined to include compartments and transporters. Anotherimprovement concerns the usage of a compound hierarchy thatwould allow to treat stereoisomers and generic compounds appro-priately. For instance, if a generic compound (e.g. a amino acid) isprovided, subgraph extraction could treat all its children (i.e. spe-cific amino acids) as a seed node set. Furthermore, the integrationof atom tracing (as in Pitkänen et al., 2009; Heath et al., 2010) withRPair annotation may increase the accuracy of pathways predictedfor a set of seed reactions.

Acknowledgements

KF was supported by Actions de Recherches Concertées de laCommunauté Franc aise de Belgique (ARC grant number 04/09-307). The BiGRe Laboratory is supported by the Belgian Program

on Interuniversity Attraction Poles, initiated by the Belgian Fed-eral Science Policy Office, project P6/25 (BioMaGNet). DC is fundedby the MICROME Collaborative Project funded by the EuropeanCommission within its FP7 Programme, under the thematic area“BIO-INFORMATICS – Microbial genomics and bio-informatics”(contract number 222886-2).

References

Adler, P., Reimand, J., Jänes, J., Kolde, R., Peterson, H., Vilo, J., 2008. KEGGanim:pathway animations for high-throughput data. Bioinformatics 24, 588–590.

Antonov, A., Dietmann, S., Mewes, H., 2008. KEGG spider: interpretation of genomicsdata in the context of the global gene metabolic network. Genome Biology 9.

Antonov, A., Dietmann, S., Wong, P., Mewes, H., 2009. TICL—a web tool fornetwork-based interpretation of compound lists inferred by high-throughputmetabolomics. FEBS Journal 276, 2084–2094.

Arita, M., 2000. Metabolic reconstruction using shortest paths. Simulation Practiceand Theory 8, 109–125.

Arita, M., 2003. In silico atomic tracing by substrate–product relationships inEscherichia coli intermediary metabolism. Genome Research 13, 2455–2466.

Arita, M., 2004. The metabolic world of Escherichia coli is not small. Proceed-ings of the National Academy of Sciences of the United States of America 101,1543–1547.

Backes, C., Keller, A., Kuentzer, J., Kneissl, B., Comtesse, N., Elnakady, Y.A., Mller,R., Meese, E., Lenhof, H.-P., 2007. GeneTrail—advanced gene set enrichmentanalysis. Nucleic Acids Research 102, W186–W192.

Becker, S., Feist, A., Mo, M., Hannum, G., Palsson, B., Herrgard, M., 2007. Quantita-tive prediction of cellular metabolism with constraint-based models: the COBRAToolbox. Nature Protocol 2, 727–738.

Blum, T., Kohlbacher, O., 2008. Using atom mapping rules for an improved detectionof relevant routes in weighted metabolic networks. Journal of ComputationalBiology 15, 565–576.

Boyer, F., Viari, A., 2003. Ab initio reconstruction of metabolic pathways. Bioinfor-matics 19, ii26–ii34.

Brohée, S., Faust, K., Lima-Mendez, G., Sand, O., Janky, R., Vanderstocken, G., Deville,Y., van Helden, J., 2008. NeAT: a toolbox for the analysis of biological networks,clusters, classes and pathways. Nucleic Acids Research 36, W444–W451.

Caspi, R., Foerster, H., Fulcher, C., Kaipa, P., Krummenacker, M., Latendresse, M., Paley,S., Rhee, S., Shearer, A., Tissier, C., Walk, T., Zhang, P., Karp, P., 2008. The Meta-Cyc Database of metabolic pathways and enzymes and the BioCyc collection ofPathway/Genome Databases. Nucleic Acids Research 36, D623–D631.

Centler, F., di Fenizio, P.S., Matsumaru, N., Dittrich, P., 2005. Chemical organizationsin the central sugar metabolism of Escherichia coli. Modeling and Simulation inScience Engineering and Technology, Post-proceedings of ECMTB 2005.

Chou, C.-H., Chang, W.-C., Chiu, C.-M., Huang, C.-C., Huang, H.-D., 2009. FMM: a webserver for metabolic pathway reconstruction and comparative analysis. NucleicAcids Research 37, W129–W134.

Christian, N., May, P., Kempa, S., Handorf, T., Ebenhöh, O., 2009. An integrativeapproach towards completing genome-scale metabolic networks. MolecularBioSystems 5, 1889–1903.

Clarke, B., 1988. Stoichiometric network analysis. Cell Biophysics 12, 237–253.Croes, D., Couche, F., Wodak, S., van Helden, J., 2005. Metabolic pathfinding: infer-

ring relevant pathways in biochemical networks. Nucleic Acids Research 33,W326–W330.

Croes, D., Couche, F., Wodak, S., van Helden, J., 2006. Inferring meaningful pathwaysin weighted metabolic networks. Journal of Molecular Biology 356, 222–236.

Dahlquist, K.D., Salomonis, N., Vranizan, K., Lawlor, S., Conklin, B., 2002. GenMAPP,a new tool for viewing and analyzing microarray data on biological pathways.Nature Genetics 31, 19–20.

Davidsen, T., Beck, E., Ganapathy, A., Montgomery, R., Zafar, N., Yang, Q., Madupu, R.,Goetz, P., Galinsky, K., White, O., Sutton, G., 2010. The comprehensive microbialresource. Nucleic Acids Research 38, D340–D345.

de Figueiredo, L., Podhorski, A., Rubio, A., Kaleta, C., Beasley, J., Schuster, S., Planes,F., 2009. Computing the shortest elementary flux modes in genome-scalemetabolic networks. Bioinformatics 25, 3158–3165.

DeJongh, M., Formsma, K., Boillot, P., Gould, J., Rycenga, M., Best, A., 2007. Toward theautomated generation of genome-scale metabolic networks in the SEED. BMCBioinformatics 8, 139.

DeRisi, J., Iyer, V., Brown, P., 1997. Exploring the metabolic and genetic control ofgene expression on a genomic scale. Science 278, 680–686.

Dittrich, M., Klau, G., Rosenwald, A., Dandekar, T., Müller, T., 2008. Identifying func-tional modules in protein–protein interaction networks: an integrated exactapproach. Bioinformatics 24, i223–i231.

Duarte, N., Becker, S., Jamshidi, N., Thiele, I., Mo, M., Vo, T., ad, B.Ø., Palsson, R.S., 2007.Global reconstruction of the human metabolic network based on genomic andbibliomic data. Proceedings of the National Academy of Sciences of the UnitedStates of America 104, 1777–1782.

Dupont, P., Callut, J., Dooms, G., Monette, J.-N., Deville, Y., 2006. Relevant subgraphextraction from random walks in a graph. Research Report UCL/FSA/INGI RR2006–07.

Ellis, L., Gao, J., Fenner, K., Wackett, L., 2008. The University of Minnesota path-way prediction system: predicting metabolic logic. Nucleic Acids Research 36,W427–W432.



Eppstein, D., 1999. Finding the k shortest paths. SIAM Journal on Computing 28,652–673.

Faust, K., Croes, D., van Helden, J., 2009a. In response to “Can sugars be producedfrom fatty acids? A test case for pathway analysis tools”. Bioinformatics 25,3202–3205.

Faust, K., Croes, D., van Helden, J., 2009b. Metabolic pathfinding using RPAIR anno-tation. Journal of Molecular Biology 388, 390–414.

Faust, K., Dupont, P., Callut, J., van Helden, J., 2010. Pathway discovery in metabolicnetworks by subgraph extraction. Bioinformatics 26, 1211–1218.

Fell, D., Wagner, A., 2000. The small world of metabolism. Nature Metabolic Engi-neering 18, 1121–1122.

Forst, C., Schulten, K., 1999. Evolution of metabolisms: a new method for thecomparison of metabolic pathways using genomics information. Journal of Com-putational Biology 6, 343–360.

Forst, C., Schulten, K., 2001. Phylogenetic analysis of metabolic pathways. Journal ofMolecular Biology 52, 471–489.

Gagneur, J., Jackson, D., Casari, G., 2003. Hierarchical analysis of dependency inmetabolic networks. Bioinformatics 19, 1027–1034.

Gama-Castro, S., Jiménez-Jacinto, V., Peralta-Gil, M., Santos-Zavaleta, A., nalozaSpinola, M.P., Contreras-Moreira, B., Segura-Salazar, J., niz Rascado, L.M.,Martínez-Flores, I., Salgado, H., Bonavides-Martínez, C., Abreu-Goodger, C.,Rodríguez-Penagos, C., Miranda-Ríos, J., Morett, E., Merino, E., Huerta, A., noQuintanilla, L.T., Collado-Vides, J., 2008. RegulonDB (version 6.0): gene regula-tion model of Escherichia coli K-12 beyond transcription, active (experimental)annotated promoters and Textpresso navigation. Nucleic Acids Research 36.

Gerlee, P., Lizana, L., Sneppen, K., 2009. Pathway identification by network pruningin the metabolic network of Escherichia coli. Bioinformatics 25, 3282–3288.

Ginsburg, H., 2009. Caveat emptor: limitations of the automated reconstruction ofmetabolic pathways in Plasmodium. Trends in Parasitology 25, 37–43.

Goesmann, A., Haubrock, M., Meyer, F., Kalinowski, J., Giegerich, R., 2002. PathFinder:reconstruction and dynamic visualization of metabolic pathways. Bioinformat-ics 18, 124–129.

Green, M., Karp, P., 2004. A Bayesian method for identifying missing enzymes inpredicted metabolic pathway databases. BMC Bioinformatics 5, 76.

Guimerà, R., Amaral, L., 2005. Functional cartography of complex metabolic net-works. Nature 433, 895–900.

Heath, A., Bennett, G., Kavraki, L., 2010. Finding Metabolic Pathways Using AtomTracking. Bioinformatics 26, 1548–1555.

Henry, C., DeJongh, M., Best, A., Frybarger, P., Linsay, B., Stevens, R., 2010. High-throughput generation, optimization and analysis of genome-scale metabolicmodels. Nature Biotechnology 2, 977–982.

Herréz, A., 2006. Biomolecules in the computer: Jmol to the rescue. Biochemistryand Molecular Biology Education 34, 255–261.

Huang, S., Fraenkel, E., 2009. Integrating proteomic, transcriptional, and interactomedata reveals hidden components of signaling and regulatory networks. ScienceSignaling 6053, 101–112.

Ideker, T., Ozier, O., Schwikowski, B., Siegel, A., 2002. Discovering regulatoryand signalling circuits in molecular interaction networks. Bioinformatics 18,S233–S240.

Jaworska, J., Dimitrov, S., Nikolova, N., Mekenyan, O., 2002. Probabilistic assessmentof biodegradatability based on metabolic pathways: CATABOL system. SAR andQSAR in Environmental Research 13, 307–323.

Jeong, H., Tombor, B., Albert, R., Oltvai, Z., Barabási, A.-L., 2000. The large-scale orga-nization of metabolic networks. Nature 407, 651–654.

Jimenez, V., Marzal, A., 1999. Computing the k shortest paths: a new algorithm and anexperimental comparison. In: Lecture Notes in Computer Science—Proceedingsof the 3rd International Workshop on Algorithm Engineering 1668 , pp. 15–29.

Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T.,Kawashima, S., Okuda, S., Tokimatsu, T., Yamanishi, Y., 2008. KEGG for linkinggenomes to life and the environment. Nucleic Acids Research 36, D480–D484.

Karp, P., Paley, S., 1994. Representations of metabolic knowledge: pathways. Pro-ceedings International Conference on Intelligent Systems for Molecular Biology2, 203–211.

Karp, P., Paley, S., Krummenacker, M., Latendresse, M., Dale, J., Lee, T., Kaipa, P.,Gilham, F., Spaulding, A., Popescu, L., Altman, T., Paulsen, I., Keseler, I., Caspi,R., 2009. Pathway Tools version 13.0: integrated software for pathway/genomeinformatics and systems biology. Briefings in Bioinformatics 2, 40–79.

Karp, R., 1972. Reducibility among combinatorial problems. In: Miller, R.E., Thatcher,J.W. (Eds.), Complexity of Computer Computations. Plenum Press, pp. 85–103.

Klamt, S., Haus, U.-U., Theis, F., 2009. Hypergraphs and cellular networks. PLoS Com-putational Biology 5, 5.

Kotera, M., Hattori, M., Oh, M.-A., Yamamoto, R., Komeno, T., Yabuzaki, J., Tonomura,K., Goto, S., Kanehisa, M., 2004a. RPAIR: a reactant-pair database representingchemical changes in enzymatic reactions. Genome Informatics 15, P062.

Kotera, M., Okuno, Y., Hattori, M., Goto, S., Kanehisa, M., 2004b. Computationalassignment of the EC numbers for genomic-scale analysis of enzymatic reac-tions. Journal of the American Chemical Society 126, 16487–16498.

Küffner, R., Zimmer, R., Lengauer, T., 2000. Pathway analysis in metabolic databasesvia differential metabolic display. Bioinformatics 16, 825–836.

Lacroix, V., Cottret, L., Thébault, P., Sagot, M., 2008. An introduction to metabolic net-works and their structural analysis. IEEE/ACM Transactions on ComputationalBiology and Bioinformatics 5, 594–617.

Lima-Mendez, G., van Helden, J., 2009. The powerful law of the power law and othermyths in network biology. Molecular BioSystems 5, 1482–1493.

Liu, B., Pop, M., 2010. Identifying differentially abundant metabolic pathways inmetagenomic datasets. Lecture Notes in Computer Science: BioinformaticsResearch and Applications 6053, 101–112.

Ljubic, I., Weiskircher, R., Pferschy, U., Klau, G.W., Mutzel, P., Fischetti, M., 2006. Analgorithmic framework for the exact solution of the prize-collecting Steiner Treeproblem. Mathematical Programming Series B 105, 427–449.

Ma, H., Zeng, A.-P., 2003. Reconstruction of metabolic networks from genome dataand analysis of their global structure for various organisms. Bioinformatics 19,270–277.

Mithani, A., Preston, G.M., Hein, J., 2009. Hypergraph based tool for metabolic path-way prediction and network comparison. Bioinformatics 25, 1831–1832.

Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A., Kanehisa, M., 2007. KAAS: an automaticgenome annotation and pathway reconstruction server. Nucleic Acids Research35, W182–W185.

Nelson, D., Cox, M., 2005. Lehninger Principles of Biochemistry, fourth edition.Noirel, J., Ow, S.Y., Sanguinetti, G., Jaramillo, A., Wright, P.C., 2008. Automated extrac-

tion of meaningful pathways from quantitative proteomics data. Briefings inFunctional Genomics and Proteomics 7, 136–146.

Overbeek, R., Disz, T., Stevens, R., 2004. The seed: a peer-to-peer environment forgenome annotation. Communications of the ACM 47, 47–51.

Paley, S.M., Karp, P.D., 2006. The pathway tools cellular overview diagram and Omicsviewer. Nucleic Acids Research 34, 3771–3778.

Pazos, F., Guijas, D., Valencia, A., Lorenzo, V.D., 2005. Metarouter: bioinformatics forbioremediation. Nucleic Acids Research 35, D588–D592.

Pitkänen, E., Jouhten, P., Rousu, J., 2009. Inferring branching pathways in genome-scale metabolic networks. BMC Systems Biology 3, 103.

Pitkänen, E., Rantanen, A., Rousu, J., Ukkonen, E., 2005. Finding feasible pathwaysin metabolic networks. In: Proceedings of the 10th Panhellenic Conferenceon Informatics (PCI’2005), Lecture Notes in Computer Science , Springer, pp.123–133.

Planes, F., Beasley, J., 2008. A critical examination of stoichiometric and path-finding approaches to metabolic pathways. Briefings in Bioinformatics 9,422–436.

Rahman, S., Advani, P., Schunk, R., Schrader, R., Schomburg, D., 2004. Metabolic path-way analysis web service (pathway hunter tool at cubic). Bioinformatics 21,1189–1193.

Schilling, C., Letscher, D., Palsson, B., 2000. Theory for the systemic definition ofmetabolic pathways and their use in interpreting metabolic function from apathway-oriented perspective. Journal of theoretical Biology 203, 229–248.

Schuster, S., Dandekar, T., Fell, D., 1999. Detection of elementary flux modes inbiochemical networks: a promising tool for pathway analysis and metabolicengineering. TIBTECH 17, 53–60.

Schuster, S., de Figueiredo, L.F., Kaleta, C., 2010. Predicting novel pathwaysin genome-scale metabolic networks. Biochemical Society Transactions 38,1202–1205.

Scott, M., Perkins, T., Bunnell, S., Pepin, F., Thomas, D., Hallett, M., 2005. Identifyingregulatory subnetworks for a set of genes. Molecular and Cellular Proteomics 4(5), 683–692.

Shlomi, T., Cabili, M., Herrgard, M., Palsson, B., Ruppin, E., 2008. Network-basedprediction of human tissue-specic metabolism. Nature Biotechnology 26,1003–1010.

Sirava, M., Schaefer, T., Eiglsperger, M., Kaufmann, M., Kohlbacher, O., Bornberg-Bauer, E., Lenhof, H., 2002. BioMiner—modeling, analyzing, and visualizingbiochemical pathways and networks. Bioinformatics 18 (2), S219–S230.

Subramanian, A., Tamayoa, P., Mootha, V., Mukherjeed, S., Ebert, B., Gillette, M.,Paulovich, A., Pomeroy, S., Golub, T., Lander, E., Mesirov, J., 2005. Gene setenrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences ofthe United States of America 102, 15545–15550.

Thiele, I., Palsson, B., 2010. A protocol for generating a high-quality genome-scalemetabolic reconstruction. Nature Protocol 5, 93–121.

van Helden, J., Gilbert, D., Wernisch, L., Schroeder, M., Wodak, S., 2001. Applicationof regulatory sequence analysis and metabolic network analysis to the inter-pretation of gene expression data. Lecture Notes in Computer Science 2066,147–165.

van Helden, J., Naim, A., Mancuso, R., Eldridge, M., Wernisch, L., Gilbert, D., Wodak,S., 2000. Representing and analysing molecular and cellular function in the com-puter. Biological Chemistry 381, 921–935.

van Helden, J., Wernisch, L., Gilbert, D., Wodak, S., 2002. Graph-based analysisof metabolic networks. In: Ernst Schering Research Foundation Workshop,vol. 38 , Springer-Verlag, pp. 245–274.

Voss, K., Heiner, M., Koch, I., 2003. Steady state analysis of metabolic pathways usingPetri nets. In Silico Biology 3, 367–387.

Wagner, A., Fell, D., 2001. The small world inside large metabolic networks. Pro-ceedings of the Royal Society of London Series B 268, 1803–1810.

Watts, D., Strogatz, S., 1998. Collective dynamics of ‘small-world’ networks. Nature393, 440–442.

Yen, J., 1971. Finding the K shortest loopless paths in a network. Management Science17, 712–716.

Zien, A., Küffner, R., Zimmer, R., Lengauer, T., 2000. Analysis of gene expressiondata with pathway scores. In: Proceedings of the International Conference ofIntelligent Systems Molecular Biology , pp. 407–417.

Zur, H., Ruppin, E., Shlomi, T., 2010. iMAT: an integrative metabolic analysis tool.Bioinformatics 26, 3140–3142.

author's personal copy -...

Documents