studies in mining conserved subgraph among metabolic pathways

1
Greedy algorithm for obtaining Minimum Feedback vertex set MFVS delete degree 1/0 vertices from V and set remaining vertices to V’ MFVS← while V’ do pick up the set S of maximal degree vertices MFVS ← MFVS U S Delete degree 1/0 vertices from V’ Random degree-conserved graph generation by : Reshuffle nodes Reshuffle edges Randomized P-Value computation (P-Value cutoff : 0.01 for 100 randomized graphs) Qiong Cheng*([email protected]), Robert Harrison, Alexander Zelikovsky ([email protected]) Department of Computer Science, Georgia State University, Atlanta, GA 30303 * Partially supported by GSU Molecular Basis of Disease (MBD) and Brains & Behavior (B&B) Studies in mining conserved subgraph among metabolic pathways References : Q. Cheng, D. Kaur, R. Harrison, and A. Zelikovsky,"Mapping and Filling Metabolic Pathways ", RECOMB Satellite Conference on Systems Biology 2007 Q. Cheng, R. Harrison, and A. Zelikovsky,"Homomorphisms of Multisource Trees into Networks with Applications to Metabolic Pathways", Proc. of IEEE 7-th International Symposium on BioInformatics and BioEngineering (BIBE'07) Ron Y Pinter, Oleg Rokhlenko, Esti Yeger-Lotem, Michal Ziv- Ukelson: Alignment of metabolic pathways. Bioinformatics. LNCS 3109. Springer-Verlag.(Aug 2005)21(16): 3401-8 ….. Abstract Graph is a popular theoretic formalism for modeling multidimensional data such as metabolic pathways. Mining conserved subgraph frequently includes comparison of unknown networks with well-understood networks accounting for the similarity of the vertices as well as their corresponding topological structures. Traditional graph comparisons are classified as two types of problems: subgraph isomophism and subgraph homeomorphism. Both problems are known to be NP-complete. However different application domains on graph comparison require different problem formulations and the corresponding solutions. Based on the specific application in metabolic pathways, we focused on mapping of multi-source tree onto multi-source tree, onto DAG(directed acyclic graph) and further mapping among arbitrary graphs. Arising from innate biochemical reaction observation property that several enzymes can catalyze the same reaction and a single enzyme can catalyze several reactions, we formulated the problem to be homeomorphism embedding and allowed the occurrence of different vertices in query graph to be mapped to a vertex in well-known target graph. Our experimental study finds significant conserved pairs in pairwise mapping of all pathways for four organisms (E.coli, S.cerevisiae, B. subtilis and T. thermophilus species). Δ(v i , u j ) if v i is a leaf in T Δ(v i , u j ) + l=1 to Adj(vi) Min j’=1 to |V T | C(i l , j’) if v i is a leaf in T DT[i, j] = i<| V P’ | j<| V T | C[i l , j l ] = DT[i l , j l ] + (h(j, j l ) - 1) is penalty for gaps h(j, j l ) = #(hops between u j and u j l in T) Homeomorphism embedding of metabolic pathways Fig 1. A portion of pentose phosphate pathway 1.1.1.49 1.1.1.34 2.7.1.13 3.1.1.31 1.1.1.44 Metabolic pathway model a directed graph in which vertices correspond to enzymes and there is a directed edge between two enzymes if the product of the reaction catalyzed by the first enzyme is a substrate of the reaction catalyzed by the second. P T f Solution find minimum feedback vertex set MFVS of P construct a multi source tree P’ = <V p - MFVS, E(V p - MFVS)> for every possible fixed mapping f’ v : F(P) V T do obtain min cost homomorphism of P’ to T under mapping f’ v choose min cost homomorphism for all possible fixed mappings Optimal homeomorphism embedding algorithm Best homeomorphism embedding of multisource tree to arbitrary graph Preprocessing of text graph T by transitive closure Ordering of constructed multisource tree P’ : 1. DFS traversal of P’ 2. Processing order in opposite way with the DFS traversal Key: Each edge e i in P’ is the unique edge connecting v i with the previous vertices in the order Dynamic programming Build DP table - DP = V P’ × V T* Each row and column corresponds to a vertex of V P’ and V T* respectively; The columns u 1 , …, u |V T * | are sorted in arbitrary order; The rows v 1 , …, v |V P’ | are sorted in the special order Every item is the min cost homomorphism from P’ subgraph induced by previous vertices in the order into T * Filling DP table by recursive function Experiments results Computing P-Value of homomorphism mapping a b c d a b c d Reshuffl e edge All-against-all mappings among 4 species : Identifying conserved pathways 24 pathways that are conserved across all 4 species 18 more pathways that are conserved across at least three of these species Resolving ambiguity (see figure 2) Discovering pathways holes (see figure 3) 2.6.1.1 2.6.1.1 1.2.4.- 1.2.4.2 2.3.1.- 2.3.1.61 6.2.1.5 6.2.1.5 1.3.99.1 1.3.99.1 4.2.1.2 4.2.1.2 1. 1.1.1.82 1.1.1.82 Fig 2. Resolving ambiguity example: Mapping of glutamate degradation VII pathways from B.subtilis to T. thermophilus (p<0.01). The enlighted node reflects enzyme homology. 1.5.1.5 1.5.1.5 3.5.4.9 3.5.4.9 3.5.1. 10 6.3.4.3 6.3.4.3 Fig 3. Pathway holes’ example: Mapping of formaldehyde oxidation V pathway in B. subtilis to formy1THF biosynthesis pathway in E. coli (p<0.01) (only vertices in the image of the pattern in the text are shown. Georgia Tech Poster/Reception session at SIAM Data Mining (2008) Mapping metabolic pathways - should capture the similarities of enzymes represented by proteins as well as topological properties. We allow different enzymes to be mapped to the same enzyme. Graph Homeomorphism embedding Let pattern P =<V P , E P > and text T =<V T , E T > be directed graphs, a homomorphism mapping f : PG satisfies f v : V P V T : Different pattern vertices can be mapped to a single text vertex; f e : E P paths of T : An edge in the pattern can be mapped to a path in the text; Homeomorphism embedding cost = vertex-to-vertex cost + edge-to-path cost Edge-to-path cost = l (|f e (e)|-1) , which proportionally increase with the number of extra hops the images of edge Homomorphism cost = v in V P Δ(v, f v (v)) + e in E P (| f e (e)|-1) Problem Formulation Given: an arbitrary graph pattern P =<V P , E P > and an arbitrary graph text T =<V T ,E T > Find: min cost homomorphism f : P T minimize v in V P Δ(v, f v (v)) + e in E P (|f e (e)|-1) s.t. v in V P , f v (v) in V T e in E p , f e (e) in Path T Runtime Transitive closure takes O(|V T ||E T |) Pattern graph ordering takes O(|V P | + |E P |) Dynamic programming : - Calculate min contribution of all child pairs of node pair (v i ∈P,u j ∈T*) takes t ij = deg P (v i )deg G* (u j ) - Filling DT takes j=1 to |V G | i=1 to |V T | t ij = j=1 to |V G | deg G* (u j ) i=1 to |V T | deg T (v i ) = 2|E G* ||E T |The total runtime is O(|V G ||E G |+|V G* ||V T |). Runtime for obtaining optimal homeomorphism embedding O(|V T | (1+|MFVS|) (|V G ||E G |+|V G* ||V T |)) Let t(v) denote the number of “reasonable” text images of v, total runtime = O(vMFVS t(v) (|VG||EG|+|VG*||VT|)) Given: an undirected graph G=(V,E) and a nonnegative weight function w on V Find: a minimum weight subset of V whose removal leaves an acyclic graph. Minimum Feedback vertex set problem Previous work Mapping : Linear patternGraph (Kelly et al 2004) ( oV P V T ) ) Mapping : Tree Tree (Pinter [2005] o(|V T | 2 |V P |/log|V T |+|V T ||V P |log|V P |) ) Exhaustively search : Sharan et al 2005 ( V P V T ) Yang et al 2007 ( V T V T )

Upload: hermione-mendez

Post on 31-Dec-2015

16 views

Category:

Documents


1 download

DESCRIPTION

SIAM Data Mining. 2008. a. b. a. b. 1.3.99.1. 6.2.1.5. 1.1.1.82. 1. 4.2.1.2. 2.6.1.1. 1.5.1.5. 6.3.4.3. 1.1.1.49. 2.7.1.13. 2.6.1.1. 6.3.4.3. 1.3.99.1. 6.2.1.5. 1.1.1.82. 1.5.1.5. 4.2.1.2. DT [ i, j ] =. c. d. c. d. i

TRANSCRIPT

Page 1: Studies in mining conserved subgraph among metabolic pathways

Greedy algorithm for obtaining Minimum Feedback vertex set MFVS delete degree 1/0 vertices from V and set remaining vertices to V’ MFVS← while V’ do pick up the set S of maximal degree vertices MFVS ← MFVS U S Delete degree 1/0 vertices from V’

Random degree-conserved graph generation by :

Reshuffle nodes

Reshuffle edges

Randomized P-Value computation (P-Value cutoff : 0.01 for 100 randomized graphs)

Qiong Cheng*([email protected]), Robert Harrison, Alexander Zelikovsky ([email protected])

Department of Computer Science, Georgia State University, Atlanta, GA 30303

* Partially supported by GSU Molecular Basis of Disease (MBD) and Brains & Behavior (B&B)

Studies in mining conserved subgraph among metabolic pathways

References :

Q. Cheng, D. Kaur, R. Harrison, and A. Zelikovsky,"Mapping and Filling Metabolic Pathways ", RECOMB Satellite Conference on Systems Biology 2007  

Q. Cheng, R. Harrison, and A. Zelikovsky,"Homomorphisms of Multisource Trees into Networks with Applications to Metabolic Pathways", Proc. of IEEE 7-th International Symposium on BioInformatics and BioEngineering (BIBE'07)

Ron Y Pinter, Oleg Rokhlenko, Esti Yeger-Lotem, Michal Ziv-Ukelson: Alignment of metabolic pathways. Bioinformatics. LNCS 3109. Springer-Verlag.(Aug 2005)21(16): 3401-8

…..

AbstractGraph is a popular theoretic formalism for modeling multidimensional datasuch as metabolic pathways. Mining conserved subgraph frequently includescomparison of unknown networks with well-understood networks accountingfor the similarity of the vertices as well as their correspondingtopological structures.

Traditional graph comparisons are classified as two types of problems:subgraph isomophism and subgraph homeomorphism. Both problems are known tobe NP-complete. However different application domains on graph comparisonrequire different problem formulations and the corresponding solutions.

Based on the specific application in metabolic pathways, we focused onmapping of multi-source tree onto multi-source tree, onto DAG(directedacyclic graph) and further mapping among arbitrary graphs. Arising frominnate biochemical reaction observation property that several enzymes cancatalyze the same reaction and a single enzyme can catalyze severalreactions, we formulated the problem to be homeomorphism embedding andallowed the occurrence of different vertices in query graph to be mappedto a vertex in well-known target graph.

Our experimental study finds significant conserved pairs in pairwisemapping of all pathways for four organisms (E.coli, S.cerevisiae, B.subtilis and T. thermophilus species).

Δ(vi , uj) if vi is a leaf in T

Δ(vi, uj) + ∑l=1 to Adj(vi)Minj’=1 to |VT|C(il, j’) if vi is a leaf in T

DT[i, j] = i<|VP’|j<|VT|C[il, jl] = DT[il, jl] + (h(j, jl) - 1)is penalty for gapsh(j, jl) = #(hops between uj and ujl

in T)

Homeomorphism embedding of metabolic pathways

Fig 1. A portion of pentose phosphate pathway

1.1.1.49

1.1.1.34

2.7.1.13

3.1.1.31 1.1.1.44

Metabolic pathway model – a directed graph in which vertices correspond to enzymes and there is a directed edge between two enzymes if the product of the reaction catalyzed by the first enzyme is a substrate of the reaction catalyzed by the second.

PT

f

Solution

find minimum feedback vertex set MFVS of P construct a multi source tree P’ = <Vp- MFVS, E(Vp- MFVS)> for every possible fixed mapping f’v: F(P) VT do obtain min cost homomorphism of P’ to T under mapping f’v choose min cost homomorphism for all possible fixed mappings

Optimal homeomorphism embedding algorithm

Best homeomorphism embedding of multisource tree to arbitrary graph

Preprocessing of text graph T by transitive closure

Ordering of constructed multisource tree P’ :

1. DFS traversal of P’

2. Processing order in opposite way with the DFS traversal

Key: Each edge ei in P’ is the unique edge connecting vi with the previous vertices in the order

Dynamic programming

Build DP table - DP = VP’ × VT*

Each row and column corresponds to a vertex of VP’ and VT* respectively;

The columns u1, …, u|VT*| are sorted in arbitrary order;

The rows v1, …, v|VP’| are sorted in the special order

Every item is the min cost homomorphism from P’ subgraph induced by previous vertices in the order into T*

Filling DP table by recursive function

Experiments results

Computing P-Value of homomorphism mapping

a b

c d

a b

c d

Reshuffle edge

All-against-all mappings among 4 species :

Identifying conserved pathways

24 pathways that are conserved across all 4 species

18 more pathways that are conserved across at least three of these species

Resolving ambiguity (see figure 2)

Discovering pathways holes (see figure 3)

2.6.1.1

2.6.1.1

1.2.4.-

1.2.4.2

2.3.1.-

2.3.1.61

6.2.1.5

6.2.1.5

1.3.99.1

1.3.99.1

4.2.1.2

4.2.1.2

1.

1.1.1.82

1.1.1.82

Fig 2. Resolving ambiguity example: Mapping of glutamate degradation VII pathways from B.subtilis to T. thermophilus (p<0.01). The enlighted node reflects enzyme homology.

1.5.1.5

1.5.1.5

3.5.4.9

3.5.4.93.5.1.10 6.3.4.3

6.3.4.3

Fig 3. Pathway holes’ example: Mapping of formaldehyde oxidation V pathway in B. subtilis to formy1THF biosynthesis pathway in E. coli (p<0.01) (only vertices in the image of the pattern in the text are shown.

Georgia Tech Poster/Reception session at SIAM Data Mining (2008)

Mapping metabolic pathways - should capture the similarities of enzymes represented by proteins as well as topological properties.

We allow different enzymes to be mapped to the same enzyme.

Graph Homeomorphism embedding

Let pattern P =<VP, EP> and text T =<VT, ET> be directed graphs, a homomorphism mapping f : PG satisfies

fv : VPVT : Different pattern vertices can be mapped to a single text vertex;

fe : EPpaths of T : An edge in the pattern can be mapped to a path in the text;

Homeomorphism embedding cost = vertex-to-vertex cost + edge-to-path cost

Edge-to-path cost = l (|fe(e)|-1) , which proportionally increase with the number of extra hops the images of edge Homomorphism cost = v in VP

Δ(v, fv(v)) + e in EP (|fe(e)|-1)

Problem FormulationGiven: an arbitrary graph pattern P =<VP, EP> and an arbitrary graph text T =<VT,ET>Find: min cost homomorphism f : P T

minimizev in VP Δ(v, fv(v)) +e in EP (|fe(e)|-1)

s.t. v in VP, fv(v) in VTe in Ep ,fe(e) in PathT

Runtime

Transitive closure takes O(|VT||ET|)

Pattern graph ordering takes O(|VP| + |EP|)

Dynamic programming :

- Calculate min contribution of all child pairs of node pair (v i∈P,uj∈T*) takes tij = degP (vi)degG*(uj)

- Filling DT takes j=1 to |VG| i=1 to |VT|tij = j=1 to |VG| degG*(uj)i=1 to |VT|degT(vi) =

2|EG*||ET|The total runtime is O(|VG||EG|+|VG*||VT|).

Runtime for obtaining optimal homeomorphism embedding

O(|VT | (1+|MFVS|) (|VG||EG|+|VG*||VT|))

Let t(v) denote the number of “reasonable” text images of v, total runtime = O(∏vMFVSt(v) (|VG||EG|+|VG*||VT|))

Given: an undirected graph G=(V,E) and a nonnegative weight function w on VFind: a minimum weight subset of V whose removal leaves an acyclic graph.

Minimum Feedback vertex set problem

Previous work Mapping : Linear patternGraph (Kelly et al 2004) ( oVPVT) ) Mapping : Tree Tree (Pinter [2005] o(|VT|2|VP|/log|VT|+|VT||VP|log|VP|) ) Exhaustively search : Sharan et al 2005 ( VPVT) Yang et al 2007 ( VTVT)