a linear-time algorithm for computing inversion distance between signed permutations with an...

A Linear-Time Algorithm for Computing Inversion Distance between signed

Permutations with an experimental Study

David Bader, Bernard Moret, Mi Yan

Presented by Anat Heilper

Motivation

Evolutionary process on many single-chromosome organisms, consists mostly of inversions.

Due to that, phylogenies are reconstructed based on gene order, using inversion distance as a measure of evolutionary distance between 2 genomes.

Model assumptions

Each genome is an ordering (circular/linear) of a fixed set of genes {g1,g2,…gn}.

Each gene is given with an orientation: positive(gi), or negative(-gi).

Inversion

Let G be the genome with the signed ordering(linear or circular) g1, g2,…gn.

Inversion(i, j), i j: g1, g2,…, gi-1, gi, gi+1,…,gj, gj+1,…, gn

g1, g2,…, gi-1, -gj, -gj-1,…, -gi , gj+1,…,gn

What happens if j<i?

Circular case: rotate the circular ordering until the proper relationship between the indices is received.

Linear case: not applicable.

Inversion

Inversion Distance

Minimum number of inversions needed to transform one permutation into the other.

Computing Inversion Distance between unsigned permutations is NP-hard.

Inversion Distance Sorting

Computing the shortest sequence of inversions can be also treated as a sorting problem.

Also Known as “sorting by reversals”.

Previous work:Hannenhalli and Pevsner, 1995 – first polynomial-time algorithm.

Berman and Hannenhalli, 1996 – runs in O(n(n)), where (n) is the inverse Ackerman function

Bader, Moret, and Yan, 2000 – runs in linear-time algorithm.

What we shall see today?

Present a simple and practical, linear time algorithm to compute the connected components of the overlap graph.

Present experimental evidence that this algorithm is really efficient.

Transformation

Given a signed permutation of {1…n}, we transform it into an unsigned permutation of {0… 2n+1}:

Substitute each positive element x by the ordered pair (2x-1, 2x).

Substitute each negative element –x by the ordered pair (2x,2x-1).

(0) = 0, (2n+1) = 2n+1.

Example

+3 +9 -7 +5 -10 +8 +4 -6 +11 +2 +1

5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

X (2x-1, 2x)-X(2x,2x-1)(0) = 0, (2n+1) = 2n+1

AssumptionsBoth permutations have been turned into unsigned permutations.

Both unsigned permutations are transformed so that the first permutation becomes the identity permutation (0,1,..2n,2n+1).

This is why this problem can be viewed as sorting: find the number of inversions needed to transform the given permutation into the identity permutation.

Cycle graph We represent the unsigned permutation by an edge-colored graph called the “cycle graph”.

Properties of the graph:2n+2 vertices.

For each i, 0 i n, there’s a gray edge between vertices (2i) and (2i+1).

There is a black edge between vertices 2i and 2i+1.

Cycle graph

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

i

(i)

The resulting graph consists of disjoint cycles in which edges alternate colors.

Overlapping edges2 gray edges ((i) , (j)) and ((k), (t)) overlap if the 2 intervals [i,j] and [k,t] overlap but neither one contains the other.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

i

(i)

Overlapping cycles2 cycles C1 and C2 overlap if there exist overlapping gray edges e1 C1 and e2C2.

Extent of a cycle C: is the interval [C.B, C.E] where C.B = min {i| i C}, B stands for the beginning of the cycle,

and C.E = max{i| i C}, E stands for the ending of the cycle.

Extent of a set of cycles {c1…ck} is [B,E] where B = min {Ci.B| 0 i k} and E = max{Ci.E| 0 i k}.

Overlap Graph of permutation One vertex for each cycle in the cycle graph.

An edge between any 2 vertices that correspond to overlapping cycles.

Overlap graph

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

i

(i)

a

b

c

d

e f

One vertex for each cycle in the cycle graph.

Overlap grapha

b

c

d

e f

a fb d c

e

An edge between any 2 vertices that correspond to overlapping cycles.

By now we know how to:Transform a signed permutation to an unsigned permutation.

Create an overlap graph from the cycle graph.

Compute the minimum number of inversion and inversion sequence from the overlap graph.

The bottleneck of the algorithm is building the overlap graph(o(n^2)).

How can we improve the algorithm?Build a much smaller graph which captures the same information as the overlap graph.

Building the overlap forest

Creating the overlap forest in two scans of the permutation:

First scan: Build a trivial forest F0 in which each node is its own forest.

Second scan: Iterative refinement of the forest, at each iteration we expand the domain where we search for overlapping cycles with one node. (2n+2 iterations overall).

Active tree: a tree rooted at f is active at stage j whenever j lies properly within the extent of f; the extent of the active trees is stores in a stack

Let Fj-1 be the forest constructed from processing the elements 0 through j-1. Let f be the cycle containing element j of the permutation. Build Fj from Fj-1 as follows:

If j is the beginning of its own cycle f then it is the root of a single node tree.

Otherwise if cycle f overlaps with cycle g, add an edge (g,f) and compute the combined extent of g and of the tree rooted at f.

Building the overlap forest – refinement stage

Building the overlap graph1) Scan the permutation, label each position i with

C[i].B, and set up [C[i].B, C[i].E].2) Initialize empty stack.3) For i 0 to 2n+1

a) If (i == C[i].B) then push C[i].b) Extent C[i].

While(top.B > C[i].B){Extent.B min{extent.B, top.B}Extent.E max{ extent.E, top.E}parent[top.B] C[i].BPop top

End while}Top.B min{extent.B, top.B}Top.E max{extent.E, top.E}

c) if (i == top.E) then pop top.

i is the beginning of its own cycle.

Top is the top

element of the stack

The cycle i belongs to intersection with

another cycle

The tree at the top of the stack isn’t active anymore

Example of building an overlap forest

i

(i)

C[i].B

C[i].E

C[i]

First stage: setup a

b

c

d

e f

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 11 11 15 15 17 17 11 11 13 13 15 15 17 17 23 23 21 21 23 23

A A B B C C D D E E C C B B D D E E F F A A F F

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 11 11 15 15 17 17 11 11 13 13 15 15 17 17 23 23 21 21 23 23


i

(i)

C[i].B

C[i].E

C[i]

i = 0:

a)i = C[0].B push A

b)extent A

while(0=top.B>C[i].B=0)?NO

C[0].B 0

C[0].E21

c) if(0==21)?NO

A

A

extent

topFor i 0 to 2n+1

a) If (i == C[i].B) then push C[i].

b) Extent C[i].While (top.B > C[i].B){

Extent.B min{extent.B, top.B}

Extent.E max{ extent.E, top.E}

parent[top.B] C[i].B

Pop top

End while }

Top.B min{extent.B, top.B}

Top.E max{extent.E, top.E}

c) If(i==top.E) then pop top.

A

i

(i)

C[i].B

C[i].E

C[i]

A

extent

topi = 1:

a)1=i C[1].B=0

b) extent A

while(0=top.B>C[i].B=0)

C[1].B 0

C[1].E21

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 11 11 15 15 17 17 11 11 13 13 15 15 17 17 23 23 21 21 23 23


For i 0 to 2n+1






Pop top

End while }




B

A

i

(i)

C[i].B

C[i].E

C[i]

B

extent

i = 2:

a)i = C[2].B push B

b) extent B


C[2].B 2

C[2].E13

i = 3:

very similar to i = 2.

top

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 11 11 15 15 17 17 11 11 13 13 15 15 17 17 23 23 21 21 23 23


For i 0 to 2n+1






Pop top

End while }




C

B

A

For i 0 to 2n+1






Pop top

End while }




C

extent

i = 4:

a)i = C[4].B push C

b) extent C


C[4].B 4

C[4].E11

i = 5:

very similar to the case of i = 4.

i

(i)

C[i].B

C[i].E

C[i]

top

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 11 11 15 15 17 17 11 11 13 13 15 15 17 17 23 23 21 21 23 23


D

C

B

A

For i 0 to 2n+1






Pop top

End while }




D

extent

i = 6:

a)i = C[6].B push D

b) extent D


C[6].B 6

C[6].E15

i = 7:


i

(i)

C[i].B

C[i].E

C[i]

top

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 11 11 15 15 17 17 11 11 13 13 15 15 17 17 23 23 21 21 23 23


E

D

C

B

A

For i 0 to 2n+1






Pop top

End while }




E

extent

i = 8:

a)i = C[8].B push E

b) extent E


C[8].B 8

C[8].E17

i = 9:


i

(i)

C[i].B

C[i].E

C[i]

top

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 11 11 15 15 17 17 11 11 13 13 15 15 17 17 23 23 21 21 23 23


E

D

C

B

A

For i 0 to 2n+1






Pop top

End while }




C

extent

i = 10:

a)i C[10].B

b) extent C


extent.B min{4,8}

extent.Emax{11, 17}

parent[ 8] 4

pop top

while(top.B=6>C[i].B =4)

extent.B min{4,6}

extent.Emax{17,15}

parent[ 6] 4

pop top

i

(i)

C[i].B

C[i].E

C[i]

top

D

C

B

A

top

c

e

c

e d

[4,17]

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 11 11 15 15 17 17 11 11 13 13 15 15 17 17 23 23 21 21 23 23


C

B

A

For i 0 to 2n+1






Pop top

End while }




E

extent

i = 10 continue:

while(top.B = 4>C[i].B =4)?NO

top.B min{4,4}

top.Emax{17,11}

c) )if (10 == 17)? no

i

(i)

C[i].B

C[i].E

C[i]

top

c

e d

[4,17]

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 17 17 15 15 17 17 17 17 13 13 15 15 17 17 23 23 21 21 23 23


C

B

A

For i 0 to 2n+1






Pop top

End while }




C

extent

i = 11 :

a)11=i C[11].B=4

b)extent C

while(4=top.B>C[i].B= 4)

top.bmin{4,4}

top.Emax{17,17}

c) )if (11 == 17)? NO

i

(i)

C[i].B

C[i].E

C[i]

top

c

e d

[4,17]

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 13 13 17 17 15 15 17 17 17 17 13 13 15 15 17 17 23 23 21 21 23 23


C

B

A

For i 0 to 2n+1






Pop top

End while }




B

extent

i

(i)

C[i].B

C[i].E

C[i]

top

c

e d[2,17]

i = 12 :

a)12=i C[12].B=2

b)extent B


Extent.B min{2, 4}

extent.E max{13, 17}

Parent[4]2

Pop top


top.B min{2,2}

top.E max{17,13}

c) if(12==17)?NO

b

B

A

top

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 17 17 17 17 15 15 17 17 17 17 17 17 15 15 17 17 23 23 21 21 23 23


For i 0 to 2n+1






Pop top

End while }




B

extent

i

(i)

C[i].B

C[i].E

C[i]

c

e d

i = 13 :

a)13=i C[13].B=2?NO

b)extent B


top.B min{2,2}

top.E max{17,17}

c) if(13==17)?NO

bB

A

top

[2,17]

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 17 17 17 17 15 15 17 17 17 17 17 17 15 15 17 17 23 23 21 21 23 23


For i 0 to 2n+1






Pop top

End while }




D

extent

i

(i)

C[i].B

C[i].E

C[i]

c

e d

i = 14 :

a)14=i C[14].B=6?NO

b)extent D

while(6=top.B>C[i].B= 6)?NO

top.B min{6,2}

top.E max{15,17}

c) if(14==17)?NO

i = 15..16:

similar to i=14

i=17: pop top

bB

A

top

[2,17]

Atop

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 17 17 17 17 15 15 17 17 17 17 17 17 15 15 17 17 23 23 21 21 23 23


For i 0 to 2n+1






Pop top

End while }




i

(i)

C[i].B

C[i].E

C[i]

i = 18:

a)i = C[18].B push F

b)extent F


C[18].B 18

C[18].E23

c) if(18==23)?NO

i=19:

very similar to i=18

F

extent

c

e d

bF

A

top

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 17 17 17 17 15 15 17 17 17 17 17 17 15 15 17 17 23 23 21 21 23 23


For i 0 to 2n+1






Pop top

End while }




i

(i)

C[i].B

C[i].E

C[i]

i = 20:

a)i C[20].B

b)extent A


extent.B min{0,18}

extent.Emax{21, 23}

parent[18] 0

Pop top

i=21:

i==top.Epop top

A

extent

c

e d

bF

A

top

F

A

Atop

top

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0 5 6 17 18 14 13 9 10 20 19 15 16 7 8 12 11 21 22 3 4 1 2 23

0 0 2 2 4 4 6 6 8 8 4 4 2 2 6 6 8 8 18 18 0 0 18 18

21 21 17 17 17 17 15 15 17 17 17 17 17 17 15 15 17 17 23 23 21 21 23 23


Lemma 1At iteration i of step (3) of the algorithm, if the tree rooted at top is active and i lies on cycle f and we have f.B < top.B, h In the tree rooted at top such that h overlaps with f.Proof: top is active it must have been pushed onto the stack before the current iteration (top.B <i) and we didn’t reach the end of top’s extent yet (i < top.E). i must be contained in top’s extent (top.B<i<top.E). Since i lies on the cycle f that begins before top (f.B<top.B), there must be an edge from cycle f that overlaps with top

Theorem

The algorithm produces a forest

in which each tree is composed of

exactly those nodes that form

a connected component.

Proof

after each iteration, the trees in the forest correspond exactly to the connected components determined by the permutation values scanned up to that point (induction on the number applications of step 3 of the algorithm:Base: each tree of F0 has a single node and no two nodes belong to the same connected component.Step: assume invariant after (i-1) iterations and let i lie on cycle f. We prove that the nodes of the tree containing i form the same set as the nodes of the connected component containing i (other connected components are unaffected and so still obey the invariant).

Proof A node in the tree containing i must be in the same connected component as I.

i = f.B: then, nothing changes in the overlap graph ( and thus in the connected components);

from step(3): the forest remains unchanged, so that the invariant is preserved.

i>f.B: ( top,f) will be added to the forest whenever f.B<top.B holds. (top,f) will join the sub tree rooted at f with that rooted at top into a single sub tree.

f.B<top.B h in the tree rooted at top such that h and f overlap (lemma 1) (h,f) must belong to the overlap graph (h,f) connecting the connected component containing f with that containing top merging them into a single connected component the invariant is preserved.

Whenever (j,i) and (k,l) with j<k<i<l, are gray edges on cycles f and h, respectively, then edge (f,h) must belong to the overlap graph. In such a case, our algorithm ensures that edge (h,f) belong to the overlap forest.

Proof A node in the same connected component as I must be in the tree containing i.

j k i l

h

f

Complexity:1) Setup and Initialize empty stack.2) For i 0 to 2n+1

a) If (i == C[i].B) then push C[i].b) Extent C[i].

While(top.B > C[i].B){Extent.B min{extent.B, top.B}Extent.E max{ extent.E, top.E}parent[top.B] C[i].BPop top

End while} Top.B min{extent.B, top.B} Top.E max{extent.E, top.E}

c) if (i == top.E) then pop top.

Linear time

Each cycle is inserted exactly once to the

stack, and is poped after adding an edge to the graph, or after passing the end of it. Thus also

linear time

Every step of the algorithm takes

linear time,so the entire algorithm,

runs in worst case linear time.

Experimental setupSigned permutations of length 10, 20, 40, 60, 80, 160, 320, and 640:

For each length, 10 groups of 3 signed permutations were generated from the identity permutation using Nadeau and Taylor’s model

5 evolutionary rates were used: 4, 16, 64, 236, and 1024 inversions per edge.

For each length, 10 groups of 3 permutations were also generated and used as an extreme test case.

For each of these test suites, the 3 distances among the 3 genomes in each group were computed 20,000 times in tight loop. An average and standard deviation were computed over the 10 groups.

Computed inversion distance are expected to be at most twice the evolutionary rate, since there are 2 edges between each pair of genomes.

Experimental setup

Run time of connected components computation as a function of the permutation size

Total run time as a function of the permutation size

Inversion distance as function of the permutation size

Speed comparison of Bader linear time algorithm and that of the UF approach(only connected components)

Speed comparison of Bader’s linear-time algorithm and that of the UF algorithm

summaryPresented a simple, practical, linear time algorithm for computing inversion distance between two signed permutations, with a detailed experimental study.

a linear-time algorithm for computing inversion distance between signed permutations with an...

Documents