tree edit distance analysisstelo/cpm/cpm03/touzet.pdf · 2003. 8. 26. · tree edit distance...

30
Tree edit distance analysis Serge Dulucq (LABRI) el` ene Touzet (LIFL) CPM 2003

Upload: others

Post on 07-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Tree edit distance analysis

Serge Dulucq (LABRI)

Helene Touzet (LIFL)

CPM 2003

Page 2: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Why comparing tree ?

¤ : base pair (stem)

• : unpaired base (loop)

Tree representation for the secondary structure of tRNA

Page 3: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Tree Edit Distance Problem

. A pair of ordered rooted trees (A,B)

. Costs for edit operations

• sub: substituting a node

• ins: inserting a node

• del: deleting a node

. Distance: minimal cost to transform A into B

Page 4: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

String edit distance I

substitution insertion deletion

A

F

LGORITHM

LOWER

-

F

ALGORITHM

LOWER

A

-

LGORITHM

FLOWER

Leftmost decomposition

Comparison of all pairs of suffixes

A L G O R I T H M

F L O W E R

Page 5: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

String edit distance II

Comparison of all pairs of prefixes

A L G O R I T H M

F L O W E R

Rightmost decomposition

substitution insertion deletion

ALGORITH

FLOWE

M

R

ALGORITHM

FLOWE

-

R

ALGORITH

FLOWER

M

-

Page 6: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Tree edit distance

l(f) l′(f ′)

++ +Distance(f, f ′)

del(l) ins(l′)sub(l, l′)

Distance(l(f), f ′)Distance(f, l′(f ′))

Deletion of l Insertion of l′Substitution of l into l′

Page 7: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Forest edit distance I

l′(f ′) ◦ t′l(f) ◦ t

Insertion of l′

Deletion of l

Distance(l(f), l′(f ′))

+

Distance(t, t′)

Distance(l(f) ◦ t, f ′ ◦ t′)+ins(l′)

Distance(f ◦ t, l′(f ′) ◦ t′)+del(l)

Leftmost decomposition

Substitution of l into l′

Page 8: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Forest edit distance II

t′ ◦ l′(f ′)t ◦ l(f)

Insertion of l′

Deletion of l

Substitution of l into l′

Distance(l(f), l′(f ′))

+

Distance(t, t′)

Distance(t ◦ l(f), t′ ◦ f ′)+ins(l′)

Rightmost decomposition

Distance(t ◦ f, t′ ◦ l′(f ′))+del(l)

Page 9: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Left/right for trees ?

4

2 3

5 2 4 5

2 4

4

3

5

4 5

1

4

2 3

5

2

54

rightmost decomposition

Page 10: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Left/right for trees ?

4

2 3

5

4

3

5

4 5

1

4

2 3

5

2

54

leftmost decomposition

4 52

2 4

Page 11: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Decomposition strategies

. Succession of choices left or right

. S : forest× forest→ {left, right}

. Zhang & Shasha (1989) :

(f, g)→ left

. Klein (1998) :

(f, g) → right, when the first node of f isthe heaviest child

left, otherwise

Page 12: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

65

4 65

32 7

4 65

3 7 4 65

3

4 65

2

7

4

5 6

65

32

1

7

4

Klein Strategy

Page 13: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

How to built up an economical strategy ?

constraints

free of

l(f) t

tl(f)

tf

decomposition

leftmost

leftmost decompositions

#subforest(l(f) ◦ t) = #subforest(l(f)) + |l(f)|+#subforest(t)

Page 14: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

How to built up an economical strategy ?

constraints

free of

l(f) t

tl(f)

tf

decomposition

leftmost

leftmost decompositions

#subforest(l(f) ◦ t) = #subforest(l(f)) + |l(f)|+#subforest(t)

Page 15: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

How to built up an economical strategy ?

constraints

free of

l(f) t

tl(f)

tf

decomposition

leftmost

leftmost decompositions

#subforest(l(f) ◦ t) = #subforest(l(f)) + |l(f)|+#subforest(t)

Page 16: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Cover strategies are economical strategies

For a tree A, define a cover φ for A as

. φ(i) ∈ {left, right} if the degree of i is 0 or 1 : direction

. φ(i) is a child of i: favorite child

leftmost decompositions rightmost decompositions

Zhang & Shasha and Klein are cover strategies.

Page 17: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Number of subforests for one tree

A = l(A1 ◦ . . . ◦An)

. Lower bound (no assumption on the strategy)

#subforest(A) ≥ |A|−|Aj|+#subforest(A1)+· · ·+#subforest(An) O(n log(n))

Aj is the heaviest child

. Upper bound (no assumption on the strategy)

#subforest(A) ≤ n(n+3)2 −∑

i∈A |A(i)| 12 n2 +

√π

2 n32 +O(n) in average

. Exact number for a cover strategy

#subforest(A) = |A| − |Aj|+ #subforest(A1) + · · ·+ #subforest(An)

Aj is the favorite child

Page 18: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Example

1

42 3

1

3

42

A

B

42 3 43

1

42 3

4 3 2

#left(B) = 4− 1+ 1+ 2 = 6#right(B) = 4− 2+ 1+ 2 = 5

4− 1+ 1+ 1+ 1 = 6 subforests for A

#special(B) = 4 ∗ 7/2− 4− 1− 2− 1 = 6

Page 19: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

What happens to the other tree B?

There are only three possibilities for the subforests

of the cover tree A

. being compared with all leftmost forests of B

. being compared with all rightmost forests of B

. being compared with all forests of B

#left(B) : number of leftmost subforests of B#right(B) : number of rightmost subforests of B

#special(B) : number of subforests of B

#left(B), #right(B), #special(B) are known

Page 20: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Example (continued)

1

42 3

1

3

42

A

B

42 3 43

1

42 3

4 3 2

#left(B) = 4− 1+ 1+ 2 = 6#right(B) = 4− 2+ 1+ 2 = 5

4− 1+ 1+ 1+ 1 = 6 subforests for A

#special(B) = 4 ∗ 7/2− 4− 1− 2− 1 = 6

Page 21: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Example (continued)

1

42 3

1

3

42

A

B

42 3 43

1

42 3

4 3 2

#left(B) = 4− 1+ 1+ 2 = 6#right(B) = 4− 2+ 1+ 2 = 5

4− 1+ 1+ 1+ 1 = 6 subforests for A

#special(B) = 4 ∗ 7/2− 4− 1− 2− 1 = 6

Page 22: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

The favorite children inherit subforests from their parent.

4 kinds of nodes :

. Free: nodes that do not receive anything

. Left : nodes that inherit leftmost forests of B

. Right : nodes that inherit rightmost forests of B

. All: nodes that inherit all subforests of B

The status of a node depends of the direction and of the heritage.

Page 23: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

1. A is reduced to a node with direction right

Free(A) = Left(A) = #left(B)All(A) = Right(A) = #special(B)

2. A is reduced to a node with direction left

Free(A) = Right(A) = #right(B)All(A) = Left(A) = #special(B)

3. A = l(A′) and the direction of l is right

Free(A) = Left(A) = #left(B) + Right(A′)All(A) = Right(A) = #special(B) + All(A′)

4. A = l(A′) and the direction of l is left

Free(A) = Right(A) = #right(B) + Left(A′)All(A) = Left(A) = #special(B) + All(A′)

Page 24: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

5. A = l(A1 ◦ · · · ◦An) and the favorite child is A1?

Free(A) = Left(A) =∑

i>1Free(Ai) + Left(A1) + #left(B)(|A| − |A1|)All(A) = Right(A) =

i>1Free(Ai) + All(A1) + #special(B)(|A| − |A1|)

6. A = l(A1 ◦ · · · ◦An) and the favorite child is An?

Free(A) = Right(A) =∑

i<nFree(Ai) + Right(An) + #right(B)(|A| − |An|)All(A) = Left(A) =

i<nFree(Ai) + All(An) + #special(B)(|A| − |An|)

7. otherwise: let Aj (1 < j < n) be the favorite child

Free(A) =∑

i6=jFree(Ai)+All(Aj)+#right(B)(1 + |A1◦· · ·◦Aj−1|)+#special(B)|Aj◦· · ·◦An|

Right(A) = Free(A)All(A) = Left(A) =

i6=j Free(Ai) + All(Aj) + #special(B)(|A| − |Aj|)

Page 25: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Example (end)

1

42 3

1

3

42

A

B

42 3 43

1

42 3

4 3 2

#left(B) = 4− 1+ 1+ 2 = 6#right(B) = 4− 2+ 1+ 2 = 5

4− 1+ 1+ 1+ 1 = 6 subforests for A

#special(B) = 4 ∗ 7/2− 4− 1− 2− 1 = 6

Page 26: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Example (end)

1

42 3

1

3

42

A

32 pairs of

B

subforests

42 3

3

4

3

42

1

3

42

4

3

43

22 4

3

4

1

3

42

3

42

1

3

42

3

42

3

442

4

1

42 3

X

X

X

A B

2

rightmost subforests

24

all subforests

leftmost subforests2 4

Page 27: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

How to construct an optimal cover ?

. Dynamic programming

. Four tables : Free, All, Left, Right

Free(A) =∑

i≥1 Free(Ai)

+min

Left(A1)− Free(A1) + #left(B) ∗ (|A| − |A1|)All(Aj)− Free(Aj)+#special(B)|Aj◦· · ·◦An|+#right(B)(1 + |A1◦· · ·◦Aj−1|), 1 < j < n

Right(An)− Free(An) + #right(B) ∗ (|A| − |An|)

. Preprocessing : O(∑

i degree(A(i))) +O(|B|) = O(|A|) +O(|B|)

Page 28: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Number of pairsof subforests

optimal : 340

right : 405

left : 350

Klein : 391

BA

direction

favorite child

optimal covering :

Page 29: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Number of pairsof subforests

optimal : 340

right : 405

left : 350

Klein : 391

BA

direction

favorite child

optimal covering :

Page 30: Tree edit distance analysisstelo/cpm/cpm03/Touzet.pdf · 2003. 8. 26. · Tree Edit Distance Problem. Apairoforderedrootedtrees(A;B). Costsforeditoperations †sub: substitutinganode

Klein

Zhang&Shasha

Optimal cover

Size of trees

Num

ber

of s

ubfo

rest

s

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

0 10 20 30 40 50 60 70 80 90 100