tree edit distance analysisstelo/cpm/cpm03/touzet.pdf · 2003. 8. 26. · tree edit distance...
TRANSCRIPT
Tree edit distance analysis
Serge Dulucq (LABRI)
Helene Touzet (LIFL)
CPM 2003
Why comparing tree ?
¤ : base pair (stem)
• : unpaired base (loop)
Tree representation for the secondary structure of tRNA
Tree Edit Distance Problem
. A pair of ordered rooted trees (A,B)
. Costs for edit operations
• sub: substituting a node
• ins: inserting a node
• del: deleting a node
. Distance: minimal cost to transform A into B
String edit distance I
substitution insertion deletion
A
F
LGORITHM
LOWER
-
F
ALGORITHM
LOWER
A
-
LGORITHM
FLOWER
Leftmost decomposition
Comparison of all pairs of suffixes
A L G O R I T H M
F L O W E R
String edit distance II
Comparison of all pairs of prefixes
A L G O R I T H M
F L O W E R
Rightmost decomposition
substitution insertion deletion
ALGORITH
FLOWE
M
R
ALGORITHM
FLOWE
-
R
ALGORITH
FLOWER
M
-
Tree edit distance
l(f) l′(f ′)
++ +Distance(f, f ′)
del(l) ins(l′)sub(l, l′)
Distance(l(f), f ′)Distance(f, l′(f ′))
Deletion of l Insertion of l′Substitution of l into l′
Forest edit distance I
l′(f ′) ◦ t′l(f) ◦ t
Insertion of l′
Deletion of l
Distance(l(f), l′(f ′))
+
Distance(t, t′)
Distance(l(f) ◦ t, f ′ ◦ t′)+ins(l′)
Distance(f ◦ t, l′(f ′) ◦ t′)+del(l)
Leftmost decomposition
Substitution of l into l′
Forest edit distance II
t′ ◦ l′(f ′)t ◦ l(f)
Insertion of l′
Deletion of l
Substitution of l into l′
Distance(l(f), l′(f ′))
+
Distance(t, t′)
Distance(t ◦ l(f), t′ ◦ f ′)+ins(l′)
Rightmost decomposition
Distance(t ◦ f, t′ ◦ l′(f ′))+del(l)
Left/right for trees ?
4
2 3
5 2 4 5
2 4
4
3
5
4 5
1
4
2 3
5
2
54
rightmost decomposition
Left/right for trees ?
4
2 3
5
4
3
5
4 5
1
4
2 3
5
2
54
leftmost decomposition
4 52
2 4
Decomposition strategies
. Succession of choices left or right
. S : forest× forest→ {left, right}
. Zhang & Shasha (1989) :
(f, g)→ left
. Klein (1998) :
(f, g) → right, when the first node of f isthe heaviest child
left, otherwise
65
4 65
32 7
4 65
3 7 4 65
3
4 65
2
7
4
5 6
65
32
1
7
4
Klein Strategy
How to built up an economical strategy ?
constraints
free of
l(f) t
tl(f)
tf
decomposition
leftmost
leftmost decompositions
#subforest(l(f) ◦ t) = #subforest(l(f)) + |l(f)|+#subforest(t)
How to built up an economical strategy ?
constraints
free of
l(f) t
tl(f)
tf
decomposition
leftmost
leftmost decompositions
#subforest(l(f) ◦ t) = #subforest(l(f)) + |l(f)|+#subforest(t)
How to built up an economical strategy ?
constraints
free of
l(f) t
tl(f)
tf
decomposition
leftmost
leftmost decompositions
#subforest(l(f) ◦ t) = #subforest(l(f)) + |l(f)|+#subforest(t)
Cover strategies are economical strategies
For a tree A, define a cover φ for A as
. φ(i) ∈ {left, right} if the degree of i is 0 or 1 : direction
. φ(i) is a child of i: favorite child
leftmost decompositions rightmost decompositions
Zhang & Shasha and Klein are cover strategies.
Number of subforests for one tree
A = l(A1 ◦ . . . ◦An)
. Lower bound (no assumption on the strategy)
#subforest(A) ≥ |A|−|Aj|+#subforest(A1)+· · ·+#subforest(An) O(n log(n))
Aj is the heaviest child
. Upper bound (no assumption on the strategy)
#subforest(A) ≤ n(n+3)2 −∑
i∈A |A(i)| 12 n2 +
√π
2 n32 +O(n) in average
. Exact number for a cover strategy
#subforest(A) = |A| − |Aj|+ #subforest(A1) + · · ·+ #subforest(An)
Aj is the favorite child
Example
1
42 3
1
3
42
A
B
42 3 43
1
42 3
4 3 2
#left(B) = 4− 1+ 1+ 2 = 6#right(B) = 4− 2+ 1+ 2 = 5
4− 1+ 1+ 1+ 1 = 6 subforests for A
#special(B) = 4 ∗ 7/2− 4− 1− 2− 1 = 6
What happens to the other tree B?
There are only three possibilities for the subforests
of the cover tree A
. being compared with all leftmost forests of B
. being compared with all rightmost forests of B
. being compared with all forests of B
#left(B) : number of leftmost subforests of B#right(B) : number of rightmost subforests of B
#special(B) : number of subforests of B
#left(B), #right(B), #special(B) are known
Example (continued)
1
42 3
1
3
42
A
B
42 3 43
1
42 3
4 3 2
#left(B) = 4− 1+ 1+ 2 = 6#right(B) = 4− 2+ 1+ 2 = 5
4− 1+ 1+ 1+ 1 = 6 subforests for A
#special(B) = 4 ∗ 7/2− 4− 1− 2− 1 = 6
Example (continued)
1
42 3
1
3
42
A
B
42 3 43
1
42 3
4 3 2
#left(B) = 4− 1+ 1+ 2 = 6#right(B) = 4− 2+ 1+ 2 = 5
4− 1+ 1+ 1+ 1 = 6 subforests for A
#special(B) = 4 ∗ 7/2− 4− 1− 2− 1 = 6
The favorite children inherit subforests from their parent.
4 kinds of nodes :
. Free: nodes that do not receive anything
. Left : nodes that inherit leftmost forests of B
. Right : nodes that inherit rightmost forests of B
. All: nodes that inherit all subforests of B
The status of a node depends of the direction and of the heritage.
1. A is reduced to a node with direction right
Free(A) = Left(A) = #left(B)All(A) = Right(A) = #special(B)
2. A is reduced to a node with direction left
Free(A) = Right(A) = #right(B)All(A) = Left(A) = #special(B)
3. A = l(A′) and the direction of l is right
Free(A) = Left(A) = #left(B) + Right(A′)All(A) = Right(A) = #special(B) + All(A′)
4. A = l(A′) and the direction of l is left
Free(A) = Right(A) = #right(B) + Left(A′)All(A) = Left(A) = #special(B) + All(A′)
5. A = l(A1 ◦ · · · ◦An) and the favorite child is A1?
Free(A) = Left(A) =∑
i>1Free(Ai) + Left(A1) + #left(B)(|A| − |A1|)All(A) = Right(A) =
∑
i>1Free(Ai) + All(A1) + #special(B)(|A| − |A1|)
6. A = l(A1 ◦ · · · ◦An) and the favorite child is An?
Free(A) = Right(A) =∑
i<nFree(Ai) + Right(An) + #right(B)(|A| − |An|)All(A) = Left(A) =
∑
i<nFree(Ai) + All(An) + #special(B)(|A| − |An|)
7. otherwise: let Aj (1 < j < n) be the favorite child
Free(A) =∑
i6=jFree(Ai)+All(Aj)+#right(B)(1 + |A1◦· · ·◦Aj−1|)+#special(B)|Aj◦· · ·◦An|
Right(A) = Free(A)All(A) = Left(A) =
∑
i6=j Free(Ai) + All(Aj) + #special(B)(|A| − |Aj|)
Example (end)
1
42 3
1
3
42
A
B
42 3 43
1
42 3
4 3 2
#left(B) = 4− 1+ 1+ 2 = 6#right(B) = 4− 2+ 1+ 2 = 5
4− 1+ 1+ 1+ 1 = 6 subforests for A
#special(B) = 4 ∗ 7/2− 4− 1− 2− 1 = 6
Example (end)
1
42 3
1
3
42
A
32 pairs of
B
subforests
42 3
3
4
3
42
1
3
42
4
3
43
22 4
3
4
1
3
42
3
42
1
3
42
3
42
3
442
4
1
42 3
X
X
X
A B
2
rightmost subforests
24
all subforests
leftmost subforests2 4
How to construct an optimal cover ?
. Dynamic programming
. Four tables : Free, All, Left, Right
Free(A) =∑
i≥1 Free(Ai)
+min
Left(A1)− Free(A1) + #left(B) ∗ (|A| − |A1|)All(Aj)− Free(Aj)+#special(B)|Aj◦· · ·◦An|+#right(B)(1 + |A1◦· · ·◦Aj−1|), 1 < j < n
Right(An)− Free(An) + #right(B) ∗ (|A| − |An|)
. Preprocessing : O(∑
i degree(A(i))) +O(|B|) = O(|A|) +O(|B|)
Number of pairsof subforests
optimal : 340
right : 405
left : 350
Klein : 391
BA
direction
favorite child
optimal covering :
Number of pairsof subforests
optimal : 340
right : 405
left : 350
Klein : 391
BA
direction
favorite child
optimal covering :
Klein
Zhang&Shasha
Optimal cover
Size of trees
Num
ber
of s
ubfo
rest
s
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
0 10 20 30 40 50 60 70 80 90 100