tree distance algorithm
TRANSCRIPT
Workshop ontree distance
By Hector Francofrancoph at tcd dot ie
Trinity College of Dublin
summary
1. Levenshtein algorithm & Sub-string matching
2. The tree edit distance
3. Basic tree concepts
4. Maps
5. Tai maps
6. Tree distance
7. Zhang Shasha algorithm & alignment
summary
• Levenshtein algorithm & Sub-string matching• The tree edit distance• Basic tree concepts• Maps• Tai maps• Tree distance• Zhang Shasha algorithm & alignment
String distance metrics: Levenshtein• Edit-distance metrics
– Precursor of tree distance.– Distance is shortest sequence of edit commands that
transform s to t. (meaning the sequence of edit command that sum less cost of mapping s to t)
– Simplest set of operations:• Copy/map character from s over to t, (cost 0)• Delete a character in s (cost 1)• Insert a character in t (cost 1)• Substitute one character for another (cost 1)
– This is “Levenshtein distance”
Levenshtein distance - example
• distance(“William Cohen”, “Willliam Cohon”)
W I L L I A M _ C O H E N
W I L L L I A M _ C O H O N
C C C C I C C C C C C C S C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
SDomain:
TRange:
op
cost
alignment
Computing Levenshtein distanceD(i,j) = score of best alignment from s1..si to t1..tj
= min
D(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
(simplify by letting d(c,d)=0 if c=d, 1 else)
also let D(i,0)=i (for i inserts) and D(0,j)=j
Computing Levenshtein distance - 3
D(i,j)= min
D(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
= D(s,t)
C O H E N
0 1 2 3 4 5
M 1 1 2 3 4 5
C 2 1 2 3 4 5
C 3 2 2 3 4 5
O 4 3 2 3 4 5
H 5 4 3 2 3 4
N 6 5 4 3 3 3
The yellow row and column, correspond to the row 0 and column 0 of the table, and they are initiated in increasing order.
For (int x = 0; x<size(target), x++) D(0,x) = x;For (int x = 0; x<size(source), x++) D(x,0) = x;d(si,tj) represents the cost of change the letter si into the letter tj, where if the letter is the same the cost will be 0 and if is a different letter the cost will be 1.
Computing Levenshtein distance - 3
D(i,j)= min
D(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
C O H E N
0 1 2 3 4 5
M 1 1 2 3 4 5
C 2 1 2 3 4 5
C 3 2 2 3 4 5
O 4 3 2 3 4 5
H 5 4 3 2 3 4
N 6 5 4 3 3 3
T1 T2 Cost
M - 1
C - 1
C C 0
O O 0
H H 0
- E 1
N N 0
T1 T2 Cost
M - 1
C C 0
C - 1
O O 0
H H 0
- E 1
N N 0
Practice
G O O D
0 1 2 3 4
G 1 0 1 2 3
O 2 1 0 1 2
D 3 2 1 1 1
G O O DG O DG O D
TRYTRY
Sub string matching
• What is the best matching substring of S in T?
u n i v e r s
0 0 0 0 0 0 0
n 1 0 1 1 1 1 1
i 2 1 0 1 2 2 2
e 3 2 1 1 1 2 3
r 4 3 2 2 2 1 2
Look for the minimum value
“niver”
Cost 0 for delete positions before the sub-string
summary
• Levenshtein algorithm & Sub-string matching• The tree edit distance• Basic tree concepts• Maps• Tai maps• Tree distance• Zhang Shasha algorithm & alignment
Edit distance between trees• Tai 1979 introduce a criterion for matching nodes between tree
representations, and Zhang and Shasha 1989 develop an algorithm that find an optimal matching tree solution for a given pair of trees.
• For this algorithm it’s considered left to right order: what means that the order of the children of each node is important and ancestor: what means that the ancestor of each node is important.
• For match or convert form one tree into another there are three operations allowed, as deleting a node, inserting a node and replace or changing a node.
• When a node n is deleted, all its children are attached to the parent of n, in a insertion, it happen the opposite, it can heritage nodes as it’s own children, and a changing only affects the label of the node without any changes on the tree morphology.
summary
• Levenshtein algorithm & Sub-string matching• The tree edit distance• Basic tree concepts• Maps• Tai maps• Tree distance• Zhang Shasha algorithm & alignment
Post-order traversal of trees
• To traverse a non-empty binary tree in postorder, perform the following operations recursively at each node:
• 1 Traverse the left subtree.• 2 Traverse the right subtree.• 3 Visit the node .
practice yourself
1
5
32
4
Ancestors:
Left most descendentFunction l(x) give as the most left descendent of the node x
Key roots x is a key root if:
)()(,| xlklxkk =>¬∃
summary
• Levenshtein algorithm & Sub-string matching• The tree edit distance• Basic tree concepts• Maps• Tai maps• Tree distance• Zhang Shasha algorithm & alignment
Mappings
M
H
S
I I
S
S
B
Domain: Range:
ΛDeleted: in the domain:
Changed
Exact match
Inserted: in the range
Mappings
M
H
S
I I
S
S
B
Domain: Range:
ΛTransformation = { 1,2 } , { 2, Λ } , { 3,1 } , { 4,4 } , {Λ,3 }Note: this is NOT a tai map.
1
12
2
33
4 4
TR: Transformations
M: Map
The setsM
H
SI I
S
S
B
Domain: Range:
Λ1
12
2
33
4 4
C: changeCost =1
EX: exact matchCost = 0
I: InsertionCost = 1
D: deletionCost= 1
TR = EX + C + I + D M = EX + C
Mappings
TR = { 1,2 } ,{ 2, Λ } , { 3,1 } , { 4,4 } , {Λ,3 }
M
H
S
I I
S
S
B
Domain: Range:
Λ1
12
2
33
4 4
• Sets:
C: relabeled (a,b) { 1,2 }
Ex: exact match (c,c) { 3,1 } , { 4,4 }
I: insertions (Λ,b) {Λ,3 }
D: deletions (a,Λ) { 2, Λ }
Mappings more formal
• Let V (T) denote the set of nodes of a tree T• S is the Source tree• T is the Target tree.• M mapping
summary
• Levenshtein algorithm & Sub-string matching• The tree edit distance• Basic tree concepts• Maps• Tai maps• Tree distance• Zhang Shasha algorithm & alignment
Tai mapping: a restricted map
M={(a,b), (c,d)}
1 One to one node
2 Sibling order preserved:brothers do not change order
3 Ancestor order preserved: by change are not new ancestors
1 a=c iff b=d
6 a<c iff b<d
7 Anc(a,c) iff anc (b,d).
c f
b d e
a
• Sample of tai mapping:
• M= {(c,f),(a,e)}
practice
.
Ancestry!multiple!
Possible tai mapping
Sibling order
1
2
34
3
21
5
4 1
5
32
4
summary
• Levenshtein algorithm & Sub-string matching• The tree edit distance• Basic tree concepts• Maps• Tai maps• Tree distance• Zhang Shasha algorithm & alignment
Definition : Tree distance
• There can be multiple possible tai mappings between two trees, and at least there is one with the smallest cost.
• “the tree-distance is the cost of the least expensive Tai mapping”
• Cost = |I| + |D| + |C|
Practice: prove the “triangle inequality”
TR: Transformations
M: Map
C: changeCost =1
EX: exact matchCost = 0
I: InsertionCost = 1
D: deletionCost= 1
TR: Transformations
M: Map
C: changeCost =1
EX: exact matchCost = 0
I: InsertionCost = 1
D: deletionCost= 1
Prove: γ (M3) ≤ γ(M1)+ γ(M2) | M3 = M1*M2Clue: γ (M0) = |C0|+|I0|+|D0|
PracticeAll possible combinations of the set m1 and m2 in m3 gives same or less cost combined, than alone.
SET1 COST SET2 COST SET3 COST DIFF
EX 0 EX 0 EX 0 0
EX 0 C 1 C 1 0
EX 0 D 1 D 1 0
C 1 EX 0 C 1 0
C 1 C 1 C/EX 1/0 -1/-2
C 1 D 1 D 1 -1
I 1 EX 0 I 1 0
I 1 C 1 I 1 -1
I 1 D 1 None 0 -2
D 1 None 0 D 1 0
D 1 I 1 C/EX 1/0 -1/-2
None 0 I 1 I 1 0
summary
• Levenshtein algorithm & Sub-string matching• The tree edit distance• Basic tree concepts• Maps• Tai maps• Tree distance• Zhang Shasha algorithm & alignment
Zhang shasha algorithm
• finds the tree distance (cost of the least costly Tai mapping)
• Important concepts:• Tree table -> tscore• Forest table -> fscore
Forest Distance
• Forest distance == F(), Tree distance T().• Cost function δ = F and T
Tree Tree Tree
Forest
Tree
δ( , )+γ( )
Tree distance
δ( , )=min
δ( , )+γ( )
δ( , )+γ( )
delete
add
change
Forest distance
δ( , ) =min
δ( , )+γ( )
δ( , )+γ( )delete
add
δ( , )+ δ( , )
δ( , )+γ( )
Why this is not allowed?Clue: check tai mapping restrictions
δ( , ) =min
δ( , )+γ( )
δ( , )+γ( )delete
add
change
δ( , )+γ( )δ( , )=change
Algoritm: psedudo code
• Preprocessing() // – Get most left node and key root for each node.
• For s:=1 to |keyroots(t1)|for t:= 1 to |keyroots(t2)|
i=keyroots(t1)[s];j=keyroots(t2)[t];Treedist(i,j);
end• Return tdist[i,j];
Algoritm: 2
Treedist(pos1,pos2) {bound1=pos1-left1[pos1]+2;bound2=pos2-left2[pos2]+2;
fdist= new int[bound1,bound2];fdist[0][0] = 0; for(i=1,i<bound1;i++)
fdist[i][0]= fdist[i-1][0] + c[i][0] for(i=1,i<bound1;i++)
fdist[0][i]= fdist[0][i-1] + c[0][1]
prep
are
fore
st t
able
Algoritm: 3
For(k=left[pos1],i=1;k<=pos1;k++,i++)For(l=left[pos1],j=1;l<=pos1;l++,j++)
if((left1[k]==left1[pos1]&&(left2[l]==left2[pos2])){ // if both are trees, then tree distance
/// then:
Fdist[i][j]=MIN(
fdist[i-1][j]+c[0][l]
fdist[i][j-1]+c[ k][0]
fdist[i-1][j-1]+c[k][l]
Tdist[k][l]=fdist[i][j];
Algoritm: 4
}else{ /// else:
M=left1[k]-left1[pos1];
N= left2[l]-left2[pos2];
Fdist[i][j]=MIN(
fdist[i-1][j]+c[0][l]
fdist[i][j-1]+c[ k][0]
fdist[m][n]+tdist[k][l];
)
}
Sample• There is a permanent tree distance table, and
a dynamic forest distance table.• Let’s follow the algorithm to solve this
problem:• Color means different labels.
6
5
2
1
3 4
6
2 4
1 3
5
t1 t26
6
5
Sample
• Position:• Left1 array = • Left2 array =• LR_keyroots1 =• LR_keyroots2 =
6
5
2
1
3 4
6
2 4
1 3
5
t1 t2
0 1 2 3 4 5 6
Nan 1 1 3 4 1 1
Nan 1 1 3 3 5 1
Nan 0 0 1 1 0 1
Nan 0 0 0 1 1 1
Most left
• Position:• Left1 array = • Left2 array =• LR_keyroots1 =• LR_keyroots2 =
6
5
2
1
3 4
6
2 4
1 3
5
t1 t2
0 1 2 3 4 5 6
Nan 1 1 3 4 1 1
Nan 1 1 3 3 5 1
Nan 0 0 1 1 0 1
Nan 0 0 0 1 1 1
Key roots
• Position:• Left1 array = • Left2 array =• LR_keyroots1 =• LR_keyroots2 =
6
5
2
1
3 4
6
2 4
1 3
5
t1 t2
0 1 2 3 4 5 6
Nan 1 1 3 4 1 1
Nan 1 1 3 3 5 1
Nan 0 0 1 1 0 1
Nan 0 0 0 1 1 1
Atomic cost: c
• Measure of change the label node n to the node m.
• First row and column corresponds to the cost of delete/insert a node with such label.
6
5
2
1
3 4
6
2 4
1 3
5
t1 t2
- 1 1 1 1 1 1
1 0 1 0 1 0 1
1 1 0 1 0 1 1
1 0 1 0 1 0 1
1 0 1 0 1 0 1
1 1 0 1 0 1 1
1 1 0 1 0 1 1
1
2
3
4
5
6
1 2 3 4 5 6
ΛΛ
Tree distance table
• First row and column are unless , it just simplify the code.
• The other cell are suppose to be start by 0, but we use N = Nan to illustrate that we only use the cells with value.
• Matrix 6 *6 as the number of nodes
• Each position x means the sub-tree T[l(x) ..x).
6
5
2
1
3 4
6
2 4
1 3
5
t1 t2
- - - - - - -
- N N N N N N
- N N N N N N
- N N N N N N
- N N N N N N
- N N N N N N
- N N N N N N
0 1 2 3 4 5 6
01 23 456
Treedist(3,4)
Calculations:
• Bound1 = 3-3+2=2• Bound2 = 4-3+2=3• Fdist = new int[2][3]• Prepare forest distance
table
Forest distance
- - - - - - -
- N N N N N N
- N N N N N N
- N N N N N N
- N N N N N N
- N N N N N N
- N N N N N N
Treedist(3,4) step2Calculations:
• k = 3, i = 1• L = 3, j = 1• Is a tree? -> yes
• Set the value in Forest distance table and in tree distance table
• L++, j++
Forest distance0 1 2
1 0
- - - - - - -
- N N N N N N
- N N N N N N
- N N 0 N N N
- N N N N N N
- N N N N N N
- N N N N N N
T( , )=min3 3
F(Λ, )+ γ( )=23
F( ,Λ)+ γ( )=2F(Λ, Λ)+γ( )=0
3
3
3
3
It is a tree so:2.We can look in
diagonal3.We must copy the value in tree table
Treedist(3,4) step3Calculations:
• k = 3, i = 1• L = 4, j = 2• Is a tree? -> yes
• Set the value in Forest distance table and in tree distance table
• L++, j++
Forest distance0 1 2
1 0 1
- - - - - - -
- N N N N N N
- N N N N N N
- N N 0 1 N N
- N N N N N N
- N N N N N N
- N N N N N N
T( , )=min3
F(Λ, )+ γ( )=3F( , )+ γ( )=1F(Λ, ) +γ( )=2
3
3
4
4
4
3
4
3
3
3
Treedist(3,5) step1
Calculations:
• Bound1 = 3-3+2=2• Bound2 = 5-5+2=2• Fdist = new int[2][2]• Prepare forest distance
table
Forest distance
- - - - - - -
- N N N N N N
- N N N N N N
- N N 0 1 N N
- N N N N N N
- N N N N N N
- N N N N N N
Treedist(3,5) step2Calculations:
• k = 3, i = 1• L = 5, j = 1• Is a tree? -> yes
Forest distance
T( , )=min3 5
F(Λ, )+ γ( )=25
F( ,Λ)+ γ( )=2F(Λ, Λ)+γ( )=0
3
3
5
5 - - - - - - -
- N N N N N N
- N N N N N N
- N N 0 1 0 N
- N N N N N N
- N N N N N N
- N N N N N N
0 1
1 0
Treedist(3,6) step1
Calculations:
• Bound1 = 3-3+2=2• Bound2 = 6-1+2=7• Fdist = new int[2][7]• Prepare forest distance
table
Forest distance0 1 2 3 4 5 6
1
- - - - - - -
- N N N N N N
- N N N N N N
- N N 0 1 0 N
- N N N N N N
- N N N N N N
- N N N N N N
Treedist(3,6) step2
Calculations:
• k = 3, i = 1• L = 1, j = 1• Is a tree? -> yes
Forest distance0 1 2 3 4 5 6
1 0
- - - - - - -
- N N N N N N
- N N N N N N
- 0 N 0 1 0 N
- N N N N N N
- N N N N N N
- N N N N N N
T( , )=min3
F(Λ, )+ γ( )=2F( , Λ )+ γ( )=2F(Λ, Λ)+γ( )=0
3
3
1
1
1
1
Treedist(3,6) step3
Calculations:
• k = 3, i = 1• L = 2, j = 2• Is a tree? -> yes
Forest distance0 1 2 3 4 5 6
1 0 1
- - - - - - -
- N N N N N N
- N N N N N N
- 0 1 0 1 0 N
- N N N N N N
- N N N N N N
- N N N N N N
T( , )=min3
F(Λ, )+ γ( )=2F( , )+ γ( )=1F(Λ, )+γ( )=2
3
3
2
2
2
1
2
1
1
1
Treedist(3,6) step4
Calculations:
• k = 3, i = 1• L = 4, j = 4• Is a tree? -> NO
• M = 0• N = 2
Forest distance0 1 2 3 4 5 6
1 0 1 2
- - - - - - -
- N N N N N N
- N N N N N N
- 0 1 0 1 0 N
- N N N N N N
- N N N N N N
- N N N N N N
T( , )=min3
F(Λ, )+ γ( )=4F( , )+ γ( )=2F(Λ, )+T( , )=2
3
3
221
3
21
3
21
21
3 3
It is NOT a tree so:2.We can NOT look
in diagonal
Treedist(3,6) step5
Calculations:
• k = 3, i = 1• L = 3, j = 3• Is a tree? -> NO
• M = 0• N = 2
Forest distance0 1 2 3 4 5 6
1 0 1 2 3
- - - - - - -
- N N N N N N
- N N N N N N
- 0 1 0 1 0 N
- N N N N N N
- N N N N N N
- N N N N N N
T( , )=min3
F(Λ, )+ γ( )=5F( , )+ γ( )=3F(Λ, )+T( , )=3
3
3
2
21
2 41 3
2 41 3
21 3
43
3
Size of forest
• Tree distance (3,4)-> starting in [3,3] (most left).F(3..3,3..3),F(3..3,3..4)
Key roots6
5
2
1
3 4
t1
2 4
1 3
t2
• Tree distance (3,4)-> starting in [3,3] (most left).F(3..3,3..3),F(3..3,3..4)
• Tree distance (4,4)-> starting in [4,3] (most left).
F(4..4,3..3),F(4..4,3..4)
Key roots6
5
2
1
3 4
t1
2 4
1 3
t2
• Tree distance (3,4)-> starting in [3,3] (most left).F(3..3,3..3),F(3..3,3..4)
• Tree distance (4,4)-> starting in [4,3] (most left).
F(4..4,3..3),F(4..4,3..4)• Tree distance (6,4)-> starting in [1,3] (most left).
F(4..4,3..3),F(4..4,3..4)
Key roots6
5
2
1
3 4
t1
2 4
1 3
t2
• Tree distance (4,4)-> starting in [4,3] (most left).F(4..4,3..3),F(3..3,3..4)
Key roots6
5
2
1
3 4
t1
2 4
1 3
t2
• Tree distance (6,4)-> starting in [1,3] (most left).F(1..1,3..4), treeF(1..2,3..4), treeF(1..3,3..4), forestF(1..4,3..4), forestF(1..5,3..4), treeF(1..6,3..4), tree
Key roots 6
5
2
1
3 4
t1
2 4
1 3
t2
2
1
3 4 We need the result of the matching for the sub-trees nodes 3 and 4, and it happen that hey are key roots!
δ( , ) =min
δ( , )+γ( )
δ( , )+γ( ) delete
add
δ( , )+ δ( , )
Practice:
• Calculate the tree distance between this two trees and observe the similarity with Levenshtein algorithm. (Only 1 key root.)
G
o
o
d
G
o
d
Forest table
= tree table
= levenshtein table.
Tree distance
Time:
Space:
Calculate the alignment
Tree distance- - - - - - -
- 0 1 0 1 0 5
- 1 0 1 0 1 4
- 0 1 0 1 0 5
- 0 1 0 1 0 5
- 4 3 4 3 4 2
- 5 4 5 4 5 3
Forest distance 6*60 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 0 1 2 3 4
3 2 1 0 1 2 3
4 3 2 1 2 1 2
5 4 3 2 3 2 2
6 5 4 3 4 3 3
The algorithm can be extended in order to get the alignment:All tables are need it.This talbes corresponds to the same example as we looked before.
Calculate the alignment
Tree distance- - - - - - -
- 0 1 0 1 0 5
- 1 0 1 0 1 4
- 0 1 0 1 0 5
- 0 1 0 1 0 5
- 4 3 4 3 4 2
- 5 4 5 4 5 3
Forest distance 6*60 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 0 1 2 3 4
3 2 1 0 1 2 3
4 3 2 1 2 1 2
5 4 3 2 3 2 2
6 5 4 3 4 3 3
(6,6)
Calculate the alignment
Tree distance- - - - - - -
- 0 1 0 1 0 5
- 1 0 1 0 1 4
- 0 1 0 1 0 5
- 0 1 0 1 0 5
- 4 3 4 3 4 2
- 5 4 5 4 5 3
Forest distance 6*60 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 0 1 2 3 4
3 2 1 0 1 2 3
4 3 2 1 2 1 2
5 4 3 2 3 2 2
6 5 4 3 4 3 3
(6,6),
Calculate the alignment
Tree distance- - - - - - -
- 0 1 0 1 0 5
- 1 0 1 0 1 4
- 0 1 0 1 0 5
- 0 1 0 1 0 5
- 4 3 4 3 4 2
- 5 4 5 4 5 3
Forest distance 6*60 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 0 1 2 3 4
3 2 1 0 1 2 3
4 3 2 1 2 1 2
5 4 3 2 3 2 2
6 5 4 3 4 3 3
(6,6), (4,5)
Calculate the alignment
Tree distance- - - - - - -
- 0 1 0 1 0 5
- 1 0 1 0 1 4
- 0 1 0 1 0 5
- 0 1 0 1 0 5
- 4 3 4 3 4 2
- 5 4 5 4 5 3
Forest distance 6*60 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 0 1 2 3 4
3 2 1 0 1 2 3
4 3 2 1 2 1 2
5 4 3 2 3 2 2
6 5 4 3 4 3 3
(6,6), (4,5)
Calculate the alignment
Tree distance- - - - - - -
- 0 1 0 1 0 5
- 1 0 1 0 1 4
- 0 1 0 1 0 5
- 0 1 0 1 0 5
- 4 3 4 3 4 2
- 5 4 5 4 5 3
Forest distance 6*60 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 0 1 2 3 4
3 2 1 0 1 2 3
4 3 2 1 2 1 2
5 4 3 2 3 2 2
6 5 4 3 4 3 3
(6,6), (4,5), (2,2)
Calculate the alignment
Tree distance- - - - - - -
- 0 1 0 1 0 5
- 1 0 1 0 1 4
- 0 1 0 1 0 5
- 0 1 0 1 0 5
- 4 3 4 3 4 2
- 5 4 5 4 5 3
Forest distance 6*60 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 0 1 2 3 4
3 2 1 0 1 2 3
4 3 2 1 2 1 2
5 4 3 2 3 2 2
6 5 4 3 4 3 3
(6,6), (4,5), (2,2), (1,1).
THANKS FOR YOUR ATENTION