similarity and correction of strings and trees : towards a correction of xml documents
DESCRIPTION
Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents. Agata SAVARY Universit é - Fran ç ois Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique. Seminarium IPIPAN, 24 kwietnia , 200 6. String-to-string correction. - PowerPoint PPT PresentationTRANSCRIPT
Similarity and Correction of Strings and Trees : Towards a Correction of XML
Documents
Agata SAVARY
Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique
Seminarium IPIPAN, 24 kwietnia, 2006
String-to-string correction
A. Savary Seminarium IPIPAN, 24/04/2006 3
Traditional string-to-string correction
(Wagner&Fischer 1974, Lawrence&Wagner 1975,…)
• CONTEXT:– Finite set of symbols (alphabet)– Elementary operations on symbols (editing operations, e.g. deletion,
insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation)
– Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved)
– Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B
• INPUT:– Two words A and B
• OUTPUT:– Distance between A and B
A. Savary Seminarium IPIPAN, 24/04/2006 4
Examples of elementary edit operations
• Insertion of a lettermonter montaer, monter montrer
• Deletion of a lettermonter montr, monter monte
• Replacement of a letter by anothermonter ponter, monter conter
• Transposition of two adjacent lettersmonter mnoter, monter montre
Each elementary operation has a non negatif cost.From now on we admit cost 1 for each elementary operation.
A. Savary Seminarium IPIPAN, 24/04/2006 5
Edit sequence• Edit sequence = sequence of elementary edit operations• For each couple of words X and Y many edit sequences exist that transform
X into Y.• Example 1: transforming sorting into string :
– sorting srting sting string (3 operations)– sorting sotring string (2 operations)– sorting srting string (2 operations)– sorting strting string (2 operations)– sorting srting sting sing sring string (5 operations)– .................
• Example 2: transforming abc into ca :– abc ac ca (2 operations)– abc cabc cac ca (3 operations)
• From now on, we’ll be interested in linear edit sequences (Du&Chang 1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation.
Linear sequence
Linear sequence
Linear sequence
Linear sequence
A. Savary Seminarium IPIPAN, 24/04/2006 6
Edit (error) distance• Cost of an edit sequence = sum of costs of all elementary
operations included in the sequence– sortingsrtingstingstring (3 operations) cost = 3
– sortingsotringstring (2 operations) cost = 2
– sortingsrtingstingsingsringstring (5 operations) cost = 5
• Edit distance (error distance) between two words X and Y (ed(X,Y)) = minimal cost of all edit sequences transforming X into Y :
ed(sorting, string) = 2
ed(abc,ca) = 2, if all edit sequences are taken into account
ed(abc,ca) = 3, if only the linear edit sequences are taken into account
A. Savary Seminarium IPIPAN, 24/04/2006 7
Calculating the edit distance (1/4)
If xi+1 = yj+1 then
ed(X[i+1],Y[j+1]) = ed(X[i],Y[j])
X[i+1]
Y[j+1]
i
j
Notation : word X= x1 x2 ... xi ...xn; the prefix of lenght i of X : X[i] = x1 x2 ... xi
Xi
X[i]
It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases
x1 x2 x3 ... xi ... xn
A. Savary Seminarium IPIPAN, 24/04/2006 8
Transposition’s cost
If xi = yj+1 and xi+1 = yj (the 2 last characters may be inverted) then 4 sub-cases are possible:
• The cheapest sequence transforming X[i+1] into Y[j+1] contains a transposition of xi and xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1
X[i+1]
Y[j+1]
i
j
• The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of xi+1 by yj+1 :
ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1
• The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’insertion of yj+1 :
ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1
• The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of xi+1 :
ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1
Replacement’s cost
Insertion’s cost
Deletion’s cost
Calculating the edit distance (2/4)
A. Savary Seminarium IPIPAN, 24/04/2006 9
OTHERWISE (if xi+1 yj+1, and (xi yj+1 or xi+1 yj)) then 3 sub-cases are possible:
X[i+1]
Y[j+1]
i
j
• The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of xi+1 by yj+1 :
ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1
• The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of yj+1 :
ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1
• The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of xi+1 :
ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1
Replacement’s cost
Insertion’s cost
Deletion’s cost
Calculating the edit distance (3/4)
A. Savary Seminarium IPIPAN, 24/04/2006 10
Edit distance between X[i] and Y[j] - recursive definition:
For i=0,...,m, j=0,...,n:
1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n)
2° ed(X[0],Y[j]) = j
ed(X[i],Y[0]) = i
ed(X[i],Y[j]) if xi+1 = yj+1
1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if xi=yj+1 et xi+1 =
yj
3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) }
1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwiseed(X[i],Y[j+1])}
Calculating the edit distance (4/4)
A. Savary Seminarium IPIPAN, 24/04/2006 11case [n,m] contains the edit distance between the 2 words
case [i,j] contains the edit distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word
Calculation the edit distance : dynamic programming
s o r t i n g
0 1 2 3 4 5 6 7
s 1 0 1 2 3 4 5 6
t 2 1 1 2 2 3 4 5
r 3 2 2 1 2 3 4 5
i 4 3 3 2 3 2 3 4
n 5 4 4 3 4 3 2 3
g 6 5 5 4 5 4 3 2
i
j
n
m
A. Savary Seminarium IPIPAN, 24/04/2006 12
Dynamic programming: case 1
s o r t i n g
0 1 2 3 4 ? ? ?
s 1 0 1 2 3 ? ? ?
t 2 1 1 2 2 ? ? ?
r ? ? ? ? ? ? ? ?
i ? ? ? ? ? ? ? ?
n ? ? ? ? ? ? ? ?
g ? ? ? ? ? ? ? ?
i+1
j+1xi+1 = yj+1
A. Savary Seminarium IPIPAN, 24/04/2006 13
Dynamic programming : case 2
s o r t i n g
0 1 2 3 4 ? ? ?
s 1 0 1 2 3 ? ? ?
t 2 1 1 2 2 ? ? ?
r 3 2 2 1 2 ? ? ?
i ? ? ? ? ? ? ? ?
n ? ? ? ? ? ? ? ?
g ? ? ? ? ? ? ? ?
i+1
j+1xi+1 = yj and xi+1 = yj
A. Savary Seminarium IPIPAN, 24/04/2006 14
Dynamic programming : case 3
s o r t i n g
0 1 2 3 4 ? ? ?
s 1 0 1 2 3 ? ? ?
t 2 1 1 2 2 ? ? ?
r 3 2 2 1 2 ? ? ?
i 4 3 3 2 2 ? ? ?
n ? ? ? ? ? ? ? ?
g ? ? ? ? ? ? ? ?
i+1
j+1xi+1 yj+1 et (xi+1 yj ou xi+1 yj)
String-to-language correction
A. Savary Seminarium IPIPAN, 24/04/2006 16
String-to-language correction: problem definition
• CONTEXT:– Finite set of symbols (alphabet)– Elementary edit operations on symbols (as before) with their costs (1 per
operation)– Edit sequences (as before) – Edit distance (error distance) between words: as before
• INPUT:– Regular grammar describing words (a finite set of words in particular)– Incorrect word A (unrecognizable by the grammar)– Threshold t
• OUTPUT:– A set of correct words B1, B2, …, Bn whose distance from A stays within t (the
nearest neighbors of A)
A. Savary Seminarium IPIPAN, 24/04/2006 17
String-to-language correction: simplistic approach
• METHOD:– For each word B recognizable by the grammar calculate the edit distance
matrix between A and B.– Propose candidates whose distance from A does not exceed the threshold t
(ed(A,B) t).
• FAISABILITY:– Impossible in case of infinite languages
• COMPLEXITY:
O(n * m * |D|)
A. Savary Seminarium IPIPAN, 24/04/2006 18
String-to-language correction: threshold-controlled depth-first
exploration of an FSA(Oflazer 1996, …)
A. Savary Seminarium IPIPAN, 24/04/2006 19
Part of the matrix calculated only once for all valid words sharing the same prefix appl
String correction with respect to a deterministic FSA (1/4)
1
2 4
5
3 6
7
8
9
ap
p
l y
e s
p
ly
e
a
Word to be corrected : *aply, threshold 2
a p p l ... ... 0 1 2 3 4 ... ...a 1 0 1 2 3 ... ...p 2 1 0 1 2 ... ...l 3 2 1 1 1 ... ...y 4 3 2 2 2 ... ...
• Each time a transition is followed a new column is calculated in the edit distance matrix
e54322
• If we get to a final state and the edit distance remains within the thershold a new candidate has been found
apple
A. Savary Seminarium IPIPAN, 24/04/2006 20
1
2 4
5
3 6
7
8
9
ap
p
l y
e s
p
ly
e
a
a p p l ... ... 0 1 2 3 4 ... ...a 1 0 1 2 3 ... ...p 2 1 0 1 2 ... ...l 3 2 1 1 1 ... ...y 4 3 2 2 2 ... ...
e54322
s65433
apple
String correction with respect to a deterministic FSA (2/4)
Word to be corrected : *aply, threshold 2
Part of the matrix calculated only once for all valid words sharing the same prefix appl
• Each time a transition is followed a new column is calculated in the edit distance matrix
• If we get to a final state and the edit distance remains within the thershold a new candidate has been found
A. Savary Seminarium IPIPAN, 24/04/2006 21
1
2 4
5
3 6
7
8
9
ap
p
l y
e s
p
ly
e
a
a p p l ... ... 0 1 2 3 4 ... ...a 1 0 1 2 3 ... ...p 2 1 0 1 2 ... ...l 3 2 1 1 1 ... ...y 4 3 2 2 2 ... ...
e54322
• A backtrancking results in deleting the current column
apple
s65433
String correction with respect to a deterministic FSA (3/4)
Word to be corrected : *aply, threshold 2
Part of the matrix calculated only once for all valid words sharing the same prefix appl
• Each time a transition is followed a new column is calculated in the edit distance matrix
• If we get to a final state and the edit distance remains within the thershold a new candidate has been found
A. Savary Seminarium IPIPAN, 24/04/2006 22
1
2 4
5
3 6
7
8
9
ap
p
l y
e s
p
ly
e
a
a p p l ... ... 0 1 2 3 4 ... ...a 1 0 1 2 3 ... ...p 2 1 0 1 2 ... ...l 3 2 1 1 1 ... ...y 4 3 2 2 2 ... ...
y54321
apple apply
String correction with respect to a deterministic FSA (4/4)
• A backtrancking results in deleting the current column
Word to be corrected : *aply, threshold 2
Part of the matrix calculated only once for all valid words sharing the same prefix appl
• Each time a transition is followed a new column is calculated in the edit distance matrix
• If we get to a final state and the edit distance remains within the thershold a new candidate has been found
A. Savary Seminarium IPIPAN, 24/04/2006 23
1
2
8
9
a c
d
Word to be corrected : abcbb, t=2
a b b b b b b-2 -1 0 1 2 3 4 5 6
-2 + + + + + + + + + -1 + 0 1 2 3 4 5 6 7a 0 + 1 0 1 2 3 4 5 6b 1 + 2 1 0 1 2 3 4 5c 2 + 3 2 1 1 2 3 4 5b 3 + 4 3 2 1 1 2 3 4b 4 + 5 4 3 2 1 1 2 3
b
b
• If the current column exceeds the threshold the whole path is cut off
Controlling the searchspace by the threshold
Tree-to-tree correction
A. Savary Seminarium IPIPAN, 24/04/2006 25
Tree-to-tree correction(Selkow 1977,…)
• CONTEXT:– Finite set of node symbols (alphabet)– Elementary edit operations on trees:
• Insertion of a leaf• Deletion of a leaf• Renaming of a node (leaf or internal node)
– Non negatif cost for each elementary operation– Edit sequences (sequences of edit operations) with their costs (sums of
costs of editing operations involved)– Edit distance between two trees A and B: minimum cost of all edit
sequences transforming A into B • INPUT:
– Two trees A and B• OUTPUT:
– Distance between A and B
A. Savary Seminarium IPIPAN, 24/04/2006 26
• A partial tree A0:i is the root of A and its subtrees A0,...,Ai • The comparison is based on comparing roots, and then recursively comparing the
roots’ subtrees
Comparing two trees(Selkow 1977,…)
A
root(A)A0
A1
A2
Broot(B)
B0
B1
B2 B3
A0:1
a b
c d c c d e c
e e e fb d b b b
B0:2
A. Savary Seminarium IPIPAN, 24/04/2006 27
case [-1,-1] contains the cost of renaming root(A) into root(B)
Edit distance matrix between two trees
(Selkow 1977,…)
case [n,m] contains the edit distance between the 2 trees
case [i,j] contains the edit distance between the partial trees A0:i and B0:j
-1 0 1 2 3
-1 1 4 14 15 16
0 4 2 12 13 14
1 15 13 3 4 5
2 16 14 4 4 4
i
j
n
m
A. Savary Seminarium IPIPAN, 24/04/2006 28
Calculation of the tree matrix(Selkow 1977,…)
-1 0 1 2 3
-1 1 4 14 15 16
0 4 2 12 13 14
1 15 13 3 4 5
2 16 14 4 4 ?i
j
Adding the cost of inserting Bj (here +1)
Adding the edit distance between Ai and Bj (here +0)
Adding the cost od deleting Ai (here +1)
Taking the minimum (here min(4+0, 5+1, 4+1) = 4
A. Savary Seminarium IPIPAN, 24/04/2006 29
Extension to the correction of XML-documents
• The validity of a node is described by a set of regular expressions, e.g. E = ab*c + db*
• The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996)
• The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977)
<y> </y>
<root> </root>
<x> </x> <z> </z>
<a> </a> <b> </b> <c> </c> <b> </b> <b> </b>
A. Savary Seminarium IPIPAN, 24/04/2006 30
Main idea
String-to-string(Wagner&Fischer 1974)
String-to-(regular) language(Oflazer 1996)
Tree-to-tree(Selkow 1977)
Tree-to-(regular) tree language(Cheriat, Savary, Bouchou, Halfeld,to be continued)
A. Savary Seminarium IPIPAN, 24/04/2006 31
Edit distance matrix with edit sequences
case [i,j] contains the edit distance between the partial trees A0:i and B0:j, and the edit sequence necessary to transform A0:i into B0:j
-1 0 1 2 3
-1 ... ... ... ... ...
0 ... ... ... ... ...
1 ... ... ... [3, <(R,0.1,f),(D,1.1,/),(I,2,e)>] ...
2 ... ... ... ... ...
i
j
A. Savary Seminarium IPIPAN, 24/04/2006 32
Bibliography• Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document
Trees. Technical Report 95-372, Department of Computing and Information Science, Queen’s University, Kingston, Ontario.
• Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302
• Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing Surveys, Vol. 12(4). ACM, New York., pp. 381-402
• Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction Problem. Journal of the ACM, Vol. 22(2), pp. 177-183
• Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries. Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477
• Oflazer, K. (1996): Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89
• Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters 6(6), pp. 184-186
• Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications of the ACM, 17(5), pp. 265-268
• Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem. Journal of the ACM, Vol. 21(1), pp. 168-173
A. Savary Seminarium IPIPAN, 24/04/2006 33
Some details of the state of the art • Wagner & Fischer (1974):
– Elegant and solid theoretical definition of the string-to-string correction problem – 3 elementary operations on single letters admitted (insertion, deletion, replacement)– Model of a trace describing the edit distance between two strings– Dynamic programming method
• Lowrance & Wagner (1975)– Additional elementary operation: inversion of two adjacent letters – Restriction of the cost function
• Du & Chang (1992):– Cost 1 for each elementary operation– Restriction to linear editing sequences – Application to the nearest neighbor search in a dictionary, with a threshold
• Oflazer (1996):– Nearest-neighbor search in finite-state automata– Application to large natural-language dictionaries
• Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de Rougemont (2003):
– Tree-to-tree correction problem• Mihov & Schulz (2004):
– Levenshtein automaton– Backward dictionary
• Bouchou, B. & Halfeld Ferrari Alves, M. (2003):– Incremental validation of XML documents resulting from updates: human-computer interaction