approximate schemas michel de rougemont, lri, university paris ii
TRANSCRIPT
1. Distance between words (structures), O(1)Edit distance with moves
2. Distance between a word (structure) and a class of words (structures), O(1)
3. Distance between two languages (classes), Poly.
4. Applications: regular languages, DTDs
Distances between languages
2121 close vfinitely)(except v if LvLLL
122121 and if LLLLLL
)',( Min),( ' wwdistLwdist Lw
1. Satisfiability : Tree |= F
2. Approximate satisfiability Tree |= F
3. Approximate equivalence
Image on a class K of trees
F F F
F fromfar -
1. Approximate Satisfiability and Equivalence
GF
G
An ε -tester for a property F is a probabilistic algorithm A such that:• If U |= F, A accepts• If U is ε far from F, A rejects with high probability • Time(A) independent of n.
Tester usually implies a linear time corrector.
Self-testers and correctors for Linear Algebra ,Blum & Kanan 1989Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994Testers for graph properties : k-colorability, Goldreich and al. 1996
graph properties have testers, Alon and al. 1999Regular languages have testers, Alon and al. 2000sTesters for Regular tree languages , Mdr and Magniez, ICALP 2004
Testers on a class K
2
1. Classical Edit Distance:
Insertions, Deletions, Modifications
2. Edit Distance with moves
0111000011110011001
0111011110000011001
3. Edit Distance with Moves generalizes to Trees
2. Equality tester
Block and uniform statistics
W=001010101110…… length n, subword of length k, n/k blocks
61.
1401
)(.
Wstatb
/1.
#....
#)(.
2
1
knn
nWstatb
k
....
"00...1" ofnumber #"00...0" ofnumber #
2
1
nn
"11...1" ofnumber #
....2kn
For k=2, n/k=6
111.
2441
)(.
Wstatu
)'(.)(.1 WstatuWstatud
Goal: d1 approximates the distance
Let ε =1/k : For n>n0 dist – ε.n < d1 < dist + ε.n
Practical application: ε=10-2 hence k=100, stat dimension 2100
Words of length n=109 , d1 is approximated by N samples and a good approximation after N=O(1/ε3) trials.
Remarks:1. Distance with Moves.
W =000….0001111…111 W’=1111…111000….000
2. Robustness to noiseIf W,W’ are noisy inputs (but ε-close), the method still works.
3. Random words are close with the moves, far without.
)'(.
2/5.0..
..2/5.0
)(. WstatbWstatb
Tester for equality of strings
Edit distance with moves. NP-complete problem, but O(1)-approximable.
Uniform statistics ( ): W=001010101110
Theorem 1. |u.stat(w)-ustat(w’)| approximates dist(w,w’)/n .
Sample N subwords of length k, compute Y(w) and Y(w’):
Theorem 2. Y(w) approximates u.stat(w).
Corollary. |Y(w)-Y(w’)| approximates dist(w,w’)/n .
Tester: If |Y(w)-Y(w’)| <ε. accept, else reject.
1)(
...1
Ni
iXN
wY
0...010
iX
111.
2441
)(.
Wstatu
1)'(
...1
Ni
iXN
wY
1k
3a. Tester for regular words Definition:
L is a regular language and A an automaton for L, Test w in L.
0C
1C
2C
3C
4C
Admissible Z=
A word W is Z-feasible if there are two states
4320 ... CCCC
......... Zand ' such that ', W jiji CCqqCqCq
init accept
)',( Min),( ' wwdistLwdist Lw
Tester for regular words
)/log(1,...,iFor m
random )/.2( Choose 3 mN ii
For every admissible path Z:
else REJECT.
1i2 size of subwords wij
Theorem: Tester(W,A, ε ) is an ε -tester for L(A).
Tester. Input : W,A, ε
.ACCEPT feasible, Zare W of all If wij
Proof schema of the Tester
Theorem: Regular words are testable.
Robustness lemma: If W is ε-far from L, then for every admissible path Z, there exists such that the number
of Z-infeasible subwords
Splitting lemma: if W is far from L there are many disjoint infeasible subwords.
Amplifying lemma: If there are many infeasible words, there are many short ones.
).5
log(2
m
i
...2
least at is 2 2
1i1i n
m
Merging words
Merging lemma: Let Z be an admissible path, and let F be a Z-feasible cut of size h’ . Then '),( 2hmLFDist
C
C C
C
C
C
Take each word and split it along its connected components, removing single letters. Rearrange all the words of the same component in its Z-order.Add gluing words to obtain W’ in L:
Fwi
............' 222110 wgwgwgW
Splitting
Splitting lemma: If Z is an admissible path, W a word s.t. dist(W,L) > h, then W has
Proof by contraposition:
.n)(h subwords.disjoint infeasible Zh/m than more 2
subwords.disjoint and infeasible Zminimal / than less hasW 2 mhh'
'.L)Dis(F, lemma merging By the 2hm'. F)Dist(W, h
'' L)Dist(W, Hence 2hmh
h L)Dist(W, And
F.cut feasible a provides letterslast theRemoving
1. Inclusion
2. Equivalence
Equivalence tester
4. Equivalent testing of Regular Languages
2121 close vfinitely)(except v if LvLLL
122121 and if LLLLLL
acceptsA then If 21 LL
32 proba with rejectsA then ) ( If 21 LL
Automata for Regular languages
Basic property:
Proposition:
Caratheodory’s theorem: in dimension d, convex hull of N points can be decomposed into in the union of convex hulls of d+1 points
Large loops can be decomposed. Small loops (less than m=|A|) suffice.
))(.),...,(.Hull(Convex- 1
1, tloops, -A,...vkk
1
t
mvv
vstatbvstatbit
where..... to 1 muuuvclosewLw il loops compatible-A ofset -multi a is .... k
1 luu
Approximate Parikh mapping
Lemma: For every X in H, w in L s. t.
)(.-X wstatb
X .
b-stat(w)
w
n).2
(L)dist(w,
H is a fair representation of L
Construction of H in polynomial time
k.m
)(
jistatbP t
t
Enumerate all loops:
Number of b-stat is less : Some loops have same b-stat: ABBA and BBAA#partitions of a word of length m with « big blocks »
Construct H by matrix iteration:
k.
m
11 tt PPP
Example
Automaton A:
Blocks, k=2, m=4, | Σ |=4, | Σ| k +1=17:
Loops: {(aa,ca:1),(bb,2),(cc,ac:3),(dd:4)}
1 2
3 4
a
b
b
ca
cd
d
aa ca
H A
ac cc
bb
dd
Equivalence tester
Tester for w in L (regular):Compute b-stat(w) and H. Decide if dist(w,L)>ε.nTime is polynomial in m=|L|.
Previous tester was exponential in m.
Tester of 1. Compute HA and HB
2. Reject if HA and HB are different.
Time polynomial in m=|A,B|
BA
Application: Data Exchange
11*)01(
Source Target
*)1010(
W=010101011, source. Which structure for the target?
Answer: if the two schemas are close, run a corrector and obtain W’=10101010, distance 3.
If the two schemas are not close, no guarantee.
General situation for data exchange and query answering.
Conclusion1. Testers and Correctors2. Constant algorithm for Edit Distance with
moves 3a. Testers and Correctors for regular words3b. Tester for regular trees and corrector for
regular trees4. Equivalence tester for automata
Polynomial time algorithm
Generalization to Buchi automata and Context-Free Tree regular languages
Generalizations
Buchi Automata. Distance on infinite words:Two words are ε-close if
A word is ε-close to a language L if there exists w’ in L s. t. W and w’ are ε-close.
Statistics: set of accumulation points of
H: compatible loops of connected components of accepting states
Tester for Buchi Automata: Compute HA and HB
Reject if HA and HB are different.
Equivalence of CF grammars is undecidable, Approximate equivalence in exponential.
(n))w'dist(w(n), lim sup n
w(n))(. nstatb
Let F be a property on a class K of structures U
F is Equality
Soundness: close structures have close statistics
Robustness: far structures have far statistics
Soundness and Robustness
.)',( nwwdist
.)',( nwwdist
Robustness of b.stat
Robustness of b-stat: ).)'(.)(. .21()',( nwstatbwstatbwwdist
.)',( then )'(.)(. if nwwdistwstatbwstatb
)'()''( t.s. 'w'construct then )'(.)(. if wstatbwstatbwstatbwstatb
61.
1401
)(.
Wstatb
61.
1302
)'(.
Wstatb
in W' 3 andin W 4 "10" #but in W' 2 andin W 1"00"#
: Example on w. onssubstituti )'(.)(.2
most at after wstatbwstatb.n
"10" intoit change andin W "00" ofblock one take:'W'
Soundness of u.stat
Soundness of u-stat:
Simple edit:
Move w=A.B.C.D, w’=A.C.B.D:
Hence, for ε2.n operations,
Problem: robustness of u.stat ? Harder! You need an auxiliary distribution and two key lemmas.
.6)'(.)(. .)',( 2 wstatuwstatunwwdist
.2
12)'(.)(.
nknkwstatuwstatu
.6
1)1(3.2)'(.)(. nkn
kwstatuwstatu
.6)'(.)(. wstatuwstatu
Block Uniform Statistics
))(.())(.()(./,...1
vstatbEvstatbEnKwstatbu v
Kniiti
1][0 ],)[(.][ ),(. uXuvstatbuXvstatbX iiiii
])[(. is Average t.independen is ][Each uwstatbuuXi
2Kn-8
e]])[(.])[(.])[(.Pr[ : Bound Chernofft
uwstatbutuwstatbuuvstatb 2
Kn-8k
.e])(.)(.)(.Pr[ : BoundUnion t
wstatbutwstatbuvstatb
0]2
)(.)(.Pr[ 2
t wstatbuvstatb
2)(.)(. vw vstatbwstatbuLemma 1:
Uniform Statistics
)1).(1( :bu by missedk length of subwords# Knk
., onsdistributi uniform twoand ALet : Lemma BA BA
BA
AB .2.Then BA
).
log()(.)(. 4 n
Owstatbuvstatu
log.
,1 with lemma previous Apply the3 n
KknB
.log
)(. )(. w 4 nwstatuwstatbu
Lemma 2:
Robustness of the uniform Statistics
Robustness of u-stat:
By Lemma 1:
By Lemma 3:
.5,6)'(. )(. .5)',( wstatuwstatunwwdist
2)(.)(. vw vstatbwstatbu
.log
)(. )(. w 4 nwstatuwstatbu
w' w,from v'Get v,
stat.u- of robustness impliesstat -b of Robustness
Tester for the distance with moves
NP-complete problem, but O(1)-approximable.
Approximate u.stat:Sample N subwords of length k, compute Y:
Y is a good approximation of u.stat (Chernoff),
Uniform statistics is a good approximation of the distance by soundness and robustness.
Tester: If Y<ε.n accept, else reject.
].)(.Pr[2..8 aNeaYWstatu
1
...1
Ni
iXN
Y
0...010
iX