approximate schemas michel de rougemont, lri, university paris ii

30
Approximate schemas Michel de Rougemont, LRI , University Paris II

Upload: paul-gray

Post on 03-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Approximate schemas

Michel de Rougemont, LRI , University Paris II

1. Distance between words (structures), O(1)Edit distance with moves

2. Distance between a word (structure) and a class of words (structures), O(1)

3. Distance between two languages (classes), Poly.

4. Applications: regular languages, DTDs

Distances between languages

2121 close vfinitely)(except v if LvLLL

122121 and if LLLLLL

)',( Min),( ' wwdistLwdist Lw

1. Satisfiability : Tree |= F

2. Approximate satisfiability Tree |= F

3. Approximate equivalence

Image on a class K of trees

F F F

F fromfar -

1. Approximate Satisfiability and Equivalence

GF

G

An ε -tester for a property F is a probabilistic algorithm A such that:• If U |= F, A accepts• If U is ε far from F, A rejects with high probability • Time(A) independent of n.

Tester usually implies a linear time corrector.

Self-testers and correctors for Linear Algebra ,Blum & Kanan 1989Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994Testers for graph properties : k-colorability, Goldreich and al. 1996

graph properties have testers, Alon and al. 1999Regular languages have testers, Alon and al. 2000sTesters for Regular tree languages , Mdr and Magniez, ICALP 2004

Testers on a class K

2

1. Classical Edit Distance:

Insertions, Deletions, Modifications

2. Edit Distance with moves

0111000011110011001

0111011110000011001

3. Edit Distance with Moves generalizes to Trees

2. Equality tester

Block and uniform statistics

W=001010101110…… length n, subword of length k, n/k blocks

61.

1401

)(.

Wstatb

/1.

#....

#)(.

2

1

knn

nWstatb

k

....

"00...1" ofnumber #"00...0" ofnumber #

2

1

nn

"11...1" ofnumber #

....2kn

For k=2, n/k=6

111.

2441

)(.

Wstatu

)'(.)(.1 WstatuWstatud

Goal: d1 approximates the distance

Let ε =1/k : For n>n0 dist – ε.n < d1 < dist + ε.n

Practical application: ε=10-2 hence k=100, stat dimension 2100

Words of length n=109 , d1 is approximated by N samples and a good approximation after N=O(1/ε3) trials.

Remarks:1. Distance with Moves.

W =000….0001111…111 W’=1111…111000….000

2. Robustness to noiseIf W,W’ are noisy inputs (but ε-close), the method still works.

3. Random words are close with the moves, far without.

)'(.

2/5.0..

..2/5.0

)(. WstatbWstatb

Tester for equality of strings

Edit distance with moves. NP-complete problem, but O(1)-approximable.

Uniform statistics ( ): W=001010101110

Theorem 1. |u.stat(w)-ustat(w’)| approximates dist(w,w’)/n .

Sample N subwords of length k, compute Y(w) and Y(w’):

Theorem 2. Y(w) approximates u.stat(w).

Corollary. |Y(w)-Y(w’)| approximates dist(w,w’)/n .

Tester: If |Y(w)-Y(w’)| <ε. accept, else reject.

1)(

...1

Ni

iXN

wY

0...010

iX

111.

2441

)(.

Wstatu

1)'(

...1

Ni

iXN

wY

1k

3a. Tester for regular words Definition:

L is a regular language and A an automaton for L, Test w in L.

0C

1C

2C

3C

4C

Admissible Z=

A word W is Z-feasible if there are two states

4320 ... CCCC

......... Zand ' such that ', W jiji CCqqCqCq

init accept

)',( Min),( ' wwdistLwdist Lw

Tester for regular words

)/log(1,...,iFor m

random )/.2( Choose 3 mN ii

For every admissible path Z:

else REJECT.

1i2 size of subwords wij

Theorem: Tester(W,A, ε ) is an ε -tester for L(A).

Tester. Input : W,A, ε

.ACCEPT feasible, Zare W of all If wij

Proof schema of the Tester

Theorem: Regular words are testable.

Robustness lemma: If W is ε-far from L, then for every admissible path Z, there exists such that the number

of Z-infeasible subwords

Splitting lemma: if W is far from L there are many disjoint infeasible subwords.

Amplifying lemma: If there are many infeasible words, there are many short ones.

).5

log(2

m

i

...2

least at is 2 2

1i1i n

m

Merging words

Merging lemma: Let Z be an admissible path, and let F be a Z-feasible cut of size h’ . Then '),( 2hmLFDist

C

C C

C

C

C

Take each word and split it along its connected components, removing single letters. Rearrange all the words of the same component in its Z-order.Add gluing words to obtain W’ in L:

Fwi

............' 222110 wgwgwgW

Splitting

Splitting lemma: If Z is an admissible path, W a word s.t. dist(W,L) > h, then W has

Proof by contraposition:

.n)(h subwords.disjoint infeasible Zh/m than more 2

subwords.disjoint and infeasible Zminimal / than less hasW 2 mhh'

'.L)Dis(F, lemma merging By the 2hm'. F)Dist(W, h

'' L)Dist(W, Hence 2hmh

h L)Dist(W, And

F.cut feasible a provides letterslast theRemoving

3b. Correction in practice: right branch treehttp://www.lri.fr/~mdr/xml/

2 moves, dist=2

1. Inclusion

2. Equivalence

Equivalence tester

4. Equivalent testing of Regular Languages

2121 close vfinitely)(except v if LvLLL

122121 and if LLLLLL

acceptsA then If 21 LL

32 proba with rejectsA then ) ( If 21 LL

Automata for Regular languages

Basic property:

Proposition:

Caratheodory’s theorem: in dimension d, convex hull of N points can be decomposed into in the union of convex hulls of d+1 points

Large loops can be decomposed. Small loops (less than m=|A|) suffice.

))(.),...,(.Hull(Convex- 1

1, tloops, -A,...vkk

1

t

mvv

vstatbvstatbit

where..... to 1 muuuvclosewLw il loops compatible-A ofset -multi a is .... k

1 luu

Approximate Parikh mapping

Lemma: For every X in H, w in L s. t.

)(.-X wstatb

X .

b-stat(w)

w

n).2

(L)dist(w,

H is a fair representation of L

Construction of H in polynomial time

k.m

)(

jistatbP t

t

Enumerate all loops:

Number of b-stat is less : Some loops have same b-stat: ABBA and BBAA#partitions of a word of length m with « big blocks »

Construct H by matrix iteration:

k.

m

11 tt PPP

Example

Automaton A:

Blocks, k=2, m=4, | Σ |=4, | Σ| k +1=17:

Loops: {(aa,ca:1),(bb,2),(cc,ac:3),(dd:4)}

1 2

3 4

a

b

b

ca

cd

d

aa ca

H A

ac cc

bb

dd

Equivalence tester

Tester for w in L (regular):Compute b-stat(w) and H. Decide if dist(w,L)>ε.nTime is polynomial in m=|L|.

Previous tester was exponential in m.

Tester of 1. Compute HA and HB

2. Reject if HA and HB are different.

Time polynomial in m=|A,B|

BA

Application: Data Exchange

11*)01(

Source Target

*)1010(

W=010101011, source. Which structure for the target?

Answer: if the two schemas are close, run a corrector and obtain W’=10101010, distance 3.

If the two schemas are not close, no guarantee.

General situation for data exchange and query answering.

Conclusion1. Testers and Correctors2. Constant algorithm for Edit Distance with

moves 3a. Testers and Correctors for regular words3b. Tester for regular trees and corrector for

regular trees4. Equivalence tester for automata

Polynomial time algorithm

Generalization to Buchi automata and Context-Free Tree regular languages

Generalizations

Buchi Automata. Distance on infinite words:Two words are ε-close if

A word is ε-close to a language L if there exists w’ in L s. t. W and w’ are ε-close.

Statistics: set of accumulation points of

H: compatible loops of connected components of accepting states

Tester for Buchi Automata: Compute HA and HB

Reject if HA and HB are different.

Equivalence of CF grammars is undecidable, Approximate equivalence in exponential.

(n))w'dist(w(n), lim sup n

w(n))(. nstatb

Let F be a property on a class K of structures U

F is Equality

Soundness: close structures have close statistics

Robustness: far structures have far statistics

Soundness and Robustness

.)',( nwwdist

.)',( nwwdist

Robustness of b.stat

Robustness of b-stat: ).)'(.)(. .21()',( nwstatbwstatbwwdist

.)',( then )'(.)(. if nwwdistwstatbwstatb

)'()''( t.s. 'w'construct then )'(.)(. if wstatbwstatbwstatbwstatb

61.

1401

)(.

Wstatb

61.

1302

)'(.

Wstatb

in W' 3 andin W 4 "10" #but in W' 2 andin W 1"00"#

: Example on w. onssubstituti )'(.)(.2

most at after wstatbwstatb.n

"10" intoit change andin W "00" ofblock one take:'W'

Soundness of u.stat

Soundness of u-stat:

Simple edit:

Move w=A.B.C.D, w’=A.C.B.D:

Hence, for ε2.n operations,

Problem: robustness of u.stat ? Harder! You need an auxiliary distribution and two key lemmas.

.6)'(.)(. .)',( 2 wstatuwstatunwwdist

.2

12)'(.)(.

nknkwstatuwstatu

.6

1)1(3.2)'(.)(. nkn

kwstatuwstatu

.6)'(.)(. wstatuwstatu

Block Uniform Statistics

))(.())(.()(./,...1

vstatbEvstatbEnKwstatbu v

Kniiti

1][0 ],)[(.][ ),(. uXuvstatbuXvstatbX iiiii

])[(. is Average t.independen is ][Each uwstatbuuXi

2Kn-8

e]])[(.])[(.])[(.Pr[ : Bound Chernofft

uwstatbutuwstatbuuvstatb 2

Kn-8k

.e])(.)(.)(.Pr[ : BoundUnion t

wstatbutwstatbuvstatb

0]2

)(.)(.Pr[ 2

t wstatbuvstatb

2)(.)(. vw vstatbwstatbuLemma 1:

Uniform Statistics

)1).(1( :bu by missedk length of subwords# Knk

., onsdistributi uniform twoand ALet : Lemma BA BA

BA

AB .2.Then BA

).

log()(.)(. 4 n

Owstatbuvstatu

log.

,1 with lemma previous Apply the3 n

KknB

.log

)(. )(. w 4 nwstatuwstatbu

Lemma 2:

Robustness of the uniform Statistics

Robustness of u-stat:

By Lemma 1:

By Lemma 3:

.5,6)'(. )(. .5)',( wstatuwstatunwwdist

2)(.)(. vw vstatbwstatbu

.log

)(. )(. w 4 nwstatuwstatbu

w' w,from v'Get v,

stat.u- of robustness impliesstat -b of Robustness

Tester for the distance with moves

NP-complete problem, but O(1)-approximable.

Approximate u.stat:Sample N subwords of length k, compute Y:

Y is a good approximation of u.stat (Chernoff),

Uniform statistics is a good approximation of the distance by soundness and robustness.

Tester: If Y<ε.n accept, else reject.

].)(.Pr[2..8 aNeaYWstatu

1

...1

Ni

iXN

Y

0...010

iX