algorithms on strings
Post on 22-Jul-2015
462 Views
Preview:
TRANSCRIPT
Algorithms on Strings
Michael Soltys
CSU Channel IslandsComputer Science
February 4, 2015
Strings - Soltys Math/CS Seminar Title - 1/27
String problems are at the heart of Computer Science:
Rewriting systems are Turing complete
In practice analysis of strings is central to:
I Algorithmic biology
I Text processing
I Language theory
I Coding theory
Strings - Soltys Math/CS Seminar Introduction - 2/27
Basics (COMP 454)
An alphabet is a finite, non-empty set of distinct symbols, denotedusually by Σ.
e.g., Σ = {0, 1} (binary alphabet)Σ = {a, b, c , . . . , z} (lower-case letters alphabet)
A string, also called word, is a finite ordered sequence of symbolschosen from some alphabet.
e.g., 010011101011
|w | denotes the length of the string w .
e.g., |010011101011| = 12
The empty string, ε, |ε| = 0, is in any Σ by default.
Strings - Soltys Math/CS Seminar Introduction - 3/27
Σk is the set of strings over Σ of length exactly k.
e.g., If Σ = {0, 1}, then
Σ0 = {ε}Σ1 = Σ
Σ2 = {00, 01, 10, 11}, etc. |Σk |?
Kleene’s star Σ∗ is the set of all strings over Σ.
Σ∗ = Σ0 ∪ Σ1 ∪ Σ2 ∪ Σ3 ∪ . . .︸ ︷︷ ︸=Σ+
Concatenation If x , y are strings, and x = a1a2 . . . am &
y = b1b2 . . . bn ⇒ x · y = xy︸︷︷︸juxtaposition
= a1a2 . . . amb1b2 . . . bn
UNIX cat command
Strings - Soltys Math/CS Seminar Introduction - 4/27
A language L is a collection of strings over some alphabet Σ, i.e.,L ⊆ Σ∗. E.g.,
L = {ε, 01, 0011, 000111, . . .} = {0n1n|n ≥ 0} (1)
Note:
I wε = εw = w .
I {ε} 6= ∅; one is the language consisting of the single string ε,and the other is the empty language.
Strings - Soltys Math/CS Seminar Introduction - 5/27
Consider L = {w | w is of the form x01y ∈ Σ∗ } where Σ = {0, 1}.
We want to specify a DFA A = (Q,Σ, δ, q0,F ) that accepts all andonly the strings in L.
Σ = {0, 1}, Q = {q0, q1, q2}, and F = {q1}.
Transition diagramq
1 0 0,1
10q0 q2 1
Transition table
0 1
q0 q2 q0
q1 q1 q1
q2 q2 q1
Strings - Soltys Math/CS Seminar Introduction - 6/27
A context-free grammar (CFG) is G = (V ,T ,P, S) — Variables,Terminals, Productions, Start variable
Ex. P −→ ε|0|1|0P0|1P1.
Ex. G = ({E , I},T ,P,E ) where T = {+, ∗, (, ), a, b, 0, 1} and P isthe following set of productions:
E −→ I |E + E |E ∗ E |(E )
I −→ a|b|Ia|Ib|I0|I1
If αAβ ∈ (V ∪ T )∗, A ∈ V , and A −→ γ is a production, then
αAβ ⇒ αγβ. We use∗⇒ to denote 0 or more steps.
L(G ) = {w ∈ T ∗|S ∗⇒ w}
Strings - Soltys Math/CS Seminar Introduction - 7/27
Context-sensitive grammars (CSG) have rules of the form:
α→ β
where α, β ∈ (T ∪ V )∗ and |α| ≤ |β|. A language is contextsensitive if it has a CSG.
Fact: It turns out that CSL = NTIME(n)
A rewriting system (also called a Semi-Thue system) is a grammarwhere there are no restrictions; α→ β for arbitraryα, β ∈ (V ∪ T )∗.
Fact: It turns out that a rewriting system corresponds to the mostgeneral model of computation; i.e., a language has a rewritingsystem iff it is “computable.”
Strings - Soltys Math/CS Seminar Introduction - 8/27
A second course in Automata
Chomsky-Schutzenberger Theorem: If L is a CFL, then thereexists a regular language R, an n, and a homomorphism h, suchthat L = h(PARENn ∩ R).
Parikh’s Theorem: If Σ = {a1, a2, . . . , an}, the signature of astring x ∈ Σ∗ is (#a1(x), #a2(x), . . . , #an(x)), i.e., the number ofocurrences of each symbol, in a fixed order. The signature of alanguage is defined by extension; regular and CFLs have the samesignatures.
Strings - Soltys Math/CS Seminar Introduction - 9/27
This presentation is about algorithms on strings.
Based on two papers that are coming out in the next months:
I Neerja Mhaskar and Michael SoltysNon-repetitive strings over alphabet liststo appear in WALCOM, February 2015.
I Neerja Mhaskar and Michael SoltysString Shuffle: Circuits and Graphsaccepted in the Journal of Discrete Algorithms, 2015
Both at http://soltys.cs.csuci.edu (papers 3 & 19)
Strings - Soltys Math/CS Seminar Introduction - 10/27
Non-repetitive strings
A word is non-repetitive if it does not contain a subword of theform vv .
Word with repetition 010101110Word without repetition 101
Easy observation: what is the smallest n so that any word overΣ = {0, 1} of length ≥ n has at least one repetition?
Strings - Soltys Math/CS Seminar Non-repetitive strings - 11/27
Original Thue problem
For Σ3 = {1, 2, 3} and morphism, due to A. Thue:
S =
1 7→ 12312
2 7→ 131232
3 7→ 1323132
Given a string w ∈ Σ∗3, we let S(w) denote w with every symbolreplaced by its corresponding substitution:
S(w) = S(w1w2 . . .wn) = S(w1)S(w2) . . . S(wn)
Lemma: If w is non-repetitive then so is S(w).
Strings - Soltys Math/CS Seminar Non-repetitive strings - 12/27
Problem extended to alphabet lists
List of alphabets L = L1, L2, . . . , Ln
Can we generate non-repetitive words
w = w1w2 . . .wn, such that the symbol wi ∈ Li ?
Studied by: [GKM10], [Sha09], and it is a natural extension of theoriginal problem posed and solved by A. Thue.
E.g., L1 = {a, b, c}, L2 = {b, c, d}, L3 = {a, d , 2}, in this casew = ac2 is over L1, L2, L3 and non-repetitive.
Is that true for any list where |Li | = 3 for all i?
Strings - Soltys Math/CS Seminar Non-repetitive strings - 13/27
[GKM10] shows that this can be done for |Li | = 4 for all i with thisalgorithm:
pick any w1 ∈ L1
for i + 1 (w = w1w2 . . .wi is non-repetitive) pick a ∈ Li+1
if wa is non-repetitive, then let wi+1 = aif wa has a square vv , thenvv must be a suffixdelete the right copy of v from w , and restart.
Using sophisticated Lovasz Local Lemma argument and Catalannumbers we can show that the above algorithm succeeds withnon-zero probability.
Strings - Soltys Math/CS Seminar Non-repetitive strings - 14/27
Particular “yes” cases for L1, L2, . . . , Ln
I Has a system of distinct representatives (SDR)
I Has the union property
I Can be mapped consistently to Σ3 = {1, 2, 3}I It is a partition
Strings - Soltys Math/CS Seminar Non-repetitive strings - 15/27
Open Problem 1
Given any list L1, L2, . . . , Ln, where |Li | = 3, can we always find anon-repetitive string w over such a list?
Strings - Soltys Math/CS Seminar Non-repetitive strings - 16/27
Shuffle
w is the shuffle of u, v : w = u � v
w = 0110110011101000
u = 01101110
v = 10101000
w = 0110110011101000
w is a shuffle of u and v provided:
u = x1x2 · · · xk
v = y1y2 · · · yk
and w obtained by “interleaving” w = x1y1x2y2 · · · xkyk .
Strings - Soltys Math/CS Seminar Shuffle - 17/27
Shuffle
w is the shuffle of u, v : w = u � v
w = 0110110011101000
u = 01101110
v = 10101000
w = 0110110011101000
w is a shuffle of u and v provided:
u = x1x2 · · · xk
v = y1y2 · · · yk
and w obtained by “interleaving” w = x1y1x2y2 · · · xkyk .
Strings - Soltys Math/CS Seminar Shuffle - 17/27
Square Shuffle
w is a square provided it is equal to a shuffle of a u with itself, i.e.,∃u s.t. w = u � u
The string w = 0110110011101000 is a square:
w = 0110110011101000
andu = 01101100 = 01101100
Strings - Soltys Math/CS Seminar Shuffle - 18/27
Result from 2013
given an alphabet Σ, |Σ| ≥ 7,
Square = {w : ∃u(w = u � u)}
is NP-complete.
What we leave open:
I What about |Σ| = 2 (for |Σ| = 1, Square is just the set ofeven length strings)
I What about if |Σ| =∞ but each symbol cannot occur moreoften than, say, 6 times (if each symbol occurs at most 4times, Square can be reduced to 2-Sat – see P. AustrinStack Exchange post http://bit.ly/WATco3)
Strings - Soltys Math/CS Seminar Shuffle - 19/27
Result from 2013
given an alphabet Σ, |Σ| ≥ 7,
Square = {w : ∃u(w = u � u)}
is NP-complete.
What we leave open:
I What about |Σ| = 2 (for |Σ| = 1, Square is just the set ofeven length strings)
I What about if |Σ| =∞ but each symbol cannot occur moreoften than, say, 6 times (if each symbol occurs at most 4times, Square can be reduced to 2-Sat – see P. AustrinStack Exchange post http://bit.ly/WATco3)
Strings - Soltys Math/CS Seminar Shuffle - 19/27
Open Problem 2
Is Square NP-complete for alphabets of size {2, 3, 4, 5, 6} ?
Strings - Soltys Math/CS Seminar Shuffle - 20/27
Upper and lower bounds
Shuffle(x , y ,w) holds if and only if w is a shuffle of x , y
Shuffle 6∈ AC0, but Shuffle ∈ AC1.
Strings - Soltys Math/CS Seminar Shuffle - 21/27
Upper bound
Strings - Soltys Math/CS Seminar Shuffle - 22/27
Lower bound
Parity(x) =∨
0 ≤ i ≤ |x |i is odd
Shuffle(0|x |−i , 1i , x).
Strings - Soltys Math/CS Seminar Shuffle - 23/27
n−i
i=1 i=3 i=5 i=n
0 x 1 1 10 0 0x x x1ii n−i i in−i n−i
Strings - Soltys Math/CS Seminar Shuffle - 24/27
Open Problem 3
Is Shuffle in NC1?
Strings - Soltys Math/CS Seminar Shuffle - 25/27
Announcement of two upcoming seminars
1. February 16, 2015, 6:00-7:00pmBell Tower 1471Ryszard JanickiOn Pairwise Comparisons Based Rankings
2. February 16, 2015, 7:00-8:00pmBell Tower 1471Neerja MhaskarRepetition in Strings and String Shuffles
Computer Science Seminars:http://compsci.csuci.edu/degrees/seminars.htm
Strings - Soltys Math/CS Seminar Conclusion - 26/27
References
Jaros law Grytczuk, Jakub Kozik, and Pitor Micek.A new approach to nonrepetitive sequences.arXiv:1103.3809, December 2010.
Jeffrey Shallit.A second course in formal languages and automata theory.Cambridge Univeristy Press, 2009.
Strings - Soltys Math/CS Seminar References - 27/27
top related