Finite state subautomataApplication to Electronic Dictionaries
Lamia TounsiPolytech'Tours, Computer Science laboratory
François Rabelais University of Tours, France
2
Motivation
o DFSA are widely used in Natural Language processing
Find all sub structures in a given FSA.
Search of subautomata in a DFSA• Decompose a very large FSA into smaller ones• Discover frequently occurring data • Reduce memory consumption
3
Plan
Mathematical preliminaries • Automaton• Subautomaton
Research of subautomata• Smallest closed subautomaton• Smallest subautomaton
Application to automata representing dictionaries Indexation and Compression Conclusion
Finite state subautomataApplication to Electronic Dictionnaries
Mathematical preliminaries •Automaton•Subautomaton
Research of subautomata•Smallest closed subautomaton•Smallest subautomaton
Application to automata representing dictionariesIndexation and Compression Conclusion
5
Automaton
A deterministic acyclic automaton A =<, Q, , qi, qf > is the alphabet• Q is the finite set of states is the transition function: : Q Q• qi is the initial state (qi Q)• qf is the final state (qf Q)
Let a and w * : (p, )=p (p, wa)= ( (p,w),a)
6
Successors & predecessors
Succ(p) = {qQ : , (p,)= q}Succ*(p) = {qQ : w*, (p,w)= q}
Pred(p) = { qQ : , (q,)= p}Pred*(p) = { qQ : w*, (q,w)= p}
Height : • H(qf)=0• H(p)=Max{q Succ(p)} H(q)+1
10
Source (E) & Initial State (p)
Let E • AP(E)={ w path from qi to p, p E}
• AN(E)={p Q/ w AP(E), p w}
11
Source (E) & Initial State (p)
Let E • AP(E)={ w path from qi to p, p E}
• AN(E)={p Q/ w AP(E), p w}
source(E) AN (E)
Source(E) :
H(source(E)) =MinqAN (E)(H(q))
Source(E)
12
Source (E) & Initial State (p)
Let E • AP(E)={ w path from qi to p, p E}
• AN(E)={p Q/ w AP(E), p w}
source(E) AN (E)
source(E) :
H(source(E)) =MinqAN (E)(H(q))
Let p Q, p qi
IS(p) = Source(Pred(p))
13
Source (E) & Initial State (p)
Source(q2, q3, q5) = Source(q3, q4) = q2
Source(q3, q4, q5) = Source(q3, q4, q5 , q6) = q1
IS(q3)= q2
IS(q5)= q1
IS(q6)= q1
14
Sink (E) & Final State (p)
Let E • PP(E) = { w path from p to qf, p E}
• PN(E) = {p Q/ w PP(E), p w}
Sink(E) PN (E)
Sink(E) :
H(Sink(E)) =MaxqPN (E)(H(q))
Let p Q, p qi
FS(p) = Sink(Succ(p))
15
Subautomaton (SA)
A’=<, Q’, ’, si, sf > is a sub automaton of A iff:• Q’ Q
• {si, sf } Q’
Q’ Q’ ’:
(q, ) Q’ : ’ (q, ) = (q, )
q Q’ : q Succ*(si) and q Pred*(sf)
q Q’ \ {si, sf } : Succ(q) Q’ and Pred(q) Q’
20
Closed subautomaton (CSA)
Let Q Q’ and si, sf two distinct states:
A subautomaton A’=<, Q’, ’, si, sf > is a closed subautomaton iff :
q Q’ \ {si}: Pred(q) Q’
q Q’ \ {sf}: Succ(q) Q’
24
Smallest Closed subautomaton (SCSA)
Let Q Q’ and si, sf two distinct states:
A closed subautomaton A’=<, Q’, ’, si, sf >is a smallest closed subautomaton iff :
(si, q) is CSA q= sf
q Q’ :
(q, sf) is CSA q= si
25
Smallest Closed subautomaton (SCSA)
An automaton that recognizes the flexion of nine verbs
SCSASCSASCSA SCSA
26
Smallest subautomaton (SSA)
Let p Q \{si, sf}
The subautomaton A’=<, Q’, ’, si, sf >
is SSA(p) iff :- A’ strictly contains p A’’=<, Q’’, ’’, s’’i, s’’f > wich strictly
contains p : Q’ Q’’
27
Smallest subautomaton (SSA)
An automaton that recognizes the flexion of nine verbs
SSA(6) SSA(18)
Finite state subautomataApplication to Electronic Dictionaries
Mathematical preliminaries •Automaton•Subautomaton
Research of subautomata•Smallest closed subautomaton (SCSA)•Smallest subautomaton (SSA)
Application to automata representing dictionariesIndexation and Compression Conclusion
29
Research SCSA
Property 1.
(si, sf ) is a SCSA iff IS(sf)= si & FS(si)= sf
Property 2. (Associativity)
If E=E1E2 and E1 , E2 then
Source(E)= Source(Source(E1),Source(E2))
Property 3. (Hierarchy between two SCSA )• Either, they have no common transitions,• Either, one is strictly included in the other.
30
Research SCSA
Let p Q1. P.IS : initial state associated to p.2. P.FSmin : minimal final state associated to p, assuming
that p is the initial state of a SCSA.3. P.FSmax : maximal final state associated to p, assuming
that p is the initial state of a SCSA.
Property 4.
p>qi, (p.IS,p) is a SCSA iff p.IS.FSmin p p.IS.FSmax
Complexity Algorithm : O (n2)
33
Research SSA
Let A’=<, Q’, ’, si, sf > be a subautomaton
Property 5.E Q’ \ {sf}: Succ*(si)Pred*(E) Q’
E Q’ \ {si}: Pred*(sf)Succ*(E) Q’
39
Research SSA
Property 6.
Let p, p’, q, q’ Q• {p, p’} Pred(q) and {q, q’} Succ(p)• H(p’) ≥ H(p) and H(q’) ≤ H(q)
p and q belong to the same SSA
40
All Subautomata of an automaton
Algorithm input: A - output: subautomata
1: repeat2: repeat3: Detect, store and replace each parallels by one transition;4: Detect, store and replace each sequences by one transition;5: until the automaton is freed from all its parallels and sequences6: Detect, store and replace each smallest subautomata by one transition;7: until The automaton A is reduced to one single transition
Valdez J., Tarjan R. E., Lawler E. L., The recognition of series-parallel digraphs, SIAM J. Comput. 11-2:298-313, 1982.
Finite state subautomataApplication to Electronic Dictionaries
Mathematical preliminaries •Automaton•Subautomaton
Research of subautomata•Smallest closed subautomaton (SCSA)•Smallest subautomaton (SSA)
Application to automata representing dictionariesIndexation and Compression Conclusion
48
Dictionaries and automata
10 dictionaries : Lexicographic order of words
• 6 Delaf : French, English, Serbian, German, Polylexicaux English, French cities.
• 4 Web : Frech, Hungarian, Bulgarian and Portuguese.
Properties of automata:Finit set of states, Acyclic, deterministic, unique initial
state, unique final state, minimal.
Finite state subautomataApplication to Electronic Dictionnaries
Mathematical preliminaries •Automaton•Subautomaton
Research of subautomata•Smallest closed subautomaton•Smallest subautomaton
Application to automata representing dictionariesFactorisation, indexation and compression Conclusion
53
Factorisation, indexation and compression
The reseach of subautomata detects sequences and parallels
Sequence subautomaton
Parallel subautomaton
Proposal: - The application of the direct acyclic word graph, initially dedicated for
indexing text, to index the subautomata,- heuristic to select the most interesting substructure to factorize.
54
Storage of an automaton
c
c
d
d
1 1 a 8
2 0 c 3
3 1 a 5
4 0 b 6
5 1 b 7
6 1 c 10
7 1 c 9
8 1 b 11
9 1 d 0
10 1 d 11
11 1 b 0
Boolean Character
log2(|Σ|) Address arrival state
log2(Max address+1)
55
Factorization
c
c
d
d
b
1 1 a 5
2 0 c 3
3 1 a 7
4 0 6
5 1 b 6
6 1 b 0
7 1 0
a cb
1 1 a 8
2 0 c 3
3 1 a 5
4 0 b 6
5 1 b 7
6 1 c 10
7 1 c 9
8 1 b 11
9 1 d 0
10 1 d 11
11 1 b 0
57
How can we choose the subautomata to factorize ?
- The best candidates to be factorized are those which increase memory storage efficiency and reduce the size of the initial automaton
Profit = saved memory – Consumed memory
- The memory space is saved by elimination of all occurrences of the substructure
- The memory space is consumed by the extention of the alphabet and the index.
58
Directed Acyclic word graph (DAWG)
Computations of frequency and profit associated to each sequence with a DAWG
DAWG (aabba)
59
Greedy Algorithm of Compression
Algorithm input: A - Output: A, Alphabet
1: Iterative process 2: Select the best sequence s from the DAWG 3: Extend the alphabet to represent s4: Delete s from A and from DAWG5: Update the DAWG
Finite state subautomataApplication to Electronic Dictionaries
Mathematical preliminaries •Automaton•Subautomaton
Research of subautomata•Smallest closed subautomaton•Smallest subautomaton
Application to automata representing dictionariesFactorisation, indexation and compression Conclusion
66
Conclusion
Research of two kinds of smallest subautomata
Statistical analysis of the internal structure of some automata associated to dictionnaries
Method of compression based on factorizations of sequences or parallel subautomata
A minimised automaton does not always lead to the better compression.