dynamic finite-state transducer composition with look-ahead … · 2010-10-03 · weighted...
TRANSCRIPT
Dynamic Finite-State Transducer Composition withLook-Ahead for Very-Large-Scale Speech Recognition
Cyril Allauzen - [email protected]
Ciprian Chelba - [email protected]
Boulos Harb - [email protected]
Michael Riley - [email protected]
Johan Schalkwyk - [email protected]
Aug 19, 2010
Weighted Finite-State Tranducers in Speech Recognition - I
• WFSTs are a general and efficient representation for many speech and NLP prob-
lems, see: Mohri, et al., “Speech recognition with weighted finite-state transduc-
ers”, in Handbook of Speech Processing. Springer 2008.
• In ASR, they have been used to:
– Represent models:
∗ G: n-gram language model (automaton over words)
∗ L: pronunciation lexicon (transducer from CI phones to words)
∗ C: context dependency (transducer from CD phones to CI phone)
– Combine and optimize models:
∗ Composition: Computes the relational composition of two transducers.
∗ Epsilon Removal: Finds equivalent WFST with no ǫ transitions.
∗ Determinization: Finds equivalent WFST that has no identically-labeled
transitions leaving a state.
∗ Minimization: Finds equivalent deterministic WFST with the fewest states
and arcs.
Weighted Finite-State Tranducers in Speech Recognition - II
• Advantages:
– Uniform data representation
– General, efficient, mathematically well-defined and reusable com-
bination and optimization operations
– Variant systems realized in data not code.
• OpenFst, an open-source finite-state transducer library, was used for
this work (http://www.openfst.org). Released under the Apache
license; used in many speech and NLP applications.
Weighted Acceptors
• Finite automata with labels and weights.
• Example: Word pronunciation acceptor:
0 1d/1
2ey/0.5
ae/0.53
t/0.3
dx/0.74
ax/1
Weighted Transducers
• Finite automata with input labels, output labels, and weights.
• Example: Word pronunciation transducer:
0
1d:data/1
5d:dew/1
2 ey: ε /0.5
ae: ε /0.5
6/0 uw: ε /1
3 t: ε /0.3
dx: ε /0.7 4/0
ax: ε /1
• L: Closed union of |V | word pronunciation transducers.
• G: An n-gram model is a WFSA with (at most) |V |n−1 states.
Context-Dependent Triphone Transducer C
ε,* x,ε
x:x/ ε_ε
x,x
x:x/ ε_x
x,y
x:x/ ε_y
y,ε
y:y/ ε_ε
y,x
y:y/ ε_x
y,y
y:y/ ε_y x:x/x_ε
x:x/x_x
x:x/x_y
y:y/x_ ε
y:y/x_x
y:y/x_y
x:x/y_εx:x/y_x
x:x/y_y
y:y/y_ε
y:y/y_xy:y/y_y
Recognition Transducer Construction• The models C, L, G can be combined and optimized with weighted finite-
state composition and determinization as:
C ◦ det(L ◦ G) (1)
• An alternative construction, producing an equivalent transducer, is:
(C ◦ det(L)) ◦ G (2)
If G is deterministic, Eq. 2 could be as efficient as Eq. 1 and avoids the
determinization of L ◦ G, greatly saving time and memory and allowing
fast dynamic combination (useful in applications).
• However, standard composition presents three problems with Eq. 2:
1. Determinization of L moves back word labels creating delay in match-
ing and creating (possibly very many) useless composition paths
2. The delayed word labels in L produce a much larger composed machine
when G is an n-gram LM.
3. The delayed word labels push back the grammar weights along paths
in the composed machine to the detriment of ASR pruning.
Composition Example
0
1
r:red
2r:read
3r:reed
4
r:road
5
r:rode
6
eh:ε
eh:εiy:ε
iy:εao:ε
ao:ε
7d:ε 0 1r:ε
2eh:ε
3iy:ε
5
ao:ε4
d:read
d:red
d:readd:reed
d:road
d:rode
0
1red/0.6
2
read/0.4
L det(L) G
0,0
1,1r:red/0.6
2,2
r:read/0.4
6,1eh:ε
6,2eh:εiy:ε
7,1d:ε
7,2d:ε
0 1,0r:ε
2,0eh:ε
3,0iy:ε
5,0ao:ε
4,2
d:read/0.4
4,1d:red/0.6
d:read/0.4
L ◦ G det(L) ◦ G
Definitions and Notation – Paths
• Path π
– Origin or previous state: p[π].
– Destination or next state: n[π].
– Input label: i[π].
– Output label: o[π].
p[π] n[π]i[π]:o[π]
• Sets of paths
– P (R1, R2): set of all paths from R1 ⊆ Q to R2 ⊆ Q.
– P (R1, x, R2): paths in P (R1, R2) with input label x.
– P (R1, x, y, R2): paths in P (R1, x, R2) with output label y.
Definitions and Notation – Transducers
• Alphabets: input A, output B.
• States: Q, initial states I, final states F .
• Transitions: E ⊆ Q × (A ∪ {ǫ}) × (B ∪ {ǫ}) × K × Q.
• Weight functions:
initial weight function λ : I → K
final weight function ρ : F → K.
• Transducer T = (A,B, Q, I, F, E, λ, ρ) with for all x ∈ A∗, y ∈ B∗:
[[T ]](x, y) =⊕
π∈P (I,x,y,F )
λ(p[π]) ⊗ w[π] ⊗ ρ(n[π])
Semirings
A semiring (K,⊕,⊗, 0, 1) = a ring that may lack negation.
• Sum: to compute the weight of a sequence (sum of the weights of
the paths labeled with that sequence).
• Product: to compute the weight of a path (product of the weights
of constituent transitions).
Semiring Set ⊕ ⊗ 0 1
Boolean {0, 1} ∨ ∧ 0 1
Probability R+ + × 0 1
Log R ∪ {−∞, +∞} ⊕log + +∞ 0
Tropical R ∪ {−∞, +∞} min + +∞ 0
String B∗ ∪ {∞} lcp · ∞ ǫ
with ⊕log defined by: x ⊕log y = − log(e−x + e−y).
(ǫ-Free) Composition Algorithm
• States: (q1, q2) with q1 in T1 and q2 in T2.
• Transitions: e1 transition in q1 and e2 in q2 such that o[e1] = i[e2]
→ ((q1, q2), i[e1], o[e2], w[e1] ⊗ w[e2], (n[e1], n[e2]))
Generalized Composition Algorithm
• Composition Filter:
Φ = (T1, T2, Q3, i3,⊥, ϕ)
– Q3: set of filter states with i3 initial and ⊥ final.
– ϕ : (e1, e2, q3) 7→ (e′1, e′2, q′3): transition filter
• Algorithm:
– States: (q1, q2, q3) with q1 in T1, q2 in T2 and q3 a filter state.
– Transitions: e1 transition in q1, e2 in q2 such that ϕ(e1, e2, q3) =
(e′1, e′2, q′3) with q′3 6=⊥
→ ((q1, q2, q3), i[e′1], o[e
′2], w[e′1] ⊗ w[e′2], (n[e′1], n[e′2], q
′3))
• Trivial filter Φtrivial:
� Allows all matching paths
Q3 = {0,⊥}, i3 = 0 and ϕ(e1, e2, 0) =
{
(e1, e2, 0) if o[e1] = i[e2]
(e1, e2,⊥) otherwise
→ basic ǫ-free composition algorithm
Pseudo-code
Weighted-Composition(T1, T2)
1 Q← I ← S ← I1 × I2 × {i3}
2 while S 6= ∅ do
3 (q1, q2, q3)← Head(S)
4 Dequeue(S)
5 if (q1, q2, q3) ∈ F1 × F2 ×Q3 then
6 F ← F ∪ {(q1, q2, q3)}
7 ρ(q1, q2, q3)← ρ1(q1)⊗ ρ2(q2)⊗ ρ3(q3)
8 M ←{(e1, e2) ∈ EL[q1]× EL[q2] such that
ϕ(e1, e2, q3) = (e′
1, e′
2, q′
3) with q′
36=⊥}
9 for each(e1, e2) ∈M do
10 (e′
1, e′
2, q′
3)← ϕ(e1, e2, q3)
11 if (n[e′
1], n[e′
2], q′
3) 6∈ Q then
12 Q← Q ∪˘
(n[e′
1], n[e′
2], q′
3)¯
13 Enqueue(S, (n[e′
1], n[e′
2], q′
3))
14 E ← E ∪{((q1, q2, q3), i[e′
1], o[e′
2],
w[e′
1]⊗ w[e′
2], (n[e′
1], n[e′
2], q′
3))}
15 return T
Epsilon-Matching Filter
• An ǫ-transition in T1 (resp. T2) can be matched in T2 (resp. T1) by an
ǫ-transition or by staying at the same state
(as if there were ǫ self-loops at each state in T1 and T2)
• Allowing all possible ǫ-matches:
→ redundant ǫ-paths in T1 ◦ T2
→ wrong result when the semiring is non-idempotent
• Filter Φǫ-match:
� Disallows redundant ǫ-paths, favoring matching actual ǫ-transitionsQ3 = {0, 1, 2,⊥}, i3 = 0 and ϕ(e1, e2, q3) = (e1, e2, q
′
3) where:
q′3 =
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
0 if (o[e1], i[e2]) = (x, x) with x ∈ B,
0 if (o[e1], i[e2]) = (ǫ, ǫ) and q3 = 0,
1 if (o[e1], i[e2]) = (ǫL, ǫ) and q3 6= 2,
2 if (o[e1], i[e2]) = (ǫ, ǫL) and q3 6= 1,
⊥ otherwise.
ǫL: label of added self-loops
→ composition algorithm of [Mohri, Peirera and Riley, 96]
Label-Reachability Filter
� Disallows following an ǫ-path in q1 that will fail to reach a non-ǫ label
that matches some transition in q2
• Label-Reachability r : Q1 × B → {0, 1}
r(q, x) =
(
1 if there exists a path from q to some q′ with output label x
0 otherwise
• Filter Φreach: Same as Φtrivial except when o[e1] = ǫ and i[e2] = ǫL then
ϕ(e1, e2, 0) = (e1, e2, q3) with q3 =
(
0 if there exist e′2
in q2 such that r(n[e1], i[e′2]) = 1
⊥ otherwise
0 1r:ε
2eh:ε
3iy:ε
5
ao:ε4
d:read
d:red
d:readd:reed
d:road
d:rode
⇒
0
1red/0.6
2
read/0.4
0,0 1,0r:ε
2,0eh:ε
3,0iy:ε
ao:ε
4,2
d:read/0.4
4,1d:red/0.6
d:read/0.4
Label-Reachability Filter with Label Pushing
• When matching an ǫ-transition e1 with an ǫL-loop e2:
� if there exists a unique e′2 in q2 such that r(n[e1], i[e′2]) = 1,
� then allow matching e1 with e′2 instead
→ early output of o[e′2]
• Filter Φpush-label: Q3 = B ∪ {ǫ,⊥} and i3 = ǫ
� the filter state encodes the label that has been consumed early
0 1r:ε
2eh:ε
3iy:ε
5
ao:ε4
d:read
d:red
d:readd:reed
d:road
d:rode
⇒
0
1red/0.6
2
read/0.4
0,0,ε 1,0,εr:ε
2,0,εeh:ε
3,1,readiy:read 4,2,ε
d:read/0.4
4,1,εd:red/0.6
d:ε/0.4
Label-Reachability Filter with Weight Pushing
• When matching an ǫ-transition e1 with an ǫL-loop e2:
� outputs early the ⊕-sum of the weight of the prospective matches
• Reachable weight wr : (q1, q2) 7→⊕
e∈E[q2],r(q1,i[e])=1 w[e]
• Filter Φpush-weight: Q3 = K, i3 = 1 and ⊥= 0
� the filter state encodes the weight that has been outputted early
if o[e1] = ǫ and i[e2] = ǫL, q′3 = wr(n[e1], q2) and w[e′2] = q−13 ⊗ q′3
0 1r:ε
2eh:ε
3iy:ε
5
ao:ε4
d:read
d:red
d:readd:reed
d:road
d:rode
⇒
0
1red/0.6
2
read/0.4
0,0,1 1,0,1r:ε
2,0,1eh:ε
3,0,0.4iy:ε/0.4 4,2,1
d:read/0.4
4,1,1d:red/0.6
d:read
Implementation
• Representation of r
– Point representation: Rq = {x ∈ B : r(x, q) = 1}
� inefficient in time and space
– Interval representation:
Iq = {[x, y) : x, y ∈ N, [x, y) ⊆ Rq, x − 1 /∈ Rq, y /∈ Rq}
� efficiency depends on the number of interval for each Rq
∗ one interval per state trivial for a tree - found by DFS
∗ one interval per state possible if C1P holds → true if unique
pronunciation L and preserved by determinization, minimiza-
tion, closure and composition with C
∗ multiple pronunciation L typically fails C1P. However, a mod-
ification of the Hsu’s (2002) C1P Test gives a greedy algorithm
for minimizing the number of intervals per state.
Implementation
• Efficient computation of wr
– Requires fast computation of sq(x, y) =⊕
e∈E[q],i[e]∈[x,y) w[e]
for q in T2, x and y in B = N
– Achieved by precomputing cq(x) =⊕
e∈E[q],i[e]<x w[e]
� sq(x, y) = cq(y) − cq(x)
Composition Design - Options
• Composition Options:
typedef SortedMatcher<StdFst> SM;
typedef SequenceComposeFilter<Arc> CF;
ComposeFstOptions<StdArc, SM, CF> opts;
opts.matcher1 = new SM(fst1, MATCH NONE, kNoLabel);
opts.matcher2 = new SM(fst2, MATCH INPUT, kNoLabel);
opts.filter = new CF(fst1, fst2);
StdComposeFst cfst(fst1, fst2, opts);
Composition Filters
• Predefined Filters:
Name Description
SequenceComposeFilter Requires FST1 epsilons to be read before FST2 epsilons
AltSequenceComposeFilter Requires FST2 epsilons to be read before FST1 epsilons
MatchComposeFilter Requires FST1 epsilons be matched with FST2 epsilons
LookAheadComposeFilter<F> Supports lookahead in composition
PushWeightsComposeFilter<F> Supports pushing weights in composition
PushLabelsComposeFilter<F> Supports pushing labels in composition
• Three lookahead composition filters, each templated on an underlying filter F,
are added. All three can be used by cascading them.
Composition: Matcher Design
• Matchers can find and iterate through requested labels at FST states.
• Matcher Form:template <class F>
class Matcher {
typedef typename F::Arc Arc;
public:
void SetState(StateId s); // Specifies current state
bool Find(Label label); // Checks state for match to label
bool Done() const; // No more matches
const Arc& Value() const; // Current arc
void Next(); // Advance to next arc
bool LookAhead(const Fst<Arc> fst, // (Optional) lookahead
StateId s, Weight &weight);
};
• A Lookahead() method, given the language (FST + initial state) to
expect, is added.
Matchers
• Predefined Matchers:
Name Description
SortedMatcher Binary search on sorted input
RhoMatcher<M> ρ symbol handling
SigmaMatcher<M> σ symbol handling
PhiMatcher<M> ϕ symbol handling
LabelLookAheadMatcher<M> Lookahead along epsilon paths
ArcLookMatcher<M> Lookahead one transition
• Two lookahead matchers, each templated on an underlying matcher M , are
added.
• Special symbol matchers:
Consumes no symbol Consumes symbol
Matches all ǫ σ
Matches rest ϕ ρ
Recognition Experiments
Broadcast News Spoken Query Task
Acoustic Model
• Trained on 96 and 97 DARPA Hub4 AM
training sets.
• PLP cepstra, LDA analysis, STC
• Triphonic, 8k tied states, 16 components per
state
• Speaker adapted (both VTLN + CMLLR)
• Trained on > 1000hrs of voice search queries
• PLP cepstra, LDA analysis, STC
• Triphonic, 4k tied states, 4 - 128 components
per state
• Speaker independent
Language Model
• 1996 Hub4 CSR LM training sets
• 4-gram language model pruned to 8M n-
grams
• Trained on > 1B words of google.com and
voice search queries
• 1 million word vocabulary
• Katz back-off model, pruned to various sizes
Recognition Experiments
Precomputation before recognition Broadcast News Spoken Query Task
Construction method Time RAM Result Time RAM Result
Static
(1) with standard composition 7 min 5.3G 0.5G 10.5 min 11.2G 1.4G
(2) with generalized composition 2.5 min 2.9G 0.5G 4 min 5.3G 1.4G
Dynamic
(2) with generalized composition none none 0.2G none none 0.5G
Broadcast News Spoken Query Task
Recognition Experiments
• A small part of the recognition transducer is visited during recognition:
Spoken Query Task Static Number of states in recognition transducer 25.4M
Dynamic Number of states visited per second 8K
• Very large language models can be used in first-pass:
1e+06 5e+06 1e+07 5e+07 1e+08 5e+08 1e+09
1718
1920
21
# of N−Grams
Wor
d E
rror
Rat
e
Spoken Query Task
Word error rate as function of LM size
(with Ciprian Chelba and Boulos Harb)
Prior Work
• Caseiro and Trancoso (IEEE Trans. on ASLP 2006): they developed
a specialized composition for a pronunciation lexicon L. If pronun-
ciations are stored in a trie, then the words readable from a node
form a lexicographic interval, which can be used to disallow non-
coaccessible epsilon paths.
• Cheng, et al. (ICASSP 2007); Oonishi, et al (Interspeech 2008):
they use methods apparently similar to ours, but many details are
left unspecified, such as what is the representation of the reachable
label sets. There are no published complexities, but the published
results show a very significant overhead to the dynamic composition
compared to a static recognition transducer.
• Our method:
– uses a very efficient representation of the label sets
– uses a very efficient computation of the weight pushing
– has a small overhead between static and dynamic composition
Conclusions
This work:
• Introduces a generalized composition filter for weighted finite-state
composition
• Presents composition filters that:
– Remove useless epsilon paths
– Push forward labels
– Push forward weights
• The combination of these filters permits the composition of large
speech-recognition context-dependent lexicons and language models
much more efficiently in time and space than before
• Experiments on Broadcast News and a spoken query task show a
5% to 10% overhead for dynamic, runtime composition compared to
static, offline composition. To our knowledge, this is the first such
system with so little overhead.