dynamic finite-state transducer composition with look-ahead … · 2010-10-03 · weighted...

Dynamic Finite-State Transducer Composition withLook-Ahead for Very-Large-Scale Speech Recognition

Cyril Allauzen - [email protected]

Ciprian Chelba - [email protected]

Boulos Harb - [email protected]

Michael Riley - [email protected]

Johan Schalkwyk - [email protected]

Aug 19, 2010

Weighted Finite-State Tranducers in Speech Recognition - I

• WFSTs are a general and efficient representation for many speech and NLP prob-

lems, see: Mohri, et al., “Speech recognition with weighted finite-state transduc-

ers”, in Handbook of Speech Processing. Springer 2008.

• In ASR, they have been used to:

– Represent models:

∗ G: n-gram language model (automaton over words)

∗ L: pronunciation lexicon (transducer from CI phones to words)

∗ C: context dependency (transducer from CD phones to CI phone)

– Combine and optimize models:

∗ Composition: Computes the relational composition of two transducers.

∗ Epsilon Removal: Finds equivalent WFST with no ǫ transitions.

∗ Determinization: Finds equivalent WFST that has no identically-labeled

transitions leaving a state.

∗ Minimization: Finds equivalent deterministic WFST with the fewest states

and arcs.

Weighted Finite-State Tranducers in Speech Recognition - II

• Advantages:

– Uniform data representation

– General, efficient, mathematically well-defined and reusable com-

bination and optimization operations

– Variant systems realized in data not code.

• OpenFst, an open-source finite-state transducer library, was used for

this work (http://www.openfst.org). Released under the Apache

license; used in many speech and NLP applications.

Weighted Acceptors

• Finite automata with labels and weights.

• Example: Word pronunciation acceptor:

0 1d/1

2ey/0.5

ae/0.53

t/0.3

dx/0.74

ax/1

Weighted Transducers

• Finite automata with input labels, output labels, and weights.

• Example: Word pronunciation transducer:

0

1d:data/1

5d:dew/1

2 ey: ε /0.5

ae: ε /0.5

6/0 uw: ε /1

3 t: ε /0.3

dx: ε /0.7 4/0

ax: ε /1

• L: Closed union of |V | word pronunciation transducers.

• G: An n-gram model is a WFSA with (at most) |V |n−1 states.

Context-Dependent Triphone Transducer C

ε,* x,ε

x:x/ ε_ε

x,x

x:x/ ε_x

x,y

x:x/ ε_y

y,ε

y:y/ ε_ε

y,x

y:y/ ε_x

y,y

y:y/ ε_y x:x/x_ε

x:x/x_x

x:x/x_y

y:y/x_ ε

y:y/x_x

y:y/x_y

x:x/y_εx:x/y_x

x:x/y_y

y:y/y_ε

y:y/y_xy:y/y_y

Recognition Transducer Construction• The models C, L, G can be combined and optimized with weighted finite-

state composition and determinization as:

C ◦ det(L ◦ G) (1)

• An alternative construction, producing an equivalent transducer, is:

(C ◦ det(L)) ◦ G (2)

If G is deterministic, Eq. 2 could be as efficient as Eq. 1 and avoids the

determinization of L ◦ G, greatly saving time and memory and allowing

fast dynamic combination (useful in applications).

• However, standard composition presents three problems with Eq. 2:

1. Determinization of L moves back word labels creating delay in match-

ing and creating (possibly very many) useless composition paths

2. The delayed word labels in L produce a much larger composed machine

when G is an n-gram LM.

3. The delayed word labels push back the grammar weights along paths

in the composed machine to the detriment of ASR pruning.

Composition Example

0

1

r:red

2r:read

3r:reed

4

r:road

5

r:rode

6

eh:ε

eh:εiy:ε

iy:εao:ε

ao:ε

7d:ε 0 1r:ε

2eh:ε

3iy:ε

5

ao:ε4

d:read

d:red

d:readd:reed

d:road

d:rode

0

1red/0.6

2

read/0.4

L det(L) G

0,0

1,1r:red/0.6

2,2

r:read/0.4

6,1eh:ε

6,2eh:εiy:ε

7,1d:ε

7,2d:ε

0 1,0r:ε

2,0eh:ε

3,0iy:ε

5,0ao:ε

4,2

d:read/0.4

4,1d:red/0.6

d:read/0.4

L ◦ G det(L) ◦ G

Definitions and Notation – Paths

• Path π

– Origin or previous state: p[π].

– Destination or next state: n[π].

– Input label: i[π].

– Output label: o[π].

p[π] n[π]i[π]:o[π]

• Sets of paths

– P (R1, R2): set of all paths from R1 ⊆ Q to R2 ⊆ Q.

– P (R1, x, R2): paths in P (R1, R2) with input label x.

– P (R1, x, y, R2): paths in P (R1, x, R2) with output label y.

Definitions and Notation – Transducers

• Alphabets: input A, output B.

• States: Q, initial states I, final states F .

• Transitions: E ⊆ Q × (A ∪ {ǫ}) × (B ∪ {ǫ}) × K × Q.

• Weight functions:

initial weight function λ : I → K

final weight function ρ : F → K.

• Transducer T = (A,B, Q, I, F, E, λ, ρ) with for all x ∈ A∗, y ∈ B∗:

[[T ]](x, y) =⊕

π∈P (I,x,y,F )

λ(p[π]) ⊗ w[π] ⊗ ρ(n[π])

Semirings

A semiring (K,⊕,⊗, 0, 1) = a ring that may lack negation.

• Sum: to compute the weight of a sequence (sum of the weights of

the paths labeled with that sequence).

• Product: to compute the weight of a path (product of the weights

of constituent transitions).

Semiring Set ⊕ ⊗ 0 1

Boolean {0, 1} ∨ ∧ 0 1

Probability R+ + × 0 1

Log R ∪ {−∞, +∞} ⊕log + +∞ 0

Tropical R ∪ {−∞, +∞} min + +∞ 0

String B∗ ∪ {∞} lcp · ∞ ǫ

with ⊕log defined by: x ⊕log y = − log(e−x + e−y).

(ǫ-Free) Composition Algorithm

• States: (q1, q2) with q1 in T1 and q2 in T2.

• Transitions: e1 transition in q1 and e2 in q2 such that o[e1] = i[e2]

→ ((q1, q2), i[e1], o[e2], w[e1] ⊗ w[e2], (n[e1], n[e2]))

Generalized Composition Algorithm

• Composition Filter:

Φ = (T1, T2, Q3, i3,⊥, ϕ)

– Q3: set of filter states with i3 initial and ⊥ final.

– ϕ : (e1, e2, q3) 7→ (e′1, e′2, q′3): transition filter

• Algorithm:

– States: (q1, q2, q3) with q1 in T1, q2 in T2 and q3 a filter state.

– Transitions: e1 transition in q1, e2 in q2 such that ϕ(e1, e2, q3) =

(e′1, e′2, q′3) with q′3 6=⊥

→ ((q1, q2, q3), i[e′1], o[e

′2], w[e′1] ⊗ w[e′2], (n[e′1], n[e′2], q

′3))

• Trivial filter Φtrivial:

� Allows all matching paths

Q3 = {0,⊥}, i3 = 0 and ϕ(e1, e2, 0) =

{

(e1, e2, 0) if o[e1] = i[e2]

(e1, e2,⊥) otherwise

→ basic ǫ-free composition algorithm

Pseudo-code

Weighted-Composition(T1, T2)

1 Q← I ← S ← I1 × I2 × {i3}

2 while S 6= ∅ do

3 (q1, q2, q3)← Head(S)

4 Dequeue(S)

5 if (q1, q2, q3) ∈ F1 × F2 ×Q3 then

6 F ← F ∪ {(q1, q2, q3)}

7 ρ(q1, q2, q3)← ρ1(q1)⊗ ρ2(q2)⊗ ρ3(q3)

8 M ←{(e1, e2) ∈ EL[q1]× EL[q2] such that

ϕ(e1, e2, q3) = (e′

1, e′

2, q′

3) with q′

36=⊥}

9 for each(e1, e2) ∈M do

10 (e′

1, e′

2, q′

3)← ϕ(e1, e2, q3)

11 if (n[e′

1], n[e′

2], q′

3) 6∈ Q then

12 Q← Q ∪˘

(n[e′

1], n[e′

2], q′

3)¯

13 Enqueue(S, (n[e′

1], n[e′

2], q′

3))

14 E ← E ∪{((q1, q2, q3), i[e′

1], o[e′

2],

w[e′

1]⊗ w[e′

2], (n[e′

1], n[e′

2], q′

3))}

15 return T

Epsilon-Matching Filter

• An ǫ-transition in T1 (resp. T2) can be matched in T2 (resp. T1) by an

ǫ-transition or by staying at the same state

(as if there were ǫ self-loops at each state in T1 and T2)

• Allowing all possible ǫ-matches:

→ redundant ǫ-paths in T1 ◦ T2

→ wrong result when the semiring is non-idempotent

• Filter Φǫ-match:

� Disallows redundant ǫ-paths, favoring matching actual ǫ-transitionsQ3 = {0, 1, 2,⊥}, i3 = 0 and ϕ(e1, e2, q3) = (e1, e2, q

′

3) where:

q′3 =

8

>

>

>

>

>

>

>

<

>

>

>

>

>

>

>

:

0 if (o[e1], i[e2]) = (x, x) with x ∈ B,

0 if (o[e1], i[e2]) = (ǫ, ǫ) and q3 = 0,

1 if (o[e1], i[e2]) = (ǫL, ǫ) and q3 6= 2,

2 if (o[e1], i[e2]) = (ǫ, ǫL) and q3 6= 1,

⊥ otherwise.

ǫL: label of added self-loops

→ composition algorithm of [Mohri, Peirera and Riley, 96]

Label-Reachability Filter

� Disallows following an ǫ-path in q1 that will fail to reach a non-ǫ label

that matches some transition in q2

• Label-Reachability r : Q1 × B → {0, 1}

r(q, x) =

(

1 if there exists a path from q to some q′ with output label x

0 otherwise

• Filter Φreach: Same as Φtrivial except when o[e1] = ǫ and i[e2] = ǫL then

ϕ(e1, e2, 0) = (e1, e2, q3) with q3 =

(

0 if there exist e′2

in q2 such that r(n[e1], i[e′2]) = 1

⊥ otherwise

0 1r:ε

2eh:ε

3iy:ε

5

ao:ε4

d:read

d:red

d:readd:reed

d:road

d:rode

⇒

0

1red/0.6

2

read/0.4

0,0 1,0r:ε

2,0eh:ε

3,0iy:ε

ao:ε

4,2

d:read/0.4

4,1d:red/0.6

d:read/0.4

Label-Reachability Filter with Label Pushing

• When matching an ǫ-transition e1 with an ǫL-loop e2:

� if there exists a unique e′2 in q2 such that r(n[e1], i[e′2]) = 1,

� then allow matching e1 with e′2 instead

→ early output of o[e′2]

• Filter Φpush-label: Q3 = B ∪ {ǫ,⊥} and i3 = ǫ

� the filter state encodes the label that has been consumed early

0 1r:ε

2eh:ε

3iy:ε

5

ao:ε4

d:read

d:red

d:readd:reed

d:road

d:rode

⇒

0

1red/0.6

2

read/0.4

0,0,ε 1,0,εr:ε

2,0,εeh:ε

3,1,readiy:read 4,2,ε

d:read/0.4

4,1,εd:red/0.6

d:ε/0.4

Label-Reachability Filter with Weight Pushing

• When matching an ǫ-transition e1 with an ǫL-loop e2:

� outputs early the ⊕-sum of the weight of the prospective matches

• Reachable weight wr : (q1, q2) 7→⊕

e∈E[q2],r(q1,i[e])=1 w[e]

• Filter Φpush-weight: Q3 = K, i3 = 1 and ⊥= 0

� the filter state encodes the weight that has been outputted early

if o[e1] = ǫ and i[e2] = ǫL, q′3 = wr(n[e1], q2) and w[e′2] = q−13 ⊗ q′3

0 1r:ε

2eh:ε

3iy:ε

5

ao:ε4

d:read

d:red

d:readd:reed

d:road

d:rode

⇒

0

1red/0.6

2

read/0.4

0,0,1 1,0,1r:ε

2,0,1eh:ε

3,0,0.4iy:ε/0.4 4,2,1

d:read/0.4

4,1,1d:red/0.6

d:read

Implementation

• Representation of r

– Point representation: Rq = {x ∈ B : r(x, q) = 1}

� inefficient in time and space

– Interval representation:

Iq = {[x, y) : x, y ∈ N, [x, y) ⊆ Rq, x − 1 /∈ Rq, y /∈ Rq}

� efficiency depends on the number of interval for each Rq

∗ one interval per state trivial for a tree - found by DFS

∗ one interval per state possible if C1P holds → true if unique

pronunciation L and preserved by determinization, minimiza-

tion, closure and composition with C

∗ multiple pronunciation L typically fails C1P. However, a mod-

ification of the Hsu’s (2002) C1P Test gives a greedy algorithm

for minimizing the number of intervals per state.

Implementation

• Efficient computation of wr

– Requires fast computation of sq(x, y) =⊕

e∈E[q],i[e]∈[x,y) w[e]

for q in T2, x and y in B = N

– Achieved by precomputing cq(x) =⊕

e∈E[q],i[e]<x w[e]

� sq(x, y) = cq(y) − cq(x)

Composition Design - Options

• Composition Options:

typedef SortedMatcher<StdFst> SM;

typedef SequenceComposeFilter<Arc> CF;

ComposeFstOptions<StdArc, SM, CF> opts;

opts.matcher1 = new SM(fst1, MATCH NONE, kNoLabel);

opts.matcher2 = new SM(fst2, MATCH INPUT, kNoLabel);

opts.filter = new CF(fst1, fst2);

StdComposeFst cfst(fst1, fst2, opts);

Composition Filters

• Predefined Filters:

Name Description

SequenceComposeFilter Requires FST1 epsilons to be read before FST2 epsilons

AltSequenceComposeFilter Requires FST2 epsilons to be read before FST1 epsilons

MatchComposeFilter Requires FST1 epsilons be matched with FST2 epsilons

LookAheadComposeFilter<F> Supports lookahead in composition

PushWeightsComposeFilter<F> Supports pushing weights in composition

PushLabelsComposeFilter<F> Supports pushing labels in composition

• Three lookahead composition filters, each templated on an underlying filter F,

are added. All three can be used by cascading them.

Composition: Matcher Design

• Matchers can find and iterate through requested labels at FST states.

• Matcher Form:template <class F>

class Matcher {

typedef typename F::Arc Arc;

public:

void SetState(StateId s); // Specifies current state

bool Find(Label label); // Checks state for match to label

bool Done() const; // No more matches

const Arc& Value() const; // Current arc

void Next(); // Advance to next arc

bool LookAhead(const Fst<Arc> fst, // (Optional) lookahead

StateId s, Weight &weight);

};

• A Lookahead() method, given the language (FST + initial state) to

expect, is added.

Matchers

• Predefined Matchers:

Name Description

SortedMatcher Binary search on sorted input

RhoMatcher<M> ρ symbol handling

SigmaMatcher<M> σ symbol handling

PhiMatcher<M> ϕ symbol handling

LabelLookAheadMatcher<M> Lookahead along epsilon paths

ArcLookMatcher<M> Lookahead one transition

• Two lookahead matchers, each templated on an underlying matcher M , are

added.

• Special symbol matchers:

Consumes no symbol Consumes symbol

Matches all ǫ σ

Matches rest ϕ ρ

Recognition Experiments

Broadcast News Spoken Query Task

Acoustic Model

• Trained on 96 and 97 DARPA Hub4 AM

training sets.

• PLP cepstra, LDA analysis, STC

• Triphonic, 8k tied states, 16 components per

state

• Speaker adapted (both VTLN + CMLLR)

• Trained on > 1000hrs of voice search queries

• PLP cepstra, LDA analysis, STC

• Triphonic, 4k tied states, 4 - 128 components

per state

• Speaker independent

Language Model

• 1996 Hub4 CSR LM training sets

• 4-gram language model pruned to 8M n-

grams

• Trained on > 1B words of google.com and

voice search queries

• 1 million word vocabulary

• Katz back-off model, pruned to various sizes


Precomputation before recognition Broadcast News Spoken Query Task

Construction method Time RAM Result Time RAM Result

Static

(1) with standard composition 7 min 5.3G 0.5G 10.5 min 11.2G 1.4G

(2) with generalized composition 2.5 min 2.9G 0.5G 4 min 5.3G 1.4G

Dynamic

(2) with generalized composition none none 0.2G none none 0.5G

Broadcast News Spoken Query Task


• A small part of the recognition transducer is visited during recognition:

Spoken Query Task Static Number of states in recognition transducer 25.4M

Dynamic Number of states visited per second 8K

• Very large language models can be used in first-pass:

1e+06 5e+06 1e+07 5e+07 1e+08 5e+08 1e+09

1718

1920

21

# of N−Grams

Wor

d E

rror

Rat

e

Spoken Query Task

Word error rate as function of LM size

(with Ciprian Chelba and Boulos Harb)

Prior Work

• Caseiro and Trancoso (IEEE Trans. on ASLP 2006): they developed

a specialized composition for a pronunciation lexicon L. If pronun-

ciations are stored in a trie, then the words readable from a node

form a lexicographic interval, which can be used to disallow non-

coaccessible epsilon paths.

• Cheng, et al. (ICASSP 2007); Oonishi, et al (Interspeech 2008):

they use methods apparently similar to ours, but many details are

left unspecified, such as what is the representation of the reachable

label sets. There are no published complexities, but the published

results show a very significant overhead to the dynamic composition

compared to a static recognition transducer.

• Our method:

– uses a very efficient representation of the label sets

– uses a very efficient computation of the weight pushing

– has a small overhead between static and dynamic composition

Conclusions

This work:

• Introduces a generalized composition filter for weighted finite-state

composition

• Presents composition filters that:

– Remove useless epsilon paths

– Push forward labels

– Push forward weights

• The combination of these filters permits the composition of large

speech-recognition context-dependent lexicons and language models

much more efficiently in time and space than before

• Experiments on Broadcast News and a spoken query task show a

5% to 10% overhead for dynamic, runtime composition compared to

static, offline composition. To our knowledge, this is the first such

system with so little overhead.

dynamic finite-state transducer composition with look-ahead … · 2010-10-03 · weighted...

Documents