the sri 2006 spoken term detection system dimitra vergyri, andreas stolcke, ramana rao gadde, wen...

The SRI 2006 Spoken Term Detection System

Dimitra Vergyri, Andreas Stolcke,

Ramana Rao Gadde, Wen Wang

Speech Technology & Research Laboratory

SRI International, Menlo Park, CA

Dec. 14, 2006STD-06 Workshop 2

Outline

• STD system overview– STT systems

• BNews system description

• CTS system description

• ConfMtg system description

– Indexing• N-gram index from word lattices

• NNet based posterior estimation

– Retrieval

• Time and memory requirements• ATWV Results• Future work


SRI STD System

STT

Audio

WordLattices INDEXER

N-gramIndex with posteriors

RETRIEVER

SearchTerms

Termswith Times and

ProbabilitiesIndexing step


English BN STT System• Single front-end : PLP (52 39 dim)• HLDA, feature-space SAT • Gender-independent acoustic modeling• Decision-tree clustered within-word and cross-word triphones• MLE followed by alternating MPE-MMIE acoustic training• Acoustic training: Hub4, TDT2+TDT4+TDT4a, BNr1234 subset

– MLE training: 3300 hours, MPE training: 1700 hours– 2500 x 200 Gaussian for nonCW triphones– 3000 x 160 Gaussians for CW triphones

• Word bigram and 5-gram LM trained on Hub4, TDT, BNr1234 transcripts, Hub4 LM training data, and NABN (cutoff date Nov. 30, 2003)

– 62k words, 29M bigrams, 27M trigrams, 15M 4-grams, 2.4M 5-grams

• Duration rescoring (word-specific phone durations)• Two-pass decoding• First decoding stage unadapted nonCW model with bigram LM• Adapted CW models to nonCW output after 5-gram LM and

duration model lattice rescoring• Lattice constrained decoding with MLLR adapted, SAT, cw model


English BN STT System

2-gramLattices

PLP MPEnonCW

PLP MPECW

Adapted5-gramLattices

Legend

Decoding/rescoring step

Hyps for MLLR or output

Lattice generation/use

Lattice or 1-best output

• Runtimes: • 2.5xRT for unadapted lattices• 5.4xRT for adapted lattices

• ~10% relative WER improvement after adaptation• Both decoding stages use shortlists.

4-gramLattices


English CTS STT System• Two front-ends:

– MFCC + voicing + MLP-features (52 + 10 + 25 39 + 25 dim)– PLP (52 39 dim)

• HLDA, feature-space SAT• Gender-dependent acoustic modeling• Decision-tree clustered within-word and cross-word triphones• MLE followed by alternating MPE-MMIE acoustic training• Acoustic training: all Hub5 + Fisher training

– 2500 x 128 x 2 Gaussians for nonCW triphones– 3000 x 128 x 2 Gaussians for CW triphones

• Prosodic rescoring (word-specific phone durations, pause trigram)• Word bigram and 4-gram LM • Interpolated + pruned LM trained on CTS, BN, and Web data

– 48k words, 16M bigrams, 16M trigrams, 12M 4grams• First lattice generation uses phone-loop MLLR nonCW MFCC and

2-gram LM• Second constrained lattice generation uses cross-adapted CW SAT PLP

models.


English Meeting STT System[Stolcke et al., MLMI’05; Janin et al., MLMI’06]

• Based on CTS system architecture (2-pass system)• Combination of CTS (narrow-band) and BN (wide-band)

base models• Acoustic models adapted to distant-mic meeting

recordings using MMI-MAP• MLP features adapted for meeting recordings by

incremental training• Mixture language model trained on meetings, CTS, and

Web data• System used in RT-06S meeting evaluation, co-

developed with ICSI


English CTS & Confmtg STT Systems

Legend

Decoding/rescoring step

Hyps for MLLR or output

Lattice generation/use

Lattice or 1-best output

2-gramLattices

MFCC-MLPMPE nonCW

PLP MPECW

Adapted4-gramlattices

3-gramLattices

• CTS runtime:• 1.8xRT for unadapted lattices• 2.5xRT for adapted lattices

•Confmtg runtime:• 5.4xRT for unadapted lattices• 6.8xRT for adapted lattices

• CTS system uses Gaussian shortlists in first pass only• Confmtg system does not use shortlists.


English STT Result Summary (WER)

10.5%

eval03

23.2%10.7%BN

STD-dev06eval02

24.0%

eval03

23.7%

eval02

17.4%17.0%CTS

STD-dev06dev04

36.9%

dev04

37.2%

eval04s

44.2%Confmtg

STD-dev06

• STD-dev06 WER measured using references constructed from RTTM files• Systematic differences compared to standard STT references

• For example, BN scoring does not exclude commercial segments• Note: STT systems were not especially tuned for STD; used

configurations inherited from STT evaluations.


Indexing of Word Lattices

• SRILM lattice-tool dumps all word 1-grams to 5-grams in lattices, along with side information

– Posterior probabilities based on normalized recognizer scores– Start/end times, channel, waveform name– 0.5s time tolerance to merge same N-grams with different times– Pronunciations (to detect OOV words, not used yet)– N-grams with posterior < 0.001 are omitted to keep index size reasonable

• Index = term occurrence table sorted by N-gram• Indexing function incorporated in SRILM release 1.5.1

– Lattice-tool –write-ngram-index option– Downloadable from www.speech.sri.com/projects/srilm/


Score Calibration

• Neural net maps posteriors to unbiased STD scores– Input features used: audio source (bnews/cts/confmtg), LM joint probability,

LM N-gram length, #words, duration, lattice posterior– Used LnkNet software for training MLP to predict correctness of hypothesized

term (1 hidden layer with 10 nodes)– Cross entropy objective function

• Neural net trained using the dev06 term list• Training on raw data improved Occurrence Weighted

Value, not A-Term Weighted Value– Also required re-tuning the posterior threshold.

• Resample training data to approximate ATWV – Downsample/upsample within occurrences of each term to have equal number

of training samples for each term. – Posterior threshold 0.5 ended up being optimal for ATWV (at least on the

training data).


Searching & Retrieval

• Convert the search terms into a sorted list• Run the Unix “join” command between the index list

obtained in indexing and the term list• YES/NO decision based on the posterior threshold 0.5• Run time almost independent of the size of the search

list (depends on the index size)


Time and Memory Requirements

• The system was run on 3GB, 3.4 GHz Intel hyperthreading CPU

• Both index size and search time can be significantly reduced if we keep only candidates with high posterior

• STT runtimes were incorrectly measured in submitted sysdesc.

40440 s26760 s58560 sSTT run time

Search time needed for all terms

Indexing

13 s

530K

37Mb

602K

37Mb

944K

74MbIndex size

(# terms/MB)

2711 sNNet run time

493 sIndex from lattice

Confmtg

(2h)

CTS

(3h)

BN

(3h)


STD Results

• Extra dev consists of RT02, RT03 (BN+CTS), dev04 (CTS+ConfMtg), RT04s (ConfMtg)

• Difficult to debug eval06 (no references were given), but the result on meetings seems much lower than on dev sets.

• Possibly overtrained neural net on meetings condition.

Occ.WV/ATWV

0.790/0.687

0.784/0.718

0.631/0.462

0.536/0.461

0.792/0.681

0.800/0.712

0.887/0.801

0.889/0.817

Extra dev

--- / 0.255

--- / 0.665

--- / 0.824

eval06

0.802/0.700

0.782/0.739

0.566/0.205

0.491/0.358

0.860/0.615

0.860/0.660

0.906/0.802

0.905/0.818

dryrun06

0.821/0.787

0.804/0.817

0.585/0.275

0.515/0.427

0.881/0.692

0.881/0.714

0.914/0.850

0.914/0.865

dev06

0.3

0.5

No NNet

With NNet

Confmtg

No NNet

With NNet

No NNet

With NNet

No NNet

With NNet

All

CTS

BN

0.3

0.5

0.3

0.5

0.3

0.5

Thres.


Future Work

• Current system does not cover detection of terms with OOVs. Possible approaches:

– Map the unknown search terms to the known vocabulary (OGI work, gave about 2-3% improvement on BNews).

– Use of phone recognition and phone-based indexing for OOVs– Hybrid word+graphone recognizer outputs both words and “graphone” units

that can match OOVs (Bisani & Ney 2005)

• Improve the score mapper– Bigger devset needed to avoid overtraining– Other models (decision tree, logistic regression)

• Found some mismatch between ASR vocabulary and term lists. Apply normalization rules to fix common problems (found about 0.3% relative improvement with few simple rules)

• Tune STT systems for indexing speed

the sri 2006 spoken term detection system dimitra vergyri, andreas stolcke, ramana rao gadde, wen...

Documents