recent work on acoustic modeling for cts at isl

29
Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe, Carnegie Mellon University

Upload: nili

Post on 01-Feb-2016

17 views

Category:

Documents


0 download

DESCRIPTION

Recent Work on Acoustic Modeling for CTS at ISL. Florian Metze , Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe, Carnegie Mellon University. Overview. ISL‘s RT-03 system revisited System combination of Tree-150 & Tree-6 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Recent Work on Acoustic Modeling for CTS at ISL

Recent Work on Acoustic Modeling for CTS at ISL

Florian Metze, Hagen Soltau, Christian Fügen,

Hua Yu

Interactive Systems Laboratories

Universität Karlsruhe, Carnegie Mellon University

Page 2: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 2

Overview

• ISL‘s RT-03 system revisitedSystem combination of Tree-150 & Tree-6

• Richer Acoustic Modeling– Across-phone Clustering

– Gaussian Transition Modeling

– Modalities

– Articulatory Features

Page 3: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 3

Decoding Strategy

• System Combination– Combine tree-150, tree-6; 8ms, 10ms output

– Confusion networks over multiple lattices and Rover

– Confidences computed from combined CNs

– Best single output (Tree-150): 25.4

– CNC + Rover: 24.9

• Results on eval03– Tree-150 single system: 24.2

– CNC + Rover: 23.4

Page 4: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 4

Vocabulary

• Vocabulary Size41k vocabulary selected from SWB, BN, CNN

• Pronunciation Variants95k entries generated by rule-based approach

• Pronunciation ProbabilitiesFrom frequencies (forced alignment of training data)

– Viterbi decoding: penalties (e.g. max = 1)

– Confusion networks: real probabilities (e.g. sum = 1)

Page 5: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 5

Clustering

• Entropy-based Divisive Clustering• Standard way :

– Grow tree for each context independent HMM state

– 50 phones, 3 states : 150 trees

• Alternative : clustering across phones– Global tree parameter sharing across phones

– Computationally expensive to cluster 6 trees

(begin, middle, end for vowels and consonants)

– Quint-phone context

Page 6: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 6

Motivation for Alternative Clustering

• Pronunciation modeling is important for recognizing conversational speech

• Adding pronunciation variants often gives marginal improvements due to increased confuseability

• Case study: Flapping of /T/

BETTER B EH T AXRBETTER(2) B EH DX AXR

Dictionary only contains single pronunciation and the phonetic decision tree chooses whether or not to flap /T/

Page 7: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 7

Clustering Across Phones:Tree construction

• How to grow a single tree?We expand the question set to allow questions regarding the substate identity and center phone identity. Computationally expensive on 600k SWB quint-phones

• Two dictionaries:• conventional dictionary with 2.2 variants per word

• (almost) single pronunciation dictionary with 1.1 variants per word

A simple procedure is used to reduce the number of pronunciation variants. Variants with a relative frequency of <20% are removed. For unobserved words, only the baseform is kept.

Page 8: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 8

• Allows better parameter tying (tying now possible across phones and sub-states)

• Alleviates lexical problems: over-specification and inconsistencies no need for an optimal phone set, preferable for multi-lingual / non-native speech recognition

• Implicitly models subtle reduction in sloppy speech

AX-b

IX-m

AX-m

0=vowel?

0=obstruent? 0=begin-state?

-1=syllabic? 0=mid-state? -1=obstruent? 0=end-state?

Clustering Across Phones

Page 9: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 9

Clustering Across Phones: Experiments

• Cross-substate clustering doesn’t make any difference

• Cross-phone clustering with 6 trees: {vowel|consonant}-{b|m|e}

• Single pronunciation lexicon has 1.1 variants per word(instead of 2.2 variants per word)

Dictionary Clustering WER 66hr training set

WER180hr training set

multi-pronunciation

traditional 34.4 33.4

cross-phone 33.9 -

single pronunciation

traditional 34.1 -

cross-phone 33.1 31.6

Results are based on first pass decoding on dev01

Page 10: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 10

Analysis

• Flexible tying works better with single pronunciation lexicon: Higher consistency, data-driven approach

• Significant cross-phone sharing:~30% of the leaf nodes are shared by multiple phones

• Commonly tied vowels: AXR & ER, AE & EH, AH & AX~ consonants: DX & HH, L & W, N & NG

-1=voiced?

-1=consonant? 0=high-vowel?

1=front-vowel? 0=high-vowel? -1=obstruent? 0=L | R | W?

Vowel-b

Page 11: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 11

Gaussian Transition Modeling

• A linear sequence of GMMs may contain a mix of different model sequences.

• To further distinguish these paths, we can model transitions between Gaussians in adjacent states.

Page 12: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 12

Frame-independence Assumption

• HMM assumes each speech frames to be conditionally independent given the hidden state sequence

frames

models

… …

… …

HMM as a generative model

Page 13: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 13

Gaussian Transition Modeling

GTM models transition probabilities between Gaussians

Page 14: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 14

GTM for Modeling Sloppy Speech

• Partial reduction/ realization may be better modeled at sub-phoneme level

• GTM can be thought of as pronunciation network at the Gaussian level

• GTM can handle a large number of trajectories• Advantages over Parallel Path HMMs/ Segmental

HMMs– Number of paths is very limited

– Hard to determine the right number of paths

Page 15: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 15

Experiments

• GTM can be readily trained using Baum-Welch algorithm

• Data sufficiency an issue since we are modeling 1st order variable

• Pruning transitions is important (backing-off)

Pruning Threshold

Avg. #transitions per Gaussian

WER(%)

Baseline 14.4 34.1

1e-5 9.7 33.7

1e-3 6.6 33.7

0.01 4.6 33.6

0.05 2.7 33.9

WERs on Switchboard (hub5e-01)

Page 16: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 16

Experiments II

• GTM offers better discrimination between trajectories• All trajectories are nonetheless still allowed.• Pruning away unlikely transitions leads to a more compact and prudent

model.• However, we need to be careful not to prune away unseen trajectories due

to a limited training set.

• Using a first-order acoustic model in decoding requires maintaining the left history, which is expensive at word boundaries. Viterbi approximation is used in current implementation.

• Log-Likelihood improvements during Baum-Welch training:-50.67 to -49.18

Page 17: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 17

Modalities

• Would like to include additional information into divisive clustering, e.g.:– Gender

– Signal-noise-ratio

– Speaking rate

– Speaking style (normal vs hyper-articulated)

– Dialect

– Show-type, Data-type (CNN, NBC, ...)

• Data-driven approach: sharing still possible

Page 18: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 18

Modalities II

• Suitable for different corpora?• Example:

– German Dialects

– Male/ Female-1=vowel?

-1=obstruent? 0=bavarian?

-1=syllabic? 0=suabian? -1=obstruent? 0=female?

Page 19: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 19

Modalities III

• Tested on German Verbmobil data• Not enough time to test on SWB/ RT-03• Proved beneficial in several applications

– Labeled data needed

– Our tests were not done on highly optimized systems (VTLN)

– Hyperarticulation: -1.7% for Hyper +0.3% for Normal

Page 20: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 20

Modalities Results

Page 21: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 21

Articulatory Features

• Idea: combine very specific sub-phone models with generic models

• Articulatory Features: Linguistically Motivated/F/ = UNVOICED, FRICATIVE, LAB-DNT, ...

• Introduce new Degrees of Freedom for– Modeling

– Adaptation

• Integrate into existing architecture, use existing training techniques (GMMs) for feature detectors

• Articulatory (Voicing) Features in Front-end did not help

Page 22: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 22

Articulatory Features

• Output from Feature Detectors:

p(FEAT)-p(NON_FEAT)+p0

Page 23: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 23

Articulatory Features

A-symmetric Stream Setup: ~4k models– ~4k GMMs in stream 0

– 2 GMMs in stream 1...N („Feature Streams“)

Page 24: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 24

Articulatory Features Results I

• Test on Read Speech (BN-F0)13.4% 11.6% with Articulatory Features

• Test on Multilingual Data13.1% 11.5% (English with ML detectors)

• Significant Improvements also seen on– Hyper-Articulated Speech

– Spontaneous, Clean Speech (ESST)

Page 25: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 25

Articulatory Features Results II

• Test on Switchboard (RT-03 devset) Sub Del Ins WER

– Baseline | 72.5 20.0 7.5 4.4 31.9 67.2 |

– Features | 68.3 18.3 13.4 2.2 33.9 68.4 |

• Result:– Substitutions, Insertions – Deletions

• No overall improvement yet will work on setup

Page 26: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 26

Thank You, ...

the ISL team!

Page 27: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 27

Related Work

• D. Jurafsky, et al.: What kind of pronunciation variation is hard for triphones to model? ICASSP’01

• T. Hain: Implicit pronunciation modeling in ASR. ISCA Pronunciation Modeling Workshop, 2002

• M. Saraclar, et al.: Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech and Language, Apr. 2000

Page 28: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 28

Related Work

• R. Iyer, et al.: Hidden Markov models for trajectory modeling, ICSLP’98

• M. Ostendorf, et al.: From HMMs to segment models: A unified view of stochastic modeling for speech recognition, IEEE trans. Sap, 1996

Page 29: Recent Work on Acoustic Modeling for CTS at ISL

EARS Workshop, December 2003, St. Thomas 29

Publications

• F. Metze and A. Waibel: A Flexible Stream Architecture for ASR using Articulatory Features; ICSLP 2002; Denver, CO

• C. Fügen and I. Rogina: Integrating Dynamic Speech Modalities into Context Decision Trees; ICASSP 2000; Istanbul, Turkey

• H. Yu and T. Schultz: Enhanced Tree Clustering with Single Pronunciation Dictionary for Conversational Speech Recognition; Eurospeech 2003; Geneva

• H. Soltau, H. Yu, F. Metze, C. Fügen, Q. Jin, and S. Jou: The ISL transcription system for conversational telephony speech; submitted to ICASSP 2004; Vancouver

• ISL web page:

http://isl.ira.uka.de