carlasimoes i ms ws speech tech

8/9/2019 CarlaSimoes I MS WS Speech Tech

1/13

Acoustic ModelingIntroduction and MethodologyFellowship in collaboration with Prof. Carlos Teixeira, FCUL

Carla [email protected]

I Microsoft Workshop on Speech Technology - Building bridges between industryand academia, May 2 2007, MLDC


2/13

Overview

Introduction Speech Components What are Acoustic Models? Why to use them?

Methodology Training Acoustic Models

Modelling English Spoken by Portuguese speakers

Conclusion and Future Work


3/13

Corpus(Speech + Transcriptions)

Feature extraction

Acoustic model training

SAPI(developersSpeech API)

Lexicon(phonetic dictionary)

Acoustic Models(Hidden Markov

Models)

Speech RecognitionEngine (SR)

Language Pack(contains core SR and

TTS engines)

Text-to-speechEngine (TTS)

Grammar + Lexicon(for SR apps; grammardefines the permittedsequence of words)

+

SpeechApplications

Telephony(Speech Server2007, Exchange

12)

Mobility(Voice

Command)

Feature vector

Desktop(Office12,

Vista)

Home(TV, Kitchen)

Speech Components


4/13

What are Acoustic Models?

They reflect the way we pronounce a certain language

Speech can be broken into phonetic segments, phones

Acoustic Models are representations of speech segments

Acoustic model training involves mapping models toacoustic examples obtained from training data


5/13

Why to use them?

Basis of an automatic speech recognition (ASR) system

S1 S2 S3

Speech Waveform

Sequence of observed speech vectors

Sequence of symbols

Front End

S1 S2 S3 The acoustic model gives the likelihood fora given feature vector as produced by aparticular phoneme


6/13

Methodology

Our Acoustic Models are Hidden Markov Models (HMMs) based Markov Assumption: each state probability depends on the previous one

Each HMM has 3 states each state represents a short segment of

speech, described mathematically byGaussian probability distributions

Acoustically similar information is sharedacross HMMs - sharing states calledsenones

Cross-word triphone System


7/13

Methodology

Training up a cross-word triphone system for a new language Acoustic model training involves mapping acoustic models (cross-word

triphone or whole-word triphones) with equivalent labels (transcriptions)

Corpus(Speech+word

leveltranscriptions )

+

Lexicon

Phoneset+

QuestionSet

Word leveltranscriptions

intoMonophone

level

transcriptions

Prototypemonophone systemconverted to initial

Cross-WordTriphone System

Cross-wordsystem is then

updated toproduce the final

Cross-Word

Triphone system

Clustering triphones intoacoustically similar groups

A phoneset file should never

contain more than 50 phones


8/13

ModellingEnglish Spoken by Portuguese Speakers

Normally a speech recognizers precision is lower for non-native users Non-native accents are more problematic than dialects more

variability

Research on non-native accent modeling reveals largegains in performance when acoustics and pronunciationof an accent are taken into account

An usage scenario: Voice controlled applications, wherePortuguese language is dominant but English termsare supported with the same accuracy


9/13

Experiments are being developed concerning this problem

Corpus description 4689 Utterances for a universe of 227 Words

Files are sampled at 8Khz for 16 bits linear 11 male speakers

Model settings 3468 utterances for training 1221 utterances for test 43 minutes of speech Senones 1200



10/13


English spoken byPortuguese corpus

New ModelTraining

T e s t i n g

Test corpus

English spoken by

Portuguese corpus

U p d a t e

ENU Model

New Model

T e s t i n g

Test corpus


11/13


English spoken byPortuguese corpus

T r ai ni n

g

ENU Phoneset PTG Phoneset

English to Portuguesemapped phoneset

PTG corpus

New ModelTestingTest corpus


12/13

Future Work

The improvement of Acoustic Models requires gatheringhundreds of hours of speech data

The amounts of data would have to be larger if weredealing with non-native speakers, because the accentvariability gets too high

Possible solutions:

Define new phonesets which implies a phonetic studyconcerning the Portuguese English pronunciation Train the native models with the English spoken by

Portuguese corpus


13/13

2007 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Carla Simes

[email protected]

Muito obrigado pela vossa ateno!

Acoustic ModelingIntroduction and Methodology

www.microsoft.com/portugal/mldc

I Microsoft Workshop on Speech Technology -Building bridges between industry and

academia, May 2 2007, MLDC
mailto:[email protected]://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldchttp://www.microsoft.com/portugal/mldcmailto:[email protected]:[email protected]:[email protected]

carlasimoes i ms ws speech tech

Documents