lectures 1 rabiner speech processing

77
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR 1 ECE 259B ECE 259B Fundamentals of Speech Fundamentals of Speech Recognition Recognition Lecture 1 Lecture 1 Introduction/Overview of Introduction/Overview of Automatic Speech Automatic Speech Recognition Recognition

Upload: jtorresf

Post on 18-Jul-2016

37 views

Category:

Documents


2 download

DESCRIPTION

Rabiner digital speech processing

TRANSCRIPT

Page 1: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

1

ECE 259BECE 259BFundamentals of Speech Fundamentals of Speech RecognitionRecognition—— Lecture 1Lecture 1Introduction/Overview of Introduction/Overview of

Automatic Speech Automatic Speech RecognitionRecognition

Page 2: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

2

Why Digital Processing of Why Digital Processing of Speech?Speech?

• digital processing of speech signals (DPSS) enjoys an extensive theoretical and experimental base developed over the past 75 years

• much research has been done since 1965 on the use of digital signal processing in speech communication problems

• highly advanced implementation technology(VLSI) exists that is well matched to the computational demands of DPSS

• there are abundant applications that are in widespread use commercially

Page 3: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

3

The Speech StackThe Speech Stack

Fundamentals — acoustics, linguistics, pragmatics, speech perception

Speech Representations — temporal, spectral, homomorphic, LPC

Speech Algorithms — speech-silence, voiced-unvoiced, pitch, formants

Speech Applications — coding, synthesis, recognition, understanding, verification, language translation, speed-up/slow-down

Page 4: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

4

Speech RecognitionSpeech Recognition--2001 2001 (Stanley Kubrick View in 1968)(Stanley Kubrick View in 1968)

Page 5: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

5

Apple Navigator Apple Navigator ---- 19881988

Page 6: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

6

The Speech AdvantageThe Speech Advantage• Reduce costs

– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services

• New revenue opportunities– 24x7 high-quality customer care automation– access to information without a keyboard or

touch-tones• Customer retention

– provide personal services for customer preferences

– improve customer experience

Page 7: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

7

The Speech CircleThe Speech Circle

DM & SLG SLU

ASRTTS

Meaning

“Billing credit”

What’s next?

“Determine correct number”Words spoken

“I dialed a wrong number”

Voice reply to customer

“What number did you want to call?”

Customer voice request

Text-to-SpeechSynthesis

Automatic SpeechRecognition

Spoken LanguageUnderstanding

Dialog Management (Actions) and

Spoken Language

Generation (Words)

Data

Page 8: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

8

• Goal: Accurately and efficiently convert a speech signal into a text message independent of the device, speaker or the environment.

• Applications: Automation of complex operator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.

Automatic Speech RecognitionAutomatic Speech Recognition

Page 9: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

9

Pattern Matching ProblemsPattern Matching Problems

A-to-DConverter

PatternMatching

FeatureAnalysis

speech symbols

• speech recognition

• speaker recognition

• speaker verification

• word spotting

• automatic indexing of speech recordings

Page 10: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

10

Basic ASR Formulation (Basic ASR Formulation (BayesBayes Method)Method)

Speaker’s Intention

Speech Production

Mechanisms

W Acoustic Processor

s(n) X Linguistic Decoder

W

Speech RecognizerSpeaker Model

ˆ arg max ( | )

( | ) ( )arg max( )

arg max ( | ) ( )

W

W

A LW

W P W X

P X W P WP X

P X W P W

=

=

=

Step 2Step 1Step 3

Page 11: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

11

Steps in Speech RecognitionSteps in Speech RecognitionStep 1Step 1-- Acoustic ModelingAcoustic Modeling: assign probabilities

to acoustic realizations of a sequence of words. Compute PA(X|W) using statistical models (Hidden Markov Models) of acoustic signals and words

Step 2Step 2-- Language ModelingLanguage Modeling: assign probabilities to sequences of words in the language. Train PL(W) from generic text or from transcriptions of task-specific dialogues.

Step 3Step 3-- Hypothesis SearchHypothesis Search: find the word sequence with the maximum a posteriori probability. Search through all possible word sequences to determine arg max over W.

Page 12: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

12

Step 1Step 1--The Acoustic ModelThe Acoustic Model we build by learning statistics of the acoustic

features, , from a training set where we compute the variability of the acoustic features during the production of

acoustic

the sound

mode

s

l

p

s

re

Xi

resented by the models it is impractical to create a separate acoustic model, ( | ), for

every possible word in the language--it requires too much training data for words in every possible contex

AP X Wi

t instead we build for the ~50 phonemes

in the English language and construct the model for a word by concantenating (stringing together sequential

acousti

ly) the

c-phonetic mod

models for

l

e

e s

th

i

constituent phones in the word similarly we build sentences (sequences of words) by concatenating

word modelsi

Page 13: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

13

Step 2Step 2--The Language ModelThe Language Model• the language model describes the probability of a sequence of words that form a valid sentence in the language

• a simple statistical method works well based on a Markovianassumption, namely that the probability of a word in a sentence is conditioned on only the previous N-words, namely an N-gram language model

1 2

1 21

1 2

( ) ( , ,..., )

( | , ,..., )

where ( | , ,..., ) is estimated by simply counting up the relative frequencies of -tuples in a large corpus of text

L L kk

L n n n n Nn

L n n n n N

P W P w w w

P w w w w

P w w w wf N

− − −=

− − −

=

=∏i

Page 14: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

14

Step 3Step 3--The Search ProblemThe Search Problem• the search search problem is one of searching the space of

all valid sound sequences, conditioned on the word grammar, the language syntax, and the task constraints, to find the word sequence with the maximum likelihood

• the size of the search spacesize of the search space can be astronomically large and take inordinate amounts of computing power to solve by heuristic methods

• the use of methods from the field of Finite State Finite State Automata TheoryAutomata Theory provide Finite State Networks Finite State Networks (FSN)(FSN) that reduce the computational burden by orders of magnitude, thereby enabling exact solutions in computationally feasible times, for large speech recognition problems

Page 15: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

15

Basic ASR FormulationBasic ASR FormulationThe basic equation of Bayes rule-based speech recognition is

where X=X1,X2,…,XN is the acoustic observation (feature vector) sequence.

is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model

ˆ arg max ( | )

( ) ( | )arg max( )

arg max ( ) ( | )

=

=

=

W

W

W

W P

P PP

P P

W X

W X WX

W X W

1 2ˆ ...= Mw w wW

Speech Analysis Decoder

s(n), W Xn W

Page 16: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

16

Feature Analysis (Spectral Analysis)

Feature Analysis (Spectral Analysis)

Language Model

(N-gram)

Language Model

(N-gram)Word

LexiconWord

Lexicon

Utterance Verification (Confidence

Scores)

Utterance Verification (Confidence

Scores)

Pattern Classification

(Decoding, Search)

Pattern Classification

(Decoding, Search)

Acoustic Model (HMM)

Acoustic Model (HMM)

Input Speech “Hello World”

(0.9) (0.8)

Speech Recognition

Process

Speech Speech Recognition Recognition

ProcessProcessSLU

TTS

DM

ASR

Page 17: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

17

Speech Recognition ProcessesSpeech Recognition Processes

• Choose task => sounds, word vocabulary, task syntax (grammar), task semantics– text training data set => word lexicon, word

grammar (language model), task grammar– speech training data set => acoustic models

• Evaluate performance– speech testing data set

• Training algorithm => build models from training set of text and speech

• Testing algorithm => evaluate performance from testing set of speech

Page 18: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

18

Feature ExtractionFeature ExtractionFeature ExtractionGoal: Extract robust features (information) fromthe speech that are relevant for ASR.

Method: Spectral analysis through either a bank-of-filters or through LPC followedby non-linearity and normalization (cepstrum).

Result: Signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).

Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—cepstral features or those from a high dimensionality space.

Feature Extraction

Feature Extraction

Pattern Classification

Pattern Classification

Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification

Page 19: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

19

RobustnessRobustnessRobustnessRobustness

UnlimitedVocabularyUnlimited

Vocabulary

RejectionRejection

Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.

Methods:Traditional techniques for improving system robustnessare based on signal enhancement, feature normalization or/andmodel adaptation.

Perception Approach:Extract fundamental acoustic information in narrow bandsof speech. Robust integration of features across time and frequency.

Page 20: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

20

Methods for Robust Speech Recognition

Methods for Robust Speech Methods for Robust Speech RecognitionRecognition

SignalTraining

Testing

Features Model

Signal Features Model

Enhancement Normalization Adaptation

Page 21: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

21

Acoustic ModelAcoustic ModelAcoustic ModelGoal: Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/).

Hidden Markov Model (HMM): Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone. HMMs are also assigned for modeling extraneous events.

Advantages: Powerful statistical method for dealing with a wide range of data and reliably recognizing speech.

Challenges: Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.

Feature Extraction

Feature Extraction

Pattern Classification

Pattern Classification

Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification

Page 22: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

22

HMM for speechHMM for speech

• Phone model : ‘z’ (/Z/)

• Word model: ‘is’ (/IH/ /Z/)

i1

i1

ih1 ih2 ih3 z1z1 z1z2 z1z3

z1z1 z1z2 z1z3

Page 23: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

23

Isolated Word HMMIsolated Word HMM

1 2 3 4 5

a11

a34a23a12

a13 a24 a35

a22 a33 a44 a55=1

a45

b1(Ot) b2(Ot) b3(Ot) b4(Ot) b5(Ot)

Left-right HMM – highly constrained state sequences

Page 24: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

24

Word LexiconWord LexiconWord LexiconGoal:Map legal phone sequences into wordsaccording to phonotactic rules. For example,

David /d/ /ey/ /v/ /ih/ /d/

Multiple Pronunciation:Several words may have multiple pronunciations. For example

Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/

Challenges:How do you generate a word lexicon automatically; how do you addnew variant dialects and word pronunciations.

Feature Extraction

Feature Extraction

Pattern Classification

Pattern Classification

Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification

Page 25: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

25

0.60.4

Goal:Mapping words into phrases and sentences based ontask syntax.

Handcrafted:Deterministic grammars that are knowledge-based. For example,

Yes on my credit (card) please

Statistical:Compute estimate of word probabilities (N-gram model). For example

Yes on my credit card please

Challenges:How do you build a language model rapidly for a new task.

Language ModelLanguage ModelLanguage Model

Feature Extraction

Feature Extraction

Pattern Classification

Pattern Classification

Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification

Page 26: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

26

Goal:Combine information (probabilities) from the acoustic model, language model and word lexicon to generatean “optimal” word sequence (highest probability).

Method:Decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.

Challenges:How do we build efficient structures (FSMs) for decoding and searching large vocabulary, complex language models tasks;

• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient

What is the theoretical limit of efficiency that can be achieved

Pattern ClassificationPattern ClassificationPattern Classification

Feature Extraction

Feature Extraction

Pattern Classification

Pattern Classification

Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification

22

8

Page 27: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

27

Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR• The basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping:

• For the WSJ 64,000 vocabulary, this results in a network of 10 bytes!• State-of-the-art methods including fast match, multi-pass decoding and A* stack provide tremendous speed-up at a cost of increased complexity and less portability.• Advances in weighted finite state transducers have enabled us to represent this network in a unified mathematical framework with only 10 bytes!

22

RobustnessRobustness

UnlimitedVocabularyUnlimited

Vocabulary

RejectionRejection

8

Features HMM states HMM units Phones Words Sentences

Page 28: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

28

Weighted Finite State Transducers (WFST)

Weighted Finite State Weighted Finite State Transducers (WFST)Transducers (WFST)

• Unified Mathematical framework to ASR• Efficiency in time and space

WFSTWFST

WFSTWFST

WFSTWFST

Word:Phrase

Phone:Word

HMM:PhoneCombinationCombination OptimizationOptimization

SearchNetwork

WFSTWFST

State:HMM

Page 29: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

29

Weighted Finite State Weighted Finite State TransducerTransducer

d: ε/1

ey:ε/.4

ae:ε/.6

dx:ε/.8

t:ε/.2

ax:”data”/1

Word Pronunciation Transducer

Data

Page 30: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

30

0

5

10

15

20

25

30

1994 1995 1996 1997 1998 1999 2000 2001 2002

Year

Rel

ativ

e Sp

eed

AT&T (Algorithmic) Moore's Law (hardware)

North American Businessvocabulary: 40,000 words

branching factor: 85

North American Businessvocabulary: 40,000 words

branching factor: 85

Algorithmic Speed-up for Speech Recognition

Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition

Community

AT&T

Page 31: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

31

Utterance VerificationUtterance VerificationUtterance VerificationGoal:Identify possible recognition errorsand out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.

Method:A confidence score based on a hypothesis test is associated witheach recognized word. For example:

Label: credit pleaseRecognized: credit fees Confidence: (0.9) (0.3)

Challenges:Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.

Feature Extraction

Feature Extraction

Pattern Classification

Pattern Classification

Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification

Page 32: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

32

RejectionRejectionProblem: Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.

Effect:Improvement in the performance of the recognizer, understanding system and dialogue manager.

Measure of Confidence:Associating word strings with a verification cost thatprovide an effective measure of confidence (Utterance Verification).

RobustnessRobustness

UnlimitedVocabularyUnlimited

Vocabulary

RejectionRejection

Page 33: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

33

State-of-the-Art Performance?

StateState--ofof--thethe--Art Art Performance?Performance?

SLU

TTS

DM

Feature ExtractionFeature

Extraction

Language Model

Language Model

Word LexiconWord

Lexicon

Pattern Classification

(Decoding, Search)

Pattern Classification

(Decoding, Search)

Acoustic Model

Acoustic Model

Input Speech

RecognizedSentence

ASR

Utterance VerificationUtterance

Verification

Page 34: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

34

Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY

SIZE WORD

ERROR RATEConnected Digit

Strings--TI Database Spontaneous 11 (zero-nine,

oh) 0.3%

Connected Digit Strings--Mall Recordings

Spontaneous 11 (zero-nine, oh)

2.0%

Connected Digits Strings--HMIHY

Conversational 11 (zero-nine, oh)

5.0%

RM (Resource Management)

Read Speech 1000 2.0%

ATIS(Airline Travel Information System)

Spontaneous 2500 2.5%

NAB (North American Business)

Read Text 64,000 6.6%

Broadcast News News Show 210,000 13-17%

Switchboard Conversational Telephone

45,000 25-29%

Call Home Conversational Telephone

28,000 40%

Factor of 17 increase in digit error rate

Page 35: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

35

Page 36: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

36vocabulary: 40,000 words branching factor: 85

vocabulary: 40,000 words branching factor: 85

North American BusinessNorth American BusinessNorth American Business

Page 37: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

37

Broadcast NewsBroadcast News

Page 38: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

38

Dictation MachineDictation Machine

Page 39: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

39

30

40

50

60

70

80

1996 1997 1998 1999 2000 2001 2002

Year

Wor

d A

ccur

acy

Switchboard/Call Home Vocabulary: 40,000 words

Perplexity: 85

Switchboard/Call Home Vocabulary: 40,000 words

Perplexity: 85

Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech Recognition

Page 40: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

40

Growth in Effective Growth in Effective Recognition Vocabulary SizeRecognition Vocabulary Size

110

1001000

10000100000

100000010000000

1960 1970 1980 1990 2000 2010Year

Voc

abul

ary

Size

Page 41: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

41

Human Speech Recognition Human Speech Recognition vsvs ASRASR

Page 42: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

42

Human Speech Recognition Human Speech Recognition vsvs ASRASR

0.1

1

10

100

0.001 0.01 0.1 1 10

HUMAN ERROR (%)

MAC

HIN

E ER

RO

R (%

)

Digits RM-LM NAB-mic WSJ

RM-null NAB-omni SWBD WSJ-22dB

MachinesOutperform

Humansx1

x10

x100

Page 43: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

43

DM & SLG SLU

TTS

Meaning

“Billing credit”

What’s next?

“Determine correct number”Words spoken

“I dialed a wrong number”

Voice reply to customer

“What number did you want to call?”

Customer voice request

Text-to-SpeechSynthesis

Automatic SpeechRecognition

Spoken LanguageUnderstanding

Dialog Management and Spoken

(Actions) Language

Generation (Words)

ASR

Voice-Enabling Services:Technology ComponentsVoiceVoice--Enabling Services:Enabling Services:Technology ComponentsTechnology Components

Page 44: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

44

Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: Interpret the meaning of key words and phrases in the recognized

speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly recognizing

every word– SLU makes it possible to offer services where the customer can speak naturally

without learning a specific set of terms

• Methodology:: Exploit task grammar (syntax) and task semantics to restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high information wordsequences to appropriate meaning

• Performance Evaluation:: Accuracy of speech understanding system on various tasks and in various operating environments

• Applications:: Automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.

• Challenges: What goes beyond simple classifications systems but below full Natural Language voice dialogue systems

Page 45: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

45

DM & SLG SLU

TTS

Meaning

“Billing credit”

What’s next?

“Determine correct number”Words spoken

“I dialed a wrong number”

Voice reply to customer

“What number did you want to call?”

Customer voice request

Text-to-SpeechSynthesis

Automatic SpeechRecognition

Spoken LanguageUnderstanding

Dialog Management (Actions) and

Spoken Language

Generation (Words)

ASR

Voice-Enabling Services:Technology ComponentsVoiceVoice--Enabling Services:Enabling Services:Technology ComponentsTechnology Components

Page 46: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

46

Dialog Management (DM)Dialog Management (DM)

• Goal:: Combine the meaning of the current input with the interaction history to decide what the next step in the interaction should be

– DM makes viable complex services that require multiple exchanges between the system and the customer

– dialog systems can handle user-initiated topic switching (within the domain of the application)

• Methodology:: Exploit models of dialog to determine the most appropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction

• Performance Evaluation:: Speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining help with a service

• Applications:: Customer care (HMIHY), travel planning, conference registration,scheduling, voice access to unified messaging

• Challenges: Is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines

Page 47: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

47

Customer Care IVR and HMIHYCustomer Care IVR and HMIHY

Network Menu

LEC Misdirect Announcement

Account Verification Routine

Main Menu

Reverse Directory RoutineLD Sub-Menu

HMIHY

Sparkle Tone“AT&T, How may I help you?”

Account Verification Routine

Customer Care IVR

Sparkle Tone“Thank you for calling AT&T…”

Total Time to Get to Reverse Directory Lookup: 2:55 minutes!!!

10 seconds

30 seconds

13 seconds

26 seconds

58 seconds

38 seconds

Total Time to Get to Reverse Directory Lookup: 28 seconds!!!

20 seconds

8 seconds

smsm

Page 48: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

48

Prompt is “AT&T. How may I help you?”User responds with totally unconstrained fluent speechSystem recognizes the words and determines the meaningof users’ speech, then routes the callDialog technology enables task completion

HMIHY

AccountBalance

CallingPlans

Local UnrecognizedNumber

. . .

HMIHY HMIHY —— How Does It WorkHow Does It Worksmsm

Page 49: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

49

HMIHY Example DialogsHMIHY Example Dialogs

• Irate Customer• Rate Plan• Account Balance• Local Service• Unrecognized Number• Threshold Billing• Billing Credit

Customer SatisfactionCustomer Satisfaction

• decreased repeat calls (37%)

• decreased ‘OUTPIC’ rate (18%)

• decreased CCA (Call Control Agent) time per call (10%)

• decreased customer complaints (78%)

smsm

Page 50: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

50

Customer Care ScenarioCustomer Care Scenario

Page 51: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

51

TTS TTS –– Closest to the Customer’s EarClosest to the Customer’s Ear

DM & SLG SLU

ASRTTS

Meaning

“Billing credit”

What’s next?

“Determine correct number”Words spoken

“I dialed a wrong number”

Voice reply to customer

“What number did you want to call?”

Customer voice request

Text-to-SpeechSynthesis

Automatic SpeechRecognition

Spoken LanguageUnderstanding

Dialog Management (Actions) and

Spoken Language

Generation (Words)

Page 52: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

52

Speech SynthesisSpeech Synthesis

LinguisticRules

D-to-AConverter

DSP Computer

text speech

Page 53: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

53

Speech SynthesisSpeech Synthesis• Synthesis of Speech for effective human-

machine communications– reading email messages over a telephone– telematics feedback in automobiles– talking agents for completion of transactions– call center help desks and customer care– handheld devices such as foreign language

phrasebooks, dictionaries, crossword puzzle helpers

– announcement machines that provide things like stock quotes, airlines schedules, updates of arrivals and departures of flights

Page 54: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

54

Giving Machines High Quality Voices and Giving Machines High Quality Voices and FacesFaces

U.S. English Female:U.S. English Male:

Spanish Female::

‘Natural Speech’

Page 55: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

55

Speech Synthesis ExamplesSpeech Synthesis Examples

• Soliloquy from Hamlet:

• Gettysburg Address:

• Third Grade Story:

Page 56: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

56

Speech Recognition DemosSpeech Recognition Demos

Page 57: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

57

Au Clair de la LuneAu Clair de la Lune

Page 58: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

58

Information KioskInformation Kiosk

Page 59: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

59

Multimodal Language ProcessingMultimodal Language ProcessingUnified Multimodal Experience Access to information through voice interface, gesture or both.

Multimodal Finite State Combination of speech, gesture and meaning using finite state technology

“Are there any cheap Italian places in this neighborhood?”

MATCH: Multimodal Access To City Help

Page 60: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

60

MIPadMIPad DemoDemo----MicrosoftMicrosoft

Page 61: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

61

VoiceVoice--Enabled ServicesEnabled Services• Desktop applications -- dictation, command and control of

desktop, control of document properties (fonts, styles, bullets, …)• Agent technology – simple tasks like stock quotes, traffic

reports, weather; access to communications, e.g., voice dialing,voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointments

• Voice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are key

• E-Contact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues

• Telematics – command and control of automotive features (comfort systems, radio, windows, sunroof)

• Small devices – control of cellphones, PDAs from voice commands

Page 62: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

62

Milestones in Speech and Milestones in Speech and Multimodal Technology ResearchMultimodal Technology Research

1962 1967 1972 1977 1982 1987 1992 1997 2002

Year

Isolated Words

Filter-bank analysis;

Time-normalization;

Dynamicprogramming

Isolated Words; Connected Digits;

Continuous Speech

Pattern recognition; LPC

analysis; Clustering

algorithms; Level building;

Continuous Speech; Speech Understanding

Stochastic language understanding;

Finite-state machines;

Statistical learning;

Small Vocabulary,

Acoustic Phonetics-

based

Medium Vocabulary, Template-based

Large Vocabulary;

Syntax, Semantics,

Connected Words;

Continuous Speech

Large Vocabulary,

Statistical-based

Hidden Markov models;

Stochastic Language modeling;

Spoken dialog; Multiple

modalities

Very Large Vocabulary; Semantics, Multimodal Dialog, TTS

Concatenativesynthesis; Machine

learning; Mixed-initiative dialog;

Page 63: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

63

Future of Speech Recognition Technologies

Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies

2002 2005 2008 2011

Year

Dialog Systems

Robust Systems

Multilingual Systems;

Multimodal Speech Enabled

Devices

Very Large Vocabulary,

Limited Tasks, Controlled

Environment

Very Large Vocabulary,

Limited Tasks, Arbitrary

Environment

Unlimited Vocabulary,

Unlimited Tasks, Many

Languages

Page 64: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

64

Issues in Speech RecognitionIssues in Speech Recognition

Page 65: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

65

Issues in Speech RecognitionIssues in Speech Recognition• Input speech format

– Isolated words/phrases– Connected word sequences– Continuous speech (essentially

unconstrained)• Recognition mode

– Speaker trained– Speaker independent– Amount of training material

• Speaking environment– Quiet office– Home– Noisy surroundings (factory floor,

cellular environments, speakerphones)• Transducer and transmission system

– High quality microphone, close talking/noise cancelling microphone

– Telephone (carbon button/electret)– Switched telephone network– IP network (VoIP)– Cellular network

• Vocabulary size and complexity (perplexity)

– Small (2-50 words)– Medium (50-250 word)– Large (250-2,000,000 words)

• Speaker characteristics– Highly motivated, cooperative– Casual

• Recognition task– Syntax constrained (language model)– Viable semantics

• Human factors– Feedback to users– Instructions– Requests for repeats– Rejections

• Tolerance for recognition errors– Fail soft systems– Human intervention on errors/confusion– Correction mechanisms built in

• System complexity– Computation/hardware– Real-time response capability

Page 66: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

66

Overview of Speech Overview of Speech Recognition ProcessesRecognition Processes

Page 67: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

67

Speech Transcriptions

FS

log En ZnTemporal, Spectral, 

Cepstral, LPC Features

Template

VQ  HMM

d(X,Y)     DTW

DictionarySyntax

Templates   Models

Word / Sound Models, Templates

Recognized Input

Overview of Speech Overview of Speech Recognition ProcessesRecognition Processes

Page 68: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

68

Statistical Pattern RecognitionStatistical Pattern RecognitionThe basic speech recognition task may be defined as follows:

• a sequence of measurements (speech analysis frames) on the (endpoint detected) speech signal of an utterance defines a pattern for that utterance• this pattern is to be classified as belonging to one of several possible categories (classes) (for word/phrase recognition) or to a sequence of possible categories (for continuous speech recognition)• the rules for this classification are formulated on the basis of a labeled set of training patterns or models

The type of measurement (temporal, spectral, cepstral, LPC features) and the classification rules (pattern alignment and distance, model alignment and probability) are the main factors that distinguish one method of speech recognition from another

Page 69: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

69

Issues in Pattern RecognitionIssues in Pattern Recognition• Training phase is required; the more training data, the better the

patterns (templates, models)• Patterns are sensitive to the speaking environment, transmission

environment, transducer (microphone), etc. (This problem is known as the speech robustness problem).

• No speech specific knowledge is required or exploited, except in the feature extraction stage

• Computational load is (more or less) linearly proportional to the number of patterns being recognized (at least for simple recognition problems, e.g., isolated word tasks)

• Pattern recognition techniques are applicable to a range of speech units, including phrases, words, and sub-word units (phonemes, syllables, dyads, etc.)

• Extensions possible to large vocabulary, fluent speech recognition using word lexicons (dictionaries) and language models (grammarsor syntax)

• Extensions possible to natural language understanding systems

Page 70: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

70

Speech Recognition ProcessesSpeech Recognition Processes

1. Fundamentals (Lectures 2(Lectures 2--6)6)– speech production (acoustic-phonetics, linguistics)– speech perception (auditory (ear) models, neural

models)– pattern recognition (statistical, template-based)– neural networks (classification methods)

2. Speech/Endpoint Detection (Lecture 7)(Lecture 7)– algorithms– speech features

Page 71: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

71

Speech Recognition ProcessesSpeech Recognition Processes

3. Speech Analysis/Feature Extraction (Lectures 7(Lectures 7--9)9)– temporal parameters (log energy, zero crossings,

autocorrelation)– spectral parameters (STFT, OLA, FBS, spectrograms)– cepstral parameters (cepstrum, Δ-cepstrum, Δ2-

cepstrum)– LPC parameters (reflection coefs, area coefs, LSP)

4. Distance/Distortion Measures (Lecture 10)(Lecture 10)– temporal (quadratic, weighted)– spectral (log spectral distance)– cepstral (cepstral distance)– LPC (Itakura distance)

Page 72: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

72

Speech Recognition ProcessesSpeech Recognition Processes

5. Time Alignment Algorithms (Lectures 11(Lectures 11--12)12)– linear alignments– dynamic time warping (DTW, dynamic programming)– HMM alignments (Viterbi alignment)

6. Model Building/Training (Lectures 13(Lectures 13--14)14)– template methods– clustering methods– HMM methods—Viterbi, Forward-Backward– vector quantization (VQ) methods

Page 73: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

73

Speech Recognition ProcessesSpeech Recognition Processes

7. Connected word modeling (Lecture 15)(Lecture 15)– dynamic programming – level building– one pass method

8. Testing/Evaluation Methods (Lecture 16)(Lecture 16)– word/sound error rates– dictionary of words– task syntax– task semantics– task perplexity

Page 74: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

74

Speech Recognition ProcessesSpeech Recognition Processes

9. Large Vocabulary Recognition (Lectures 17(Lectures 17--18)18)– phoneme models– context dependent models– discrete, mixed, continuous density models– N-gram language models– Natural language understanding– insertions, deletions, substitutions– other factors

Page 75: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

75

Putting It All TogetherPutting It All Together

Page 76: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

76

Speech Recognition Course Speech Recognition Course TopicsTopics

Page 77: Lectures 1 Rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

77

What We Will Be LearningWhat We Will Be Learning• speech production model—acoustics, articulatory concepts, speech production models• speech perception model—ear models, auditory signal processing, equivalent acoustic

processing models• signal processing approaches to speech recognition—acoustic-phonetic methods,

pattern recognition methods, statistical methods, neural network methods• fundamentals of pattern recognition• signal processing methods—bank-of-filters model, short-time Fourier transforms, LPC

methods, cepstral methods, perceptual linear prediction, mel cepstrum, vector quantization

• pattern recognition issues—speech detection, distortion measures, time alignment and normalization, dynamic time warping

• speech system design issues—source coding, template training, discriminative methods• robustness issues—spectral subtraction, cepstral mean subtraction, model adaptation• Hidden Markov Model (HMM) fundamentals—design issues• connected word models—dynamic programming, level building, one pass method• grammar networks—finite state machine (FSM) basics• large vocabulary speech recognition—training, language models, perplexity, acoustic

models for context dependent sub-word units• task-oriented designs—natural language understanding, mixed initiative systems, dialog

management• text-to-speech synthesis—based on unit selection methods