lectures 1 rabiner speech processing
DESCRIPTION
Rabiner digital speech processingTRANSCRIPT
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
1
ECE 259BECE 259BFundamentals of Speech Fundamentals of Speech RecognitionRecognition—— Lecture 1Lecture 1Introduction/Overview of Introduction/Overview of
Automatic Speech Automatic Speech RecognitionRecognition
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
2
Why Digital Processing of Why Digital Processing of Speech?Speech?
• digital processing of speech signals (DPSS) enjoys an extensive theoretical and experimental base developed over the past 75 years
• much research has been done since 1965 on the use of digital signal processing in speech communication problems
• highly advanced implementation technology(VLSI) exists that is well matched to the computational demands of DPSS
• there are abundant applications that are in widespread use commercially
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
3
The Speech StackThe Speech Stack
Fundamentals — acoustics, linguistics, pragmatics, speech perception
Speech Representations — temporal, spectral, homomorphic, LPC
Speech Algorithms — speech-silence, voiced-unvoiced, pitch, formants
Speech Applications — coding, synthesis, recognition, understanding, verification, language translation, speed-up/slow-down
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
4
Speech RecognitionSpeech Recognition--2001 2001 (Stanley Kubrick View in 1968)(Stanley Kubrick View in 1968)
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
5
Apple Navigator Apple Navigator ---- 19881988
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
6
The Speech AdvantageThe Speech Advantage• Reduce costs
– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services
• New revenue opportunities– 24x7 high-quality customer care automation– access to information without a keyboard or
touch-tones• Customer retention
– provide personal services for customer preferences
– improve customer experience
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
7
The Speech CircleThe Speech Circle
DM & SLG SLU
ASRTTS
Meaning
“Billing credit”
What’s next?
“Determine correct number”Words spoken
“I dialed a wrong number”
Voice reply to customer
“What number did you want to call?”
Customer voice request
Text-to-SpeechSynthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
Dialog Management (Actions) and
Spoken Language
Generation (Words)
Data
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
8
• Goal: Accurately and efficiently convert a speech signal into a text message independent of the device, speaker or the environment.
• Applications: Automation of complex operator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.
Automatic Speech RecognitionAutomatic Speech Recognition
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
9
Pattern Matching ProblemsPattern Matching Problems
A-to-DConverter
PatternMatching
FeatureAnalysis
speech symbols
• speech recognition
• speaker recognition
• speaker verification
• word spotting
• automatic indexing of speech recordings
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
10
Basic ASR Formulation (Basic ASR Formulation (BayesBayes Method)Method)
Speaker’s Intention
Speech Production
Mechanisms
W Acoustic Processor
s(n) X Linguistic Decoder
W
Speech RecognizerSpeaker Model
ˆ arg max ( | )
( | ) ( )arg max( )
arg max ( | ) ( )
W
W
A LW
W P W X
P X W P WP X
P X W P W
=
=
=
Step 2Step 1Step 3
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
11
Steps in Speech RecognitionSteps in Speech RecognitionStep 1Step 1-- Acoustic ModelingAcoustic Modeling: assign probabilities
to acoustic realizations of a sequence of words. Compute PA(X|W) using statistical models (Hidden Markov Models) of acoustic signals and words
Step 2Step 2-- Language ModelingLanguage Modeling: assign probabilities to sequences of words in the language. Train PL(W) from generic text or from transcriptions of task-specific dialogues.
Step 3Step 3-- Hypothesis SearchHypothesis Search: find the word sequence with the maximum a posteriori probability. Search through all possible word sequences to determine arg max over W.
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
12
Step 1Step 1--The Acoustic ModelThe Acoustic Model we build by learning statistics of the acoustic
features, , from a training set where we compute the variability of the acoustic features during the production of
acoustic
the sound
mode
s
l
p
s
re
Xi
resented by the models it is impractical to create a separate acoustic model, ( | ), for
every possible word in the language--it requires too much training data for words in every possible contex
AP X Wi
t instead we build for the ~50 phonemes
in the English language and construct the model for a word by concantenating (stringing together sequential
acousti
ly) the
c-phonetic mod
models for
l
e
e s
th
i
constituent phones in the word similarly we build sentences (sequences of words) by concatenating
word modelsi
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
13
Step 2Step 2--The Language ModelThe Language Model• the language model describes the probability of a sequence of words that form a valid sentence in the language
• a simple statistical method works well based on a Markovianassumption, namely that the probability of a word in a sentence is conditioned on only the previous N-words, namely an N-gram language model
1 2
1 21
1 2
( ) ( , ,..., )
( | , ,..., )
where ( | , ,..., ) is estimated by simply counting up the relative frequencies of -tuples in a large corpus of text
L L kk
L n n n n Nn
L n n n n N
P W P w w w
P w w w w
P w w w wf N
− − −=
− − −
=
=∏i
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
14
Step 3Step 3--The Search ProblemThe Search Problem• the search search problem is one of searching the space of
all valid sound sequences, conditioned on the word grammar, the language syntax, and the task constraints, to find the word sequence with the maximum likelihood
• the size of the search spacesize of the search space can be astronomically large and take inordinate amounts of computing power to solve by heuristic methods
• the use of methods from the field of Finite State Finite State Automata TheoryAutomata Theory provide Finite State Networks Finite State Networks (FSN)(FSN) that reduce the computational burden by orders of magnitude, thereby enabling exact solutions in computationally feasible times, for large speech recognition problems
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
15
Basic ASR FormulationBasic ASR FormulationThe basic equation of Bayes rule-based speech recognition is
where X=X1,X2,…,XN is the acoustic observation (feature vector) sequence.
is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model
ˆ arg max ( | )
( ) ( | )arg max( )
arg max ( ) ( | )
=
=
=
W
W
W
W P
P PP
P P
W X
W X WX
W X W
1 2ˆ ...= Mw w wW
Speech Analysis Decoder
s(n), W Xn W
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
16
Feature Analysis (Spectral Analysis)
Feature Analysis (Spectral Analysis)
Language Model
(N-gram)
Language Model
(N-gram)Word
LexiconWord
Lexicon
Utterance Verification (Confidence
Scores)
Utterance Verification (Confidence
Scores)
Pattern Classification
(Decoding, Search)
Pattern Classification
(Decoding, Search)
Acoustic Model (HMM)
Acoustic Model (HMM)
Input Speech “Hello World”
(0.9) (0.8)
Speech Recognition
Process
Speech Speech Recognition Recognition
ProcessProcessSLU
TTS
DM
ASR
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
17
Speech Recognition ProcessesSpeech Recognition Processes
• Choose task => sounds, word vocabulary, task syntax (grammar), task semantics– text training data set => word lexicon, word
grammar (language model), task grammar– speech training data set => acoustic models
• Evaluate performance– speech testing data set
• Training algorithm => build models from training set of text and speech
• Testing algorithm => evaluate performance from testing set of speech
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
18
Feature ExtractionFeature ExtractionFeature ExtractionGoal: Extract robust features (information) fromthe speech that are relevant for ASR.
Method: Spectral analysis through either a bank-of-filters or through LPC followedby non-linearity and normalization (cepstrum).
Result: Signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).
Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—cepstral features or those from a high dimensionality space.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model Word
Lexicon
Word Lexicon
Utterance
Verification
Utterance
Verification
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
19
RobustnessRobustnessRobustnessRobustness
UnlimitedVocabularyUnlimited
Vocabulary
RejectionRejection
Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.
Methods:Traditional techniques for improving system robustnessare based on signal enhancement, feature normalization or/andmodel adaptation.
Perception Approach:Extract fundamental acoustic information in narrow bandsof speech. Robust integration of features across time and frequency.
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
20
Methods for Robust Speech Recognition
Methods for Robust Speech Methods for Robust Speech RecognitionRecognition
SignalTraining
Testing
Features Model
Signal Features Model
Enhancement Normalization Adaptation
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
21
Acoustic ModelAcoustic ModelAcoustic ModelGoal: Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/).
Hidden Markov Model (HMM): Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone. HMMs are also assigned for modeling extraneous events.
Advantages: Powerful statistical method for dealing with a wide range of data and reliably recognizing speech.
Challenges: Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model Word
Lexicon
Word Lexicon
Utterance
Verification
Utterance
Verification
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
22
HMM for speechHMM for speech
• Phone model : ‘z’ (/Z/)
• Word model: ‘is’ (/IH/ /Z/)
i1
i1
ih1 ih2 ih3 z1z1 z1z2 z1z3
z1z1 z1z2 z1z3
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
23
Isolated Word HMMIsolated Word HMM
1 2 3 4 5
a11
a34a23a12
a13 a24 a35
a22 a33 a44 a55=1
a45
b1(Ot) b2(Ot) b3(Ot) b4(Ot) b5(Ot)
Left-right HMM – highly constrained state sequences
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
24
Word LexiconWord LexiconWord LexiconGoal:Map legal phone sequences into wordsaccording to phonotactic rules. For example,
David /d/ /ey/ /v/ /ih/ /d/
Multiple Pronunciation:Several words may have multiple pronunciations. For example
Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/
Challenges:How do you generate a word lexicon automatically; how do you addnew variant dialects and word pronunciations.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model Word
Lexicon
Word Lexicon
Utterance
Verification
Utterance
Verification
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
25
0.60.4
Goal:Mapping words into phrases and sentences based ontask syntax.
Handcrafted:Deterministic grammars that are knowledge-based. For example,
Yes on my credit (card) please
Statistical:Compute estimate of word probabilities (N-gram model). For example
Yes on my credit card please
Challenges:How do you build a language model rapidly for a new task.
Language ModelLanguage ModelLanguage Model
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model Word
Lexicon
Word Lexicon
Utterance
Verification
Utterance
Verification
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
26
Goal:Combine information (probabilities) from the acoustic model, language model and word lexicon to generatean “optimal” word sequence (highest probability).
Method:Decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.
Challenges:How do we build efficient structures (FSMs) for decoding and searching large vocabulary, complex language models tasks;
• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient
What is the theoretical limit of efficiency that can be achieved
Pattern ClassificationPattern ClassificationPattern Classification
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model Word
Lexicon
Word Lexicon
Utterance
Verification
Utterance
Verification
22
8
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
27
Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR• The basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping:
• For the WSJ 64,000 vocabulary, this results in a network of 10 bytes!• State-of-the-art methods including fast match, multi-pass decoding and A* stack provide tremendous speed-up at a cost of increased complexity and less portability.• Advances in weighted finite state transducers have enabled us to represent this network in a unified mathematical framework with only 10 bytes!
22
RobustnessRobustness
UnlimitedVocabularyUnlimited
Vocabulary
RejectionRejection
8
Features HMM states HMM units Phones Words Sentences
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
28
Weighted Finite State Transducers (WFST)
Weighted Finite State Weighted Finite State Transducers (WFST)Transducers (WFST)
• Unified Mathematical framework to ASR• Efficiency in time and space
WFSTWFST
WFSTWFST
WFSTWFST
Word:Phrase
Phone:Word
HMM:PhoneCombinationCombination OptimizationOptimization
SearchNetwork
WFSTWFST
State:HMM
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
29
Weighted Finite State Weighted Finite State TransducerTransducer
d: ε/1
ey:ε/.4
ae:ε/.6
dx:ε/.8
t:ε/.2
ax:”data”/1
Word Pronunciation Transducer
Data
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
30
0
5
10
15
20
25
30
1994 1995 1996 1997 1998 1999 2000 2001 2002
Year
Rel
ativ
e Sp
eed
AT&T (Algorithmic) Moore's Law (hardware)
North American Businessvocabulary: 40,000 words
branching factor: 85
North American Businessvocabulary: 40,000 words
branching factor: 85
Algorithmic Speed-up for Speech Recognition
Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition
Community
AT&T
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
31
Utterance VerificationUtterance VerificationUtterance VerificationGoal:Identify possible recognition errorsand out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.
Method:A confidence score based on a hypothesis test is associated witheach recognized word. For example:
Label: credit pleaseRecognized: credit fees Confidence: (0.9) (0.3)
Challenges:Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model Word
Lexicon
Word Lexicon
Utterance
Verification
Utterance
Verification
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
32
RejectionRejectionProblem: Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.
Effect:Improvement in the performance of the recognizer, understanding system and dialogue manager.
Measure of Confidence:Associating word strings with a verification cost thatprovide an effective measure of confidence (Utterance Verification).
RobustnessRobustness
UnlimitedVocabularyUnlimited
Vocabulary
RejectionRejection
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
33
State-of-the-Art Performance?
StateState--ofof--thethe--Art Art Performance?Performance?
SLU
TTS
DM
Feature ExtractionFeature
Extraction
Language Model
Language Model
Word LexiconWord
Lexicon
Pattern Classification
(Decoding, Search)
Pattern Classification
(Decoding, Search)
Acoustic Model
Acoustic Model
Input Speech
RecognizedSentence
ASR
Utterance VerificationUtterance
Verification
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
34
Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY
SIZE WORD
ERROR RATEConnected Digit
Strings--TI Database Spontaneous 11 (zero-nine,
oh) 0.3%
Connected Digit Strings--Mall Recordings
Spontaneous 11 (zero-nine, oh)
2.0%
Connected Digits Strings--HMIHY
Conversational 11 (zero-nine, oh)
5.0%
RM (Resource Management)
Read Speech 1000 2.0%
ATIS(Airline Travel Information System)
Spontaneous 2500 2.5%
NAB (North American Business)
Read Text 64,000 6.6%
Broadcast News News Show 210,000 13-17%
Switchboard Conversational Telephone
45,000 25-29%
Call Home Conversational Telephone
28,000 40%
Factor of 17 increase in digit error rate
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
35
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
36vocabulary: 40,000 words branching factor: 85
vocabulary: 40,000 words branching factor: 85
North American BusinessNorth American BusinessNorth American Business
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
37
Broadcast NewsBroadcast News
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
38
Dictation MachineDictation Machine
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
39
30
40
50
60
70
80
1996 1997 1998 1999 2000 2001 2002
Year
Wor
d A
ccur
acy
Switchboard/Call Home Vocabulary: 40,000 words
Perplexity: 85
Switchboard/Call Home Vocabulary: 40,000 words
Perplexity: 85
Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech Recognition
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
40
Growth in Effective Growth in Effective Recognition Vocabulary SizeRecognition Vocabulary Size
110
1001000
10000100000
100000010000000
1960 1970 1980 1990 2000 2010Year
Voc
abul
ary
Size
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
41
Human Speech Recognition Human Speech Recognition vsvs ASRASR
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
42
Human Speech Recognition Human Speech Recognition vsvs ASRASR
0.1
1
10
100
0.001 0.01 0.1 1 10
HUMAN ERROR (%)
MAC
HIN
E ER
RO
R (%
)
Digits RM-LM NAB-mic WSJ
RM-null NAB-omni SWBD WSJ-22dB
MachinesOutperform
Humansx1
x10
x100
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
43
DM & SLG SLU
TTS
Meaning
“Billing credit”
What’s next?
“Determine correct number”Words spoken
“I dialed a wrong number”
Voice reply to customer
“What number did you want to call?”
Customer voice request
Text-to-SpeechSynthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
Dialog Management and Spoken
(Actions) Language
Generation (Words)
ASR
Voice-Enabling Services:Technology ComponentsVoiceVoice--Enabling Services:Enabling Services:Technology ComponentsTechnology Components
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
44
Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: Interpret the meaning of key words and phrases in the recognized
speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly recognizing
every word– SLU makes it possible to offer services where the customer can speak naturally
without learning a specific set of terms
• Methodology:: Exploit task grammar (syntax) and task semantics to restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high information wordsequences to appropriate meaning
• Performance Evaluation:: Accuracy of speech understanding system on various tasks and in various operating environments
• Applications:: Automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.
• Challenges: What goes beyond simple classifications systems but below full Natural Language voice dialogue systems
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
45
DM & SLG SLU
TTS
Meaning
“Billing credit”
What’s next?
“Determine correct number”Words spoken
“I dialed a wrong number”
Voice reply to customer
“What number did you want to call?”
Customer voice request
Text-to-SpeechSynthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
Dialog Management (Actions) and
Spoken Language
Generation (Words)
ASR
Voice-Enabling Services:Technology ComponentsVoiceVoice--Enabling Services:Enabling Services:Technology ComponentsTechnology Components
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
46
Dialog Management (DM)Dialog Management (DM)
• Goal:: Combine the meaning of the current input with the interaction history to decide what the next step in the interaction should be
– DM makes viable complex services that require multiple exchanges between the system and the customer
– dialog systems can handle user-initiated topic switching (within the domain of the application)
• Methodology:: Exploit models of dialog to determine the most appropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction
• Performance Evaluation:: Speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining help with a service
• Applications:: Customer care (HMIHY), travel planning, conference registration,scheduling, voice access to unified messaging
• Challenges: Is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
47
Customer Care IVR and HMIHYCustomer Care IVR and HMIHY
Network Menu
LEC Misdirect Announcement
Account Verification Routine
Main Menu
Reverse Directory RoutineLD Sub-Menu
HMIHY
Sparkle Tone“AT&T, How may I help you?”
Account Verification Routine
Customer Care IVR
Sparkle Tone“Thank you for calling AT&T…”
Total Time to Get to Reverse Directory Lookup: 2:55 minutes!!!
10 seconds
30 seconds
13 seconds
26 seconds
58 seconds
38 seconds
Total Time to Get to Reverse Directory Lookup: 28 seconds!!!
20 seconds
8 seconds
smsm
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
48
Prompt is “AT&T. How may I help you?”User responds with totally unconstrained fluent speechSystem recognizes the words and determines the meaningof users’ speech, then routes the callDialog technology enables task completion
HMIHY
AccountBalance
CallingPlans
Local UnrecognizedNumber
. . .
HMIHY HMIHY —— How Does It WorkHow Does It Worksmsm
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
49
HMIHY Example DialogsHMIHY Example Dialogs
• Irate Customer• Rate Plan• Account Balance• Local Service• Unrecognized Number• Threshold Billing• Billing Credit
Customer SatisfactionCustomer Satisfaction
• decreased repeat calls (37%)
• decreased ‘OUTPIC’ rate (18%)
• decreased CCA (Call Control Agent) time per call (10%)
• decreased customer complaints (78%)
smsm
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
50
Customer Care ScenarioCustomer Care Scenario
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
51
TTS TTS –– Closest to the Customer’s EarClosest to the Customer’s Ear
DM & SLG SLU
ASRTTS
Meaning
“Billing credit”
What’s next?
“Determine correct number”Words spoken
“I dialed a wrong number”
Voice reply to customer
“What number did you want to call?”
Customer voice request
Text-to-SpeechSynthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
Dialog Management (Actions) and
Spoken Language
Generation (Words)
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
52
Speech SynthesisSpeech Synthesis
LinguisticRules
D-to-AConverter
DSP Computer
text speech
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
53
Speech SynthesisSpeech Synthesis• Synthesis of Speech for effective human-
machine communications– reading email messages over a telephone– telematics feedback in automobiles– talking agents for completion of transactions– call center help desks and customer care– handheld devices such as foreign language
phrasebooks, dictionaries, crossword puzzle helpers
– announcement machines that provide things like stock quotes, airlines schedules, updates of arrivals and departures of flights
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
54
Giving Machines High Quality Voices and Giving Machines High Quality Voices and FacesFaces
U.S. English Female:U.S. English Male:
Spanish Female::
‘Natural Speech’
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
55
Speech Synthesis ExamplesSpeech Synthesis Examples
• Soliloquy from Hamlet:
• Gettysburg Address:
• Third Grade Story:
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
56
Speech Recognition DemosSpeech Recognition Demos
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
57
Au Clair de la LuneAu Clair de la Lune
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
58
Information KioskInformation Kiosk
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
59
Multimodal Language ProcessingMultimodal Language ProcessingUnified Multimodal Experience Access to information through voice interface, gesture or both.
Multimodal Finite State Combination of speech, gesture and meaning using finite state technology
“Are there any cheap Italian places in this neighborhood?”
MATCH: Multimodal Access To City Help
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
60
MIPadMIPad DemoDemo----MicrosoftMicrosoft
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
61
VoiceVoice--Enabled ServicesEnabled Services• Desktop applications -- dictation, command and control of
desktop, control of document properties (fonts, styles, bullets, …)• Agent technology – simple tasks like stock quotes, traffic
reports, weather; access to communications, e.g., voice dialing,voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointments
• Voice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are key
• E-Contact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues
• Telematics – command and control of automotive features (comfort systems, radio, windows, sunroof)
• Small devices – control of cellphones, PDAs from voice commands
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
62
Milestones in Speech and Milestones in Speech and Multimodal Technology ResearchMultimodal Technology Research
1962 1967 1972 1977 1982 1987 1992 1997 2002
Year
Isolated Words
Filter-bank analysis;
Time-normalization;
Dynamicprogramming
Isolated Words; Connected Digits;
Continuous Speech
Pattern recognition; LPC
analysis; Clustering
algorithms; Level building;
Continuous Speech; Speech Understanding
Stochastic language understanding;
Finite-state machines;
Statistical learning;
Small Vocabulary,
Acoustic Phonetics-
based
Medium Vocabulary, Template-based
Large Vocabulary;
Syntax, Semantics,
Connected Words;
Continuous Speech
Large Vocabulary,
Statistical-based
Hidden Markov models;
Stochastic Language modeling;
Spoken dialog; Multiple
modalities
Very Large Vocabulary; Semantics, Multimodal Dialog, TTS
Concatenativesynthesis; Machine
learning; Mixed-initiative dialog;
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
63
Future of Speech Recognition Technologies
Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies
2002 2005 2008 2011
Year
Dialog Systems
Robust Systems
Multilingual Systems;
Multimodal Speech Enabled
Devices
Very Large Vocabulary,
Limited Tasks, Controlled
Environment
Very Large Vocabulary,
Limited Tasks, Arbitrary
Environment
Unlimited Vocabulary,
Unlimited Tasks, Many
Languages
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
64
Issues in Speech RecognitionIssues in Speech Recognition
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
65
Issues in Speech RecognitionIssues in Speech Recognition• Input speech format
– Isolated words/phrases– Connected word sequences– Continuous speech (essentially
unconstrained)• Recognition mode
– Speaker trained– Speaker independent– Amount of training material
• Speaking environment– Quiet office– Home– Noisy surroundings (factory floor,
cellular environments, speakerphones)• Transducer and transmission system
– High quality microphone, close talking/noise cancelling microphone
– Telephone (carbon button/electret)– Switched telephone network– IP network (VoIP)– Cellular network
• Vocabulary size and complexity (perplexity)
– Small (2-50 words)– Medium (50-250 word)– Large (250-2,000,000 words)
• Speaker characteristics– Highly motivated, cooperative– Casual
• Recognition task– Syntax constrained (language model)– Viable semantics
• Human factors– Feedback to users– Instructions– Requests for repeats– Rejections
• Tolerance for recognition errors– Fail soft systems– Human intervention on errors/confusion– Correction mechanisms built in
• System complexity– Computation/hardware– Real-time response capability
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
66
Overview of Speech Overview of Speech Recognition ProcessesRecognition Processes
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
67
Speech Transcriptions
FS
log En ZnTemporal, Spectral,
Cepstral, LPC Features
Template
VQ HMM
d(X,Y) DTW
DictionarySyntax
Templates Models
Word / Sound Models, Templates
Recognized Input
Overview of Speech Overview of Speech Recognition ProcessesRecognition Processes
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
68
Statistical Pattern RecognitionStatistical Pattern RecognitionThe basic speech recognition task may be defined as follows:
• a sequence of measurements (speech analysis frames) on the (endpoint detected) speech signal of an utterance defines a pattern for that utterance• this pattern is to be classified as belonging to one of several possible categories (classes) (for word/phrase recognition) or to a sequence of possible categories (for continuous speech recognition)• the rules for this classification are formulated on the basis of a labeled set of training patterns or models
The type of measurement (temporal, spectral, cepstral, LPC features) and the classification rules (pattern alignment and distance, model alignment and probability) are the main factors that distinguish one method of speech recognition from another
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
69
Issues in Pattern RecognitionIssues in Pattern Recognition• Training phase is required; the more training data, the better the
patterns (templates, models)• Patterns are sensitive to the speaking environment, transmission
environment, transducer (microphone), etc. (This problem is known as the speech robustness problem).
• No speech specific knowledge is required or exploited, except in the feature extraction stage
• Computational load is (more or less) linearly proportional to the number of patterns being recognized (at least for simple recognition problems, e.g., isolated word tasks)
• Pattern recognition techniques are applicable to a range of speech units, including phrases, words, and sub-word units (phonemes, syllables, dyads, etc.)
• Extensions possible to large vocabulary, fluent speech recognition using word lexicons (dictionaries) and language models (grammarsor syntax)
• Extensions possible to natural language understanding systems
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
70
Speech Recognition ProcessesSpeech Recognition Processes
1. Fundamentals (Lectures 2(Lectures 2--6)6)– speech production (acoustic-phonetics, linguistics)– speech perception (auditory (ear) models, neural
models)– pattern recognition (statistical, template-based)– neural networks (classification methods)
2. Speech/Endpoint Detection (Lecture 7)(Lecture 7)– algorithms– speech features
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
71
Speech Recognition ProcessesSpeech Recognition Processes
3. Speech Analysis/Feature Extraction (Lectures 7(Lectures 7--9)9)– temporal parameters (log energy, zero crossings,
autocorrelation)– spectral parameters (STFT, OLA, FBS, spectrograms)– cepstral parameters (cepstrum, Δ-cepstrum, Δ2-
cepstrum)– LPC parameters (reflection coefs, area coefs, LSP)
4. Distance/Distortion Measures (Lecture 10)(Lecture 10)– temporal (quadratic, weighted)– spectral (log spectral distance)– cepstral (cepstral distance)– LPC (Itakura distance)
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
72
Speech Recognition ProcessesSpeech Recognition Processes
5. Time Alignment Algorithms (Lectures 11(Lectures 11--12)12)– linear alignments– dynamic time warping (DTW, dynamic programming)– HMM alignments (Viterbi alignment)
6. Model Building/Training (Lectures 13(Lectures 13--14)14)– template methods– clustering methods– HMM methods—Viterbi, Forward-Backward– vector quantization (VQ) methods
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
73
Speech Recognition ProcessesSpeech Recognition Processes
7. Connected word modeling (Lecture 15)(Lecture 15)– dynamic programming – level building– one pass method
8. Testing/Evaluation Methods (Lecture 16)(Lecture 16)– word/sound error rates– dictionary of words– task syntax– task semantics– task perplexity
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
74
Speech Recognition ProcessesSpeech Recognition Processes
9. Large Vocabulary Recognition (Lectures 17(Lectures 17--18)18)– phoneme models– context dependent models– discrete, mixed, continuous density models– N-gram language models– Natural language understanding– insertions, deletions, substitutions– other factors
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
75
Putting It All TogetherPutting It All Together
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
76
Speech Recognition Course Speech Recognition Course TopicsTopics
12/28/2009 Fundamentals of Speech Recognition-Overview of ASR
77
What We Will Be LearningWhat We Will Be Learning• speech production model—acoustics, articulatory concepts, speech production models• speech perception model—ear models, auditory signal processing, equivalent acoustic
processing models• signal processing approaches to speech recognition—acoustic-phonetic methods,
pattern recognition methods, statistical methods, neural network methods• fundamentals of pattern recognition• signal processing methods—bank-of-filters model, short-time Fourier transforms, LPC
methods, cepstral methods, perceptual linear prediction, mel cepstrum, vector quantization
• pattern recognition issues—speech detection, distortion measures, time alignment and normalization, dynamic time warping
• speech system design issues—source coding, template training, discriminative methods• robustness issues—spectral subtraction, cepstral mean subtraction, model adaptation• Hidden Markov Model (HMM) fundamentals—design issues• connected word models—dynamic programming, level building, one pass method• grammar networks—finite state machine (FSM) basics• large vocabulary speech recognition—training, language models, perplexity, acoustic
models for context dependent sub-word units• task-oriented designs—natural language understanding, mixed initiative systems, dialog
management• text-to-speech synthesis—based on unit selection methods