techniques for speech and natural language recognition and ... · traditional techniques for...
TRANSCRIPT
1
Digital Speech ProcessingDigital Speech Processing——Lecture 19Lecture 19
Techniques for Speech Techniques for Speech and Natural Language and Natural Language
Recognition and Recognition and UnderstandingUnderstanding
Speech RecognitionSpeech Recognition--2001 2001 (Stanley Kubrick View in 1968)(Stanley Kubrick View in 1968)
Apple Navigator Apple Navigator ---- 19881988 The Speech AdvantageThe Speech Advantage•• Reduce costsReduce costs
– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services
• New revenue opportunities– 24x7 high-quality customer care automation– access to information without a keyboard or
touch-tones•• Customer retentionCustomer retention
– provide personal services for customer preferences
– improve customer experience
The Speech Dialog CircleThe Speech Dialog Circle
TTSText-to-Speech
SynthesisAutomatic Speech
Recognition
ASR
Data
SpeechVoice reply to customer
“What number did you want to call?”
DM
SLU
y
Spoken LanguageUnderstanding
DialogManagement
SLGSpoken LanguageGeneration
Data,Rules
Action
Words spoken
“I dialed a wrong number”
Meaning
“Billing credit”
Words: What’s next?
“Determine correct number”
• Goal: Accurately and efficiently convert a speech signal into a text message independent of the device, speaker or the environment.
Automatic Speech RecognitionAutomatic Speech Recognition
• Applications: Automation of complex operator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.
2
A Brief History of Speech RecognitionA Brief History of Speech Recognitionni
tion
Acc
urac
ySp
eech
Rec
ogn
Time
Cristopher C.discovers a new
continent.HMMs
Mathematical FormalizationGlobal optimizationAutomatic Learning from data
HeuristicsHandcrafted rulesLocal optimization
Basic ASR Formulation (Bayes Method)Basic ASR Formulation (Bayes Method)
Speaker’s Intention
Speech Production
Mechanisms
W Acoustic Processor
s(n) X Linguistic Decoder
W
Speech RecognizerSpeaker Model
ˆ arg max ( | )
( | ) ( )arg max( )
arg max ( | ) ( )
W
W
A LW
W P W X
P X W P WP X
P X W P W
=
=
=
Step 2Step 1Step 3
Steps in Speech RecognitionSteps in Speech RecognitionStep 1Step 1-- Acoustic ModelingAcoustic Modeling:: assign probabilities
to acoustic realizations of a sequence of words. Compute PA(X|W) using statistical models (Hidden Markov Models) of acoustic signals and words
St 2St 2 L M d liL M d li i b bilitiStep 2Step 2-- Language ModelingLanguage Modeling:: assign probabilities to sequences of words in the language. Train PL(W) from generic text or from transcriptions of task-specific dialogues.
Step 3Step 3-- Hypothesis SearchHypothesis Search:: find the word sequence with the maximum a posteriori probability. Search through all possible word sequences to determine arg max over W.
Step 1Step 1--The Acoustic ModelThe Acoustic Model we build by learning statistics of the acoustic
features, , from a training set where we compute the variability of the acoustic features during the production of
acoustic
the sound
mode
s
l
p
s
re
Xi
resented by the models it is impractical to create a separate acoustic model, ( | ), for
ibl d i th l it i t hAP X Wi
every possible word in the language--it requires too much training data for words in every possible context
instead we build for the ~50 phonemes in the English language and construct the model for a word by concantenating (stringing together sequential
acousti
ly) the
c-phonetic mod
models for
l
e
e s
th
i
constituent phones in the word similarly we build sentences (sequences of words) by concatenating
word modelsi
Step 2Step 2--The Language ModelThe Language Model• the language model describes the probability of a sequence of words that form a valid sentence in the language
• a simple statistical method works well based on a Markovian assumption, namely that the probability of a word in a sentence is conditioned on only the previous N-words, namely an N-gram language modelg g
1 2
1 21
1 2
( ) ( , ,..., )
( | , ,..., )
where ( | , ,..., ) is estimated by simply counting up the relative frequencies of -tuples in a large corpus of text
L L kk
L n n n n Nn
L n n n n N
P W P w w w
P w w w w
P w w w wf N
− − −=
− − −
=
=∏i
Step 3Step 3--The Search ProblemThe Search Problem• the search problem is one of searching the space of all
valid sound sequences, conditioned on the word grammar, the language syntax, and the task constraints, to find the word sequence with the maximum likelihood
• the size of the search space can be astronomically l d t k i di t t f ti tlarge and take inordinate amounts of computing power to solve by heuristic methods
• the use of methods from the field of Finite State Automata Theory provide Finite State Networks (FSN)that reduce the computational burden by orders of magnitude, thereby enabling exact solutions in computationally feasible times, for large speech recognition problems
3
Speech Recognition ProcessesSpeech Recognition Processes
1.1. Choose the taskChoose the task => sounds, word vocabulary, task syntax (grammar), task semantics
•• ExampleExample: isolated digit recognition task– Sounds to be recognized—whole words– Word vocabulary—zero, oh, one, two, …,
nine– Task syntax—any digit is allowed– Task semantics—sequence of isolated
digits must form a valid telephone number
Speech Recognition ProcessesSpeech Recognition Processes
2.2. Train the ModelsTrain the Models => create a => create a method for building acoustic word modelsfrom a speech training data set, a word lexicon, a word grammar, g(language model), and a task grammar from a text training data set
– speech training set—100 people each speaking the 11 digits 10 times in isolation
– text data training set—listing of valid telephone numbers (or equivalently algorithm that generates valid telephone numbers)
Speech Recognition ProcessesSpeech Recognition Processes
3.3. Evaluate performanceEvaluate performance => determine => determine word error rate, task error rate for word error rate, task error rate for recognizerrecognizer
– speech testing data set—25 new people p g p peach speaking 10 telephone numbers as sequences of isolated digits
– evaluate digit error rate, telephone number error rate
4.4. Testing algorithmTesting algorithm => method for evaluating recognizer performance from the testing set of speech utterances
Acoustic Model (HMM)
Speech Recognition Speech Recognition ProcessProcess
Speech Recognition Speech Recognition ProcessProcess
SLU
TTS ASR
DM
SLG
Feature Analysis (Spectral Analysis)
Language Model (N-
gram)Word
Lexicon
Confidence Scoring
(Utterance Verification)
Pattern Classification
(Decoding, Search)
Input Speech “Hello World”
(0.9) (0.8)
s(n), W
Xn
W^
Feature ExtractionFeature ExtractionFeature ExtractionFeature ExtractionGoal: extract robust features (information) fromthe speech that are relevant for ASR.
Method: spectral analysis through either a bank-of-filters or through LPC followedby non-linearity and normalization (cepstrum).
Feature Extraction
Pattern Classification
Acoustic Model
Language Model
Word Lexicon
Confidence Scoring
Result: signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).
Challenges: robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—cepstral features or those from a high dimensionality space.
What Features to Use?What Features to Use?ShortShort--time Spectral Analysistime Spectral Analysis:
Acoustic features:- mel cepstrum (LPC, filterbank, wavelets)- formant frequencies, pitch, prosody- zero-crossing rate, energy
Acoustic-Phonetic features:manner of articulation (e g stop nasal voiced)- manner of articulation (e.g., stop, nasal, voiced)
- place of articulation (labial, dental, velar)Articulatory features:
- tongue position, jaw, lips, velumAuditory features:
- ensemble interval histogram (EIH), synchronyTemporal AnalysisTemporal Analysis: approximation of the velocity and acceleration typically through first and second ordercentral differences.
4
Sampling and Quantization
Preemphasis
Feature Extraction ProcessFeature Extraction Process
αCepstral Analysis
FilteringNoiseRemoval,
Normalization
( )s t
( )s n
( )′
( )ma l′M N
( )ma l′
Segmentation (blocking)
Windowing
Spectral Analysis
Temporal Derivative
Equalization
Delta cepstrumDelta^2 cepstrum
Bias removalor normalization
PitchFormants
( )s n′
( )ms n′
( )ms n′′
)(nW
( )mc l
( )mc l′
( )mc l′Δ
M,N
EnergyZero-Crossing
RobustnessRobustnessRobustness
yUnlimited
Vocabulary
Rejection
Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.
Methods:Traditional techniques for improving system robustnessare based on signal enhancement, feature normalization or/andmodel adaptation.
Perception Approach:Extract fundamental acoustic information in narrow bandsof speech. Robust integration of features across time and frequency.
Methods for Robust Speech Methods for Robust Speech RecognitionRecognition
Methods for Robust Speech Methods for Robust Speech RecognitionRecognition
SignalTraining Features Model
Testing Signal Features Model
Enhancement Normalization Adaptation
Acoustic ModelAcoustic ModelAcoustic ModelAcoustic ModelGoal: map acoustic features into distinct phonetic labels (e.g., /s/, /aa/) or word labels.
Hidden Markov Model (HMM): statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone or a word.
Feature Extraction
Pattern Classification
Acoustic Model
Language Model
Word Lexicon
Confidence Scoring
HMMs are also assigned for modeling extraneous events.
Advantages: powerful statistical method for dealing with a wide range of data and reliably recognizing speech.
Challenges: understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.
Hidden Markov ModelsHidden Markov Models
0 7010 2
...
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
010 60 3
...
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
0 30 30 4
.
.
.
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
050 20 3
.
.
.
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
P upP down
P unchanged
( )( )
( )
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
Ergodic modelErgodic model —can get to any state from any other state;
States are hiddenStates are hidden— observe the effects, not the states
HMM for speechHMM for speech
• phone model : ‘s’i1
s1 s3s2
• word model: ‘is’ (ih-z) i1
z1 z3z2i1 i3i2
‘ih’ ‘z’
5
Isolated Word HMMIsolated Word HMM
a11
a34a23a12
a22 a33 a44 a55
a45
1 2 3 4 5
a13 a24 a35
b1(Ot) b2(Ot) b3(Ot) b4(Ot) b5(Ot)
Left-right HMM – highly constrained state sequences
Basic Problems in HMMsBasic Problems in HMMs
Given acoustic observation X and model Φ:
Evaluation: compute P(X | Φ)
Decoding: choose optimal state sequence
Re-estimation: adjust Φ to maximize P(X|Φ)
Evaluation: Evaluation: The BaumThe Baum--Welch AlgorithmWelch Algorithm
1 1 11
( ) ( | , ) ( ) ( ) = -1 1; 1N
Tt t t ji i t t
ij P X s j a b X i t T j Nβ β+ + +
=
⎡ ⎤= = Φ = … ≤ ≤⎢ ⎥
⎣ ⎦∑
1 11
( ) ( , | ) ( ) ( ) 2 ; 1N
tt t t ji i t
ji P X s i j a b X t T i Nα α −
=
⎡ ⎤= = Φ = ≤ ≤ ≤ ≤⎢ ⎥
⎣ ⎦∑
t -2 t+1t-1 t
1 1
1 1
1
1
1
( , ) ( , | , )
( , , | )( | )
( ) ( ) ( )
( )
Tt t t
Tt t
T
t ij j t tN
Tk
i j P s i s j X
P s i s j XP X
i a b X j
k
γ
α β
α
−
−
−
=
= = = Φ
= = Φ=
Φ
=
∑
s1
s2
s3
sN
a1i
a2i
a3i
aNi
aj1
aj2
aj3
ajN
s1
s2
s3
sN
si
αt-2(i)
Xt
αt-1(i) βt(j) βt+1(j)
sj
aijbj(Xt)
Decoding: Viterbi AlgorithmDecoding: Viterbi Algorithm
sM
Optimal alignment between X and S
Vt-1(j+1)j+1
aj+1 j
s1
s2
x2x1 xT
b2(2)
Vt(j)Vt-1(j)
Vt-1(j-1)bj(xt)
t -1 t
j -1
jaj j
aj-1 j
Reestimation: Training AlgorithmReestimation: Training Algorithm
• find model parameters Φ=(A, B, π) that maximize the probability p(X| Φ) of observing some data X:
• there is no closed-form solution and thus we use the EM1
1
( | ) ( , | ) ( )t t t
T
s s s tt
P P a b X−
=
Φ = Φ =∑ ∑∏S S
X X S
there is no closed form solution and thus we use the EM algorithm:– given Φ, we compute model parameters that maximize the
following auxiliary function Q
– it is guaranteed to have increased (or the same) likelihood
all
( , | ) ˆˆ ˆ ˆ( , ) log ( , | ) ( , ) ( , )( | ) i ji j
PQ P Q QP
ΦΦ Φ = Φ = Φ + Φ
Φ∑ a bS
X S X S a bX
Φ
Training Procedure (Training Procedure (Segmental Segmental KK--Means TrainingMeans Training))
Several iterations of the process:
Input SpeechDatabase
Updated HMM Model
Old (Initial)HMM Model
EstimatePosteriorsζt(j,k), γt(i,j)
MaximizeParameters
aij, cjk, μjk, Σjk
6
Isolated Word Recognition Using Isolated Word Recognition Using HMM’sHMM’s
Assume a vocabulary of words, with occurrences of each spoken wordin a training set. Observation vectors are spectral characterizations of the word.For isolated word recognition, we do the follow
V K
( )λ
Π
ing: 1. for each word, , in the vocabulary, we must build an HMM, , i.e., we must re-estimate model parameters , , that optimize the likelihood of the
t i i t b ti t
vvA B
f th th d (TRAINING) training set observation vector
[ ]= 1 2
s for the -th word. (TRAINING) 2. for each unknown word which is to be recognized, we do the following: a. measure the observation sequence ...
b. calculate model likelihooT
v
O O O O
λ
λ∗
≤ ≤
≤ ≤
⎡ ⎤= ⎣ ⎦
⋅ = = =
⇒
12
5
ds, ( | ), 1 c. select the word whose model likelihood score is highest
argmax ( | )
Computation is on the order of required; 100, 5, 4010 com
v
v
v V
P O v V
v P O
V N T V N Tputations
Continuous Observation Density HMM’sContinuous Observation Density HMM’s
{ }
μ=
⎡ ⎤= ≤ ≤⎣ ⎦
=
==
∑1
1 2
Most general form of pdf with a valid re-estimation procedure is:
( ) , , , 1
observation vector= , ,...,number of mixture densities
gain of -th mi
M
j jm jm jmm
D
jm
b x c x U j N
x x x xMc m xture in state j
μ
=
==
=
≥ ≤ ≤ ≤ ≤
= ≤ ≤∑1
any log-concave or elliptically symmetric density (e.g., a Gaussian)mean vector for mixture , state
covariance matrix for mixture , state
0, 1 , 1
1, 1
jm
jm
jm
M
jmm
m j
U m j
c j N m M
c j
∞
−∞
= ≤ ≤∫ ( ) 1, 1j
N
b x dx j N
HMM Feature Vector DensitiesHMM Feature Vector Densities Isolated Word HMM Recognizer
Choice of Model Parameters1. left-right model preferable to ergodic model (speech is
a left-right process)2. number of states in range 2-40 (from sounds to
frames)• order of number of distinct sounds in the word
order of average number of observations in word• order of average number of observations in word
3. observation vectors• cepstral coefficients (and their second and third order
derivatives) derived from LPC (1-9 mixtures), diagonal covariance matrices
• vector quantized discrete symbols (16-256 codebook sizes)
4. constraints on bj(O) densities• bj(k)>ε for discrete densities• Cjm>δ, Ujm(r,r)>δ for continuous densities
Performance Vs Number of Performance Vs Number of States in ModelStates in Model
7
HMM Segmentation for /SIX/HMM Segmentation for /SIX/Digit Recognition Using HMM’sDigit Recognition Using HMM’s
unknown unknown log log
energyenergy
frame frame cumulative cumulative
scoresscores
oneone
oneone
oneone
ninenine
frame frame likelihood likelihood
scoresscores
state state segmentationsegmentation
oneone
ninenine
oneone
ninenine
Digit Recognition Using HMM’sDigit Recognition Using HMM’s
unknown unknown log log
energyenergy
frame frame cumulative cumulative
scoresscores
sevenseven
sevenseven
sixsix
frame frame likelihood likelihood
scoresscores
state state segmentationsegmentation
sevenseven
sevenseven sevenseven
sixsixsixsix
Word LexiconWord LexiconWord LexiconWord LexiconGoal:map legal phone sequences into wordsaccording to phonotactic rules. For example,
David /d/ /ey/ /v/ /ih/ /d/
Feature Extraction
Pattern Classification
Acoustic Model
Language Model
Word Lexicon
Confidence Scoring
Multiple Pronunciation:several words may have multiple pronunciations. For example
Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/
Challenges:how do you generate a word lexicon automatically; how do you add new variant dialects and word pronunciations.
Goal: Model “acceptable” spoken phrases, constrained by task syntax.
Rule-based:Deterministic grammars that are knowledge driven. For example,
flying from $city to $city on $date
Feature Extraction
Pattern Classification
Acoustic Model
Language Model
Word Lexicon
Confidence Scoring
Language ModelLanguage Model
flying from $city to $city on $date
Statistical:Compute estimate of word probabilities (N-gram, class-based, CFG). For example
flying from Newark to Boston tomorrow
Challenges:How do you build a language model rapidly for a new task.
0.60.4
8
Formal GrammarsFormal Grammars
S
VPNP
Rewrite Rules:
1. S NP VP2. VP V NP3. VP AUX VP4. NP ART NP15. NP ADJ NP1
V NP
ADJ
NAME
Maryloves
that
person
7. NP1 N6. NP1 ADJ NP1
8. NP NAME9. NP PRON10. NAME Mary11. V loves12. ADJ that13. N person
NP1
N
NN--GramsGrams1 2
1 2 1 3 1 2 1 2 1
1 2 11
( ) ( , , , ) ( ) ( | ) ( | , ) ( | , , , )
( | , , , )
N
N NN
i ii
P P w w wP w P w w P w w w P w w w w
P w w w w
−
−=
=
=
=∏
W ……
…
• Trigrams Estimation
2, 1,1, 2
2, 1
( )( | )
( )i i i
i i ii i
C w w wP w w w
C w w− −
− −− −
=
ExampleExample: bibi--gramsgramsprobability of a word given the previous wordprobability of a word given the previous word
I want 0.02want to 0.01to go 0.03go from 0.05go to 0.01from Boston 0.01Boston to 0.02
I want to go from Boston to Denver tomorrow I want to go to Denver tomorrow from BostonTomorrow I want to go from Boston to Denver with UnitedA flight on United going from Boston to Denver tomorrowA flight on UnitedBoston to DenverGoing to DenverUnited from BostonBoston with United tomorrow
to Denver 0.01......
Tomorrow a flight to DenverGoing to Denver tomorrow with United Boston Denver with UnitedA flight with United tomorrow
...
...
I want to go from Boston to Denver0.02 0.01
0.03
0.05 0.01 0.02
0.02
1.2e-12
Generalization for NGeneralization for N--gramsgrams(back(back--off)off)
If the bi-gram ab was never observed, we canestimate its probability:
I want a flight from Boston to Denver0.02 0.0
0.0 0.0
0.01 0.02
0.02
0.0BAD!!!
estimate its probability:
Probability of word b following word aas a fraction of probability of word b
I want a flight from Boston to Denver0.02 0.003
0.001 0.002
0.01 0.02
0.02
4.8e-15
GOOD
Goal:combine information (probabilities)from the acoustic model, languagemodel and word lexicon to generatean “optimal” word sequence (highest probability).
Method:decoder searches through all possible recognition
Pattern ClassificationPattern ClassificationPattern ClassificationPattern Classification
Feature Extraction
Pattern Classification
Acoustic Model
Language Model
Word Lexicon
Confidence Scoring
decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.
Challenges:how do we build efficient structures (FSMs) for decoding and searching large vocabulary, complex language models tasks;
• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient
what is the theoretical limit of efficiency that can be achieved
22
8
Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR
• the basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping:
Robustness
yUnlimited
Vocabulary
Rejection
Features HMM states HMM units Phones Words Sentences
• for the WSJ 20,000 vocabulary, this results in a network of 10 bytes!
• state-of-the-art methods including fast match, multi-pass decoding A* stack and finite-state transducers all provide tremendous speed-up by searching through the network andfinding the best path that maximizes the likelihood function.
22
9
Weighted Finite State Transducers Weighted Finite State Transducers (WFST)(WFST)
Weighted Finite State Transducers Weighted Finite State Transducers (WFST)(WFST)
• Unified Mathematical framework to ASR• Efficiency in time and space
Word:Phrase WFST
WFST
WFST
WFST
Phone:Word
HMM:Phone
Combination Optimization
SearchNetwork
State:HMMWFST can compile the network to 10WFST can compile the network to 1088 states states
----14 orders of magnitude more efficient14 orders of magnitude more efficient
Weighted Finite State Weighted Finite State TransducerTransducer
d: ε/1
ey:ε/.4 dx:ε/.8ax:”data”/1
Word PronunciationWord Pronunciation TransducerTransducer
d: ε/1
ae:ε/.6 t:ε/.2
Data
15
20
25
30
ve S
peed
AT&T (Algorithmic) Moore's Law (hardware)
Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition
Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition
Community
AT&T
0
5
10
1994 1995 1996 1997 1998 1999 2000 2001 2002
Year
Rel
ativ
North American Businessvocabulary: 40,000 words
branching factor: 85
Goal:identify possible recognition errorsand out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.
Method:a confidence score based on a hypothesis test is associated with
Confidence ScoringConfidence ScoringConfidence ScoringConfidence Scoring
Feature Extraction
Pattern Classification
Acoustic Model
Language Model
Word Lexicon
Confidence Scoring
ypeach recognized word. For example:
Label: credit pleaseRecognized: credit fees Confidence: (0.9) (0.3)
Challenges:rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.
( ) ( | )( | )( ) ( | )
P PPP P
=∑W
W X WW XW X W
RejectionRejectionProblem: Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.
Measure of Confidence:
Robustness
yUnlimited
Vocabulary
Rejection
Effect:Improvement in the performance of the recognizer, understanding system and dialogue manager.
Associating word strings with a verification cost thatprovide an effective measure of confidence (Utterance Verification).
StateState--ofof--thethe--Art Art PerformancePerformance
StateState--ofof--thethe--Art Art PerformancePerformance
Acoustic Model
Input Recognized
SLU
TTS ASR
DM
SLG
Feature Extraction
Language Model
Word Lexicon
Pattern Classification
(Decoding, Search)
Input Speech
RecognizedSentenceConfidence
Scoring
10
How to Evaluate Performance?How to Evaluate Performance?
• Dictation applications: Insertions, substitutions and deletions
# # #Word Error Rate 100%N f d i th t t
Subs Dels Ins+ += ×
• Command-and-control: false rejection and false acceptance => ROC curves.
No.of words in the correct sentence
Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY
SIZE WORD
ERROR RATEConnected Digit
Strings--TI Database Spontaneous 11 (zero-nine,
oh) 0.3%
Connected Digit Strings--Mall Recordings
Spontaneous 11 (zero-nine, oh)
2.0%
Connected Digits Strings--HMIHY
Conversational 11 (zero-nine, oh)
5.0%
RM (Resource Read Speech 1000 2 0%
factor of 17
increase in digit
error rate
RM (Resource Management)
Read Speech 1000 2.0%
ATIS(Airline Travel Information System)
Spontaneous 2500 2.5%
NAB (North American Business)
Read Text 64,000 6.6%
Broadcast News News Show 210,000 13-17%
Switchboard Conversational Telephone
45,000 25-29%
Call Home Conversational Telephone
28,000 40%
NIST Benchmark PerformanceNIST Benchmark PerformanceRa
te
Year
Wor
d Er
ror
Resource Management ATIS
NAB
North American BusinessNorth American BusinessNorth American BusinessNorth American Business
vocabulary: 40,000 words branching factor: 85
60
70
80
cura
cy
Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech Recognition
30
40
50
1996 1997 1998 1999 2000 2001 2002
Year
Wor
d A
cc
Switchboard/Call Home
Vocabulary: 40,000 words Perplexity: 85
11
Growth in Effective Growth in Effective Recognition Vocabulary SizeRecognition Vocabulary Size
10000100000
100000010000000
ary
Size
110
1001000
1960 1970 1980 1990 2000 2010Year
Voc
abul
a
Human Speech Recognition vs ASRHuman Speech Recognition vs ASR
10
100
NE
ERR
OR
(%)
x100
0.1
1
0.001 0.01 0.1 1 10
HUMAN ERROR (%)
MAC
HIN
Digits RM-LM NAB-mic WSJ
RM-null NAB-omni SWBD WSJ-22dB
MachinesOutperform
Humansx1
x10
Challenges in ASRChallenges in ASRSystem PerformanceSystem Performance
- accuracy- efficiency (speed, memory)- robustness
Operational PerformanceOperational Performance- end-point detection- user barge-in- utterance rejection- confidence scoring
Machines are 10Machines are 10--100 times less accurate than humans100 times less accurate than humans
LargeLarge--Vocabulary ASR DemoVocabulary ASR Demo
Courtesy of Murat Saraclar
VoiceVoice--Enabled System Technology Enabled System Technology ComponentsComponents
TTSText-to-Speech
SynthesisAutomatic Speech
Recognition
ASR
Data
SpeechSpeech
DM
SLU
y
Spoken LanguageUnderstanding
DialogManagement
SLGSpoken LanguageGeneration
Data,Rules Words
MeaningAction
Words
Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: interpret the meaning of key words and phrases in the
recognized speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly
recognizing every word– SLU makes it possible to offer services where the customer can speak
naturally without learning a specific set of terms• Methodology:: exploit task grammar (syntax) and task semantics
t t i t th f i i t d ith th i dto restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high information word sequences to appropriate meaning
• Performance Evaluation:: accuracy of speech understanding system on various tasks and in various operating environments
• Applications:: automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.
• Challenges: what goes beyond simple classifications systems but below full Natural Language voice dialogue systems
12
Knowledge Sources for Knowledge Sources for Speech UnderstandingSpeech Understanding
A ti /
ASR SLU DM
Phonotactic Syntactic Semantic PragmaticAcoustic/Phonetic
AcousticModel
WordLexicon
LanguageModel
Relationshipof speech
sounds andEnglish phonemes
Rules forphoneme
sequences and pronunciation
Structure ofwords, phrasesin a sentence
Relationshipand meaningsamong words
Discourse,interaction history,world knowledge
UnderstandingModel
DialogManager
DARPA CommunicatorDARPA Communicator• Darpa sponsored research and development of mixed-initiative
dialogue systems• Travel task involving airline, hotel and car information and reservation
“Yeah I uh I would like to go from New York to Boston tomorrow night with United”
• SLU output (Concept decoding) XML Schema<itinerary>
Topic: ItineraryOrigin: New YorkDestination: BostonDay of the week: SundayDate: May 25th, 2002Time: >6pmAirline: United
<origin><city></city><state></state>
</origin><destination>
<city></city><state></state>
</destination><date></date><time></time><airline></airline>
</itinerary>
Customer Care IVR and HMIHYCustomer Care IVR and HMIHY
Network Menu
LEC Mi di t A t
HMIHY
Sparkle Tone“AT&T, How may I help you?”
Account Verification Routine
Customer Care IVR
Sparkle Tone“Thank you for calling AT&T…”
10 seconds
30 seconds
20 seconds
8 seconds
LEC Misdirect Announcement
Account Verification Routine
Main Menu
Reverse Directory RoutineLD Sub-Menu
Total Time to Get to Reverse Directory Lookup: 2:55 minutes!!!
13 seconds
26 seconds
58 seconds
38 seconds
Total Time to Get to Reverse Directory Lookup: 28 seconds!!!
Prompt is “AT&T. How may I help you?”User responds with totally unconstrained fluent speechSystem recognizes the words and determines the meaningof users’ speech, then routes the callDialog technology enables task completion
HMIHYHMIHY——How Does It WorkHow Does It Work
Dialog technology enables task completion
HMIHY
AccountBalance
CallingPlans
Local UnrecognizedNumber
. . .
HMIHY Example DialogsHMIHY Example Dialogs
• Irate Customer• Rate Plan• Account Balance
Customer SatisfactionCustomer Satisfaction
• decreased repeat calls (37%)
• decreased ‘OUTPIC’ t (18%)
• Local Service• Unrecognized Number• Threshold Billing• Billing Credit
rate (18%)
• decreased CCA (Call Control Agent) time per call (10%)
• decreased customer complaints (78%)
Customer Care ScenarioCustomer Care Scenario
13
VoiceVoice--enabled System Technology enabled System Technology ComponentsComponents
TTSText-to-Speech
SynthesisAutomatic Speech
Recognition
ASR
Data
SpeechSpeech
DM
SLU
y
Spoken LanguageUnderstanding
DialogManagement
SLGSpoken LanguageGeneration
Data,Rules Words
MeaningAction
Words
Dialog Management (DM)Dialog Management (DM)•• Goal:Goal::: combine the meaning of the current input with the interaction history
to decide what the next step in the interaction should be– DM makes viable complex services that require multiple exchanges between the
system and the customer– dialog systems can handle user-initiated topic switching (within the domain of
the application)•• Methodology:Methodology::: exploit models of dialog to determine the most
appropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction
•• Performance Evaluation:Performance Evaluation::: speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining help with a service
•• Applications:Applications::: customer care (HMIHY), travel planning, conference registration, scheduling, voice access to unified messaging
•• Challenges:Challenges: is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines
Dialog StrategiesDialog Strategies• Completion (continuation): elicits missing input information from the user
• Constraining (disambiguation): reduces the scope of the request when multiple information has been retrieved
• Relaxation: increases the scope of the request when no information has been retrieved• Confirmation: verifies that the system understood correctly; strategy
Backend
DialogStrategies
ContextInterpretation
• Confirmation: verifies that the system understood correctly; strategy may differ depending on SLU confidence measure. • Reprompting: used when the system expected input but did not receive any or did not understand what it received.• Context Help: provides user with help during periods of misunderstanding of either the system or the user.• Greeting/Closing: maintains social protocol at the beginning and end of an interaction• Mixed-initiative: allows users to manage the dialog.
MixedMixed--initiative Dialoginitiative Dialog
System UserInitiative
Who manages the dialog?
Please say collect, calling card, third
number.
How may I help you?
Example: MixedExample: Mixed--Initiative Initiative DialogDialog
• System InitiativeSystem Initiative
• Mixed InitiativeMixed Initiative
System: Please say just your departure city.User: ChicagoSystem: Please say just your arrival city.User: Newark
Long dialogs but easier to design
• Mixed InitiativeMixed InitiativeSystem: Please say your departure cityUser: I need to travel from Chicago to Newark tomorrow.
• User InitiativeUser InitiativeSystem: How may I help you?User: I need to travel from
Chicago to Newark tomorrow.
Shorter dialogs (better user experience) but more difficult to design
VoiceVoice--enabled System enabled System Technology ComponentsTechnology Components
TTSText-to-Speech
SynthesisAutomatic Speech
Recognition
ASR
Data
SpeechSpeech
DM
SLU
Meaning
y
Spoken LanguageUnderstanding
DialogManagement
SLG
Words to besynthesized
Spoken LanguageGeneration
Words spoken
Data,Rules Words
MeaningAction
Words
14
Building Good SpeechBuilding Good Speech--Based Based ApplicationsApplications
•• Good User InterfacesGood User Interfaces---make the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice
•• Good Models of DialogGood Models of Dialog---keep the conversation moving f d i i d f t t i t th tforward, even in periods of great uncertainty on the parts of either the user or the machine
•• Matching the Task to the TechnologyMatching the Task to the Technology---be realistic about the capabilities of the technology– fail soft methods– error correcting techniques (data dips)
User InterfaceUser Interface•• Good User InterfacesGood User Interfaces
– makes the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice
– keeps the conversation moving forward, even in periods of great uncertainty on the parts of eitherperiods of great uncertainty on the parts of either the user or the machine
– a great UI cannot save a system with poor ASR and NLU. But, the UI can make or break a UI can make or break a systemsystem, even with excellent ASR & NLU
– effective UI design is based on a set of elementary principles and common widgets; sequenced screen presentations; simple error-trap dialog; a user manual.
Multimodal System Multimodal System Technology ComponentsTechnology Components
TTSText-to-Speech
SynthesisAutomatic Speech
Recognition
ASR
Data
SpeechSpeech
VisualPen
Gesture
DM
SLU
y
Spoken LanguageUnderstanding
DialogManagement
SLGSpoken LanguageGeneration
Data,Rules Words
MeaningAction
Words
Multimodal ExperienceMultimodal Experience• users need to have access to
automated dialog applications at anytime and anywhere using any access deviceselection of “most appropriate”• selection of most appropriate” UI mode or combination of modes depends on the device, the task, the environment, and the user’s abilities and preferences.
Microsoft MIPadMicrosoft MIPad
• Multimodal Interactive Pad• usability studies show
double throughput for Englishg
• speech is mostly useful in cases with lots of alternatives
MiPad VideoMiPad Video
15
AT&T MATCHAT&T MATCH Multimodal Access Multimodal Access To City HelpTo City Help
Access to information through voice interface, gesture and GPS.
•• Multimodal IntegrationMultimodal Integration
“Are there any cheap Italian places in this neighborhood?”
•• Multimodal IntegrationMultimodal IntegrationCombination of speech, gesture and meaning using finite state technology
MATCH VideoMATCH Video
Courtesy of Michael Johnston
The Speech AdvantageThe Speech Advantage•• Reduce costsReduce costs
– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services
•• New revenue opportunitiesNew revenue opportunitiesNew revenue opportunitiesNew revenue opportunities– 24x7 high-quality customer care automation– Access to information without a keyboard or
touch-tones•• Customer retentionCustomer retention
– provide personal services for customer preferences
– improve customer experience
VoiceVoice--Enabled ServicesEnabled Services•• Desktop applicationsDesktop applications -- dictation, command and control of
desktop, control of document properties (fonts, styles, bullets, …)•• Agent technologyAgent technology – simple tasks like stock quotes, traffic reports,
weather; access to communications, e.g., voice dialing, voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointments
•• Voice PortalsVoice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are key
•• EE--Contact servicesContact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues
•• TelematicsTelematics – command and control of automotive features (comfort systems, radio, windows, sunroof)
•• Small devicesSmall devices – control of cellphones, PDAs from voice commands
Milestones in Speech and Milestones in Speech and Multimodal Technology ResearchMultimodal Technology Research
IsolatedIsolated Words;
Connected Digits; Continuous
Small Vocabulary,
Acoustic Phonetics-
based
Medium Vocabulary, Template-based
Large Vocabulary;
Syntax, Semantics
Connected Words;
Large Vocabulary,
Statistical-based
Spoken dialog;
Very Large Vocabulary; Semantics, Multimodal Dialog, TTS
1962 1967 1972 1977 1982 1987 1992 1997 2002
Year
Isolated Words
Filter-bank analysis;
Time-normalization;
Dynamicprogramming
Connected Digits; Continuous
Speech
Pattern recognition;
LPC analysis; Clustering algorithms;
Level building
Speech; Speech Understanding
Stochastic language
understanding; Finite-state machines;
Statistical learning
Words; Continuous
Speech
Hidden Markov models;
Stochastic Language modeling
Multiple modalities
Concatenative synthesis; Machine
learning; Mixed-initiative dialog
Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies
Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies
Very Large Vocabulary,
Limited Tasks, Controlled
Environment
Very Large Vocabulary,
Limited Tasks, Arbitrary
Environment
Unlimited Vocabulary,
Unlimited Tasks, Many
Languages
2002 2005 2008 2011
Year
Dialog Systems
Robust Systems
Multilingual Systems;
Multimodal Speech Enabled Devices
16
We’ve Come A Long WayWe’ve Come A Long Way------Or Or Have We?Have We?