techniques for speech and natural language recognition and ... · traditional techniques for...

1

Digital Speech ProcessingDigital Speech Processing——Lecture 19Lecture 19

Techniques for Speech Techniques for Speech and Natural Language and Natural Language

Recognition and Recognition and UnderstandingUnderstanding

Speech RecognitionSpeech Recognition--2001 2001 (Stanley Kubrick View in 1968)(Stanley Kubrick View in 1968)

Apple Navigator Apple Navigator ---- 19881988 The Speech AdvantageThe Speech Advantage•• Reduce costsReduce costs

– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services

• New revenue opportunities– 24x7 high-quality customer care automation– access to information without a keyboard or

touch-tones•• Customer retentionCustomer retention

– provide personal services for customer preferences

– improve customer experience

The Speech Dialog CircleThe Speech Dialog Circle

TTSText-to-Speech

SynthesisAutomatic Speech

Recognition

ASR

Data

SpeechVoice reply to customer

“What number did you want to call?”

DM

SLU

y

Spoken LanguageUnderstanding

DialogManagement

SLGSpoken LanguageGeneration

Data,Rules

Action

Words spoken

“I dialed a wrong number”

Meaning

“Billing credit”

Words: What’s next?

“Determine correct number”

• Goal: Accurately and efficiently convert a speech signal into a text message independent of the device, speaker or the environment.

Automatic Speech RecognitionAutomatic Speech Recognition

• Applications: Automation of complex operator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.

2

A Brief History of Speech RecognitionA Brief History of Speech Recognitionni

tion

Acc

urac

ySp

eech

Rec

ogn

Time

Cristopher C.discovers a new

continent.HMMs

Mathematical FormalizationGlobal optimizationAutomatic Learning from data

HeuristicsHandcrafted rulesLocal optimization

Basic ASR Formulation (Bayes Method)Basic ASR Formulation (Bayes Method)

Speaker’s Intention

Speech Production

Mechanisms

W Acoustic Processor

s(n) X Linguistic Decoder

W

Speech RecognizerSpeaker Model

ˆ arg max ( | )

( | ) ( )arg max( )

arg max ( | ) ( )

W

W

A LW

W P W X

P X W P WP X

P X W P W

=

=

=

Step 2Step 1Step 3

Steps in Speech RecognitionSteps in Speech RecognitionStep 1Step 1-- Acoustic ModelingAcoustic Modeling:: assign probabilities

to acoustic realizations of a sequence of words. Compute PA(X|W) using statistical models (Hidden Markov Models) of acoustic signals and words

St 2St 2 L M d liL M d li i b bilitiStep 2Step 2-- Language ModelingLanguage Modeling:: assign probabilities to sequences of words in the language. Train PL(W) from generic text or from transcriptions of task-specific dialogues.

Step 3Step 3-- Hypothesis SearchHypothesis Search:: find the word sequence with the maximum a posteriori probability. Search through all possible word sequences to determine arg max over W.

Step 1Step 1--The Acoustic ModelThe Acoustic Model we build by learning statistics of the acoustic

features, , from a training set where we compute the variability of the acoustic features during the production of

acoustic

the sound

mode

s

l

p

s

re

Xi

resented by the models it is impractical to create a separate acoustic model, ( | ), for

ibl d i th l it i t hAP X Wi

every possible word in the language--it requires too much training data for words in every possible context

instead we build for the ~50 phonemes in the English language and construct the model for a word by concantenating (stringing together sequential

acousti

ly) the

c-phonetic mod

models for

l

e

e s

th

i

constituent phones in the word similarly we build sentences (sequences of words) by concatenating

word modelsi

Step 2Step 2--The Language ModelThe Language Model• the language model describes the probability of a sequence of words that form a valid sentence in the language

• a simple statistical method works well based on a Markovian assumption, namely that the probability of a word in a sentence is conditioned on only the previous N-words, namely an N-gram language modelg g

1 2

1 21

1 2

( ) ( , ,..., )

( | , ,..., )

where ( | , ,..., ) is estimated by simply counting up the relative frequencies of -tuples in a large corpus of text

L L kk

L n n n n Nn

L n n n n N

P W P w w w

P w w w w

P w w w wf N

− − −=

− − −

=

=∏i

Step 3Step 3--The Search ProblemThe Search Problem• the search problem is one of searching the space of all

valid sound sequences, conditioned on the word grammar, the language syntax, and the task constraints, to find the word sequence with the maximum likelihood

• the size of the search space can be astronomically l d t k i di t t f ti tlarge and take inordinate amounts of computing power to solve by heuristic methods

• the use of methods from the field of Finite State Automata Theory provide Finite State Networks (FSN)that reduce the computational burden by orders of magnitude, thereby enabling exact solutions in computationally feasible times, for large speech recognition problems

3

Speech Recognition ProcessesSpeech Recognition Processes

1.1. Choose the taskChoose the task => sounds, word vocabulary, task syntax (grammar), task semantics

•• ExampleExample: isolated digit recognition task– Sounds to be recognized—whole words– Word vocabulary—zero, oh, one, two, …,

nine– Task syntax—any digit is allowed– Task semantics—sequence of isolated

digits must form a valid telephone number


2.2. Train the ModelsTrain the Models => create a => create a method for building acoustic word modelsfrom a speech training data set, a word lexicon, a word grammar, g(language model), and a task grammar from a text training data set

– speech training set—100 people each speaking the 11 digits 10 times in isolation

– text data training set—listing of valid telephone numbers (or equivalently algorithm that generates valid telephone numbers)


3.3. Evaluate performanceEvaluate performance => determine => determine word error rate, task error rate for word error rate, task error rate for recognizerrecognizer

– speech testing data set—25 new people p g p peach speaking 10 telephone numbers as sequences of isolated digits

– evaluate digit error rate, telephone number error rate

4.4. Testing algorithmTesting algorithm => method for evaluating recognizer performance from the testing set of speech utterances

Acoustic Model (HMM)

Speech Recognition Speech Recognition ProcessProcess

Speech Recognition Speech Recognition ProcessProcess

SLU

TTS ASR

DM

SLG

Feature Analysis (Spectral Analysis)

Language Model (N-

gram)Word

Lexicon

Confidence Scoring

(Utterance Verification)

Pattern Classification

(Decoding, Search)

Input Speech “Hello World”

(0.9) (0.8)

s(n), W

Xn

W^

Feature ExtractionFeature ExtractionFeature ExtractionFeature ExtractionGoal: extract robust features (information) fromthe speech that are relevant for ASR.

Method: spectral analysis through either a bank-of-filters or through LPC followedby non-linearity and normalization (cepstrum).

Feature Extraction


Acoustic Model

Language Model

Word Lexicon

Confidence Scoring

Result: signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).

Challenges: robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—cepstral features or those from a high dimensionality space.

What Features to Use?What Features to Use?ShortShort--time Spectral Analysistime Spectral Analysis:

Acoustic features:- mel cepstrum (LPC, filterbank, wavelets)- formant frequencies, pitch, prosody- zero-crossing rate, energy

Acoustic-Phonetic features:manner of articulation (e g stop nasal voiced)- manner of articulation (e.g., stop, nasal, voiced)

- place of articulation (labial, dental, velar)Articulatory features:

- tongue position, jaw, lips, velumAuditory features:

- ensemble interval histogram (EIH), synchronyTemporal AnalysisTemporal Analysis: approximation of the velocity and acceleration typically through first and second ordercentral differences.

4

Sampling and Quantization

Preemphasis

Feature Extraction ProcessFeature Extraction Process

αCepstral Analysis

FilteringNoiseRemoval,

Normalization

( )s t

( )s n

( )′

( )ma l′M N

( )ma l′

Segmentation (blocking)

Windowing

Spectral Analysis

Temporal Derivative

Equalization

Delta cepstrumDelta^2 cepstrum

Bias removalor normalization

PitchFormants

( )s n′

( )ms n′

( )ms n′′

)(nW

( )mc l

( )mc l′

( )mc l′Δ

M,N

EnergyZero-Crossing

RobustnessRobustnessRobustness

yUnlimited

Vocabulary

Rejection

Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.

Methods:Traditional techniques for improving system robustnessare based on signal enhancement, feature normalization or/andmodel adaptation.

Perception Approach:Extract fundamental acoustic information in narrow bandsof speech. Robust integration of features across time and frequency.

Methods for Robust Speech Methods for Robust Speech RecognitionRecognition

Methods for Robust Speech Methods for Robust Speech RecognitionRecognition

SignalTraining Features Model

Testing Signal Features Model

Enhancement Normalization Adaptation

Acoustic ModelAcoustic ModelAcoustic ModelAcoustic ModelGoal: map acoustic features into distinct phonetic labels (e.g., /s/, /aa/) or word labels.

Hidden Markov Model (HMM): statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone or a word.

Feature Extraction


Acoustic Model

Language Model

Word Lexicon

Confidence Scoring

HMMs are also assigned for modeling extraneous events.

Advantages: powerful statistical method for dealing with a wide range of data and reliably recognizing speech.

Challenges: understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.

Hidden Markov ModelsHidden Markov Models

0 7010 2

...

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

010 60 3

...

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

0 30 30 4

.

.

.

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

050 20 3

.

.

.

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

P upP down

P unchanged

( )( )

( )

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

Ergodic modelErgodic model —can get to any state from any other state;

States are hiddenStates are hidden— observe the effects, not the states

HMM for speechHMM for speech

• phone model : ‘s’i1

s1 s3s2

• word model: ‘is’ (ih-z) i1

z1 z3z2i1 i3i2

‘ih’ ‘z’

5

Isolated Word HMMIsolated Word HMM

a11

a34a23a12

a22 a33 a44 a55

a45

1 2 3 4 5

a13 a24 a35

b1(Ot) b2(Ot) b3(Ot) b4(Ot) b5(Ot)

Left-right HMM – highly constrained state sequences

Basic Problems in HMMsBasic Problems in HMMs

Given acoustic observation X and model Φ:

Evaluation: compute P(X | Φ)

Decoding: choose optimal state sequence

Re-estimation: adjust Φ to maximize P(X|Φ)

Evaluation: Evaluation: The BaumThe Baum--Welch AlgorithmWelch Algorithm

1 1 11

( ) ( | , ) ( ) ( ) = -1 1; 1N

Tt t t ji i t t

ij P X s j a b X i t T j Nβ β+ + +

=

⎡ ⎤= = Φ = … ≤ ≤⎢ ⎥

⎣ ⎦∑

1 11

( ) ( , | ) ( ) ( ) 2 ; 1N

tt t t ji i t

ji P X s i j a b X t T i Nα α −

=

⎡ ⎤= = Φ = ≤ ≤ ≤ ≤⎢ ⎥

⎣ ⎦∑

t -2 t+1t-1 t

1 1

1 1

1

1

1

( , ) ( , | , )

( , , | )( | )

( ) ( ) ( )

( )

Tt t t

Tt t

T

t ij j t tN

Tk

i j P s i s j X

P s i s j XP X

i a b X j

k

γ

α β

α

−

−

−

=

= = = Φ

= = Φ=

Φ

=

∑

s1

s2

s3

sN

a1i

a2i

a3i

aNi

aj1

aj2

aj3

ajN

s1

s2

s3

sN

si

αt-2(i)

Xt

αt-1(i) βt(j) βt+1(j)

sj

aijbj(Xt)

Decoding: Viterbi AlgorithmDecoding: Viterbi Algorithm

sM

Optimal alignment between X and S

Vt-1(j+1)j+1

aj+1 j

s1

s2

x2x1 xT

b2(2)

Vt(j)Vt-1(j)

Vt-1(j-1)bj(xt)

t -1 t

j -1

jaj j

aj-1 j

Reestimation: Training AlgorithmReestimation: Training Algorithm

• find model parameters Φ=(A, B, π) that maximize the probability p(X| Φ) of observing some data X:

• there is no closed-form solution and thus we use the EM1

1

( | ) ( , | ) ( )t t t

T

s s s tt

P P a b X−

=

Φ = Φ =∑ ∑∏S S

X X S

there is no closed form solution and thus we use the EM algorithm:– given Φ, we compute model parameters that maximize the

following auxiliary function Q

– it is guaranteed to have increased (or the same) likelihood

all

( , | ) ˆˆ ˆ ˆ( , ) log ( , | ) ( , ) ( , )( | ) i ji j

PQ P Q QP

ΦΦ Φ = Φ = Φ + Φ

Φ∑ a bS

X S X S a bX

Φ

Training Procedure (Training Procedure (Segmental Segmental KK--Means TrainingMeans Training))

Several iterations of the process:

Input SpeechDatabase

Updated HMM Model

Old (Initial)HMM Model

EstimatePosteriorsζt(j,k), γt(i,j)

MaximizeParameters

aij, cjk, μjk, Σjk

6

Isolated Word Recognition Using Isolated Word Recognition Using HMM’sHMM’s

Assume a vocabulary of words, with occurrences of each spoken wordin a training set. Observation vectors are spectral characterizations of the word.For isolated word recognition, we do the follow

V K

( )λ

Π

ing: 1. for each word, , in the vocabulary, we must build an HMM, , i.e., we must re-estimate model parameters , , that optimize the likelihood of the

t i i t b ti t

vvA B

f th th d (TRAINING) training set observation vector

[ ]= 1 2

s for the -th word. (TRAINING) 2. for each unknown word which is to be recognized, we do the following: a. measure the observation sequence ...

b. calculate model likelihooT

v

O O O O

λ

λ∗

≤ ≤

≤ ≤

⎡ ⎤= ⎣ ⎦

⋅ = = =

⇒

12

5

ds, ( | ), 1 c. select the word whose model likelihood score is highest

argmax ( | )

Computation is on the order of required; 100, 5, 4010 com

v

v

v V

P O v V

v P O

V N T V N Tputations

Continuous Observation Density HMM’sContinuous Observation Density HMM’s

{ }

μ=

⎡ ⎤= ≤ ≤⎣ ⎦

=

==

∑1

1 2

Most general form of pdf with a valid re-estimation procedure is:

( ) , , , 1

observation vector= , ,...,number of mixture densities

gain of -th mi

M

j jm jm jmm

D

jm

b x c x U j N

x x x xMc m xture in state j

μ

=

==

=

≥ ≤ ≤ ≤ ≤

= ≤ ≤∑1

any log-concave or elliptically symmetric density (e.g., a Gaussian)mean vector for mixture , state

covariance matrix for mixture , state

0, 1 , 1

1, 1

jm

jm

jm

M

jmm

m j

U m j

c j N m M

c j

∞

−∞

= ≤ ≤∫ ( ) 1, 1j

N

b x dx j N

HMM Feature Vector DensitiesHMM Feature Vector Densities Isolated Word HMM Recognizer

Choice of Model Parameters1. left-right model preferable to ergodic model (speech is

a left-right process)2. number of states in range 2-40 (from sounds to

frames)• order of number of distinct sounds in the word

order of average number of observations in word• order of average number of observations in word

3. observation vectors• cepstral coefficients (and their second and third order

derivatives) derived from LPC (1-9 mixtures), diagonal covariance matrices

• vector quantized discrete symbols (16-256 codebook sizes)

4. constraints on bj(O) densities• bj(k)>ε for discrete densities• Cjm>δ, Ujm(r,r)>δ for continuous densities

Performance Vs Number of Performance Vs Number of States in ModelStates in Model

7

HMM Segmentation for /SIX/HMM Segmentation for /SIX/Digit Recognition Using HMM’sDigit Recognition Using HMM’s

unknown unknown log log

energyenergy

frame frame cumulative cumulative

scoresscores

oneone

oneone

oneone

ninenine

frame frame likelihood likelihood

scoresscores

state state segmentationsegmentation

oneone

ninenine

oneone

ninenine

Digit Recognition Using HMM’sDigit Recognition Using HMM’s

unknown unknown log log

energyenergy

frame frame cumulative cumulative

scoresscores

sevenseven

sevenseven

sixsix

frame frame likelihood likelihood

scoresscores

state state segmentationsegmentation

sevenseven

sevenseven sevenseven

sixsixsixsix

Word LexiconWord LexiconWord LexiconWord LexiconGoal:map legal phone sequences into wordsaccording to phonotactic rules. For example,

David /d/ /ey/ /v/ /ih/ /d/

Feature Extraction


Acoustic Model

Language Model

Word Lexicon

Confidence Scoring

Multiple Pronunciation:several words may have multiple pronunciations. For example

Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/

Challenges:how do you generate a word lexicon automatically; how do you add new variant dialects and word pronunciations.

Goal: Model “acceptable” spoken phrases, constrained by task syntax.

Rule-based:Deterministic grammars that are knowledge driven. For example,

flying from $city to $city on $date

Feature Extraction


Acoustic Model

Language Model

Word Lexicon

Confidence Scoring

Language ModelLanguage Model

flying from $city to $city on $date

Statistical:Compute estimate of word probabilities (N-gram, class-based, CFG). For example

flying from Newark to Boston tomorrow

Challenges:How do you build a language model rapidly for a new task.

0.60.4

8

Formal GrammarsFormal Grammars

S

VPNP

Rewrite Rules:

1. S NP VP2. VP V NP3. VP AUX VP4. NP ART NP15. NP ADJ NP1

V NP

ADJ

NAME

Maryloves

that

person

7. NP1 N6. NP1 ADJ NP1

8. NP NAME9. NP PRON10. NAME Mary11. V loves12. ADJ that13. N person

NP1

N

NN--GramsGrams1 2

1 2 1 3 1 2 1 2 1

1 2 11

( ) ( , , , ) ( ) ( | ) ( | , ) ( | , , , )

( | , , , )

N

N NN

i ii

P P w w wP w P w w P w w w P w w w w

P w w w w

−

−=

=

=

=∏

W ……

…

• Trigrams Estimation

2, 1,1, 2

2, 1

( )( | )

( )i i i

i i ii i

C w w wP w w w

C w w− −

− −− −

=

ExampleExample: bibi--gramsgramsprobability of a word given the previous wordprobability of a word given the previous word

I want 0.02want to 0.01to go 0.03go from 0.05go to 0.01from Boston 0.01Boston to 0.02

I want to go from Boston to Denver tomorrow I want to go to Denver tomorrow from BostonTomorrow I want to go from Boston to Denver with UnitedA flight on United going from Boston to Denver tomorrowA flight on UnitedBoston to DenverGoing to DenverUnited from BostonBoston with United tomorrow

to Denver 0.01......

Tomorrow a flight to DenverGoing to Denver tomorrow with United Boston Denver with UnitedA flight with United tomorrow

...

...

I want to go from Boston to Denver0.02 0.01

0.03

0.05 0.01 0.02

0.02

1.2e-12

Generalization for NGeneralization for N--gramsgrams(back(back--off)off)

If the bi-gram ab was never observed, we canestimate its probability:

I want a flight from Boston to Denver0.02 0.0

0.0 0.0

0.01 0.02

0.02

0.0BAD!!!

estimate its probability:

Probability of word b following word aas a fraction of probability of word b

I want a flight from Boston to Denver0.02 0.003

0.001 0.002

0.01 0.02

0.02

4.8e-15

GOOD

Goal:combine information (probabilities)from the acoustic model, languagemodel and word lexicon to generatean “optimal” word sequence (highest probability).

Method:decoder searches through all possible recognition

Pattern ClassificationPattern ClassificationPattern ClassificationPattern Classification

Feature Extraction


Acoustic Model

Language Model

Word Lexicon

Confidence Scoring

decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.

Challenges:how do we build efficient structures (FSMs) for decoding and searching large vocabulary, complex language models tasks;

• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient

what is the theoretical limit of efficiency that can be achieved

22

8

Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR

• the basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping:

Robustness

yUnlimited

Vocabulary

Rejection

Features HMM states HMM units Phones Words Sentences

• for the WSJ 20,000 vocabulary, this results in a network of 10 bytes!

• state-of-the-art methods including fast match, multi-pass decoding A* stack and finite-state transducers all provide tremendous speed-up by searching through the network andfinding the best path that maximizes the likelihood function.

22

9

Weighted Finite State Transducers Weighted Finite State Transducers (WFST)(WFST)

Weighted Finite State Transducers Weighted Finite State Transducers (WFST)(WFST)

• Unified Mathematical framework to ASR• Efficiency in time and space

Word:Phrase WFST

WFST

WFST

WFST

Phone:Word

HMM:Phone

Combination Optimization

SearchNetwork

State:HMMWFST can compile the network to 10WFST can compile the network to 1088 states states

----14 orders of magnitude more efficient14 orders of magnitude more efficient

Weighted Finite State Weighted Finite State TransducerTransducer

d: ε/1

ey:ε/.4 dx:ε/.8ax:”data”/1

Word PronunciationWord Pronunciation TransducerTransducer

d: ε/1

ae:ε/.6 t:ε/.2

Data

15

20

25

30

ve S

peed

AT&T (Algorithmic) Moore's Law (hardware)

Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition

Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition

Community

AT&T

0

5

10

1994 1995 1996 1997 1998 1999 2000 2001 2002

Year

Rel

ativ

North American Businessvocabulary: 40,000 words

branching factor: 85

Goal:identify possible recognition errorsand out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.

Method:a confidence score based on a hypothesis test is associated with

Confidence ScoringConfidence ScoringConfidence ScoringConfidence Scoring

Feature Extraction


Acoustic Model

Language Model

Word Lexicon

Confidence Scoring

ypeach recognized word. For example:

Label: credit pleaseRecognized: credit fees Confidence: (0.9) (0.3)

Challenges:rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.

( ) ( | )( | )( ) ( | )

P PPP P

=∑W

W X WW XW X W

RejectionRejectionProblem: Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.

Measure of Confidence:

Robustness

yUnlimited

Vocabulary

Rejection

Effect:Improvement in the performance of the recognizer, understanding system and dialogue manager.

Associating word strings with a verification cost thatprovide an effective measure of confidence (Utterance Verification).

StateState--ofof--thethe--Art Art PerformancePerformance

StateState--ofof--thethe--Art Art PerformancePerformance

Acoustic Model

Input Recognized

SLU

TTS ASR

DM

SLG

Feature Extraction

Language Model

Word Lexicon


(Decoding, Search)

Input Speech

RecognizedSentenceConfidence

Scoring

10

How to Evaluate Performance?How to Evaluate Performance?

• Dictation applications: Insertions, substitutions and deletions

# # #Word Error Rate 100%N f d i th t t

Subs Dels Ins+ += ×

• Command-and-control: false rejection and false acceptance => ROC curves.

No.of words in the correct sentence

Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY

SIZE WORD

ERROR RATEConnected Digit

Strings--TI Database Spontaneous 11 (zero-nine,

oh) 0.3%

Connected Digit Strings--Mall Recordings

Spontaneous 11 (zero-nine, oh)

2.0%

Connected Digits Strings--HMIHY

Conversational 11 (zero-nine, oh)

5.0%

RM (Resource Read Speech 1000 2 0%

factor of 17

increase in digit

error rate

RM (Resource Management)

Read Speech 1000 2.0%

ATIS(Airline Travel Information System)

Spontaneous 2500 2.5%

NAB (North American Business)

Read Text 64,000 6.6%

Broadcast News News Show 210,000 13-17%

Switchboard Conversational Telephone

45,000 25-29%

Call Home Conversational Telephone

28,000 40%

NIST Benchmark PerformanceNIST Benchmark PerformanceRa

te

Year

Wor

d Er

ror

Resource Management ATIS

NAB

North American BusinessNorth American BusinessNorth American BusinessNorth American Business

vocabulary: 40,000 words branching factor: 85

60

70

80

cura

cy

Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech Recognition

30

40

50

1996 1997 1998 1999 2000 2001 2002

Year

Wor

d A

cc

Switchboard/Call Home

Vocabulary: 40,000 words Perplexity: 85

11

Growth in Effective Growth in Effective Recognition Vocabulary SizeRecognition Vocabulary Size

10000100000

100000010000000

ary

Size

110

1001000

1960 1970 1980 1990 2000 2010Year

Voc

abul

a

Human Speech Recognition vs ASRHuman Speech Recognition vs ASR

10

100

NE

ERR

OR

(%)

x100

0.1

1

0.001 0.01 0.1 1 10

HUMAN ERROR (%)

MAC

HIN

Digits RM-LM NAB-mic WSJ

RM-null NAB-omni SWBD WSJ-22dB

MachinesOutperform

Humansx1

x10

Challenges in ASRChallenges in ASRSystem PerformanceSystem Performance

- accuracy- efficiency (speed, memory)- robustness

Operational PerformanceOperational Performance- end-point detection- user barge-in- utterance rejection- confidence scoring

Machines are 10Machines are 10--100 times less accurate than humans100 times less accurate than humans

LargeLarge--Vocabulary ASR DemoVocabulary ASR Demo

Courtesy of Murat Saraclar

VoiceVoice--Enabled System Technology Enabled System Technology ComponentsComponents

TTSText-to-Speech


Recognition

ASR

Data

SpeechSpeech

DM

SLU

y


DialogManagement


Data,Rules Words

MeaningAction

Words

Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: interpret the meaning of key words and phrases in the

recognized speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly

recognizing every word– SLU makes it possible to offer services where the customer can speak

naturally without learning a specific set of terms• Methodology:: exploit task grammar (syntax) and task semantics

t t i t th f i i t d ith th i dto restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high information word sequences to appropriate meaning

• Performance Evaluation:: accuracy of speech understanding system on various tasks and in various operating environments

• Applications:: automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.

• Challenges: what goes beyond simple classifications systems but below full Natural Language voice dialogue systems

12

Knowledge Sources for Knowledge Sources for Speech UnderstandingSpeech Understanding

A ti /

ASR SLU DM

Phonotactic Syntactic Semantic PragmaticAcoustic/Phonetic

AcousticModel

WordLexicon

LanguageModel

Relationshipof speech

sounds andEnglish phonemes

Rules forphoneme

sequences and pronunciation

Structure ofwords, phrasesin a sentence

Relationshipand meaningsamong words

Discourse,interaction history,world knowledge

UnderstandingModel

DialogManager

DARPA CommunicatorDARPA Communicator• Darpa sponsored research and development of mixed-initiative

dialogue systems• Travel task involving airline, hotel and car information and reservation

“Yeah I uh I would like to go from New York to Boston tomorrow night with United”

• SLU output (Concept decoding) XML Schema<itinerary>

Topic: ItineraryOrigin: New YorkDestination: BostonDay of the week: SundayDate: May 25th, 2002Time: >6pmAirline: United

<origin><city></city><state></state>

</origin><destination>

<city></city><state></state>

</destination><date></date><time></time><airline></airline>

</itinerary>

Customer Care IVR and HMIHYCustomer Care IVR and HMIHY

Network Menu

LEC Mi di t A t

HMIHY

Sparkle Tone“AT&T, How may I help you?”

Account Verification Routine

Customer Care IVR

Sparkle Tone“Thank you for calling AT&T…”

10 seconds

30 seconds

20 seconds

8 seconds

LEC Misdirect Announcement

Account Verification Routine

Main Menu

Reverse Directory RoutineLD Sub-Menu

Total Time to Get to Reverse Directory Lookup: 2:55 minutes!!!

13 seconds

26 seconds

58 seconds

38 seconds

Total Time to Get to Reverse Directory Lookup: 28 seconds!!!

Prompt is “AT&T. How may I help you?”User responds with totally unconstrained fluent speechSystem recognizes the words and determines the meaningof users’ speech, then routes the callDialog technology enables task completion

HMIHYHMIHY——How Does It WorkHow Does It Work

Dialog technology enables task completion

HMIHY

AccountBalance

CallingPlans

Local UnrecognizedNumber

. . .

HMIHY Example DialogsHMIHY Example Dialogs

• Irate Customer• Rate Plan• Account Balance

Customer SatisfactionCustomer Satisfaction

• decreased repeat calls (37%)

• decreased ‘OUTPIC’ t (18%)

• Local Service• Unrecognized Number• Threshold Billing• Billing Credit

rate (18%)

• decreased CCA (Call Control Agent) time per call (10%)

• decreased customer complaints (78%)

Customer Care ScenarioCustomer Care Scenario

13

VoiceVoice--enabled System Technology enabled System Technology ComponentsComponents

TTSText-to-Speech


Recognition

ASR

Data

SpeechSpeech

DM

SLU

y


DialogManagement


Data,Rules Words

MeaningAction

Words

Dialog Management (DM)Dialog Management (DM)•• Goal:Goal::: combine the meaning of the current input with the interaction history

to decide what the next step in the interaction should be– DM makes viable complex services that require multiple exchanges between the

system and the customer– dialog systems can handle user-initiated topic switching (within the domain of

the application)•• Methodology:Methodology::: exploit models of dialog to determine the most

appropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction

•• Performance Evaluation:Performance Evaluation::: speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining help with a service

•• Applications:Applications::: customer care (HMIHY), travel planning, conference registration, scheduling, voice access to unified messaging

•• Challenges:Challenges: is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines

Dialog StrategiesDialog Strategies• Completion (continuation): elicits missing input information from the user

• Constraining (disambiguation): reduces the scope of the request when multiple information has been retrieved

• Relaxation: increases the scope of the request when no information has been retrieved• Confirmation: verifies that the system understood correctly; strategy

Backend

DialogStrategies

ContextInterpretation

• Confirmation: verifies that the system understood correctly; strategy may differ depending on SLU confidence measure. • Reprompting: used when the system expected input but did not receive any or did not understand what it received.• Context Help: provides user with help during periods of misunderstanding of either the system or the user.• Greeting/Closing: maintains social protocol at the beginning and end of an interaction• Mixed-initiative: allows users to manage the dialog.

MixedMixed--initiative Dialoginitiative Dialog

System UserInitiative

Who manages the dialog?

Please say collect, calling card, third

number.

How may I help you?

Example: MixedExample: Mixed--Initiative Initiative DialogDialog

• System InitiativeSystem Initiative

• Mixed InitiativeMixed Initiative

System: Please say just your departure city.User: ChicagoSystem: Please say just your arrival city.User: Newark

Long dialogs but easier to design

• Mixed InitiativeMixed InitiativeSystem: Please say your departure cityUser: I need to travel from Chicago to Newark tomorrow.

• User InitiativeUser InitiativeSystem: How may I help you?User: I need to travel from

Chicago to Newark tomorrow.

Shorter dialogs (better user experience) but more difficult to design

VoiceVoice--enabled System enabled System Technology ComponentsTechnology Components

TTSText-to-Speech


Recognition

ASR

Data

SpeechSpeech

DM

SLU

Meaning

y


DialogManagement

SLG

Words to besynthesized

Spoken LanguageGeneration

Words spoken

Data,Rules Words

MeaningAction

Words

14

Building Good SpeechBuilding Good Speech--Based Based ApplicationsApplications

•• Good User InterfacesGood User Interfaces---make the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice

•• Good Models of DialogGood Models of Dialog---keep the conversation moving f d i i d f t t i t th tforward, even in periods of great uncertainty on the parts of either the user or the machine

•• Matching the Task to the TechnologyMatching the Task to the Technology---be realistic about the capabilities of the technology– fail soft methods– error correcting techniques (data dips)

User InterfaceUser Interface•• Good User InterfacesGood User Interfaces

– makes the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice

– keeps the conversation moving forward, even in periods of great uncertainty on the parts of eitherperiods of great uncertainty on the parts of either the user or the machine

– a great UI cannot save a system with poor ASR and NLU. But, the UI can make or break a UI can make or break a systemsystem, even with excellent ASR & NLU

– effective UI design is based on a set of elementary principles and common widgets; sequenced screen presentations; simple error-trap dialog; a user manual.

Multimodal System Multimodal System Technology ComponentsTechnology Components

TTSText-to-Speech


Recognition

ASR

Data

SpeechSpeech

VisualPen

Gesture

DM

SLU

y


DialogManagement


Data,Rules Words

MeaningAction

Words

Multimodal ExperienceMultimodal Experience• users need to have access to

automated dialog applications at anytime and anywhere using any access deviceselection of “most appropriate”• selection of most appropriate” UI mode or combination of modes depends on the device, the task, the environment, and the user’s abilities and preferences.

Microsoft MIPadMicrosoft MIPad

• Multimodal Interactive Pad• usability studies show

double throughput for Englishg

• speech is mostly useful in cases with lots of alternatives

MiPad VideoMiPad Video

15

AT&T MATCHAT&T MATCH Multimodal Access Multimodal Access To City HelpTo City Help

Access to information through voice interface, gesture and GPS.

•• Multimodal IntegrationMultimodal Integration

“Are there any cheap Italian places in this neighborhood?”

•• Multimodal IntegrationMultimodal IntegrationCombination of speech, gesture and meaning using finite state technology

MATCH VideoMATCH Video

Courtesy of Michael Johnston

The Speech AdvantageThe Speech Advantage•• Reduce costsReduce costs

– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services

•• New revenue opportunitiesNew revenue opportunitiesNew revenue opportunitiesNew revenue opportunities– 24x7 high-quality customer care automation– Access to information without a keyboard or

touch-tones•• Customer retentionCustomer retention

– provide personal services for customer preferences

– improve customer experience

VoiceVoice--Enabled ServicesEnabled Services•• Desktop applicationsDesktop applications -- dictation, command and control of

desktop, control of document properties (fonts, styles, bullets, …)•• Agent technologyAgent technology – simple tasks like stock quotes, traffic reports,

weather; access to communications, e.g., voice dialing, voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointments

•• Voice PortalsVoice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are key

•• EE--Contact servicesContact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues

•• TelematicsTelematics – command and control of automotive features (comfort systems, radio, windows, sunroof)

•• Small devicesSmall devices – control of cellphones, PDAs from voice commands

Milestones in Speech and Milestones in Speech and Multimodal Technology ResearchMultimodal Technology Research

IsolatedIsolated Words;

Connected Digits; Continuous

Small Vocabulary,

Acoustic Phonetics-

based

Medium Vocabulary, Template-based

Large Vocabulary;

Syntax, Semantics

Connected Words;

Large Vocabulary,

Statistical-based

Spoken dialog;

Very Large Vocabulary; Semantics, Multimodal Dialog, TTS

1962 1967 1972 1977 1982 1987 1992 1997 2002

Year

Isolated Words

Filter-bank analysis;

Time-normalization;

Dynamicprogramming

Connected Digits; Continuous

Speech

Pattern recognition;

LPC analysis; Clustering algorithms;

Level building

Speech; Speech Understanding

Stochastic language

understanding; Finite-state machines;

Statistical learning

Words; Continuous

Speech

Hidden Markov models;

Stochastic Language modeling

Multiple modalities

Concatenative synthesis; Machine

learning; Mixed-initiative dialog

Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies

Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies

Very Large Vocabulary,

Limited Tasks, Controlled

Environment

Very Large Vocabulary,

Limited Tasks, Arbitrary

Environment

Unlimited Vocabulary,

Unlimited Tasks, Many

Languages

2002 2005 2008 2011

Year

Dialog Systems

Robust Systems

Multilingual Systems;

Multimodal Speech Enabled Devices

16

We’ve Come A Long WayWe’ve Come A Long Way------Or Or Have We?Have We?

techniques for speech and natural language recognition and ... · traditional techniques for...

Documents