lectures 1 rabiner speech processing

12/28/2009 Fundamentals of Speech Recognition-Overview of ASR

1

ECE 259BECE 259BFundamentals of Speech Fundamentals of Speech RecognitionRecognition—— Lecture 1Lecture 1Introduction/Overview of Introduction/Overview of

Automatic Speech Automatic Speech RecognitionRecognition


2

Why Digital Processing of Why Digital Processing of Speech?Speech?

• digital processing of speech signals (DPSS) enjoys an extensive theoretical and experimental base developed over the past 75 years

• much research has been done since 1965 on the use of digital signal processing in speech communication problems

• highly advanced implementation technology(VLSI) exists that is well matched to the computational demands of DPSS

• there are abundant applications that are in widespread use commercially


3

The Speech StackThe Speech Stack

Fundamentals — acoustics, linguistics, pragmatics, speech perception

Speech Representations — temporal, spectral, homomorphic, LPC

Speech Algorithms — speech-silence, voiced-unvoiced, pitch, formants

Speech Applications — coding, synthesis, recognition, understanding, verification, language translation, speed-up/slow-down


4

Speech RecognitionSpeech Recognition--2001 2001 (Stanley Kubrick View in 1968)(Stanley Kubrick View in 1968)


5

Apple Navigator Apple Navigator ---- 19881988


6

The Speech AdvantageThe Speech Advantage• Reduce costs

– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services

• New revenue opportunities– 24x7 high-quality customer care automation– access to information without a keyboard or

touch-tones• Customer retention

– provide personal services for customer preferences

– improve customer experience


7

The Speech CircleThe Speech Circle

DM & SLG SLU

ASRTTS

Meaning

“Billing credit”

What’s next?

“Determine correct number”Words spoken

“I dialed a wrong number”

Voice reply to customer

“What number did you want to call?”

Customer voice request

Text-to-SpeechSynthesis

Automatic SpeechRecognition

Spoken LanguageUnderstanding

Dialog Management (Actions) and

Spoken Language

Generation (Words)

Data


8

• Goal: Accurately and efficiently convert a speech signal into a text message independent of the device, speaker or the environment.

• Applications: Automation of complex operator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.

Automatic Speech RecognitionAutomatic Speech Recognition


9

Pattern Matching ProblemsPattern Matching Problems

A-to-DConverter

PatternMatching

FeatureAnalysis

speech symbols

• speech recognition

• speaker recognition

• speaker verification

• word spotting

• automatic indexing of speech recordings


10

Basic ASR Formulation (Basic ASR Formulation (BayesBayes Method)Method)

Speaker’s Intention

Speech Production

Mechanisms

W Acoustic Processor

s(n) X Linguistic Decoder

W

Speech RecognizerSpeaker Model

ˆ arg max ( | )

( | ) ( )arg max( )

arg max ( | ) ( )

W

W

A LW

W P W X

P X W P WP X

P X W P W

=

=

=

Step 2Step 1Step 3


11

Steps in Speech RecognitionSteps in Speech RecognitionStep 1Step 1-- Acoustic ModelingAcoustic Modeling: assign probabilities

to acoustic realizations of a sequence of words. Compute PA(X|W) using statistical models (Hidden Markov Models) of acoustic signals and words

Step 2Step 2-- Language ModelingLanguage Modeling: assign probabilities to sequences of words in the language. Train PL(W) from generic text or from transcriptions of task-specific dialogues.

Step 3Step 3-- Hypothesis SearchHypothesis Search: find the word sequence with the maximum a posteriori probability. Search through all possible word sequences to determine arg max over W.


12

Step 1Step 1--The Acoustic ModelThe Acoustic Model we build by learning statistics of the acoustic

features, , from a training set where we compute the variability of the acoustic features during the production of

acoustic

the sound

mode

s

l

p

s

re

Xi

resented by the models it is impractical to create a separate acoustic model, ( | ), for

every possible word in the language--it requires too much training data for words in every possible contex

AP X Wi

t instead we build for the ~50 phonemes

in the English language and construct the model for a word by concantenating (stringing together sequential

acousti

ly) the

c-phonetic mod

models for

l

e

e s

th

i

constituent phones in the word similarly we build sentences (sequences of words) by concatenating

word modelsi


13

Step 2Step 2--The Language ModelThe Language Model• the language model describes the probability of a sequence of words that form a valid sentence in the language

• a simple statistical method works well based on a Markovianassumption, namely that the probability of a word in a sentence is conditioned on only the previous N-words, namely an N-gram language model

1 2

1 21

1 2

( ) ( , ,..., )

( | , ,..., )

where ( | , ,..., ) is estimated by simply counting up the relative frequencies of -tuples in a large corpus of text

L L kk

L n n n n Nn

L n n n n N

P W P w w w

P w w w w

P w w w wf N

− − −=

− − −

=

=∏i


14

Step 3Step 3--The Search ProblemThe Search Problem• the search search problem is one of searching the space of

all valid sound sequences, conditioned on the word grammar, the language syntax, and the task constraints, to find the word sequence with the maximum likelihood

• the size of the search spacesize of the search space can be astronomically large and take inordinate amounts of computing power to solve by heuristic methods

• the use of methods from the field of Finite State Finite State Automata TheoryAutomata Theory provide Finite State Networks Finite State Networks (FSN)(FSN) that reduce the computational burden by orders of magnitude, thereby enabling exact solutions in computationally feasible times, for large speech recognition problems


15

Basic ASR FormulationBasic ASR FormulationThe basic equation of Bayes rule-based speech recognition is

where X=X1,X2,…,XN is the acoustic observation (feature vector) sequence.

is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model

ˆ arg max ( | )

( ) ( | )arg max( )

arg max ( ) ( | )

=

=

=

W

W

W

W P

P PP

P P

W X

W X WX

W X W

1 2ˆ ...= Mw w wW

Speech Analysis Decoder

s(n), W Xn W


16

Feature Analysis (Spectral Analysis)

Feature Analysis (Spectral Analysis)

Language Model

(N-gram)

Language Model

(N-gram)Word

LexiconWord

Lexicon

Utterance Verification (Confidence

Scores)

Utterance Verification (Confidence

Scores)

Pattern Classification

(Decoding, Search)


(Decoding, Search)

Acoustic Model (HMM)

Acoustic Model (HMM)

Input Speech “Hello World”

(0.9) (0.8)

Speech Recognition

Process

Speech Speech Recognition Recognition

ProcessProcessSLU

TTS

DM

ASR


17

Speech Recognition ProcessesSpeech Recognition Processes

• Choose task => sounds, word vocabulary, task syntax (grammar), task semantics– text training data set => word lexicon, word

grammar (language model), task grammar– speech training data set => acoustic models

• Evaluate performance– speech testing data set

• Training algorithm => build models from training set of text and speech

• Testing algorithm => evaluate performance from testing set of speech


18

Feature ExtractionFeature ExtractionFeature ExtractionGoal: Extract robust features (information) fromthe speech that are relevant for ASR.

Method: Spectral analysis through either a bank-of-filters or through LPC followedby non-linearity and normalization (cepstrum).

Result: Signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).

Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—cepstral features or those from a high dimensionality space.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification


19

RobustnessRobustnessRobustnessRobustness

UnlimitedVocabularyUnlimited

Vocabulary

RejectionRejection

Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.

Methods:Traditional techniques for improving system robustnessare based on signal enhancement, feature normalization or/andmodel adaptation.

Perception Approach:Extract fundamental acoustic information in narrow bandsof speech. Robust integration of features across time and frequency.


20

Methods for Robust Speech Recognition

Methods for Robust Speech Methods for Robust Speech RecognitionRecognition

SignalTraining

Testing

Features Model

Signal Features Model

Enhancement Normalization Adaptation


21

Acoustic ModelAcoustic ModelAcoustic ModelGoal: Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/).

Hidden Markov Model (HMM): Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone. HMMs are also assigned for modeling extraneous events.

Advantages: Powerful statistical method for dealing with a wide range of data and reliably recognizing speech.

Challenges: Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification


22

HMM for speechHMM for speech

• Phone model : ‘z’ (/Z/)

• Word model: ‘is’ (/IH/ /Z/)

i1

i1

ih1 ih2 ih3 z1z1 z1z2 z1z3

z1z1 z1z2 z1z3


23

Isolated Word HMMIsolated Word HMM

1 2 3 4 5

a11

a34a23a12

a13 a24 a35

a22 a33 a44 a55=1

a45

b1(Ot) b2(Ot) b3(Ot) b4(Ot) b5(Ot)

Left-right HMM – highly constrained state sequences


24

Word LexiconWord LexiconWord LexiconGoal:Map legal phone sequences into wordsaccording to phonotactic rules. For example,

David /d/ /ey/ /v/ /ih/ /d/

Multiple Pronunciation:Several words may have multiple pronunciations. For example

Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/

Challenges:How do you generate a word lexicon automatically; how do you addnew variant dialects and word pronunciations.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification


25

0.60.4

Goal:Mapping words into phrases and sentences based ontask syntax.

Handcrafted:Deterministic grammars that are knowledge-based. For example,

Yes on my credit (card) please

Statistical:Compute estimate of word probabilities (N-gram model). For example

Yes on my credit card please

Challenges:How do you build a language model rapidly for a new task.

Language ModelLanguage ModelLanguage Model

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification


26

Goal:Combine information (probabilities) from the acoustic model, language model and word lexicon to generatean “optimal” word sequence (highest probability).

Method:Decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.

Challenges:How do we build efficient structures (FSMs) for decoding and searching large vocabulary, complex language models tasks;

• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient

What is the theoretical limit of efficiency that can be achieved

Pattern ClassificationPattern ClassificationPattern Classification

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification

22

8


27

Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR• The basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping:

• For the WSJ 64,000 vocabulary, this results in a network of 10 bytes!• State-of-the-art methods including fast match, multi-pass decoding and A* stack provide tremendous speed-up at a cost of increased complexity and less portability.• Advances in weighted finite state transducers have enabled us to represent this network in a unified mathematical framework with only 10 bytes!

22

RobustnessRobustness


Vocabulary

RejectionRejection

8

Features HMM states HMM units Phones Words Sentences


28

Weighted Finite State Transducers (WFST)

Weighted Finite State Weighted Finite State Transducers (WFST)Transducers (WFST)

• Unified Mathematical framework to ASR• Efficiency in time and space

WFSTWFST

WFSTWFST

WFSTWFST

Word:Phrase

Phone:Word

HMM:PhoneCombinationCombination OptimizationOptimization

SearchNetwork

WFSTWFST

State:HMM


29

Weighted Finite State Weighted Finite State TransducerTransducer

d: ε/1

ey:ε/.4

ae:ε/.6

dx:ε/.8

t:ε/.2

ax:”data”/1

Word Pronunciation Transducer

Data


30

0

5

10

15

20

25

30

1994 1995 1996 1997 1998 1999 2000 2001 2002

Year

Rel

ativ

e Sp

eed

AT&T (Algorithmic) Moore's Law (hardware)

North American Businessvocabulary: 40,000 words

branching factor: 85

North American Businessvocabulary: 40,000 words

branching factor: 85

Algorithmic Speed-up for Speech Recognition

Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition

Community

AT&T


31

Utterance VerificationUtterance VerificationUtterance VerificationGoal:Identify possible recognition errorsand out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.

Method:A confidence score based on a hypothesis test is associated witheach recognized word. For example:

Label: credit pleaseRecognized: credit fees Confidence: (0.9) (0.3)

Challenges:Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.

Feature Extraction

Feature Extraction



Acoustic Model

Acoustic Model

Language Model

Language Model Word

Lexicon

Word Lexicon

Utterance

Verification

Utterance

Verification


32

RejectionRejectionProblem: Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.

Effect:Improvement in the performance of the recognizer, understanding system and dialogue manager.

Measure of Confidence:Associating word strings with a verification cost thatprovide an effective measure of confidence (Utterance Verification).

RobustnessRobustness


Vocabulary

RejectionRejection


33

State-of-the-Art Performance?

StateState--ofof--thethe--Art Art Performance?Performance?

SLU

TTS

DM

Feature ExtractionFeature

Extraction

Language Model

Language Model

Word LexiconWord

Lexicon


(Decoding, Search)


(Decoding, Search)

Acoustic Model

Acoustic Model

Input Speech

RecognizedSentence

ASR

Utterance VerificationUtterance

Verification


34

Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY

SIZE WORD

ERROR RATEConnected Digit

Strings--TI Database Spontaneous 11 (zero-nine,

oh) 0.3%

Connected Digit Strings--Mall Recordings

Spontaneous 11 (zero-nine, oh)

2.0%

Connected Digits Strings--HMIHY

Conversational 11 (zero-nine, oh)

5.0%

RM (Resource Management)

Read Speech 1000 2.0%

ATIS(Airline Travel Information System)

Spontaneous 2500 2.5%

NAB (North American Business)

Read Text 64,000 6.6%

Broadcast News News Show 210,000 13-17%

Switchboard Conversational Telephone

45,000 25-29%

Call Home Conversational Telephone

28,000 40%

Factor of 17 increase in digit error rate


35


36vocabulary: 40,000 words branching factor: 85

vocabulary: 40,000 words branching factor: 85

North American BusinessNorth American BusinessNorth American Business


37

Broadcast NewsBroadcast News


38

Dictation MachineDictation Machine


39

30

40

50

60

70

80

1996 1997 1998 1999 2000 2001 2002

Year

Wor

d A

ccur

acy

Switchboard/Call Home Vocabulary: 40,000 words

Perplexity: 85

Switchboard/Call Home Vocabulary: 40,000 words

Perplexity: 85

Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech Recognition


40

Growth in Effective Growth in Effective Recognition Vocabulary SizeRecognition Vocabulary Size

110

1001000

10000100000

100000010000000

1960 1970 1980 1990 2000 2010Year

Voc

abul

ary

Size


41

Human Speech Recognition Human Speech Recognition vsvs ASRASR


42

Human Speech Recognition Human Speech Recognition vsvs ASRASR

0.1

1

10

100

0.001 0.01 0.1 1 10

HUMAN ERROR (%)

MAC

HIN

E ER

RO

R (%

)

Digits RM-LM NAB-mic WSJ

RM-null NAB-omni SWBD WSJ-22dB

MachinesOutperform

Humansx1

x10

x100


43

DM & SLG SLU

TTS

Meaning


What’s next?









Dialog Management and Spoken

(Actions) Language

Generation (Words)

ASR

Voice-Enabling Services:Technology ComponentsVoiceVoice--Enabling Services:Enabling Services:Technology ComponentsTechnology Components


44

Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: Interpret the meaning of key words and phrases in the recognized

speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly recognizing

every word– SLU makes it possible to offer services where the customer can speak naturally

without learning a specific set of terms

• Methodology:: Exploit task grammar (syntax) and task semantics to restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high information wordsequences to appropriate meaning

• Performance Evaluation:: Accuracy of speech understanding system on various tasks and in various operating environments

• Applications:: Automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.

• Challenges: What goes beyond simple classifications systems but below full Natural Language voice dialogue systems


45

DM & SLG SLU

TTS

Meaning


What’s next?










Spoken Language

Generation (Words)

ASR

Voice-Enabling Services:Technology ComponentsVoiceVoice--Enabling Services:Enabling Services:Technology ComponentsTechnology Components


46

Dialog Management (DM)Dialog Management (DM)

• Goal:: Combine the meaning of the current input with the interaction history to decide what the next step in the interaction should be

– DM makes viable complex services that require multiple exchanges between the system and the customer

– dialog systems can handle user-initiated topic switching (within the domain of the application)

• Methodology:: Exploit models of dialog to determine the most appropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction

• Performance Evaluation:: Speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining help with a service

• Applications:: Customer care (HMIHY), travel planning, conference registration,scheduling, voice access to unified messaging

• Challenges: Is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines


47

Customer Care IVR and HMIHYCustomer Care IVR and HMIHY

Network Menu

LEC Misdirect Announcement

Account Verification Routine

Main Menu

Reverse Directory RoutineLD Sub-Menu

HMIHY

Sparkle Tone“AT&T, How may I help you?”

Account Verification Routine

Customer Care IVR

Sparkle Tone“Thank you for calling AT&T…”

Total Time to Get to Reverse Directory Lookup: 2:55 minutes!!!

10 seconds

30 seconds

13 seconds

26 seconds

58 seconds

38 seconds

Total Time to Get to Reverse Directory Lookup: 28 seconds!!!

20 seconds

8 seconds

smsm


48

Prompt is “AT&T. How may I help you?”User responds with totally unconstrained fluent speechSystem recognizes the words and determines the meaningof users’ speech, then routes the callDialog technology enables task completion

HMIHY

AccountBalance

CallingPlans

Local UnrecognizedNumber

. . .

HMIHY HMIHY —— How Does It WorkHow Does It Worksmsm


49

HMIHY Example DialogsHMIHY Example Dialogs

• Irate Customer• Rate Plan• Account Balance• Local Service• Unrecognized Number• Threshold Billing• Billing Credit

Customer SatisfactionCustomer Satisfaction

• decreased repeat calls (37%)

• decreased ‘OUTPIC’ rate (18%)

• decreased CCA (Call Control Agent) time per call (10%)

• decreased customer complaints (78%)

smsm


50

Customer Care ScenarioCustomer Care Scenario


51

TTS TTS –– Closest to the Customer’s EarClosest to the Customer’s Ear

DM & SLG SLU

ASRTTS

Meaning


What’s next?










Spoken Language

Generation (Words)


52

Speech SynthesisSpeech Synthesis

LinguisticRules

D-to-AConverter

DSP Computer

text speech


53

Speech SynthesisSpeech Synthesis• Synthesis of Speech for effective human-

machine communications– reading email messages over a telephone– telematics feedback in automobiles– talking agents for completion of transactions– call center help desks and customer care– handheld devices such as foreign language

phrasebooks, dictionaries, crossword puzzle helpers

– announcement machines that provide things like stock quotes, airlines schedules, updates of arrivals and departures of flights


54

Giving Machines High Quality Voices and Giving Machines High Quality Voices and FacesFaces

U.S. English Female:U.S. English Male:

Spanish Female::

‘Natural Speech’


55

Speech Synthesis ExamplesSpeech Synthesis Examples

• Soliloquy from Hamlet:

• Gettysburg Address:

• Third Grade Story:


56

Speech Recognition DemosSpeech Recognition Demos


57

Au Clair de la LuneAu Clair de la Lune


58

Information KioskInformation Kiosk


59

Multimodal Language ProcessingMultimodal Language ProcessingUnified Multimodal Experience Access to information through voice interface, gesture or both.

Multimodal Finite State Combination of speech, gesture and meaning using finite state technology

“Are there any cheap Italian places in this neighborhood?”

MATCH: Multimodal Access To City Help


60

MIPadMIPad DemoDemo----MicrosoftMicrosoft


61

VoiceVoice--Enabled ServicesEnabled Services• Desktop applications -- dictation, command and control of

desktop, control of document properties (fonts, styles, bullets, …)• Agent technology – simple tasks like stock quotes, traffic

reports, weather; access to communications, e.g., voice dialing,voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointments

• Voice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are key

• E-Contact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues

• Telematics – command and control of automotive features (comfort systems, radio, windows, sunroof)

• Small devices – control of cellphones, PDAs from voice commands


62

Milestones in Speech and Milestones in Speech and Multimodal Technology ResearchMultimodal Technology Research

1962 1967 1972 1977 1982 1987 1992 1997 2002

Year

Isolated Words

Filter-bank analysis;

Time-normalization;

Dynamicprogramming

Isolated Words; Connected Digits;

Continuous Speech

Pattern recognition; LPC

analysis; Clustering

algorithms; Level building;

Continuous Speech; Speech Understanding

Stochastic language understanding;

Finite-state machines;

Statistical learning;

Small Vocabulary,

Acoustic Phonetics-

based

Medium Vocabulary, Template-based

Large Vocabulary;

Syntax, Semantics,

Connected Words;

Continuous Speech

Large Vocabulary,

Statistical-based

Hidden Markov models;

Stochastic Language modeling;

Spoken dialog; Multiple

modalities

Very Large Vocabulary; Semantics, Multimodal Dialog, TTS

Concatenativesynthesis; Machine

learning; Mixed-initiative dialog;


63

Future of Speech Recognition Technologies

Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies

2002 2005 2008 2011

Year

Dialog Systems

Robust Systems

Multilingual Systems;

Multimodal Speech Enabled

Devices

Very Large Vocabulary,

Limited Tasks, Controlled

Environment

Very Large Vocabulary,

Limited Tasks, Arbitrary

Environment

Unlimited Vocabulary,

Unlimited Tasks, Many

Languages


64

Issues in Speech RecognitionIssues in Speech Recognition


65

Issues in Speech RecognitionIssues in Speech Recognition• Input speech format

– Isolated words/phrases– Connected word sequences– Continuous speech (essentially

unconstrained)• Recognition mode

– Speaker trained– Speaker independent– Amount of training material

• Speaking environment– Quiet office– Home– Noisy surroundings (factory floor,

cellular environments, speakerphones)• Transducer and transmission system

– High quality microphone, close talking/noise cancelling microphone

– Telephone (carbon button/electret)– Switched telephone network– IP network (VoIP)– Cellular network

• Vocabulary size and complexity (perplexity)

– Small (2-50 words)– Medium (50-250 word)– Large (250-2,000,000 words)

• Speaker characteristics– Highly motivated, cooperative– Casual

• Recognition task– Syntax constrained (language model)– Viable semantics

• Human factors– Feedback to users– Instructions– Requests for repeats– Rejections

• Tolerance for recognition errors– Fail soft systems– Human intervention on errors/confusion– Correction mechanisms built in

• System complexity– Computation/hardware– Real-time response capability


66

Overview of Speech Overview of Speech Recognition ProcessesRecognition Processes


67

Speech Transcriptions

FS

log En ZnTemporal, Spectral,

Cepstral, LPC Features

Template

VQ HMM

d(X,Y) DTW

DictionarySyntax

Templates Models

Word / Sound Models, Templates

Recognized Input

Overview of Speech Overview of Speech Recognition ProcessesRecognition Processes


68

Statistical Pattern RecognitionStatistical Pattern RecognitionThe basic speech recognition task may be defined as follows:

• a sequence of measurements (speech analysis frames) on the (endpoint detected) speech signal of an utterance defines a pattern for that utterance• this pattern is to be classified as belonging to one of several possible categories (classes) (for word/phrase recognition) or to a sequence of possible categories (for continuous speech recognition)• the rules for this classification are formulated on the basis of a labeled set of training patterns or models

The type of measurement (temporal, spectral, cepstral, LPC features) and the classification rules (pattern alignment and distance, model alignment and probability) are the main factors that distinguish one method of speech recognition from another


69

Issues in Pattern RecognitionIssues in Pattern Recognition• Training phase is required; the more training data, the better the

patterns (templates, models)• Patterns are sensitive to the speaking environment, transmission

environment, transducer (microphone), etc. (This problem is known as the speech robustness problem).

• No speech specific knowledge is required or exploited, except in the feature extraction stage

• Computational load is (more or less) linearly proportional to the number of patterns being recognized (at least for simple recognition problems, e.g., isolated word tasks)

• Pattern recognition techniques are applicable to a range of speech units, including phrases, words, and sub-word units (phonemes, syllables, dyads, etc.)

• Extensions possible to large vocabulary, fluent speech recognition using word lexicons (dictionaries) and language models (grammarsor syntax)

• Extensions possible to natural language understanding systems


70


1. Fundamentals (Lectures 2(Lectures 2--6)6)– speech production (acoustic-phonetics, linguistics)– speech perception (auditory (ear) models, neural

models)– pattern recognition (statistical, template-based)– neural networks (classification methods)

2. Speech/Endpoint Detection (Lecture 7)(Lecture 7)– algorithms– speech features


71


3. Speech Analysis/Feature Extraction (Lectures 7(Lectures 7--9)9)– temporal parameters (log energy, zero crossings,

autocorrelation)– spectral parameters (STFT, OLA, FBS, spectrograms)– cepstral parameters (cepstrum, Δ-cepstrum, Δ2-

cepstrum)– LPC parameters (reflection coefs, area coefs, LSP)

4. Distance/Distortion Measures (Lecture 10)(Lecture 10)– temporal (quadratic, weighted)– spectral (log spectral distance)– cepstral (cepstral distance)– LPC (Itakura distance)


72


5. Time Alignment Algorithms (Lectures 11(Lectures 11--12)12)– linear alignments– dynamic time warping (DTW, dynamic programming)– HMM alignments (Viterbi alignment)

6. Model Building/Training (Lectures 13(Lectures 13--14)14)– template methods– clustering methods– HMM methods—Viterbi, Forward-Backward– vector quantization (VQ) methods


73


7. Connected word modeling (Lecture 15)(Lecture 15)– dynamic programming – level building– one pass method

8. Testing/Evaluation Methods (Lecture 16)(Lecture 16)– word/sound error rates– dictionary of words– task syntax– task semantics– task perplexity


74


9. Large Vocabulary Recognition (Lectures 17(Lectures 17--18)18)– phoneme models– context dependent models– discrete, mixed, continuous density models– N-gram language models– Natural language understanding– insertions, deletions, substitutions– other factors


75

Putting It All TogetherPutting It All Together


76

Speech Recognition Course Speech Recognition Course TopicsTopics


77

What We Will Be LearningWhat We Will Be Learning• speech production model—acoustics, articulatory concepts, speech production models• speech perception model—ear models, auditory signal processing, equivalent acoustic

processing models• signal processing approaches to speech recognition—acoustic-phonetic methods,

pattern recognition methods, statistical methods, neural network methods• fundamentals of pattern recognition• signal processing methods—bank-of-filters model, short-time Fourier transforms, LPC

methods, cepstral methods, perceptual linear prediction, mel cepstrum, vector quantization

• pattern recognition issues—speech detection, distortion measures, time alignment and normalization, dynamic time warping

• speech system design issues—source coding, template training, discriminative methods• robustness issues—spectral subtraction, cepstral mean subtraction, model adaptation• Hidden Markov Model (HMM) fundamentals—design issues• connected word models—dynamic programming, level building, one pass method• grammar networks—finite state machine (FSM) basics• large vocabulary speech recognition—training, language models, perplexity, acoustic

models for context dependent sub-word units• task-oriented designs—natural language understanding, mixed initiative systems, dialog

management• text-to-speech synthesis—based on unit selection methods

lectures 1 rabiner speech processing

Documents