speech and language processing

59
Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition

Upload: azuka

Post on 07-Jan-2016

39 views

Category:

Documents


1 download

DESCRIPTION

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition. Outline for ASR. ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction Acoustic Model Language Model Lexicon/Pronunciation Model (Introduction to HMMs again) Decoder - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Speech and Language Processing

Speech and Language Processing

Chapter 9 of SLPAutomatic Speech Recognition

Page 2: Speech and Language Processing

Outline for ASR

ASR Architecture The Noisy Channel Model

Five easy pieces of an ASR system1) Feature Extraction2) Acoustic Model3) Language Model4) Lexicon/Pronunciation Model

(Introduction to HMMs again)5) Decoder

• Evaluation

04/20/23 2Speech and Language Processing Jurafsky and Martin

Page 3: Speech and Language Processing

Speech Recognition

Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions,

air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching

04/20/23 3Speech and Language Processing Jurafsky and Martin

Page 4: Speech and Language Processing

LVCSR

Large Vocabulary Continuous Speech Recognition

~20,000-64,000 words Speaker independent (vs. speaker-

dependent) Continuous speech (vs isolated-word)

04/20/23 4Speech and Language Processing Jurafsky and Martin

Page 5: Speech and Language Processing

Current error rates

Task Vocabulary Error Rate%

Digits 11 0.5

WSJ read speech 5K 3

WSJ read speech 20K 3

Broadcast news 64,000+ 10

Conversational Telephone

64,000+ 20

Ballpark numbers; exact numbers depend very much on the specific corpus

04/20/23 5Speech and Language Processing Jurafsky and Martin

Page 6: Speech and Language Processing

HSR versus ASR

Conclusions:Machines about 5 times worse than humansGap increases with noisy speech These numbers are rough, take with grain of

salt

Task Vocab

ASR Hum SR

Continuous digits

11 .5 .009

WSJ 1995 clean 5K 3 0.9

WSJ 1995 w/noise

5K 9 1.1

SWBD 2004 65K 20 4

04/20/23 6Speech and Language Processing Jurafsky and Martin

Page 7: Speech and Language Processing

Why is conversational speech harder?

A piece of an utterance without context

The same utterance with more context

04/20/23 7Speech and Language Processing Jurafsky and Martin

Page 8: Speech and Language Processing

LVCSR Design Intuition

• Build a statistical model of the speech-to-words process

• Collect lots and lots of speech, and transcribe all the words.

• Train the model on the labeled speech

• Paradigm: Supervised Machine Learning + Search

04/20/23 8Speech and Language Processing Jurafsky and Martin

Page 9: Speech and Language Processing

Speech Recognition Architecture

04/20/23 9Speech and Language Processing Jurafsky and Martin

Page 10: Speech and Language Processing

The Noisy Channel Model

Search through space of all possible sentences.

Pick the one that is most probable given the waveform.

04/20/23 10Speech and Language Processing Jurafsky and Martin

Page 11: Speech and Language Processing

The Noisy Channel Model (II)

What is the most likely sentence out of all sentences in the language L given some acoustic input O?

Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot

Define a sentence as a sequence of words: W = w1,w2,w3,…,wn

04/20/23 11Speech and Language Processing Jurafsky and Martin

Page 12: Speech and Language Processing

Noisy Channel Model (III)

Probabilistic implication: Pick the highest prob S = W:

We can use Bayes rule to rewrite this:

Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:

ˆ W = argmaxW ∈L

P(W | O)

ˆ W = argmaxW ∈L

P(O |W )P(W )€

ˆ W = argmaxW ∈L

P(O |W )P(W )

P(O)

04/20/23 12Speech and Language Processing Jurafsky and Martin

Page 13: Speech and Language Processing

Noisy channel model

ˆ W = argmaxW ∈L

P(O |W )P(W )

likelihood prior

04/20/23 13Speech and Language Processing Jurafsky and Martin

Page 14: Speech and Language Processing

The noisy channel model

Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source)

04/20/23 14Speech and Language Processing Jurafsky and Martin

Page 15: Speech and Language Processing

Speech Architecture meets Noisy Channel

04/20/23 15Speech and Language Processing Jurafsky and Martin

Page 16: Speech and Language Processing

Architecture: Five easy pieces (only 3-4 for today)

HMMs, Lexicons, and Pronunciation Feature extraction Acoustic Modeling Decoding Language Modeling (seen this

already)

04/20/23 16Speech and Language Processing Jurafsky and Martin

Page 17: Speech and Language Processing

Lexicon

A list of words Each one with a pronunciation in

terms of phones We get these from on-line

pronucniation dictionary CMU dictionary: 127K words

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

We’ll represent the lexicon as an HMM

04/20/23 17Speech and Language Processing Jurafsky and Martin

Page 18: Speech and Language Processing

HMMs for speech: the word “six”

04/20/23 18Speech and Language Processing Jurafsky and Martin

Page 19: Speech and Language Processing

Phones are not homogeneous!

Time (s)0.48152 0.937203

0

5000

ay k

04/20/23 19Speech and Language Processing Jurafsky and Martin

Page 20: Speech and Language Processing

Each phone has 3 subphones

04/20/23 20Speech and Language Processing Jurafsky and Martin

Page 21: Speech and Language Processing

Resulting HMM word model for “six” with their subphones

04/20/23 21Speech and Language Processing Jurafsky and Martin

Page 22: Speech and Language Processing

HMM for the digit recognition task

04/20/23 22Speech and Language Processing Jurafsky and Martin

Page 23: Speech and Language Processing

Detecting Phones

Two stages Feature extraction

Basically a slice of a spectrogram

Building a phone classifier (using GMM classifier)

04/20/23 23Speech and Language Processing Jurafsky and Martin

Page 24: Speech and Language Processing

MFCC: Mel-Frequency Cepstral Coefficients

04/20/23 24Speech and Language Processing Jurafsky and Martin

Page 25: Speech and Language Processing

MFCC process: windowing

04/20/23 25Speech and Language Processing Jurafsky and Martin

Page 26: Speech and Language Processing

MFCC process: windowing

04/20/23 26Speech and Language Processing Jurafsky and Martin

Page 27: Speech and Language Processing

Hamming window on the signal, and then computing the

spectrum

04/20/23 27Speech and Language Processing Jurafsky and Martin

Page 28: Speech and Language Processing

Final Feature Vector

39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy)

So each frame represented by a 39D vector

04/20/23 28Speech and Language Processing Jurafsky and Martin

Page 29: Speech and Language Processing

Acoustic Modeling (= Phone detection)

Given a 39-dimensional vector corresponding to the observation of one frame oi

And given a phone q we want to detect

Compute p(oi|q) Most popular method:

GMM (Gaussian mixture models) Other methods

Neural nets, CRFs, SVM, etc04/20/23 29Speech and Language Processing Jurafsky and Martin

Page 30: Speech and Language Processing

Gaussian Mixture Models

Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian:

p(o | q) =1

σ 2πexp(−

(o − μ)2

2σ 2)

04/20/23 30Speech and Language Processing Jurafsky and Martin

Page 31: Speech and Language Processing

Gaussians for Acoustic Modeling

P(o|q):

P(o|q)

o

P(o|q) is highest here at mean

P(o|q is low here, very far from mean)

A Gaussian is parameterized by a mean and a variance:

Different means

04/20/23 31Speech and Language Processing Jurafsky and Martin

Page 32: Speech and Language Processing

Training Gaussians

A (single) Gaussian is characterized by a mean and a variance

Imagine that we had some training data in which each phone was labeled

And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation

We could just compute the mean and variance from the data:

μi =1

Tot

t=1

T

∑ s.t. ot is phone i

σ i2 =

1

T(ot

t=1

T

∑ − μ i)2 s.t. ot is phone i

04/20/23 32Speech and Language Processing Jurafsky and Martin

Page 33: Speech and Language Processing

But we need 39 gaussians, not 1!

The observation o is really a vector of length 39

So need a vector of Gaussians:

p(r o | q) =

1

2πD

2 σ 2[d]d =1

D

∏exp(−

1

2

(o[d] − μ[d])2

σ 2[d]d =1

D

∑ )

04/20/23 33Speech and Language Processing Jurafsky and Martin

Page 34: Speech and Language Processing

Actually, mixture of gaussians

Each phone is modeled by a sum of different gaussians

Hence able to model complex facts about the data

Phone A

Phone B

04/20/23 34Speech and Language Processing Jurafsky and Martin

Page 35: Speech and Language Processing

Gaussians acoustic modeling

Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices

Usually assume covariance matrix is diagonal

I.e. just keep separate variance for each cepstral feature

04/20/23 35Speech and Language Processing Jurafsky and Martin

Page 36: Speech and Language Processing

Where we are

Given: A wave file Goal: output a string of words What we know: the acoustic model

How to turn the wavefile into a sequence of acoustic feature vectors, one every 10 ms

If we had a complete phonetic labeling of the training set, we know how to train a gaussian “phone detector” for each phone.

We also know how to represent each word as a sequence of phones

What we knew from Chapter 4: the language model Next time:

Seeing all this back in the context of HMMs Search: how to combine the language model and the acoustic

model to produce a sequence of words

04/20/23 36Speech and Language Processing Jurafsky and Martin

Page 37: Speech and Language Processing

Decoding

In principle:

In practice:

Page 38: Speech and Language Processing

Why is ASR decoding hard?

Page 39: Speech and Language Processing

HMMs for speech

Page 40: Speech and Language Processing

HMM for digit recognition task

Page 41: Speech and Language Processing

The Evaluation (forward) problem for speech

The observation sequence O is a series of MFCC vectors

The hidden states W are the phones and words

For a given phone/word string W, our job is to evaluate P(O|W)

Intuition: how likely is the input to have been generated by just that word string W

Page 42: Speech and Language Processing

Evaluation for speech: Summing over all different paths!

f ay ay ay ay v v v v f f ay ay ay ay v v v f f f f ay ay ay ay v f f ay ay ay ay ay ay v f f ay ay ay ay ay ay ay ay v f f ay v v v v v v v

Page 43: Speech and Language Processing

The forward lattice for “five”

Page 44: Speech and Language Processing

The forward trellis for “five”

Page 45: Speech and Language Processing

Viterbi trellis for “five”

Page 46: Speech and Language Processing

Viterbi trellis for “five”

Page 47: Speech and Language Processing

Search space with bigrams

Page 48: Speech and Language Processing

Viterbi trellis

Page 49: Speech and Language Processing

Viterbi backtrace

Page 50: Speech and Language Processing

Training

Page 51: Speech and Language Processing

Evaluation

How to evaluate the word string output by a speech recognizer?

Page 52: Speech and Language Processing

Word Error Rate

Word Error Rate = 100 (Insertions+Substitutions + Deletions) ------------------------------ Total Word in Correct TranscriptAligment example:REF: portable **** PHONE UPSTAIRS last night soHYP: portable FORM OF STORES last night soEval I S S WER = 100 (1+2+0)/6 = 50%

Page 53: Speech and Language Processing

NIST sctk-1.3 scoring softare:Computing WER with sclite

http://www.nist.gov/speech/tools/ Sclite aligns a hypothesized text (HYP) (from the

recognizer) with a correct or reference text (REF) (human transcribed)

id: (2347-b-013)

Scores: (#C #S #D #I) 9 3 1 2

REF: was an engineer SO I i was always with **** **** MEN UM and they

HYP: was an engineer ** AND i was always with THEM THEY ALL THAT and they

Eval: D S I I S S

Page 54: Speech and Language Processing

Sclite output for error analysis

CONFUSION PAIRS Total (972) With >= 1 occurances (972)

1: 6 -> (%hesitation) ==> on 2: 6 -> the ==> that 3: 5 -> but ==> that 4: 4 -> a ==> the 5: 4 -> four ==> for 6: 4 -> in ==> and 7: 4 -> there ==> that 8: 3 -> (%hesitation) ==> and 9: 3 -> (%hesitation) ==> the 10: 3 -> (a-) ==> i 11: 3 -> and ==> i 12: 3 -> and ==> in 13: 3 -> are ==> there 14: 3 -> as ==> is 15: 3 -> have ==> that 16: 3 -> is ==> this

Page 55: Speech and Language Processing

Sclite output for error analysis

17: 3 -> it ==> that 18: 3 -> mouse ==> most 19: 3 -> was ==> is 20: 3 -> was ==> this 21: 3 -> you ==> we 22: 2 -> (%hesitation) ==> it 23: 2 -> (%hesitation) ==> that 24: 2 -> (%hesitation) ==> to 25: 2 -> (%hesitation) ==> yeah 26: 2 -> a ==> all 27: 2 -> a ==> know 28: 2 -> a ==> you 29: 2 -> along ==> well 30: 2 -> and ==> it 31: 2 -> and ==> we 32: 2 -> and ==> you 33: 2 -> are ==> i 34: 2 -> are ==> were

Page 56: Speech and Language Processing

Better metrics than WER?

WER has been useful But should we be more concerned with

meaning (“semantic error rate”)? Good idea, but hard to agree on Has been applied in dialogue systems, where

desired semantic output is more clear

Page 57: Speech and Language Processing

Summary: ASR Architecture Five easy pieces: ASR Noisy Channel architecture

1) Feature Extraction: 39 “MFCC” features

2) Acoustic Model: Gaussians for computing p(o|q)

3) Lexicon/Pronunciation Model• HMM: what phones can follow each other

4) Language Model• N-grams for computing p(wi|wi-1)

5) Decoder• Viterbi algorithm: dynamic programming for combining all these

to get word sequence from speech!

Page 58: Speech and Language Processing

Accents: An experiment

A word by itself

The word in context

04/20/23 58Speech and Language Processing Jurafsky and Martin

Page 59: Speech and Language Processing

Summary ASR Architecture

The Noisy Channel Model Phonetics Background Five easy pieces of an ASR system

1) Lexicon/Pronunciation Model (Introduction to HMMs again)

2) Feature Extraction3) Acoustic Model4) Language Model5) Decoder

• Evaluation

04/20/23 59Speech and Language Processing Jurafsky and Martin