lirong xia speech recognition, machine learning friday, april 4, 2014

Lirong Xia

Speech recognition, machine learning

Friday, April 4, 2014

Particle Filtering

2

• Sometimes |X| is too big to use exact inference– |X| may be too big to even store

B(X)– E.g. X is continuous– |X|2 may be too big to do updates

• Solution: approximate inference– Track samples of X, not all values– Samples are called particles– Time per step is linear in the number

of samples– But: number needed may be large– In memory: list of particles

• This is how robot localization works in practice

• Elapse of time

B’(Xt)=Σxt-1p(Xt|xt-1)B(xt-1)

• ObserveB(Xt) ∝p(et|Xt)B’(Xt)

• Renormalize

B(xt) sum up to 1

3

Forward algorithm vs. particle filtering

Forward algorithm Particle filtering

• Elapse of timex--->x’

• Observew(x’)=p(et|x)

• Resample

resample N particles

Dynamic Bayes Nets (DBNs)

4

• We want to track multiple variables over time, using multiple sources of evidence

• Idea: repeat a fixed Bayes net structure at each time• Variables from time t can condition on those from t-1

• DBNs with evidence at leaves are HMMs

HMMs: MLE Queries

5

• HMMs defined by:– States X– Observations E– Initial distribution: p(X1)– Transitions: p(X|X-1)– Emissions: p(E|X)

• Query: most likely explanation:

1:

1: 1:arg max |t

t tx

p x e

1:tx1:tx

Viterbi Algorithm

6

Today

7

• Speech recognition– A massive HMM!– Details of this section not required– CSCI 4962: Natural language processing by Prof.

Heng Ji• Start to learn machine learning

– CSCI 4100: machine learning by Prof. Malik Magdon-Ismail

Speech and Language

8

• Speech technologies– Automatic speech recognition (ASR)– Text-to-speech synthesis (TTS)– Dialog systems

• Language processing technologies– Machine translation– Information extraction– Web search, question answering– Text classification, spam filtering, etc…

Digitizing Speech

9

The Input

10

• Speech input is an acoustic wave form

Graphs from Simon Arnfield’s web tutorial on speech, sheffield:

http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/

The Input

12

• Frequency gives pitch; amplitude gives volume– Sampling at ~8 kHz phone, ~16 kHz mic

• Fourier transform of wave displayed as a spectrogram– Darkness indicates energy at each frequency

Acoustic Feature Sequence

13

• Time slices are translated into acoustic feature vectors (~39 real numbers per slice)

• These are the observations, now we need the hidden states X

State Space

14

• p(E|X) encodes which acoustic vectors are appropriate for each phoneme (each kind of sound)

• p(X|X’) encodes how sounds can be strung together• We will have one state for each sound in each word• From some state x, can only:

– Stay in the same state (e.g. speaking slowly)– Move to the next position in the word– At the end of the word, move to the start of the next word

• We build a little state graph for each word and chain them together to form our state space X

HMMs for Speech

15

Transitions with Bigrams

16

Decoding

17

• While there are some practical issues, finding the words given the acoustics is an HMM inference problem

• We want to know which state sequence x1:T is most likely given the evidence e1:T:

• From the sequence x, we can simply read off the words

1:

1:

*1: 1: 1:

1: 1:

arg max |

arg max ,T

T

T T Tx

T Tx

x p x e

p x e

Machine Learning

18

• Up until now: how to reason in a model and how make optimal decisions

• Machine learning: how to acquire a model on the basis of data / experience– Learning parameters (e.g. probabilities)– Learning structure (e.g. BN graphs)– Learning hidden concepts (e.g. clustering)

Parameter Estimation

19

• Estimating the distribution of a random variable• Elicitation: ask a human (why is this hard?)• Empirically: use training data (learning!)

– E.g.: for each outcome x, look at the empirical rate of that value:

– This is the estimate that maximizes the likelihood of the data

count

total samplesML

xp x

, ii

L x p x

1 3MLp r

Estimation: Smoothing

20

• Relative frequencies are the maximum likelihood estimates (MLEs)

• In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution

count

total samplesML

xp x

arg max |

arg max

ML

ii

p X

p X

arg max |

arg max |

arg max |

MAP p X

p X p p X

p X p

????

Estimation: Laplace Smoothing

21

• Laplace’s estimate:– Pretend you saw every outcome

once more than you actually did

– Can derive this as a MAP estimate with Dirichlet priors

ML

LAP

p X

p X

Estimation: Laplace Smoothing

22

• Laplace’s estimate (extended):– Pretend you saw every outcome k

extra times

– What’s Laplace with k=0?– k is the strength of the prior

• Laplace for conditionals:– Smooth each condition

independently:

,0

,1

,100

LAP

LAP

LAP

p X

p X

p X

, =LAP k

c x kp x

N k X

,

,| =LAP k

c x y kp x y

c y k X

Example: Spam Filter

23

• Input: email• Output: spam/ham• Setup:

– Get a large collection of example emails, each labeled “spam” or “ham”

– Note: someone has to hand label all this data!

– Want to learn to predict labels of new, future emails

• Features: the attributes used to make the ham / spam decision– Words: FREE!– Text patterns: $dd, CAPS– Non-text: senderInContacts– ……

Example: Digit Recognition

24

• Input: images / pixel grids• Output: a digit 0-9• Setup:

– Get a large collection of example images, each labeled with a digit

– Note: someone has to hand label all this data!

– Want to learn to predict labels of new, future digit images

• Features: the attributes used to make the digit decision– Pixels: (6,8) = ON– Shape patterns: NumComponents,

AspectRation, NumLoops– ……

A Digit Recognizer

25

• Input: pixel grids

• Output: a digit 0-9

Naive Bayes for Digits

26

• Simple version:– One feature Fij for each grid position <i,j>

– Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image

– Each input maps to a feature vector, e.g.

– Here: lots of features, each is binary valued

• Naive Bayes model:

• What do we need to learn?

General Naive Bayes

27

• A general naive Bayes model:

• We only specify how each feature depends on the class

• Total number of parameters is linear in n

Inference for Naive Bayes

28

• Goal: compute posterior over causes– Step 1: get joint probability of causes and evidence

– Step 2: get probability of evidence– Step 3: renormalize

General Naive Bayes

29

• What do we need in order to use naive Bayes?

– Inference (you know this part)• Start with a bunch of conditionals, p(Y) and the p(F i|Y) tables

• Use standard inference to compute p(Y|F1…Fn)

• Nothing new here

– Estimates of local conditional probability tables• p(Y), the prior over labels

• p(Fi|Y) for each feature (evidence variable)

• These probabilities are collectively called the parameters of the model and denoted by θ

• Up until now, we assumed these appeared by magic, but…

• … they typically come from training data: we’ll look at this now

Examples: CPTs

30

1 0.1

2 0.1

3 0.1

4 0.1

5 0.1

6 0.1

7 0.1

8 0.1

9 0.1

0 0.1

p Y 3,1 |p F on Y

1 0.05

2 0.01

3 0.90

4 0.80

5 0.90

6 0.90

7 0.25

8 0.85

9 0.60

0 0.80

5,5 |p F on Y

1 0.01

2 0.05

3 0.05

4 0.30

5 0.80

6 0.90

7 0.05

8 0.60

9 0.50

0 0.80

Important Concepts

31

• Data: labeled instances, e.g. emails marked spam/ham– Training set– Held out set– Test set

• Features: attribute-value pairs which characterize each x

• Experimentation cycle– Learn parameters (e.g. model probabilities) on training set– (Tune hyperparameters on held-out set)– Compute accuracy of test set– Very important: never “peek” at the test set!

• Evaluation– Accuracy: fraction of instances predicted correctly

• Overfitting and generalization– Want a classifier which does well on test data– Overfitting: fitting the training data very closely, but not generalizing

well– We’ll investigate overfitting and generalization formally in a few

lectures

lirong xia speech recognition, machine learning friday, april 4, 2014

Documents

px t x t

state x

bx t pe t x t bx t

time t

hmms slide

state space x slide

frequency slide

hidden states x slide