speech processing overview & supporting concepts

Speech ProcessingOverview & Supporting Concepts

Professor Marie RochReadings are listed in the schedule

What is speech processing?

• Specialized branch of machine learning• Interdisciplinary

– signal processing– statistics– linear algebra– linguistics– optimization– perception & production

2

-1.5-1

-0.50

0.51

1.5

-1.5

-1

-0.5

0

0.5

10

2

4

6

8

10

12

f2f1

P(f 1

,f 2)

Basic ideas

• Acquire an audio signal• Extract features• Classify features to labels depending on goals, e.g.

– speech recognition: text to speech– speaker recognition: identify a person– command and control: recognize directives or keywords

3

Rosie the robot, The Jetsons ©

Hannah B

arbera

Probability

• A measure of uncertainty.• Sources of uncertainty:

– Process is stochastic (nondeterministic).– We cannot observe all aspects of process.– Incomplete modeling of process.

4

Probability

Two ways to think about:• frequentist – Related to the proportion of events occurring• Bayesian – Related to degree of belief

Both types of probability are treated using the same rules.

5

Random variables

• Variables that can be bound to values non-deterministically.

• Common to use notation to distinguish between*

– a random variable X (capitalized)– and an observed value x (lower case, e.g. x=5)

6*Goodfellow et al. use bold in place of capitalization.

Discrete random variables

• Values from a discrete set of labels:

• Probability density (mass) function (pdf/pmf) is the probability of something occurring, P(X=x)*:

•

7

{ } { }red, orange, yellow, blue, indigo, violet , 0,1,2X Y∈ ∈

( ) 0.8, 1( 2)3

PX re YP d= == =

domain( )domain( ),0 ( ) 1 and ( ) 1

x Xx X P X x P X x

∈

∈ ≤ = ≤ =∀ =∑* Goodfellow et al. use PMF for discrete distributions. P(X=x) is commonly abbreviated to P(x)

Distribution

• Set of probabilities associated with the values of a random variable.

• Uniform distribution – Special distribution where all probabilities are equal.

8

Example

• Let X be a die roll– P(X=3) denotes the probability of rolling a 3– Fair die P(X=3) = 1/6

• If we don’t know if the die is fair, how would we approximate P(X=x)?

• Estimates of probability are denoted with a hat:9

ˆ( )P X x=

Continuous random variables

• Values from a continuous domain• Satisfies the following:

• Worth noting:– no max on P(X=x)– What is?

10

domain( ),0 ( )

( ) ( )

( ) 1

b

a

x

X P X x

P a X b P X x dx

P X

x

x dx

∈ ≤ =

≤ ≤ = =

= =

∀

∫∫

( ) ( )a

aP a X a P X x dx≤ ≤ = =∫

Continuous uniform distribution

Notation: ~ means “has a distribution of.”U(a,b) is a uniform distribution between values and b

11

or implies 1

~ ( , )0

( )

X U a bx a

P X xa

x b

x bb a

>

≤ ≤−

<= =

Expectation

When u(X) = X, we call this the mean, or average value:

12

discrete: ( ) P( )

continuous:

[ ( )]

[ ( )] ( ) P( )x

x

E u X

E u

u x x

xX dxu x

=

= ∫

∑

[ ]E Xµ =

Sample mean

• What if we don’t know P(X=x)?• If we roll a die many times, we can estimate

P(X=x) by counting the number of times each x occurs and dividing by the number of rolls.

• To think about: How does this relate to our traditional idea of an average?

13

µ̂

Variance

• Answers the question: What is the expected squared deviation from the mean?

• Can be defined with expectation operator

• Sample variance (unbiased)

14

2 2) ] ( )[ )( (x

E X x P X xµ µ= − =− ∑

2 21ˆ[( ) ] ( )1 x

E x xN

µ µ− = −− ∑

Normal distribution

• Normal or Gaussian distributions are the so-called “bell curve” distributions

• Parameters: mean & variance:

15

2

2( )

2 22

,2

1( | )x

eP xµ

σµ σπσ

− −

=

2~ )( ,X n µ σ

Normal distribution

• Mean controls the center, variance the breadth of the distribution.

• A normal distribution can be fit with sample mean & variance

16

, the standard~ (0 no al,1) rmX n

~ (2,1)X n

~ (0,0.04)X n

Application: Speech/noise segmentation

How can we determine if someone is talking?1. Record: Acquire pressure signal 2. Frame: Break into pieces3. Compute features: root mean square intensity4. Train: Fit speech/noise distribution models5. Classify: See which distribution matches new samples

17

Acquisition: pressure sensors

Basic ideas for microphones:• Deformable membrane

– pushed by compression– pulled by rarefaction

• Deformation is converted to voltage• Samples: voltage measured N times/second and discretized

– called sample rate (Fs) and – measured in Hertz (Hz)

18

Pressure and Intensity• Pressure

– Amount of force per area– Typically measured in Pascals

(Newtons/meter squared)if the microphone is calibratedotherwise in counts

• Intensity– Product of sound pressure and particle velocity per area–

19

Intensity Pressure2α

2

NPam

=

Framing

• Most audio signals vary over time• Analyze small segments that are roughly static

20

chimpanzee pant hoot – courtesy Cat Hobaiter

and so on…

21

decibel (dB) scale

• Human auditory sensitivity to pressure:20 µPa – 200 Pa (107 range!)

• The decibel scale allows us to compare the intensity between two sounds in a compressed rang:– Intensity

– ~ with pressure

010log10

II

=

010

2

010 log20log10

PP

PP

The bel is named afterAlexander Graham Bell10 decibels 1 bel

image credit: biography.com

𝐼𝐼0,𝑃𝑃0 are reference intensity/pressure

22

What value of P0?

Rabiner/Juang 1993

calibrated terrestrial systems:𝑃𝑃0= 20𝜇𝜇Pa (threshold of hearing)denoted dB re 20 𝜇𝜇Pa

uncalibrated systems:𝑃𝑃0 = 1 (for convenience)denoted dB rel.

Computing intensity

• Computed for each frame• Root-mean-square (RMS) pressure

– squared– averaged– square root

• Pressure converted to intensity in dB:

23

2

frame

1| frame |RMS i

ip x

∈

= ∑

20 log10 (𝑝𝑝𝑅𝑅𝑅𝑅𝑅𝑅) = 10 log101

|frame|�

𝑖𝑖∈frame𝑥𝑥𝑖𝑖2 dB

Our first speech problem

• Given a recording, separate speech from constant background noise

240 1 2 3 4 5 6

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Time (s)

Am

plitu

de (c

ount

s)

Relative Intensity

25

0 1 2 3 4 5 6-75

-70

-65

-60

-55

-50

-45

-40

-35

-30

Time (s)

dB re

l.

Intensity histogram

26

-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.

Cou

nts

So how can we characterize speech and silence?

27-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.

Co

un

ts

Silence estimator

• Beginning of recording vs.

• Entire recording

28

-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.

Counts

-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.

Counts

Normal fit of silence

29

-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.

Cou

nts

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.12

2( )

2

2

1( | )2

sil

sil

x

sil

P X x sil eµσ

πσ

− −

= =

P(X|sil)

Similar for speech 1.3 – 2.8 s

30

-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.

Co

un

ts

0

0.01

0.02

0.03

0.04

0.05

0.06

P(X|speech)

Intensity distribution with pdfs

31

-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.

Co

un

ts

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Where should we draw the boundary?

P(X|sil) &

P(X|speech)

Conditional probability

The probability of something occurring can change when you know something…

32

( "nice to meet you")

( "nice to meet you" | meeting new person)

P XvsP X

=

=

Conditional probability

The probability of something occurring can change when you know something…

33

( "nice to meet you")

( "nice to meet you" | meeting new person)

P XvsP X

=

=

Formally, P(X|Y) defined as: 𝑃𝑃(𝑋𝑋|𝑌𝑌) ≜𝑃𝑃(𝑋𝑋,𝑌𝑌)𝑃𝑃(𝑌𝑌)

A problem…

Our intensity pdfs showed

but we would like to know

which is known as theposterior distribution

34-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.

Co

un

ts

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

( | ) and ( | )P X speech P X noise

( | ) and ( | )P speech X P noise X

Bayes’ decision rule

The optimal classification rule is the one that maximizes the posterior probability:

Regrettably, we do not know ,but we do know

35

{ , }

( | )arg maxspeech noise

P Xω

ω∈

note: arg returns the argument (ω) rather than the value

( | )P Xω( | )P X ω

Reverend Thomas Bayes1702-1761

Bayes’ rule (of conditional probability)

36

𝑃𝑃(𝐴𝐴|𝐵𝐵) =𝑃𝑃(𝐴𝐴,𝐵𝐵)𝑃𝑃(𝐵𝐵)

≜ conditional probability

note that swapping names A and B

P(B|A) =𝑃𝑃(𝐵𝐵,𝐴𝐴)𝑃𝑃(𝐴𝐴)

→ 𝑃𝑃(𝐵𝐵,𝐴𝐴) = 𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)

and 𝑃𝑃(𝐴𝐴,𝐵𝐵) = 𝑃𝑃(𝐵𝐵,𝐴𝐴)

∴ 𝑃𝑃(𝐴𝐴|𝐵𝐵) =𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)

𝑃𝑃(𝐵𝐵)

also known as Bayes’ Theorem/Bayes’ Law)

Bayes’ Rule≠Bayes’ Decision Rule

𝑃𝑃(𝐴𝐴,𝐵𝐵) : probability that bothA and B occur

Bayes’ decision rule

Bayes’ rule to the rescue!

37

arg max ( ( | ) ( )| )( )

PP X PXP Xω

ω ωω =

posterior probability

class-conditionalprobability

priorprobability

observationprobability

almost there…

We know P(X|ω). We don’t need to know P(X) as

38

( | ) ( )| )( )

( P X PX

P XPω ωω =

( )

( | ) ( ) ( | ) ( ),( ) ( )

max ( | ) (

max

), ( | ) ( )

P X speech P speech P X noise P noiseP X P X

P X speech P speech P X noise P noise

=

Prior probability

• Probability of an observation without any prior information.• When we don’t know this, we frequently assume a uniform

(equally likely) distribution:

39

1 1max ( | ) , ( | )2 2

P X speech P X noise =

Bayes decision rule revisited

Assuming a uniform prior, we can now make decisions about speech or noise:

How can we solve the boundary threshold?

40

2

2( )

2

2{ , }

1decision( ) arg max2

x

speech noisex e

ω

ω

µσ

ωωπσ

− −

∈=

Threshold for our minimal speech detector

Solve for x (2 roots):

41

22

22

( )( )22

2 22 21 1

speechnoise

speechnoise

noise speech

xx

e eµµσσ

πσ πσ

− −− −

=

Result

42

0 1 2 3 4 5 6-75

-70

-65

-60

-55

-50

-45

-40

-35

-30

Time (s)

dB re

l.

speechnoise

Caveat: Example is for pedagogical purposes, we can do better than this…

A better way

• This required us to have labels for the speech and noise

• Remember that the intensity histogram had two modes:

• Can we learn these automatically?

43

-75 -70 -65 -60 -55 -50 -45 -40 -35 -300

5

10

15

20

25

30

35

40

dB rel.C

ount

s

A better way

• Unsupervised model with unlabeled data• Mixture models

– Learn to fit complicated distributions with simpler parametric ones

– We do not require labels

44

45

Multidimensional Gaussians

Maximum likelihood estimators: sample mean & variance

Note: We do not need d>1 for this problem, but let’s start thinking about higher dimensions

( )

11 ( ) ( )2

1/22

1( )2

Ti i ix x

id

i

f x eµ µ

π

−− − Σ −=

Σ

dimensions, T transpose, determinant, , variance-covariance m

#[ a ri] t x i i

dE Xµ

→ ⋅ →→

→ Σ →

Gaussian Intuitions: Size of Σ

• µ = [0 0] µ = [0 0] µ = [0 0] • Σ = I Σ = 0.6I Σ = 2I• As Σ becomes larger, Gaussian becomes more spread

out; as Σ becomes smaller, Gaussian more compressedText and figures from Andrew Ng’s lecture notes for CS229 (via Jurafsky & Martin)

46Speech and Language Processing Jurafsky and Martin

47

GMM: Parameters

• Parameters for each of K mixtures k=1...K:– μk mean– ∑k variance-covariance matrix– ck mixture weight (importance) where

– Use Φk to denote (ck, μk,∑k) and Φ to denote the entire set of parameters

11

=∑ =

K

k kc

48

Sample 16 mixture modelBased upon R2 cepstral speech data

-1.5-1

-0.50

0.51

1.5

-1.5

-1

-0.5

0

0.5

10

2

4

6

8

10

12

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

-1

-0.5

0

0.5

Surface plot of pdfs Equal likelihood contour linesMixtures have different weights, means, & covariance matrices

49

GMM: Evaluation probability

• Probability of an observation:

P( �⃑�𝑥|Φ) = �𝑘𝑘=1

𝐾𝐾

𝑐𝑐𝑘𝑘𝑁𝑁𝑘𝑘(�⃗�𝑥|𝜇𝜇𝑘𝑘 ,∑𝑘𝑘)

= �𝑘𝑘=1

𝐾𝐾

𝑐𝑐𝑘𝑘1

(2𝜋𝜋) �𝑑𝑑 2 Σ𝑘𝑘 �1 2𝑒𝑒−12 (�⃑�𝑥−𝜇𝜇𝑘𝑘)𝑡𝑡Σ𝑘𝑘

−1(�⃑�𝑥−𝜇𝜇𝑘𝑘)

Learning GMM parameters(overview only)

• Given training data, and a set of parameters, we would like to maximize likelihood of the data.

• If the data are independent, select parameters to maximize:

𝑃𝑃 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑇𝑇|Φ = �𝑖𝑖=1

𝑇𝑇

𝑃𝑃(𝑥𝑥𝑖𝑖|Φ)

= Π𝑖𝑖=1𝑇𝑇 �𝑘𝑘=1

𝑐𝑐𝑘𝑘1

2𝜋𝜋𝑑𝑑2 Σ𝑘𝑘

12𝑒𝑒−

12 𝑥𝑥𝑖𝑖−𝜇𝜇𝑘𝑘 𝑇𝑇Σ𝑘𝑘

−1(𝑥𝑥𝑖𝑖−𝜇𝜇𝑘𝑘)

50

Learning GMM parameters

• If we knew which mixture generated each observation, we could use the derivative to select model parameters that maximize the likelihood:

𝑃𝑃 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑇𝑇|Φ• We do not know this…

51

EM algorithm

• Suppose we have an estimate, model Φ• Use E[·] operator to determine how likely a mixture is to

generate an observation.• Now we can maximize• This is called the expectation-maximization algorithm.

52

EM algorithm

Basic ideas• Use expectation and an extant model to estimate what you do

not know• Use maximization to come up with a new model• Repeat

Convergence guaranteed under minimal assumptions…but no guarantee on rate

Old Faithful Data

54

Wikipedia, EM algorithm

slide courtesy Kaitlin Palmer

Mixture models

• We do not have the time for the details of implementation• For those interested, see:

– Moon, T. K. (1996). "The expectation-maximization algorithm," IEEE Signal Proc. Mag. 13(6). 47-60.

or– Huang, X., Acero, A., and Hon, H. W. (2001). Spoken Language Processing (Prentice Hall PTR,

Upper Saddle River, NJ)

• For our purposes, we will use an implementation from scikit learn.

55

https://scikit-learn.org/

GMM estimation

• Our training data are the RMS intensity values• We use two mixtures• When predicting, we look at the likelihood of each mixture

and select the higher one. • How do we know which mixture represents noise and which

mixture represents speech?

56

speech processing overview & supporting concepts

Documents