tdt4171 artiﬁcial intelligence methods · tdt4171 artiﬁcial intelligence methods lecture 3...

TDT4171 Artificial Intelligence MethodsLecture 3 & 4 – Probabilistic Reasoning Over Time

Norwegian University of Science and Technology

Helge LangsethIT-VEST 310

[email protected]

1 TDT4171 Artificial Intelligence Methods

Outline

1 Leftovers from last timeInference

2 Probabilistic Reasoning over TimeSet-upBasic speech recognitionInference: Filtering, prediction, smoothingInference for Hidden Markov modelsKalman FiltersDynamic Bayesian networksSummary

3 Speech recognitionSpeech as probabilistic inferenceSpeech soundsWord sequences


Leftovers from last time

Summary from last time

Bayes nets provide a natural representation for (causallyinduced) conditional independence

Topology + CPTs = compact representation of jointdistribution

Generally easy to construct – also for non-experts

Canonical distributions (e.g., noisy-OR) = compactrepresentation of CPTs

Announcements

The first assignment due next Friday

Deliver it using It’s Learning

There will be no lecture next week!


Leftovers from last time Inference

Inference tasks

Simple queries: compute posterior marginal P(Xi|E= e), e.g.,P (NoGas|Gauge= empty, Lights= on, Starts= false)

Conjunctive queries:P(Xi,Xj |E= e) = P(Xi|E= e)P(Xj |Xi,E= e)

Optimal decisions: decision networks include utility information;probabilistic inference required forP (outcome|action, evidence)

Value of information: which evidence to seek next?

Sensitivity analysis: which probability values are most critical?

Explanation: why do I need a new starter motor?



Inference tasks – Inference by enumeration

Slightly intelligent way to sum out variables from the joint withoutactually constructing its explicit representation.

Simple query on the burglary network:

P(B|j,m) = P(B, j,m)/P (j,m)

= αP(B, j,m)

= α Σe Σa P(B, e, a, j,m)

B E

J

A

MRewrite full joint entries using product of CPT entries:

P(B|j,m) = ΣeΣaP(B)P (e)P(a|B, e)P (j|a)P (m|a)

Recursive depth-first enumeration: O(n) space, O(n · dn) time



Inference tasks – Inference by enumeration

Slightly intelligent way to sum out variables from the joint withoutactually constructing its explicit representation.

Simple query on the burglary network:

P(B|j,m) = P(B, j,m)/P (j,m)

= αP(B, j,m)

= α Σe Σa P(B, e, a, j,m)

B E

J

A

MRewrite full joint entries using product of CPT entries:

P(B|j,m) = ΣeΣaP(B)P (e)P(a|B, e)P (j|a)P (m|a)

= αP(B) ΣeP (e)Σa P(a|B, e)P (j|a)P (m|a)

Recursive depth-first enumeration: O(n) space, O(dn) time



Enumeration algorithm

function Enumeration-Ask(X,e,bn) returns distr. over X

inputs: X, the query variablee, observed values for variables Ebn, a Bayesian network with variables {X } ∪ E ∪ Y

Q(X )← a distribution over X, initially emptyfor each value xi of X do

extend e with value xi for X

Q(xi)←Enum-All(Vars[bn],e)return Normalize(Q(X ))

function Enum-All(vars,e) returns a real numberif Empty?(vars) then return 1.0Y←First(vars)if Y has value y in e

then return P (y|Pa(Y )) ×Enum-All(Rest(vars),e)else return

∑

y P (y|Pa(Y ))×Enum-All(Rest(vars),ey)where ey is e extended with Y = y



Evaluation tree

P(j|a).90

P(m|a).70 .01

P(m| a)

.05P(j| a) P(j|a)

.90

P(m|a).70 .01

P(m| a)

.05P(j| a)

P(b).001

P(e).002

P( e).998

P(a|b,e).95 .06

P( a|b, e).05P( a|b,e)

.94P(a|b, e)

Enumeration is inefficient, as we have repeated computation ofe.g., P (j|a)P (m|a) for each value of e.⇒ Nice to know that better methods are available. . .



Summary of Chapter 14

Bayes nets provide a natural representation for (causallyinduced) conditional independence

Topology + CPTs = compact representation of joint

Generally easy to construct – also for non-experts

Canonical distributions (e.g., noisy-OR) = compactrepresentation of CPTs

Efficient inference calculations are available (but the goodones are outside the scope of this course)

What you should know:

How to build models (and verify them using ConditionalIndependence and Causality)

What drives the . . .

model building burdencomplexity of inference


Probabilistic Reasoning over Time Set-up

Time and uncertainty

Motivation: The world changes; we need to track and predict itStatic (Vehicle diagnosis) vs. Dynamic (Diabetes management)

Basic idea: copy state and evidence variables for each time step

Raint = Does it rain at time t

This assumes discrete time; step size depends on problemHere: A timestep is one day, I guess (?)



Markov processes (Markov chains)

If we want to construct a Bayes net from these variables, then whatare the parents?

Assume we have observations of Rain0, Rain1, . . . , Raint andwant to predict whether or not it rains at day t + 1:P(Raint+1|Rain0, Rain1, . . . , Raint)

Try to build a BN over Rain0, Rain1, . . . , Raint+1:

P(Raint+1) 6= P(Raint+1|Raint); base on Raint.P(Raint+1|Raint) ≈ P(Raint+1|Raint, Raint−1)(Do you agree?)








First-order Markov process:

P(Raint+1|Rain0, . . . , Raint) = P(Raint+1|Raint)“Future is cond. independent of Past given Present”








k’th-order Markov process:

P(Raint+1|Rain0, . . . , Raint) = P(Raint+1|Raint−k+1, . . . , Raint)



Markov processes as Bayesian networks


Markov assumption: Xt depends on bounded subset of X0:t−1

First-order Markov process: P(Xt|X0:t−1) = P(Xt | Xt−1)

Second-order Markov process:

P(Xt|X0:t−1) = P(Xt|Xt−2,Xt−1)

X t −1 X tX t −2 X t +1 X t +2

X t −1 X tX t −2 X t +1 X t +2First−order

Second−order



Is a first-order Markov process suitable?

First-order Markov assumption not exactly true in real world!

Possible fixes:1 Increase order of Markov process2 Augment state, e.g., add Tempt, Pressuret

State augmentation is enough!

Any k’th-order Markov process can be expressed as a First orderMarkov process – Focus on first order processes from now on.

“Proof”:1 Assume for simplicity that the process contains only variable

X, and that we have a second-order Markov process2 Create a new variable X ′

t identical to Xt−1.3 Let Xt+1 have both Xt and X ′

t as parent.4 Do for all t. Augmented model is first-order Markov process


Probabilistic Reasoning over Time Basic speech recognition

Speech as probabilistic inference

How can we recognize speech?

Speech signals are noisy, variable, ambiguous

What is the most likely word sequence, given the speechsignal?

Why not choose Words to maximize P(Words|signal)??Use Bayes’ rule:

P(Words|signal) = αP(signal|Words)P(Words)

I.e., decomposes into acoustic model + language model

Need to be able to do the required calculations!!



Generation of Speech



The sound signal - Characteristics

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14−0.4

−0.2

0

0.2

0.4

Time (s)

Am

plit

ude

Sound is dynamic, and we must take this into account torepresent it faithfully.

Sound is a “wavy” signal-train, with amplitude and frequencyinformation changing all the time.

Volume of speech ↔ Global change of amplitudesSpeed of speech ↔ Global change of frequencies

Most information is carried by the frequencies around 1kHz



The raw sound for recognition/classification

0 0.05 0.1 0.15 0.2 0.25−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Time (s)

Sig

nal −

− S

tart

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Time (s)

Sig

nal −

− S

top

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Time (s)

Sig

nal −

− L

eft

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Time (s)

Sig

nal −

− R

ight

The raw signal of the the words “Start”, “Stop”, “Left”, and “Right”.



Phones

All human speech is composed from 40-50 phones,determined by the configuration of articulators

Form an intermediate (hidden) level between words and signal⇒ speech of a word = uttering a sequence of phones.

ARPAbet designed for American English:

[iy] beat [b] bet [p] pet

[ih] bit [ch] Chet [r] rat[ey] bet [d] debt [s] set[ao] bought [hh] hat [th] thick

[ow] boat [hv] high [dh] that[er] Bert [l] let [w] wet[ix] roses [ng] sing [en] button

......

......

......

E.g., “ceiling” is [s iy l ih ng] / [s iy l ix ng] / [s iy l en]



Markov processes and speech

Assume we observe phonemes directly. Let Xt be the phonemeuttered inside frame t:

Xt is a single, discrete variable.

Xt takes on a value from the state-space {1, 2, . . . , N}, whereN is the total number of phonemes.

An observation sequence is {x1, x2, . . . , xT } (use x1:T as ashorthand).

It is common to assume a Markov process for speech signals.



(Observable) Markov processes; full set of assumptions

Stationary process:

Transition model P(Xt|pa (Xt)) fixed for all t


P(Xt|X0:t−1) = P(Xt|Xt−k:t−1)

Parameters:

Transition matrix T : P(Xt|Xt−k:t−1).

Prior distribution π: P(X0:k−1)



Hidden Markov models

Phonemes are not observable themselves

Phoneme Xt is partially disclosed by the sound signal in framet (or our representation of that). We call the observation Et.

Reasonable assumptions to make:

Stationary process:

Transition model P(Xt|pa (Xt)) fixed for all t


P(Xt|X0:t−1) = P(Xt|Xt−k:t−1)

Sensor Markov assumption:

P(Et|X1:t,E1:t−1) = P(Et|Xt).



Hidden Markov models as Bayesian networks

X0 X1 X2 X3 X4

E1 E2 E3 E4

The variables Xt are discrete and one-dimentional, the variables Et are vectors of variables

used to represent the sound signal in a that frame.



Example of Hidden Markov Model from the book

tRain

tUmbrella

Raint −1

Umbrella t −1

Raint +1

Umbrella t +1

Rt −1 tP(R )

0.3f0.7t

tR tP(U )

0.9t0.2f


Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Recognition of isolated words

Let e1:T denote the observation of a sound signal over Tframes.

Must define model to find likelihood P (e1:T |word) forisolated word

P (word|e1:T ) = αP (e1:T |word)P (word)

Prior probability P (word) by counting word frequencies.

This leaves us with the problem of calculating

P (e1:T |word) to make single-word speech recognition.



Top level design

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

word1

word2

wordn

p1

p2

pno1:T

o1:T

o1:T ClassifierTransform

The top-level structure for the classifier has one HMM per word.Note that the same data is sent to all models, and that theprobability pj = P (e1:T |wordj) is returned from the HMMs.



Inference tasks

Filtering: P(Xt|e1:t). This is the belief state – input to thedecision process of a rational agent. Also, as a artifactof the calculation scheme, we get the probabilityneeded for speech recognition if we are interested.

Prediction: P(Xt+k|e1:t) for k > 0. Evaluation of possible actionsequences; like filtering without the evidence

Smoothing: P(Xk|e1:t) for 0 ≤ k < t. Better estimate of past

states – Essential for learning

Most likely explanation: arg maxx1:tP (x1:t|e1:t). Speech

recognition, decoding with a noisy channel



Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉



Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = ?

P(X1) =∑

x0

P(X1|x0) · P (x0)

= 〈0.7, 0.3〉 · 0.5 + 〈0.3, 0.7〉 · 0.5= 〈0.5, 0.5〉



Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = ?

P(X1|e1) = α · P(e1|X1)P(X1)

= α · 〈0.9 · 0.5, 0.2 · 0.5〉= 〈0.818, 0.182〉



Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = ?

P(X2|e1) =∑

x1

P(X2|x1) · P (x1|e1)

= 〈0.7, 0.3〉 · 0.818 + 〈0.3, 0.7〉 · 0.182= 〈0.627, 0.373〉



Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = ?

P(X2|e1:2) = α · P(e2|X2) · P(X2 | e1)

= α · 〈0.9, 0.2〉 · 〈0.627, 0.373〉= α · 〈0.565, 0.075〉= 〈0.883, 0.117〉



Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉



Prediction

P(Xt+k+1|e1:t) = Σxt+kP(Xt+k+1|xt+k)P (xt+k|e1:t)

Again we have a recursive formulation – This time over k. . .

As k →∞, P (xt+k|e1:t) tends to the stationary distribution ofthe Markov chain. This means that the effect of e1:t will vanish ask increases, and predictions will become more and more dubious.

Mixing time depends on how stochastic the chain is (“howpersistent X is”)



Prediction – Example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉

P(X3|e1:2) =∑

x2

P(X3|x2) · P (x2|e1:2)

= 〈0.7, 0.3〉 · 0.883 + 〈0.3, 0.7〉 · 0.117= 〈0.653, 0.347〉




Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉

P(X4|e1:2) =∑

x3

P(X4|x3) · P (x3|e1:2)

= 〈0.7, 0.3〉 · 0.653 + 〈0.3, 0.7〉 · 0.347= 〈0.561, 0.439〉




Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉

P(X10|e1:2) =∑

x9

P(X10|x9) · P (x9|e1:2)

= 〈0.7, 0.3〉 · 0.501 + 〈0.3, 0.7〉 · 0.499= 〈0.500, 0.500〉

limk→∞ P(Xt+k|e1:t) = 〈12, 1

2〉 — but why?



Example: Automatic recognition of hand-written digits

We have this system that can “recognise” hand-written digits:

replacements

0 3 6 8

P(image | Digit)

Takes a binary image of a handwritten digit as input

Returns P(image | Digit)(The system we will consider is not very good)



Internals of recogniser – Naive Bayes

An image is a 16 × 16 matrix of binary variables Imagei,j:Imagei,j = true if pixel (i, j) is white, false otherwise.

How should we proceed? We need a model forP(image | Digit). Note that image is 256-dimentional.





Note! The different digits distribute white spots differently inthe image ⇒ combine single-pixel information to find digit.





In this example we assume that each location contributeindependently (Naive Bayes model):

P(image | Digit) =∏

i

∏

j

P (imagei,j | Digit).



Scaling up: ZIP-codes

We want to build a system that can decode hand-written ZIP-codesfor letters to Norway.

Digit1 Digit2 Digit3 Digit4



Scaling up: ZIP-codes

We want to build a system that can decode hand-written ZIP-codesfor letters to Norway.

There is a structure in this:

ZIP-codes always have 4 digitsSome ZIP-codes more frequent than others (e.g., 0xxx – 13xx

for Oslo, 50xx for Bergen, 70xx for Trondheim)Some ZIP-codes are not used, e.g. 5022 does not exist. . . but some illegal numbers are often used, e.g. 7000meaning “Wherever in Trondheim”

Can we utilise the internal structure to improve thedigits-recogniser?



How to model the internal structure of ZIP-codes

Take 1: Full model





Take 1: Full model

The full model includes all relations between digits:

7465 is good, 7365 is not

The problem is related to size of CPTs:

How many numbers to represent P(Digit4 | Pa(Digit4))?What if we want to use this system to recognise KID numbers(> 10 digits)?




Take 2: Markov model





Take 2: Markov model

The reduced model includes only some relations betweendigits:

Can represent “If start with 7 and digit number three is 6, thenthe second one is probably 4”Cannot represent “If start with 9 thin digit number four isprobably not 7”

What about making the model stationary?

Does not seem appropriate here.Might be necessary and/or reasonable for KID recognitionthough.



Inference (filtering)

Step 1: First digit classified as a 4! (Not good! I told you!)

replacements


1

4 7

?? ?

P(Digit1 | image1)




Step 1: First digit classified as a 4! (Not good! I told you!)

So what happened?

The Naive Bayes method supplies P(image1 | Digit1)

Using the calculation rule, the system finds

P(Digit1 | image1) = α · P(image1 | Digit1) · P(Digit1)




Step 2: Second digit classified as a 4.

replacements


2

2

44 7

? ?

P(Digit2 | image1, image2)




Step 2: Second digit classified as a 4.

So what happened?

The Naive Bayes method supplies P(image2 | Digit2)Using the calculation rule, the system findsP(Digit2 | image1, image2) =

α · P(image2 | Digit2)·∑

digit1

P(Digit2 | digit1)P (digit1 | image1)

To do the classification, the system used the information thatThe image is a very typical “4”7→ 4 is probable4→ 4 is not very probable, but possible

Can this structural information also be used“backwards”?

If the 2nd digit is 4, then the 1st digit is probably a 7, not a 4

This is called smoothing



Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = ?

b3:2 = P(e3:2|X2)

= 〈1, 1〉 (void)



Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = 〈1, 1〉

P(X2|e1:2) = ?

P(X2|e1:2) = α · f1:2 · b3:2

= α · 〈0.883, 0.117〉 · 〈1, 1〉= 〈0.883, 0.117〉



Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = 〈1, 1〉

P(X2|e1:2) = 〈0.883, 0.117〉

b2:2 = ?

b2:2 = P(e2:2|X1)

=∑

x2

P (e2|x2) · b3:2(x2) · P(x2|X1)

= (0.9 · 1 · 〈0.7, 0.3〉) + (0.2 · 1 · 〈0.3, 0.7〉) = 〈0.690, 0.410〉37 TDT4171 Artificial Intelligence Methods


Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = 〈1, 1〉

P(X2|e1:2) = 〈0.883, 0.117〉

b2:2 = 〈0.690, 0.410〉

P(X1|e1:2) = ?

P(X1|e1:2) = αf1:1 · b2:2

= α · 〈0.818, 0.182〉 · 〈0.690, 0.410〉= 〈0.883, 0.117〉



Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = 〈1, 1〉

P(X2|e1:2) = 〈0.883, 0.117〉

b2:2 = 〈0.690, 0.410〉

P(X1|e1:2) = 〈0.883, 0.117〉



Smoothing example — conclusion

Rain1

Umbrella1

Rain2

Umbrella2

Rain0

TrueFalse

0.8180.182

0.6270.373

0.8830.117

0.5000.500

0.5000.500

1.0001.000

0.6900.410

0.8830.117

forward

backward

smoothed0.8830.117

Forward–backward algorithm: cache forward messages as wemoveTime linear in t (polytree inference), space O(t · |f|)



How to classify ZIP-codes?


Can we take the most probable digit per image and use forclassification?

NO! Most likely sequence IS NOT the sequence of most

likely states!



Most likely explanation

Most likely sequence 6= sequence of most likely states!

Most likely path to each xt+1 is most likely path to some xt

plus one more step

maxx1...xt

P(x1, . . . , xt,Xt+1|e1:t+1)

= maxx1...xt

P(x1, . . . , xt,Xt+1, e1:t+1)/P (e1:t+1)

= maxx1...xt

P(et+1|x1, . . . , xt,Xt+1, e1:t) · P(Xt+1|x1, . . . , xt, e1:t)

· P (x1, . . . , xt|e1:t) · α= max

x1...xt

α · P(et+1|Xt+1) · P(Xt+1|xt) · P (x1, . . . , xt|e1:t)

= αP(et+1|Xt+1)maxxt

(

P(Xt+1|xt) maxx1...xt−1

P (x1, . . . , xt−1, xt|e1:t)

)



Most likely explanation

Most likely sequence 6= sequence of most likely states!

Most likely path to each xt+1 is most likely path to some xt

plus one more step

maxx1...xt

P(x1, . . . , xt,Xt+1|e1:t+1)

= αP(et+1|Xt+1) ·maxxt

(

P(Xt+1|xt) maxx1...xt−1

P (x1, . . . , xt|e1:t)

)

Identical to filtering, except f1:t replaced by

m1:t = maxx1...xt−1

P(x1, . . . , xt−1,Xt|e1:t),

I.e., m1:t(i) gives the probability of the most likely path to state i.Update has sum replaced by max, giving the Viterbi algorithm:

m1:t+1 = P(et+1|Xt+1)maxxt

(P(Xt+1|xt) ·m1:t)



Viterbi example

Rain1 Rain2 Rain3 Rain4 Rain5

true

false

true

false

true

false

true

false

true

false

.8182 .5155 .0361 .0334 .0210

.1818 .0491 .1237 .0173 .0024

m 1:1 m 1:5m 1:4m 1:3m 1:2

statespacepaths

mostlikelypaths

umbrella true truetruefalsetrue

m1:t+1 = P(et+1|Xt+1)maxxt

(P(Xt+1|xt) ·m1:t)


Probabilistic Reasoning over Time Inference for Hidden Markov models

Simplifications for Hidden Markov models

Xt is a single, discrete variable (usually Et is too)Domain of Xt is {1, . . . , S}Transition matrix Tij = P (Xt = j|Xt−1 = i), e.g.,

(0.7 0.30.3 0.7

)

Sensor matrix Ot for each t, diagonal elements P (et|Xt = i).For instance, with U1 = true we get

O1 =

(P (u1|x1) 0

0 P (u1|¬x1)

)

=

(0.9 00 0.2

)

Forward and backward messages as column vectors:

f1:t+1 = αOt+1T⊤f1:t

bk+1:t = TOk+1bk+2:t

Forward-backward algorithm needs time O(S2t) and space O(St)


Probabilistic Reasoning over Time Inference for Hidden Markov models

Hidden Markov Models at work: Bach

Chord0 Chord1 Chord2

Melody1 Melody2

http://www.anc.inf.ed.ac.uk/demos/hmmbach/demo1.html


http://www.anc.inf.ed.ac.uk/demos/hmmbach/demo1.html

Probabilistic Reasoning over Time Kalman Filters

Kalman filters

Modelling systems described by a set of continuous variables,e.g., tracking a bird flying — Xt = X,Y,Z, X , Y , Z.

Also: Airplanes, robots, ecosystems, economies, chemical plants,planets, . . .

“Noisy” observations, continuous variables, dynamic model

tZ t+1Z

tX t+1X

tX t+1X

Gaussian prior, linear Gaussian transition model and sensor model



Continuous variables

Need a way to define a conditional density function for childvariable given continuous parents

Most common is the linear Gaussian model, e.g.,:

P (Xt = xt|Xt−1 = xt−1)

= N(a · xt−1 + b, σ)(xt)

=1

σ√

2πexp

(

−1

2

(xt − (a · xt−1 + b)

σ

)2)

Mean Xt varies linearly with Xt−1, variance is fixed

Linear variation and fixed variance may be unreasonable

over the full range, but may work OK if the likely range ofXt is narrow



Continuous variables (cont’d)

All-continuous network with LG distributions⇒ full joint distribution is a multivariate Gaussian



Updating Gaussian distributions

Prediction step: if P(Xt|e1:t) is Gaussian, then prediction

P(Xt+1|e1:t) =

∫

xt

P(Xt+1|xt)P (xt|e1:t) dxt

is Gaussian.

If P(Xt+1|e1:t) is Gaussian, then the updated distribution

P(Xt+1|e1:t+1) = αP(et+1|Xt+1)P(Xt+1|e1:t)

is Gaussian

Hence P(Xt|e1:t) is multivariate Gaussian N(µt,Σt) for all t

General (nonlinear, non-Gaussian) process: description of posteriorgrows unboundedly as t→∞



Simple 1-D example

Task: Measure the Norwegian population’s level of jobsatisfaction on a monthly basis

Scale: Real numbers (typical values from −5 to +5)

Indirect measurement: Ask a random subset of N people

Modelling assumptions:

The true value cannot be measured (N < 4.5 · 106), but themeasurements (Zt) are correlated with the true value (Xt):

P (zt|Xt = xt) ∼ N(xt,Σz)(zt)

The true level at time t is related to the level at time t− 1:P (xt|Xt−1 = xt−1) ∼ N(xt−1,Σx)(xt)

That is, we have a Gaussian Random Walk



Simple 1-D example (cont’d)

Gaussian random walk on X–axis, s.d. σx, sensor s.d. σz

µt+1 =(σ2

t + σ2x)zt+1 + σ2

zµt

σ2t + σ2

x + σ2z

σ2t+1 =

(σ2t + σ2

x)σ2z

σ2t + σ2

x + σ2z

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-8 -6 -4 -2 0 2 4 6 8

P(X

)

X position

P(x0)

P(x1)

P(x1 | z1=2.5)

*z1



General Kalman update

Transition and sensor models:

P (xt+1|xt) = N(Fxt,Σx)(xt+1)P (zt|xt) = N(Hxt,Σz)(zt)

F is the matrix for the transition; Σx the transition noisecovariance

H is the matrix for the sensors; Σz the sensor noise covariance

Filter computes the following update:

µt+1 = Fµt + Kt+1(zt+1 −HFµt)

Σt+1 = (I−Kt+1)(FΣtF⊤ + Σx)

Kt+1 = (FΣtF⊤ + Σx)H⊤(H(FΣtF

⊤ + Σx)H⊤ + Σz)−1

is the Kalman gain matrix

Note! Σt and Kt are independent of observation sequence, socompute offline



2-D tracking example: filtering

8 10 12 14 16 18 20 22 24 266

7

8

9

10

11

12

X

Y

2D filtering

trueobservedfiltered



2-D tracking example: smoothing

8 10 12 14 16 18 20 22 24 266

7

8

9

10

11

12

X

Y

2D smoothing

trueobservedsmoothed



Where Kalman Filtering falls apart

Kalman Filters cannot be applied if the transition model isnonlinear

Extended Kalman Filter models transition as locally linear

around xt = µt. Fails if systems is locally unsmoothSwitching Kalman Filter kan be used to handlediscontinuities


Probabilistic Reasoning over Time Dynamic Bayesian networks

Dynamic Bayesian networks

Xt, Et contain arbitrarily many variables in a replicated Bayes net

0.3f0.7t

0.9t0.2f

Rain0 Rain1

Umbrella1

P(U )1R1

P(R )1R0

0.7

P(R )0

Z1

X1

X1tXX 0

X 0

1BatteryBattery 0

1BMeter



DBNs vs. HMMs

Every HMM is a single-variable DBN; every discrete DBN is

an HMM

X t Xt+1

tY t+1Y

tZ t+1Z

Sparse dependencies ⇒ exponentially fewer parameters

. . . e.g., 20 state variables, three parents each

DBN has 20× 23 = 160 parameters, HMM has220× 220 ≈ 1012



DBNs vs. Kalman filters

Every Kalman filter model is a DBN, but few DBNs are KFs, asreal world requires non-Gaussian posteriors:Where are my keys? What’s the battery charge? Does this systemwork?

Z1

X1

X1tXX 0

X 0

1BatteryBattery 0

1BMeter

0BMBroken 1BMBroken

-1

0

1

2

3

4

5

15 20 25 30

E(B

atte

ry)

Time step

E(Battery|...5555005555...)

E(Battery|...5555000000...)

P(BMBroken|...5555000000...)

P(BMBroken|...5555005555...)


Probabilistic Reasoning over Time Summary

Summary – Representation and inference in temporalmodels

Temporal models use state and sensor variables replicatedover time

Markov assumptions and stationarity assumption, so weneed

Transition model P(Xt|Xt−1)Sensor model P(Et|Xt)

Tasks are filtering, prediction, smoothing, most likely sequence;all done recursively with constant cost per time step

Hidden Markov models have a single discrete state variable;used for speech recognition

Kalman filters allow n state variables, linear Gaussian, O(n3)update

Dynamic Bayes nets subsume HMMs, Kalman filters; exactupdate intractable; approximations exist


Speech recognition Speech as probabilistic inference

Speech as probabilistic inference

Let us return to the question of how to recognize speech

Speech signals are noisy, variable, ambiguous

Classify as to Words to maximize P(Words|signal)??Use Bayes’ rule:

P(Words|signal) = αP(signal|Words)P(Words)

I.e., decomposes into acoustic model + language model

The Words are the hidden state sequence, signal is theobservation sequence

We use Hidden Markov Models to model this


Speech recognition Speech sounds

Speech sounds

Raw signal is the microphone displacement as a function of time;processed into overlapping 30ms frames, each described byfeatures

Analog acoustic signal:

Sampled, quantized digital signal:

Frames with features:10 15 38

52 47 82

22 63 24

89 94 11

10 12 73

Frame features are typically formants (peaks in the powerspectrum)



Phone models

Frame features in P (features|phone) summarized by

an integer in [0 . . . 255] (using vector quantization); orthe parameters of a mixture of Gaussians

Three-state phones: each phone has three phases (Onset,Mid, End)E.g., [t] has silent Onset, explosive Mid, hissing End⇒ P (features|phone, phase)Triphone context: each phone becomes n2 distinct phones,depending on the phones to its left and rightE.g., [t] in “star” is written [t(s,aa)] (different from “tar”!)

Triphones useful for handling coarticulation effects: thearticulators have inertia and cannot switch instantaneouslybetween positionsE.g., [t] in “eighth” has tongue against front teeth



Phone model example

Phone HMM for [m]:

0.1

0.90.3

0.6

0.4

C1: 0.5

C2: 0.2

C3: 0.3

C3: 0.2

C4: 0.7

C5: 0.1

C4: 0.1

C6: 0.5

C7: 0.4

Output probabilities for the phone HMM:

Onset: Mid: End:

FINAL0.7

Mid EndOnset



Word pronunciation models

Each word is described as a distribution over phone sequencesDistribution represented as an HMM transition model

0.5

0.5

0.2

0.8

[m]

[ey]

[ow][t]

[aa]

[t]

[ah]

[ow]

1.0

1.0

1.0

1.0

1.0

P ([towmeytow]|“tomato”) = P ([towmaatow]|“tomato”) = 0.1P ([tahmeytow]|“tomato”) = P ([tahmaatow]|“tomato”) = 0.4

Structure is created manually, transition probabilities learned fromdata



Isolated words

Phone models + word models fix likelihood P (e1:t|word) forisolated word

P (word|e1:t) = αP (e1:t|word)P (word)

Prior probability P (word) by counting word frequencies

P (e1:t|word) can be computed recursively: define

ℓ1:t =P(Xt, e1:t)

and use the recursive update

ℓ1:t+1 = Forward(ℓ1:t, et+1)

and then P (e1:t|word) =∑

xtℓ1:t(xt)

Isolated-word dictation systems with training reach 95% –99% accuracy


Speech recognition Word sequences

Continuous speech

Not just a sequence of isolated-word recognition problems!

Adjacent words highly correlated

Sequence of most likely words is not equal to most likelysequence of words

Segmentation: there are few gaps in speech

Cross-word coarticulation, e.g., “next thing” ≈ “nexing” indaily speech

Continuous speech recognition is hard; currently the best systemsmanage 60% – 80% accuracy



Language model

Prior probability of a word sequence is given by chain rule:

P (w1 · · ·wn) =

n∏

i=1

P (wi|w1 · · ·wi−1)

simplify using a Bigram model:

P (wi|w1 · · ·wi−1) ≈ P (wi|wi−1)

Train by counting all word pairs in a large text corpus

More sophisticated models (trigrams, grammars, etc.) help,but only a little bit



Summary – Speech

Since the mid-1970s, speech recognition has been formulatedas probabilistic inference

Evidence = speech signal, hidden variables = word and phonesequences

“Context” effects (coarticulation etc.) are handled byaugmenting state

Variability in human speech (speed, timbre, etc., etc.) andbackground noise make continuous speech recognition in realsettings an open problem


tdt4171 artiﬁcial intelligence methods · tdt4171 artiﬁcial intelligence methods lecture 3...

Documents