Transcript
Page 1: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

TDT4171 Artificial Intelligence MethodsLecture 3 & 4 – Probabilistic Reasoning Over Time

Norwegian University of Science and Technology

Helge LangsethIT-VEST 310

[email protected]

1 TDT4171 Artificial Intelligence Methods

Page 2: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Outline

1 Leftovers from last timeInference

2 Probabilistic Reasoning over TimeSet-upBasic speech recognitionInference: Filtering, prediction, smoothingInference for Hidden Markov modelsKalman FiltersDynamic Bayesian networksSummary

3 Speech recognitionSpeech as probabilistic inferenceSpeech soundsWord sequences

2 TDT4171 Artificial Intelligence Methods

Page 3: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Leftovers from last time

Summary from last time

Bayes nets provide a natural representation for (causallyinduced) conditional independence

Topology + CPTs = compact representation of jointdistribution

Generally easy to construct – also for non-experts

Canonical distributions (e.g., noisy-OR) = compactrepresentation of CPTs

Announcements

The first assignment due next Friday

Deliver it using It’s Learning

There will be no lecture next week!

3 TDT4171 Artificial Intelligence Methods

Page 4: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Leftovers from last time Inference

Inference tasks

Simple queries: compute posterior marginal P(Xi|E= e), e.g.,P (NoGas|Gauge= empty, Lights= on, Starts= false)

Conjunctive queries:P(Xi,Xj |E= e) = P(Xi|E= e)P(Xj |Xi,E= e)

Optimal decisions: decision networks include utility information;probabilistic inference required forP (outcome|action, evidence)

Value of information: which evidence to seek next?

Sensitivity analysis: which probability values are most critical?

Explanation: why do I need a new starter motor?

4 TDT4171 Artificial Intelligence Methods

Page 5: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Leftovers from last time Inference

Inference tasks – Inference by enumeration

Slightly intelligent way to sum out variables from the joint withoutactually constructing its explicit representation.

Simple query on the burglary network:

P(B|j,m) = P(B, j,m)/P (j,m)

= αP(B, j,m)

= α Σe Σa P(B, e, a, j,m)

B E

J

A

MRewrite full joint entries using product of CPT entries:

P(B|j,m) = ΣeΣaP(B)P (e)P(a|B, e)P (j|a)P (m|a)

Recursive depth-first enumeration: O(n) space, O(n · dn) time

5 TDT4171 Artificial Intelligence Methods

Page 6: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Leftovers from last time Inference

Inference tasks – Inference by enumeration

Slightly intelligent way to sum out variables from the joint withoutactually constructing its explicit representation.

Simple query on the burglary network:

P(B|j,m) = P(B, j,m)/P (j,m)

= αP(B, j,m)

= α Σe Σa P(B, e, a, j,m)

B E

J

A

MRewrite full joint entries using product of CPT entries:

P(B|j,m) = ΣeΣaP(B)P (e)P(a|B, e)P (j|a)P (m|a)

= αP(B) ΣeP (e)Σa P(a|B, e)P (j|a)P (m|a)

Recursive depth-first enumeration: O(n) space, O(dn) time

5 TDT4171 Artificial Intelligence Methods

Page 7: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Leftovers from last time Inference

Enumeration algorithm

function Enumeration-Ask(X,e,bn) returns distr. over X

inputs: X, the query variablee, observed values for variables Ebn, a Bayesian network with variables {X } ∪ E ∪ Y

Q(X )← a distribution over X, initially emptyfor each value xi of X do

extend e with value xi for X

Q(xi)←Enum-All(Vars[bn],e)return Normalize(Q(X ))

function Enum-All(vars,e) returns a real numberif Empty?(vars) then return 1.0Y←First(vars)if Y has value y in e

then return P (y|Pa(Y )) ×Enum-All(Rest(vars),e)else return

y P (y|Pa(Y ))×Enum-All(Rest(vars),ey)where ey is e extended with Y = y

6 TDT4171 Artificial Intelligence Methods

Page 8: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Leftovers from last time Inference

Evaluation tree

P(j|a).90

P(m|a).70 .01

P(m| a)

.05P(j| a) P(j|a)

.90

P(m|a).70 .01

P(m| a)

.05P(j| a)

P(b).001

P(e).002

P( e).998

P(a|b,e).95 .06

P( a|b, e).05P( a|b,e)

.94P(a|b, e)

Enumeration is inefficient, as we have repeated computation ofe.g., P (j|a)P (m|a) for each value of e.⇒ Nice to know that better methods are available. . .

7 TDT4171 Artificial Intelligence Methods

Page 9: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Leftovers from last time Inference

Summary of Chapter 14

Bayes nets provide a natural representation for (causallyinduced) conditional independence

Topology + CPTs = compact representation of joint

Generally easy to construct – also for non-experts

Canonical distributions (e.g., noisy-OR) = compactrepresentation of CPTs

Efficient inference calculations are available (but the goodones are outside the scope of this course)

What you should know:

How to build models (and verify them using ConditionalIndependence and Causality)

What drives the . . .

model building burdencomplexity of inference

8 TDT4171 Artificial Intelligence Methods

Page 10: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Set-up

Time and uncertainty

Motivation: The world changes; we need to track and predict itStatic (Vehicle diagnosis) vs. Dynamic (Diabetes management)

Basic idea: copy state and evidence variables for each time step

Raint = Does it rain at time t

This assumes discrete time; step size depends on problemHere: A timestep is one day, I guess (?)

9 TDT4171 Artificial Intelligence Methods

Page 11: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Set-up

Markov processes (Markov chains)

If we want to construct a Bayes net from these variables, then whatare the parents?

Assume we have observations of Rain0, Rain1, . . . , Raint andwant to predict whether or not it rains at day t + 1:P(Raint+1|Rain0, Rain1, . . . , Raint)

Try to build a BN over Rain0, Rain1, . . . , Raint+1:

P(Raint+1) 6= P(Raint+1|Raint); base on Raint.P(Raint+1|Raint) ≈ P(Raint+1|Raint, Raint−1)(Do you agree?)

10 TDT4171 Artificial Intelligence Methods

Page 12: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Set-up

Markov processes (Markov chains)

If we want to construct a Bayes net from these variables, then whatare the parents?

Assume we have observations of Rain0, Rain1, . . . , Raint andwant to predict whether or not it rains at day t + 1:P(Raint+1|Rain0, Rain1, . . . , Raint)

Try to build a BN over Rain0, Rain1, . . . , Raint+1:

P(Raint+1) 6= P(Raint+1|Raint); base on Raint.P(Raint+1|Raint) ≈ P(Raint+1|Raint, Raint−1)(Do you agree?)

First-order Markov process:

P(Raint+1|Rain0, . . . , Raint) = P(Raint+1|Raint)“Future is cond. independent of Past given Present”

10 TDT4171 Artificial Intelligence Methods

Page 13: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Set-up

Markov processes (Markov chains)

If we want to construct a Bayes net from these variables, then whatare the parents?

Assume we have observations of Rain0, Rain1, . . . , Raint andwant to predict whether or not it rains at day t + 1:P(Raint+1|Rain0, Rain1, . . . , Raint)

Try to build a BN over Rain0, Rain1, . . . , Raint+1:

P(Raint+1) 6= P(Raint+1|Raint); base on Raint.P(Raint+1|Raint) ≈ P(Raint+1|Raint, Raint−1)(Do you agree?)

k’th-order Markov process:

P(Raint+1|Rain0, . . . , Raint) = P(Raint+1|Raint−k+1, . . . , Raint)

10 TDT4171 Artificial Intelligence Methods

Page 14: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Set-up

Markov processes as Bayesian networks

If we want to construct a Bayes net from these variables, then whatare the parents?

Markov assumption: Xt depends on bounded subset of X0:t−1

First-order Markov process: P(Xt|X0:t−1) = P(Xt | Xt−1)

Second-order Markov process:

P(Xt|X0:t−1) = P(Xt|Xt−2,Xt−1)

X t −1 X tX t −2 X t +1 X t +2

X t −1 X tX t −2 X t +1 X t +2First−order

Second−order

11 TDT4171 Artificial Intelligence Methods

Page 15: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Set-up

Is a first-order Markov process suitable?

First-order Markov assumption not exactly true in real world!

Possible fixes:1 Increase order of Markov process2 Augment state, e.g., add Tempt, Pressuret

State augmentation is enough!

Any k’th-order Markov process can be expressed as a First orderMarkov process – Focus on first order processes from now on.

“Proof”:1 Assume for simplicity that the process contains only variable

X, and that we have a second-order Markov process2 Create a new variable X ′

t identical to Xt−1.3 Let Xt+1 have both Xt and X ′

t as parent.4 Do for all t. Augmented model is first-order Markov process

12 TDT4171 Artificial Intelligence Methods

Page 16: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

Speech as probabilistic inference

How can we recognize speech?

Speech signals are noisy, variable, ambiguous

What is the most likely word sequence, given the speechsignal?

Why not choose Words to maximize P(Words|signal)??Use Bayes’ rule:

P(Words|signal) = αP(signal|Words)P(Words)

I.e., decomposes into acoustic model + language model

Need to be able to do the required calculations!!

13 TDT4171 Artificial Intelligence Methods

Page 17: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

Generation of Speech

14 TDT4171 Artificial Intelligence Methods

Page 18: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

The sound signal - Characteristics

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14−0.4

−0.2

0

0.2

0.4

Time (s)

Am

plit

ude

Sound is dynamic, and we must take this into account torepresent it faithfully.

Sound is a “wavy” signal-train, with amplitude and frequencyinformation changing all the time.

Volume of speech ↔ Global change of amplitudesSpeed of speech ↔ Global change of frequencies

Most information is carried by the frequencies around 1kHz

15 TDT4171 Artificial Intelligence Methods

Page 19: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

The raw sound for recognition/classification

0 0.05 0.1 0.15 0.2 0.25−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Time (s)

Sig

nal −

− S

tart

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Time (s)

Sig

nal −

− S

top

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Time (s)

Sig

nal −

− L

eft

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Time (s)

Sig

nal −

− R

ight

The raw signal of the the words “Start”, “Stop”, “Left”, and “Right”.

16 TDT4171 Artificial Intelligence Methods

Page 20: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

Phones

All human speech is composed from 40-50 phones,determined by the configuration of articulators

Form an intermediate (hidden) level between words and signal⇒ speech of a word = uttering a sequence of phones.

ARPAbet designed for American English:

[iy] beat [b] bet [p] pet

[ih] bit [ch] Chet [r] rat[ey] bet [d] debt [s] set[ao] bought [hh] hat [th] thick

[ow] boat [hv] high [dh] that[er] Bert [l] let [w] wet[ix] roses [ng] sing [en] button

......

......

......

E.g., “ceiling” is [s iy l ih ng] / [s iy l ix ng] / [s iy l en]

17 TDT4171 Artificial Intelligence Methods

Page 21: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

Markov processes and speech

Assume we observe phonemes directly. Let Xt be the phonemeuttered inside frame t:

Xt is a single, discrete variable.

Xt takes on a value from the state-space {1, 2, . . . , N}, whereN is the total number of phonemes.

An observation sequence is {x1, x2, . . . , xT } (use x1:T as ashorthand).

It is common to assume a Markov process for speech signals.

18 TDT4171 Artificial Intelligence Methods

Page 22: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

(Observable) Markov processes; full set of assumptions

Stationary process:

Transition model P(Xt|pa (Xt)) fixed for all t

k’th-order Markov process:

P(Xt|X0:t−1) = P(Xt|Xt−k:t−1)

Parameters:

Transition matrix T : P(Xt|Xt−k:t−1).

Prior distribution π: P(X0:k−1)

19 TDT4171 Artificial Intelligence Methods

Page 23: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

Hidden Markov models

Phonemes are not observable themselves

Phoneme Xt is partially disclosed by the sound signal in framet (or our representation of that). We call the observation Et.

Reasonable assumptions to make:

Stationary process:

Transition model P(Xt|pa (Xt)) fixed for all t

k’th-order Markov process:

P(Xt|X0:t−1) = P(Xt|Xt−k:t−1)

Sensor Markov assumption:

P(Et|X1:t,E1:t−1) = P(Et|Xt).

20 TDT4171 Artificial Intelligence Methods

Page 24: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

Hidden Markov models as Bayesian networks

X0 X1 X2 X3 X4

E1 E2 E3 E4

The variables Xt are discrete and one-dimentional, the variables Et are vectors of variables

used to represent the sound signal in a that frame.

21 TDT4171 Artificial Intelligence Methods

Page 25: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Basic speech recognition

Example of Hidden Markov Model from the book

tRain

tUmbrella

Raint −1

Umbrella t −1

Raint +1

Umbrella t +1

Rt −1 tP(R )

0.3f0.7t

tR tP(U )

0.9t0.2f

22 TDT4171 Artificial Intelligence Methods

Page 26: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Recognition of isolated words

Let e1:T denote the observation of a sound signal over Tframes.

Must define model to find likelihood P (e1:T |word) forisolated word

P (word|e1:T ) = αP (e1:T |word)P (word)

Prior probability P (word) by counting word frequencies.

This leaves us with the problem of calculating

P (e1:T |word) to make single-word speech recognition.

23 TDT4171 Artificial Intelligence Methods

Page 27: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Top level design

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

word1

word2

wordn

p1

p2

pno1:T

o1:T

o1:T ClassifierTransform

The top-level structure for the classifier has one HMM per word.Note that the same data is sent to all models, and that theprobability pj = P (e1:T |wordj) is returned from the HMMs.

24 TDT4171 Artificial Intelligence Methods

Page 28: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Inference tasks

Filtering: P(Xt|e1:t). This is the belief state – input to thedecision process of a rational agent. Also, as a artifactof the calculation scheme, we get the probabilityneeded for speech recognition if we are interested.

Prediction: P(Xt+k|e1:t) for k > 0. Evaluation of possible actionsequences; like filtering without the evidence

Smoothing: P(Xk|e1:t) for 0 ≤ k < t. Better estimate of past

states – Essential for learning

Most likely explanation: arg maxx1:tP (x1:t|e1:t). Speech

recognition, decoding with a noisy channel

25 TDT4171 Artificial Intelligence Methods

Page 29: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Filtering

Aim: devise a recursive state estimation algorithm:P(Xt+1|e1:t+1) = Some-Func(P(Xt|e1:t), et+1)

P(Xt+1|e1:t+1) = P(Xt+1, e1:t, et+1)/P (e1:t+1)

= P(et+1|Xt+1, e1:t) · P(Xt+1|e1:t) · P (e1:t)/P (e1:t+1)

= P(et+1|Xt+1 ) · P(Xt+1|e1:t) · α= α · P(et+1|Xt+1)

︸ ︷︷ ︸

Evidence

·P (Xt+1|e1:t)︸ ︷︷ ︸

Prediction

So, filtering is a prediction updated by evidence.

26 TDT4171 Artificial Intelligence Methods

Page 30: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Filtering

Aim: devise a recursive state estimation algorithm:P(Xt+1|e1:t+1) = Some-Func(P(Xt|e1:t), et+1)

Prediction by summing out Xt:

P(Xt+1|e1:t+1)

= α · P(et+1|Xt+1) · P(Xt+1|e1:t)

= α · P(et+1|Xt+1) · ΣxtP(Xt+1|xt, e1:t)P (xt|e1:t)

= α · P(et+1|Xt+1) · ΣxtP(Xt+1|xt)P(xt|e1:t)

︸ ︷︷ ︸

P(Xt+1|e1:t)using what we have already

All relevant information contained in f1:t =P(Xt|e1:t); beliefrevision using f1:t+1 = Forward(f1:t, et+1).

Note! Time and space requirements for calculating f1:t+1 isconstant (independent of t)

26 TDT4171 Artificial Intelligence Methods

Page 31: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉

27 TDT4171 Artificial Intelligence Methods

Page 32: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = ?

P(X1) =∑

x0

P(X1|x0) · P (x0)

= 〈0.7, 0.3〉 · 0.5 + 〈0.3, 0.7〉 · 0.5= 〈0.5, 0.5〉

27 TDT4171 Artificial Intelligence Methods

Page 33: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = ?

P(X1|e1) = α · P(e1|X1)P(X1)

= α · 〈0.9 · 0.5, 0.2 · 0.5〉= 〈0.818, 0.182〉

27 TDT4171 Artificial Intelligence Methods

Page 34: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = ?

P(X2|e1) =∑

x1

P(X2|x1) · P (x1|e1)

= 〈0.7, 0.3〉 · 0.818 + 〈0.3, 0.7〉 · 0.182= 〈0.627, 0.373〉

27 TDT4171 Artificial Intelligence Methods

Page 35: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = ?

P(X2|e1:2) = α · P(e2|X2) · P(X2 | e1)

= α · 〈0.9, 0.2〉 · 〈0.627, 0.373〉= α · 〈0.565, 0.075〉= 〈0.883, 0.117〉

27 TDT4171 Artificial Intelligence Methods

Page 36: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Filtering example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈0.5, 0.5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉

27 TDT4171 Artificial Intelligence Methods

Page 37: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Prediction

P(Xt+k+1|e1:t) = Σxt+kP(Xt+k+1|xt+k)P (xt+k|e1:t)

Again we have a recursive formulation – This time over k. . .

As k →∞, P (xt+k|e1:t) tends to the stationary distribution ofthe Markov chain. This means that the effect of e1:t will vanish ask increases, and predictions will become more and more dubious.

Mixing time depends on how stochastic the chain is (“howpersistent X is”)

28 TDT4171 Artificial Intelligence Methods

Page 38: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Prediction – Example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉

P(X3|e1:2) =∑

x2

P(X3|x2) · P (x2|e1:2)

= 〈0.7, 0.3〉 · 0.883 + 〈0.3, 0.7〉 · 0.117= 〈0.653, 0.347〉

29 TDT4171 Artificial Intelligence Methods

Page 39: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Prediction – Example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉

P(X4|e1:2) =∑

x3

P(X4|x3) · P (x3|e1:2)

= 〈0.7, 0.3〉 · 0.653 + 〈0.3, 0.7〉 · 0.347= 〈0.561, 0.439〉

29 TDT4171 Artificial Intelligence Methods

Page 40: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Prediction – Example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

P(X0) = 〈.5, .5〉P(X1) = 〈0.5, 0.5〉P(X1|e1) = 〈0.818, 0.182〉

P(X2|e1) = 〈0.627, 0.373〉P(X2|e1:2) = 〈0.883, 0.117〉

P(X10|e1:2) =∑

x9

P(X10|x9) · P (x9|e1:2)

= 〈0.7, 0.3〉 · 0.501 + 〈0.3, 0.7〉 · 0.499= 〈0.500, 0.500〉

limk→∞ P(Xt+k|e1:t) = 〈12, 1

2〉 — but why?

29 TDT4171 Artificial Intelligence Methods

Page 41: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Example: Automatic recognition of hand-written digits

We have this system that can “recognise” hand-written digits:

replacements

0 3 6 8

P(image | Digit)

Takes a binary image of a handwritten digit as input

Returns P(image | Digit)(The system we will consider is not very good)

30 TDT4171 Artificial Intelligence Methods

Page 42: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Internals of recogniser – Naive Bayes

An image is a 16 × 16 matrix of binary variables Imagei,j:Imagei,j = true if pixel (i, j) is white, false otherwise.

How should we proceed? We need a model forP(image | Digit). Note that image is 256-dimentional.

31 TDT4171 Artificial Intelligence Methods

Page 43: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Internals of recogniser – Naive Bayes

An image is a 16 × 16 matrix of binary variables Imagei,j:Imagei,j = true if pixel (i, j) is white, false otherwise.

Note! The different digits distribute white spots differently inthe image ⇒ combine single-pixel information to find digit.

31 TDT4171 Artificial Intelligence Methods

Page 44: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Internals of recogniser – Naive Bayes

An image is a 16 × 16 matrix of binary variables Imagei,j:Imagei,j = true if pixel (i, j) is white, false otherwise.

In this example we assume that each location contributeindependently (Naive Bayes model):

P(image | Digit) =∏

i

j

P (imagei,j | Digit).

31 TDT4171 Artificial Intelligence Methods

Page 45: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Scaling up: ZIP-codes

We want to build a system that can decode hand-written ZIP-codesfor letters to Norway.

Digit1 Digit2 Digit3 Digit4

32 TDT4171 Artificial Intelligence Methods

Page 46: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Scaling up: ZIP-codes

We want to build a system that can decode hand-written ZIP-codesfor letters to Norway.

There is a structure in this:

ZIP-codes always have 4 digitsSome ZIP-codes more frequent than others (e.g., 0xxx – 13xx

for Oslo, 50xx for Bergen, 70xx for Trondheim)Some ZIP-codes are not used, e.g. 5022 does not exist. . . but some illegal numbers are often used, e.g. 7000meaning “Wherever in Trondheim”

Can we utilise the internal structure to improve thedigits-recogniser?

33 TDT4171 Artificial Intelligence Methods

Page 47: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

How to model the internal structure of ZIP-codes

Take 1: Full model

Digit1 Digit2 Digit3 Digit4

34 TDT4171 Artificial Intelligence Methods

Page 48: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

How to model the internal structure of ZIP-codes

Take 1: Full model

The full model includes all relations between digits:

7465 is good, 7365 is not

The problem is related to size of CPTs:

How many numbers to represent P(Digit4 | Pa(Digit4))?What if we want to use this system to recognise KID numbers(> 10 digits)?

34 TDT4171 Artificial Intelligence Methods

Page 49: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

How to model the internal structure of ZIP-codes

Take 2: Markov model

Digit1 Digit2 Digit3 Digit4

34 TDT4171 Artificial Intelligence Methods

Page 50: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

How to model the internal structure of ZIP-codes

Take 2: Markov model

The reduced model includes only some relations betweendigits:

Can represent “If start with 7 and digit number three is 6, thenthe second one is probably 4”Cannot represent “If start with 9 thin digit number four isprobably not 7”

What about making the model stationary?

Does not seem appropriate here.Might be necessary and/or reasonable for KID recognitionthough.

34 TDT4171 Artificial Intelligence Methods

Page 51: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Inference (filtering)

Step 1: First digit classified as a 4! (Not good! I told you!)

replacements

Digit1 Digit2 Digit3 Digit4

1

4 7

?? ?

P(Digit1 | image1)

35 TDT4171 Artificial Intelligence Methods

Page 52: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Inference (filtering)

Step 1: First digit classified as a 4! (Not good! I told you!)

So what happened?

The Naive Bayes method supplies P(image1 | Digit1)

Using the calculation rule, the system finds

P(Digit1 | image1) = α · P(image1 | Digit1) · P(Digit1)

35 TDT4171 Artificial Intelligence Methods

Page 53: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Inference (filtering)

Step 2: Second digit classified as a 4.

replacements

Digit1 Digit2 Digit3 Digit4

2

2

44 7

? ?

P(Digit2 | image1, image2)

35 TDT4171 Artificial Intelligence Methods

Page 54: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Inference (filtering)

Step 2: Second digit classified as a 4.

So what happened?

The Naive Bayes method supplies P(image2 | Digit2)Using the calculation rule, the system findsP(Digit2 | image1, image2) =

α · P(image2 | Digit2)·∑

digit1

P(Digit2 | digit1)P (digit1 | image1)

To do the classification, the system used the information thatThe image is a very typical “4”7→ 4 is probable4→ 4 is not very probable, but possible

Can this structural information also be used“backwards”?

If the 2nd digit is 4, then the 1st digit is probably a 7, not a 4

This is called smoothing

35 TDT4171 Artificial Intelligence Methods

Page 55: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Smoothing

X 0 X 1

1E tE

tXX k

Ek

Calculate P(Xk|e1:t) by dividing evidence e1:t into e1:k, ek+1:t:

P(Xk|e1:t) = P(Xk|e1:k, ek+1:t)

= P(Xk, e1:k, ek+1:t)/P(e1:k, ek+1:t)

= P(ek+1:t|Xk, e1:k) · P(Xk|e1:k) · P(e1:k)/P(e1:k, ek+1:t)

= P(ek+1:t|Xk ) · P(Xk|e1:k) · α= α · P(Xk|e1:k) · P(ek+1:t|Xk)

= α · f1:k · bk+1:t

where bk+1:t = P(ek+1:t|Xk).

36 TDT4171 Artificial Intelligence Methods

Page 56: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Smoothing

X 0 X 1

1E tE

tXX k

Ek

Backward message computed by a backwards recursion:

P(ek+1:t|Xk) = Σxk+1P(ek+1:t|Xk, xk+1)P(xk+1|Xk)

= Σxk+1P (ek+1:t|xk+1)P(xk+1|Xk)

= Σxk+1P (ek+1|xk+1) · P (ek+2:t|xk+1) · P(xk+1|Xk)

So. . .

bk+1:t = P(ek+1:t|Xk)

= Σxk+1P (ek+1|xk+1) · bk+2:t(xk+1) · P(xk+1|Xk)

36 TDT4171 Artificial Intelligence Methods

Page 57: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = ?

b3:2 = P(e3:2|X2)

= 〈1, 1〉 (void)

37 TDT4171 Artificial Intelligence Methods

Page 58: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = 〈1, 1〉

P(X2|e1:2) = ?

P(X2|e1:2) = α · f1:2 · b3:2

= α · 〈0.883, 0.117〉 · 〈1, 1〉= 〈0.883, 0.117〉

37 TDT4171 Artificial Intelligence Methods

Page 59: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = 〈1, 1〉

P(X2|e1:2) = 〈0.883, 0.117〉

b2:2 = ?

b2:2 = P(e2:2|X1)

=∑

x2

P (e2|x2) · b3:2(x2) · P(x2|X1)

= (0.9 · 1 · 〈0.7, 0.3〉) + (0.2 · 1 · 〈0.3, 0.7〉) = 〈0.690, 0.410〉37 TDT4171 Artificial Intelligence Methods

Page 60: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = 〈1, 1〉

P(X2|e1:2) = 〈0.883, 0.117〉

b2:2 = 〈0.690, 0.410〉

P(X1|e1:2) = ?

P(X1|e1:2) = αf1:1 · b2:2

= α · 〈0.818, 0.182〉 · 〈0.690, 0.410〉= 〈0.883, 0.117〉

37 TDT4171 Artificial Intelligence Methods

Page 61: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Smoothing example

Rain0 Rain1 Rain2

Umbrella1 Umbrella2

f0 = 〈0.5, 0.5〉 f1:1 = 〈0.818, 0.182〉 f1:2 = 〈0.883, 0.117〉

b3:2 = 〈1, 1〉

P(X2|e1:2) = 〈0.883, 0.117〉

b2:2 = 〈0.690, 0.410〉

P(X1|e1:2) = 〈0.883, 0.117〉

37 TDT4171 Artificial Intelligence Methods

Page 62: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Smoothing example — conclusion

Rain1

Umbrella1

Rain2

Umbrella2

Rain0

TrueFalse

0.8180.182

0.6270.373

0.8830.117

0.5000.500

0.5000.500

1.0001.000

0.6900.410

0.8830.117

forward

backward

smoothed0.8830.117

Forward–backward algorithm: cache forward messages as wemoveTime linear in t (polytree inference), space O(t · |f|)

38 TDT4171 Artificial Intelligence Methods

Page 63: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

How to classify ZIP-codes?

Digit1 Digit2 Digit3 Digit4

Can we take the most probable digit per image and use forclassification?

NO! Most likely sequence IS NOT the sequence of most

likely states!

39 TDT4171 Artificial Intelligence Methods

Page 64: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Most likely explanation

Most likely sequence 6= sequence of most likely states!

Most likely path to each xt+1 is most likely path to some xt

plus one more step

maxx1...xt

P(x1, . . . , xt,Xt+1|e1:t+1)

= maxx1...xt

P(x1, . . . , xt,Xt+1, e1:t+1)/P (e1:t+1)

= maxx1...xt

P(et+1|x1, . . . , xt,Xt+1, e1:t) · P(Xt+1|x1, . . . , xt, e1:t)

· P (x1, . . . , xt|e1:t) · α= max

x1...xt

α · P(et+1|Xt+1) · P(Xt+1|xt) · P (x1, . . . , xt|e1:t)

= αP(et+1|Xt+1)maxxt

(

P(Xt+1|xt) maxx1...xt−1

P (x1, . . . , xt−1, xt|e1:t)

)

40 TDT4171 Artificial Intelligence Methods

Page 65: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Most likely explanation

Most likely sequence 6= sequence of most likely states!

Most likely path to each xt+1 is most likely path to some xt

plus one more step

maxx1...xt

P(x1, . . . , xt,Xt+1|e1:t+1)

= αP(et+1|Xt+1) ·maxxt

(

P(Xt+1|xt) maxx1...xt−1

P (x1, . . . , xt|e1:t)

)

Identical to filtering, except f1:t replaced by

m1:t = maxx1...xt−1

P(x1, . . . , xt−1,Xt|e1:t),

I.e., m1:t(i) gives the probability of the most likely path to state i.Update has sum replaced by max, giving the Viterbi algorithm:

m1:t+1 = P(et+1|Xt+1)maxxt

(P(Xt+1|xt) ·m1:t)

40 TDT4171 Artificial Intelligence Methods

Page 66: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference: Filtering, prediction, smoothing

Viterbi example

Rain1 Rain2 Rain3 Rain4 Rain5

true

false

true

false

true

false

true

false

true

false

.8182 .5155 .0361 .0334 .0210

.1818 .0491 .1237 .0173 .0024

m 1:1 m 1:5m 1:4m 1:3m 1:2

statespacepaths

mostlikelypaths

umbrella true truetruefalsetrue

m1:t+1 = P(et+1|Xt+1)maxxt

(P(Xt+1|xt) ·m1:t)

41 TDT4171 Artificial Intelligence Methods

Page 67: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference for Hidden Markov models

Simplifications for Hidden Markov models

Xt is a single, discrete variable (usually Et is too)Domain of Xt is {1, . . . , S}Transition matrix Tij = P (Xt = j|Xt−1 = i), e.g.,

(0.7 0.30.3 0.7

)

Sensor matrix Ot for each t, diagonal elements P (et|Xt = i).For instance, with U1 = true we get

O1 =

(P (u1|x1) 0

0 P (u1|¬x1)

)

=

(0.9 00 0.2

)

Forward and backward messages as column vectors:

f1:t+1 = αOt+1T⊤f1:t

bk+1:t = TOk+1bk+2:t

Forward-backward algorithm needs time O(S2t) and space O(St)

42 TDT4171 Artificial Intelligence Methods

Page 68: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Inference for Hidden Markov models

Hidden Markov Models at work: Bach

Chord0 Chord1 Chord2

Melody1 Melody2

http://www.anc.inf.ed.ac.uk/demos/hmmbach/demo1.html

43 TDT4171 Artificial Intelligence Methods

Page 69: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

Kalman filters

Modelling systems described by a set of continuous variables,e.g., tracking a bird flying — Xt = X,Y,Z, X , Y , Z.

Also: Airplanes, robots, ecosystems, economies, chemical plants,planets, . . .

“Noisy” observations, continuous variables, dynamic model

tZ t+1Z

tX t+1X

tX t+1X

Gaussian prior, linear Gaussian transition model and sensor model

44 TDT4171 Artificial Intelligence Methods

Page 70: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

Continuous variables

Need a way to define a conditional density function for childvariable given continuous parents

Most common is the linear Gaussian model, e.g.,:

P (Xt = xt|Xt−1 = xt−1)

= N(a · xt−1 + b, σ)(xt)

=1

σ√

2πexp

(

−1

2

(xt − (a · xt−1 + b)

σ

)2)

Mean Xt varies linearly with Xt−1, variance is fixed

Linear variation and fixed variance may be unreasonable

over the full range, but may work OK if the likely range ofXt is narrow

45 TDT4171 Artificial Intelligence Methods

Page 71: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

Continuous variables (cont’d)

All-continuous network with LG distributions⇒ full joint distribution is a multivariate Gaussian

46 TDT4171 Artificial Intelligence Methods

Page 72: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

Updating Gaussian distributions

Prediction step: if P(Xt|e1:t) is Gaussian, then prediction

P(Xt+1|e1:t) =

xt

P(Xt+1|xt)P (xt|e1:t) dxt

is Gaussian.

If P(Xt+1|e1:t) is Gaussian, then the updated distribution

P(Xt+1|e1:t+1) = αP(et+1|Xt+1)P(Xt+1|e1:t)

is Gaussian

Hence P(Xt|e1:t) is multivariate Gaussian N(µt,Σt) for all t

General (nonlinear, non-Gaussian) process: description of posteriorgrows unboundedly as t→∞

47 TDT4171 Artificial Intelligence Methods

Page 73: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

Simple 1-D example

Task: Measure the Norwegian population’s level of jobsatisfaction on a monthly basis

Scale: Real numbers (typical values from −5 to +5)

Indirect measurement: Ask a random subset of N people

Modelling assumptions:

The true value cannot be measured (N < 4.5 · 106), but themeasurements (Zt) are correlated with the true value (Xt):

P (zt|Xt = xt) ∼ N(xt,Σz)(zt)

The true level at time t is related to the level at time t− 1:P (xt|Xt−1 = xt−1) ∼ N(xt−1,Σx)(xt)

That is, we have a Gaussian Random Walk

48 TDT4171 Artificial Intelligence Methods

Page 74: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

Simple 1-D example (cont’d)

Gaussian random walk on X–axis, s.d. σx, sensor s.d. σz

µt+1 =(σ2

t + σ2x)zt+1 + σ2

zµt

σ2t + σ2

x + σ2z

σ2t+1 =

(σ2t + σ2

x)σ2z

σ2t + σ2

x + σ2z

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-8 -6 -4 -2 0 2 4 6 8

P(X

)

X position

P(x0)

P(x1)

P(x1 | z1=2.5)

*z1

49 TDT4171 Artificial Intelligence Methods

Page 75: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

General Kalman update

Transition and sensor models:

P (xt+1|xt) = N(Fxt,Σx)(xt+1)P (zt|xt) = N(Hxt,Σz)(zt)

F is the matrix for the transition; Σx the transition noisecovariance

H is the matrix for the sensors; Σz the sensor noise covariance

Filter computes the following update:

µt+1 = Fµt + Kt+1(zt+1 −HFµt)

Σt+1 = (I−Kt+1)(FΣtF⊤ + Σx)

Kt+1 = (FΣtF⊤ + Σx)H⊤(H(FΣtF

⊤ + Σx)H⊤ + Σz)−1

is the Kalman gain matrix

Note! Σt and Kt are independent of observation sequence, socompute offline

50 TDT4171 Artificial Intelligence Methods

Page 76: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

2-D tracking example: filtering

8 10 12 14 16 18 20 22 24 266

7

8

9

10

11

12

X

Y

2D filtering

trueobservedfiltered

51 TDT4171 Artificial Intelligence Methods

Page 77: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

2-D tracking example: smoothing

8 10 12 14 16 18 20 22 24 266

7

8

9

10

11

12

X

Y

2D smoothing

trueobservedsmoothed

52 TDT4171 Artificial Intelligence Methods

Page 78: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Kalman Filters

Where Kalman Filtering falls apart

Kalman Filters cannot be applied if the transition model isnonlinear

Extended Kalman Filter models transition as locally linear

around xt = µt. Fails if systems is locally unsmoothSwitching Kalman Filter kan be used to handlediscontinuities

53 TDT4171 Artificial Intelligence Methods

Page 79: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Dynamic Bayesian networks

Dynamic Bayesian networks

Xt, Et contain arbitrarily many variables in a replicated Bayes net

0.3f0.7t

0.9t0.2f

Rain0 Rain1

Umbrella1

P(U )1R1

P(R )1R0

0.7

P(R )0

Z1

X1

X1tXX 0

X 0

1BatteryBattery 0

1BMeter

54 TDT4171 Artificial Intelligence Methods

Page 80: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Dynamic Bayesian networks

DBNs vs. HMMs

Every HMM is a single-variable DBN; every discrete DBN is

an HMM

X t Xt+1

tY t+1Y

tZ t+1Z

Sparse dependencies ⇒ exponentially fewer parameters

. . . e.g., 20 state variables, three parents each

DBN has 20× 23 = 160 parameters, HMM has220× 220 ≈ 1012

55 TDT4171 Artificial Intelligence Methods

Page 81: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Dynamic Bayesian networks

DBNs vs. Kalman filters

Every Kalman filter model is a DBN, but few DBNs are KFs, asreal world requires non-Gaussian posteriors:Where are my keys? What’s the battery charge? Does this systemwork?

Z1

X1

X1tXX 0

X 0

1BatteryBattery 0

1BMeter

0BMBroken 1BMBroken

-1

0

1

2

3

4

5

15 20 25 30

E(B

atte

ry)

Time step

E(Battery|...5555005555...)

E(Battery|...5555000000...)

P(BMBroken|...5555000000...)

P(BMBroken|...5555005555...)

56 TDT4171 Artificial Intelligence Methods

Page 82: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Probabilistic Reasoning over Time Summary

Summary – Representation and inference in temporalmodels

Temporal models use state and sensor variables replicatedover time

Markov assumptions and stationarity assumption, so weneed

Transition model P(Xt|Xt−1)Sensor model P(Et|Xt)

Tasks are filtering, prediction, smoothing, most likely sequence;all done recursively with constant cost per time step

Hidden Markov models have a single discrete state variable;used for speech recognition

Kalman filters allow n state variables, linear Gaussian, O(n3)update

Dynamic Bayes nets subsume HMMs, Kalman filters; exactupdate intractable; approximations exist

57 TDT4171 Artificial Intelligence Methods

Page 83: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Speech as probabilistic inference

Speech as probabilistic inference

Let us return to the question of how to recognize speech

Speech signals are noisy, variable, ambiguous

Classify as to Words to maximize P(Words|signal)??Use Bayes’ rule:

P(Words|signal) = αP(signal|Words)P(Words)

I.e., decomposes into acoustic model + language model

The Words are the hidden state sequence, signal is theobservation sequence

We use Hidden Markov Models to model this

58 TDT4171 Artificial Intelligence Methods

Page 84: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Speech sounds

Speech sounds

Raw signal is the microphone displacement as a function of time;processed into overlapping 30ms frames, each described byfeatures

Analog acoustic signal:

Sampled, quantized digital signal:

Frames with features:10 15 38

52 47 82

22 63 24

89 94 11

10 12 73

Frame features are typically formants (peaks in the powerspectrum)

59 TDT4171 Artificial Intelligence Methods

Page 85: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Speech sounds

Phone models

Frame features in P (features|phone) summarized by

an integer in [0 . . . 255] (using vector quantization); orthe parameters of a mixture of Gaussians

Three-state phones: each phone has three phases (Onset,Mid, End)E.g., [t] has silent Onset, explosive Mid, hissing End⇒ P (features|phone, phase)Triphone context: each phone becomes n2 distinct phones,depending on the phones to its left and rightE.g., [t] in “star” is written [t(s,aa)] (different from “tar”!)

Triphones useful for handling coarticulation effects: thearticulators have inertia and cannot switch instantaneouslybetween positionsE.g., [t] in “eighth” has tongue against front teeth

60 TDT4171 Artificial Intelligence Methods

Page 86: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Speech sounds

Phone model example

Phone HMM for [m]:

0.1

0.90.3

0.6

0.4

C1: 0.5

C2: 0.2

C3: 0.3

C3: 0.2

C4: 0.7

C5: 0.1

C4: 0.1

C6: 0.5

C7: 0.4

Output probabilities for the phone HMM:

Onset: Mid: End:

FINAL0.7

Mid EndOnset

61 TDT4171 Artificial Intelligence Methods

Page 87: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Speech sounds

Word pronunciation models

Each word is described as a distribution over phone sequencesDistribution represented as an HMM transition model

0.5

0.5

0.2

0.8

[m]

[ey]

[ow][t]

[aa]

[t]

[ah]

[ow]

1.0

1.0

1.0

1.0

1.0

P ([towmeytow]|“tomato”) = P ([towmaatow]|“tomato”) = 0.1P ([tahmeytow]|“tomato”) = P ([tahmaatow]|“tomato”) = 0.4

Structure is created manually, transition probabilities learned fromdata

62 TDT4171 Artificial Intelligence Methods

Page 88: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Speech sounds

Isolated words

Phone models + word models fix likelihood P (e1:t|word) forisolated word

P (word|e1:t) = αP (e1:t|word)P (word)

Prior probability P (word) by counting word frequencies

P (e1:t|word) can be computed recursively: define

ℓ1:t =P(Xt, e1:t)

and use the recursive update

ℓ1:t+1 = Forward(ℓ1:t, et+1)

and then P (e1:t|word) =∑

xtℓ1:t(xt)

Isolated-word dictation systems with training reach 95% –99% accuracy

63 TDT4171 Artificial Intelligence Methods

Page 89: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Word sequences

Continuous speech

Not just a sequence of isolated-word recognition problems!

Adjacent words highly correlated

Sequence of most likely words is not equal to most likelysequence of words

Segmentation: there are few gaps in speech

Cross-word coarticulation, e.g., “next thing” ≈ “nexing” indaily speech

Continuous speech recognition is hard; currently the best systemsmanage 60% – 80% accuracy

64 TDT4171 Artificial Intelligence Methods

Page 90: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Word sequences

Language model

Prior probability of a word sequence is given by chain rule:

P (w1 · · ·wn) =

n∏

i=1

P (wi|w1 · · ·wi−1)

simplify using a Bigram model:

P (wi|w1 · · ·wi−1) ≈ P (wi|wi−1)

Train by counting all word pairs in a large text corpus

More sophisticated models (trigrams, grammars, etc.) help,but only a little bit

65 TDT4171 Artificial Intelligence Methods

Page 91: TDT4171 Artificial Intelligence Methods · TDT4171 Artificial Intelligence Methods Lecture 3 & 4 – Probabilistic Reasoning Over Time Norwegian University of Science and Technology

Speech recognition Word sequences

Summary – Speech

Since the mid-1970s, speech recognition has been formulatedas probabilistic inference

Evidence = speech signal, hidden variables = word and phonesequences

“Context” effects (coarticulation etc.) are handled byaugmenting state

Variability in human speech (speed, timbre, etc., etc.) andbackground noise make continuous speech recognition in realsettings an open problem

66 TDT4171 Artificial Intelligence Methods


Top Related