computer vision: models, learning and inference markov ...cv192/wiki.files/cv192_lec_mrf1.pdf ·...

$: Computer Vision: Models, Learning and Inference Markov ...cv192/wiki.files/CV192_lec_MRF1.pdf · Computer Vision: Models, Learning and Inference {Markov Random Fields, Part 1 Oren$
Computer Vision: Models, Learning and Inference–

Markov Random Fields, Part 1

Oren Freifeld and Ron Shapira-Weber

Computer Science, Ben-Gurion University

March 11, 2019

www.cs.bgu.ac.il/~cv192/ MRFs, Part 1 (ver. 1.01) Mar 11, 2019 1 / 36

www.cs.bgu.ac.il/~cv192/

Bayesian Image Restoration with a Markov Random Field

From left to right:

x: the true binary image.

y: its degraded version (20% random flips – this defines p(y|x)).

arg maxx p(x|y) = arg maxx p(y|x)p(x), where p(x) was taken to be aparticular MRF prior called the Ising model.

A sample from p(x|y).

In about week from now, you will know how to do it (in terms of both themath and the coding involved).

Figure from Winkler’s book on MRFs.www.cs.bgu.ac.il/~cv192/ MRFs, Part 1 (ver. 1.01) Mar 11, 2019 2 / 36


1 Few Words on Probabilistic Graphical Models

2 Markov Chains

3 Markov Random Fields



Few Words on Probabilistic Graphical Models

Probabilistic Graphical Models (PGMs)

PGM come in two main flavors:

Bayesian Networks – directed graphsMarkov Random Fields (MRFs)– undirected graphs

In either case, a PGM encodes (and visualizes) dependency structure ofa joint pdf/pmf

Both types generalize Markov chains

PGMs and Neural Networks are different beasts – but there are relationsbetween them.

pdf: probability density functionpmf: probability mass function



Markov Chains

Markov Chain as a Directed Linear Graph

A Markov Chain, (X1, X2, . . . , Xn), may be graphically represented as

x1 x2 · · · xn−1 xn

This highlights the fact that

p(x) = p(x1)

n∏i=2

p(xi|xi−1)



Markov Chains

Markov Chain as an Undirected Linear Graph

A Markov Chain, (X1, X2, . . . , Xn), may also be graphically represented as

x1 x2 · · · xn−1 xn

This highlights the fact that

p(x1:n) =

n−1∏i=1

φi,i+1(xi,xi+1)in sloppier notation

=

n−1∏i=1

φ(xi,xi+1)

where:

φ1,2(x1,x2) = p(x1)p(x2|x1)

φi,i+1(xi,xi+1) = p(xi+1|xi) ∀i ∈ 2, . . . , n− 1



Markov Chains

Factorization Simplifies Computations

For now, forget about probability, and consider the following:

x, y, z are binary variables.

f : R3 → R>0 factorizes as f(x, y, z) = φx,y(x, y)φy,z(y, z) for sometwo nonnegative functions, φx,y : R2 → R≥0 and φy,z : R2 → R≥0.

Want:

maxx,y,z

f(x, y, z) (1)



Markov Chains


Brute force requires 23 computations of f(x, y, z). However, we can dobetter by exploiting the factorization:

maxx,y,z

f(x, y, z) = maxx,y,z

φx,y(x, y)φy,z(y, z)

= maxx,y

φx,y(x, y) maxzφy,z(y, z)︸︷︷︸ψy(y),

= maxx,y

φx,y(x, y)ψy(y)︸︷︷︸ψx,y(x,y),

= maxx,y

ψx,y(x, y)



Markov Chains


maxx,y ψx,y(x, y)

ψx,y(x, y) , φx,y(x, y)ψy(y)

ψy(y) , maxz φy,z(y, z)

Now:

ψy(0) = func(φy,z(0, 0), φy,z(0, 1)) (2 evaluations of φy,z)

ψy(1) = func(φy,z(1, 0), φy,z(1, 1)) (2 evaluations of φy,z)

ψx,y(0, 0) = func(φx,y(0, 0), ψy(0)) (1 evaluation of φx,y)




The solution = max {ψx,y(0, 0), ψx,y(0, 1), ψx,y(1, 0), ψx,y(1, 1)}Again 23 evaluations, but of simpler functions

There is some overhead (e.g., memory, bookkeeping)



Markov Chains


More generally:

If the 3 variables, instead of binary, take values in {0, 1, . . . , s− 1}, thenbrute forces requires s3 evaluations of f while exploiting thefactorization leads to 2s2 evaluations of its factors.

If x = (x1, . . . , xn) where each xi takes values in {0, 1, . . . , s− 1}, andwant maxx f(x) where

f(x) =

n−1∏i=1

φi,i+1(xi, xi+1) (2)

then brute forces requires sn evaluations of f while exploiting thefactorization leads to (n− 1)s2 evaluations of its factors. Difference canbe huge, e.g.: s = 10 and n = 100 ⇒ sn = 10100 and (n− 1)s2 = 9900.

Obviously: more overhead due to memory and bookkeeping.



Markov Chains


Similar results hold if f(x) =∏n−1i=1 φi,i+1(xi, xi+1) and want

∑x f(x);

this is useful, e.g., if want to create a normalized version of f , i.e.,

f(x)∑x f(x)

A bit less trivial: as we will see, similar results hold iff(x) =

∏n−1i=1 φi,i+1(xi, xi+1) is a pmf, and want to sample from f :

x ∼ f(x) (3)



Markov Chains

Another Characterization of Markov Chain

Recall: a sequence, (X1, . . . , Xn), is called an MC ifp(xi|x1:(i−1)) = p(xi|xi−1) for every i ∈ {2, . . . , n}. This property isreferred to as “1-sided MC”

If a sequence, (X1, . . . , Xn), satisfiesp(xi|x1:(i−1),x(i+1):n) = p(xi|xi−1,xi+1) ∀i ∈ {2, . . . , n− 1}

p(x1|x2:n) = p(x1|x2)p(xn|x1:(n−1)) = p(xn|xn−1)

it is said to satisfy the “2-sided MC” property. In words:given all the others, each RV depends only on its neighbors.



Markov Chains

Fact

1-sided MC ⇐⇒ 2-sided MC

Corollary

By symmetry, it follows that if (X1, . . . , Xn) is an MC, then we also have

i ∈ {1, . . . , n− 1} ⇒ p(xi|x(i+1):n) = p(xi|xi+1)

i ∈ {1, . . . , n− 1} ⇒ p(xi:n) = p(xn)∏n−1j=i p(xj |xj+1)

Particularly,

p(x) = p(xn)

n−1∏i=1

p(xi|xi+1)

where x = (x1, . . . ,xn)



Markov Random Fields


One of the two main types of Probabilistic Graphical Models

Generalize Markov Chains to general undirected graphs

Many computer-vision and machine-learning applications





Informal Definition

Associate an RV with each vertex of an undirected graph, G, and say thateach variable, given all the others, depends only on its neighbors(according to the graph). In which case, we say that p, the joint pdf (orpmf) of all these RVs, is an MRF (w.r.t. G).

Example




Cliques

Definition

A clique (in the graph) is a set of vertices that are fully connected. Byconvention, each singleton is a clique.

Example

Notation

Let C denote the set of all cliques in the graph. If c ∈ C, thenxc , {xs : s ∈ c}




The structure of MRFs leads to computational advantages in calculatingprobabilities on a graph.Examples for (graphs of) MRFs:

graphs defined over pixels (regular 2D lattice)

speech recognition




Notation

S: collection of indices

Xa: a ∈ S, an RV

R: the range of Xa, called “state space”. Usually, |R| <∞ (but we willsee cases where this is not true)

XA, for A ⊂ S, is the set {Xs : s ∈ A}BXA = XA\B = {Xs : s ∈ A \B}If A = S, can also just write BX = XS\B = {Xs : s ∈ S \B}p: pmf (or pdf) of XS .

xs: a generic value for Xs, s ∈ S.




Notation

G = (S, η), where η is a “neighborhood system”; η = {ηs}s∈S where

ηs ⊂ Ss /∈ ηss ∈ ηt ⇐⇒ t ∈ ηs

Example

S = {1, 2, 3, 4, 5, 6, 7, 8}η1 = {2, 3}, η2 = {1, 3, 4}, η3 = {1, 2}, η4 = {2, 5, 6}, η5 = {4},η6 = {4, 7, 8}, η7 = {6, 8}, η8 = {6, 7}www.cs.bgu.ac.il/~cv192/ MRFs, Part 1 (ver. 1.01) Mar 11, 2019 19 / 36



Cliques and Neighborhoods

Example

Recall C is the set of cliques in G; i.e., c ∈ C ⇒ c ⊂ S, such that ∀s, t ∈ cwe have s ∈ ηt.




Definition (Markov Random Field)

p is an MRF w.r.t. G if p(xs|sx) = p(xs|xηs)∀s ∈ S(provided the LHS exists)

Remark: some authors also require p(x) > 0, ∀x




Definition (Gibbs distribution)

p is Gibbs w.r.t. G if p(x) > 0 ∀x and

p(x) =∏c∈C

Fc(xc)

for some {Fc}c∈C , a set of functions.




Theorem (Hammersley & Clifford)

If p(x) > 0 ∀x then:

p MRF w.r.t. G ⇐⇒ p Gibbs w.r.t. G

AKA the fundamental theorem of random fields.




Hammersley-Clifford and Markov Chains

Consider an MC, X = (X1, X2, . . . , Xn); i.e., p(x) factorizes as

p(x) = p(x1)

n−1∏i=1

p(xi+1|xi)

Assume also p(x) > 0 ∀x.HC⇒ p(x) is MRF w.r.t. G (which here is a linear undirected graph).⇒ p(xi|ix) = p(xi|xi−1,xi+1) ∀i ∈ {2, . . . , n− 1}.We just showed, using HC, that the 1-sided Markov property implies the2-sided Markov property.

In fact, don’t need HC for this as we can prove it directly. But first,before we do it, we need the following fact.




1-sided Markov Property ⇒ 2-sided Markov Property.

p(xi|ix) =p(x)

p(ix)=

p(x)∑xip(x)

MC=

p(x1)∏n−1j=1 p(xj+1|xj)∑

xip(x1)

∏n−1j=1 p(xj+1|xj)

=p(xi|xi−1)p(xi+1|xi)∑xip(xi|xi−1)p(xi+1|xi)

=: g(xi,xi−1,xi+1)

Claim: p(xi|ix) = g(xi,xi−1,xi+1) ⇒ p(xi|ix) = p(xi|xi−1,xi+1). Thisfollows directly from the previous fact: just take A = {i− 1, i+ 1}.




2-sided Markov Property ⇒ 1-sided Markov Property.

p(xi|ix) = p(xi|xi−1,xi+1) ⇒ p is MRF w.r.t. the (linear) graph GHC⇒ p is Gibbs w.r.t. G. ⇒ p(x) =

∏n−1i=1 F (xi+1,xi) ⇒

p(xi+1|x1:i) =p(x1:(i+1))

p(x1:i)=

∑x(i+2):n

p(x)∑x(i+1):n

p(x)

=

∑x(i+2):n

∏n−1j=1 F (xj+1,xj)∑

x(i+1):n

∏n−1j=1 F (xj+1,xj)

=

func(x1:i+1)︷︸︸︷∏ij=1 F (xj+1,xj)

∑x(i+2):n

func(xi+1,xi+2:n)︷︸︸︷∏n−1j=i+1 F (xj+1,xj)∏i−1

j=1 F (xj+1,xj)︸︷︷︸func(x1:i)

∑x(i+1):n

∏n−1j=i F (xj+1,xj)︸︷︷︸func(xi,xi+1:n)

= F (xi+1,xi)func(xi+1)

func(xi)=: g(xi+1,xi)⇒ p(xi+1|x1:i) = p(xi+1|xi)




Proof of the Hammersley-Clifford Theorem

p(x) > 0

Proving “p is Gibbs w.r.t. G ⇒ p is MRF w.r.t. G”.

p is Gibbs w.r.t. G ⇒ p(x) =∏c∈C Fc(xc) for some {Fc}c∈C

⇒

p(xs|sx) =

∏c∈C Fc(xc)∑

xs

∏c∈C Fc(xc)

=

∏c∈C:xs /∈c Fc(xc)

∏c∈C:xs∈c Fc(xc)∏

c∈C:xs /∈c Fc(xc)∑

xs

∏c∈C:xs∈c Fc(xc)

=

∏c∈C:xs∈c Fc(xc)∑

xs

∏c∈C:xs∈c Fc(xc)

=func(xs,xηs)

func(xηs)= g(xs,xηs) = p(xs|xηs)

(recall that p(xs|sx) = g(xs,xηs) implies that p(xs|sx) = p(xs|xηs))⇒ p is MRF w.r.t. G.

The other direction is hard; we omit the proof (cf. Winkler’s book ifinterested)




Marginals and Posteriors

Suppose we divide S into “Unobservable” (AKA hidden/latent) and“Observable” sites:

S = A ∪B A ∩B = ∅x = xS = (xA,yB)

Example

Of interest are the statistical structures of p(xA|yB) and p(yB).




Fact (equivalent characterizations of MRFs)

Let p(x) > 0∀x. Let MP stand for “Markov Property”. If A ⊂ S, then∂A = Ac ∩

⋃s∈A ηs is called the Markov Blanket of A. Let A = A ∪ ∂A

The following are equivalents:

1 p(xs|sx) = p(xs|xηs) ∀s ∈ S (i.e., our original definition of an MRF)

2 p is Gibbs w.r.t. G

3 global MP: A,B,C ⊂ S are disjoint and C separatesa A and B⇒ xA ⊥⊥ xB|xC

4 Setwise local MP: A ⊂ S ⇒ xA ⊥⊥ xS\A|x∂A5 local MP: s ∈ S ⇒ xs ⊥⊥ xS\(s∪ηs)|xηs6 pairwise MP: s, t ⊂ S, s /∈ ηt ⇒ xs ⊥⊥ xt|S \ {s, t}

aI.e., for every s ∈ A and t ∈ B, any path in G between s and t passesthrough some q ∈ C




Posteriors

For every clique (more generally, a subset of S) c, we havec = S ∩ c = (A ∪B) ∩ c = (A ∩ c) ∪ (B ∩ c)Can write Fc(xc) = Fc(xA∩c,yB∩c)

We have

p(xA|yB) =p(xA,yB)

p(yB)=

p(x)

p(yB)=

∏c∈C Fc(xA∩c,yB∩c)

p(yB)

=

∏c∈C:c∩A 6=∅ Fc(xA∩c,yB∩c)

∏c∈C:c∩A=∅ Fc(xA∩c,yB∩c)

p(yB)

∝∏

c∈C:c∩A 6=∅

Fc(xA∩c,yB∩c) =∏

c∈C:c∩A 6=∅

Fc(xA∩c)




Posteriors

p(xA|yB) ∝∏

c∈C:c∩A 6=∅

Fc(xA∩c)

⇒ p(xA|yB) is Gibbs w.r.t. GA (i.e., G restricted to A).

⇒ p(xA|yB) is an MRF w.r.t. GA.

In words: conditioning on a subset of an MRF, yields another(somewhat simpler/smaller) MRF.

Example (Hidden Markov Model (HMM))

Here p(xA|yB) is an MRF w.r.t. a linear graph (i.e., Markov Chain).




Marginals

p(yB) =∑xA

p(xA,yB) =∑xA

∏c∈C

Fc(xA,yB)

Example (y1 and y2 are conditionally independent but not independent)

p(x1,x2,y1,y2) = F12(x1,x2)G1(x1,y1)G2(x2,y2)⇒

p(y1,y2) =∑x1,x2

F12(x1,x2)G1(x1,y1)G2(x2,y2) = G12(y1,y2)

typically

6= H1(y1)H2(y2) so y1 ⊥6⊥ y2 even though y1 ⊥⊥ y2|x1,x2




Marginals

In fact, more generally, every time we sum out a variable, we create aclique involving all its neighbors (“creating new edges”).

Example




Marginals

p(yB) is an MRF w.r.t. GB (G restricted to B) with, in general, an addededge between s, t ∈ B provided there is a path in G from s to t that goesexclusively through A.

Example (Hidden Markov Model (HMM))

Here p(xA|yB) is an MRF w.r.t. a linear graph (i.e., Markov Chain) whilethe graph for p(yB) is fully connected.




Version Log

11/3/2019, ver 1.01. S7: Changed R>0 to R≥0. S24: Added asentence. S28: Added a step.

9/3/2019, ver 1.00.



computer vision: models, learning and inference markov ...cv192/wiki.files/cv192_lec_mrf1.pdf ·...

Documents