markov random field (mrf)

Markov Random Field

Explained from the View of Probabilistic Graphical Models

SUPPLEMENTS FOR BAYESIAN NETWORKS COURSEYUAN‐KAI WANG, 2011/05/05

1

Goal of This Unit• We have seen that directed graphical models

specify a factorization of the joint distribution over a set of variables into a product of local conditional distributions

• We turn now to the second major class of graphical models that are described by undirected graphs and that again specify both a factorization and a set of conditional independence relations.

• We will talk Markov Random Field (MRF).• No inference algorithms• But more on modeling and energy function

2

Self‐Study Reference

• Source of this unit• Section 8.3 Markov Random Fields, Pattern

Recognition and Machine Learning, C. M. Bishop, 2006.

• Background of this unit• Chapter 8 Graphical Models, Pattern Recognition

and Machine Learning, C. M. Bishop, 2006.• Probabilistic Graphical Models, Yuan‐Kai Wang’s

Lecture Notes for Bayesian Networks Courses, 2011.

3

Contents

1. Background2. Conditional Independence Property3. Factorization Property4. Example: Image De‐noising5. Relation to Directed Graphs

4

1. Background

• We have seen that directed graphical models• Specify a factorization of the joint distribution

over a set of variables• Into a product of local conditional distributions

5

Bayesian Networks

Directed Acyclic Graph (DAG)

6

Bayesian Networks

General Factorization

7

What Is Markov Random Field (MRF)

• A Markov random field (MRF) has a set of • Nodes

• Each node corresponds to a variable or group of variables

• Links• Each connects a pair of nodes.

• The links are undirected• They do not carry arrows

• MRF is also known as• Markov network, or• Undirected graphical model

(Kindermann and Snell, 1980)

8

Why Use MRF for Computer Vision

• Image de‐noising• Image de‐blurring• Image segmentation• Image super‐resolution

9

2. Conditional Independence Property

• In the case of directed graphs, we can test whether a conditional independence (CI) property holds by applying a graphical test called d‐separation.• This involved testing whether or not the paths

connecting two sets of nodes were ‘blocked’.• The CI definition will not apply to MRF and

undirected graphical models (UGMs).• But we will find alternative semantics of CI

property for MRF and UGMs.

10

CI Definition for UGM

• Suppose that in an UGM we identify three sets of nodes, denoted A, B, and C,

• And we consider the CI property

• To test whether CI property is satisfied by a probability distribution defined by a UGM• We consider all possible paths that connect

nodes in set A to nodes in set B through C.

11

An Example of CI• Every path from any node in set A to any node in set B passes through at least one node in set C.

• Consequently the conditional independence property

holds for any probability distribution described by this graph.

12

Markov Blanket

• The Markov blanket for a UGM takes a particularly simple form,• Because a node will be conditionally

independent of all other nodes conditioned only on the neighbouring nodes.

Markov Blanket

13

3. Factorization Property

• It is a factorization rule for UGM corresponding to the conditional independence test.

• What is factorization?• Expressing the joint distribution p(x) as a

product of functions defined over sets of variables that are local to the graph.

• Remember the factorization rule in directed graphs Product of factors

14

The Factorization Rule – Two nodes

• Consider two nodes xi and xj that are not connected by a link• Then these variables must be conditionally

independent given all other nodes in the graph.• There is no direct path between the two nodes.• And all other paths pass through nodes that are

observed, and hence those paths are blocked.

• This CI property can be expressed as

x\{i,j} denotes the set x of all variables with xi and xj removed.

15

The Factorization Rule – All Nodes

• Extend the factorization of two nodes to the joint distribution p(x) of all nodes • It must be the product of a set of factors• Each factor has some nodes Xc={xi … xj} that do

not appear in other factors • In order for the CI property to hold for all possible

distributions belonging to the graph.

,

16

Clique

• How to find the set of {xc}?• We need to consider a graph terminology:

clique• It is a subset of the nodes in a graph such that

there exists a link between all pairs of nodes in the subset.

• The set of nodes in a clique is fully connected.

,

17

Cliques and Maximal Cliques

• A maximal clique is a clique that • It is not possible to include any other nodes

from the graph to the set without it ceasing to be a clique. Clique

Maximal Clique

18

An Example of Clique

• This graph has five cliques of two nodes• {x1, x2}, {x2, x3}, {x3, x4}, {x4, x2}, {x1, x3}

• It has two maximal cliques • {x1, x2, x3}, {x2, x3, x4}

• The set {x1, x2, x3, x4} is not a clique because of the missing linkfrom x1 to x4.

Clique

Maximal Clique

19

Factorization by Maximal Clique

• We can define the factors in the decomposition of the joint distribution to be functions of the variables in the cliques.

• The set of nodes xC is a clique• In fact, we can consider functions of the

maximal cliques, without loss of generality,• Because other cliques must be subsets of maximal

cliques.• The set of nodes xC is a maximal clique

,

20

The Factorization Rule

• Denote a clique by C and the set of variables in that clique by xC.

• The joint distribution p(x) is written as a product of potential functions C(xC) over the maximal cliques of the graph

• The quantity Z, sometimes called partition function, is a normalization constant and is given by

21

Why Not Probability Function for the Factorization Rule (1/2)

• Why potential function

but not probability function?• In directed graphs, each factor represents the

conditional distribution corresponding to its parents.

• But here we do not restrict the choice of potential functions to a specific probabilistic distribution.

22

Why Not Probability Function for the Factorization Rule (2/2)

• Why potential function

but not probability function?• It is all for flexibility• We can define any function as we want• But a little restriction (compared to probability)

has still to be made for potential function C(xC).

• And note that the p(x) is still a probability function

23

The Potential Function

• Potential function C(xC)• C(xC) 0, to ensure that p(x) 0.• Therefore it is usually convenient to express

them as exponentials

• E(xC) is called an energy function, and the exponential representation is called the Boltzmann distribution.

24

Energy Function

• The joint distribution is defined as the product of potentials

• The total energy is obtained by adding the energies of each of the maximal cliques.

1ψ

1exp

1exp

1exp

25

Total Energy Function

• The total energy function can define the property of a MRF

Next, we will use an example, image de‐noising, to illustrate how to design the energy function.

26

4. Illustration: Image De‐Noising

Original Image Noisy Image

De‐Noising (Noise removal, Recover)

(binary image)Image Capture

27

Binary Image

• Let the observed noisy image be described by an array of binary pixel values yi {−1,+1}, where the index i = 1, . . . ,Druns over all pixels.

28

Simulation: generate noisy images

• We have the original noise‐free image, described by binary pixel values xi{−1,+1}.

• We randomly flipping the sign of pixels with some small probability, say 10%.

Simulation

29

Modelling

• Because the noise level is small, • There is a strong correlation between xi and yi.

• We also know that • Neighbouring pixels

xi and xj in an image are strongly correlated.

(noisy pixel)

(noise‐free pixel)

xj

30

Modelling ‐ Cliques

• This graph has two types of cliques, each of which contains two variables.• { xi , yi }• { xi , xj }

• We need to definetwo energy functionsfor the two cliques.

xj

31

Modelling – Energy Function (1/2)• { xi , yi } energy function

• Expresses the correlationbetween these variables

• −xiyi• is a positive constant

• Why?• Remember that

• A lower energy encouraging a higher probability

• Low energy when xi and yi have the same sign• Higher energy when they have the opposite sign

32

Modelling – Energy Function (2/2)

• { xi , xj } energy function• Expresses the correlationbetween these variables

• −xixj• is a positive constant

• Why?• Low energy when xi and xj have the same sign• Higher energy when they have the opposite sign.

xj

33

Modelling ‐ Total Energy Function (1/2)

, = { },

,

34

Modelling ‐ Total Energy Function (2/2)

• The complete energy function for the model

(noisy pixel)

(noise‐free pixel)

• We add an extra term hxi for each pixel i in the noise‐free image.

• It has the effect of • Biasing the model towards pixel values that have one particular sign in preference to the other

,

,

35

Modelling – Final Representation

,1exp ,

,

,

= { },

36

Two Algorithms for Solutions

• How to find solution of

• Iterated Conditional Modes (ICM)• Proposed by Kittler & Foglein, 1984• Simply a coordinate‐wise gradient ascent algorithm• Local maximum solution• Description in Wikipedia

• Graph Cuts• Guaranteed to find the global maximum solution• Description in Wikipedia

37

Image De‐Noising ‐ ICM

Noisy Image

Restored Image (ICM)

38

Image De‐Noising – Graph Cuts

Restored Image (Graph cuts)

Restored Image (ICM)

39

5. Relation to Directed Graphs

• We have introduced two graphical frameworks for representing probability distributions, corresponding to directed and undirected graphs

• It is instructive to discuss the relation between these.

• Details is TBU(To Be Updated)

40

Converting Directed to Undirected Graphs (1)

41

Converting Directed to Undirected Graphs (2)Additional links

42

Directed vs. Undirected Graphs (1)

43

Directed vs. Undirected Graphs (2)

44

markov random field (mrf)

Education