using graphs to describe model structuresrihari/cse676/16.2 graphrepresentation.pdf•edge has no...

Deep Learning Srihari

1

Using Graphs to Describe Model Structure

Sargur N. [email protected]


Topics in Structured PGMs for Deep Learning0. Overview1.Challenge of Unstructured Modeling2.Using graphs to describe model structure3.Sampling from graphical models4.Advantages of structured modeling5.Learning about Dependencies6.Inference and Approximate Inference7.The deep learning approach to structured probabilistic models

1. Ex: The Restricted Boltzmann machine

2


Topics in Using Graphs to Describe Model Structure

1. Directed Models2. Undirected Models3. The Partition Function4. Energy-based Models5. Separation and D-separation6. Converting between Undirected and

Directed Graphs7. Factor Graphs

3


Graphs to describe model structure• Model structure is described using graphs:

– Each node represents a random variable– Each edge represents a direct interaction

• These direct interactions imply other indirect interactions• But only direct interactions need be explicitly modeled

4


Types of graphical models

• More than one way to describe interactions in a probability distribution using a graph

• Graphical models can be largely divided into two categories– Models based on directed acyclic graphs– Models based on undirected graphs

5


1. Directed Models

• One type of structured probabilistic model is the directed graphical model

• Also known as a belief network or a Bayesian network

• The term Bayesian is used since the probabilities can be judgmental– They usually represent degrees of belief rather than

frequencies of events

6


Example of Directed Graphical Model

• Relay race example

• Bob’s finishing time t1 depends on Alice’s finishing time t0, Carol’s finishing time t2depends on Bob’s finishing time t1

7


Meaning of directed edges

• Drawing an arrow from a to b means we define a conditional probability distribution (CPD) over b via a conditional distribution with a as one of the variables on the right side of the conditional bar– i.e., distribution over b depends on the value of a

8


Formal directed graphical model

• Defined on variables x is defined by a directed graphical acyclic graph G– whose vertices are random variables in the model

and a set of local CPDsp(xi|PaG(xi))• where PaG(xi) gives the parents of xi in G

• The probability distribution over x is given by

• In the relay race example– p(t0,t1,t2)=p(t0)p(t1|t0)p(t2|t1) 9

p(x) = p(x

ii∏ |Pa

G(x

i))


Savings achieved by directed model• If t0, t1 and t2 are discrete with 100 values then

single table would require 999,999 values– By making tables for only conditional probabilities

we need only 18,999 values• To model n discrete variables each having k

values, cost of a single table is O(kn)• If m is maximum no. of variables appearing on

either side of conditioning bar in a single CPD then cost of tables for directed PGM is O(km)– So long as each variable has few parents in graph,

distribution represented by few parameters


2. Undirected models• Directed graphical models give us a language

to describe structured probabilistic models• Another language is: undirected models

– Synonyms: Markov Random Fields, Markov Nets– Use graphs whose edges are undirected

• Directed models are useful when there is clear directionality

• When interactions have no clear directionality, more appropriate to use an undirected graph– Often when we know causality and causality flows

in one direction 11


Models without clear direction

• When interactions have no clear direction in interactions, or operate in both directions, it is appropriate to use an undirected model

• An example with three binary variables described next

12


Ex: Undirected Model for Health• A model over three binary variables:

– Whether or not you are sick, hy– Whether or not your coworker is sick, hc– Whether or not your roommate is sick, hr

• Assuming coworker and roommate do not know each other, very unlikely one of them will give a cold to the other– Event is so rare we do not model it

• There is no clear directionality either• This motivates using an undirected model

13


The health undirected graph

• You and your rommmate may infect each other with a cold

• You and your work colleague may do the same• Assuming room-mate and colleague do not

know each other they can only get infected through you

14


Undirected graph definition• If two variables directly interact with each other

then the nodes are connected• Edge has no arrow and has no CPD• An undirected PGM is defined on a graph G

– For each clique C in the graph, a factor ϕ(C), or clique potential, measures the affinity for being in each of their joint states• A clique is a subset of nodes all connected to each other

– Together they define an unnormalized distribution

15 !p(x) = φ(C)

C∈G∏


Efficiency of Unnormalized Distribution

• Unnormalized probability distribution is efficient to work with so long as the cliques are small

• It encodes that states with higher affinity ϕ(C)are more likely

• Since there is little structure to the definition of the cliques, there is no guarantee that multiplying them together will yield a probability distribution

16


Reading factorization information from an undirected graph

• This graph (with five cliques) implies that

– for an appropriate choice of the ϕ functions• Example of clique potentials shown next 17

p(a,b,c,d,e, f ) = 1

Zφ

a,b(a,b)φ

b,c(b,c)φ

a,d(a,d)φ

b,e(b,e)φ

e,f(e, f )


Ex: Clique potential

• One clique is between hy and hc– Factor for this clique can be defined by a table

• Similar factor needed for the other clique between hy and hc

18

A state of 1 indicates good healthWhile state of 0 indicates poor healthBoth are usually healthy, so the correspondingstate has highest affinityState of only one being being sick has lowest affinityState of both being sick has higher affinity thanone being sick

hy= health of youhr= health of roommatehc=health of colleague

ϕ(hy,hc)


• The unnormalized probability distribution – Is guaranteed to be non-negative everywhere– It is not guaranteed to sum or integrate to 1

• To obtain a valid probability distribution we must use the normalized (or Gibbs) distribution

– Where Z causes the distribution to sum to one

• Z is a constant when the ϕ functions are constant• If ϕ has parameters then Z is a function of those

parameters, commonly written without arguments• Known as the partition function in statistical physics

3. The partition function

p(x) = 1

Zp̂(x)

Z = !p(x)dx∫

!p(x) = φ(C)

C∈G∏


Intractability of Z

• Since Z is an integral or sum over all possible values of x it is intractable to compute

• In order to compute a normalized probability of an undirected model:– Model structure and definitions of ϕ functions must

be conducive to computing Z efficiently– In deep learning applications Z is intractable and we

must resort to approximations

20


Choice of factors• When designing undirected models important to

know that for some factors, Z does not exist!1. If there is a single scalar variable x ε R and we

choose a single clique potential ϕ(x)= x2 then

• This integral diverges– Hence there is no probability distribution for this choice

1. The choice of a parameter of the ϕ functions determines whether the distribution exists1.For ϕ(x ; β)=exp(-βx2), the β parameter determines

whether Z exists– Positive β defines a Gaussian over x– Other values of β make ϕ impossible to normalize 21

Z = x 2 dx∫


Key difference between BN & MN• Directed models are:

– defined directly in terms of probability distributions• From the start

• Undirected models are:– Defined loosely in terms of ϕ functions that are then

converted into probability distributions• This changes intuitions to work with these models

– One key idea to keep in mind in working with MNs:• Domain of variables has a dramatic effect on kind of

probability distributions a given set of ϕ functions corresponds to

– We will see how we can define distributions for different domains22


What distribution does an MN give?• Consider an n-dimensional random variable x={xi}i=1,..,n

• And an undirected model parameterized with biases b• Suppose we have one clique for each xi: ϕ(i)(xi)=exp(bixi)

x1 xi xn

• What kind of probability distribution is modeled?

• The answer is that we do not have enough information• Because we have not specified the domain of x

– Three example domains are: 1. x ε Rn, an n-dimensional vector of real values

2. x ε {0,1}n, an n-dimensional vector of binary values3.Domain of x is the set of elementary basis vectors {[1,0,..0],[0,1,..,0],.,[0,0,..,1]}

p(x) = 1

Zp̂(x)

!p(x) = φ(i)(x

i)

C∈G∏ = exp(b

1x

1+ ..+ b

nx

n)

Z = !p(x)dx∫


Effect of domain of x on distribution• We have n random variables, x={xi}i=1,..,n

• For each xi: ϕ(i)(xi)=exp(bixi)

• What kind of probability distribution is modeled?

1.If xεRn diverges and no probability distribution exists2. If xε{0,1}n p(x) factorizes into n independent distributions with

p(xi=1)=σ(bi), where– Each independent distribution is a binomial with parameter σ(bi)

3. If domain of x is the set of basis vectors {[1,0,..0],[0,1,..,0],..,[0,0,..,1]}thenp(x)=softmax(b)

– So a large value of bi reduces to p(xj)=1 for j ≠i, i.e., multiclass

• Often by careful choice of domain of x we can obtain complicated behavior from a simple set ϕ functions

24

Z = !p(x)dx∫

p(x) = 1

Zp̂(x)

!p(x) = φ(i)(x

i)

C∈G∏ = exp(b

1x

1+ ..+ b

nx

n)

Z = !p(x)dx∫

σ(x) = 1

1 + e−x= ex

1 + ex


4. Energy-based Models (EBMs)• Many interesting theoretical results of

undirected models depend on assumption that

• We can enforce this using an EBM where

– E(x) is known as the Energy function– Because exp(z)>0 ∨z, no energy function will result

in a probability of zero for any x• If we were to learn clique potentials directly we would

need to impose constraints for minimum probability value• By learning the energy function we can use

unconstrained optimization: probabilities can approach 0

∀x, !p(x) > 0

!p(x) = exp(−E(x))

!p(x) = φ(C)

C∈G∏


Boltzmann Machine Terminology• Any distribution of the form

– is referred to as a Boltzmann distribution• For this reason, many energy-based models

are referred to as Boltzmann machines– No consensus on when to call it a energy-based

model or a Boltzmann Machine• Boltzmann machines first referred to only binary variables

– Today mean-covariance restricted Boltzmann Machines deal with real-valued variables• Boltzmann Machines refer to models with latent variables

and those without are referred to as MRFs or log-linear models 26

!p(x) = exp(−E(x))


Cliques,factors and energy• Cliques in the undirected graph correspond to

factors in the uunnormalized probability function– Cliques in undirected graph also correspond to

different terms of an energy function– Because exp(a)exp(b)=exp(a+b) different cliques

in undirected graph correspond to different terms of the energy function• i.e., energy-based model is a special Markov network• Exponentiation makes each term of the energy function

correspond to a factor for a different clique– Reading the form of the energy function from an

undirected graph is shown next 27


Graph and Corresponding Energy

• This graph (with five cliques) implies thatE(a,b,c,d,e,f)= Ea,b(a,b)+Eb,c(b,c)+Ea,d(a,d)+Eb,e(b,e)+Ee,f(e,f)

We can obtain the ϕ functions by setting each ϕ to the exponential of the corresponding negative energy, e.g., ϕa,b(a,b)=exp(-E(a,b))

28


Energy-based Model as Experts• An energy based model with multiple terms in

its energy function can be viewed as a product of experts

• Each term corresponds to a factor in the probability distribution– Each term determines whether a soft constraint is

satisfied– Each expert may impose only one constraint that

concerns a low-dimensional projection of the random variables• When combined by multiplication of probabilities, the

experts together enforce a high-dimensional constraint


Role of negative sign in energy

• The negative sign in serves no functional purpose from a ML perspective

• This sign could be incorporated into the definition of the Energy function

• It is there mainly for compatibility with physics literature

• Some ML researchers omit the negative sign and refer to the negative energy as harmony

30

!p(x) = exp(−E(x))


Free Energy instead of Probability

• Many algorithms that operate on probabilistic models do not need to compute pmodel (x) but only

• For energy-based models with latent variables h, these algorithms are phrased in terms of the negative of this quantity, called the free energy

• Deep learning prefers this formulation31

log !pmodel

(x)

F(x) = − log exp −E(x,h)( )

h∑


RBM as an Energy Model

32

Binary version of RBMObserved layer: set of nv binary r.v.s, vLatent or hidden set of nh binary r.v.s hIts energy function is

E(v,h)= -bTv –cTh –vTWhwhere b, c and W are unconstrained, real-valued learnable parameters

Thus model is divided into two groups of units v and hand the interaction between them is described by matrix W

Joint-probability distribution is specified by the energy function:P(v=v,h=h)=(1/Z)exp(-E(v,h))

The energy function for an RBM is E(v,h)= -bTv –cTh –vTWhZ is the partition function. Z = Σv Σh E(v,h)Since Z is intractable P(v) is also intractable

Although P(v) is intractable, bipartite structure of RBM has special property that Conditionals P(h|v),P(v|h) are factorial & easily computed


5. Separation and D-Separation

• Edges in a directed graph tell which variables directly interact

• We often need to know which variables indirectly interact

• Some of these interactions can be enabled or disabled by observing other variables

• More formally we would like to know which variables are conditionally independent of each other given the values of other sets of variables

33


Separation in undirected models• Identifying conditional independences is very

simple in the case of undirected models– In this case conditional independence implied by

the graph is called separation– Set of variables A is separated from variables B

given third set of variables S if the graph implies that A is independent of B given S

– If two variables a and b are connected by a path involving only unobserved variables then those variables are not separated• If no path exists between them, or all paths contain an

observed variable then they are separated 34


Separation in undirected graphs

• b is shaded to indicate it is observed

• b blocks path from a to c, so a and c are separated given b

• There is an active path from a to d, so a and dare not separated given b

35


Separation in Directed Graphs

• In the context of directed graphs, these separation concepts are called d-separation– d stands for “dependence”

• D-separation is defined the same as separation for undirected graphs:

• A set of variables A is d-separated from a set of variables B given a third set of variables S if the graph structure implies that A is independent of B given S

36


Examining Active Paths

• Two variables are dependent if there is an active path between them

• They are d-separated if there is no path between them

• In directed nets determining whether a path is active is more complicated

• A guide to identifying active paths in a directed model is given next

37


All active paths of length 2

38

Active paths between random variables a and b


Reading properties from a graph

39

using graphs to describe model structuresrihari/cse676/16.2 graphrepresentation.pdf•edge has no...

Documents