an introduction to variational methods for graphical models

An Introduction to Variational MethAn Introduction to Variational Methods for Graphical Modelsods for Graphical Models

Introduction (1)Introduction (1)

Problem of Probabilistic Inference H : set of hidden nodes E : set of evidence nodes

P(E) : likelihood

Provide satisfactory solution to inference and learning cases where time or space complexity is unacceptable

)(

),()|(

EP

EHPEHP

Introduction (2)Introduction (2)

Variational Method provide approach to the design of approximate inferenc

e algorithm deterministic approximation procedures Intuition

“Complex graphs can be probabilistically simple.”

Exact Inference (1)Exact Inference (1) Overview of Exact inference for graphical model

Junction Tree Algorithm

Directed Graphical Model (Bayesian network) joint probability distribution of all of the N nodes

P(S) = P(S1, S2, …, SN)

N

iii SSPSP

1)( )|()(

Exact Inference (2)Exact Inference (2) Undirected Graphical Model

potential function on the set of configuration of a clique that associates

a positive real number with each configuration

joint probability distribution

}{ 1

1

)(

)()(

S

M

iii

M

iii

CZ

Z

CSP

Exact Inference (3)Exact Inference (3)

Junction Tree Algorithm Moralization Step

Compiles directed graphical model into undirected graphical models

Triangulation Step input : moral graph output : undirected graph in which additional edges have been

added– allow recursive calculation of probabilities to take place

triangulated graph– data structure known as junction tree

Junction TreeJunction Tree

running intersection property “If a node appears in any two cliques in the tree, it

appears in all cliques that lie on the path between the two cliques.”

local consistency global consistency time complexity of probabilistic calculation

depends on the size of the cliques for discrete data, the number of values to represent the

potential is exponential in the number of nodes in the clique.

The QMR-DT database (1)The QMR-DT database (1) large-scale probabilistic database intended to be used as a diagnostic aid bipartite graphical model

upper layer : diseases lower layer : symptoms 600 disease nodes and 400 symptom nodes

The QMR-DT database (2)The QMR-DT database (2)

finding observed symptoms f : the vector of findings d : vector of diseases

all nodes are binary

The QMR-DT database (3)The QMR-DT database (3)

The joint probability over diseases and findings:

Prior probabilities of the diseases are obtained from archival data.

Conditional Probabilities were obtained from expert assessments under a “noisy-OR” model

qij are parameters obtained form the expert assessment.

j

ji

i dPdfP

dPdfPdfP

)()|(

)()|(),(

)(

0 )1()1()|0(ij

dijii

jqqdfP

The QMR-DT databse (4)The QMR-DT databse (4) Rewrite the nosy-OR model

Joint probability distribution Negative findings

benign with respect to the inference problem

Positive findings cross products terms couple the diseases coupling terms exponential growth in inferential complexity Diagnostic calculation under QMR-DT model is generally infeasible.

)|0(1)|1(

)1ln(

)|0( )( 0

dfPdfP

q

edfP

ii

ijij

di

ij ijij

Neural networks as graphical modelsNeural networks as graphical models

Neural networks layered graphs endowed with a nonlinear “activation” fun

ction at each node Activation function

bounded zero and one f(z) = 1 / (1+e-z)

Treat neural network as a graphical model associating a binary variable Si with each node interpreting the activation of the node as the probability t

hat the associated binary variable takes one of its two values

Neural networks (2)Neural networks (2) Example (sigmoid belief network)

ij : parameter associated with the edges between parent nodes j and node i

i0 : bias

)( 01

1)|1( )( ij ijij Sii

eSSP

Neural networks (3)Neural networks (3) Exact inference is infeasible in general layered neural

network models

a node has as parents all of the nodes in the preceding layer. Thus the moralized neural network graph has links between

all of the nodes in this layer hidden units in the penultimate layer become probabilistic

dependent, as do their ancestors in the preceding hidden layers

Factorial hidden Markov models (1)Factorial hidden Markov models (1)

FHMM

composed of a set of M chains : state node for the mth chain at time i A(m) : transition matrix for the mth chain

)(miX

FHMM (2)FHMM (2) Overall transition probability

effective state space for the FHMM the Cartesian product of the state space associated with the

individual chains

Represent a large effective state space with a much smaller number of parameters

M

m

mi

mi

mii XXAXXP

1

)(1

)()(1 )|()|(

FHMM (3)FHMM (3)

Emission probabilities of the FHMM

Ghahramani and Jordan B(m) and : matrices of parameters the states become stochastically coupled when the outp

uts are observed.

m

mi

mii XBNXYP ),()|( )()(

FHMM (4)FHMM (4) Time complexity

N : the number states in each chain cliques for the hidden state : size N3

time complexity of exact inference O(N3T) triangulation creates cliques of size N4

complexity of exact inference : O(N4T) in general : M is number of chain : O(NM+1T)

Hidden Markov decision treesHidden Markov decision trees

HMDT (Hidden Markov Decision Tree) Make decisions in decision tree conditional not only the c

urrent data point, but also on the decisions at the previous moment in time

Dependency is assumed to be level-specific. The probabilities of a decision depends only on the previous deci

sion at the same level of the decision tree

Problem Given a sequence of input vector Ui and sequence of output vecto

r Yi, compute the conditional probability distribution over the hidden states.

intractable (including FHMM)

Basics of variational methodologyBasics of variational methodology

Variational Methods Converts a complex problem into a simpler problem The simpler problem is generally characterized by a de

coupling of the degrees of freedom in the original problem

Decoupling is achieved via an expansion of the problem to include additional parameters(variational parameter) that must be fit to the problem.

Examples (1)Examples (1) Express the logarithm function variationally:

: variational parameter

1lnmin)ln(

xx

Examples (2)Examples (2)

For any given x, for all

Variational transformation provides a family of upper bounds on the logarithm.

The minimum over bounds is the exact value of the logarithm

Pragmatic justification nonlinear function linear function cost : a free parameter for each x if we set well we can obtain a good bound

1ln)ln( xx

Example (3)Example (3)

For binary-valued nodes, it is common to represent the probability that the node takes one of its values via a monotonic nonlinearity Example: logistic regression model

x : weighted sum of the values of the parents of a node neither convex nor concave simple linear bound will not work log concave

xexf

1

1)(

)1ln()( xexg


Bound the log logistic function with linear functions bound the logistic function by the exponential

H() : binary entropy function

Good choice of provide better bound.

)(min)(

Hxxg

)(

)(

)(

min)(

Hx

Hx

exf

exf


Significance of the advantage of the transformation For conditional probabilities represented with logistic regress

ion, we obtain product of functions of the form f(x) = 1 / (1+e-x).

Augment network representation by including variational parameters, a bound on the joint probability is obtained by taking products of exponentials.

Convex duality (1)Convex duality (1)

General fact of convex analysis concave function f(x) can be represented via a conjugate or

dual function

conjugate function f*()

)(*min)(

fxxf T

)(min)(* xfxxf T

x


f(x) and linear function x for a particular shift x vertically by an amount which is the minimum of the

value x - f(x) obtain upper bounding line with slope that touches f(x) at a

single point


Framework of convex duality applies equally well to lower bound.

Convex duality is not restricted to linear bounds.

Approximation for join probabilities Approximation for join probabilities and conditional probabilitiesand conditional probabilities Directed Graphs

Suppose we have a lower bound and upper bound for each of the local conditional probabilities .

We have forms and .

let E and H be a disjoint partition of S

)|( )(ii SSP

),|( )(Uiii

U SSP ),|( )(Liii

L SSP

i

Uiii

U

iii

SSP

SSPSP

),|(

)|()(

)(

)(

}{)(

}{

),|(

),()(

H i

Uiii

U

H

SSP

EHPEP

Approximation (2)Approximation (2)

Given that upper bound hold for any settings of values the variational parameter , it holds in particular for optimizing settings of the parameters.

The right hand side of the equations a function to be minimized with respect to .

Distinction between joint prob. and marginal prob. joint probabilities

– If we allow the variational parameters to be set optimally for each value of the argument S, then it is possible to find optimizing settings of the variational parameters that recover the exact value of the joint probability

marginal probabilities– generally not able to recover exact value

Ui

Ui

an introduction to variational methods for graphical models

Documents