an introduction to variational methods for graphical models
DESCRIPTION
An Introduction to Variational Methods for Graphical Models. Introduction (1). Problem of Probabilistic Inference H : set of hidden nodes E : set of evidence nodes P ( E ) : likelihood Provide satisfactory solution to inference and learning - PowerPoint PPT PresentationTRANSCRIPT
An Introduction to Variational MethAn Introduction to Variational Methods for Graphical Modelsods for Graphical Models
Introduction (1)Introduction (1)
Problem of Probabilistic Inference H : set of hidden nodes E : set of evidence nodes
P(E) : likelihood
Provide satisfactory solution to inference and learning cases where time or space complexity is unacceptable
)(
),()|(
EP
EHPEHP
Introduction (2)Introduction (2)
Variational Method provide approach to the design of approximate inferenc
e algorithm deterministic approximation procedures Intuition
“Complex graphs can be probabilistically simple.”
Exact Inference (1)Exact Inference (1) Overview of Exact inference for graphical model
Junction Tree Algorithm
Directed Graphical Model (Bayesian network) joint probability distribution of all of the N nodes
P(S) = P(S1, S2, …, SN)
N
iii SSPSP
1)( )|()(
Exact Inference (2)Exact Inference (2) Undirected Graphical Model
potential function on the set of configuration of a clique that associates
a positive real number with each configuration
joint probability distribution
}{ 1
1
)(
)()(
S
M
iii
M
iii
CZ
Z
CSP
Exact Inference (3)Exact Inference (3)
Junction Tree Algorithm Moralization Step
Compiles directed graphical model into undirected graphical models
Triangulation Step input : moral graph output : undirected graph in which additional edges have been
added– allow recursive calculation of probabilities to take place
triangulated graph– data structure known as junction tree
Junction TreeJunction Tree
running intersection property “If a node appears in any two cliques in the tree, it
appears in all cliques that lie on the path between the two cliques.”
local consistency global consistency time complexity of probabilistic calculation
depends on the size of the cliques for discrete data, the number of values to represent the
potential is exponential in the number of nodes in the clique.
The QMR-DT database (1)The QMR-DT database (1) large-scale probabilistic database intended to be used as a diagnostic aid bipartite graphical model
upper layer : diseases lower layer : symptoms 600 disease nodes and 400 symptom nodes
The QMR-DT database (2)The QMR-DT database (2)
finding observed symptoms f : the vector of findings d : vector of diseases
all nodes are binary
The QMR-DT database (3)The QMR-DT database (3)
The joint probability over diseases and findings:
Prior probabilities of the diseases are obtained from archival data.
Conditional Probabilities were obtained from expert assessments under a “noisy-OR” model
qij are parameters obtained form the expert assessment.
j
ji
i dPdfP
dPdfPdfP
)()|(
)()|(),(
)(
0 )1()1()|0(ij
dijii
jqqdfP
The QMR-DT databse (4)The QMR-DT databse (4) Rewrite the nosy-OR model
Joint probability distribution Negative findings
benign with respect to the inference problem
Positive findings cross products terms couple the diseases coupling terms exponential growth in inferential complexity Diagnostic calculation under QMR-DT model is generally infeasible.
)|0(1)|1(
)1ln(
)|0( )( 0
dfPdfP
q
edfP
ii
ijij
di
ij ijij
Neural networks as graphical modelsNeural networks as graphical models
Neural networks layered graphs endowed with a nonlinear “activation” fun
ction at each node Activation function
bounded zero and one f(z) = 1 / (1+e-z)
Treat neural network as a graphical model associating a binary variable Si with each node interpreting the activation of the node as the probability t
hat the associated binary variable takes one of its two values
Neural networks (2)Neural networks (2) Example (sigmoid belief network)
ij : parameter associated with the edges between parent nodes j and node i
i0 : bias
)( 01
1)|1( )( ij ijij Sii
eSSP
Neural networks (3)Neural networks (3) Exact inference is infeasible in general layered neural
network models
a node has as parents all of the nodes in the preceding layer. Thus the moralized neural network graph has links between
all of the nodes in this layer hidden units in the penultimate layer become probabilistic
dependent, as do their ancestors in the preceding hidden layers
Factorial hidden Markov models (1)Factorial hidden Markov models (1)
FHMM
composed of a set of M chains : state node for the mth chain at time i A(m) : transition matrix for the mth chain
)(miX
FHMM (2)FHMM (2) Overall transition probability
effective state space for the FHMM the Cartesian product of the state space associated with the
individual chains
Represent a large effective state space with a much smaller number of parameters
M
m
mi
mi
mii XXAXXP
1
)(1
)()(1 )|()|(
FHMM (3)FHMM (3)
Emission probabilities of the FHMM
Ghahramani and Jordan B(m) and : matrices of parameters the states become stochastically coupled when the outp
uts are observed.
m
mi
mii XBNXYP ),()|( )()(
FHMM (4)FHMM (4) Time complexity
N : the number states in each chain cliques for the hidden state : size N3
time complexity of exact inference O(N3T) triangulation creates cliques of size N4
complexity of exact inference : O(N4T) in general : M is number of chain : O(NM+1T)
Hidden Markov decision treesHidden Markov decision trees
HMDT (Hidden Markov Decision Tree) Make decisions in decision tree conditional not only the c
urrent data point, but also on the decisions at the previous moment in time
Dependency is assumed to be level-specific. The probabilities of a decision depends only on the previous deci
sion at the same level of the decision tree
Problem Given a sequence of input vector Ui and sequence of output vecto
r Yi, compute the conditional probability distribution over the hidden states.
intractable (including FHMM)
Basics of variational methodologyBasics of variational methodology
Variational Methods Converts a complex problem into a simpler problem The simpler problem is generally characterized by a de
coupling of the degrees of freedom in the original problem
Decoupling is achieved via an expansion of the problem to include additional parameters(variational parameter) that must be fit to the problem.
Examples (1)Examples (1) Express the logarithm function variationally:
: variational parameter
1lnmin)ln(
xx
Examples (2)Examples (2)
For any given x, for all
Variational transformation provides a family of upper bounds on the logarithm.
The minimum over bounds is the exact value of the logarithm
Pragmatic justification nonlinear function linear function cost : a free parameter for each x if we set well we can obtain a good bound
1ln)ln( xx
Example (3)Example (3)
For binary-valued nodes, it is common to represent the probability that the node takes one of its values via a monotonic nonlinearity Example: logistic regression model
x : weighted sum of the values of the parents of a node neither convex nor concave simple linear bound will not work log concave
xexf
1
1)(
)1ln()( xexg
Example (4)Example (4)
Bound the log logistic function with linear functions bound the logistic function by the exponential
H() : binary entropy function
Good choice of provide better bound.
)(min)(
Hxxg
)(
)(
)(
min)(
Hx
Hx
exf
exf
Example (5)Example (5)
Significance of the advantage of the transformation For conditional probabilities represented with logistic regress
ion, we obtain product of functions of the form f(x) = 1 / (1+e-x).
Augment network representation by including variational parameters, a bound on the joint probability is obtained by taking products of exponentials.
Convex duality (1)Convex duality (1)
General fact of convex analysis concave function f(x) can be represented via a conjugate or
dual function
conjugate function f*()
)(*min)(
fxxf T
)(min)(* xfxxf T
x
Convex duality (2)Convex duality (2)
f(x) and linear function x for a particular shift x vertically by an amount which is the minimum of the
value x - f(x) obtain upper bounding line with slope that touches f(x) at a
single point
Convex duality (3)Convex duality (3)
Framework of convex duality applies equally well to lower bound.
Convex duality is not restricted to linear bounds.
Approximation for join probabilities Approximation for join probabilities and conditional probabilitiesand conditional probabilities Directed Graphs
Suppose we have a lower bound and upper bound for each of the local conditional probabilities .
We have forms and .
let E and H be a disjoint partition of S
)|( )(ii SSP
),|( )(Uiii
U SSP ),|( )(Liii
L SSP
i
Uiii
U
iii
SSP
SSPSP
),|(
)|()(
)(
)(
}{)(
}{
),|(
),()(
H i
Uiii
U
H
SSP
EHPEP
Approximation (2)Approximation (2)
Given that upper bound hold for any settings of values the variational parameter , it holds in particular for optimizing settings of the parameters.
The right hand side of the equations a function to be minimized with respect to .
Distinction between joint prob. and marginal prob. joint probabilities
– If we allow the variational parameters to be set optimally for each value of the argument S, then it is possible to find optimizing settings of the variational parameters that recover the exact value of the joint probability
marginal probabilities– generally not able to recover exact value
Ui
Ui