a new look at the generalized distributive lawananth/preprints/payam/...a new look at the...

A New Look at the Generalized Distributive Law∗

Payam Pakzad ([email protected])Venkat Anantharam ([email protected])

June 2001Berkeley, California

Abstract

In this paper we develop a measure-theoretic version of the JunctionTree algorithm to compute the marginalizations of a product function. Wereformulate the problem in a measure-theoretic framework, where the de-sired marginalizations are viewed as conditional expectations of a productfunction given certain σ-fields. We generalize the notions of independenceand junction trees at the level of these σ-fields and produce algorithmsto find or construct a junction tree on a given set of σ–fields. By tak-ing advantage of structures at the atomic level of the sample space, ourmarginalization algorithm is capable of producing solutions far less com-plex than GDL-type algorithms (see [1]). Our formalism reduces to theold GDL as a special case, and any GDL marginalization problem can bereexamined with our framework for possible savings in complexity.

1 Introduction

Local message-passing algorithms on graphs have enjoyed much attention in therecent years, mainly due to their success in decoding applications. A generalframework for these algorithms was introduced by Shafer and Shenoy in [10].This general framework is widely known as the Junction Tree algorithm in theArtificial Intelligence community. Aji and McEliece [1] gave an equivalent, (andfor our purposes, slightly more convenient) framework known as the GeneralizedDistributive Law (GDL). Viterbi algorithm, BCJR [3], Belief Propagation [9],FFT over a finite field, and Turbo [7] and LDPC [6] decoding algorithms areamongst implementations of the Junction Tree algorithm or GDL1.Given a marginalization problem, GDL makes use of the distributivity of the

‘product’ operation over ‘summation’ in an underlying semiring to reduce thecomplexity of the required calculations. In many cases this translates to sub-stantial savings, but as we shall see in this paper, sometimes there is just more

∗This work was supported by grants (ONR/MURI) N00014-1-0637, (NSF) ECS-9873086and (EPRI/DOD) EPRI-W08333-04

1Throughout this paper we will use both names –GDL, and the Junction Tree algorithm–interchangeably, but will use the GDL notation from [1] to show connections with this work.

1

structure in the local functions than GDL can efficiently handle. In particular,GDL relies solely on the notion of variables. Any structure at a finer level thanthat of variables will be ignored by GDL. We illustrate these limitations in thefollowing simple example:

Example 1. Let X and Y be arbitrary real functions on 1, · · · , n. Let µ(i, j)be a fixed real weight function for i, j ∈ 1, · · · , n, given by an n×n matrix Mwith µ(i, j) =Mi,j . We would like to calculate the weighted average of X · Y :E =

∑ni=1

∑nj=1X(i)Y (j)µ(i, j).

The general GDL-type algorithm (assuming no structure on the weight func-tion µ) will suggest the following:

E =

n∑

i=1

X(i)

n∑

j=1

Y (j)µ(i, j),

requiring n(n+ 1) multiplications and (n− 1)(n+ 1) additions. But this is notalways the simplest way to calculate E.Consider a ‘luckiest’ case when the matrixM has rank 1, i.e. the weight func-

tion µ(i, j) factorizes as f1(i)·f2(j). In this case E =(∑n

i=1X(i)f1(i))(∑n

j=1 Y (j)f2(j))

,requiring only 2n+ 1 multiplications and 2n− 2 additions.Suppose next that µ(i, j) does not factorize as above, but the matrix M has

a low rank of 2, so that µ(i, j) = f1(i)f2(j) + g1(i)g2(j). Then we can computeE as follows:

E =(

n∑

i=1

X(i)f1(i))(

n∑

j=1

Y (j)f2(j))

+(

n∑

i=1

X(i)g1(i))(

n∑

j=1

Y (j)g2(j))

This requires 4n+ 2 multiplications and 4n− 4 additions.Next suppose that the fixed weight matrix M is sparse, for example with

µ(i, j) = miδ(i+ j − n− 1). Then

E =

n∑

i=1

miX(i)Y (n+ 1− i),

requiring only 2n multiplications and n− 1 additions.

From this example it is evident that there are families of marginalizationproblems for which the GDL treatment is insufficient to produce the best methodof calculation.

In this paper we introduce a probability-theoretic framework which elimi-nates the explicit use of ‘variables’ to represent the states of the data. Specifi-cally, we replace GDL’s concept of ‘local domains’ with σ-fields in an appropri-ate sample-space. We also replace GDL’s ‘local kernels’ with random variablesmeasurable with respect to those σ-fields. The marginalization problem is then,naturally, replaced by taking the conditional expectation given a σ-field. Aswe shall see, this representation has the flexibility and natural tool (in form of

2

a measure function on the sample-space) to capture both full and partial in-dependencies between the marginalizable functions. Our formalism reduces tothe old GDL as a special case, and any GDL marginalization problem can bereexamined with our framework for possible savings in complexity.Although our results are generalizable to any arbitrary semifield2, in order

to avoid abstract distractions, we focus on the sum-product algebra.Here is an outline of this paper. In Section 2 we review the GDL algorithm.

In Section 3 we present the necessary concepts from the probability theory. InSection 4 we give a probabilistic version of the marginalization problem thatwe address in this paper, and introduce the probabilistic junction trees and themessage-passing algorithm on them. We further produce an algorithm to finda junction tree on a given collection of σ-fields. Just as is the case with theordinary GDL, junction trees do not always exist. In Section 5 we discuss amethod to expand the problem in a minimal way so to be able to construct ajunction tree (much like the process of moralization and triangulation). Someexamples and applications are given in Section 6, and in Section 7 we discussour results.

2 GDL Algorithm

Definition 1. A (commutative) semiring R is a set with operations + and ×such that both + and × are commutative and associative and have identityelements in R (0 and 1 respectively), and × is distributive over +.

Let x1, · · · , xn be variables taking values in sets A1, · · · , An respec-tively. Let S1, · · · , SM be a collection of subsets of 1, · · · , n, and fori ∈ 1, · · · ,M, let αi : ASi

−→ R be a function of xSi, taking value in

some semiring R. The “Marginalize a Product Function” (MPF) problem is tofind, for one or more of the indices i = 1, · · · ,M , the Si-marginalization of theproduct of the αi’s, i.e.

βi(xSi)∆=

∑

xSci∈ASc

i

(

M∏

i=1

αi(xSi))

In the language of GDL, αi’s are called the local kernels, and the variable listsxSi

are called the local domains.

The GDL algorithm gives a message passing solution when the sets S1, · · · , SMcan be organized into a junction tree. A junction tree is a graph-theoretic treewith nodes corresponding to S1, · · · , SM, and with the property that the sub-graph on the nodes that contain any variable xi is connected. An equivalent

2A semifield is an algebraic structure with addition and multiplication, both of which arecommutative and associative and have identity element. Further, multiplication is distributiveover addition, and every nonzero element has a multiplicative inverse. Such useful algebrasas the sum-product and the max-sum are examples of semifields (see [2]).

3

condition is that if A,B and C are subsets of 1, · · · ,M such that A and B

are separated by S on the graph, then SA ∩ SB ∈ SC where SA∆=⋃

i∈A Si. Aswe will see in Section 4 our definition of a junction tree will resemble this latterdefinition.

SupposeG is a junction tree on nodes 1, · · · ,M with local kernels α1, · · · , αM.Let E1, · · · , ET be a message-passing schedule, viz. the ‘message’ functionalong the (directed) edge (i, j) of the graph is updated at time t iff (i, j) ∈ Et.The following asynchronous message-passing algorithm (GDL) will solve theMPF problem:

Algorithm 1. At each time t and for all pairs (i, j) of neighboring nodes in thegraph let the ‘message’ from i to j be a function µti,j : ASi∩Sj

−→ R. Initializeall ‘message’ functions to 1. At each time t ∈ 1, · · · , T, if the edge (i, j) ∈ Et

then update the message from node i to j as follows

µti,j(xSi∩Sj) =

∑

xSi\Sj∈ASi\Sj

αi(xSi)∏

k adj ik 6=j

µt−1k,i (xSk∩Si) (1)

where (k adj i) means that node k is adjacent to i on the tree.

This algorithm will converge in finite time, at which time we have:

αi(xSi)∏

k adj i

µk,i(xSk∩Si) = βi(xSi

) =∑

xSci∈ASc

i

(

M∏

i=1

αi(xSi))

Proof. See [1].

3 Probabilistic Preliminaries

First we review some notions from probability theory. Throughout this paperwe focus on discrete sample spaces, i.e. the case when the sample space is finiteor countably infinite.

Let (Ω,M) be a discrete measurable space, i.e. Ω is a finite or countableset and M is a σ-field on Ω. Let µ : M → (−∞,∞) be a signed measureon (Ω,M), i.e. µ(∅) = 0 and for any sequence Ai

∞i=1 of disjoint sets in

M, µ(⋃∞1 Ai) =

∑∞1 µ(Ai). (As a matter of notation, we usually write

µ(A1, A2, · · · , An) for µ(A1∩A2∩ · · ·∩An).) Then we call (Ω,M, µ) a measurespace. If (Ω,M, µ) is a measure space and F1, · · · ,FM are sub σ-fields ofM,then we call (Ω, F1, · · · ,FM, µ) a collection of measure spaces.Let F ,G and H be sub σ-fields ofM.

4

Atoms of a σ-field: We define the set of atoms of F to be the collection of theminimal nonempty measurable sets in F w.r.t. inclusion:

A(F)∆=

f ∈ F : f 6= ∅, and ∀ g ∈ F , f ∩ g ∈ ∅, f

Augmentation of σ-fields: We denote by F ∨ G the span of F and G, i.e. thesmallest σ-field containing both F and G. For a set A of indices, we write FA

for∨

i∈A Fi, with F∅∆= ∅,Ω, the trivial σ-field on Ω. Note that the atoms of

F ∨ G are all in the form f ∩ g for some f ∈ A(F) and g ∈ A(G).

Conditional Independence: We say F is conditionally independent of G given Hor F⊥⊥G

∣

∣H w.r.t. µ when for any atom h of H,

• if µ(h) = 0 then ∀f ∈ F , g ∈ G, µ(f, g, h) = 0

• if µ(h) 6= 0 then ∀f ∈ F , g ∈ G, µ(f, g, h)µ(h) = µ(f, h)µ(g, h).

When the underlying measure is obvious from the context, we omit the ex-plicit mention of it.

Independence: We say F is independent of G or F ⊥⊥ G w.r.t. µ whenF⊥⊥G

∣

∣ ∅,Ω.

Note that these definitions are consistent with the usual definitions of inde-pendence when µ is a probability function.

Conditional Measure: Although it is not essential for our discussion, we definethe conditional measure as a partially defined function µ(·|·) : M ×M →(−∞,∞), defined as:

µ(a|b)∆=µ(a, b)

µ(b)defined for nonzero-measure b

Expectation and Conditional Expectation: Let F and G be a σ-fields with atomsA(F) =

fi∞

i=1and A(G) =

gi∞

i=1respectively. A partially-defined random

variable X in F is a partially-defined function on Ω, where for each r in therange of X, X−1(r) is measurable in F . We write X ∈ F , and denote by AX(F)the subset of A(F) =

fi∞

i=1where X is defined.

Assuming µ(Ω) 6= 0, the expectation of X is defined as

E[

X] ∆=

1

µ(Ω)

∑

f∈AX(F)

X(f)µ(f)

=∑

f∈AX(F)

X(f)µ(f |Ω)

Then we define the conditional expectation of X given G, as a partially-defined random variable Y in G, with AY (G) = g ∈ A(G) : µ(g) 6= 0, as

5

follows:

E[

X∣

∣G]

(g)∆=

1

µ(g)

∑

f∈AX(F)

X(f)µ(g, f) for each atom g ∈ AY (G)

=∑

f∈AX(F)

X(f)µ(f |g) for g ∈ AY (G)

The signed Conditional Independence relation satisfies certain properties(inference rules) that we will state in the next theorem. See [9] and [5] fordiscussion of inference rules for the case when µ is a probability function3.

Theorem 3.1. Let F ,G,X ,Y be σ-fields. Then the following properties hold:

F⊥⊥G∣

∣ X =⇒ G⊥⊥F∣

∣ X Symmetry (2a)

F⊥⊥G ∨ X∣

∣ Y =⇒ F⊥⊥G∣

∣ Y & F⊥⊥X∣

∣ Y Decomposition(2b)

F⊥⊥G∣

∣ X & F⊥⊥Y∣

∣ G ∨ X =⇒ F⊥⊥G ∨ Y∣

∣ X Contraction (2c)

F⊥⊥G∣

∣ X & F ∨ G⊥⊥Y∣

∣ X =⇒ F⊥⊥G ∨ Y∣

∣ X (2d)

F⊥⊥G∣

∣ X & F ∨ X⊥⊥Y∣

∣ G =⇒ F⊥⊥G ∨ Y∣

∣ X (2e)

F⊥⊥G ∨ X∣

∣ Y & F ∨ G⊥⊥Y∣

∣ X =⇒

F⊥⊥G ∨ Y∣

∣ X

F ∨ Y⊥⊥G∣

∣ X(2f)

Proof. Let f, g, x, and y be arbitrary atoms of F ,G,X and Y respectively. Theproofs below consider all possible cases for the value of the µ on these atoms.

(2a): Symmetry is obvious, since f ∩ g = g ∩ f .

(2b): If µ(y) = 0, then µ(f, g, x, y) = 0 and if µ(y) 6= 0, then µ(f, g, x, y) =µ(f, y)µ(g, x, y)/µ(y) for all choices of f, g and x in F ,G and X respectively. Inparticular, choosing x = Ω or g = Ω will yield the desired results.

(2c):

• µ(x) = 0. Then from F⊥⊥G∣

∣ X , we get µ(g, x) = 0 for all g. Then from

F⊥⊥Y∣

∣ G ∨X we get µ(f, g, x, y) = 0 for all f and y, and so we are done.

• µ(x) 6= 0 and µ(g, x) = 0. Then from F⊥⊥Y∣

∣G∨X we get µ(f, g, x, y) = 0for all f and y, and so µ(f, g, x, y)µ(x) = µ(f, x)µ(g, x, y) = 0 and we aredone.

3Note that the signed Conditional Independence relation satisfies symmetry, decompositionand contraction, but in general weak union does not hold. So the signed C.I. relation is nota semi-graphoid.

6

• µ(x) 6= 0 and µ(g, x) 6= 0.Then from F⊥⊥Y∣

∣ G ∨ X we get

µ(f, g, x, y) = µ(f, g, x)µ(g, x, y)/µ(g, x) (3)

Also from F⊥⊥G∣

∣X we have µ(f, g, x)/µ(g, x) = µ(f, x)/µ(x). Replacingthis into (3) we obtain µ(f, g, x, y) = µ(g, x, y)µ(f, x)/µ(x) and we aredone.

(2d):

• µ(x) = 0. Then from F ∨ G⊥⊥ Y∣

∣ X , we get µ(f, g, x, y) = 0 for all f, gand y, and so we are done.

• µ(x) 6= 0 and µ(x, y) = 0. Then from F ∨ G⊥⊥Y∣

∣X we get µ(f, g, x, y) =µ(f, g, x)µ(x, y)/µ(x) = 0 for all f, g and y, and in particular µ(g, x, y) = 0and so we have the desired equality µ(f, g, x, y)/µ(x) = µ(f, x)µ(g, x, y) =0.

• µ(x) 6= 0 and µ(x, y) 6= 0. We have

µ(f, g, x) = µ(f, x)µ(g, x)/µ(x) since F⊥⊥G∣

∣ X (4)

µ(f, g, x, y) = µ(f, g, x)µ(x, y)/µ(x) since F ∨ G⊥⊥Y∣

∣ X (5)

µ(g, x)/µ(x) = µ(g, x, y)/µ(x, y) since, by (2b), G⊥⊥Y∣

∣ X (6)

Replacing (6) into (4) and then into (5) we obtain

µ(f, g, x, y) = µ(f, x)µ(g, x, y)/µ(x).

(2e):

• µ(x) = 0 and µ(g) = 0. Then from F ∨X⊥⊥Y∣

∣ G, we get µ(f, g, x, y) = 0and we are done.

• µ(x) = 0 and µ(g) 6= 0. From F ⊥⊥ G∣

∣ X we have µ(f, g, x) = 0. Then

from F ∨ X ⊥⊥ Y∣

∣ G, µ(f, g, x, y) = µ(f, g, x)µ(y, g)/µ(g) = 0 and we aredone.

• µ(x) 6= 0 and µ(g) = 0. Then from F ∨ X ⊥⊥ Y∣

∣ G, we get bothµ(f, g, x, y) = 0 and µ(g, x, y) = 0, so the desired equality hold:

µ(f, g, x, y) = µ(f, x)µ(g, x, y)/µ(x) = 0.

• µ(x) 6= 0 and µ(g) 6= 0. Then from F ∨ X ⊥⊥Y∣

∣ G, we get µ(f, g, x, y) =

µ(f, g, x)µ(g, y)/µ(g). Also from F⊥⊥G∣

∣ X we have

µ(f, g, x) = µ(f, x)µ(g, x)/µ(x).

So we obtain the equality µ(f, g, x, y) = µ(f, x)µ(g, x)µ(g, y)/(µ(g)µ(x)).Finally, decomposition applied to F∨X⊥⊥Y

∣

∣G yields µ(g, x)µ(g, y)/µ(g) =µ(g, x, y). So we have proved µ(f, g, x, y) = µ(f, x)µ(g, x, y)/µ(x) and thiscompletes the proof.

7

(2f):

• µ(x) = 0. Then from F ∨ G⊥⊥Y∣

∣ X , we have µ(f, g, x, y) = 0 and we aredone.

• µ(x) 6= 0 and µ(x, y) = 0. Then from F∨G⊥⊥Y∣

∣X we have µ(f, g, x, y) =µ(f, g, x)µ(x, y)/µ(x) and so µ(f, g, x, y) = 0. Also after applying (2b) tothe above, we have µ(f, x, y) = µ(f, x)µ(x, y)/µ(x) = 0 and µ(g, x, y) =µ(g, x)µ(x, y)/µ(x) = 0. So we have the equality µ(f, g, x, y)µ(x) =µ(f, x, y)µ(g, x) = µ(f, x)µ(g, x, y) = 0 and we are done.

• µ(x) 6= 0 and µ(x, y) 6= 0. Then also µ(y) 6= 0 or else from F ⊥⊥ G ∨X∣

∣ Y we would have µ(x, y) = 0. Then from F ⊥⊥ G ∨ X∣

∣ Y we getµ(f, g, x, y) = µ(f, y)µ(g, x, y)/µ(y), and also after (2b) to the above, weget µ(f, x, y)/µ(x, y) = µ(f, y)/µ(y). Replacing the latter equation intothe former we obtain

µ(f, g, x, y) = µ(g, x, y)µ(f, x, y)/µ(x, y) (7)

But from F ∨ G ⊥⊥ Y∣

∣ X and by (2b) we have both µ(f, x, y)/µ(x, y) =µ(f, x)/µ(x) and µ(g, x, y)/µ(x, y) = µ(g, x)/µ(x). Replacing each ofthese into (7) we obtain

µ(f, g, x, y) = µ(g, x, y)µ(f, x)/µ(x)

andµ(f, g, x, y) = µ(f, x, y)µ(g, x)/µ(x)

and we are done.

4 Probabilistic MPF and Junction Trees

We now formulate a probabilistic version of the MPF problem and introducethe corresponding concept of junction trees. The rest of this paper will analyzeproperties of these junction trees and describe a probabilistic GDL algorithmto solve this MPF problem.

Throughout this paper, let (Ω, F1, · · · ,FM, µ) be a collection of measurespaces, and let X1, · · · , XM be a collection of partially defined random vari-ables with Xi ∈ Fi.

Probabilistic MPF Problem: For one or more i ∈ 1, · · · ,M, find E[∏

j Xj

∣

∣Fi]

,the conditional expectation of the product given Fi.

A GDLMPF problem in the format described in Section 2 can be representedas a probabilistic MPF problem usually in more than one way, depending on

8

the choice of assignment of GDL’s local kernels as either a random variable,or a factor of the measure function; either way, the product will be the same.

Specifically, a marginalization∑

xSci∈ASc

i

(

∏Mi=1 αi(xSi

))

can be viewed as a

weighted average of the product of αj ’s for j ∈ J ⊂ 1, · · · ,M with the measurefunction µ(x1,··· ,M) =

∏

k∈Jc αk(xSk) for any subset J of 1, · · · ,M. Our

sample space Ω is then the product space A1,··· ,M = A1 × · · · × AM . Foreach j ∈ J , we view αj as a random variable measurable in a σ-field whoseatoms are the hyper-planar subsets of Ω in which the coordinates correspondingto the elements of Sj are constant. In other words, each atom of this σ-fieldcorresponds to a possible choice of xSj

∈ ASj. Denoting each atom by its

corresponding element xSjthen, we have

E[

∏

i

Xi

∣

∣Fj]

(xSj) =

1

µ(xSj)

∑

xSci∈ASc

i

(

M∏

i=1

αi(xSi))

=1

qScj

βj(xSj)

In most applications, for a family of MPF problems the local kernels canbe categorized as either fixed or arbitrary. For example, in an LDPC decodingproblem, the code itself is fixed, so the local kernels at the check-nodes arefixed; we only receive new observations and try to find the most likely codewordgiven each observation set. As another example, when finding the Hadamardtransform

∑

x1,··· ,xn

∏ni=1(−1)

xiyif(x1, · · · , xn) of an arbitrary function f , thefunctions (−1)xiyi are fixed. Typically, we want to assign (some of) the fixedkernels as the measure function, and the arbitrary kernels as the marginalizablerandom variables; this way, once a junction tree has been found for one problem,it can be used to marginalize the product of any arbitrary collection of randomvariables measurable in the same σ-fields. See Section 6 for more examples.

4.1 Junction Trees

As in the case of the old GDL, junction trees are defined to capture the under-lying independencies in the marginalizable functions. Given the above problemsetup we define junction trees as follows:

Definition 2. Let G be a tree with nodes 1, · · · ,M. We say subsets A andB of 1, · · · ,M are separated by a node i if ∀x ∈ A, y ∈ B, the path from xto y contains i. Then we call G a Junction Tree if ∀A,B ⊂ 1, · · · ,M andi ∈ 1, · · · ,M s.t. i separates A and B on the tree we have FA⊥⊥FB

∣

∣ Fi.

Lemma 4.1. Suppose there exists a junction tree with nodes corresponding toσ-fields F1, · · · ,FM. Then if f is a zero measure atom of any of the Fi’s,

and g ⊂ f is measurable in∨Mi=1 Fi, then µ(g) = 0.

Proof. Node i vacuously separates the empty subset of 1, · · · ,M from

1, · · · ,M\i. Thus ∅,Ω ⊥⊥∨Mj=1j 6=iFj

∣

∣ Fi. Thus by the definition of con-

ditional independence, whenever f ∈ A(Fi) has zero measure, all its subsets

measurable in∨Mi=1 Fi also have measure zero.

9

Lemma 4.2. Let F1,F2 and F3 be σ-fields such that F1⊥⊥F3∣

∣ F2. Then forany partially-defined random variable X ∈ F1, the following equality holds:

E[

E[

X∣

∣F2]

∣

∣

∣F3]

= E[

X∣

∣F3]

Proof. Let Y = E[

X∣

∣F2]

. Then AY (F2) = b ∈ A(F2) : µ(b) 6= 0 andfor b ∈ AY (F2), Y (b) = 1

µ(b)

∑

a∈AX(F1)X(a)µ(a, b). Then, for any nonzero-

measure atom c ∈ A(F3),

E[

E[

X∣

∣F2]

∣

∣

∣F3]

(c) = E[

Y∣

∣F3]

(c)

=1

µ(c)

∑

b∈AY (F2)

Y (b)µ(b, c)

=1

µ(c)

∑

b∈AY (F2)

1

µ(b)

∑

a∈AX(F1)

X(a)µ(a, b)µ(b, c)

=1

µ(c)

∑

b∈AY (F2)

∑

a∈AX(F1)

X(a)µ(a, b, c) since F1⊥⊥F3∣

∣ F2

=1

µ(c)

∑

a∈AX(F1)

X(a)∑

b∈AY (F2)

µ(a, b, c)

=1

µ(c)

∑

a∈AX(F1)

X(a)µ(a, c)

= E[

X∣

∣F3]

(c)

where we have used the fact that F1 ⊥⊥ F3∣

∣ F2, so∑

b∈AF2(Y ) µ(a, b, c) =

∑

b∈A(F2)µ(a, b, c) = µ(a, c).

Lemma 4.3. Let F1, · · · ,Fl and F be σ-fields such that F1, · · · ,Fl aremutually conditionally independent given F . For each i = 1, · · · , l, let Xi be apartially-defined random variable in Fi. Then:

E[

l∏

i=1

Xi

∣

∣

∣F]

=

l∏

i=1

E[

Xi

∣

∣F]

Proof. We shall proceed by induction. The statement is vacuous for l = 1.For l = 2, let Y = E

[

X1X2∣

∣F]

. Then AY (F) = f ∈ A(F) : µ(f) 6= 0.Also note that X1X2 is a partially-defined random variable in F1 ∨ F2 withAX1X2

(F1∨F2) = AX1(F1∨F2)∩AX2

(F1∨F2), and that any atom of F1∨F2can be written as a∩ b for a ∈ A(F1) and b ∈ A(F2). Then for any f ∈ AY (F)we have:

10

Y (f) =1

µ(f)

∑

(a∩b)∈AX1X2(F1∨F2)

a∈F1,b∈F2

X1(a)X2(b)µ(a, b, f)

=1

µ(f)

∑

a∈AX1(F1)

∑

b∈AX2(F2)

X1(a)X2(b)µ(a, f)µ(b, f)

µ(f)

=1

µ(f)

∑

a∈AX1(F1)

X1(a)µ(a, f)1

µ(f)

∑

b∈AX2(F2)

X2(b)µ(b, f)

= E[

X1∣

∣F]

(f)E[

X2∣

∣F]

(f)

where we have used the fact that F1⊥⊥F2∣

∣ F , so µ(a,f)µ(b,f)µ(f) = µ(a, b, f).

For l > 2 assume inductively that the equality holds for l − 1. Then:

E[

l∏

i=1

Xi

∣

∣

∣F]

= E[

X1

l∏

i=2

Xi

∣

∣

∣F]

= E[

X1∣

∣F]

E[

l∏

i=2

Xi

∣

∣

∣F]

since F1⊥⊥l∨

i=2

Fi∣

∣ F

=

l∏

i=1

E[

Xi

∣

∣F]

by induction hypothesis

4.2 Probabilistic Junction Tree Algorithm

Suppose G is a junction tree with σ–fields F1, · · · ,FM, as defined above,and let X1, · · · , XM be random variables with Xi ∈ Fi. Then the follow-ing asynchronous message-passing algorithm will solve the probabilistic MPFproblem:

Algorithm 2. For each edge (i, j) on the graph, define a ‘message’ Yi,j fromnode i to j as a partially-defined random variable measurable in Fj.For each i = 1, · · · ,M define the set of neighbors of i as:

Ni =

k : (i, k) an edge in the tree

,

and for each edge (i, j) in the tree, define:

Ni,j = Ni\j

Initialize all the messages to 1. For each edge (i, j) in the tree, update themessage Yi,j (asynchronously) as:

Yi,j = E[

Xi

∏

k∈Ni,j

Yk,i

∣

∣

∣Fj]

(8)

11

This algorithm will converge in finite time, at which time we have:

E[

M∏

j=1

Xj

∣

∣

∣Fi]

= Xi

∏

k∈Ni

Yk,i

Proof. The scheduling theorem 3.1 in [1] also holds here, using Lemmas 4.2 and4.3 above. For completeness we will include that proof here.

We will show that if Et is the schedule for activation of the nodes, (i.e. adirected edge (i, j) ∈ Et iff node i updates its message to its neighbor, j at timet) then the message from a node i to a neighboring node j is:

Yi,j(t) = E[

∏

k∈Ki,j(t)

Xk

∣

∣

∣Fj]

, (9)

where Ki,j(t) is a subset of the nodes defined recursively by:

Ki,j(t) =

∅ if t = 0,

Ki,j(t− 1) if (i, j) 6∈ Et,

i⋃

l∈Ni,jKl,i(t− 1) if (i, j) ∈ Et

We will prove this by induction on t. Case t = 0 is clear from the initial-ization. Now let t > 0 and assume that (9) above holds for t− 1. We can alsoassume that the (i, j) ∈ Et so the message Yi,j is being updated at time t. Then:

Yi,j(t) = E[

Xi

∏

l∈Ni,j

Yk,i

∣

∣

∣Fj]

= E[

Xi

∏

l∈Ni,j

E[

∏

k∈Kl,i(t−1)

Xk

∣

∣Fi]

∣

∣

∣Fj]

by induction

= E[

E[

Xi

∏

l∈Ni,j

∏

k∈Kl,i(t−1)

Xk

∣

∣Fi]

∣

∣

∣Fj]

by J.T. property and Lemma 4.3

= E[

Xi

∏

l∈Ni,j

∏

k∈Kl,i(t−1)

Xk

∣

∣

∣Fj]

by J.T. property and Lemma 4.2

= E[

∏

k∈Ki,j(t)

Xk

∣

∣

∣Fj]

by definition of Ki,j(t)

Indeed Ki,j(t) above is the set of all the nodes whose ‘information’ has

reached the edge (i, j) by time t. Similarly, with Ji(t)∆= i

⋃

j∈NiKj,i(t), Ji(t)

is the collection of all the nodes whose ‘information’ has reached a node i bytime t. It is natural to think of a message trellis up to time t, which is anM × t directed graph, where for any i, j ∈ 1, · · · ,M and n < t, i(n) is alwaysconnected to i(n+1), and i(n) is connected to j(n+1) iff (i, j) ∈ En. It followsthat we will have Ji(t) = 1, · · · ,M when there is a path from every the initial

12

node (i.e. at t = 0) in the trellis to the node i(t). Then, since the tree has finitediameter, any infinite schedule that activates all the edges infinitely many timeshas a finite sub-schedule, say of length t0 such that Ji(t0) = 1, · · · ,M for alli. At that time we have:

E[

Xi

∏

j∈Ni

Yj,i(t0)∣

∣

∣Fi]

= E[

Xi

∏

j∈Ni

E[

∏

k∈Kj,i(t0)

Xk

∣

∣

∣Fi]∣

∣

∣Fi]

by (9)

= E[

E[

Xi

∏

j∈Ni

∏

k∈Kj,i(t0)

Xk

∣

∣

∣Fi]∣

∣

∣Fi]

by J.T. propertyand Lemma 4.3

= E[

Xi

∏

j∈Ni

∏

k∈Kj,i(t0)

Xk

∣

∣

∣Fi]

= E[

∏

k∈Ji(t0)

Xk

∣

∣

∣Fi]

by defn. of Ji(t)

= E[

M∏

k=1

Xk

∣

∣

∣Fi]

4.3 Representation and Complexity

In this section we discuss the complexity of representation of a junction treeand the implementation of our algorithm.Let G be a junction tree with σ-fields F1, · · · ,FM, as defined above, and

let X1, · · · , XM be arbitrary random variables with Xi ∈ Fi. Denote by qithe number of atoms of the σ-field Fi, so qi = |A(Fi)|.It can be seen that, in general, the sample space Ω can have as many as

∏Mi=1 qi elements and thus full representation of σ-fields and the measure func-

tion requires exponentially large storage resources. Fortunately, however, a fullrepresentation is not required. Along each edge (i, j) on the tree, Algorithm 2only requires local computation of E

[

X∣

∣Fi]

for a random variable X ∈ Fj . Thisonly requires a qi×qj table of the joint measures of the atoms of Fi and Fj . Foran arbitrary edge (i, j), let A(Fi) = a1, · · · , aqi

and A(Fj) = b1, · · · , bqj

be the sets of atoms of Fi and Fj . Define W (i, j) to be the qi × qj matrixwith (r, s) entry equal to µ(ar|bs); note that from Lemma 4.1, (possibly aftertrivial simplification of the problem by eliminating the zero measure events,)no atom of Fj can have measure 0, so µ(ar|bs) is defined for all atoms of Fj .Then once a junction tree has been found, we need only keep 2(M − 1) suchmatrices (corresponding to the (M − 1) edges of the tree) to fully represent thealgorithm, for a total of 2

∑

(i,j) an edge qiqj storage units.The arithmetic complexity of the algorithm depends on even a smaller quan-

tity, namely the total number of nonzero elements of the W (i, j) matrices. Letnz(i, j) denote the number of nonzero entries of the matrix W (i, j) (note that

13

nz(i, j) = nz(j, i)). Let X be an arbitrary random variable in Fi. Then

E[

X∣

∣Fj]

(bs) =

qi∑

r=1

X(ar)µ(ar|bs)

=∑

r:µ(ar|bs)6=0

X(ar)µ(ar|bs)

requiring nz(i, j) multiplications and nz(i, j) − qj additions. Note that themeasures are assumed to be fixed, and only the Xi’s are allowed to be arbi-trary random variables measurable in the Fi’s. So it makes sense to excludemultiplications and additions by the 0’s from the algorithm.For each (directed) edge (i, j) in G define χ(i, j) = 2nz(i, j) − qj to be the

edge complexity, i.e. the number of additions and multiplications required forcomputing E

[

Xi

∣

∣Fj]

. From the Algorithm 2, calculating the conditional expec-tation given a single σ-field Fi with the most efficient schedule, requires updatingof the messages from the leaves towards the node i. Each edge is activated inone direction, and at each non-leaf node l the messages need to be multiplied toupdate the message from l to its neighbor in the direction of i. This requires,for each edge (k, l), an additional ql multiplications. Thus the grand total arith-metic operations needed to calculate E

[∏

j Xj

∣

∣Fi]

is∑

(k,l) an edge 2nz(k, l).

Note that nz(k, l) can be upper-bounded by qkql, corresponding to carryingout multiplications and additions for the events of measure zero.

The complexity of the full algorithm, in which E[∏

j Xj

∣

∣Fi]

is calculatedfor all i = 1, · · · ,M , can also be found using similar ideas. For each node k,let d(k) denote the number of the neighbors of k on the tree. Then for eachdirected edge (k, l), the d(k) − 1 messages from other neighbors of k must bemultiplied by Xk (requiring (d(k) − 1)qk multiplications) and then the condi-tional expectation given Fl be taken (requiring χ(k, l) operations). So the totalnumber of operations required for the full algorithm is

∑

(k,l) a dir. edge

2nz(k, l)− ql + (d(k)− 1)qk

=∑

(k,l) a dir. edge

(

2nz(k, l) + d(k)qk)

−M∑

k=1

2d(k)qk

=∑

(k,l) an edge

4nz(k, l) +

M∑

k=1

(d(k)2 − 2d(k))qk

As noted in [1], it is possible to produce all the products of d(k) − 1 ofd(k) messages going into node k in a more efficient way. In this case, the totalarithmetic complexity of the complete algorithm will be

∑

(k,l) an edge 4nz(k, l)+

O(∑M

k=1 d(k)qk).

14

4.4 Existence of Junction Trees

Definition 3. A Valid Partition of 1, · · · ,M\i with respect to a node i is

a partition p1, · · · , pl of 1, · · · ,M\i (i.e.⋃Mj=1 pj = 1, · · · ,M\i and

pi ∩ pj = ∅ for i 6= j) such that Fpj’s are mutually conditionally independent,

given Fi.

Definition 4. Let P = p1, · · · , pl be any partition of 1, · · · ,M\i. Atree with nodes 1, · · · ,M is called compatible with partition P at node i if itssubtrees hanging from i correspond to the elements of P .

Lemma 4.4. ∀i ∈ 1, · · · ,M, there is a Finest Valid Partition w.r.t. i, whichwe shall denote by Pi, such that every other valid partition w.r.t. i is a coars-ening of Pi. Further, if p is an element of Pi and p is the disjoint union ofnonempty sets e1 and e2, then Fe1 6⊥⊥Fe2

∣

∣ Fi.

Proof. Suppose A = p1, · · · , pl and B = q1, · · · , qm are valid partitionsw.r.t. node i. Now construct another partition, C = p ∩ q : p ∈ A & q ∈ B.We claim that C is also a valid partition w.r.t. i, (finer than both A and B):To see this, we need to show that for each d = p ∩ q ∈ C, Fd⊥⊥Fdc

∣

∣ Fi. Usingsimple manipulations like Fp = Fp∩(q∪qc) = Fp∩q ∨ Fp∩qc we get:

Fp∩q ∨ Fp∩qc⊥⊥Fpc

∣

∣ Fi

Fp∩q ∨ Fpc∩q⊥⊥Fp∩qc ∨ Fpc∩qc∣

∣ Fi ⇒ Fp∩q⊥⊥Fp∩qc∣

∣ Fi by (2b)

And finally, the last two relations and (2d) imply that Fp∩q⊥⊥ Fpc∪(p∩qc)

∣

∣ Fi,

and hence Fp∩q⊥⊥F(p∩q)c∣

∣ Fi. So a finest valid partition w.r.t. i exists, whoseatoms are the intersections of atoms of all the valid partitions w.r.t. i.

Now suppose p is an element of Pi and p is the disjoint union of nonemptysets e1 and e2, and Fe1 ⊥⊥ Fe2

∣

∣ Fi. We also have Fe1 ∨ Fe2 ⊥⊥ Fpc

∣

∣ Fi. Then

from the last two relations and by (2d) we get Fe1⊥⊥Fpc ∨ Fe2∣

∣ Fi, and hence

Fe1 ⊥⊥ Fec1∣

∣ Fi. Then e1 and e2 would be elements in a finer valid partitionwhich is a contradiction.

Lemma 4.5. Given F1, · · · ,FM, a tree with nodes 1, · · · ,M is a junctiontree iff at each node i it is compatible with some valid partition of 1, · · · ,M\iw.r.t. i.

Proof. Immediate from the definitions.

Lemma 4.6. Let d be a subset of 1, · · · ,M and let d′ be its complement in1, · · · ,M. Suppose there exist t ∈ d and i ∈ d′ such that Fd⊥⊥ Fd′

∣

∣ Ft and

Fd⊥⊥ Fd′∣

∣ Fi. Let G be any junction tree on d and G′ any junction tree ond′. Then the tree obtained by connecting G and G′ by adding an edge betweent and i is a junction tree on 1, · · · ,M (see Figure 1.)

15

i d′ t d

G′ G

Figure 1: Augmenting Junction Trees; Lemma 4.6

Proof. Let x be any node that separates A and B on the resultant tree. We willshow that FA⊥⊥FB

∣

∣ Fx and hence we have a junction tree.Let A1 = A ∩ d,A2 = A ∩ d′, B1 = B ∩ d and B2 = B ∩ d′ and WLOG supposex ∈ d. Then at least one of A2 and B2 must be empty, or else x would notseparate A and B. Suppose A2 = ∅.

First suppose x = t. Then we have:

FA1⊥⊥FB1

∣

∣ Ft by J.T. property on G

FA1∨ FB1

⊥⊥FB2

∣

∣ Ft since A1 ∪B1 ⊂ d and B2 ⊂ d′

So by (2d) we have FA1⊥⊥FB1

∨ FB2

∣

∣ Ft, i.e. FA⊥⊥FB∣

∣ Ft and we are done.Next Suppose x ∈ d\t. Then we must also have that x separates A1 fromB1 ∪ t (assuming that B2 is nonempty, which is no loss of generality.) Then:

FA1⊥⊥FB1

∨ Ft∣

∣ Fx (10)

FA1∨ Fx ∨ FB1

⊥⊥FB2

∣

∣ Ft since A1 ∪B1 ∪ x ⊂ d and B2 ⊂ d′ (11)

We will show that FA1⊥⊥FB1

∨ FB2∨ Ft

∣

∣ Fx and hence FA⊥⊥FB∣

∣ Fx.Let χ, τ, α, β1 and β2 be arbitrary atoms of Fx,Ft,FA,FB1

and FB2respectively.

• Case µ(χ) = µ(τ) = 0. Then from (11) we have that µ(α, β1, β2, χ, τ) = 0,and so we are done.

• Case µ(χ) = 0 and µ(τ) 6= 0. Then from (11) we have µ(α, β1, β2, χ, τ) =µ(α, β1, χ, τ)µ(β2, τ)/µ(τ). But from (10), µ(α, β1, χ, τ) = 0 since µ(χ) =0. Thus µ(α, β1, β2, χ, τ) = 0 and we are done.

• Case µ(χ) 6= 0 and µ(τ) = 0. Then from (11) we have that µ(α, β1, β2, χ, τ) =µ(β1, β2, χ, τ) = 0, and so we have the equality µ(α, β1, β2, χ, τ)µ(χ) =µ(α, χ)µ(β1, β2, χ, τ) = 0 and we are done.

• Case µ(χ) 6= 0 and µ(τ) 6= 0. Then from (11),

µ(α, β1, β2, χ, τ) = µ(α, β1, χ, τ)µ(β2, τ)/µ(τ),

16

and from (10), µ(α, β1, χ, τ) = µ(α, χ)µ(β1, χ, τ)/µ(χ). Replacing the lat-ter into the former, we obtain

µ(α, β1, β2, χ, τ) = µ(α, χ)µ(β1, χ, τ)µ(β2, τ)/(µ(χ)µ(τ)).

But by (11), µ(β1, χ, τ)µ(β2, τ)/µ(τ) = µ(β1, β2, χ, τ), so

µ(α, β1, β2, χ, τ) = µ(α, χ)µ(β1, β2, χ, τ)/µ(χ)

and we are done.

We now state our main theorem on the existence of junction trees:

Theorem 4.7. Given a set of σ-fields F1, · · · ,FM, if there exists a junctiontree on 1, · · · ,M, then for every i ∈ 1, · · · ,M there exists a junction treecompatible with Pi, the finest valid partition w.r.t. i.

Notice that Theorem 4.7 and Lemma 4.6 effectively give an algorithm to finda junction tree, when one exists, as we shall describe in Section 4.5.

Proof. The claim is trivial for M ≤ 3. We will prove the theorem for M > 3 byinduction:Let Pi = c1, · · · , cl with

⋃lj=1 cj = 1, · · · ,M\i and cj ∩ ck = ∅ for

j 6= k. Let G be a junction tree. Let Q = d1, · · · , dn be the partition of1, · · · ,M\i compatible with G. Let d = dj be an arbitrary element of Q,and let d′ =

⋃

k 6=j dk ∪ i. Let t = Ni ∪ d be the node in d that is neighborto i in tree G. By Lemmas 4.4 and 4.5 above, d is the union of some of ck’s.WLOG assume that d =

⋃Kk=1 ck where K ≤ l, and also assume that t ∈ cK .

Then from the junction tree property, we have

Fi⊥⊥Fd∣

∣ Ft (12)

Since G is a junction tree, the subtree on d is also a junction tree. Now|d| < M , and so by induction hypothesis there exists a junction tree on dcompatible with P ′

t , the finest valid partition w.r.t. t of d\t.Now we claim that R =

ck\t : 1 ≤ k ≤ K

is a valid partition ofd\t w.r.t. t. To see this, let c = ck for some arbitrary k = 1, · · · ,K, andlet c′ = d\c, so Fd = Fc ∨ Fc′ . But one of c and c

′ contains t. Then by theproperties of valid partition w.r.t. i, we have:

Fc ∨ Ft⊥⊥Fc′∣

∣ Fi or Fc⊥⊥Fc′ ∨ Ft∣

∣ Fi

also, Fi⊥⊥Fc ∨ Fc′∣

∣ Ft since t separates i from d on G

Then by (2f) followed by (2b), the last relations imply that Fc⊥⊥ Fc′∣

∣ Ftand we are done.

17

Next we show that for all k ∈ 1, · · · ,K−1 (so that t 6∈ ck), ck is an elementof P ′

t . If not, then there exists a c = ck ∈ R, with t 6∈ c, s.t. c is the disjointunion of some subsets e1 and e2 and Fe1⊥⊥Fe2

∣

∣ Ft. Also Fe1 ∨ Fe2⊥⊥Fi∣

∣ Ftso by (2d) we get Fe1⊥⊥Fe2 ∨ Ft

∣

∣ Fi. We also have Fe1 ∨ Fe2⊥⊥Ft∣

∣ Fi sincee1 ∪ e2 = c and t belongs to another set in the finest valid partition w.r.t. i.From the last two relations and by (2f) followed by (2b) we get Fe1⊥⊥Fe2

∣

∣ Fi.But by Lemma 4.4, c ∈ Pi cannot be so decomposed, so e1, e2 = c, ∅ andwe have proved the claim.

So we have shown, by induction, that there exists a junction tree, Gd ond, where node t has at least K − 1 neighbors with subtrees corresponding tock, 1 ≤ k ≤ K − 1. Now we modify the original junction tree, G in K +1 stepsto get trees H, H0,· · · , HK−1 as follows:First we form H by replacing the subtree in G on d, with Gd above, con-

necting i to t with an edge. By Lemma 4.6, H is a junction tree on 1, · · · ,M.Let H0 be the subtree of H after removing the subtrees around t on ck, 1 ≤

k ≤ K − 1. Then H0 is also a junction tree. For each j = 1, · · · ,K − 1 let Ljbe the subtree of H on cj , and let xj be the node on cj that was connected to tin H. Then at each step j = 1, · · · ,K − 1 we form Hj by joining Hj−1 and Ljby adding the edge between i and xj (see Figure 2.)

H H0 Hk-1

t i

x1 x2

xk-1

c1 c2

ck-1

ck

d

d′ t i ck d′ t

x1 x2 xk-1

i

c1 c2 ck-1

d′ ck Lk

Lk-1

L2 L1

Lk

L1 L2

Lk-1

Lk

Gd

Figure 2: Transformation of Junction Trees

We now show inductively that each Hj is a junction tree. By inductionhypothesis Hj−1 is a junction tree. At the same time, Lj , being a subtree of a

junction tree, is also a junction tree. Further Fcj⊥⊥FcK

∨Fd′∨j−1r=1 cr

∣

∣Fi, sincecj is a set in a valid partition w.r.t. i.

Also, Fcj⊥⊥ FcK

∨ Fd′∨j−1r=1 cr

∣

∣ Fxj, since on the junction tree H, node xj

separates cj from cK ∪ d′⋃j−1r=1 cr . Then by Lemma 4.6, each Hj is a junction

tree (Note that HK−1 is a junction tree on 1, · · · ,M.)Next we perform the same transformation on HK−1, starting with other

neighbors of i. The resulting tree will be a junction tree, and will be compatiblewith Pi.

18

4.5 Algorithm to Find a Junction Tree

We will now produce an algorithm to find a junction tree when one exists.Given a set of σ-fields F1, · · · ,FM,

Algorithm 3. Pick any node i ∈ 1, · · · ,M as the root.

• If M = 2 then the single edge (1, 2) is a junction tree. Stop.

• Find the finest valid partition of 1, · · · ,M\i w.r.t. i, Pi = c1, · · · , cl(see notes below).

• For j=1 to l

• Find a node t ∈ cj s.t. Fi⊥⊥Fcj

∣

∣ Ft. If no such node exists, then stop;no junction tree exists.

• Find a junction tree on cj with node t as root. Attach this tree, by addingedge (i, t).

• End For

Note: In the general case of the signed conditional independence, we knowof no better way to find the finest valid partition than an exhaustive search inan exponential subset of all the partitions. In the case of unsigned measures,however, we can show that when a junction tree exists, the finest valid partitioncoincides with the finest pairwise partition, which can be found in polynomialtime.

Proof. At each iteration, t is chosen so Fi⊥⊥ Fcj

∣

∣ Ft. But we also had Fccj ⊥⊥

Ft ∨ Fcj

∣

∣ Fi. By (2e) the last two relations imply Fcj⊥⊥Fi ∨ Fccj

∣

∣ Ft. But we

also have Fcj⊥⊥Fi ∨Fccj

∣

∣Fi. So by Lemma 4.6 we have a junction tree at eachstep. Also, from Theorem 4.7 if the algorithm fails, then there is no junctiontree.

5 Construction of Junction Tree - Lifting

In the previous section we gave an algorithm to find a junction tree, when oneexists. In this section we deal with the case when Algorithm 3 declares that nojunction tree exists for the given set of σ-fields. In particular, we would like tomodify the σ-fields in some minimal sense, so as to ensure that we can constructa junction tree.

Definition 5. Let (Ω, F1, · · · ,FM, µ) be a collection of measure spaces. Wecall another collection of measure spaces (Ω′, F ′

1, · · · ,F′M ′, µ′) a lifting of

(Ω, F1, · · · ,FM, µ) if there are maps, f : Ω′ −→ Ω and σ : 1, · · · ,M −→

1, · · · ,M such that:

• µ′ is consistent with µ under the map f , i.e.∀ A ∈ F1,··· ,M, µ(A) = µ′(f−1(A)).

19

• For all i = 1, · · · ,M , f is (F ′σ(i),Fi)-measurable, i.e.

∀ A ∈ Fi, f−1(A) ∈ F ′

σ(i).

where for A ∈ Ω, f−1(A)∆= ω′ ∈ Ω′ : f(ω′) ∈ A.

In words, up to some renaming of the elements, each σ-field Fi is a sub-σ-field of some F ′

j , (namely, F′σ(i)) and this F

′j is obtained from Fi by splitting

some of the atoms.We now describe the connection of the above concept with our problem.

Let (Ω′, F ′1, · · · ,F

′M ′, µ′) be a lifting of (Ω, F1, · · · ,FM, µ) with lifting

maps f : Ω′ −→ Ω and σ : 1, · · · ,M −→ 1, · · · ,M as described inDefinition 5. Let G′ be a junction tree on 1, · · · ,M ′ corresponding to σ-fieldsF ′

1, · · · ,F′M ′. We will construct a junction tree G′′ from G′ such that the

running Algorithm 2 on G′′ will produce the desired conditional expectationsat appropriate nodes.For each i = 1, · · · ,M , let Gi be the σ-field on Ω

′ with atoms A(Gi) =f−1(a) : a ∈ A(Fi), and let Yi ∈ Gi be the random variable with Yi(f

−1(a)) =Xi(a) for all a ∈ A(Fi); so that up to a renaming of the atoms and elements,(Ω,Fi, µ) and (Ω

′,Gi, µ′) are the same measure space andXi and Yi are the same

random variable. LetG′′ be a tree with nodes 1, · · · ,M ′,M ′+1, · · · ,M ′+M,– with corresponding σ-fields F ′

1, · · · ,F′M ′ ,G′

1, · · · ,G′M and random variables

1, · · · , 1, Y1, · · · , YM,– which is generated by starting with G′ and addingedges (j,M ′ + σ(j)) for each j = 1, · · · ,M ′. In words, G′′ is a graph obtainedfrom G′ by adding and attaching each node with σ-fields Gi for i = 1, · · · ,M(which are in turn equivalent to the original Fi’s,) to the node whose σ-fieldcontains that Gi. Then by Lemma 4.6, G

′′ is a junction tree and hence runningAlgorithm 2 on G′′ will produce E

[∏M

i=1 Yi∣

∣Gj]

at the node labelled (M ′+σ(j))

for each j = 1, · · · ,M . But these are equivalent to E[∏M

i=1Xi

∣

∣Fj]

for j =1, · · · ,M and we have thus solved the probabilistic MPF problem.So we only need to establish how to lift a certain collection of σ-fields to

create the required independencies and form a junction tree.

Suppose we have three σ-fields, F1,F2 and F3, and we would like to lift themeasure space (Ω, F1,F2,F3, µ) into (Ω

′, F ′1,F

′2,F

′3, µ

′) in a way to havethe independence F ′

1⊥⊥ F′3

∣

∣ F ′2. Let ai ∈ A(F1), ck ∈ A(F2) and bj ∈ A(F3)

be arbitrary atoms. For each ck, let Ak be the matrix with (i, j) entry equalto µ(ai, bj , ck). Then µ(ck) = 0 iff the sum of entries of Ak is zero. Howeverwe cannot have a nonzero matrix with zero sum of entries. If this is the case,we can decompose the matrix into sum of two matrices with nonzero sum ofentries. This corresponds to splitting atom ck in a way that the new atoms havenonzero measure.Next for each such matrix, Ak with nonzero sum of entries, the independence

condition corresponding to ck is exactly the condition that Ak is rank-one. If,however, Ak is not rank-one, we can split ck using an optimal decomposition

20

of Ak as the sum of say q rank-one matrices so that none of the matrices arezero-sum.4 This corresponds to splitting the atom ck into q atoms, c

k1 , · · · , c

kq

where each of ckl ’s render F1 and F3 independent. ckl ’s are then atoms of F

′2.

5.1 Algorithm to Construct a Junction Tree

Combining the above ideas with the Algorithm 3 we obtain:

Algorithm 4. Pick any node i ∈ 1, · · · ,M as the root.

• If M = 2 then the single edge (1, 2) is a junction tree. Stop.

• Find any5 valid partition of 1, · · · ,M\i w.r.t. i, Pi = c1, · · · , cl.

• For j=1 to l

• Find a node t ∈ cj s.t. Fi⊥⊥Fcj

∣

∣ Ft. If no such node exists, then pickany t ∈ cj. Lift Ft by splitting some of its atoms as discussed aboveso to have Fi⊥⊥Fcj

∣

∣ Ft (see notes below).

• Find a junction tree on cj with node t as root. Attach this tree, by addingedge (i, t).

• End For

The resulting collection (Ω′, F ′1, · · · ,F

′M ′, µ′) is a lifting of the original collec-

tion with M ′ = M , and the tree generated by this algorithm is a junction treecorresponding to this lifted collection of σ-fields.

In many cases, however, it is possible to achieve a less complex junction treealgorithm by allowing M ′ to exceed M . The following optional steps can be usedfor possible reduction of complexity:

• For each edge (i, j) in the resultant junction tree,

• Create a σ-field G(i,j) that renders Fi and Fj independent, i.e. Fi ⊥⊥

Fj∣

∣ G(i,j). This can be done by starting with F∅,Ω and applyingthe rank-one decomposition techniques in the previous section.

• Insert a new node in the tree corresponding to G(i,j), between i and j.

• End For

4This can always be done with q = rank(Ak) if Ak is not zero-sum. Obviously rank(Ak)is also a lower bound for q. An optimal decomposition, however, not only aims to minimizeq, but also involves minimizing the number of nonzero entries of the decomposing matrices,as discussed in Section 4.3.

5Although any valid partition will work, in general finer partitions should result in betterand less complex algorithms (see Section 4.3).

21

Notes: In general, the size of the Ω space can grow multiplicatively with each‘lifting,’ so the full representation of the measure spaces require exponentiallylarge storage resources. However, as mentioned in Section 4.3, a full descriptionof the algorithm requires storing only the W (i, j) matrices of the conditionalmeasures of the atoms of the neighboring σ-fields. In fact, a careful reader mayhave noticed that we have not completely specified the lifting maps as defined inDefinition 5 on the entire collection of measure spaces. In fact once the lifting hasbeen done on a subset of the σ-fields (so to create the desired independencies),there is a unique way to update the measure function on the entire samplespace that ensures that previous relations still hold. Such extension, however,is unnecessary since by the consistency of the measures in a lifting, the matricesof conditional measures along other edges of the tree remain the same.

6 Examples

In this section we consider a few examples where GDL-based algorithms areapplicable. We will work out the details of the process of making a junctiontree, and compare the complexity of the message-passing algorithm with GDL.

Our Example 1 at the beginning of this paper can be generalized as follows:

Example 2. Let Ai, i = 1, · · · ,M be finite sets of size q, and for eachi = 1, · · · ,M , let Xi be a real function on Ai. Let µ be a real (weight) functionon A1 ×A2 × · · · ×AM . We would like to compute the weighted average of theproduct of Xi’s, i.e. E =

∑

ai∈AiX1(a1)X2(a2) · · ·XM (aM )µ(a1, · · · , aM ).

Suppose that the weight function is in the following form:

µ(a1, · · · , aM ) =M∏

i=1

fi(ai) +M∏

i=1

gi(ai)

As long as the weight function µ is not in product form, the most efficientGDL algorithm will prescribe

E =∑

a1∈A1

X1(a1) · · ·∑

aM ∈AM

XM (aM )µ(a1, · · · , aM )

requiring O(nM ) additions and multiplications, corresponding to the followingjunction tree.

a1,…,aM a1,…,aM-1 a1

X1(a1) XM-1(aM-1) XM(aM)µ (a)

∅

1

Figure 3: GDL Junction Tree

22

Now, Let Ω be the product space A1 ×A2 × · · · × AM with signed measureµ, and for i = 1, · · · ,M , let Fi be the σ–field containing the planes ai = c, soXi ∈ Fi. Let F0 = ∅,Ω be the trivial σ–field. Then the problem is to findthe conditional expectation of the product of the Xi’s given F0.The best junction tree is obtained by lifting the space so that given F ′

0 all otherσ–fields are mutually conditionally independent. To do this, we split the atomΩ of F0 into two atoms Ω1 and Ω2. In effect, for each element (a1, · · · , aM ) ofΩ, the new space Ω′ has two elements (a1, · · · , aM )∩Ω1 and (a1, · · · , aM )∩Ω2.

The new weight function is defined on Ω′ as µ′((a1, · · · , a1M )∩Ω1) =

∏Mi=1 fi(ai)

and µ′((a1, · · · , aM ) ∩ Ω1) =∏M

i=1 gi(ai).

Then there is a star-shaped junction tree on 0, · · · ,M with node 0 at thecenter. The message passing algorithm on this tree is:

E = E[

∏

Xi

∣

∣F0]

(Ω)

= E[

∏

Xi

∣

∣F ′0

]

(Ω1) +E[

∏

Xi

∣

∣F ′0

]

(Ω2)

=M∏

i=1

E[

Xi

∣

∣F ′0

]

(Ω1) +M∏

i=1

E[

Xi

∣

∣F ′0

]

(Ω2)

=

M∏

i=1

∑

ai∈Ai

Xi(ai)fi(ai) +

M∏

i=1

∑

ai∈Ai

Xi(ai)gi(ai)

Note that this requires only O(Mn) additions and multiplications.

In the next example we show that Pearl’s treatment of the belief propagationalgorithm in the case of a node with disjunctive interaction or noisy-or-gatecausation ([9], section 4.3.2) can be viewed as a special case of our algorithm:

Example 3. Bayesian Network with Disjunctive Interaction. Let thebinary inputs U = (U1, U2, · · · , Un) be the parents of the binary node X in theBayesian network of Figure 4, interacting on X through a noisy-or-gate.

U1 Un

X

U2

Figure 4: Bayesian Network of Example 3

This means that there are parameters q1, · · · , qn ∈ [0, 1] so that

P (X = 0 | U) =∏

i∈Tu

qi

P (X = 1 | U) = 1−∏

i∈Tu

qi

23

where Tu∆= i : Ui = 1.

Normal moralization and triangulation technique applied to this graph willgive a single clique with all the variables. However, because of the structure inthe problem, a better solution exists.Let FX ,F1, · · · ,Fn be the σ-fields generated by the (independent) variables

x, u1, · · · , un respectively. All variables are binary so each of the σ-fields has pre-cisely two atoms. In our framework, let the ‘random variables’ be (the function)1 ∈ FX , and πX(ui) = P (Ui = ui) ∈ Fi for i = 1, · · · , n, with the underlyingjoint measure on x, u1, · · · , un defined to be µ(x, u1, · · · , un) = P (X = x | U =(u1, · · · , un)). Then Fi’s are not mutually conditionally independent given FX ,however the following simple lifting of space will create the independence: Leta variable x′ be defined to take value in 0, 1, 2, where the event x′ = 0 corre-sponds to x = 0, and x′ = 1 ∪ x′ = 2 correspond to x = 1. Extend themeasure as follows:

µ′(x′, u1, · · · , un) =

∏

i∈Tuqi if x′ = 0

−∏

i∈Tuqi if x′ = 1

1 if x′ = 2

Then we see that in this lifted spaces, the σ-fields F ′i (generated by variables

ui respectively) are mutually conditionally independent given F′X′ (the σ-field

generated by variable x′.) Then we have a junction tree in the shape of a star,with F ′

X′ corresponding to the central node. The junction tree algorithm willcalculate the following marginalized random variable at the node correspondingto F ′

X′ :

β(x′) =

∏ni=1

(

P (Ui = 0) + P (Ui = 1) qi)

if x′ = 0

−∏n

i=1

(

P (Ui = 0) + P (Ui = 1) qi)

if x′ = 1

1 if x′ = 2

Then the belief at X is the function

BEL(x) =

∏ni=1

(

P (Ui = 0) + P (Ui = 1) qi)

if x = 0

1−∏n

i=1

(

P (Ui = 0) + P (Ui = 1) qi)

if x = 1

where we have merged the atoms x′ = 1 and x′ = 2 of F ′X′ to get back x = 1.

This is essentially the same as Equation (4.57) in [9].

Example 4. Hadamard Transform. Let x1, · · · , xn be binary variables andlet f(x1, · · · , xn) be a real function of the x’s. The Hadamard transform of f isdefined as

g(y1, · · · , yn) =∑

x1,··· ,xn

n∏

i=1

(−1)xiyif(x1, · · · , xn)

where y1, · · · , yn are binary variables.Since our framework is particularly useful when the underlying functions are

structured, we consider the case when f is a symmetric function of x1, · · · , xn,

24

i.e. f depends only on the sum of the xi’s. Then it is easy to verifyClaim: When f is symmetric, then its Hadamard transform, g is also a sym-metric function.We now set up the problem in our framework. Let Ω be 0, 12n with

elements ω = (x1, · · · , xn, y1, · · · , yn). Let F and G be the σ–fields in whichrespectively f and g are measurable; in our case of symmetric f and g, A(F) =αk for k = 0, · · · , n =

ω :∑

i xi = k for k = 0, · · · , n

and A(G) =

βk for k = 0, · · · , n =

ω :∑

i yi = k for k = 0, · · · , n

.Next we note that all the factors involving terms (−1)xiyi can be summarizedas a signed measure µ on F ∨ G as follows:

µ(αj , βk) =∑

ω ∈αj∩βk

(−1)∑

i xiyi

=∑

ω:∑

i xi=j,∑

i yi=k

(−1)∑

i xiyi

=∑

(x1,··· ,xn):∑

i xi=j

(−1)∑k

i=1 xi

Note that µ can be stored in a (n+ 1)× (n+ 1) table.Now we have a junction tree with only two nodes, corresponding to F and

G, and the marginalization is done as follows:

g(βk) = E[

f∣

∣G]

=

n∑

j=0

f(αj)µ(αj , βk)

where f(αj) = f(

(x1, · · · , xn) :∑

i xi = j)

and g(βk) = g(

(y1, · · · , yn) :∑

i yi = k)

.This requires only n additions and (n+1) multiplications for each of (n+1)

possible values of g.

We have created a Matlab library containing the necessary functions to setup a general marginalization problem and create a junction tree. The followingexamples are processed using that code.

Example 5. Probabilistic State Machine. Consider the Bayesian networkdepicted in Figure 5, where ui, si and yi denote inputs, hidden states and theoutputs of a chain of length n. Let m be the memory of the state machine, sothat each state si+1 can be taken to be (ui−m+1, · · · , ui).The problem of finding the maximum-likelihood input symbols is solved

by using the BCJR algorithm [3], which is an implementation of GDL algo-rithm on the junction tree obtained by moralization and triangulation of theabove Bayesian network. The local functions are P (s0), P (ui), P (y

ei |ui, si) and

P (si|ui−1, si−1). At each stage i, a message-update involves marginalizing a

25

s0

y0

u0

s1

y1

u1

sn-1

yn-1

un-1

Figure 5: Bayesian Network for a Probabilistic State Machine

function f(ui−m, · · · , ui) over ui−m. So in case of binary inputs and outputs,the BCJR algorithm will require about (5 n 2m) additions and multiplications.Now consider a case when the output of the state machine depends on the

input and state in a simple, but non-product form. For the purposes of thisexample, we have chosen the output yi to be the outcome of an ‘OR’ gate onthe state si and input ui, passed through a binary symmetric channel, i.e.

P (yi|ui, si) = (1− p)1(yi = ∨mj=0ui−j) + p · 1(yi 6= ∨

mj=0ui−j)

where ‘∨’ indicates a logical ‘OR’, and 1(·) is the indicator function.We formed the Ω space as 0, 1n, with elements ω = (u0, · · · , un−1). Theneach functions P (yi|ui, si) is measurable in a σ–field Fi, with two atoms ω :∨mj=0ui−j = 0 and ω : ∨

mj=0ui−j = 1. Since we like to calculate the posterior

probabilities on each input ui, we also include σ–fields Gi each with two atomsω : ui = 0 and ω : ui = 1.We then run our algorithm to create a junction tree on the Fi’s and the Gi’s,

lifting the space whenever needed. The result is a chain consisting of the F ′i ’s

with each G′i hanging from its corresponding F

′i (see Figure 6).

We have run the algorithm on different values of n and m. Table 1 comparesthe complexity of the message-passing algorithm on the probabilistic junctiontree and the normal GDL (BCJR) algorithm. To count for the final marginaliza-tion at the atoms of the original σ-fields, we have used 5nz as the (approximate)arithmetic complexity of our algorithm, where nz =

∑

(i,j) an edge nz(i, j) is thetotal number of nonzero elements in the tables of pairwise joint measure alongthe edges of the tree (see Section 4.3).

The details of the case n = 12,m = 6 have been portrayed in Figure 6.The numbers underneath each F ′

i shows the number of atoms of F′i ∨ G

′i after

lifting has been done. Note that with our setup, originally F0 ∨G0 has 2 atoms,and all other Fi ∨ Gi’s have 3 atoms. The numbers under the brackets denotenz(i, j), the number of nonzero elements in the matrix of joint measures betweenthe atoms of adjacent nodes; As mentioned before, the number of operationsrequired in the computation of the message from one node to an adjacent nodeis proportional to this number.

26

(n, m) (9, 5) (9, 6) (9, 7) (10, 5) (10, 6) (10, 7) (11, 5) (11, 6) (11, 7) (12, 5) (12, 6) (12, 7)

nz 95 130 123 107 133 186 119 147 217 129 159 230

GDL ops 1440 2880 5760 1600 3200 6400 1760 3520 7040 1920 3840 7680

PGDL ops 475 650 615 535 665 930 595 735 1085 645 795 1150

Table 1: Comparison between complexity of GDL and probabilistic GDL. Herenz denotes the total number of nonzero entries in the tables of pairwise jointmeasures. For a chain of length n and memorym, GDL algorithm requires about5 n 2m arithmetic operations, while PGDL requires about 5 nz operations.

′0

′0

2

4

′1

′1

3

6

′2

′2

4

8

′3

′3

5

10

′4

′4

6

12

′5

′5

7

14

′6

′6

7

14

′7

′7

7

29

′8

′8

6

29

′9

′9

5

24

′10

′10

6

7

′11

′11

3

Figure 6: Junction Tree Created for Chain of Length 12 and Memory 6

Example 6. CH-ASIA. Our last example is CH-ASIA from [5], pp. 110-111.The chain graph of Figure 7 describes the dependencies between the variables.Thus the problem is to marginalize P (S,A,L, T,B,E,D,C,X), which is the

S

L

B

D C

E

T

A

X

Figure 7: Graph of CH-ASIA Example

product of the following functions: P (S),P (A),P (L|S),P (T |A),P (B|S),1(E =L ∨ T ),P (X|E),f(C,D,B),g(C,B,E) and h(B,E).Again we set up measurable spaces, with σ–fields corresponding to each of

the above functions. We then ran the lifting algorithm to find a junction tree inform of a chain, as in the previous example. This time, however, due to lack ofstructure at the level of the marginalizable functions, (i.e. the aforementioned

27

conditional probabilities,) the algorithm produced exactly a junction tree thatone could obtain by the process of moralization and triangulation at the levelof original variables. In other words, all liftings were done by addition of one ormore ‘whole’ orthogonal directions (i.e. GDL variables) of the Ω space to theσ–fields. After reconverting σ–fields to ‘variables’, the junction tree we obtainedwas the following:

S A,S L,S,A T,A,E,S B,S,E B,C,D,E B,E,C X,E

Figure 8: Junction Tree for CH-ASIA Example

In this case, our algorithm has reduced to GDL.

7 Discussion

In this paper we have developed a measure-theoretic version of the junction treealgorithm. We have generalized the notions of independence and junction treeat the level of σ–fields, and have produced algorithms to find or construct ajunction tree on a given set of σ–fields. By taking advantage of structures atthe atomic level of sample space Ω, our marginalization algorithm is capable ofproducing solutions far less complex than GDL-type algorithms. The cost ofgenerating a junction tree, however, is exponential in the size of the problem,as is the size of any complete representation of the sample space Ω. Once ajunction tree has been constructed, however, the algorithm will only dependon the joint measure of the atoms of adjacent pairs of σ–fields on the tree.This means that an ‘algorithm’ which was build by considering an Ω space withmyriads of elements, can be stored compactly and efficiently.Using our framework, the tradeoff between the construction complexity of

junction trees and the overall complexity of the marginalization algorithm can bemade with an appropriate choice for the representation of the measurable spaces;at one extreme, one considers the complete sample space, taking advantage of allthe possible structures, and at the other, one represents the sample space withindependent variables (i.e. orthogonal directions), in which case our frameworkreduces to GDL, both in concept and implementation.The validity of this theory for the signed measures is of enormous conve-

nience; it allows for introduction of atoms of negative weight in order to createindependencies. It also greatly simplifies the task of lifting, as now it can bedone by taking the singular value decomposition of a matrix. In contrast, theproblem of finding a positive rank-one decomposition of a positive matrix (whichwould arise if one confined the problem to the positive measures functions) is avery hard problem (See [4]).

28

References

[1] S.M. Aji and R.J. McEliece, “The Generalized Distributive Law,” IEEETrans. Inform. Theory 46 (no. 2), March 2000, pp.325–343.

[2] F. Baccelli, G. Cohen, G.J. Olsder and J.P. Quadrat, Synchronization andLinearity, New York : Wiley, 1992.

[3] L.R. Bahl, J. Cocke, F. Jelinek and J.Raviv, “Optimal Decoding of LinearCodes for Minimizing Symbol Error Rate,” IEEE Trans. Inform. TheoryIT-20, March 1974, pp.284–287.

[4] J.E. Cohen and U.G. Rothblum, “Nonnegative Ranks, Decompositions,and Factorization of Nonnegative Matrices,” Linear Algebra Appl. 190,1993, pp.149–168.

[5] R.G. Cowell, A.P. Dawid, S.L. Lauritzen and D.J. Spiegelhalter, Probabilis-tic Networks and Expert Systems, New York: Springer-Verlag, 1999.

[6] D.J.C. MacKay and R.M. Neal, “Good Codes Based on Very Sparse Matri-ces,” Cryptography and Coding, 5th IMA conference, Proceedings, pp.100–111, Berlin: Springer-Verlag, 1995

[7] R.J. McEliece, D.J.C. MacKay and J.F. Cheng, “Turbo Decoding as anInstance of Pearl’s ‘Belief Propagation’ Algorithm,” IEEE Journal on Se-lected Areas in Communications, 16 (no. 2), Feb. 1998, pp.140–152.

[8] P. Pakzad and V. Anantharam, “Conditional Independence for Signed Mea-sures,” In preparation.

[9] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plau-sible Inference, San Mateo, CA: Morgan Kaufmann, 1988.

[10] G.R. Shafer and P.P. Shenoy, “Probability Propagation,” Ann. Math. Art.Intel., 2, 1990, pp.327–352.

29

a new look at the generalized distributive lawananth/preprints/payam/...a new look at the...

Documents