lazy evaluation in penniless propagation over join trees

Lazy Evaluation in Penniless Propagation over Join Trees

Andres CanoDepartment of Computer Science and Artificial Intelligence, University of Granada, Spain

Serafın MoralDepartment of Computer Science and Artificial Intelligence, University of Granada, Spain

Antonio SalmeronDepartment of Statistics and Applied Mathematics, University of Almerıa, Spain

In this paper, we investigate the application of the ideasbehind Lazy propagation to the Penniless propagationscheme. Probabilistic potentials attached to the mes-sages and the nodes of the join tree are represented in afactorized way as a product of (approximate) probabilitytrees, and the combination operations are postponeduntil they are compulsory for the deletion of a variable.We tested two variations of the basic Lazy scheme: Oneis based on keeping a hash table for the operations withprobabilistic potentials that are carried out more thanonce during the propagation, to avoid repeating compu-tations; the other uses a heuristic method to determinethe order of the operations when combining a set ofpotentials. © 2002 Wiley Periodicals, Inc.

Keywords: Bayesian networks; join trees; lazy evaluation;penniless propagation

1. INTRODUCTION

A Bayesian network is an efficient representation of ajoint probability distribution over a set of variables, wherethe network structure encodes the independence relationsbetween the variables. Bayesian networks are commonlyused to draw inferences about the probability distribution onsome variables of interest, given that the values of someother variables are known.

Several methods to obtain the exact marginal distribu-tions by local computations have been proposed in recentyears [8, 13, 23, 24]. Local computation consists of calcu-

lating the marginals without actually computing the jointdistribution and is described in terms of a message passingscheme over a structure called a join tree. If the problem istoo complicated, however, the application of these schemesmay become infeasible due to a high requirement of re-sources (computing time and memory). A new technique toimprove local computation algorithms was recently pro-posed called Lazy Propagation [16]. Lazy Propagation ba-sically tries to avoid unnecessary operations, thereby savingspace and computing time. To achieve this, probabilisticpotentials attached to nodes and separators in the join treeare stored without having been previously combined and thecombination operations are postponed as long as possible.

To deal with more complicated problems, approximatepropagation algorithms have been developed. These provideinexact results, but with a lower resource requirement. Animportant group of approximate methods is based on MonteCarlo simulation [4–6, 19, 22]. In addition to simulationalgorithms, there are deterministic procedures to performapproximate propagation. Different ideas have been inves-tigated: replacing low probability values with zeroes, toreduce storage requirements [7]; reducing the complexity ofthe model by removing weak dependencies [9]; and enu-merating the configurations of the variables in the modelwith the highest probability to obtain approximations of themarginal posterior distributions [18, 20]. A hybrid versionof this approach was presented in [21].

A more recent contribution to the group of deterministicapproximate methods is Penniless propagation [3]. Thisalgorithm performs a Shenoy–Shafer message passing overa binary join tree [23], but both the messages and thepotentials stored in the nodes of the join tree are representedusing probability trees [19]. The use of probability treesallows large messages to be approximated by smaller ones,which enables this algorithm to run under limited resourcesor over very large networks. Probability trees provide a verygeneral approach to approximate probability potentials: A

Received August 1, 2000; accepted February 1, 2002Published online 00 Month 2002 in Wiley InterScience(www.interscience.wiley.com). DOI 10.1002/net.10024Correspondence to: A. Salmeron; e-mail: [email protected] grant sponsor: CICYT; contract grant numbers: TIC97-1135-C04-01, 02Contract grant sponsor: Junta de Andalucıa; contract grant number:TIC103; FQM244© 2002 Wiley Periodicals, Inc.

NETWORKS, Vol. 39(4), 175–185 2002

probability tree can be approximated by another one ofsmaller size, by collapsing several of its branches into asingle branch which contains the average of the valuesstored in the removed branches.

In this paper, we investigate the application of the ideasbehind Lazy propagation to the Penniless propagationscheme. Probabilistic potentials attached to the messagesand the nodes of the join tree are represented in a factorizedway as a product of (approximate) probability trees, and theoperations of combination are postponed until they arecompulsory for the deletion of a variable.

We also implemented and experimentally tested twovariations of the basic Lazy scheme, regarding the order inwhich combinations are carried out and the use of the cache.In particular, when probability trees are used to representpotentials, different orderings may result in operations withdifferent complexity. It has been pointed out in [16] that thisorder can affect the performance of the algorithms. In thispaper, we consider a heuristic method that first combinesthese two potentials, giving rise to a potential of minimumsize.

The cache is used because of the possible negative con-sequences of postponing combinations: On some occasions,the same combination of potentials may be repeated morethan once. To avoid this problem, we propose that a cachememory (a hash table) be kept of the computations. Whentwo potentials are to be combined, we first check to see ifthe same computation has been stored in the cache. If it has,there is no need to recompute it.

The paper is organized as follows: Section 2 contains thebasis of Shenoy–Shafer propagation over join trees; Sec-tions 3 and 4 briefly describe Lazy propagation, Pennilessapproximation, and computations with probability trees;Section 5 introduces the Lazy–Penniless propagation algo-rithm and proposes two variations of Lazy propagation: theuse of the cache and the selection of potentials for combi-nation; Section 6 analyzes the new algorithms by means ofa series of experiments carried out over three large real-world Bayesian networks; and, finally, Section 7 presentsthe conclusions.

2. PROPAGATION OVER JOIN TREES

2.1. Notation

A Bayesian network is a directed acyclic graph whereeach node represents a random variable, and the topology ofthe graph encodes the independence relations between thevariables, according to the d-separation criterion [17]. Aprobability distribution is associated with the graph for eachnode conditioned on its parents, such that the joint distri-bution over all the variables in the network factorizes as theproduct of the conditional distributions.

Let X � {X1, . . . , Xn} be the set of variables in thenetwork, where each variable Xi takes values on a finite setUi containing �Ui� elements. Given a set of indices I, XI is

the set of variables {Xi�i � I}, defined on UI � �i�IUi.N � {1, . . . , n} will denote the set of indices of all thevariables in the network (XN � X). Given x � UI and J �I, xJ denotes the elements of UJ obtained from x droppingthe coordinates not in J.

A potential � defined on UI is a mapping � : UI3 �0�,

where �0� is the set of nonnegative real numbers. Probabi-

listic information (including a priori, conditional and aposteriori distributions) will always be represented bymeans of potentials, as in [13].

By the size of a potential �, denoted as size(�), we meanthe highest number of values necessary to completely spec-ify it, that is, if � is defined on UI, its size is �UI� � �i�I

�Ui�.If � is a potential defined on UI, dom(�) denotes the set

of indices of the variables for which � is defined [i.e.,dom(�) � I].

The marginal of a potential � with dom(�) � I for a setof variables XJ, J � I, is denoted by �2XJ and is a functiondefined for variables XJ as

�2XJ�y� � �xJ�y

��x�, � y � UJ. (1)

The combination of two potentials � and �� is a newpotential � � �� defined for variables Xdom(�)�dom(��) andobtained by multiplication in the following way:

�� x� � ��xdom�� xdom��,

� x � Udom��dom��.(2)

The conditional distribution of each variable Xi, i� 1, . . . , n, given its parents Xpa(i) in the network isdenoted by a potential pi( xi�xpa(i)), where pi is defined overU{i}�pa(i).

The joint probability distribution for the n-dimensionalrandom variable XN can then be expressed as

p�x� � �i�N

pi�xi�xpa�i�� x � UN. (3)

An observation is the knowledge about the exact valueXi � ei of a variable. The set of observations will bedenoted by e and called the evidence set. E will be the setof indices of the variables observed.

The goal of probability propagation is to calculate the aposteriori probability function p( xk�e) � p( xk, e)/p(e), forevery xk � Uk, where k � {1, . . . , n}�E. LetH � { pi( xi�xpa(i))�i � 1, . . . , n} be the initial set ofpotentials. Observations are incorporated by transformingeach potential � � H into a potential �e defined ondom(�)�E as �e(x) � �(y), where ydom(�)�E � x andyi � ei, for all i � E.

Let us denote the marginal for a variable Xk, p( xk,e)( xk � Uk) by �Xk

m ( xk), where superscript m indicates

176 NETWORKS—2002

marginal a posteriori. After the observations have beenincorporated into H, we then have

�Xk

m � � ��H

��2�Xk

. (4)

The conditional distribution for any variable Xk can becomputed from �Xk

m by normalizing it in such a way that thesum of its values is equal to 1.

The computation of �Xk

m is usually organized in a jointree. A join tree is a tree where each node V is a subset ofXN and such that if a variable is in two distinct nodes, V1

and V2, then it is also in every node in the path between V1

and V2. A join tree is called binary if every node has nomore than three neighbors.

Every potential � � H is assigned to a node Vj such thatXdom(�) � Vj. A potential constantly equal to 1 (unitypotential) is assigned to nodes which did not receive anypotential from H. In this way, attached to every node Vi

there will be a potential �Videfined over the set of variables

Vi and which is equal to the product of all the potentialsassigned to it (perhaps represented in a factorized way). Thealgorithms proposed in this paper operate over binary jointrees. For details of the construction of binary join trees, see[3, 23].

2.2. The Shenoy–Shafer Propagation Algorithm

The Shenoy–Shafer propagation algorithm is performedon a join tree. In this scheme, two mailboxes are placed oneach edge of the join tree. Given an edge connecting nodesVi and Vj, one mailbox is for messages from Vi to Vj and theother is for messages from Vj to Vi. The messages allocatedin both mailboxes are probability potentials defined on Vi �Vj. Details about this propagation procedure can be found in[24]. In this section, we give the necessary notation todescribe our algorithms.

Let us assume that we have an evidence set e. Themessage from Vi to Vj is computed as

�Vi3Vj� ��Vi

� � �Vk�ne�Vi��Vj

�Vk3Vi��2Vi�Vj

, (5)

where �Viis the initial probability potential on Vi reduced to

the observations in e, �Vk3Viare the messages in the mail-

boxes Vk-outgoing and Vi-incoming, and ne(Vi) are theneighbors of Vi in the join tree.

The propagation can be organized into two stages: In thefirst stage, messages are sent from the leaves to a previouslyselected root node (upward propagation), and in the second,messages are sent from the root to the leaves (downwardpropagation). When the message passing ends, the marginalfor a variable Xk can be obtained by determining a node Vi

containing Xk and computing

�Vi

m � �Vi� � �

Vk�ne�Vi�

�Vk3Vi� . (6)

The conditional distribution for any variable Xk given theevidence set e can be calculated by marginalizing �Vi

m downto Xk (obtaining �Xk

m ) and normalizing the result.

3. LAZY PROPAGATION

Lazy propagation [16] is an exact algorithm that operatesby passing messages over a join tree. A feature of Lazypropagation is that the multiplication of potentials is post-poned until it is strictly necessary for the deletion of avariable. In the Shenoy–Shafer algorithm, all the condi-tional distributions assigned to a node are combined andstored as a joint distribution over the variables in the node,forming the initial potential over that node. In the Lazypropagation architecture, however, distributions are notmultiplied to form the initial potentials. Instead, a decom-position of the joint potential at each node is kept and thefactors are combined only when variables are eliminatedwhile performing marginalizations.

Another difference between Lazy propagation and otherjoin tree algorithms is the handling of evidence and theexploitation of barren variables when computing messagesbetween two nodes in the join tree. However, these twofeatures can be somehow incorporated in algorithms basedon the Shenoy–Shafer architecture, if the evidence is takeninto account during the construction of the join tree, asdescribed in Section 4.

The Lazy propagation algorithm can be expressed interms of operations on sets of potentials. The combinationand marginalization operations are as follows:

• Combination. If and � are two sets of potentials, theircombination is the union of both sets, � �.

• Marginalization. If is a set of potentials over variablesin XI and J � I, then 2XJ is obtained by deleting in every variable in XI�J. The deletion of a variable Xk � XI

is carried out as follows:

– Let k � {� � �k � dom(�)}.– �k � ��k

�.– � (�k) � {�k

2Xdom(�k)�{Xk}}.

The message passing scheme in Lazy propagation can beperformed as in the Shenoy–Shafer algorithm, but changingevery potential � for a set of potentials and using the newoperations defined above.

After applying (6) with � substituted for �, Vi

m willcontain a set of potentials defined for variables in Vi. Themarginal for a variable Xk � Vi is obtained by deleting thevariables in Vi�{Xk} and then computing the product of theresulting potentials. The conditional distribution for Xk willbe proportional to this product and can be obtained bynormalization.

NETWORKS—2002 177

4. PENNILESS PROPAGATION

In situations where the problem is too complex or theresources are limited (storage capacity and CPU speed),exact algorithms may be of no practical value. Pennilesspropagation [3] is a deterministic approximate propagationalgorithm based on Shenoy–Shafer’s method, but which isable to provide (approximate) results under limited re-sources. To achieve this, messages are represented usingprobability trees [1]. This tool provides an approximaterepresentation of potentials within a given maximum num-ber of values [2, 10, 11, 19].

Penniless propagation is carried out on binary join treesas the Shenoy–Shafer propagation scheme is more efficient[23] in this structure. For details of the construction of abinary join tree, see [3]. One important feature of thePenniless procedure [3] is that it considers a variation of theminimum-size criterion, consisting of first removing theunobserved variables whose descendants are unobserved aswell, starting from the leaves upward. This has an importantconsequence: When we send messages upward in the jointree, many of them will be identically equal to 1, and it willbe possible to efficiently approximate and represent themusing probability trees—more precisely, with a probabilitytree containing just a single value: the unity. This happenswhen a variable X is deleted from a potential correspondingto a conditional distribution for X; then, the result of delet-ing X is a unity potential defined on the parents of X in thenetwork. This criterion was used in importance samplingpropagation algorithms in [19], producing very good results.In the experiments which we have carried out with Penni-less propagation, however, we observed that, in some cases,the simplicity of upward propagation does not compensatefor the increase in complexity of downward propagation.We have therefore also considered the classical minimum-size criterion. Another drawback is that a new join tree isconstructed when new evidence comes to the system, so thismethod should only be used if the saving in propagationtime is greater than the time spent retriangulating and con-structing a new join tree. These issues are addressed in theexperiments reported in Section 6.

The messages sent during propagation are approximatedto reduce the storage requirements. The criteria for approx-imating probability trees are described in Subsection 4.1.

Another difference with respect to Shenoy–Shafer’s al-gorithm is the number of stages in which the propagation iscarried out. In the exact algorithm, there are two stages: Inthe first, messages are sent from the leaves to the root, andin the second, in the opposite direction. Penniless propaga-tion can perform more than two stages, in which messagesare gradually improved by taking into account the informa-tion flowing across the join tree.

In this paper, we will use a simplified version of thePenniless algorithm, in which only two stages are consid-ered. Adopting this simplified version enables the ideas ofthe Lazy propagation algorithm to be applied and its bene-fits evaluated.

In addition, we have considered the following changeregarding the algorithm in [3]: Here, in the first stage(upward), we only approximate after marginalization, and inthe second stage (downward), we approximate after eachcombination and marginalization. The reason for this is that,under our modification of the minimum-size criterion fortriangulation, we always delete the leaves corresponding tounobserved variables first. This also happens very often inthe minimum-size criterion, in which a leaf can always bedeleted before its parents. As a result, with no observations,many messages going upward in the join tree are identicallyequal to 1 and can be stored as a tree with a single value. Ifwe approximate combinations, we may lose this fact andactually use bigger trees. We will refer to this simplifiedversion of the Penniless algorithm as Simple PennilessPropagation, denoted as the SP algorithm.

4.1. Probability Trees

A probability tree [1, 2, 19] is a directed labeled tree,where each internal node represents a variable and each leafnode represents a probability value. Each internal node hasone outgoing arc for each state of the variable associatedwith that node. Each leaf contains a nonnegative real num-ber. The size of a tree �, denoted as size(�), is defined asthe number of leaves of �.

A probability tree � on variables XI represents a poten-tial � : UI 3 �0

�, if for each xI � UI, the value �(xI) isthe number stored in the leaf node that is reached by startingfrom the root node and selecting the child corresponding tocoordinate xi for each internal node labeled with Xi. Thepotential represented by tree � is denoted by ��(xI).

Example 1. Figure 1 displays a potential � and its rep-resentation using a probability tree. The tree contains thesame information as does the table, but using five valuesinstead of eight.

Three basic operations are necessary to use probabilitytrees in propagation algorithms: restriction, combination,and marginalization; also, in the case of the Penniless algo-rithm, a way of approximating trees must be specified.These operations can be carried out on the probability treerepresentation directly (see [3, 19] for details). In the fol-

FIG. 1. A potential � and a probability tree representing it.

178 NETWORKS—2002

lowing section, we briefly describe the approximation op-eration.

Let �1 be a representation of a potential �. We willconsider how to approximate potential � with a tree �which is smaller than �1. One way of obtaining � is toprune �1. When the tree is pruned, a node is selected suchthat all its children are leaves and the selected node and itschildren are replaced by a single value equal to the averageof the values of the leaf nodes being removed (this mini-mizes the Kullback–Leibler divergence [12] between theoriginal tree and the tree resulting from replacing the leavespruned away with a single number [2, 19]).

The main issue is how to select the nodes to prune. Letus use (XJ � xJ, Xk) to denote a path from the root node toa node Xk whose children are leaves (numbers) and sum(XJ

� xJ, Xk) to denote the sum of the children of Xk. We willuse two criteria:

1. Consider a threshold � � 0 and then approximate thechildren of Xk by their average if the Kullback–Leiblerdivergence between the original tree and the approximateone is less than �.

2. Consider a value � � 0 and then prune node Xk ifsum(XJ � xJ, Xk) � � � S, where S � ¥x�Udom(�)

�(x),that is, we prune every node Xk such that the subtreerooted at it contains a proportion of the entire probabilitymass less than �.

The approximation steps are performed recursively,starting from the nodes whose children are leaves and goingback to the root node. In this way, if all the children of aninternal node are leaves or have previously been pruned toa number, then this node is again considered for approxi-mation.

Method 1 has already been considered in [2, 19] andmethod 2 is introduced in this paper. We will use a combi-nation of these. Method 1 is based on information theoret-ical concepts and method 2 is more simple and provides away of establishing an upper bound on the size of a treeafter pruning (at most there are 1/� leaves with a proportionof the mass which is greater than �).

In [2, 19], we considered other approximation proce-dures, allowing the tree to be rearranged by changing theorder of the internal nodes or limiting the size of therepresentation to a given threshold. Usually, however, the

complexity of these methods did not compensate for thegains in accuracy with the increase in computation time.

5. COMBINING LAZY AND PENNILESSPROPAGATION

In this section, we show how the Lazy propagationmethod can be combined with the Penniless algorithm. Thestarting point is the Simple Penniless algorithm, but poten-tials in the join tree will now be represented in a factorizedform, as a product of a set of probability trees. We will callthis new method Lazy–Penniless propagation.

The Lazy–Penniless algorithm depends on two parame-ters, � and � (explained in Section 4.1), which determine towhich extent the trees will be pruned.

The algorithm is performed in two message passingstages as in the Shenoy–Shafer architecture, but with thefollowing features:

● Instead of a single potential, each node in the join treecontains a set of potentials, and the combination andmarginalization are carried out as in Section 3.

● Each potential is represented by a probability tree and themultiplication and marginalization operations are per-formed directly on the tree representation.

● Once a new tree has been computed as the result of amarginalization, it is then pruned according to the proce-dures in Section 4.1. The result of a combination is onlypruned in the second stage (downward phase).

The tree representation is especially appropriate for Lazypropagation. It can take advantage of situations of causalindependence without the need for any further manipulationof the algorithms as in [15]. This fact is illustrated in thefollowing example:

Example 2. Let us assume that we have three variablesX1, X2, and X3, parents of variable Y, and that the condi-tional distribution of Y given the other three variables hassome asymmetrical independence relations which allow thetree on the left side of Figure 2 to be represented compactly.Let us now assume that in the same node of the join tree wehave a priori probability distributions for X1, X2, and X3. Ifwe multiply all these a priori probability distributions by thetree representing the conditional distribution of Y, we ob-

FIG. 2. Result of combining two trees, deleting X2 afterward.

NETWORKS—2002 179

tain a full probability tree with the same number of leavesas values in a table representation. However, if we do notmultiply all the potentials, and we combine them only whenit is necessary to delete a variable, we can take greateradvantage of the tree representation. Imagine that we wantto delete variable X2: We multiply the conditional proba-bility tree by the marginal distribution of this variable(represented as the central tree in Fig. 2), and then X2 isdeleted by marginalization. The result is the tree on theright side of Figure 2. This process is repeated with thedeletion of successive variables. It should be noted that thistree continues to represent asymmetrical conditional inde-pendence relations and we have used only trees with fewerleaves than the product of the number of states of thevariables involved. The improvements would be muchgreater if the number of parent variables were higher.

In the following two subsections, we describe somevariations on the basic scheme described above.

5.1. The Order of Combinations

To delete variable Xk from a set according to theprocedure in Section 3, we must first consider the set ofpotentials k and then combine all of its potentials. This isdone by taking two potentials from k and replacing themin k with the result of their multiplication. The order inwhich probability trees are taken from k can be relevant.For instance, let us assume that we have three potentials ink defined for variables (X1, X2), (X1, X3), and (X1, X2,X3, X5, X7), respectively. It is clear that for full trees* it ismore convenient to combine the first with the second andthe result with the third. Any other order would imply moremultiplications. In [14], this problem is considered from amore general point of view: If we have a set of potentialsand we want to marginalize down to one variable, what is

the optimal ordering of combinations and marginalizations?They carry out a theoretical study and propose some heu-ristics to select potentials based on the minimum size of theresulting potential. Here, we will concentrate on the prob-lem of combinations and consider the features of the repre-sentation using probability trees to define a heuristic proce-dure to determine the next two potentials from k to becombined:

● Select two trees �i and �j from k such that if �i

represents a potential in UI and �j a potential in UJ thenmin{�UI�J�, size(�i) � size(�j)} reaches its minimum,that is, the size of the frame in which their combination isdefined is minimum. We will denote the Lazy–Pennilessalgorithm that uses this heuristic as SLP(prod_vars)(Simple Lazy–Penniless with minimum product size). Thevalue used to select the trees to combine is an upperbound on the size of the product tree. The optimal crite-rion would be to use the true size of the product tree, butthis would require actually computing the product, andthe heuristic method would then be too slow.

The name given to the basic version of the algorithm isSimple Lazy–Penniless, in which no special order in thecombination of potentials is considered: They are taken asthey are stored in the data structure where they are con-tained. This algorithm will be denoted as SLP.

5.2. The Use of Cache Memory to Avoid RedundantComputations

The efficiency of Lazy evaluation can decrease due to theunnecessary repetition of some computations. When twopotentials � and �� are contained in the same message ,these two potentials must be combined in every path thatfollows from the node that receives the message. In mostcases, this is not a problem because the potentials arecombined in smaller frames and with simplified versions ofthe potentials (after combining and deleting some of theirvariables by marginalization), but we can find situations inwhich exactly the same combination is repeated in different

* The situation can be different if the tree representing the third potentialis very small compared to the first and the second.

FIG. 3. An example in which Lazy repeats computations.

180 NETWORKS—2002

parts of the propagation algorithm. As a simple example, letus consider Figure 3. On the left side, we have a directedacyclic graph and, on the right side, a valid join tree withsets of potentials assigned to each of its nodes. The rootnode contains the set of potentials { p( x1), p( x2), p( x3�x1,x2)}. It is necessary to combine these three potentials whensending messages V13V2

and V13V3. In this case, it is

clear that there is no advantage in postponing combinationsand that it produces a loss in efficiency.

The use of a cache memory to avoid this kind of repe-tition has already been highlighted in [16], but no specificimplementation of this idea was provided. One first ap-proach might be to consider a hash table indexed with thecombined potentials. In this way, the result of combiningtwo potentials is saved in the cache, and the next time weneed to perform the same combination, the result is takenfrom the cache. This procedure can, however, use up toomuch computer memory. Not all the operations saved in thecache will be useful later. In preliminary experiments, wealso found that there was no improvement in efficiency interms of the computing time. Our explanation for this is thatthe extra time spent in memory management in the Javarun-time environment that we have used, Java 2 v1.3, wasnot compensated for with the savings in computations.

To avoid this problem, we propose a procedure wherebyonly the result of operations which are going to be used ata later stage are included in the cache. This requires prop-agating twice in the join tree. In the first propagation, weannotate as useful those operations (combination and mar-ginalization) that are used more than once, but without

actually carrying them out (we only store empty trees). Inthe second propagation, we do standard calculations, butwhen an operation was annotated in the first propagation asuseful and it still has not been calculated, we save it into thecache. Afterward, if we need to repeat the operation, wetake it from the cache, avoiding the need to recalculate it.Merging this procedure with SLP will constitute theSLP(cache) algorithm.

6. EXPERIMENTAL RESULTS

We conducted experiments with algorithms SP, SLP,SLP(cache), and SLP(prod_vars) using three differ-ent networks, munin1, munin2, and water, correspond-ing to highly complex, real-world problems: They contain189, 1003, and 32 variables, respectively. The three net-works were borrowed from the Decision Support SystemsGroup at Aalborg University (http://www.cs.auc.dk/research/DSS/misc).

We considered the case of propagation with and withoutobserved variables. The observed variables were selected atrandom (five for water, eight for munin1, and fifteen formunin2). In the case of considering observed variables, wealways reduced the initial potentials by restriction to theobservations.

Three different triangulations were used in the experi-ments:

t0: This is the minimum-size criterion, but taking intoaccount the observations: Before triangulating, thepotentials are reduced by restriction to the obser-vations. In this case, observed variables are notincluded in the join tree.

t1: This is the modified minimum-size criterion, inwhich unobserved leaves are deleted first. Thetriangulation depends on the observations and asin t0 potentials are reduced before triangulating.As in the previous case, observed variables are notincluded in the join tree.

TABLE 1. Compilation times in seconds (including triangulation andconstruction of the join tree) for the networks used in the experimentswith observed variables, with triangulation methods t0, t1, and t2.

Method water munin1 munin2

t0 0.51 1.19 12.57t1 0.50 1.00 10.23t2 0.51 1.09 12.83

TABLE 2. Detailed results of the experiments with observed variables for the network water.

t0 t1 t2 t0 t1 t2

� SP SLP

0.005 1.62–0.0198 3.12–0.0121 2.77–0.0449 1.22–0.0068 1.88–0.0128 1.87–0.01010.001 2.51–0.0022 5.02–0.0010 6.61–0.0052 1.85–0.0016 5.07–0.0022 4.72–0.01520.0005 3.1–0.0015 6.17–5.28E-4 8.67–0.0011 1.96–5.48E-4 5.68–0.0016 5.78–0.00420.0001 4.07–4.99E-4 9.35–1.57E-4 13.17–5.05E-4 2.32–3.55E-4 7.52–5.57E-4 12.80–0.0015

SLP(cache) SLP(prod_vars)

0.005 1.27–0.0066 2.10–0.0126 1.71–0.0092 1.34–0.0175 1.42–0.0083 2.09–0.03320.001 1.75–0.0015 4.88–0.0021 4.69–0.0138 1.93–0.0020 2.73–0.0021 3.64–0.00720.0005 1.95–4.57E-4 5.41–0.0016 5.81–0.0027 2.11–0.0014 4.74–0.0016 4.42–0.00480.0001 2.25–2.77E-4 7.41–4.88E-4 11.49–0.0015 2.58–7.73E-4 9.58–0.0022 7.16–6.30E-4

Each cell contains the propagation time (in seconds) and the error measured as the Kullback–Leibler divergence between the exact and estimatedprobability values.

NETWORKS—2002 181

t2: This is the minimum-size criterion, but without tak-ing into account the observed variables. The tri-angulation does not depend on the observationsand the size of complete potentials before therestriction to the observed cases is considered inthe triangulation, that is, observed variables areincluded in the join tree, so that the same join treecan be used when new evidence comes to thesystem.

The compilation times for each of the triangulations aregiven in Table 1.

Trees resulting from marginalizations are always prunedand trees resulting from combinations are pruned only in thedownward phase (see Section 4). The parameters that con-trol the pruning (� and �) were chosen as follows: � was setto 0.001 except in trials with network munin1 for which �is set to 0.01 to force a more severe pruning due to its largesize; with respect to �, we carried out trials with four values(0.005, 0.001, 0.0005, 0.0001). This selection is aimed at

testing the ability of � to monitor the pruning. The way inwhich � can be used with the same purpose was reported inthe experiments in [3, 19].

The results obtained were compared with the exact re-sults using the Kullback–Leibler divergence from the esti-mated distribution on each variable to the exact one. Theglobal divergence (displayed as the K–L divergence in thetables and figures) is measured as the average of the diver-gences on each of the unobserved variables in the network.

For each trial, we measured the computing time and theK–L divergence. The algorithms were implemented in Java2 version 1.3. Trials were run on an AMD K7 (800 MHz)computer with 768MB of RAM and operating system LinuxRedHat with kernel 2.2.16.22.

The results of the experiments are displayed in Tables2–7, where each pair of numbers indicates the propagationtime and the K–L divergence. We also measured the com-puting times obtained by exact Lazy propagation (i.e., SLPwith � � � � 0) for network water (for munin1 and

TABLE 3. Detailed results of the experiments without observedvariables for the network water.

�

t0/t2 t1 t0/t2 t1

SP SLP

0.005 3.39–0.2145 2.15–0.0888 2.96–0.1390 4.21–0.14720.001 10.36–0.0851 2.25–0.0087 5.20–0.0116 8.36–0.11410.0005 11.56–0.0259 2.72–0.0050 6.12–0.0074 10.48–0.05920.0001 15.68–0.0037 3.69–0.0027 8.16–0.0025 14.05–0.0127


0.005 3.19–0.1389 4.50–0.1472 4.04–0.2092 1.39–0.03840.001 5.35–0.0116 8.80–0.01141 7.96–0.0848 2.28–0.00800.0005 6.28–0.0074 10.94–0.0592 8.83–0.0259 2.74–0.00440.0001 8.06–0.0025 14.41–0.0127 10.38–0.0040 3.59–0.0028

Each cell contains the propagation time (in seconds) and the errormeasured as the Kullback–Leibler divergence between the exact and esti-mated probability values.

TABLE 4. Detailed results of the experiments with observed variables for the network munin1.

�

t0 t1 t2 t0 t1 t2

SP SLP

0.005 11.21–0.1507 20.72–0.1229 16.24–0.1426 10.28–0.3721 13.74–0.3184 13.35–0.38880.001 33.25–0.0855 61.41–0.0428 36.31–0.0833 31.49–0.2546 30.60–0.1723 23.19–0.22370.0005 38.37–0.0786 86.16–0.0231 44.91–0.0784 33.55–0.2100 40.57–0.1508 30.92–0.20220.0001 52.99–0.0742 206.73–0.0070 72.29–0.0738 51.27–0.1567 63.04–0.0851 58.76–0.1392


0.005 11.01–0.4125 14.42–0.3099 13.52–0.4191 7.90–0.1726 38.22–0.1481 13.29–0.14260.001 26.48–0.2872 30.87–0.1517 23.44–0.2350 19.45–0.0833 77.33–0.0484 22.32–0.08030.0005 31.51–0.2345 38.64–0.1231 30.66–0.2100 22.92–0.0765 121.32–0.0284 29.51–0.07310.0001 51.15–0.1513 56.90–0.0777 60.79–0.1270 32.53–0.0739 244.303–0.0120 45.69–0.0692


TABLE 5. Detailed results of the experiments without observedvariables for the network munin1.

�

t0/t2 t1 t0/t2 t1

SP SLP

0.005 20.73–0.1651 39.94–0.1706 11.40–0.3315 60.82–0.28640.001 55.15–0.1260 140.46–0.0712 16.23–0.2375 160.76–0.23250.0005 73.16–0.1138 295.27–0.0418 18.28–0.2103 188.62–0.18830.0001 123.32–0.0913 1084.96–0.0135 – – 571.71–0.1652


0.005 11.89–0.3564 49.39–0.2984 17.69–0.1766 66.90–0.16220.001 19.00–0.2666 125.48–0.2191 34.75–0.1245 216.81–0.08000.0005 22.24–0.2243 153.26–0.1796 50.47–0.1062 461.06–0.05330.0001 – – 390.77–0.1725 78.30–0.1096 1268.37–0.0274

Each cell contains the propagation time (in seconds) and the errormeasured as the Kullback–Leibler divergence between the exact and esti-mated probability values. Cells filled with – – indicate that the system runsout of memory.

182 NETWORKS—2002

munin2, the system ran out of memory). These times (inseconds) are 60.22 without observations and 20.64 withobservations. Note that the times reported in this papercorrespond to programs in Java (interpreted), and thus arehigher than the times that could be obtained using compiledlanguages such as C.

6.1. Discussion of Results

As a first conclusion, we can say that the value of � cancontrol the level of approximation to the exact results,allowing us to adapt it to the available time. We can see inthe tables that decreasing the value of � decreases the valueof the error, but it normally increases the cost in terms of thecomputing time. A typical situation is given in Figure 5 inwhich the data for munin1 with triangulation t0 andobservations are represented graphically.

As a second conclusion, we consider that the approxi-mate algorithms usually obtain very good approximations tothe exact results in a very short time. In the water network,

we obtain very accurate results in 1–2 seconds, rather thanafter 20 seconds, which is the time required by the exactlazy algorithm. In our implementation, the munin1 andmunin2 networks ran out of memory when exact lazypropagation was applied. In both networks, good approxi-mations were obtained in the range of 10–20 seconds. Byinvesting more time, we could obtain even better approxi-mations.

In general, triangulation t1, which was proposed in [19],produces bad results in comparison with triangulations t0and t2. This triangulation was suitable for Monte Carloimportance sampling algorithms, but in our experiments, itis always slower for the same degree of approximation. Thequality of the approximations is sometimes worse andsometimes better, but, in general, t0 or t2 produces betterresults than does t1 for the same time. One of the cases inwhich t1 is more competitive is in munin2 withSLP(prod_vars) algorithm (see Fig. 4 for the resultswith observations); what actually happens is that we cannotcompare the results as times and approximations are neverof a similar magnitude.

In the comparison of t0 and t2, in general, the errors

TABLE 6. Detailed results of the experiments with observed variables for the network munin2.

�

t0 t1 t2 t0 t1 t0

SP SLP

0.005 21.57–0.1342 46.38–0.0228 25.49–0.1381 15.46–0.0758 61.47–0.1767 16.03–0.06350.001 31.28–0.1025 120.99–0.0052 32.73–0.1039 16.24–0.0577 119.91–0.0223 13.69–0.05950.0005 35.14–0.0735 197.74–0.0026 37.65–0.0736 38.86–0.0351 180.13–0.0176 43.51–0.04160.0001 44.46–0.0677 266.40–0.0011 50.80–0.0676 74.93–5.36E-4 276.67–0.0030 82.23–0.0016


0.005 15.61–0.0758 54.13–0.1739 15.66–0.0639 12.07–0.0963 75.45–0.0222 12.55–0.09500.001 16.12–0.0577 95.40–0.0217 14.50–0.0597 22.65–0.0811 123.32–0.0059 21.88–0.08250.0005 34.78–0.0351 135.38–0.0173 39.50–0.0419 31.75–0.0807 163.35–0.0036 30.92–0.08230.0001 67.69–5.95E-4 198.05–0.0031 74.47–0.0025 40.15–0.0686 264.23–0.0012 42.19–0.0700


TABLE 7. Detailed results of the experiments without observedvariables for the network munin2.

�

t0/t2 t1 t0/t2 t1

SP SLP

0.005 22.10–0.1593 76.34–0.0699 14.63–0.0707 50.16–0.23540.001 25.57–0.1204 804.96–0.0471 11.58–0.0655 92.85–0.02100.0005 32.24–0.0888 574.89–0.0420 38.10–0.0480 112.82–0.01240.0001 45.91–0.0816 799.01–0.0375 72.23–0.0019 156.53–0.0047


0.005 14.84–0.0708 55.81–0.2358 12.70–0.1132 158.79–0.07220.001 13.39–0.0656 98.97–0.0211 22.75–0.0986 459.30–0.04750.0005 34.54–0.0484 118.67–0.0125 34.48–0.0962 514.03–0.04230.0001 67.20–0.0024 163.32–0.0048 44.66–0.0823 950.73–0.0379

Each cell contains the propagation time (in seconds) and the errormeasured as the Kullback–Leibler divergence between the exact and esti-mated probability values.

FIG. 4. K–L divergence versus time for the munin2 network withobserved variables corresponding to the algorithm SLP(prod_vars)with triangulations t0, t1, and t2.

NETWORKS—2002 183

are similar and times are lower using t0. However, weshould take into account that t0 needs recompilation foreach set of observations, while for t2, we can use the samejoin tree in each propagation. Taking the times in Table 1into account, we could conclude that in the networkswater and munin1 triangulation t0 is better than that inmunin2; if we add compilation times to the t0 results,then t2 is clearly better. We could say that the convenienceof taking observations into account to reduce the join treedepends on the time necessary to build such a join tree.Nevertheless, it is important to point out that these algo-rithms always take advantage of observations by restrictingall the potentials to the observed values. The only differenceis if the join tree is optimized for these reduced potentials orfor the initial ones. In the following discussions, triangula-tion t0 will always be considered for the sake of simplicity.

Simple Lazy–Penniless propagation (SLP) generally im-proves the results of the Simple Penniless algorithm (SP).Only in the munin1 network (with and without observa-tions) is there a reduction in the time, but the error increases.In this network, Lazy–Penniless propagation with the heu-ristic to select the potentials to combine [SLP(prod_vars)] reduces the time even more and obtains similarerrors to SP. See Figure 5 for a comparison of the algo-rithms over the network munin1 with triangulation t0 andwith observations.

In general, the use of the cache in Lazy–Penniless[SLP(cache)] produces similar results to SLP in three ofthe six experiments (the three networks with and withoutobservations). In one of them, the results are worse and intwo they are better. The cases in which the results improveare the most difficult ones (munin2). A possible explana-tion is that in simple cases it takes longer to maintain thecache and to make a double propagation than to performredundant calculation with small probability trees, so thatthe use of the cache is interesting only if many operationsare stored in it. For instance, in experiments without obser-vations and triangulation t0, the number of operationsstored in the cache are 652, 111, and 24 for networks

munin2, munin1, and water, respectively, which clearlysuggests that algorithms over munin2 are more likely tobenefit from the use of the cache.

The use of a heuristic to select the order in which thepotentials are combined in Lazy–Penniless [SLP(prod_vars)] looks promising. In general, it considerably re-duces the time for equal values of �. The exceptions arewater and munin1 without observations, but in thesecases, the differences in terms of the propagation time aresmall. Furthermore, water is a rather easy problem and itis normal that a very sophisticated algorithm performsworse. On the contrary, over munin1 SLP(prod_vars)is able to provide a result with the lowest value of �, whileSLP runs out of memory. We do not, however, have ageneral rule to establish the quality of the approximation(divergence). In some cases, the error is lower, and in somecases, it is higher. It is, in fact, very difficult to expect analgorithm to always provide a lower error than another inevery situation. We must take into account that the accuracyof the approximation depends on several parameters, whichare difficult to manage in practice, and any slight variationin the order of operations can produce results of differentsigns.

7. CONCLUSIONS AND FUTURE WORK

We have presented a family of approximate algorithmsfor probability propagation in Bayesian networks. Empiricalwork has demonstrated that these algorithms obtain goodapproximate results in a short time compared to the timenecessary to obtain exact results with Lazy propagation.They can be adapted to the time available using the param-eter �.

Lazy evaluation was combined with the Penniless tech-nique, and, consequently, good results were obtained inexperimental work. In addition, two possible improvementsto Lazy–Penniless were considered. First, we studied theuse of a heuristic to determine an order in which the poten-tials are combined. This heuristic selects the potentials tocombine by taking into account the size of the potentialsrepresented by trees and the maximum size of the productpotential. In general, this has reduced the propagation time.We also implemented a solution based on the use of thecache in order to avoid redundant calculations in the Lazyevaluation. The use of the cache looks promising in hardproblems.

We think that the new algorithms enlarge the class oftractable Bayesian networks as the time and space costs aresmaller than are the costs of HUGIN or Shenoy–Shaferarchitectures. When an exact solution cannot be reached, wehave the possibility of searching for an approximate solu-tion.

In the future, we plan to test other heuristics for theselection of the combination order based on entropy mea-sures. We are willing to consider the use of conditionalapproximations as in [3], but in a selective way: only whenthe information value of a potential is high.

FIG. 5. K–L divergence versus time for the munin1 network withobserved variables corresponding to the different algorithms used in theexperiments with triangulation t0.

184 NETWORKS—2002

Another possibility is to approximate potentials repre-sented by very large trees by using a product of severalpotentials. Here, we only considered the case of approxi-mating a potential by another one, but there are situations inwhich a potential is better approximated by the product ofseveral ones (consider the case in which we are close to asituation of conditional independence). This is more diffi-cult to carry out, but can be exploited by Lazy propagationin difficult problems, as we can keep the two approximatingpotentials without combining them.

Acknowledgments

The authors are very grateful to the anonymous refereesfor their constructive and useful comments. In particular, inour first version of the paper, only triangulation t1 wasconsidered, which did, in fact, provide the worst results.While reading the reviews, we realized that this was possi-bly a very bad selection.

REFERENCES

[1] J. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller,Context-specific independence in Bayesian networks, Proc12th Conf on Uncertainty in Artificial Intelligence, MorganKaufmann, San Francisco, 1996, pp. 115–123.

[2] A. Cano and S. Moral, Propagacion exacta y aproximadacon arboles de probabilidad, Proc VII Conf of the SpanishAssoc for Artificial Intelligence, 1997, pp. 635–644.

[3] A. Cano, S. Moral, and A. Salmeron, Penniless propagationin join trees, Int J Intell Syst 15 (2000), 1027–1059.

[4] P. Dagum and M. Luby, An optimal approximation algo-rithm for Bayesian inference, Artif Intell 93 (1997), 1–27.

[5] R. Fung and K.C. Chang, “Weighting and integrating evi-dence for stochastic simulation in Bayesian networks,” M.Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer,Editors, Uncertainty in Artificial Intelligence, North-Hol-land, Amsterdam, 1990, Vol. 5, pp. 209–220.

[6] L.D. Hernandez, S. Moral, and A. Salmeron, A Monte Carloalgorithm for probabilistic propagation in belief networksbased on importance sampling and stratified simulationtechniques, Int J Approx Reason 18 (1998), 53–91.

[7] F. Jensen and S.K. Andersen, Approximations in Bayesianbelief universes for knowledge-based systems, Proc 6thConf on Uncertainty in Artificial Intelligence, Elsevier,New York, 1990, pp. 162–169.

[8] F.V. Jensen, S.L. Lauritzen, and K.G. Olesen, Bayesianupdating in causal probabilistic networks by local compu-tation, Comput Stat Q 4 (1990), 269–282.

[9] U. Kjærulff, Reduction of computational complexity inBayesian networks through removal of weak dependencies,

Proc 10th Conf on Uncertainty in Artificial Intelligence,Morgan Kaufmann, San Francisco, 1994, pp. 374–382.

[10] A.V. Kozlov, Efficient inference in Bayesian networks, PhDthesis, Stanford University, 1998.

[11] D. Kozlov and D. Koller, Nonuniform dynamic discretiza-tion in hybrid networks, Proc 13th Conf on Uncertainty inArtificial Intelligence, Morgan Kaufmann, San Francisco,1997, pp. 302–313.

[12] S. Kullback and R. Leibler, On information and sufficiency,Ann Math Stat 22 (1951), 76–86.

[13] S.L. Lauritzen and D.J. Spiegelhalter, Local computationswith probabilities on graphical structures and their applica-tion to expert systems, J R Stat Soc B 50 (1988), 157–224.

[14] Z. Li and B. D’Ambrosio, Efficient inference in Bayesnetworks as a combinatorial optimization problem, Int J Ap-prox Reason 11 (1994), 55–81.

[15] A.L. Madsen and B. D’Ambrosio, Lazy propagation andindependence of causal influence, Symbolic and Quantita-tive Approaches to Reasoning and Uncertainty, LectureNotes on Artificial Intelligence, Springer, Berlin, 1999, Vol.1638, pp. 293–304.

[16] A.L. Madsen and F.V. Jensen, Lazy propagation: A junctiontree inference algorithm based on lazy evaluation, ArtifIntell 113 (1999), 203–245.

[17] J. Pearl, Probabilistic reasoning in intelligent systems, Mor-gan Kaufmann, San Mateo, 1988.

[18] D. Poole, Average-case analysis of a search algorithm forestimating prior and posterior probabilities in Bayesian net-works with extreme probabilities, Proc 13th Int Joint Confon Artificial Intelligence (IJCAI-93), Morgan Kaufmann,San Mateo, 1993, pp. 606–612.

[19] A. Salmeron, A. Cano, and S. Moral, Importance samplingin Bayesian networks using probability trees, Comput StatData An 34 (2000), 387–413.

[20] E. Santos and S.E. Shimony, Belief updating by enumerat-ing high-probability independence-based assignments, Proc10th Conf on Uncertainty in Artificial Intelligence, MorganKaufmann, San Francisco, 1994, pp. 506–513.

[21] E. Santos, S.E. Shimony, and E. Williams, Hybrid algo-rithms for approximate belief updating in Bayes nets, IntJ Approx Reason 17 (1997), 191–216.

[22] R.D. Shachter and M.A. Peot, “Simulation approaches togeneral probabilistic inference on belief networks,” M. Hen-rion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer, Editors,Uncertainty in Artificial Intelligence, North-Holland, Am-sterdam, 1990, Vol. 5, pp. 221–231.

[23] P.P. Shenoy, Binary join trees for computing marginals inthe Shenoy–Shafer architecture, Int J Approx Reason 17(1997), 239–263.

[24] P.P. Shenoy and G. Shafer, “Axioms for probability andbelief function propagation,” R.D. Shachter, T.S. Levitt,L.N. Kanal, and J.F. Lemmer, Editors, Uncertainty in Arti-ficial Intelligence, North Holland, Amsterdam, 1990, Vol. 4,pp. 169–198.

NETWORKS—2002 185

lazy evaluation in penniless propagation over join trees

Documents