representing and querying uncertain data prithviraj sensen/pubs/thesis/thesis.pdf · 2010-05-21 ·...

Representing and Querying Uncertain Data

Prithviraj Sen

Dissertation submitted in partial fulfillmentof requirements for the degree of Doctor of Philosophy to the

Department of Computer ScienceUniversity of Maryland, College Park

Thesis Advisers: Dr. Amol V. Deshpande and Dr. Lise C. Getoor

Abstract

There has been a longstanding interest in building systems that can handleuncertain data. Traditional database systems inherently assume exact dataand harbour fundamental limitations when it comes to handling uncertaindata. In this dissertation, we present a probabilistic database model thatcan compactly represent uncertainty models in full generality. Our repre-sentation is associated with precise and intuitive semantics and we showthat the answer to every user-submitted query can be obtained by perform-ing probabilistic inference. To query large-scale probabilistic databases,we propose a number of techniques that help scale probabilistic inference.Foremost among these techniques is a novel lifted inference algorithm thatdetermines and exploits symmetries in the uncertainty model to speed upquery evaluation. For cases when the uncertainty model stored in thedatabase does not contain symmetries, we propose a number of techniquesthat perform approximate lifted inference. Our techniques for approximatelifted inference have the added advantage of allowing the user to controlthe degree of approximation through a handful of tunable parameters. Be-sides scaling probabilistic inference, we also develop techniques that alterthe structure of inference required to evaluate a query. More specifically, weshow that for a restricted model of our probabilistic database, if each resulttuple can be represented by a boolean formula with special characteristics,i.e., it is a read-once function, then the complexity of inference can be dras-tically reduced. We conclude the dissertation with a listing of directions forfuture work.

To Mummy, Baba and Shiny,Dadu and Diddey,Amma and Dada,

for always supporting me

Contents

1 Introduction 11.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Outline and Contributions . . . . . . . . . . . . . . . . . . . . 9

2 Related Work 122.1 Uncertainty and Databases . . . . . . . . . . . . . . . . . . . . 122.2 First-Order Graphical Models . . . . . . . . . . . . . . . . . . 142.3 Lifted Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Query Evaluation in Probabilistic Databases . . . . . . . . . 172.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Representing Uncertain Data 213.1 Background: Probabilistic Graphical Models . . . . . . . . . 213.2 Probabilistic Databases with Probabilistic Graphical Models 24

3.2.1 Possible World Semantics . . . . . . . . . . . . . . . . 243.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Query Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.2 General Relational Algebra Queries . . . . . . . . . . 333.3.3 Complexity of probabilistic inference . . . . . . . . . 36

3.4 Experiments and Discussion . . . . . . . . . . . . . . . . . . . 373.4.1 Case Study: Querying Publication Datasets . . . . . . 373.4.2 Experiments with TPC-H Benchmark . . . . . . . . . 393.4.3 Aggregation Queries . . . . . . . . . . . . . . . . . . . 40

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Bisimulation-based Lifted Inference for Probabilistic Databases 434.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.1 Limitations of Naive Inference Algorithms . . . . . . 474.2 Inference with Shared Factors . . . . . . . . . . . . . . . . . . 47

4.2.1 The ELIMRV operator . . . . . . . . . . . . . . . . . . . 484.2.2 The RV-ELIM Graph . . . . . . . . . . . . . . . . . . . . 504.2.3 Identifying Shared Factors . . . . . . . . . . . . . . . . 504.2.4 Bisimulation for RV-ELIM Graphs . . . . . . . . . . . . 544.2.5 Inference with the Compressed RV-ELIM Graph . . . 56

4.3 Computing Elimination Orders . . . . . . . . . . . . . . . . . 574.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 60

4.4.1 Car DB experiments . . . . . . . . . . . . . . . . . . . 634.4.2 Experiments with uncertain join attributes . . . . . . 644.4.3 Experiments with TPC-H data . . . . . . . . . . . . . 65

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Approximate Lifted Inference For Probabilistic Databases 705.1 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . 725.2 Approximate Lifted Inference with Approximate Bisimulation 735.3 Approximate Lifted Inference with Factor Binning . . . . . . 775.4 Unified Lifted Inference . . . . . . . . . . . . . . . . . . . . . 82

5.4.1 Bounded Complexity Lifted Inference . . . . . . . . . 825.4.2 Unified Lifted Inference Engine . . . . . . . . . . . . . 83

5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 835.5.1 Synthetic Bayesian Network Generator . . . . . . . . 845.5.2 Lifted Inference with Approximate Bisimulation . . . 855.5.3 Approximate Lifted Inference with Factor Binning . . 865.5.4 Unified Lifted Inference Engine . . . . . . . . . . . . . 875.5.5 Experiments on Real-World Data . . . . . . . . . . . . 88

5.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 90

6 Read-Once Functions and Probabilistic Databases 916.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2 Read-Once Functions: Unateness, P4-Free and Normality . . 946.3 Hierarchical Queries and Read-Once Functions . . . . . . . . 966.4 Read-Once Expressions for Probabilistic Databases . . . . . . 986.5 Read-Once Functions and Conjunctive Queries without Self-

Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7 Conclusion 1067.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . 1067.2 Avenues for Future Work . . . . . . . . . . . . . . . . . . . . . 1077.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Bibliography 110

Chapter 1

Introduction

Many applications produce massive amounts of data that needs to be storedin an organized manner so that users can sift through and find informationthat is of interest. Database systems have become the de facto standard forstoring such large amounts of data. At least from the end user’s perspec-tive, one of the most important reasons for the success of database systemsis the declarative querying capabilities they offer through query languagessuch as SQL. The use of a declarative query language allows the lay user topose complex queries against the underlying data without having to worryabout algorithmic or efficiency issues associated with evaluating the query.

Unfortunately, current database systems are not well suited to storedata with uncertainties. If we state that John’s salary is $55,000 per annum,then John cannot have any salary other than $55,000. We cannot, for exam-ple, state that John’s salary could lie anywhere between $55,000 to $60,000per annum or that the temperature on the first floor measured through asensor at 10:28AM this morning was more likely to be 52.6◦ F than 52.8◦ F.We refer to such data with uncertainties as uncertain data or inexact data.

A number of real world applications produce large amounts of uncer-tain data. Examples include data collected from sensor networks [Desh-pande et al., 2004], information extraction systems [Jayram et al., 2006] andmobile object tracking systems [Cheng et al., 2003]. Traditional databasemanagement systems are not suited for storing uncertain data, which meansthat the declarative querying capabilities offered by database systems areunavailable to people who need to deal with and search through such datato find information of interest.

In this dissertation, our aim is to develop a database system that canstore and query uncertain data. To achieve our goal, we need to answer two

1

fundamental questions: 1) How do we represent uncertainty in a databaseand 2) How do we use the uncertainty model along with the data to pro-duce relevant answers to a user-submitted query? In the ensuing chap-ters, we will see how we answer the former question by combining tradi-tional database ideas with uncertainty representation models from machinelearning. Further, we also show how efficient query evaluation can be per-formed in such databases by developing novel algorithms based on ideasfrom graph-theory. But before we go any further, let us consider a small ex-ample that illustrates some of the differences between querying exact datausing traditional database query processing techniques and handling un-certain data.

1.1 A Small Example

Figure 1.1(a) shows a small relation,Ads, where each row corresponds to anadvertisement (ad) that we pulled off from a pre-owned car sales website.For simplicity, we depict only the Make and Price of the cars associatedwith each ad in Figure 1.1(a); a real pre-owned car sales database will likelycontain many more attributes of interest. Ads contains four ads, the first(depicting a Honda for sale) being an instance of certain data and the lastthree (s2, s3 and s4) being uncertain. Also shown in Ads against each tuple,are numbers depicting how certain/uncertain each tuple is. For now, wewill refrain from specifying exactly what these numbers mean other thansaying that they are a measure of how likely it is for the corresponding tu-ple to exist in the real world or their degree of certainty. Such numbers maybe useful in capturing the fact that many ads posted on websites remainvisible even after the car in question has been sold and thus with the pas-sage of time it is less likely for a car being advertised to be still up for sale.For the purposes of presenting our example, a tuple with degree closer to1 increases the chance of its being present in the real world, while a degreecloser to 0 decreases the chance of that ad still being valid. Thus, in Figure1.1(a), s1 is an instance of exact data where we know that a car of makeHonda is definitely up for sale, while s2, s3 and s4 are uncertain, we are notquite sure if those cars are still available but there is a good chance (since0.7 is closer to 1 than to 0) of them being still available for sale.

Suppose a user is interested in finding out the makes of the cars thatare for sale and so wants to issue the query

∏Make(Ads). Now, we have

a slight problem since we don’t know how to deal with the degrees of tu-ples. We will consider two simple approaches. In the first approach, we

2

Ads Make Price

s1 Honda $12,000 1.0 −→s2 Toyota $8,000 0.7 ∏

Make(Ads)s3 Dodge $6,000 0.7s4 Dodge $9,000 0.7 −→

(a)

MakeHonda

(b)

MakeHondaToyotaDodge

(c)

Figure 1.1: (a) A small pre-owned car sales database. (b)∏

Make(Ads) ig-noring uncertain data. (c)

∏Make(Ads) treating uncertain data just like ex-

act data but with an extra attribute.

will simply ignore all uncertain data (s2, s3 and s4) and not return them asquery results. The result (shown in Figure 1.1(b)) contains only one resultand is unsatisfactory because it seems to suggest that the only make avail-able to the user is Honda and if she doesn’t want to purchase a car of thismake then s/he doesn’t have any other cars to choose from which is not en-tirely true. There is a possibility that the other (uncertain) tuples in Ads arestill valid ads and the user should be able to find out about these throughher/his query. The main issue here is that when we throw away (uncertain)data we actually throw away information and this leads to query results ofworse quality. In many domains, such as sensor networks, the bulk of datacollected is uncertain due to reasons ranging from uncertainty associatedwith the sensing mechanism of the sensors to inadequate number of sen-sors being placed in the environment being measured. Throwing away theuncertain data, in such cases, leaves the database with precious little datato work with which, albeit certain, is still unlikely to be enough to ensuregood quality query results.

A second approach is to treat the degrees of certainty of the tuples asan extra attribute and append it to list of attributes of Ads. Executing thesame query under this paradigm returns the result shown in Figure 1.1(c)which now indicates that there are Toyotas and Dodges up for sale besidesthe Honda. Unfortunately, this result is not quite satisfactory due to tworeasons:

3

• Honda vs. Toyota: Notice that the query result gives Honda and Toy-ota equal billing. However, the Honda result is derived from a certaintuple whereas the Toyota result is derived from an uncertain tuplewhich may not correspond to a valid ad.

• Toyota vs. Dodge: The query also gives Toyota and Dodge equalbilling, even though, Toyota was derived from an uncertain tuple as-sociated with a degree of 0.7 whereas Dodge was derived from twouncertain tuples each associated with a degree of 0.7. This suggeststhat it is more likely for the user to find a Dodge up for sale and thequery result does not reflect this.

The first discrepancy (Honda and Toyota getting equal billing in the result)suggests that while evaluating a query we need to look at the degrees ofthe tuples, the second discrepancy (Toyota and Dodge getting equal billing)suggests that we may also need non-trivial reasoning mechanisms to com-bine and compare degrees if we are to return useful query results.

The above example should make it clear that handling uncertain datais quite different (and perhaps more challenging) than handling exact data.Uncertain data is typically richer than exact data and the richness is becauseof the quantitative expression of uncertainties which is something tradi-tional database research has not considered in depth. Effectively storingand querying uncertain data requires that we use the information presentin the uncertainties appropriately so that we can help users sift throughand arrive at answers of interest. We next describe, at a high level of ab-straction, the basic ideas used in this dissertation to design such a databasesystem.

1.2 Our Approach

Any system that deals with uncertain data has to begin by describing arepresentation scheme that allows users to compactly yet flexibly representthe uncertainties present in the data. In this dissertation, we borrow exten-sively from the machine learning literature and use probability theory cou-pled with the language of probabilistic graphical models to augment databasesso that they can represent uncertain data. For this reason, henceforth, wewill refer to a database containing uncertain data as a probabilistic database.Probabilistic graphical models [Cowell et al., 1999; Pearl, 1988] are compactrepresentations of joint probability distributions involving a large numberof random variables. By redefining a probabilistic database in terms of

4

a probabilistic graphical model, we inherit all of their nice compactnessproperties. Additionally, we show how to use probabilistic graphical mod-els to represent all the different kinds of uncertainty that a user might wantto express in a probabilistic database along with correlations. Correlationsallow one to couple uncertainties among multiple random variables. Forinstance, relating back to the example from the previous section, supposeAds contained another attribute Color. Also, suppose that for some ad weneither knew the color nor the make but we know that if the make of thecar in the ad is Honda then its color can be one from a restricted set ofcolors, say red or black. Then, this coupling between color and make at-tributes can be represented in a probabilistic database by expressing it as acorrelation. A number of applications produce uncertain data with knowncorrelations, such as sensor networks for habitat monitoring where it hasbeen shown that utilizing spatial and temporal correlations can drasticallyimprove the quality of query results [Deshpande et al., 2004]. In short, ourformulation of a probabilistic database can represent:

• attribute uncertainty: tuples with uncertain attribute values,

• tuple uncertainty: tuples whose existence we are unsure of,

• intra-tuple attribute-attribute correlation: tuples whose attribute val-ues are uncertain and correlated,

• inter-tuple attribute-attribute correlation: attribute values from dif-ferent tuples that are both uncertain and correlated (note that thesetuples can belong to different relations),

• inter and intra attribute value-tuple existence correlations.

In the previous section, when we discussed simple ways of handlinguncertain data using traditional database systems, we showed how suchtechniques led to query results that were qualitatively unsatisfactory. How-ever, we did not discuss what the correct query result should look like.This, in part, relates to the question of assigning semantics to a probabilis-tic database. What does a probabilistic database actually mean? Possibleworlds semantics [Halpern, 1990] is one set of semantics that has formed thebasis of numerous probabilistic models proposed in the machine learningliterature and it is known that databases based on possible worlds seman-tics are associated with particularly intuitive and precise query evaluationsemantics [Dalvi and Suciu, 2004; Fuhr and Rolleke, 1997]. Essentially, un-der possible worlds semantics, a probabilistic database is simply a distribu-tion over many traditional databases each referred to as a possible world.

5

Query evaluation under possible worlds semantics means evaluating thequery against each possible world (which we know how to do since eachpossible world is a database devoid of any uncertainty) and for each resultadding up the probabilities of all possible worlds that produce the result.Fortunately, our probabilistic graphical models based formulation of prob-abilistic databases lends itself naturally to possible worlds semantics thusdefining precise semantics for the query evaluation problem.

Of course, defining the query evaluation problem by associating it withprecise semantics is one thing and efficiently evaluating queries is another.Even though possible worlds semantics precisely defines what the result ofposing a query to a probabilistic database should be, it does not providean efficient means of computing it. To this end, we develop an approachto evaluating a user-submitted query by reformulating it as probalistic in-ference problem in an appropriately constructed graphical model. Moreprecisely, given a query q (expressed in some declarative query languagesuch as relational algebra or SQL) to be evaluated against a probabilis-tic database with an underlying probabilistic graphical model, we showhow to augment the probabilistic graphical model on the fly to constructan augmented probabilistic graphical model from which we can computethe result of q by solving a probabilistic inference problem. Exactly howwe augment the probabilistic graphical model underlying the probabilisticdatabase depends on q and the operators appearing in it. This reformula-tion in terms of probabilistic inference has two clear benefits:

1. The general problem of evaluating queries on probabilistic databasesis known to be #P-complete [Dalvi and Suciu, 2004], but we also knowthat for some queries this problem is solvable. By expressing a queryevaluation problem as a probabilistic inference problem to be evalu-ated on an appropriately constructed probabilistic graphical model,we can now identify exactly which queries lead to hard problemssince the hardness of running probabilistic inference is well under-stood and can be determined by measuring the treewidth [Arnborg,1985] of the probabilistic graphical model.

2. By reformulating the query evaluation problem as a probabilistic in-ference problem, we allow access to using the host of probabilisticinference algorithms developed in the machine learning literature,and by appropriately choosing the inference algorithm, we can ob-tain various time vs. space vs. accuracy trade-offs depending on therequirements of the user.

6

Besides utilizing probabilistic inference algorithms and the various op-timizations they come with to evaluate queries on probabilistic databases,another aspect that affects the complexity of the query evaluation problemis the data stored in the probabilistic database. We can reduce the complex-ity of query evaluation by exploiting special properties of the data storedin the probabilistic database at hand. One such property is the presenceof shared correlations, where the same correlations and uncertainties repeat-edly occur in the data many times over. For instance, in the example fromthe previous section, if we had two ads concerning Honda vehicles andwe didn’t know their respective colors then it is likely that the Color at-tribute values of the two corresponding tuples would be governed by thesame uncertainties and probability distributions. Essentially, uncertaintiesand probability distributions rarely vary on a tuple-to-tuple basis and usu-ally come from general statistics, which leads to repeated probability fac-tors and correlations. Besides occurring naturally in the data, shared cor-relations are also introduced when we augment the probabilistic graphi-cal model defined on the base data to evaluate a query. In the presenceof shared correlations, any standard inference algorithm would treat eachcopy of a shared correlation separately and perform the same computationsteps repeatedly. We develop an inference algorithm based on bisimulation[Kanellakis and Smolka, 1983] that helps identify such shared correlationsand avoid repetitive computations. We validate our algorithm by showingthat even in the presence of a few shared correlations our algorithm doessignificantly better than standard inference algorithms.

We further develop our approach to leveraging shared correlations whileevaluating queries by developing approximate versions of the above infer-ence algorithm. For many applications, perfect accuracy in query resultsmay not be a requirement and some errors can be tolerated; our approxi-mate inference techniques are aimed towards such applications where wemake more aggressive use of shared correlations and trade-off accuracyto reduce time spent to run inference. More specifically, we propose twodifferent ways to implement approximate inference, both closely relatedto bisimulation. Both of these techniques can be combined for more ag-gressive exploitation of shared correlations. Further, our techniques canbe combined with bounded complexity inference techniques such as mini-buckets [Dechter and Rish, 2003]. We report experiments on both syn-thetic and real data to show that in the presence of symmetries, run-timesfor inference can be improved significantly, with approximate lifted infer-ence providing orders of magnitude speedup over standard inference algo-rithms and the previously developed shared correlations-aware exact infer-

7

ence algorithm.In the last part of the dissertation, our focus remains on efficient query

evaluation but the questions we ask are slightly different. Recall that whileevaluating queries, we first take the (uncertain) data from the database andthe user submitted query, and generate a probabilistic graphical model onwhich we need to run inference to compute the result of the query. Notethat, for the same query, many different query plans are possible. Fur-ther, different query plans of the same query may result in different prob-abilistic graphical models, all of which are equivalent with respect to theresults of inference. The obvious question to be asked in such a scenariois: are all of these graphical models similar in complexity or is there agraphical model/query plan on which it is easier to run inference, in otherwords, is there a low-treewidth graphical model? Previous attempts to an-swer this question led to the concept of hierarchical queries. Hierarchicalqueries represent the class of queries for which there exists a particularquery plan that lets us generate a tree-structured probabilistic graphicalmodel (which is easy to run inference on) for any (tuple-independent) prob-abilistic database. However, because of their query-centric definition thatdoes not involve the database, hierarchical queries represent an overly pes-simistic way of defining the class of tractable queries. It is easy to constructexamples where a non-hierarchical query run on an appropriate databasegives rise to a tractable query evaluation problem. In the final part of thedissertation, we go beyond the notion of hierarchical queries. Our goal isto develop query evaluation algorithms that, given the database and thequery, generate a tree-structured graphical model (if it exists) leveragingboth the data and the query. For a tuple-level probabilistic database, it iseasy to show that every result tuple is associated with a boolean formulaand query evaluation reduces to computing the marginal probability forthe boolean formula holding true. It is also easy to see that if the result tu-ple is such that its associated boolean formula can be factorized into a formwhere every boolean variable (or tuple-existence variable, in our case) ap-pears not more than once, then its marginal probability can be computedefficiently. We propose novel approaches that generate such factorizationsof result tuples produced by evaluating queries. By doing so, we leverageboth data and query to solve queries on probabilistic databases in the mostefficient manner possible.

This dissertation forms the first few steps in developing a full-fledgeddatabase system that can manage and store uncertain data. Given the levelof interest in probabilistic databases and the wide array of applications thatcan benefit from developments in this area of research, it should come as no

8

surprise that much work still needs to be done before we see a viable, usefulsystem being released and that the number of possible directions of futurework far exceeds than what can be accomodated in a few pages of thisdissertation. Still, some of these directions are more compelling and requiremore urgent attention than others. We conclude the dissertation with asummary of contributions made and a listing of these possible avenues forfuture work.

1.3 Outline and Contributions

The rest of the dissertation is organized as follows:

• In the next chapter, we begin by discussing prior related work. Thework described in this dissertation contributes and is related to anumber of different fields of research, and in Chapter 2 we reviewthe more relevant references organized according to different areas ofresearch to help the reader place our contributions in context.

• Chapter 3 describes the basic representation scheme which can ex-press all types of uncertainties that one may want to express in a rela-tional database. This chapter is based on work that appeared in Senand Deshpande [2007]; Sen et al. [2007, 2009c]. More precisely, in thischapter:

– We define probabilistic databases in terms of probabilistic graph-ical models.

– We show how our formulation naturally lends itself to possibleworlds semantics.

– We show that the query evaluation problem can be recast as aprobabilistic inference problem in an appropriately constructedprobabilistic graphical model.

– We show how to construct the probabilistic graphical model fora query on the fly at query time.

– We show how standard probabilistic inference algorithms, alongwith various optimizations, can be used to answer queries in ourprobabilistic database.

• In Chapter 4, we develop the first inference algorithm that exploitsshared correlations which is the first inference algorithm of its kind

9

that can be applied to any probabilistic graphical model (even onesthat do not arise out of query evaluation for probabilistic databases).This chapter is based on work that appeared in Sen et al. [2008a,2009c]. More precisely, in this chapter:

– We define shared correlations and motivate their presence in un-certain data using examples.

– We develop an inference algorithm based on bisimulation thatexploits shared correlations to avoid repetitive computation.

– We develop an effective heuristic to construct elimination orders(a key step in most exact inference algorithms) and show thatour heuristic produces orders that work well with our inferencealgorithm.

– We validate our inference algorithm by running experiments andcomparing against standard inference algorithms.

• In Chapter 5, we develop approximate versions of the previously de-veloped shared correlation-aware inference algorithm. This chapteris based on work that appeared in Sen et al. [2009a]. More precisely,in this chapter:

– We devise two different ways to implement approximate infer-ence with shared correlations: one based on approximate bisim-ulation and another based on factor binning.

– We show that these two approaches can be combined togetherfor more aggressive exploitation of shared correlations.

– We also show how these techniques can be combined with boundedcomplexity inference mechanisms.

– We develop a unified inference engine that, through the use ofa handful of tunable parameters, allows the user to control thedegree of approximation and to what extent we want to exploitshared correlations, thus allowing the user to achieve a trade-offbetween accuracy of inference and time spent running inference.

– We demonstrate through experiments on both synthetic and realdata how the approximate inference procedures can provide or-ders of magnitude speedup over standard inference algorithmsand our previously developed shared correlation-aware exactinference algorithm.

10

• In Chapter 6, we develop query evaluation algorithms that generatetree-structured graphical models (if possible) given the query and thedatabase. More precisely, in this chapter:

– We review the concept of hierarchical queries.

– We review the relationship between hierarchical queries, tree-structured graphical models and read-once functions.

– We propose a very simple query evaluation algorithm that makesuse of previous work on read-once functions and generates tree-structured graphical models whenever possible given any queryto be run on a probabilistic database.

– We consider the special case of conjunctive queries and showthat read-once functions can be generated more efficiently forthis case.

• We conclude the dissertation with Chapter 7 which contains a sum-mary of contributions and a listing of possible avenues of future work.

11

Chapter 2

Related Work

The broader field of uncertainty management in databases has seen a lot ofwork in recent years. In this chapter, we attempt to list the more relevantworks and contrast them with the contributions made in this dissertation.Moreover, the work described in the ensuing chapters relates to various dif-ferent fields of research besides database systems such as machine learning.In what follows, we attempt to divide the related work according to the var-ious fields of research and within each sub-division, we mention how ourwork relates to relevant prior work.

2.1 Uncertainty and Databases

The topic of representing and modeling uncertainty has been in the col-lective conscience of the database community for a fairly long period oftime. Consequently, a wide array of approaches have been proposed. Veryearly on, the subject of dealing with null values or logical uncertainty in aprincipled manner received a fair amount of attention [Imielinski and Lip-ski, Jr., 1984]. More recently, there has been more work along these linesthat attempt to concisely represent such databases by employing verticalpartitioning methods [Antova et al., 2007]. Das Sarma et al. [2006] explorevarious different models of logical uncertainty with varying representationpower.

When it comes to dealing with uncertainty involving a measure of un-certainty or beliefs, a number of different approaches have been proposed.However, the consensus seems to be that probability theory has the rightbalance of power and tractability, which is why the bulk of research in thisarea falls under the sub-area known as probabilistic databases. Barbara et al.

12

[1992] is one of the earliest works along these lines which explores attributeuncertainty models focussing on intra-tuple correlations. Fuhr and Rolleke[1997] is perhaps one of the earliest works that proposed a coherent, albeitsimplistic, model of a probabilistic database; the application in focus wascombining information retrieval and database techniques into one singlesystem. ProbView [Lakshmanan et al., 1997] posits that each tuple is asso-ciated not with a point estimate of probability but a range, and goes on todevelop query evaluation techniques based on linear programming. In re-cent developments, Dalvi and Suciu [2004] present a probabilistic databasemodel based on simple semantics (possible worlds) and show how queryrewriting techniques can help solve intractable queries under this model.

In Chapter 3, we develop compact yet powerful models of probabilisticdatabases based on probability theory and factored representations of jointprobability distributions. Our approach is closely related to representinguncertainty with probabilistic graphical models from the machine learningliterature. Our techniques allow the user to express all kinds of uncertaintywithin a relational database, along with correlations. We also show that ourmodel of a probabilistic database is associated with precise and intuitive se-mantics, possible worlds, and query evaluation can be performed by run-ning standard probabilistic inference algorithms on an appropriately con-structed probabilistic graphical model. The work described in Chapter 3illustrates that it is possible to go beyond simplistic tuple-level uncertaintymodels that assume complete independence and still be able to come upwith models of probabilistic databases that have desirable properties suchas simple semantics and tractable querying.

Chapter 3 is based on previously published works [Deshpande et al.,2008; Sen and Deshpande, 2007; Sen et al., 2007, 2009c]. In Sen and Desh-pande [2007], we introduced models of probabilistic databases that allowedtuple-level uncertainty with correlations, of both intra-relation and inter-relation varieties, and was perhaps one of the first works to include corre-lations. This was a significant departure from prior work, both Fuhr andRolleke [1997] and Dalvi and Suciu [2004] worked with tuple-level uncer-tainty models assuming complete independence among tuples. Since mostapplications produce data which requires modeling correlations, our worksignificantly broadened the applicability of probabilistic databases. Sub-sequently, in Sen et al. [2007, 2009c], we made our model of probabilisticdatabases more general by including attribute and tuple level uncertainty,and also by including first-order graphical models based on shared cor-relations. The concept of shared correlations is introduced in Chapter 4wherein we represent numerous identical correlations together instead of

13

representing them separately. This allows our uncertainty model to becomeeven more compact. Shared correlations are the basis of state-of-the-artfirst-order uncertainty representation models from machine learning (e.g.,probabilistic relational models [Friedman et al., 1999] and Markov logicnetworks [Richardson and Domingos, 2006], reviewed in more detail be-low).

There are a number of other works that also come under the umbrellaof probabilistic databases and have been tried over the years. Fuhr andRolleke [1996] proposed an extension based on NF2 relational algebra. Trio[Benjelloun et al., 2006] proposes the concept of x-tuples which is basicallyan uncertain tuple represented by its various alternatives. MystiQ [Re et al.,2006] proposes the block-independent disjoint model which is similar to x-tuples. SPROUT [Koch and Olteanu, 2008] employs a model referred toas a world-set tree and Li and Deshpande [2009] employ a similar and/xortree. None of these approaches discuss concisely describing the uncertaintymodel using shared correlations and first-order graphical models like wedo in Chapter 4.

Among the various models that go beyond the use of probability theory,there are models based on fuzzy logic [Bosc and Pivert, 2005; Buckles andPetry, 1982] and models based on Dempster-Shafer theory [Choenni et al.,2006].

2.2 First-Order Graphical Models

On the topic of representing uncertainty, researchers in machine learninghave devoted a lot of thought and time to developing concise models thatpossess the requisite representation power. The result is the developmentof probabilistic graphical models (PGM), that contain as special cases Bayesiannetworks [Pearl, 1988] and Markov networks [Cowell et al., 1999]. As re-viewed in Chapter 3, a probabilistic graphical model represents a joint dis-tribution among many random variables by representing it in little piecescalled factors. Bayesian networks include only directed dependencies andMarkov networks only allow undirected dependencies. There exist gen-eralizations that allow a mix of directed and undirected dependencies butdisallow directed cycles such as chain graphs [Lauritzen, 1996] and factorgraphs [Frey, 2003]. Further, generalizations that allow directed cycles havealso been studied [Richardson, 1997]. Our approach outlined in Chapter 3can make use of any of these approaches.

However, PGMs are not without limitations. These models are eas-

14

ier to visualize, reason about and deal with when the number of randomvariables range in a few hundreds or less. Even in a small-to-moderatelysized probabilistic database we are likely to exceed this number whichmakes it unreasonable to assume that having an uncertainty model in termsof a PGM will be easy to handle. Machine learning researchers, specifi-cally statistical relational learning researchers (SRL), in the past decade orso, have paid cognisance to this fact and have come up with a new classof PGMs frequently referred to as first-order graphical models (FO-models).FO-models are essentially PGMs with an additional layer of specificationthat uses first-order rules to specify correlations among classes of randomvariables. The same first-order rule applies to all random variables belong-ing to the respective classes, and these are, essentially, shared correlations(Chapter 4). This allows FO-models to be compact, easier to maintain,visualize and also, statistically easier to estimate from data. Listing thevarious FO-models produces a veritable alphabet soup: PRMs [Friedmanet al., 1999] (probabilistic relational models), RMNs [Taskar et al., 2002](relational Markov networks), MLNs [Richardson and Domingos, 2006](Markov logic networks), BLOGs [Milch et al., 2005] (Bayesian logic) etc.We refer the interested reader to Getoor and Taskar [2007] for a more exten-sive and detailed survey.

Our use of shared correlations and FO-models to specify models of un-certainty in probabilistic databases means we have close connections tothis area of work although there are some differences. Most of the workon FO-models has concentrated on how to specify and learn a class-levelprobabilistic model for relational data; and answering queries expressedin a standard query language (e.g., relational algebra or SQL) was not theirmain focus as is the case in research on probabilistic databases. In fact, veryfew FO-models proposed in the literature even consider querying with astructured query language. ProbLog [De Raedt et al., 2007], which usesProlog, is perhaps the only exception. We believe that the best way to viewthe work described in this dissertation is to look upon it as taking the bestof both FO-models and probabilistic databases, since the representationschemes we develop in Chapter 3 allow us to represent shared correlationsin databases while the query evaluation algorithms we develop later (inChapter 4 and Chapter 5) can exploit the same shared correlations to allowthe user to efficiently and declaratively query the probabilistic database.

15

2.3 Lifted Inference

Even though probabilistic inference can be used to evaluate queries in prob-abilistic databases, there may still be cases when probabilistic inference isinefficient. In Chapter 4 and Chapter 5, we show how to exploit specialproperties of the uncertain data, i.e., shared correlations, to speed up in-ference during query evaluation. Shared correlations occur in the datawhen the same uncertainties and probability distributions occur repeat-edly. In such a case, standard inference algorithms treat each instance ofthese shared correlations separately and repeatedly perform the same com-putation steps. We develop an approach based on the graph-theoretic con-cept of bisimulation [Kanellakis and Smolka, 1983; Paige and Tarjan, 1987]that avoids such repeated computation. In Chapter 4, we present an exactinference algorithm based on these ideas and, in Chapter 5, we extend thetechniques in multiple different ways to perform approximate inference.

The inference algorithms presented in Chapter 4 and Chapter 5 areclosely related to lifted inference algorithms [de Salvo Braz et al., 2005; Poole,2003] developed by the SRL community. Lifted inference aims to exploitthe symmetry provided by FO-models in the form of shared correlations toachieve more efficient inference. The basic idea behind lifted inference is todevelop inference algorithms that instead of summing over random vari-ables and multiplying factors, sum over sets of random variables and mul-tiply sets of factors, thus reducing redundant computation. Most works inlifted inference assume that they are provided a PGM expressed as an FO-model and that the symmetry of shared correlations is explicitly providedin first-order logic. In Chapter 4, we make no such assumptions. Thisis mainly because to evaluate queries in probabilistic databases one firstneeds to build the PGM on which we need to perform inference (describedin detail in Chapter 3) and it is not straightforward to obtain a PGM ex-pressed as an FO-model via this approach. Instead, our bisimulation-basedapproach to lifted inference discovers the symmetry due to shared correla-tions in the constructed PGM on the fly. To the best of our knowledge, ourapproach is the first general lifted inference approach that can be appliedto any PGM.

Some attempts have been made by the probabilistic database commu-nity to make more direct use of existing lifted inference work. In Wanget al. [2008], the authors state that among the various issues complicatingthe use of Parameterized Variable Elimination [Poole, 2003] for query evalua-tion in probabilistic databases is the presence of evidence and, presumably,handling joins among different relations; Wang et al. only report experi-

16

ments on single-relation selection queries.Poole [2003] was one of the first to show that variable elimination [Zhang

and Poole, 1994] can be modified to directly work with FO-models to avoidpropositionalization during inference. Subsequently, de Salvo Braz et al.[2005] further developed on Poole’s work and referred to it as inversion elim-ination. They also introduce another technique for lifted inference knownas counting elimination which is more expensive than inversion elimination(since it requires considering all possible combinations of assignments to aset of random variables [de Salvo Braz et al., 2005]) but can help in certainsituations where the complexity of the ground model renders ground infer-ence infeasible. It is straightforward to show that our bisimulation-basedapproach to lifted inference subsumes inversion elimination (and partialinversion [de Salvo Braz et al., 2006]). We provide more discussion illus-trating this connection, along with an example, in Section 4.6.

Lifted inference is still a very young field, but there has been some workon designing approximate lifted inference algorithms. Jaimovich et al. [2007];Kersting et al. [2009]; Singla and Domingos [2008] all, essentially, proposeto use a bisimulation-like algorithm on the factor graph [Kschischang et al.,2001] representing the probabilistic model to find clusters of random vari-ables that send and receive identical messages which helps speed up infer-ence with loopy belief propagation (LBP) [Yedidia et al., 2000], a ground ap-proximate inference algorithm. Our work on approximate lifted inferencedescribed in Chapter 5 differs from lifted LBP on two distinct counts. First,except for Kersting et al., the above works depend on receiving the FO-model as input, whereas our approximate lifted inference techniques, ineffect, determine the first-order representation on the fly. Second, as Singlaand Domingos acknowledge, LBP often has problems with convergence,whereas the approaches we describe Chapter 5 are always guaranteed toconverge.

2.4 Query Evaluation in Probabilistic Databases

Keeping with the wide array of representation schemes proposed, a num-ber of diverse schemes for query evaluation in probabilistic databases havealso been tried. Until the last decade, there were mainly two competingschools of thought: Intensional and Extensional query evaluation. Inten-sional evaluation always provides coherent results adhering to possibleworlds semantics. Extensional evaluation, however, does not always comewith guaranteed semantics, so in that sense, the results may be wrong, even

17

though extensional evaluation is always cheaper than intensional evalua-tion. Dalvi and Suciu [2004] illustrated that these two approaches werenot completely at loggerheads, and that there exists a subset of SQL whosequeries are such that when run under extensional semantics with a particu-lar plan lead to results in accordance with possible worlds semantics. Thissubset of relational algebra has since been referred to as queries with safeplans [Dalvi and Suciu, 2004] or hierarchical queries [Dalvi and Suciu, 2007].

Interestingly, it is easy to show that safe plans always give rise to tree-structured PGMs when expressed in our formulation. This means that ourapproach to evaluating queries is also quite efficient when extensional eval-uation provides correct query results (since tree-structured PGMs are easyto run inference on), besides always adhering to possible world seman-tics. In Chapter 6, we take this idea one step further. Instead of look-ing at the query to find out if it is tractable or not, as is done in mostother works on tractable queries [Dalvi and Suciu, 2007, 2004; Olteanu andHuang, 2009, 2008], we ask if the PGM constructed for query evaluation canbe re-ordered into a tree-structured graphical model. Essentially, the defi-nition of hierarchical queries [Dalvi and Suciu, 2007] does not take into ac-count the data contained in the database. One way to describe this tractableclass of queries is to say that if a query is tractable for all possible databasesthen it belongs to this class. However, this represents a very pessimisticway of defining tractable queries. Since the PGM on which we need torun inference to compute the results is a combination of the query and thedatabase, we need to look at both aspects in order to determine tractability.In Chapter 6, we explore these issues and make connections to literature ingraph theory on factorizing boolean formulas [Golumbic et al., 2006]. Wedevelop algorithms that take each result tuple and explore whether the cor-responding PGM can be converted to a tree-structured one, if so then weproceed to building this tree-structured PGM and running inference on it.We also show that for a large class of queries, conjunctive queries withoutself-joins, some of the checks that need to be performed in the most generalcase can be avoided, resulting in more efficiency. By doing so, we showthat both data and query can be leveraged to the fullest to evaluate queriesover probabilistic databases.

The original work on hierarchial queries [Dalvi and Suciu, 2004] mainlyconsidered queries involving equality join predicates. Since then, therehave been other works along these lines extending the notion to variousother operators. In recent work, there have been attempts to show that,at least in some cases, inequality predicates, 6= [Olteanu and Huang, 2008]and >,< [Olteanu and Huang, 2009], also allow for tractable query evalu-

18

ation. These approaches are currently out of the scope for our frameworksince they may not lead to tree-structured PGMs. As regards the assump-tion of no self-joins in the query, Dalvi and Suciu [2007] is the only workwe are aware of that attempts to remove this assumption. From the ma-chine learning community, Darwiche [2002] proposes utilizing boolean for-mula factorization algorithms so that a given probabilistic model can becompiled into a more tractable form usually referred to as an arithmetic cir-cuit. This is advantageous because performing inference using the com-piled arithmetic circuit is more efficient than performing inference with theoriginal probabilistic model. More importantly, Darwiche can handle at-tribute uncertainty (they consider general PGMs). However, Darwiche re-lies on the use of an exponential-sized intermediate representation calledmulti-linear formula. In Chapter 6, we consider the simpler case of a prob-abilistic database with tuple-level uncertainty. Developing techniques thatcan handle attribute-level uncertainty is delegated to future work.

Since our work illustrating that query evaluation requires probabilis-tic inference (Chapter 3), a number of other works have utilized differentinference procedures. In this dissertation, we propose the use of exact infer-ence procedures such as variable elimination [Zhang and Poole, 1994] andthe junction tree algorithm [Pearl, 1988], Benjelloun et al. [2006]; Fuhr andRolleke [1997] have utilized the inclusion-exclusion principle for booleanformulas, Re et al. [2007] proposed the use of a Markov chain Monte Carlotechnique, Koch and Olteanu [2008] use ordered binary decision diagramsand as mentioned earlier, Wang et al. [2008] makes direct use of existingwork on lifted inference [Poole, 2003] developed in the SRL community.

Other techniques to improve efficiency of evaluating queries in proba-bilistic databases include Trio’s memoization techniques [Das Sarma et al.,2008], index structures for uncertain data retrieval [Singh et al., 2007] andindex structures for junction trees [Kanagal and Deshpande, 2009]. Thesetechniques are fairly generic and can be used in conjunction with the tech-niques proposed in this dissertation.

2.5 Conclusion

Having surveyed the relevant related work, we are now ready to proceedwith the rest of the dissertation. In the next chapter we propose modelsfor representing uncertainty to be used in conjunction with probabilisticdatabases, develop a technique that expresses a query evaluation problemas an inference problem on a PGM and illustrate the connection between

19

query evaluation and probabilistic inference.

20

Chapter 3

Representing Uncertain Data

The database community has seen a lot of work on managing uncertaindata, and the search for an ideal representation scheme has been a topic ofconstant interest. In the past, a number of different approaches have beenproposed to represent uncertainty (we surveyed some of these in Chapter2). Among these, perhaps the most frequently proposed approach has beenthe use of probability theory, perhaps due to its balance between power andsimplicity; probability theory is general enough to represent most kindsof uncertainty we encounter in various applications in practice and is stillsimple enough to be amenable to algebraic manipulation so that we canuse it to perform various operations such as query evaluation.

In this chapter, we describe our approach to representing uncertaintyin databases. We use probability theory in conjunction with probabilisticgraphical models (PGMs) to develop a compact scheme to represent uncer-tain data with correlations. In the next section, we provide background onPGMs. In Section 3.2, we formally define a probabilistic database in termsof PGMs and describe their semantics, in addition to providing a few ex-amples that illustrate how correlations can be represented and affect thedistribution represented by a probabilistic database. In Section 3.3, we dis-cuss query evaluation and optimizations that can lead to efficient queryevaluation, especially for aggregate computation. We conclude the chapterwith Section 3.5, after describing experimental results in Section 3.4.

3.1 Background: Probabilistic Graphical Models

Probabilistic graphical models (PGMs) form a powerful class of approachesthat can compactly represent and reason about complex dependency pat-

21

Pr(x1, x2, x3) =1Zf1(x1)f12(x1, x2)f23(x2, x3)

x1 f11 0.52 0.33 0.2

x1 x2 f121 1 0.81 2 0.11 3 0.12 1 0.12 2 0.82 3 0.13 1 0.13 2 0.13 3 0.8

x2 x3 f231 1 0.71 2 0.21 3 0.12 1 0.22 2 0.72 3 0.13 1 0.23 2 0.13 3 0.7

(a)

X2

X3

X1

(b)

x1 x2 x3 Pr1 1 1 0.2801 1 2 0.0801 1 3 0.0401 2 1 0.0101 2 2 0.0351 2 3 0.0051 3 1 0.0101 3 2 0.0051 3 3 0.035

x1 x2 x3 Pr2 1 1 0.0212 1 2 0.0062 1 3 0.0032 2 1 0.0482 2 2 0.1682 2 3 0.0242 3 1 0.0062 3 2 0.0032 3 3 0.021

x1 x2 x3 Pr3 1 1 0.0143 1 2 0.0043 1 3 0.0023 2 1 0.0043 2 2 0.0143 2 3 0.0023 3 1 0.0323 3 2 0.0163 3 3 0.112

(c)

Figure 3.1: Example involving three dependent random variables each witha ternary domain: (a) factored representation (b) graphical model represen-tation (c) resulting joint probability distribution.

terns involving large numbers of correlated random variables [Cowell et al.,1999; Pearl, 1988]. The key idea behind PGMs is exploiting conditional inde-pendence [Pearl, 1988]. Most random variables only show local interactionsor correlations with other random variables, and in many cases there areonly a few of such correlations that need to be captured to represent thejoint probability distribution defined over the collection of random vari-ables. PGMs allow the specification of such correlations by defining smallfunctions we refer to as factors∗, the joint probability distribution over the

∗Factors are a generalization of conditional probability tables in Bayesian networks [Pearl,

22

collection of random variables can then be defined as a normalized productof all factors.

Let X denote a random variable with a domain dom(X) and let Pr(X)denote a probability distribution over it. Similarly, let X = {X1, X2, X3 . . . , Xn}denote a set of n random variables each with its own associated domaindom(Xi), and Pr(X) denote the joint probability distribution over them.

Definition 1. A factor f(X) is a function over a (small) set of random variablesX = {X1, . . . , Xn} such that 0 ≤ f(x), ∀x ∈ dom(X1)× . . .× dom(Xn).

Definition 2. A probabilistic graphical model (PGM) P = 〈F ,X〉 defines ajoint distribution over the set of random variables X via a set of factors F , where∀f(X) ∈ F , X ⊆ X . Given a complete joint assignment x ∈ ×X∈Xdom(X),the joint distribution is defined by Pr(x) = 1

Z∏

f∈F f(xf ) where xf denotes theassignments restricted to arguments of f and Z =

∑x′

∏f∈F f(x′f )†.

Figure 3.1 shows a small example of a PGM expressing a joint proba-bility distribution over three random variables each with domain {1,2,3}.The complete joint distribution is shown in Figure 3.1(c); note that repre-senting this requires storing 27 real numbers (26, if you exploit the fact thatthe distribution should add upto 1). However, if we are willing to exploitconditional independence amongX1,X2 andX3, then we can represent thejoint probability distribution with far fewer numbers. For instance, the dis-tribution is such thatX3 is conditionally independent ofX1 given the valueof X2; in terms of correlations, X1 only directly affects X2’s value and X2

only affects X3’s values. Exploiting these properties, we can represent thesame distribution using three factors (shown in Figure 3.1(a)). Note thatthe factors only require storing 21 real numbers which is 5 less comparedto storing the joint distribution described earlier. The savings usually in-crease with more random variables and larger domains. In Figure 3.1(b)we show a “graphical” representation of the PGM where vertices representrandom variables and edges depict correlations.

1988].†Note that since we allow factors to return 0, technically, there is a possibility of Z being

0. This only happens when we are dealing with a PGM P that encodes the trivial jointprobability distribution which maps all joint assignments to 0. As long as there exists atleast one joint assignment x such that

Qf∈F f(xf ) > 0 this case should not arise.

23

3.2 Probabilistic Databases with Probabilistic Graph-ical Models

We are now ready to define a probabilistic database in terms of a PGM. Thebasic idea is to use random variables to depict uncertain attribute valuesand factors to represent correlations. Let R denote a probabilistic relationor simply, relation, and let attr(R) denote the set of attributes of R. Arelation R consists of a set of probabilistic tuples or simply, tuples, each ofwhich is a mapping from attr(R) to random variables. Let t.a denote therandom variable of tuple t ∈ R such that a ∈ attr(R). Besides mappingeach attribute to a random variable, every tuple t is also associated with aboolean-valued random variable which captures the existence uncertaintyof t and we denote this by t.e.

Definition 3. A probabilistic database or simply, a database, D is a pair〈R,P〉 where R is a set of relations and P denotes a PGM defined over the setof random variables associated with the tuples inR.

3.2.1 Possible World Semantics

We now define semantics for our formulation of a probabilistic database.Let X denote the set of random variables associated with database D =〈R,P〉. Possible world semantics defines a database D as a probability dis-tribution over deterministic databases or possible worlds [Dalvi and Su-ciu, 2004] each of which is obtained by assigning X a joint assignmentx ∈ ×X∈Xdom(X)‡. The probability associated with the possible worldobtained from the joint assignment x is given by the distribution definedby the PGM P (Definition 2).

3.2.2 Examples

We now present a few examples to further explain our notion of a prob-abilistic database. Consider the two-relation database shown in Figure3.2(a). In this database, every tuple has an uncertain attribute value (the

‡Note that not all joint assignments are legal, a legal joint assignment should satisfy:t.e ⇒ (t.a = ∅), ∀t ∈ R, ∀a ∈ attr(R), ∀R ∈ R where R denotes the set of relationsin D and ∅ is a special “null” assignment, in other words a tuple’s attributes cannot beassigned values unless it exists. It is easy to define the factors in such a way that all illegalassignments are assigned 0 probabilities.

24

S A Bs1 a1 {1: 0.6, 2: 0.4}s2 a2 {1: 0.6, 2: 0.4}

T B Ct1 {2: 0.5, 3: 0.5} c

(a)

s1.B fs1.B

1 0.62 0.4

s2.B fs2.B

1 0.62 0.4

t1.B ft1.B

2 0.53 0.5

(b)

Figure 3.2: A small database with independent uncertain attribute values.

B attributes) and these are indicated in Figure 3.2(a) by specifying their re-spective domains with each entry from the domain followed by the proba-bility with which the attribute value can take the assignment. In a database,we represent the uncertainty associated with each uncertain value using arandom variable and the corresponding probability distribution using afactor (assuming complete independence). For instance, s2.B can be as-signed the value 1 with probability 0.6 and the value 2 with probability 0.4and we would represent this using the factor fs2.B shown in Figure 3.2(b).We show all three required factors fs1.B(s1.B), fs2.B(s2.B) and ft1.B(t1.B)in Figure 3.2(b). In addition to the random variables which denote un-certain attribute values, we can introduce tuple existence random variabless1.e, s2.e, and t1.e, which capture tuple uncertainty. These are boolean-valued random variables and can have associated factors. In Figure 3.2,we assume the tuples are certain, so we do not show the existence randomvariables for the base tuples. We next explain semantics of our exampledatabase in terms of possible worlds.

The database shown in Figure 3.2 represents a distribution over manydeterministic databases (possible worlds), and each possible world is ob-tained by assigning all three random variables s1.B, s2.B and t1.B assign-ments from their respective domains. Since the three random variablesdepicted in Figure 3.2 each have domain with size 2, there are 23 = 8 possi-ble worlds. Figure 3.3 shows all 8 possible worlds with the correspondingprobabilities listed under the column “prob.(ind.)”. The probability associ-ated with each possible world is obtained by multiplying the appropriatenumbers returned by the factors and normalizing if necessary. For instance,for the possible world obtained by the assignment s1.B = 1, s2.B = 2,t1.B = 2 (D3 in Figure 3.3) the probability is 0.6× 0.4× 0.5 = 0.12.

25

possible worldprob. prob. prob. prob.(ind.) (implies) (diff.) (pos.corr.)

D1 =S A Bs1 a1 1s2 a2 1

T B Ct1 2 c

0.18 0.50 0.30 0.06

D2 =S A Bs1 a1 1s2 a2 1

T B Ct1 3 c

0.18 0.02 0.06 0.30

D3 =S A Bs1 a1 1s2 a2 2

T B Ct1 2 c

0.12 0 0.20 0.04

D4 =S A Bs1 a1 1s2 a2 2

T B Ct1 3 c

0.12 0.08 0.04 0.20

D5 =S A Bs1 a1 2s2 a2 1

T B Ct1 2 c

0.12 0 0 0.24

D6 =S A Bs1 a1 2s2 a2 1

T B Ct1 3 c

0.12 0.08 0.24 0

D7 =S A Bs1 a1 2s2 a2 2

T B Ct1 2 c

0.08 0 0 0.16

D8 =S A Bs1 a1 2s2 a2 2

T B Ct1 3 c

0.08 0.32 0.16 0

Figure 3.3: Possible worlds for example in Figure 3.2(a).

Primplies(s1.B, s2.B, t1.B) = f impliest1.B (t1.B)f implies

t1.B,s1.B(t1.B, s1.B)f impliest1.B,s2.B(t1.B, s2.B)

t1.B f impliest1.B

2 0.53 0.5

t1.B s1.B f impliest1.B,s1.B

2 1 12 2 03 1 0.23 2 0.8

t1.B s2.B f impliest1.B,s2.B

2 1 12 2 03 1 0.23 2 0.8

(a)

Prdiff (s1.B, s2.B, t1.B) = fdifft1.B,s1.B(t1.B, s1.B)fdiff

s2.B (s2.B)

t1.B s1.B fdifft1.B,s1.B

2 1 0.52 2 03 1 0.13 2 0.4

s2.B fdiffs2.B

1 0.62 0.4

(b)

Prpos.corr(s1.B, s2.B, t1.B) = fpos.corr.t1.B,s1.B(t1.B, s1.B)fpos.corr.

s2.B (s2.B)

t1.B s1.B fpos.corr.t1.B,s1.B

2 1 0.12 2 0.43 1 0.53 2 0

s2.B fpos.corr.s2.B

1 0.62 0.4

(c)

Figure 3.4: Factors for the probabilistic databases with dependencies (wehave omitted the normalization constant Z because the numbers are suchthat distribution is already normalized) (a) implies correlation (b) differentcorrelation (c) positive correlation.

Let us now try to modify our example to illustrate how to represent cor-relations in a probabilistic database. In particular, we will try to constructthree different databases each containing the following dependencies re-

27

spectively:

• implies: t1.B = 2 implies s1.B 6= 2 and s2.B 6= 2, in other words,(t1.B = 2) =⇒ (s1.B = 1) ∧ (s2.B = 1).

• different: t1B and s1.B cannot have the same assignment, in otherwords, (t1.B = 2)⇔ (s1.B = 1) or (s1.B = 2)⇔ (t1.B = 3).

• positive correlation: High positive correlation between t1.B and s1.B,if one is assigned 2 then the other is also assigned the same value withhigh probability.

Figure 3.3 shows a distribution each over the possible worlds that satisfieseach of the above correlations (the columns are labeled with abbreviationsof the names of the correlations, e.g., the column for positive correlation islabeled “pos. corr.”).

To represent the possible worlds of our example database with the newcorrelations, we simply redefine the factors in the database. However, inthis case, since we need to represent correlations, we will need to use fac-tors defined over multiple random variables. Figure 3.4 represents the threesets of factors each corresponding to a database with each of the previouslydefined dependencies that depict the required distribution over possibleworlds from Figure 3.3. For instance, Figure 3.4 (a) shows the factors re-quired to define the possible worlds distribution depicted in column “im-plies” in Figure 3.3, and this is achieved by defining factors f implies

t1.B,s1.B and

f impliest1.B,s2.B which denote the implication dependencies defined earlier. Simi-

larly, notice how factor fdifft1.B,s1.B (Figure 3.4 (b)) enforces that t1.B and s1.B

be assigned different values. Lastly, fpos.corr.t1.B,s1.B enforces the positive correla-

tion between t1.B and s1.B depicted in the third example.Note that in Definition 3, we make no restrictions as to which random

variables appear as arguments in a factor. Thus, if the user wishes, s/hemay define a factor including random variables from the same tuple, dif-ferent tuples, tuples from different relations or tuple existence and attributevalue random variables, which means that in our formulation we can ex-press any kind of correlation that one might think of representing in a prob-abilistic database.

3.3 Query Evaluation

Having defined our representation scheme, we now move our discussionto query evaluation. The main advantage of associating possible world se-

28

mantics with a probabilistic database is that it lends precise semantics tothe query evaluation problem. Given a user-submitted query q (expressedin some standard query language such as relational algebra) and a databaseD, then the result of evaluating q againstD is defined to be the set of resultsobtained by evaluating q against each possible world ofD augmented withthe probabilities of the possible worlds. Relating back to our earlier exam-ples, suppose we want to run the query q =

∏C(S ./B T ). Figure 3.5(a)

shows the set of results obtained from each set of possible worlds aug-mented by the corresponding probabilities depending on which databasewe ran the query against.

Now, even though query evaluation under possible world semanticsis clear and intuitive, it still has some issues that prevent us from execut-ing it directly. First and foremost among these issues, is the size of theresult. Since the number of possible worlds is exponential in the number ofrandom variables in the database (product of domain sizes of all randomvariables to be more precise), in the case that every possible world returnsa different result, returning the result to the user or storing it is only goingto be feasible for the smallest of databases. To get around this issue, it istraditional to compress the result before returning it to the user. One wayof doing this is to collect all tuples from the set of results returned by possi-ble world semantics and return these along with the sum of probabilities ofthe possible worlds that return the tuple as a result [Dalvi and Suciu, 2004].In Figure 3.5(a), there is only one tuple that is returned as a result and thistuple is returned by possible worlds D3, D5 and D7. In Figure 3.5(b), weshow the resulting probabilities obtained by summing across these threepossible worlds for each example database.

The second issue is, of course, related to the complexity of computingthe results of a query from first principles. Since the number of possibleworlds is going to be large for any non-trivial database, evaluating resultsdirectly by enumerating all of its possible worlds is going to be infeasible.To get around this issue we first make the connection between comput-ing query results for a probabilistic database and the marginal probabilitycomputation problem for probabilistic graphical models.

Definition 4. Given a PGM P = 〈F ,X〉 and a random variable X ∈ X ,the marginal probability associated with the assignment X = x, where x ∈dom(X), is defined as µX(x) =

∑x∼x Pr(x), where Pr(x) denotes the distribu-

tion defined by the PGM and x ∼ x denotes a joint assignment to X where X isassigned x.

Since each possible world is obtained by assigning all random variables

29

possible query prob. prob. prob. prob.world result (ind.) (implies) (diff.) (pos.corr.)D1 ∅ 0.18 0.50 0.30 0.06D2 ∅ 0.18 0.02 0.06 0.30

D3Cc

0.12 0 0.20 0.04

D4 ∅ 0.12 0.08 0.04 0.20

D5Cc

0.12 0 0 0.24

D6 ∅ 0.12 0.08 0.24 0

D7Cc

0.08 0 0 0.16

D8 ∅ 0.08 0.32 0.16 0

(a)

query Pr(D3) + Pr(D5) + Pr(D7)result ind. implies diff. pos.corr.

Cc

0.32 0 0.20 0.40

(b)

Figure 3.5: Results running the query∏

C(S ./B T ) on the different exam-ple databases.

in the database with a joint assignment, at least intuitively, it does seem likewe are computing marginal probabilities when we sum over all possibleworlds to evaluate a query. However, we have yet to express the result tu-ples using random variables (the random variables in the database are theones associated with the base tuples). Therefore, to cast the query evalu-

30

ation problem into a marginal probability computation problem, we haveto first show how to augment the PGM underlying the database such thatthe augmented PGM contains random variables representing result tuples.We can then express the probability computation associated with evalu-ating the query as a standard marginal probability computation problemand thus allow us to use any of the host of probabilistic inference algo-rithms designed to perform marginal probability computations to solve thequery evaluation problem. We next present an example to illustrate the ba-sic ideas underlying our approach to augmenting the PGM underlying thedatabase given a query; after that we discuss how to augment the PGM inthe general case given any relational algebra query.

3.3.1 Example

Consider running the query∏

C(S ./B T ) on the database presented inFigure 3.2(a). Our query evaluation approach is very similar to query eval-uation in traditional database systems and is depicted in Figure 3.6. Justas in traditional database query processing, in Figure 3.6, we introduce in-termediates tuples produced by the join (i1 and i2) and produce a resulttuple (r1) produced from the projection operation. What makes query pro-cessing for probabilistic databases different from traditional database queryprocessing is the fact that we need to preserve the correlations among therandom variables representing the intermediate and result tuples and therandom variables representing the tuples they were produced from. In ourexample, there are three such correlations that we need to maintain:

• i1 (produced by the join between s1 and t1) exists or i1.e is true onlyin those possible worlds where both s1.B and t1.B are assigned thevalue 2.

• Similarly, i2.e is true only in those possible worlds where both s2.Band t1.B are assigned the value 2.

• Finally, r1 (the result tuple produced by the projection) exists or r1.eis true, only in those possible worlds that produce at least one of i1or i2 or both.

To enforce these correlations, during query evaluation we introduce in-termediate factors defined over appropriate random variables. For our ex-ample, we introduce the following three correlations:

31

S A Bs1 a1 {1:0.6, 2:0.4}s2 a2 {1:0.6, 2:0.4}

T B Ct1 {2:0.5, 3:0.5} c

S./BT−→A B C

i1 a1 2 ci2 a2 2 c

fi1.e, fi2.e

QC(S./BT )−→ C

r1 c

fr1.e

Figure 3.6: Evaluating∏

C(S ./B T ) on the database from Figure 3.2(a).

• For the correlation among i1.e, s1.B and t1.B we introduce the factorfi1.e which is defined as:

fi1.e(i1.e, s1.B, t1.B) ={

1 if i1.e⇔ ((s1.B == 2) ∧ (t1.B == 2))0 otherwise

• Similarly, for the correlation among i2.e, s2.B and t1.B, we introducethe factor fi2.e which is defined as:

fi2.e(i2.e, s2.B, t1.B) ={

1 if i2.e⇔ ((s2.B == 2) ∧ (t1.B == 2))0 otherwise

• For the correlation among r1.e, i1.e and i2.e, we introduce a factorfr1.e capturing the or semantics. In other words, we would like toenforce that r1.e is true when at least one of i1.e or i2.e hold true:

fr1.e(r1.e, i1.e, i2.e) ={

1 if r1.e⇔ (i1.e ∨ i2.e)0 otherwise

Figure 3.6 depicts the full run of the query along with the introduced fac-tors.

Now, to compute the probability of existence of r1 (which is what wedid in Figure 3.5 by enumerating over all possible worlds), we simplyneed to compute the marginal probability associated with the assignmentr1.e = true from PGM formed by the set of factors in the base data andthe factors introduced during query evaluation. For instance, for the ex-ample where we assumed complete independence among all uncertainattribute values (Figure 3.2(b)), our augmented PGM is given by the col-lection fs1.B, fs2.B, ft1.B, fi1.e, fi2.e and fr1.e, and to compute the marginalprobability, we can simply use any of the exact inference algorithms avail-able in the machine learning literature such as variable elimination [Dechter,1996; Zhang and Poole, 1994] or the junction tree algorithm [Huang andDarwiche, 1994].

32

3.3.2 General Relational Algebra Queries

Query evaluation for general relational algebra also follows the same basicideas. In what follows, we modify the traditional relational algebra oper-ators so that they not only generate intermediate tuples but also introduceintermediate factors, which, combined with the factors on the base data,provide a PGM that can then be used to compute marginal probabilitiesof the random variables associated with result tuples of interest. We nextdescribe the modified σ, ×,

∏, δ, ∪, − and γ (aggregation) operators where

we use ∅ to denote a special “null” symbol.Select: Let σc(R) denote the query we are interested in, where c denotes thepredicate of the select operation. Every tuple t ∈ R can be jointly instan-tiated with values from ×a∈attr(R)dom(t.a). If none of these instantiationssatisfy c, then t does not give rise to any result tuple. If even a single in-stantiation satisfies c, then we generate an intermediate tuple r that mapsattributes from R to random variables, besides being associated with a tu-ple existence random variable r.e. We then introduce factors encoding thecorrelations among the random variables for r and the random variablesfor t. The first factor we introduce is fσ

r.e, which encodes the correlationsfor r.e:

fσr.e(r.e, t.e, {t.a}a∈attr(R)) =

{1 if t.e ∧ c({t.a}a∈attr(R))⇔ r.e

0 otherwise

where c({t.a}a∈attrR) is true if a joint assignment to the attribute valuerandom variables of t satisfies the predicate c and false otherwise.

We also introduce a factor for r.a, ∀a ∈ attr(R) (where dom(r.A) =dom(t.A)), denoted by fσ

r.a. fσr.a takes t.a, r.e and r.a as arguments and can

be defined as:

fσr.a(r.a, r.e, t.a) =

1 if r.e ∧ (t.a = r.a)1 if r.e ∧ (r.a = ∅)0 otherwise

Cartesian Product: Suppose R1 and R2 are the two relations involved inthe cartesian product operation. Let r denote the join result of two tuplest1 ∈ R1 and t2 ∈ R2. Thus rmaps every attribute from attr(R1)∪attr(R2) toa random variable, besides being associated with a tuple existence randomvariable r.e. The factor for r.e, denoted by f×r.e, takes t1.e, t2.e and r.e asarguments, and is defined as:

f×r.e(r.e, t1.e, t2.e) ={

1 if t1.e ∧ t2.e⇔ r.e0 otherwise

33

We also introduce a factor f×r.a for each a ∈ attr(R1) ∪ attr(R2), and thisis defined exactly in the same fashion as fσ

r.a. Basically, for a ∈ attr(R1)(a ∈ attr(R2)), it returns 1 if r.e ∧ (t1.a = r.a) (r.e ∧ (t2.a = r.a)) holds or ifr.e ∧ (r.a = ∅) holds, and 0 otherwise.Project (without duplicate elimination): Let

∏a(R) denote the operation

we are interested in where a ⊆ attr(R) denotes the set of attributes wewant to project onto. Let r denote the result of projecting t ∈ R. Thus rmaps each attribute a ∈ a to a random variable, besides being associatedwith r.e. The factor for r.e, denoted by f

Qr.e, takes t.e and r.e as arguments,

and is defined as follows:

fQr.e(r.e, t.e) =

{1 if t.e⇔ r.e0 otherwise

Each factor fQr.a, introduced for r.a, ∀a ∈ a, is defined exactly as fσ

r.a, inother words, f

Qr.a(r.a, r.e, t.a) = fσ

r.a(r.a, r.e, t.a).Duplicate Elimination: Duplicate elimination is a slightly more complexoperation because it can give rise to multiple intermediate tuples even ifthere was only one input tuple to begin with. Let R denote the relationfrom which we want to eliminate duplicates, then the resulting relation af-ter duplicate elimination will contain tuples whose existence is uncertain,more precisely the resulting tuples’ attribute values are known. Any ele-ment from

⋃t∈R×a∈attr(R)dom(t.a) may correspond to the values of a pos-

sible result tuple. Let r denote any such result tuple whose attribute valuesare known, only r.e is not true with certainty. Denote by ra the value ofattribute a in r. We only need to introduce the factor f δ

r.e for r.e. To do thiswe compute the set of tuples from R that may give rise to r. Any tuplet that satisfies

∧a∈attr(R)(ra ∈ dom(t.a)) may give rise to r. Let yr

t be anintermediate random variable with dom(yr

t ) = {true, false} such that yrt

is true iff t gives rise to r and false otherwise. This is easily done by in-troducing a factor f δ

yrt

that takes {t.a}a∈attr(R), t.e and yrt as arguments and

is defined as:

f δyr

t(yr

t , {t.a}a∈attr(R), t.e) ={

1 if t.e ∧∧

a(t.a = ra)⇔ yrt

0 otherwise

where {t.a}a∈attr(R) denotes all attribute value random variables of t. Wecan then define f δ

r.e in terms of yrt . f δ

r.e takes as arguments {yrt }t∈Tr , where

Tr denotes the set of tuples that may give rise to r (contains the assignment

34

{ra}a∈attr(R) in its joint domain), and r.e, and is defined as:

f δr.e(r.e, {yr

t }t∈Tr) ={

1 if∨

t∈Tryr

t ⇔ r.e

0 otherwise

Union and set difference: These operators require set semantics. Let R1

and R2 denote the relations on which we want to apply one of these twooperators, eitherR1∪R2 orR1−R2. We will assume that bothR1 andR2 aresets of tuples such that every tuple contained in them have their attributevalues fixed and the only uncertainty associated with these tuples are withtheir existence (if not then we can apply a δ operation to convert them tothis form). Now, consider result tuple r and sets of tuples T 1

r , containingall tuples from R1 that match r’s attribute values, and T 2

r , containing alltuples from R2 that match r’s attribute values. The required factors for r.ecan now be defined as follows:

f∪r.e(r.e, {t1.e}t1∈T 1r, {t2.e}t2∈T 2

r) =

{1 if (

∨t∈T 1

r ∪T 2rt.e)⇔ r.e

0 otherwise

f−r.e(r.e, {t1.e}t1∈T 1r, {t2.e}t2∈T 2

r) =

{1 if ((

∨t∈T 1

rt.e) ∧ ¬(

∨t∈T 2

rt.e))⇔ r.e

0 otherwise

Aggregation operators: Aggregation operators are also easily handled us-ing factors. Suppose we want to compute the sum aggregate on attributea of relation R, then we simply define a random variable r.a for the resultand introduce a factor that takes as arguments {t.a}t∈attr(R) and r.a, anddefine the factor so that it returns 1 if r.a = (

∑t∈R t.a) and 0 otherwise.

Thus for any aggregate operator γ and result tuple random variable r.a, wecan define the following factor:

fγr.a(r.a, {t.a}t∈R) =

1 if r.a = γt∈Rt.a1 if (r.a = ∅)⇔

∧t∈R(t.a = ∅)

0 otherwise

Optimizations: For the above operator modifications, we have attemptedto be completely general and as such, the factors introduced may lookslightly more complicated than need be. For example, it is not necessarythat fσ

r.E take as arguments all random variables {t.a}a∈attr(R) (as definedabove), it only needs to take those t.a random variables as arguments whichare involved in the predicate c of the σ operation. Also, given a theta-join,we do not need to implement this as a cartesian product followed by a se-lect operation. It is straightforward to push the select operation into the

35

cartesian product factors and implement the theta-join directly by modify-ing f×r.E appropriately using c.

Another type of optimization that is extremely useful for aggregatecomputation, duplicate elimination and the set-theoretic operations (∪ and−) is to exploit decomposable functions. A decomposable function is onewhose result does not depend on the order in which the inputs are pre-sented to it. For instance, ∨ is a decomposable function, and so are most ofthe aggregation operators including sum, count, max and min. The prob-lem with some of the redefined relational algebra operators is that, if imple-mented naively, they may lead to large intermediate factors. For instance,while running a δ operation, if Tr contains n tuples for some r then the fac-tor f δ

r.e will be of size 2n+1 which is inefficient. By exploiting decomposabil-ity of∨we can implement the same factor using a linear number of constantsized (3-argument) factors which may lead to significant speedups. We re-fer the interested reader to Rish [1999]; Zhang and Poole [1996] for moredetails. The only aggregation operator that is not decomposable is avg,but even in this case we can exploit the same ideas by implementing avgin terms of sum and count, both of which are decomposable.

3.3.3 Complexity of probabilistic inference

The above operators will help generate the augmented PGM given any re-lational algebra query to be executed on a database, after generating theaugmented PGM, the last step of query evaluation requires that we runprobabilistic inference. Exact probabilistic inference is known to be NP-hard in general [Cooper, 1990]. More specifically, the complexity of exactprobabilistic inference is exponential in a quantity known as the treewidth[Arnborg, 1985] which depends on the structure of the graph depicting thePGM (where vertices denote random variables and edges denote correla-tions, see Figure 3.1(b) for an example). However, many applications pro-vide PGMs with sparse graph structures that allow efficient probabilisticcomputation [Zhang and Poole, 1994]. Variable elimination, also known asbucket elimination, [Dechter, 1996; Zhang and Poole, 1994] and the junctiontree algorithm [Huang and Darwiche, 1994] are two exact inference algo-rithms (among others) that have the ability to exploit such structure. Inparticular, the inference problem is easy if the PGM is or closely resemblesa tree and the problem becomes progressively harder as the PGM deviatesmore from being a tree.

36

3.4 Experiments and Discussion

We performed three sets of experiments. In the first set of experiments,we illustrate the importance of modeling correlations. Here, we chose theproblem of querying publication datasets as a case study and show that ifwe do not model natural mutual exclusivity correlations then results canbe counter-intuitive. In the second set of experiments, we experiment onthe TPC-H benchmark [TPC-H Benchmark] with slight modifications (toadd probabilities) to demonstrate the scalability of query evaluation withprobabilistic inference. In the third set of experiments, we demonstratethe range of queries that can be evaluated with probabilistic inference byevaluating aggregation queries.

3.4.1 Case Study: Querying Publication Datasets

Most applications produce data that requires modeling uncertainty andusually, assuming one wants to be faithful to the underlying distributionthen, complex correlations need to be represented. However, the complex-ity of managing uncertain databases increases with increasingly correlatedmodels. In this section, we present some experiments that indicate the needfor modeling correlations, that unless we do this the quality of results ob-tained from query evaluation can be exceedingly poor.

Consider a publications database containing two relations: (1) PUBS(PID,Title), and (2) AUTHS(PID, Name), where PID is the unique publica-tion id, and consider the task of retrieving all publications with title y writ-ten by an author with name x. Assuming that the user is not sure of thespellings x and y, we might use the following query to perform the abovetask: ∏

Title(σName≈x(AUTHS) ./ σTitle≈y(PUBS))

One way to handle uncertain predicates used above is to interpret them interms of probabilities. Given a predicate of the form R.a ≈ k, where a is astring attribute, and k is a string constant, the system assigns a probabilityto each tuple t, based on how similar t.a is to k. Following Dalvi and Suciu[2004], we compute the 3-gram distance [Ukkonen, 1992] between t.a and k,and convert it to a posterior probability by assuming that the distance isnormally distributed with mean 0, and variance σ (σ is a parameter fed tothe system). For the above query, the similarity predicates will cause boththe relations PUBS and AUTHS to be converted into probabilistic relations,AUTHSp and PUBSp. However, note that AUTHSp contains natural mutual

37

exclusion dependencies with respect to this query. Since the user is look-ing for publications by a single author with name x, it is not possible for xto match two AUTHSp tuples corresponding to the same publication in the samepossible world. Thus, any two AUTHSp tuples with the same PID exhibita mutual exclusion dependency, and a possible world containing both ofthem should be assigned zero probability.

To illustrate the drawbacks of ignoring such mutual exclusion depen-dencies, we ran the above query with x = “T. Michel” and y = “Reinfor-ment Leaning hiden stat” on two probabilistic databases, one assumingcomplete independence among tuples (IND DB) and another that modelsthe dependencies (MUTEX DB). We ran the query on an extraction of 860publications from the real-world CiteSeer dataset [Giles et al., 1998b]. Wereport results across various settings of σ. Figure 3.7 shows the top threeresults obtained from the two databases at three different settings of σ (wealso list the author names to aid the reader’s understanding). MUTEX DBreturns intuitive and similar results at all three values of σ. IND DB returnsreasonable results only at σ = 10, whereas at σ = 50, 100 it returns very oddresults (“Decision making and problem solving” does not match the string“Reinforment Leaning hiden stat” very closely and yet it is assigned thehighest rank at σ = 100). Figure 3.8 (i) shows the cumulative recall graphfor IND DB for various values of σ, where we plot the fraction of the topN results returned by MUTEX DB that were present in the top N resultsreturned by IND DB. As we can see, at σ = 50 and 100, IND DB exhibitspoor recall.

Figure 3.7 shows that IND DB favors publications with long authorlists. This does not affect the results at low values of σ (=10) because, inthat case, we use a “peaked” gaussian which assigns negligible probabil-ities to possible worlds with multiple AUTHSp from the same publication.At larger settings of σ, however, these possible worlds are assigned largerprobabilities and IND DB returns poor results. MUTEX DB assigns thesepossible worlds zero probabilities by modeling dependencies on the basetuples. We would like to note that, although setting the value of σ carefullymay have resulted in a good answer for IND DB in this case, choosing σis not easy in general and depends on various factors such as user prefer-ences, distributions of the attributes in the database, etc. Modeling mutualexclusion dependencies explicitly using our approach naturally alleviatesthis problem.

38

TitleReinforcement learning with hidden states (by L. Lin, T. Mitchell)Feudal Reinforcement Learning (by C. Atkeson, P. Dayan, . . .)Reasoning (by C. Bereiter, M. Scardamalia). . .

(i) MUTEX DB results at σ = 10, 50, 100

TitleReinforcement learning with hidden states (by L. Lin, T. Mitchell)Feudal Reinforcement Learning (by C. Atkeson, P. Dayan, . . .)Reasoning (by C. Bereiter, M. Scardamalia). . .

(ii) IND DB results at σ = 10

TitleFeudal Reinforcement Learning (by C. Atkeson, P. Dayan, . . .)Decision making and problem solving (G. Dantzig, R. Hogarth, . . .)Multimodal Learning Interfaces (by U. Bub, R. Houghton, . . .). . .

(iii) IND DB results at σ = 50

TitleDecision making and problem solving (G. Dantzig, R. Hogarth, . . .)HERMES: A heterogeneous reasoning and mediator system (by S. Adali,A. Brink, . . .)Induction and reasoning from cases (by K. Althoff, E. Auriol, . . .). . .

(iv) IND DB results at σ = 100

Figure 3.7: Top three results for a similarity query: (i) shows results fromMUTEX DB; (ii), (iii) and (iv) show results from IND DB.

3.4.2 Experiments with TPC-H Benchmark

We also show scalability results for our proposed query execution strate-gies using a randomly generated TPC-H dataset of size 10MB. For simplic-ity, we assume complete independence among the base tuples (though theintermediate tuples may still be correlated). Figure 3.8 (iii) shows the exe-

39

cution times on TPC-H queries Q2 to Q8 (modified to remove the top-levelaggregations). The first bar on each query indicates the time it took for ourimplementation to run the full query including all the database operationsand the probability computations. The second bar on each query indicatesthe time it took to run only the database operations using our JAVA-basedimplementation. Here are the summary of the results:

• As we can see in Figure 3.8 (iii), for most queries the additional cost ofprobability computations is comparable to the cost of normal queryprocessing.

• The two exceptions are Q3 and Q4 which severely tested our proba-bilistic inference engine. By removing the aggregate operations, Q3resulted in a relation of size in excess of 60,000 result tuples. AlthoughQ4 resulted in a very small relation, each result tuple was associatedwith a probabilistic graphical model of size exceeding 15,000 randomvariables. Each of these graphical models are fairly sparse but book-keeping for such large data structures took a significant amount oftime.

• Q7 and Q8 are supposed to be intractable queries (i.e., are not hierar-chical queries [Dalvi and Suciu, 2004]) yet their run-times are surpris-ingly fast. By taking a closer look, we noticed that both these queriesgave rise to tree-structured graphical models for which treewidth islow justifying our belief that there are may be databases where thedata allows query evaluation to be tractable even if query compila-tion techniques [Dalvi and Suciu, 2004; Olteanu and Huang, 2009,2008] suggest otherwise.

3.4.3 Aggregation Queries

Our approach also naturally supports efficient computation of a variety ofaggregate operators over probabilistic relations using the decompositiontechniques described in Section 3.3. Figure 3.8 (ii) shows the result of run-ning an average query over a synthetically generated dataset containing 500tuples. As we can see, the final result can be a fairly complex probabilitydistribution, which is quite common for aggregate operations.

40

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900

Rec

all

N

σ = 10σ = 50

σ = 100

0 0.0002 0.0004 0.0006 0.0008

0.001 0.0012 0.0014

1.7 1.8 1.9 2 2.1 2.2

prob

abili

ty

average

0.02516

(i) (ii)

0

10

20

30

40

50

60

70

Q2 Q3 Q4 Q5 Q6 Q7 Q8

Run

-tim

es (s

ec)

Full QueryBare Query

(iii)

Figure 3.8: (i) Cumulative recall graph comparing results of IND DB andMUTEX DB for σ = 10, 50, 100. (ii) AVG aggregate computed over 500 ran-domly generated tuples with attribute values ranging from 1 to 5. (iii) Run-times on TPC-H data.

3.5 Conclusion

In this chapter, we described an approach to represent uncertain data witharbitrary correlations in a probabilistic database using probabilistic graph-ical models. Probabilistic graphical models allow us exploit conditionalindependence present in the data to provide a compact scheme that canrepresent both attribute and tuple level uncertainty in the same database.We showed how our representation scheme naturally lends itself to pos-sible world semantics thus associating precise semantics with the query

41

evaluation problem. We further showed that it is possible to recast thequery evaluation problem into a marginal probability computation prob-lem on an appropriately constructed probabilistic graphical model that canbe generated on the fly. Our approach allows us to use a host of probabilis-tic inference algorithms (exact and approximate) developed in the machinelearning community to evaluate queries, however there are certain aspectsregarding query evaluation in probabilistic databases that make it uniqueand different from the inference problems traditionally considered in ma-chine learning research. In the next chapter we show how these aspects canbe exploited to speedup query evaluation for probabilistic databases.

42

Chapter 4

Bisimulation-based LiftedInference for ProbabilisticDatabases

In the last chapter, we described our representation scheme for uncertaindata and showed that query evaluation reduces to probabilistic inference insuch databases. In reality, most probabilistic database formulations (whetherthe one described in the previous chapter or other formulations based ontuple-level uncertainty such as Benjelloun et al. [2006] or Re and Suciu[2007]) require general probabilistic inference at some level of abstraction.Thus it is imperative that we design efficient inference approaches to makeprobabilistic databases a feasible and viable option. Given that we alreadyknow general inference is a #P-complete problem [Dalvi and Suciu, 2004],the only way we can achieve this is to utilize the special properties of thedata at hand. In this chapter, we motivate the presence of one such prop-erty that we refer to as shared correlations, and show how to exploit it tospeed up inference during query evaluation for probabilistic databases.

Consider the example database containing pre-owned car sales ads fromthe last chapter which we used to contrast between tuple level uncertaintydatabases and databases that can express uncertainty at both tuple and at-tribute levels (shown again in Figure 4.1 for convenience). Recall that, thefirst tuple shows an ad with the color of the car missing, the third tupleshows one with the make missing and the second tuple represents an adwith both attributes missing. Figure 4.1 also shows the probability distribu-tions associated with these missing values, more specifically, fmake definesthe distribution over missing make values in the database (assuming our

43

AdID Make Color Price1 Honda ? $9,0002 ? ? $6,0003 ? Beige $8,000...

......

...

Color fcolorBlack 0.75Beige 0.25

Make fmakeHonda 0.55Toyota 0.45

Figure 4.1: Pre-owned car ads with missing values.

universe can contain only two makes Honda and Toyota) and fcolor definesthe distribution over missing color values (assuming our universe containsonly black and beige cars). Note that the distributions make no reference toany tuple specific information. In other words, no matter how many tupleswith missing color are present in the relation, their uncertainty will stillbe defined by the same distribution represented by fcolor and, along withfmake, these distributions are examples of shared correlations (more preciselydefined in Section 4.2.

In many cases, the uncertainty in the data is defined using generalstatistics that do not vary on a per-tuple basis, and this, in turn, leads toshared correlations. Various earlier works have also described applicationswith shared correlations. For instance, Andritsos et al. [2006] describe acustomer relationship management application where the objective is tomerge data from two or more source databases and each source databaseis assigned a probability value based on the quality of the information itcontains. Even here, probabilities do not change from tuple to tuple, sincetuples from the same source are assigned the same source probability. An-other source of shared correlations in probabilistic databases is the queryevaluation approach itself. Recall from the previous chapter that whileevaluating queries we first build an augmented PGM by introducing smallfactors that depict probability distributions and correlations on the fly. Forinstance, if tuples t1 and t2 join to produce join tuple r then one needs tointroduce a factor that encodes the correlation that r exists iff both t1 andt2 exist (f×r.e(r.e, t1.e, t2.e) defined in the last chapter). More importantly,such a factor is introduced whenever any pair of tuples join, thus leadingto repeated copies of the same factor, thus introducing additional sharedcorrelations. Our aim, in this chapter, is to exploit such shared correlationsto make exact probability computation for query evaluation in probabilistic

44

S A Bs1 a1 {1:0.6, 2:0.4}s2 a2 {1:0.6, 2:0.4}

T B Ct1 {2:0.5, 3:0.5} c

S./BT−→A B C

i1 a1 2 ci2 a2 2 c

fi1.e, fi2.e

Figure 4.2: Running example for this chapter.

databases more efficient.Our motivation for shared correlations and more efficient inference in

this context closely ties in with recent work done in the machine learningcommunity. In the past decade or so, machine learning researchers havedevised approaches to exploit shared correlations to come up with morecompact ways of describing PGMs. These models are sometimes referredto as first-order graphical models. Lifted inference is the sub-field that aimsto devise more efficient inference techniques for first-order models that ex-ploit such shared correlations. In fact, the inference approach we devise inthis chapter is a novel lifted inference algorithm that automatically deter-mines symmetries in the uncertainty model denoted by shared factors. Wesurveyed these related areas of research along with various works on liftedinference in Chapter 2. The rest of this chapter is organized as follows: Inthe next section, we introduce a motivating example that shows how stan-dard inference algorithms fail to exploit shared correlations; in Section 4.2and Section 4.3 we formally define shared correlations and describe ourapproach to inference with shared correlations; in Section 4.4 we describeour experimental results comparing our approach to standard inference al-gorithms; and finally, we conclude with Section 4.6 after a discussion inSection 4.5.

4.1 Motivating Example

For the purposes of this chapter, we will use a slightly simplified versionof the example from the last chapter (Figure 4.2). In this modified version,we have the same two relations S and T , but we run a simple join query,S ./B T , in this case. The inference task remains the same, i.e., we need tocompute the marginal probabilities of the two result tuples produced. In

45

µi1.e(i1.e) =∑

s1.B,t1.B

ft1.B(t1.B)fs1.B(s1.B)fi1.e(i1.e, s1.B, t1.B)

=∑t1.B

ft1.B(t1.B)∑s1.B

fs1.B(s1.B)fi1.e(i1.e, s1.B, t1.B)︸︷︷︸ms1.B(i1.e,t1.B)

µi2.e(i2.e) =∑

s2.B,t1.B

ft1.B(t1.B)fs2.B(s2.B)fi2.e(i2.e, s2.B, t1.B)

=∑t1.B

ft1.B(t1.B)∑s2.B

fs2.B(s2.B)fi2.e(i2.e, s2.B, t1.B)︸︷︷︸ms2.B(i2.e,t1.B)

Figure 4.3: How variable elimination proceeds to solve the query evaluatedin Figure 4.2.

other words, we need to compute marginal probabilities corresponding tothe assignments i1.e = true and i2.e = true from the augmented PGMcomprising of factors fs1.B, fs2.B, ft1.B, fi1.e and fi2.e (see previous chapterfor full definitions of the factors). Now let us try to see how standard in-ference algorithms such as variable elimination (VE) [Dechter, 1996; Zhangand Poole, 1994] and the junction tree algorithm [Huang and Darwiche,1994] would proceed to solve this problem. Here we take the example ofVE. Recall that marginal probability computation basically means that wesimply sum over all the other random variables from the PGM except forthe random variable whose marginal probability we need to compute (Def-inition 4). VE runs by first choosing an elimination order which specifiesthe order in which to sum over (eliminate) the random variables. It thenrepeatedly picks the next random variable from the order, pushes the cor-responding summation as far into the product of factors as possible, sumsit out and proceeds in this fashion. In Figure 4.3 we show the first fewsteps of how VE would proceed when used to compute the probability ofi1.e and i2.e using the elimination order O = {s1.B, s2.B, t1.B} (variablesare eliminated left to right).

46

4.1.1 Limitations of Naive Inference Algorithms

The main issue with VE (or any other standard exact probabilistic infer-ence algorithm, for that matter) is that it does not exploit shared correla-tions. For instance, in Figure 4.3, in the process of computing the probabil-ities for i1.e and i2.e, we produce intermediate factors ms1.B(i1.e, t1.B) andms2.B(i2.e, t1.B). If we take a closer look at both of these factors then wewill notice that they both map exactly the same inputs to the same outputs:

i1.e t1.B ms1.B

True 2 0.4True 3 0False 2 0.6False 3 1

i2.e t1.B ms2.B

True 2 0.4True 3 0False 2 0.6False 3 1

This indicates that we went through the exact same multiplication andsummation steps to compute both ms1.B(i1.e, t1.B) and ms2.B(i2.e, t1.B).In fact, these are shared factors and represent shared correlations (which willbe defined more precisely in the next section), and this repeated compu-tation is what we would like to avoid. In hindsight, it is not really sur-prising that ms1.B(i1.e, t1.B) and ms2.B(i2.e, t1.B) turned out to be virtualcopies of each other. If we look closely, ms1.B was computed by mul-tiplying fs1.B(s1.B) with fi1.e(i1.e, s1.B, t1.B) followed by a summationoperation, whereas ms2.B was computed by multiplying fs2.B(s2.B) withfi2.e(i2.e, s2.B, t1.B) followed by a summation operation, and fs1.B(s1.B)and fs2.B(s2.B), and fi1.e(i1.e, s1.B, t1.B) and fi2.e(i2.e, s2.B, t1.B) were pairsof shared factors. Often during the run of inference such intermediateshared factors multiply with each other and give rise to more intermedi-ate shared factors thus making it imperative that we recognize and takeadvantage of such symmetry before we actually compute these shared factors.Devloping an approach that achieves this is the topic of the next section.

4.2 Inference with Shared Factors

We begin by formally defining shared factors, and for this we need to takea closer look at the definition of a factor (Definition 1). A factor consistsof two distinct parts: the first part is the list of random variables it takesas arguments, and the second part is the function that maps input assign-ments to outputs. Thus, it may be possible for two factors f1 and f2 tohave different arguments lists but use the same function to map inputs tooutputs.

47

ms1.B

args.: i1.e, t1.B

func.:

True, 2→0.4True, 3→0False, 2→0.6False, 3→1

ms2.B

args.: i2.e, t1.B

func.:

True, 2→0.4True, 3→0False, 2→0.6False, 3→1

Figure 4.4: Pair of shared factors.

Definition 5. Let f1 and f2 denote two factors, f1.func and f2.func denotetheir function components, and dom1 and dom2 denote the domains of f1.funcand f2.func, respectively. f1 and f2 are shared factors, denoted f1

∼= f2, ifdom1 = dom2 = dom and f1.func(d) = f2.func(d),∀d ∈ dom.

Figure 4.4 shows two factors from the previous section where we haveclearly separated their argument lists and function components.

We will assume that we are given a PGM P = 〈F ,X〉 (constructed byrunning a query on a database and containing shared factors) and a ran-dom variable X (associated with a result tuple) whose marginal proba-bilities need to be computed. We will also assume that every f ∈ F isassociated with an id denoted by id(f) such that for any pair of factorsid(f1) = id(f2)⇔ f1

∼= f2.The basic idea behind our approach to performing probabilistic infer-

ence with shared factors is to represent a run of the inference algorithmexplicitly as a labeled graph. Once we do that, we will then show that it ispossible to examine the graph and identify the shared intermediate factorsthat are generated during the inference process. To explain our approach,we will first define the semantics associated with the edges of the labeledgraph by introducing an operator that forms the basis of most exact prob-abilistic inference algorithms (e.g., variable elimination [Zhang and Poole,1994] and junction tree algorithm [Huang and Darwiche, 1994]).

4.2.1 The ELIMRV operator

The elimrv operator (which stands for ELIMinate a Random Variable) is thebasic operator that is used repeatedly while running inference to computemarginal probabilities. It essentially takes a random variable Y and a col-lection of factors F each of which involves Y as an argument and sums Yout from the product of all factors in F to return a new factor. We denote the

48

resulting (intermediate) factor produced by mY followed by its list of argu-ments, if they are not clear from the context. For instance, when we werecomputing µi1.e(i1.e) for the example in Section 4.1, to sum over s1.B wehad to first multiply the collection of factors formed by fi1.e(i1.e, s1.B, t1.B)and fs1.B(s1.B) and then sum over s1.B from the product to produce thenew intermediate factor ms1.B(i1.e, t1.B). Note that F may contain inter-mediate factors produced by earlier applications of the elimrv operator.

We first note a few properties about elimrv operator. The order in whichthe factors appear in F is important. For instance, suppose we want to sumover X2 from the collection formed by fa(X1, X2) and fb(X2, X3). Thenwe would produce the product fc(X1, X2, X3) and perform the summationto produce fd(X1, X3). In other words, there is an implicit assumption ofordering the arguments in the product by scanning the arguments of theinput factors from left to right and this affects the resulting factor producedafter the summation operation. If instead, we had multiplied fb(X2, X3)and fa(X1, X2), then we would first produce a factor f ′c(X2, X3, X1) andthen produce f ′d(X3, X1) after the summation. In addition, the way thearguments overlap across the input factors (in the above case, the secondargument of fa overlaps with the first argument of fb) and the position ofthe argument that is being summed over also matter. We would like tomake these points about the elimrv operator clear, and for this purpose,we feed the operator an explicit label that specifies the above describedinformation.

Example 1. For the examples that follow we use the following simple format forconstructing labels that specify the argument order, how the arguments overlapand which argument is being summed over. For each elimrv operation, we gothrough the list of factors in F assigning each argument a unique id if it has notbeen seen before. Then we construct the label by traversing the list of factors again,writing the id of the argument that appears, enclosing the lists of arguments insquare braces and finally, appending the label by the id of the argument beingsummed over. For the above example involving X2, fa(X1, X2) and fb(X2, X3),the label turns out to be {[1, 2], [2, 3], 2} using this format.

We can now define the elimrv operator as follows:

Definition 6. The elimrv(Y,F, l) operation takes as input a random variable Y ,an ordered list of factors F and a label l, and computes a new factor

∑Y

∏f∈F f

according to the label l.

49

4.2.2 The RV-ELIM Graph

For the purposes of introducing our graph-based data structure, we will as-sume that we are given, besides X and P = 〈F ,X〉, an elimination orderOthat contains all random variables involved in X except for X . In the nextsection (Section 4.3), we discuss in detail how to construct such an elimina-tion order that suits our purposes. Note that once we have an eliminationorder, we have the sequence of elimrv operations defined for our inferenceprocedure. The inference procedure proceeds as follows: we collect all fac-tors from F in a pool, pick the first random variable Y to be eliminatedfromO, collect all factors that include Y as an argument from the pool, per-form the corresponding elimrv operation, add the resulting intermediatefactor mY back to the pool, and continue in the same fashion until we haveexhausted all random variables from O. The rv-elim graph (which standsfor Random Variable ELIMination graph) essentially encodes this sequenceof elimrv operations using a labeled graph.

Definition 7. The rv-elim graph G = (V,E,LV ,LE) is a directed graph withvertex labels LV (v),∀v ∈ V , and edge labels LE(e),∀e ∈ E, that represents a runof inference on a PGM P = 〈F ,X〉 according to elimination order O such that:

• Every v ∈ V represents a factor. If v is a root, then it represents a factorfrom F and LV (v) = id(f); if v is not a root then it represents an interme-diate factor mY =elimrv(Y,F, l) produced during the run of inference andLV (v) = l.

• For each mY = elimrv(Y,F, l) produced during inference, for the ith factorin F, we add an edge vf

i→ vmY , where vf denotes the vertex correspondingto f and vmY denotes the vertex corresponding to mY , and i is the label onthe edge.

Figure 4.5 (a) shows the rv-elim graph for our running example usingthe same elimination order we defined in Section 4.1. One point to noteabout the rv-elim graph is that, in general, it can never contain a directedcycle (in other words, it has to be a directed acyclic graph (DAG)). Ourexample happens to be a tree; in general this is not always going to be thecase; Figure 4.6 shows an rv-elim graph that is not a tree.

4.2.3 Identifying Shared Factors

The advantage of representing a run of inference as a graph is that we cannow identify exactly when two vertices in the graph represent shared fac-

50

fs1.Bfs2.B fi1.e

ms1.B

fi2.e

ms2.B ft1.B

µi2.e µi1.e

1

11

2

2

1 2

2

aab b

c

{[1],[2,1,3],1} {[1],[2,1,3],1}

{[1,2],[2],2} {[1,2],[2],2}

(a)

A

C

B

E

D

2

2

1

1

[fs1.B, fs2.B]a

[fi1.e, fi2.e]b

[ft1.B]c

[ms1.B,ms2.B]

{[1],[2,1,3],1}

[µi1.e, µi2.e]{[1,2],[2],2}

(b)

Figure 4.5: (a) rv-elim graph for the example from Figure 4.3, (b) its com-pressed version obtained using bisimulation. The rv-elim graph shown in(a) is a vertex-labeled, edge-labeled graph. The edges are labeled with in-tegers (in this case, 1 or 2) and denote the order in which the parent factorsare present in the elimrv operation. The vertices are labeled with stringsand these are shown alongside the vertex, if the vertex is a source vertexthen the label is a letter (e.g., a for the first source vertex in the top left cor-ner), or a string if it is a vertex with parents denoting how the argumentsoverlap for the elimrv operation that created the intermediate factor cor-responding to this vertex (for instance, {[1, 2], [2], 2} for the sink verticesin the rv-elim graph). The compressed rv-elim graph shown in (b) is alsoan edge-labeled, vertex-labeled graph with the extent of every vertex de-picted next to it in square braces. Note that the compressed rv-elim graphin this case consists of 5 vertices whereas the rv-elim graph itself contains 9vertices, a significant reduction considering we have such a small runningexample.

tors. Denote by fv the factor represented by vertex v in an rv-elim graph.

Claim 1. For rv-elim graph G = (V,E,LV ,LE), two vertices v1, v2 ∈ V areshared factors fv1

∼= fv2 if:

• LV (v1) = LV (v2).

• ∀u1i→ v1,∃u2

i→ v2 and fu1∼= fu2 .

• ∀u2i→ v2,∃u1

i→ v1 and fu1∼= fu2 .

51

S As1 {1:0.7, 2:0.3}

T A Bt1 1 2t2 1 3

U Bu1 {2:0.5, 3:0.5}

S./AT−→A B

i1 1 2i2 1 3

fi1.e, fi2.e

./BU−→A B

j1 1 2j2 1 3

fj1.e, fj2.e

fi1.e(i1.e, s1.A) fj1.e(j1.e, i1.e, u1.B) fi2.e(i2.e, s1.A) fj2.e(j2.e, i2.e, u1.B)

mi1.e(s1.A, j1.e, u1.B) mi2.e(s1.A, j2.e, u1.B)fs1.A(s1.A)

m1s1.A(j1.e, u1.B) m2

s1.A(j2.e, u1.B)

m2u1.B(j2.e)

fu1.B(u1.B)

m1u1.B(j1.e)

Figure 4.6: A three-relation join that produces a non-tree structured rv-elimgraph (edge and vertex labels not shown for legibility). Note that to com-pute the marginal probabilities of j1.ewe do not need to multiply all factorsin the PGM, certain factors such as fi2.e are only required to compute themarginal probabilities of the other result tuple’s random variables (j2.e’s)and we do this by “tagging” factors in the PGM with the random vari-ables whose marginal probability computations they are involved in; sub-sequently, while performing inference we make sure that we multiply twofactors only if they have atleast one tag in common.

Essentially, what the claim says is that two intermediate factors fv1 andfv2 generated during inference (using elimrv operations) are shared if:

• they were produced by multiplying sets of factors containing the samefunction components (the parents are shared)

• the argument orders, argument alignments and the argument beingsummed over, all match (the labels on v1 and v2 are the same)

Note that for a given internal vertex in the rv-elim graph, all incoming

52

edges from parents are assigned distinct edge labels since we label theedges with the index indicating the position of the factor represented bythe parent in F of the corresponding elimrv operation and two factors can-not be at the same position (Definition 7).

We can now use Claim 1 to determine the intermediate shared factorsthat get generated during the inference process. The important thing torealize is that we can do this without actually computing these intermediatefactors. For instance, recall that in Section 4.1, we showed that during therun of inference for our running example, ms1.B and ms2.B were interme-diate factors that turned out to be shared (shown in dashed boxes in Figure4.5(a)). By looking at the rv-elim graph (Figure 4.5(a)) this is now easy tosee since:

• They have the same vertex label {[1],[2,1,3],1}.

• Both ms1.B and ms2.B have parents fs1.B and fs2.B, respectively, viaedges labeled 1, and fs1.B

∼= fs2.B since they have the same vertexlabel a and are roots.

• Both ms1.B and ms2.B have parents fi1.e and fi2.e, respectively, viaedges labeled 2, and fi1.e

∼= fi2.e since they have the same vertex labelb and are also roots.

Thus by Claim 1, ms1.B∼= ms2.B.

Given a graph (like the rv-elim graph shown in Figure 4.5(a)) and aproperty (such as the one specified in Claim 1), we now need an algo-rithm for partitioning the vertices into collections of shared factors. It turnsout that there exist reasonably fast algorithms that can partition the set ofvertices into disjoint sets which, because of our construction, will satisfythis property. These algorithms generally go by the term bisimulation (alsoknown as the relational coarsest partition problem [Paige and Tarjan, 1987]).Given the special case of the graph being a DAG, there exist algorithms thatrun in time linear in the size of the graph.

Dovier et al [Dovier et al., 2001] describe one such algorithm that runson an edge-labeled, vertex-labeled graph and not only partitions the set ofvertices but also returns another (smaller) graph where each disjoint set inthe partition is represented by a vertex and the edges between vertices p1,representing one disjoint set in the partition, and p2, representing anotherdisjoint set in the partition, is the result of taking the union of all edges be-tween all vertices from the input graph in p1 and all vertices from p2. Wewill refer to each resulting disjoint set of the vertices of the rv-elim graph

53

as an extent and the resulting graph returned as a result of running bisim-ulation on the rv-elim graph as the compressed rv-elim graph. Figure 4.5(b)shows the compressed rv-elim graph returned as a result of running bisim-ulation on the rv-elim graph shown Figure 4.5(a). Notice how vertex Arepresents both factors fs1.B and fs2.B. We show this in Figure 4.5(b) by in-dicatingA’s extent in square braces next to it. More interestingly, the pair ofintermediate shared factors that we identified earlier (ms1.B andms2.B) hasalso been collapsed into one single vertex denoted by C in the compressedrv-elim graph.

Unfortunately, we cannot apply the bisimulation algorithm describedin Dovier et al. [2001] directly to our problem, and this is because we havenot yet addressed an important issue. Recall that we discussed how theorder in which the factors appear in the elimrv operator affects the resultsof applying the operator. We have not yet discussed how to choose anorder. For traditional inference algorithms, when eliminating a randomvariable, any ordering of the factors works. However, in our case, Claim1 actually uses the order of the parents of the vertices in the rv-elim graphto determine which ones represent shared factors. This means that for usthe order matters. If we do not choose the correct order then we might endup with cases such the one shown in Figure 4.7, where instead of orderingthe parents of ms2.B with fs2.B as the first parent and fi2.e as the second,we have placed fi2.e as the first parent and fs2.B as the second. A directconsequence of this is that the labels on the vertices representingms1.B andms2.B in the rv-elim graph are now different, which means that using Claim1 we cannot decree them to form a pair of shared factors.

The problem is that we do not know the order in which we shouldpresent the factors to each elimrv operation, and some orders produce moresymmetric rv-elim graphs (with more shared factors) than others and weneed to choose these orders. One approach is to try all possible parent or-derings but this will likely be too expensive. Instead, we introduce a novelheuristic for choosing better orderings. Our bisimulation algorithm, basedon Dovier et al. [2001]’s, requires a different interleaving of the steps, sofor completeness we first present our bisimulation algoirthm, and then theheuristic we developed for ordering parents.

4.2.4 Bisimulation for RV-ELIM Graphs

We will assume that we are given an rv-elim graph G = (V,E,LV ,LE)for computing marginal probabilities of random variable X from PGM Pusing the elimination order O. Each root v ∈ V is labeled by the id(fv)

54

fs1.B(s1.B) fi1.e(i1.e, s1.B, t1.B)

ms1.B(i1.e, t1.B)

1 2a b

{[1],[2,1,3],1}

fi2.e(i2.e, s2.B, t1.B) fs2.B(s2.B)

ms2.B(i2.e, t1.B)

21 ab

{[1,2,3],[2],2}

Figure 4.7: A poor ordering of parent vertices.

where fv denotes the factor from F represented by v. We will assign theremaining vertex labels (for the internal vertices) and the the edge labels inG dynamically through the bisimulation algorithm we present.

A partition denotes a division of the set of vertices of the rv-elim graphinto disjoint sets; each disjoint set is denoted a block. The full algorithm isdescribed in Algorithm 1. The bisimulation algorithm starts by comput-ing ranks for each vertex in the rv-elim graph (a simple depth-first searchshould do this). After computing ranks, the algorithm starts by assigningthe roots in the rv-elim graph to the blocks formed by their labels. Afterthis, it goes through the vertices at rank i, partitioning them into blocks.Note that when we are dealing with vertices at rank i, we only need thepartitioning on the vertices at ranks i′ < i, since according to Claim 1, thepartitioning of a vertex only depends on its label and its parents’ partition-ing and the parents of vertices at rank i can only have ranks i′ < i (therank computation scheme guarantees this). The nested for loops basicallyachieve this. They take all vertices at rank i, choose orders for each ver-tices’ parents (we will discuss how this is done shortly), forms the labeland the key based on this ordering and partitions these vertices based onthe constructed key. See Dovier et al. [2001] for proof of correctness whenthe vertex and edge labels can be statically allocated.

Parent ordering heuristic To order the parents of each internal vertex vin the rv-elim graph before partitioning them, we simply order the parentsbased on their block-ids (assuming the block-ids can be ordered). We cando this using Algorithm 1 since when we are about to decide in which blockto place v in, we have the blocks of its parents available. Recall that Claim1 requires both the labels to match and the parent sets of both vertices to bealigned before we decree vertices v and v′ to represent shared factors. Thisheuristic helps align the parent vertices.

Algorithm 1, by itself, is reasonably efficient. Its time complexity, as-suming we use the heuristic that orders based on block-ids, is O(|V |+ |E|)

55

Algorithm 1: Bisimulation for RV-Elim Graphs.input : RV-Elim graph G = (V,E,LV ,LE) with roots labeled.output : A disjoint partition over V .

d(v) ={

0, if v is a root1 + max{d(v′)|v′ → v ∈ E} /* compute depths */

ρ← max{d(v)|v ∈ V }B0,l = {v|v is root ∧ LV (v) = l} /* compute initial partition */C = {B0,l}Bi = {v|d(v) = i},∀i = 1 . . . ρfor i = 1 . . . ρ do

foreach v ∈ Bi do /* construct keys to partition on */order parents by block-idsconstruct label LV (v)construct key kv with LV (v) and parents’ blocks-ids

endadd Bi,k = {v ∈ Bi|kv = k} to C

endreturn the final partition C

(to compute ranks in step 1) +∑

v∈V dv log dv +dv (to order the parents andform the key) where dv is the in-degree of v (ignoring the time spent to con-struct LV (v)) + O(|V |) to partition vertices at rank i into blocks based ontheir key. Adding up, this gives us O(

∑v dv log dv + |V |) = O(|E| logD +

|V |) where D is the maximum in-degree of any vertex in the rv-elim graph.

4.2.5 Inference with the Compressed RV-ELIM Graph

Having computed the partition of the vertices using Algorithm 1, as in-dicated earlier, we can now construct the compressed rv-elim graph byconstructing a graph where each block in the partition is represented bya vertex, the label on the block is the label on the vertices in the block, andtwo blocks have an edge labeled i between them if there exists a pair ofvertices in the two blocks that have an edge labeled i. These definitionsare consistent because the blocks of the partition correspond to particularkeys constructed by Algorithm 1 which contain the vertex labels and edgelabels, and all vertices in block have the same key.

We can now perform inference on the compressed rv-elim graph. Toseed the inference, we simply copy the function components of the factors

56

corresponding to roots of the rv-elim graph to the roots in the compressedrv-elim graph. Then we call a depth-first search procedure (dfs) from theleaf in the compressed rv-elim graph which begins by looking at the par-ents, the labels on the edges and the vertices and applies the elimrv opera-tor to compute the functions on the child. If the parent’s functions have notbeen computed yet then we make the dfs call on the parent before apply-ing elimrv on the child. Finally, we will have the (unnormalized) marginaldistribution computed at the leaf of the compressed rv-elim graph. If ourinference required computing marginal probabilities of multiple randomvariables then this can also be done using our approach but in this casethe compressed rv-elim graph may have multiple leaves. If the user re-quests marginal probabilities for random variable X , then we simply needto find the leaf in the compressed rv-elim graph that contains (unnormal-ized) µ(X) in its extent and return that (after normalization). This last stepcan be made faster if we maintain a mapping from random variables X tothe leaves of the compressed rv-elim graph that contains the corresponding(unnormalized) marginal probability function.

4.3 Computing Elimination Orders

One of the important steps in performing probabilistic inference is to choosea good elimination order that helps run inference without producing toomany large intermediate factors (in terms of number of arguments) duringthe run of inference. This can make the difference between inference be-ing tractable or intractable since the size of a factor is proportional to theproduct of the domain sizes of its argument random variables. In our case,since we are interested in exploiting shared factors, and since the elimina-tion order affects the rv-elim graph constructed, we would like eliminationorders that produce smaller factors and, at the same time, produce rv-elimgraphs that can be compressed using bisimulation. Unfortunately, evenwithout consideration of shared factors, the problem is known to be NP-Hard [Arnborg, 1985]. Thus, as is done in traditional inference algorithms,we resort to heuristics. In particular, we introduce a novel version of thepopular minimum size heuristic (MSH) [Kjaerulff, 1990] that is used withtraditional exact inference algorithms to construct effective elimination or-ders that can help exploit shared factors∗.

Our elimination order generation heuristic works in two phases:

∗For an alternate, more integrated, elimination order generation heuristic see Sen et al.[2009c].

57

i1.e

s2.Bs1.B

i2.e

t1.Ba ac

b b

(a)

A

C

Ba

[s1.B, s2.B]

[i1.e, i2.e]b

[t1.B]

c

(b)

Figure 4.8: (a) Example PGM graph (b) its compressed version.

• We first identify sets of “similar” random variables, as we will explainshortly, this should help construct elimination orders that lead to rv-elim graphs which can be compressed better.

• Traditional MSH defines a notion of neighborhood for random vari-ables. We show below that, for our purposes, this notion is no longeradequate and we introduce a novel version of MSH that helps avoidlarge intermediate factors.

We first explain the need to look for random variables that are “simi-larly” positioned in the PGM produced by query evaluation. Recall fromour running example that we eliminated s1.B and s2.B one after another.If instead we had eliminated t1.B, then we risk combining shared factorsinto potentially one single factor and risk loss of symmetry in the resultingrv-elim graph. What we need to do here is find sets of random variablesthat occur in shared factors. Eliminating these one after another shouldhelp generate rv-elim graphs with better compression properties. Fortu-nately, we can easily represent a PGM as a graph where the random vari-ables are represented using vertices and correlations are represented usingedges (Figure 4.8(a) shows the PGM graph for our running example) andwe can use this PGM graph to find similar random variables simply by la-beling the vertices using the ids of the factors from the PGM (if the randomvariable is present in multiple factors then aggregate their ids using someoperation such as max or sum, assuming the ids are numbers). Then werun a bisimulation on the PGM graph to compute a partition on the ran-dom variables of the PGM and the corresponding compressed PGM graph(Figure 4.8(b) shows the compressed PGM graph for our running example).Each extent thus obtained after bisimulation contains similar random vari-ables. Note that, unlike rv-elim graphs which are guaranteed to be directedacyclic graphs, the PGM graph can be cyclic. Bisimulation algorithms for

58

general graphs (with cycles) are available [Dovier et al., 2001; Paige andTarjan, 1987].

We now explain the second step. Having constructed the sets of simi-lar random variables we would now like to ensure that we eliminate thoserandom variables one after another and that we avoid generating large fac-tors in the process. One simple way to do this is to produce an orderingon the vertices of the compressed PGM graph and then expand the entriesin that ordering using the extents. In addition, producing an ordering onthe vertices of the compressed PGM graph is likely to be faster since thecompressed graph is likely to contain less vertices compared to the numberof random variables in the PGM. We now proceed towards applying (somesuitable modification of) MSH on the compressed PGM graph, and for thiswe need some background on MSH. The basic tenet underlying traditionalMSH is the notion of neighborhood of a random variable which is definedas the set of distinct random variables with which it appears as argumentsto factors in the PGM. MSH works by greedily picking the random vari-able with the smallest neighborhood to be eliminated first, updating theneighborhoods of all random variables involved in the intermediate factorintroduced by the elimination until all random variables to be eliminatedhave been picked.

However, the original MSH may not work on the compressed PGMgraph. The problem here is that the neighborhood of a vertex in the com-pressed PGM graph is not a good indicator of the size of the intermedi-ate factor produced by an elimination. This leads us towards defining anew neighborhood criterion that involves not only the neighborhood in thecompressed PGM graph but also the extents of the vertices in the neighbor-

hood. Define avg. neighborhood size to be =P

v′∈N (v) |extent(v′)||extent(v)| where N (v)

denotes the neighborhood of v in the compressed PGM graph. Essentially,avg. neighborhood assumes that there are as many neighbors to vertex vas there are random variables in all neighbors’ extents summed up. It es-sentially tries to estimate the neighborhood of the vertex with respect tothe uncompressed PGM graph, and it compensates for the case when v it-self has a large extent by dividing by the extent size. Thus it tries to makeMSH behave as if we are running it on the uncompressed PGM graph, butactually runs on the compressed PGM graph thus making it more efficient.Algorithm 2 shows the final modified minimum size heuristic algorithm.

59

Algorithm 2: Modified Minimum Size Heuristicinput : Compressed PGM graph G = (V,E), and vertex vX that

contains X (whose marginals we need) in its extent.output : Ordering over all random variables that need to be eliminated.

intialize empty list Owhile ∃v ∈ V s.t. v 6= vX , v /∈ O do

pick vertex v 6= vX with the smallest avg. neighborhoodadd v to Ointroduce an edge between every pair of neighbors of v

endconstruct O by expanding entries in O with their extentsadd extent(vX )\{X} to Oreturn O

4.4 Experimental Evaluation

Our experimental evaluations were designed to answer the question: Whenis it worthwhile to apply our bisimulation-based approach to a query eval-uation problem? Note that standard inference algorithms take a PGM and arandom variable, and simply begin multiplying factors and summing overrandom variables (after computing the elimination order). Instead, ourapproach first constructs the rv-elim graph, applies bisimulation to com-press it, and then begins multiplying function components of factors andsumming over arguments from them. So it is plausible that there may becases where our approach may perform poorly because it spends too muchtime before actually getting to the point where it can perform (a hopefullysmaller set of) multiplications and summations. Our experimental resultssuggest the following:

• In most cases, our approach is significantly faster than the standardinference algorithm.

• In a small number of cases, our approach loses out to the baselineinference approach we compare against; but in these cases the differ-ence between the time it took to run our approach and the baselineapproach was not large.†

†Note that early stopping techniques are possible, such as once we run bisim-ulation on the PGM graph and find out that the extents of the compressed PGMgraph are small then we can switch our inference engine and resort to standard

60

We compare against a baseline exact inference algorithm, denoted BatchVE,which is a modified version of variable elimination (VE) except that if thePGM contains multiple random variables whose marginal probabilities weare interested in, then it avoids multiple passes through the PGM like stan-dard VE [Zhang and Poole, 1994] does. We refer to our approach, whichconstructs a compressed PGM to exploit shared factors, as Lifted Inferenceor LiftedInf, in short. For each experiment we report five numbers:

• Relational algebra operations (Rel. alg. ops): Reports the time takento perform the relational algebra operations in the query to constructthe PGM.

• BatchVE arithmatic operations (BatchVE arith. ops.): Reports thetime taken to multiply factors and sum over random variables duringinference for BatchVE.

• BatchVE remaining operations (BatchVE rem.): Reports the time re-quired to perform the remaining BatchVE operations such as deter-mining the elimination order.

• LiftedInf arithmatic operations (LiftedInf arith. ops.): Reports thetime spent multiplying functions of factors and summing over argu-ments (on the compressed rv-elim graph) for our approach.

• LiftedInf remaining operations (LiftedInf rem.): Reports time takento perform the remaining operations for the approach we describedin this chapter, this includes the various runs of bisimulation and thetime spent to determine the elimination order from the compressedPGM graph.

For each experiment, we report three bars (except for Figure 4.9(e)): the firstbar reporting the rel. alg. ops. time; the second, time spent by BatchVE;and the third, time spent by LiftedInf. See the legend (shown at the top inFigure 4.9) for more details. Note that no single bar reports the actual timeto run the query. To find out the total time taken to run the query we needto add the rel. alg. ops. time to the second bar or the third bar, dependingon the algorithm.

All our experiments were run on a dual proc Xeon 3 GHz machine with3GBytes of RAM. Our implementation is in JAVA and the numbers we re-port were averaged over 10 runs.

inference, but for our experiments we did not include this approach.

61

LiftedInf arith. ops.

LiftedInf rem.

BatchVE arith. ops.

BatchVE rem.

Rel.alg.ops.

1 2 3 4 5 6 7 8 9 10Number of tuples (in millions)

0

1

2

3

4

5

Tim

e (s

)

10 20 30 40 50 60 70 80 90 100Domain Size

0

5

10

15

Tim

e (s

)

(a) (b)

1 2 3 4 5 6 7 8 9 10Fanout per primary key

0

2

4

6

Tim

e (s

)

1 2 3 4 5 6 7 8 9 10λ

0

1

2

3

4

Tim

e (s

)

(c) (d)

1 10 30 60 .1k .12k.15k .2k .3k .6kNumber of primary key buckets

0

2

4

6

Tim

e (s

)

1 2 3 4 5 6 7 8 9 10Number of common entries in domain

02468

10

Tim

e (s

)

(e) (f)

20 40 60 80 .1k .12k.14k.16k.18k .2kNumber of join values

0

1

2

3

Tim

e (s

)

1 2 3 4 5 6 7 8 9 10Scale factor

0

5

10

Tim

e (s

)

(g) (h)

Figure 4.9: Plots for experiments on synthetic and TPC-H data. The legendis shown at the top.

4.4.1 Car DB experiments

For our first set of experiments, we developed the pre-owned car ads exam-ple further and randomly generated data and factors to illustrate how per-formance of the two algorithms vary on various characteristics of the data.In addition to the relation containing the various advertisements (Ad) de-scribed in Figure 4.1, we added another relation which denotes the sourcewebsites from which the ads were pulled (S). Each tuple in S is an un-certain tuple with an associated probability of existence which depends onhow reliable the website’s information is. For these experiments, we ran thefollowing query:

∏AdID((σColor=cAd) ./SID S) where c denotes a specific

color and SID is a primary key in S and acts as a foreign key inAd. Besidesthe uncertain tuples in S, we set the Color attribute values to be uncertainand these were correlated with the corresponding Make attributes. A carof a certain Make can have one of 4 distinct Colors. The parameters that wevaried for these experiments are d (domain size of Make, default was 50),n (the number of attribute uncertainty tuples in Ad, default value is 1000)and fanout (the number of tuples in Ad that each tuple from S joins with,default value is 1000).

In Figure 4.9(a), we show how LiftedInf and BatchVE perform whenwe vary n from 100 to 1000. Notice that LiftedInf significantly reduces thetime spent performing arithmatic operations. Note that on the x-axis inFigure 4.9(a), we report the size of Ad in terms of number of tuple uncer-tainty tuples to help the reader compare with previous work on probabilis-tic databases since our formulation can deal with both attribute uncertaintyand tuple uncertainty but most recent work can handle tuple uncertaintyonly. A simple rule of thumb to compute the size of attribute uncertaintyrelations in tuple uncertainty format is n× d1× d2× . . . d|attr(R)|, where n isthe number of attribute uncertainty tuples and di is the domain size of theith uncertain attribute in the relation (assuming all ith uncertain attributevalues in the relation have the same domain size). For our experiment, thisgives us n×d×4d. See Section 4.5 for more details on this conversion fromattribute uncertainty to tuple uncertainty.

Figure 4.9(b) shows the performance of the two inference algorithmswith varying domain sizes. Notice how at d = 10, LiftedInf performs worse(because small domain sizes means small factors and therefore, less timespent on arithmatic operations), but the difference between its time andBatchVE’s time is not large.

The third experiment we ran (Figure 4.9(c)) is the most interesting ex-periment in this subsection. Here we varied the fanout from 1 to 10 to vary

63

the symmetry in the PGMs produced by the query (but kept the numberof tuples in Ad fixed). At fanout 1, we have no symmetry and no sharedfactors in the base data, since every tuple from S has a unique existenceprobability, but the shared factors increase as we increase fanout. Thus, atfanout 1, LiftedInf should perform worse, and it does, but not by a hugeamount. At fanout 2, where we have a slight amount of symmetry in thequery (every tuple from S joins with exactly 2 tuples from Ad), LiftedInf isalready doing better than BatchVE. At fanout 10, it does much better thanBatchVE.

In Figure 4.9(d), instead of keeping the fanout constant for all tuples inS, we sampled it from a Poisson distribution with parameter λ. In this casehowever, we kept the number of tuples in S fixed. Note that at λ = 1, mostfanouts sampled turn out to be 1, but some samplings produce numbersgreater than 1 and LiftedInf utilizes this to do better than BatchVE, even atλ = 1. At λ = 10, LiftedInf performs much better.

Until now, we had kept the existence probabilities of tuples in relationS distinct. In the next experiment, we introduced some shared factors forexistence probabilities by dividing the tuples in S into buckets. Two tu-ples in the same bucket had the same existence probability. The number oftuples in S were fixed to 600, so at 600 buckets (right end of the plot in Fig-ure 4.9(e)), we have exactly 1 tuple belonging to each bucket. Figure 4.9(e)shows how LiftedInf’s performance deteriorates when the number of buck-ets increase. Note that we do not show the time taken by BatchVE in thiscase since it would obscure the trend of LiftedInf (BatchVE took around 25seconds for this experiment).

4.4.2 Experiments with uncertain join attributes

The next two plots (Figure 4.9(f) and (g)) relate to a two relation join be-tween S and Ad where the join attribute SID itself was uncertain. Thisrelates to the case of link uncertainty or structure uncertainty [Getoor et al.,2002], where we are unsure about the primary/foreign key values in thedata. For instance, we may have another relation in our database whichstores the id of the person who posted the pre-owned car ad and we maywant to join with that relation so we can take into account the reliability ofthe seller while trying to return to the user cars of her/his interest. How-ever, we may not know the seller’s identity as this information may nothave been properly extracted or is simply unavailable (s/he used the guestlogin). Joins on uncertain attributes give rise to very complicated PGMsand we wanted to keep some control over the complexity of the PGM. We

64

setup this experiment in the following fashion: first we contructed k keyvalues, then for each tuple in either relation, we polled from this pool mdistinct keys randomly to include in the domain of the uncertain join at-tribute value; finally we padded each attribute value’s domain with uniquekey values so that the total domain size is 50. Thus, increasing k makes itless likely that two tuples from the two relations join, on the other hand,increasing m increases the chance that two tuples join. Note that if two tu-ples join then this may be due to multiple entries being common in theirdomain. Figure 4.9(f) (varying m with k held constant at 100) and Figure4.9(g) (varying k with m held constant at 2) show the results.

4.4.3 Experiments with TPC-H data

Following previous work, we also ran experiments based on the TPC-Hschema. We picked Q5 from the TPC-H specification since this involves ajoin among six relations of which we made 4 relations (customer, lineitem,supplier and order) probabilistic. The query tries to determine how muchvolume of sales is being generated in various regions. Each customer makesk1 orders, each order is broken down into k2 sub-orders each of which is alineitem entry, each sub-order is then diverted to a supplier. Each tuplefrom customer is uncertain and these were divided into p1 buckets suchthat tuples from the same bucket had the same existence probabilities, sim-ilarly, the supplier tuples were also divided into p2 buckets. Moreover,each customer sub-order is usually (with 95% probability) routed to oneof c suppliers, else the supplier is chosen randomly. For the lineitem andorder relations, we made the discount attribute uncertain (domain size 4d)and correlated with part being ordered’s type (domain size d), and the or-derdate attribute uncertain (domain size d). We set the parameters in thefollowing manner: k1 ∼ Poisson(2), k2 ∼ Poisson(3), p1 = p2 = 5, c = 3,d = 50. We defined the scale factor to be the number of tuples in lineitem intuple-uncertainty format divided by 6× 106. The results are shown in Fig-ure 4.9(h). The results showed similar trends when we tried other settingsof the parameter values; for instance the execution time for LiftedInf wentdown when we decreased c and increased d and so on.

In almost all our experiments, we noticed significant speedups rangingfrom 200% to 700%. Even in cases where there was no symmetry, LiftedInfperformed only slightly worse than BatchVE, incurring about 25% extratime to compress rv-elim graphs. Given that the datasets we generatedwere extremely simple in their correlation structure, we believe we will doeven better on real-world data with richer correlation structure containing

65

AdID Make Color Price1 Honda ? $9,0002 ? ? $6,0003 ? Beige $8,000...

......

...

Color fcolorBlack 0.75Beige 0.25

Make fmakeHonda 0.55Toyota 0.45

AdID Make Color Price prob.1 Honda Black $9,000 0.751 Honda Beige $9,000 0.252 Honda Black $6,000 0.41252 Honda Beige $6,000 0.13752 Toyota Black $6,000 0.33752 Toyota Beige $6,000 0.11253 Honda Beige $8,000 0.553 Toyota Beige $8,000 0.45...

......

......

Figure 4.10: Database with pre-owned cars for sale (a) attribute-uncertaintyformat (b) pure tuple-uncertainty format.

shared factors.

4.5 Discussion

Recall that, in Chapter 2 we surveyed a number of related works proposingdifferent ways of modeling uncertainty in probabilistic databases. Broadlyspeaking, the various models can be categorized as models that associateduncertainty at the tuple level (tuple-level uncertainty) and models that rep-resent uncertainty at both tuple and attribute levels. Especially in recenttimes, a number of tuple-level uncertainty models have been proposedin the probabilistic database community. These include (but are not lim-ited to) MystiQ’s block-independent disjoint formalism [Re et al., 2006] andTrio’s x-tuples [Benjelloun et al., 2006; Das Sarma et al., 2006]. In contrast,our approach based on PGMs and shared correlations (first-order graphicalmodels) allows expressing uncertainty at the tuple and/or attribute levels.Having reviewed both of these approaches, one question that begs ask-ing is whether we are any closer to choosing one single way of modelinguncertainty. Both approaches have advantages and disadvantages, and tocompare them we first need to understand how to represent the same frag-ment of uncertain data in either approach. To this end, we discuss a simpletransformation that takes a database with attribute and tuple uncertaintyand returns its representation in pure tuple uncertainty format. After that,we discuss the pros and cons of one representation scheme over the other.

Figure 4.10 shows an example where the database contains ads for pre-

66

owned vehicles up for sale. In Figure 4.10 (a), some of the attribute valuesare missing; in particular, the tuple with AdID 1 has its Color attributemissing, tuple with AdID 3 has its Make attribute missing and tuple withAdID 2 has both attributes missing. Figure 4.10 (a) also shows the probabil-ity distributions associated with the missing attributes (common across alltuples) in the bottom. To represent such data with attribute uncertainty inpure tuple uncertainty, one approach is to compute all possible joint instan-tiations of every tuple present in the attribute-level uncertainty database.For instance, the first tuple in Figure 4.10 (a) can be instantiated to two tu-ples 1 Honda Black $9,000 and 1 Honda Beige $9,000 , wherethe first instantiation’s probability of existence is 0.75 while the second in-stantiation’s is 0.25 (given by the distribution on the Color attribute fromFigure 4.10 (a)). Note that these two instantiations cannot exist togethersince they come from the same attribute-level uncertainty tuple, in otherwords, they are correlated with a mutually exclusive dependency. In Fig-ure 4.10 (b), we show all three tuples from Figure 4.10 (a) represented withtuple-level uncertainty and tuples present in the same block are mutuallyexclusive (note that this is the same representation used in other works onprobabilistic databases such as x-tuples [Benjelloun et al., 2006] and block-independent disjoint formalism [Re and Suciu, 2007]).

It is difficult to see how approaches that only allow representing tuple-level uncertainty can exploit shared correlations for efficient query evalu-ation. As the example shows, the shared factors for Color and Make inFigure 4.10 (a) get completely obscured once we convert to tuple-level un-certainty, so much so that no pair of tuples in Figure 4.10 (b) have the sameprobability of existence. Moreover, the tuple-level uncertainty representa-tion (Figure 4.10 (b)) requires 8 tuples to represent the same informationthat required only 3 tuples using attribute-level uncertainty (Figure 4.10(a)) which means representing data using tuple-level uncertainty requiresmore space. In the above example, if the color attribute had nc values in itsdomain and the make attribute nm, then the tuple with AdID 2 in Figure4.10 (a) would blow up into nc × nm tuples in pure tuple-level uncertaintyformat. Not only does this imply that the tuple-level uncertainty formatrequires more space to represent the same data, it also means that this formof representation involves more random variables which is another reasonwhy query evaluation may be slower under this approach.

The problem with using tuple-level uncertainty for data that containsattribute uncertainty is that it requires computing joint distributions andthis becomes an expensive operation in terms of size of the representationwhen we have many uncertain attribute values connected via correlations.

67

ψ(X1, Y ) . . .ψ(X2, Y )

ψ′(Y )

ψ(Xn, Y )

. . .ψ′(Y )

ψ′′()

ψ′(Y )

(a)

ψ(X,Y )

ψ′′()

ψ′(Y ). . .

(b)

Figure 4.11: Inversion elimination is a special case of bisimulation-basedinference: (a) the rv-elim graph and (b) its compressed version.

This observation has been made in other contexts also, such as selectivityestimation in databases [Getoor et al., 2001], and is one of the main reasonswhy researchers in machine learning prefer working with factored repre-sentations of joint probability distributions such as probabilistic graphicalmodels. Note that many domains produce uncertain data that can be natu-rally modeled using attribute-level uncertainty rather than tuple-level un-certainty such as mobile object databases [Cheng et al., 2003] and sensornetwork data [Deshpande et al., 2004], and these can be easily read into adatabase that can represent both attribute and tuple uncertainty, whereaswe need to perform some transformation before we can read them into adatabase that can only represent tuple-level uncertainty which may lead toloss of its natural structure.

On a side note, recall that while discussing related work on lifted infer-ence in Section 2.3, we mentioned that inversion elimination [de Salvo Brazet al., 2005; Poole, 2003] is one popular approach. We are now ready todemonstrate that inversion elimination is a special case of our bisimulation-based lifted inference. Given a computation of the form

∑Y

∑Xi

∏i ψ(Xi, Y )

(all ψ’s are shared factors), inversion elimination avoids the complexity ofeliminating each Xi,∀i = 1, . . . n separately by pushing each summation ofXi against the corresponding ψ, eliminating Xi once and then eliminatingY :

∑Y

∑Xi

∏ni=1 ψ(Xi, Y ) =

∑Y

∏ni=1

∑Xiψ(Xi, Y ) =

∑Y

∏ni=1 ψ

′(Y )=

∑Y ψ

′n(Y ) = ψ′′(). Figure 4.11 shows how our approach achieves thesame.

68

4.6 Conclusion

In this chapter, we showed how to exploit shared correlations to speed upprobabilistic inference during query evaluation for probabilistic databases.Shared correlations are likely to exist in many probabilistic databases sinceprobabilities and correlations often come from general statistics learnt from(large amounts of) data and rarely vary on a tuple-to-tuple basis. In addi-tion, the query evaluation approach that builds the probabilistic graphi-cal model on which we finally need to run inference itself tends to intro-duce shared correlations. We introduced a new graph-based data structureand explained how to build it from the probabilistic graphical model. Wethen showed how the graph can be compressed using an algorithm basedon bisimulation. We empirically evaluated our approach and showed thateven in the presence of a few shared correlations, we do significantly betterthan naive inference approaches.

69

Chapter 5

Approximate Lifted InferenceFor Probabilistic Databases

In the last chapter, we discussed how to implement a lifted inference algo-rithm that leverages shared correlations to efficiently perform large-scaleprobabilistic inference while evaluating queries in probabilistic databases.We discussed how probabilistic databases are likely to contain shared corre-lations that result in identical factors, and how these identical factors repre-sent a kind of symmetry that lifted inference algorithms attempt to exploitto speed up inference. Although lifted inference often works well in caseswhen the PGM provides significant symmetry, sometimes such symmetrymay not exist. In such cases, and in cases when the application can toler-ate errors in the result of inference, we may want to resort to approximateinference to scale up to large datasets.

In this chapter, just as in the last one, we continue with our goal ofdesigning efficient large-scale inference procedures. However, unlike thelast chapter, where we concentrated on designing an exact lifted inferencealgorithm, here our goal is to design approximate procedures. The mainquestion we investigate here is whether it is possible to design approximatelifted inference techniques that allow the user to trade off accuracy of infer-ence for computational efficiency. We answer this question in the affirma-tive and develop two such techniques. Moreover, we show that these twotechniques can be combined for more aggressive exploiting of shared cor-relations. Also, we show how it is possible to combine our techniques withbounded complexity inference procedures such as the mini-bucket scheme[Dechter and Rish, 2003], so that we do not need to incur the full treewidthof the PGM in question to run inference. Finally, we develop a unified lifted

70

S A Bs1 a1 {1:0.6, 2:0.4}s2 a2 {1:0.6, 2:0.4}s3 a3 {1:0.62, 2:0.38}

T B Ct1 {2:0.5, 3:0.5} c

S./BT−→A B C

i1 a1 2 ci2 a2 2 ci3 a3 2 c

fi1.e, fi2.e, fi3.e

Figure 5.1: Running example for this chapter. Note that the prior on s3.Bis slightly different from the priors on s1.B and s2.B.

inference engine that, via the use of a handful of tunable parameters, allowsthe user full control over what type of lifted inference algorithm s/he de-sires. The unified lifted inference engine allows the user choice of varyingbetween standard or lifted inference of the exact or approximate variety,along with the user’s specification of incurred complexity of inference.

Continuing in the vein of the work described in the last chapter, eventhough our focus in this one remains that of query evaluation in probabilis-tic databases, the techniques we develop can be applied to general PGMsgenerated from any application, even when there are no shared correla-tions. Also, like the last chapter, we will refrain from assuming that we aregiven a first-order specification of the PGM (an assumption often made bymany lifted inference algorithms [de Salvo Braz et al., 2005; Milch et al.,2008; Pfeffer et al., 1999; Poole, 2003; Singla and Domingos, 2008]). Partof our task is to develop techniques that automatically determine the first-order symmetry in PGMs and exploit it to speed up inference.

In the next section, we introduce a modified version of the running ex-ample from the last chapter, which will serve as the running example forthis one. In Section 5.2 and Section 5.3, we present two techniques for ap-proximate lifted inference. In Section 5.4, we discuss how to combine ourideas with bounded complexity inference techniques and along with that,present a unified lifted inference engine that combines all the techniques. InSection 5.5, we evaluate our approaches on synthetic and real-world dataand show that our techniques can achieve orders of magnitude speedupover standard ground inference procedures and the bisimulation-based ex-act lifted inference procedure introduced in the last chapter. We concludewith some pointers for future work in Section 5.6.

71

s1.B s3.Bs2.Bt1.B

i3.ei2.ei1.e

∧∧∧

fs1.B ft1.B fs2.B fs3.B

fi1.e fi2.e fi3.e

Figure 5.2: PGM produced by the running example described in Figure 5.1.

5.1 Running Example

Similar to the running example introduced in Section 4.1, in Figure 5.1,once again we have a simple 2-relation join query. In this case, S con-tains 3 tuples with the probability distribution associated with the newtuple, s3’s B attribute, being slightly different from s1.B and s2.B. As ex-pected, the join produces three result tuples. Our task is now to computethe marginal probabilities associated with the assignments i1.e = true,i2.e = true and i3.e = true from the PGM shown in Figure 5.2. Fig-ure 5.3 (a) shows the rv-elim graph obtained using the elimination orderO = {s1.B, s2.B, s3.B, t1.B}, and Figure 5.3 (b) shows the correspondingcompressed rv-elim graph on which we can now perform inference faster.Notice how, µi1.e and µi2.e turn out to be identical (just like in the last chap-ter), but µi3.e is partitioned into a different node in the compressed rv-elimgraph. This is because one of the factors that leads to the computation ofµi3.e, i.e., fs3.B, is slightly different than the corresponding factors fs1.B andfs2.B. If our application could tolerate an adequate level of approximationin the marginals computed, then we may want to pool µi3.e along withthe other two marginal distributions. This would result in a smaller com-pressed rv-elim graph and could lead to significant savings in computationtime.

While the exact lifted inference approach described in the previous chap-ter can provide significant speedups when the probabilistic model containsmoderate to large amounts of symmetry, in many cases we can do muchbetter if we are willing to accept approximations in the marginal probabil-ity distributions computed. The main idea here is to explore looser versionsof Claim 1 so that we can partition the vertices of the rv-elim graph into big-ger blocks and thus arrive at a smaller compressed rv-elim graph. In what

72

µi3.e

ft1.B

fi2.e fi3.e

µi2.e

fi1.e

ms2.Bms1.B

µi1.e

ms3.B

fs1.B fs2.B fs3.B

1b

2 1

{[1],[2,1,3],1}

{[1,2],[2],2}

2c

1

a

1

{[1],[2,1,3],1}

{[1,2],[2],2}

2

d

1

{[1],[2,

1,3],1}

{[1,2],[2],2}2

c

2

2

c

1

a

(a)

A BC

ED

G

F

H

1

b2

2

d,[ft1.B]

1

{[1],[2,1,3],1}

{[1,2],[2],2}

1

a2

1

{[1],[2,1,3],1}

{[1,2],[2],2}

2

c

[fs1.B,fs2.B][fi1.e,fi2.e,fi3.e] [fs3.B]

[µi1.e,µi2.e] [µi3.e]

[ms1.B,ms2.B] [ms3.B]

(b)

Figure 5.3: (a) RV-Elim graph for the running example (vertices parti-tioned into 8 blocks, shading indicates partitioning), (b) correspondingcompressed rv-elim graph.

follows, we describe two separate and orthogonal generalizations of Claim1 that can be used to implement approximate lifted inference. After that,we discuss how to combine our techniques with bounded complexity infer-ence algorithms and finally, we discuss how to combine all of our proposedideas together into one single general purpose approximate lifted inferenceengine.

5.2 Approximate Lifted Inference with ApproximateBisimulation

To introduce our first technique, we require some notation. Given a ver-tex, edge labeled graph G = (V,E,LV ,LE) such as an rv-elim graph, letv0, . . . vn denote an n-length vertex path such that ∀i = 0, . . . n : vi ∈ V and

∀i = 0, . . . , n − 1 : ∃j s.t. vij→ vi+1 ∈ E. Further, we say that label path

or simply, path, l0(l′0)l1(l′1) . . . ln(l′n)ln+1 matches vertex path v0, . . . vn+1 (and

vice versa), if ∀i = 0, . . . , n + 1 : LV (vi) = li and ∀i = 0, . . . n : LE(vi →vi+1) = l′i.

We will now revisit Claim 1 and try to assign it a path-based interpreta-tion. Using a simple induction (and the fact that edges with the same headhave distinct edge labels) it is possible to show that two vertices v1 and v2 in

73

µi3.e

ft1.B

fi2.e fi3.e

µi2.e

fi1.e

ms2.Bms1.B

µi1.e

ms3.B

fs1.B fs2.B fs3.B

1b

2 1

{[1],[2,1,3],1}

{[1,2],[2],2}

2

c

1

a

1

{[1],[2,1,3],1}

{[1,2],[2],2}2

d

1

{[1],[2,

1,3],1}

{[1,2],[2],2}

2

c

2

2

c

1

a

(a)

µi3.e

ft1.B

fi2.e fi3.e

µi2.e

fi1.e

ms2.Bms1.B

µi1.e

ms3.B

fs1.B fs2.B fs3.B

1b

2 1

{[1],[2,1,3],1}

{[1,2],[2],2}

2

c

1

a1

{[1],[2,1,3],1}

{[1,2],[2],2}

2

d

1

{[1],[2,

1,3],1}

{[1,2],[2],2}

2

c

2

2

c

1

a

(b)

Figure 5.4: Results of running approximate bisimulation on the running ex-ample (shading indicates partitioning), (a) path-length=0 (partitioning onlabels, vertices partitioned into 6 blocks) (b) path-length=1 (vertices par-titioned into 7 blocks). At path-length=2, we obtain the results of exactbisimulation (see Figure 5.3 (a)).

an rv-elim graph are bisimilar iff their incoming set of paths from the roots areidentical. For instance in Figure 5.3 (a) recall thatms1.B

∼= ms2.B, which havethe same set of incoming paths from the roots {“a(1){[1], [2, 1, 3], 1}′′, “c(2){[1], [2, 1, 3], 1}′′},the matching vertex paths for ms1.B are fs1.B,ms1.B and fi1.e,ms1.B, resp.,and the matching vertex paths for ms2.B are fs2.B,ms2.B and fi2.e,ms2.B,resp. Notice that this path-based interpretation of Claim 1 shows that itis a fairly stringent criteria (albeit necessary for exact inference). For in-stance, consider a case when two vertices deep in the rv-elim graph havelarge sets of long incoming paths and both sets are almost identical exceptfor one incoming path to the second vertex which has that one label thatdoes not allow it to match any incoming path to the first vertex; based onClaim 1 these two vertices would be placed in different blocks of the fi-nal partition and the compressed rv-elim graph would be correspondinglybloated. This sort of behaviour is, in fact, on display in our running exam-ple where µi2.e � µi3.e simply because, of the three incoming paths to µi3.e,“b(1){[1], [2, 1, 3], 1}(1){[1, 2], [2], 2}′′ (matching fs3.B,ms3.B, µi3.e) does notmatch any of µi2.e’s incoming paths.

Instead of comparing sets of all incoming paths to vertices, we proposeto relax Claim 1 by comparing sets of only k-length (and less than k-length)

74

Algorithm 3: Approximate Lifted Inference with Approximate Bisimula-tion

input : RV-Elim Graph G = (V,E,LV ,LE) and path-length k.output : A disjoint partitioning over V .

d(v) ={

0, if v is a root1 + max{d(v′)|v′ → v ∈ E}

ρ← max{d(v)|v ∈ V }B0,l = {v|d(v) = 0 ∧ LV (v) = l}Bi = {v|d(v) = i}∀i = 1 . . . ρC ← {B0,l}∀l ∪ {Bi}ρi=1

X ← Cfor j = 1 . . . k do

for i = 1 . . . ρ doforeach B ∈ C at depth i do

order parents by block-ids in Xconstruct labels LV (v)∀v ∈ Bconstruct key kv∀v ∈ B with LV (v), parent blocks-ids in Xpartition B based on keys kv

replace B in C with new blocksend

endX ← C

endreturn C

incoming paths, where k is a tunable parameter we refer to as the path-length. Our compression algorithm permits high compression when thepath-length is set to a low value and approaches exact bisimulation as weincrease it. Figure 5.4 (a) shows the result of partitioning vertices in ourexample rv-elim graph with k set to 0, where we simply partition verticesbased on their labels. Figure 5.4 (b) is more interesting, where we haveset k to 1 and so, compare incoming paths of length 1. Note how, in thiscase, ms3.B has been differentiated from ms1.B and ms2.B since ms3.B hasan incoming path “b(1){[1], [2, 1, 3], 1}′′ (matching fs3.B,ms3.B) of length 1which does not match any incoming 1-length path of ms1.B or ms2.B. Incontrast, ms1.B, ms2.B and ms3.B were all placed into the same block inFigure 5.4 (a). Also notice that, in Figure 5.4 (b), µi1.e, µi2.e and µi3.e are stillpartitioned into the same block (leaf vertices tiled with bricks) and this isbecause the only path that differentiates µi3.e from µi1.e and µi2.e is a path

75

A BC

ED

G

F

2

2

d,[ft1.B]

1

{[1],[2,1,3],1}

[ms1.B,ms2.B]

[µi1.e,µi2.e,µi3.e]

{[1,2],[2],2}

1

{[1],[2,1,3],1}

[ms3 ]

1

a

1

b

2

c

[fs3.B][fi1.e,fi2.e,fi3.e][fs1.B,fs2.B]

Figure 5.5: The compressed graph obtained at path-length=1 with the dot-ted edge being deleted since its tail has a smaller extent.

of length 2 (vertex path fs3.B,ms3.B, µi3.e) which is beyond the scope of thecurrent path-length setting of 1. This changes however, when we set path-length to 2 and obtain the results of exact bisimulation shown earlier inFigure 5.3 (a).

The partitioning based on comparing incoming k-length paths can beobtained by computing k-bisimilarity [Kanellakis and Smolka, 1983] (forwhich algorithms are available) since these two properties are equivalent(this can be proved by induction). We formalize the k-bisimilarity propertyas follows:

Property 1. Given an rv-elim graph G = (V,E,LV ,LE), ∼=k is defined induc-tively. For vertices v1, v2 ∈ V ,

• v1 ∼=0 v2 iff LV (v1) = LV (v2).

• v1 ∼=k v2 (k > 0) iffLV (v1) = LV (v2) and ∀u1i→ v1,∃u2

i→ v2 s.t. u1∼=k−1

u2 and vice versa.

The algorithm for obtaining the partition based on ∼=k, Algorithm 3, be-gins by computing the depth of each vertex d(v) and constructing an initialpartition based on labels of the roots and the depths of internal vertices.Throughout Algorithm 3, we maintain two partitions, X and C. In the ith

iteration, X maintains ∼=i−1 and is used to update C where we construct∼=i. Note that the inner two loops can be performed in O(|E| logD + |V |)

76

time (not counting the time spent to construct the vertex labels), where Dis the maximum in-degree in the rv-elim graph. Thus, Algorithm 3 runsin O(k(|E| logD + |V |)) time (in contrast to Algorithm 1 which runs inO(|E| logD + |V |) time). Note that constructing the compressed rv-elimgraph corresponding to ∼=k is a bit more complicated now since we areno longer guaranteed that, if two internal vertices fall into the same blockof the partition, then the parents will also have been placed into the sameblock (which holds for Claim 1). Figure 5.5 (the compressed graph obtainedat k=1) illustrates this issue where all µ’s have been merged into one blockbut their 1st parents are not, thus G has two 1st parents D and F , whichis problematic if we want to use the compressed graph to run inference.Here, we simply get rid of the edge that corresponds to the smaller sizedblock (the dotted edge F → G in Figure 5.5 since F represents a block ofsize 1 versus D whose block size is 2) to maximize the number of correctmarginal probability computations.

5.3 Approximate Lifted Inference with Factor Binning

We now introduce another way of implementing approximated lifted in-ference using an orthogonal generalization of Claim 1. We begin by asso-ciating with Claim 1 a distance-based interpretation. Recall that, Claim 1bins two factors into the same block of the partition when we can guaran-tee that their input-output mappings are exactly the same without actuallycomputing them. Stated differently, given any user-defined distance mea-sure that can measure the “distance” between two factors, Claim 1 deemsthat these factors belong to the same block only if the distance betweenthem is zero. Note that the converse is not true. That is, it is possible fortwo internal vertices in the rv-elim graph to actually represent factors thatcomprise of identical input-output mappings but because their parents donot belong to the same blocks or because the parents’ arguments do notoverlap in the same fashion, Claim 1 cannot bin these into the same block

77

of the partition. We illustrate this with the following example:

∑Y

X Y f1

t t 0.8t f 0.2f t 0.4f f 0.6

×Y f2

t 0.5f 0.5

=X mY

t 0.5f 0.5

∑Y ′

X ′ Y ′ f ′1t t 0.2t f 0.8f t 0.6f f 0.4

×Y ′ f ′2t 0.5f 0.5

=X ′ mY ′

t 0.5f 0.5

where t and f denote true and false resp. Notice how factors f1 andf ′1 have different input-output mappings (f1(t, t) = 0.8 6= 0.2 = f ′1(t, t))and hence cannot be binned into the same block which means that it isnot possible to determine that the resulting factors mY and mY ′ compriseof the same input-output mappings solely using Claim 1. This, in turn,means that any intermediate factors derived from these two factors duringthe inference process will always be binned separately, thus leading to abloated compressed rv-elim graph.

Such symmetries can not be captured without actually looking into thefactors and computing the distance between them (any distance measuresuch as KL-divergence or root mean squared distance would do). For thispurpose, we ask the user for a separate parameter ε, that specifies an up-per bound on the distance between two factors for them to be consideredshared. Note that, unlike the previous algorithm, we can not compute dis-tance between two intermediate factors without computing the factors.

To determine such a distance-based partitioning of the factors, we willneed to solve the factor binning problem (FB):

Given: set of factors F = {f1, . . . fn}threshold ε, distance function dist(·,·)

Return: argminF⊆F |F|such that ∀fi ∈ F \ F ∃f ∈ F s.t. dist(fi, f) ≤ ε

We will shortly show that the factor binning problem is equivalent to

78

the dominating set problem (DS):

Given: graph G with vertex set V and edge set E

denote by Nv neighborhood of vertex vReturn: argminD⊆V|D|

such that ∀vi ∈ V \D ∃v ∈ D s.t. v ∈ Nvi

Theorem 1. FB is equivalent to DS.

Proof. The proof is in two parts; we first show that any instance of FB canbe reduced to DS and vice versa. To show the first part, we specify thereduction to DS. Given an instance of FB, define the corresponding DS bysetting:

DSFB : V = F , Nfi= {fi} ∪ {f |dist(fi, f) ≤ ε}

Note that any solution to DSFB is a solution to FB. We show this by con-tradiction. Suppose solution D to DSFB is not a solution to FB, in otherwords, ∃fi ∈ F \D s.t. dist(fi, f) > ε, ∀f ∈ D. This implies Nfi

∩D = ∅which means that D is not a solution to DSFB and thus we have a contra-diction. Similarly, any solution to FB is a solution to DSFB . Again, assumethat solution F to FB is not a solution to DSFB . Thus, ∃fi ∈ F \ F s.t.Nfi∩ F = ∅. This implies dist(fi, f) > ε, ∀f ∈ F which means F is not a

solution to FB and we have a contradiction. Given that solution spaces ofFB and DSFB are same, and that the objective functions are also same, wehave shown that FB can be solved by solving DSFB .

The reduction in the other direction is also easy. Given an instance ofDS, define the corresponding FBDS by setting:

FBDS : F = V, ε = 0

dist(vi, vj) ={

0 if (vi, vj) ∈ E1 otherwise

Let D denote a solution to DS that is not a solution to FBDS :

⇒ ∃vi ∈ V \D s.t. dist(vi, v) > ε = 0, ∀v ∈ D

⇒ ∃vi ∈ V \D s.t. dist(vi, v) = 1, ∀v ∈ D

⇒ ∃vi ∈ V \D s.t. (vi, v) /∈ E, ∀v ∈ D

⇒ ∃vi ∈ V \D s.t. Nvi ∩D = ∅⇒ D is not a soln. to DS and we have a contradiction

79

Algorithm 4: Approximate Lifted Inference with Factor Binninginput : RV-Elim Graph G = (V,E,LV ,LE), a distance function and ε.output : A disjoint partitioning over V .

d(v) ={

0, if v is a root1 + max{d(v′)|v′ → v ∈ E}

ρ← max{d(v)|v ∈ V }B0,l = {v|v is a root ∧ LV (v) = l}instantiate one factor per block B0,l

Bds0 ← compute dominating set and construct new set of blocks by

merging {B0,l}C = Bds

0

Bi = {v|d(v) = i},∀i = 1 . . . ρfor i = 1 . . . ρ do

foreach v ∈ Bi doorder parents by block-idsconstruct label LV (v)construct key kv with LV (v) and parents’ blocks-ids

endBi,k = {v ∈ Bi|kv = k}instantiate one factor per new block Bi,k

Bdsi ← compute dominating set and construct new set of blocks by

merging {Bi,k}C ← C ∪Bds

i

endreturn C

Also, trying it the other way round. Let F denote a solution to FBDS thatis not a solution to DS:

⇒ ∃vi ∈ V \ F s.t. Nvi ∩ F = ∅⇒ ∃vi ∈ V \ F s.t. dist(vi, v) = 1 > 0 = ε, ∀v ∈ F

⇒ F is not a soln. to FBDS and we have a contradiction

DS is NP-Complete [Garey and Johnson, 1979]. Further, Feige [1998]showed that DS is not approximable to a factor of (1 − o(1))ln(|V|) unlessNP has “slightly super-polynomial time” algorithms (orNP ⊂ DTIME(nlog(log(|V|)))).One way to solve DS is to utilize the fact that it is a special case of set cover

80

and use the obvious greedy heuristic (described below) for set cover. Thisgives us an ln(|V|)-approximation algorithm [Vazirani, 2001]. Thus, forour experiments, we use the same greedy approach to solve FB. FB is alsoequivalent to the ρ-dominating set problem Bar-Ilan et al. [1993], which, inturn, is the converse of the classic k-center problem [Kariv and Hakimi,1979] where we are given a graph from which we need to choose a subsetof k vertices so that their distance from the other vertices is minimized.

Even though the above discussion suggests FB is hard to solve, the sit-uation is not so dire. When the distance function satisfies special proper-ties, better algorithms may be available but this will require us to “tweak”the definition of FB. For instance, when dealing with euclidean spaces,there are algorithms that can solve the minimum geometric disk cover problem(GDC) near-optimally. GDC is posed as follows: consider a set of pointsin some high-dimensional plane. Our task is to return the smallest set ofpoints (centers) from the plane such that each input point is within r dis-tance of a center. Hochbaum and Maass [1985] describe approximationalgorithms with bounds on their (non-)optimality for this problem. Notethat the bounds depend on the dimensionality of the space, the algorithmswork better in low dimensional spaces. Now consider the following re-duction of a modified version of FB to GDC. We will assume that eachinput factor has d rows. We will interpret each input factor as a point ina d-dimensional space and the coordinates of the corresponding point aregiven by the output of the factor. Note that this is a one-to-one mappingfrom points to factors. Once we solve GDC on these points, we can getback factors corresponding to the returned centers. Note that this is notthe same definition of FB as before since the centers need not correspond toany of the input factors. Also, to make this work desirably one may needto normalize the outputs of factors in some sensible way.

The algorithm to obtain the greedy solution for FB is to first constructeach subset Nfi

(as defined above) and repeatedly pick fi correspondingto the current largest Nfi

to include into our solution. Every time we pickfi, we update all Nfj

’s by deleting from them all factors that are withinε distance of fi. Another question we need to consider is whether to binfactors based on distance once and then run approximate lifted inferenceor whether to bin the intermediate factors based on distance also. For ourexperiments, we also binned the intermediate factors, since this allows usto compress the rv-elim graph more agrressively. Algorithm 4 shows thecomplete algorithm to run approximate lifted inference using FB.

81

5.4 Unified Lifted Inference

The approximation techniques we have introduced so far do not alleviatethe worst-case complexity of the inference procedure. In other words, thesetechniques would not help if the ground inference procedure is associatedwith high treewidth (common with structured probabilistic graphical mod-els). Next we show how to incorporate the mini-bucket scheme [Dechterand Rish, 2003], a bounded complexity approximate (ground) inference al-gorithm, this allows us to keep a tight control over the complexity of infer-ence incurred. Next, we discuss how to combine all the proposed ideas inthis chapter to construct one single unified lifted inference engine.

5.4.1 Bounded Complexity Lifted Inference

The mini-buckets scheme is a modification of the variable elimination al-gorithm [Zhang and Poole, 1994] where at each step instead of eliminatinga random variable by multiplying all factors it appears as argument in, onedevises a set of mini-buckets each containing a (disjoint) subset of factorsthat contains that variable as argument and then eliminates the variableseparately from each mini-bucket. More precisely, given a set of factors,one first constructs a canonical partition such that all subsumed factors areplaced into the same bucket of the partition. A factor f is said to be sub-sumed by factor f ′ if any argument of f is also an argument of f ′. Afterconstructing the canonical partition, the user has two choices:• construct mini-buckets by restricting the total number of arguments i (a

user-defined parameter) in each mini-bucket. Since inference complex-ity is directly affected by the size of the largest factor encountered, thisis one way to control the amount of computation incurred.

• specify how many bucketsm of the canonical partition to merge to forma mini-bucket. Again, this (indirectly) controls the size of the largestfactor generated and keeps the complexity bounded.

Dechter and Rish [2003] show how such a modification of the variable elim-ination algorithm provides an upper bound over the numbers produced inthe resulting factors.

It is easy to combine our approaches with the mini-bucket scheme. In-stead of building the rv-elim graph by introducing internal vertices corre-sponding to intermediate factors produced by multiplying all factors in-volving a certain random variable as argument, we simply introduce ver-tices corresponding to factors produced by the mini-bucket scheme. Sinceour approaches work on any rv-elim graph, this requires no change to the

82

Parameter Name DescriptionUB (bisimulation) compresses rv-elim graphs if truePL (path length) approximate bisimulation parameter,

use exact bisimulation when set to∞ε factor binning parameter, uses factor

binning if ε > 0UMB (mini-bucket) allows using mini-buckets if trueACR (arg. count restriction) if true then restricts based on number

of arguments in mini-bucketsMBR (mini bucket restriction) if ACR=true then this is i (the max

number of args per mini-bucket), elseit is interpreted as m (the number ofcanonical partition buckets merged toform a mini-bucket).

Table 5.1: Parameters for our unified lifted inference engine.

approaches presented earlier, while keeping the complexity of inferencebounded.

5.4.2 Unified Lifted Inference Engine

By interleaving the various steps, it is possible to combine all the ideas wehave presented in this section into one unified approximate lifted inferenceengine. Our combined inference engine takes a set of eight parameterswhich define the combinations of techniques we would like to invoke (seeTable 5.1). The experiments presented in the next section use this genericinference engine.

5.5 Experimental Evaluation

Through our experiments, we want to emphasize that even though our fo-cus is query evaluation in probabilistic databases, our techniques work forany PGM. We conducted experiments on synthetic and real data to deter-mine how lifted inference with approximate bisimulation and factor bin-ning perform on their own. We also report experiments with our unifiedlifted inference engine where we used both approaches in tandem. Eachnumber we report is an average over 3 runs, our implementation is inJAVA, and our experiments were performed on a machine with a 3GHz

83

Xeon processor and 3GB RAM. We compare our results with two baselinealgorithms: A ground inference procedure which is basically variable elim-ination [Zhang and Poole, 1994] modified so that we obtain all marginals ina single pass, and the exact lifted inference procedure introduced in Chap-ter 4. We report two metrics for each experiment: run times incurred bythe various algorithms in seconds (Time) and error measured by comput-ing the average number of marginal probabilities which were not within10−8 of their correct values (Avg. #Probs. Incorrect).

5.5.1 Synthetic Bayesian Network Generator

We set up a synthetic Bayesian network (BN) generator to test various as-pects of our algorithms. The generator produces BNs where the randomvariables are organized in layers and random variables from the ith layerrandomly choose parents from the i − 1th layer. For our experiments, wegenerated BNs with 3 layers: the first layer contained 1000 random vari-ables, the second 500 and the third 250. We introduced priors randomly foreach variable in the first layer, every 25th prior was identical. The randomvariables in the last layer are our query variables for which we computedmarginal probabilities. All random variables had domain of size 30. Togenerate factors defining the dependency between random variables fromthe ith and i − 1th layers, for each variable in the ith, we randomly chose 2parents from the previous layer. Two children can choose the same parents,so we generated non-tree structured BNs. All factors with children fromthe ith layer are identical. This closely follows many structured probabilis-tic graphical models we have come across, where the priors usually closelyresemble each other but may not be identical; whereas the factors definingdependencies between various random variables come from generic rulesand are thus identical. We used a parameter to control how many times arandom variable can be picked as a parent. This helps vary the complexityof the inference problem. We also used a parameter to add random noiseafter the factors are generated. We tried other parameter settings as welland the trends were as expected. For instance, increasing domain size in-creases the speedups obtained since with larger domains, we increase thetime spent summing over random variables and multiplying factors whilerunning ground inference – our lifted inference procedures are designed tosave on this assuming the symmetry among factors is kept constant. Sim-ilarly, increasing the number of random variables with constant symmetryalso increases speedups obtained.

84

0 1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8

Tim

e (s

ec)

Path Length

Approx. Bisim. (AB)AB w/ Mini-Bucket(i=3)AB w/ Mini-Bucket(m=1)AB w/ Mini-Bucket(i=2)AB w/ Mini-Bucket(i=4)AB w/ Mini-Bucket(m=2)

0 5

10 15 20 25 30 35 40 45 50

1 2 3 4 5 6 7 8

Avg.

# P

robs

. Inc

orre

ct (%

)

Path Length

Approx. Bisim. (AB)AB w/ Mini-Bucket(i=3)AB w/ Mini-Bucket(m=1)AB w/ Mini-Bucket(i=2)AB w/ Mini-Bucket(i=4)AB w/ Mini-Bucket(m=2)

(a) (b)

Figure 5.6: Experimental results for lifted inference with approximatebisimulation, (a) and (b) report time and error with varying path-length,respectively. Variable elimination took 25.95 sec and exact lifted inferencetook 8.2 sec.

5.5.2 Lifted Inference with Approximate Bisimulation

Our first set of experiments tests our algorithm for lifted inference withapproximate bisimulation. The results are reported in Figure 5.6 (a) andFigure 5.6 (b). The plots show that as we increase path-length (x-axis inthese plots), the time for inference (Figure 5.6 (a)) slowly increases, but er-ror decreases (Figure 5.6 (b)). The solid line with triangles depict the resultsof running lifted inference with approximate bisimulation without mini-buckets, and with path-length set to 3, we see that the error stands around18%; the inference procedure took about 3 seconds to run, which is almosta 3 times speedup over exact lifted inference (which took 8.2 seconds) andalmost a 9 times speedup over ground inference (which took 25.95 sec).All the other lines in the plots correspond to lifted inference with approx-imate bisimulation run with various mini-bucket schemes. Among these,the mini-bucket scheme with mini-buckets restricted by argument count ati = 3 seems to be a promising setting (dotted line with triangles), since itruns faster than lifted inference with approximate bisimulation, but doesnot incur significantly higher error. Another interesting point that showsup in these plots is that with mini-buckets with i = 4 or m = 2 at path-length set to 3, the time taken to run inference goes up noticeably. Thisshows that at very low path-lengths, using mini-buckets could actuallylead to loss of symmetry in the rv-elim graph.

85

2

3

4

5

6

7

8

9

1e-14 1e-12 1e-10 1e-08 1e-06 0.0001 0.01

Tim

e (s

ec)

ε

Factor Binning (FB)FB w/ Mini-Bucket(i=3)FB w/ Mini-Bucket(m=1)FB w/ Mini-Bucket(i=2)FB w/ Mini-Bucket(i=4)FB w/ Mini-Bucket(m=2)

0

10

20

30

40

50

60

1e-14 1e-12 1e-10 1e-08 1e-06 0.0001 0.01

Avg.

# P

robs

. Inc

orre

ct (%

)

ε

Factor Binning (FB)FB w/ Mini-Bucket(i=3)FB w/ Mini-Bucket(m=1)FB w/ Mini-Bucket(i=2)FB w/ Mini-Bucket(i=4)FB w/ Mini-Bucket(m=2)

(a) (b)

Figure 5.7: Experimental results for lifted inference with factor binning, (a)and (b) report time and error with varying path-length, respectively. Vari-able elimination took 33.12 seconds and exact lifted inference took 25.24seconds.

5.5.3 Approximate Lifted Inference with Factor Binning

Our second set of experiments tests our factor binning approach. The re-sults are shown in Figure 5.7 (a) and Figure 5.7 (b). For these experiments,we used root mean squared distance to compare two factors. More pre-cisely, given two factors f1 and f2 with a common joint domainD, dist(f1, f2) =√

1|D|

∑x∈D(f1(x)− f2(x))2. The plots show that as we increase ε (on the

x-axis), the times for inference go down (Figure 5.7 (a)), and the error goesup (Figure 5.7 (b)). On these experiments, ground inference took about 33seconds and exact lifted inference took 25.24 seconds, which means factorbinning without mini-buckets (solid line with triangles) achieves a speedupof about 3.5 times over exact lifted inference, and a speedup of almost 5times over ground inference. Among the various mini-bucket schemes,once again i = 3 (dotted line with triangles) seems to be the best setting; itgives small but noticeable reductions in run-times at almost no cost to ac-curacy. Notice that mini-buckets with small settings of either m or i tendsto perform very poorly, neither giving good accuracies nor providing goodrun-times, and this is likely due to the sheer number of factors with whichwe are dealing. At such small settings, the mini-bucket scheme producesa lot of factors and computing the dominating set (which has a quadratictime complexity) becomes too expensive.

86

0 50

100 150 200 250 300 350 400 450 500

10 100 1000 10000

Tim

e (s

ec)

#Random Variables

RV-Elim GraphExact Lifted Inf.

Approx. Lifted Inf.

0

50

100

150

200

250

300

0.001 0.01 0.1 1

Tim

e (s

ec)

Prob. of 2 factors being identical

Exact Lifted Inf.Approx. Lifted Inf.

(a) (b)

Figure 5.8: Experimental results for unified lifted inference engine. (a) de-picts how variable elimination and exact lifted inference compare with re-spect to time with varying size of PGMs. (b) shows how unified lifted in-ference does not require that factors be exactly identical.

5.5.4 Unified Lifted Inference Engine

In our last set of experiments, we used both approximate bisimulation(path-length=3) and factor binning (ε = 0.01) with mini-buckets (restrictedby argument count i = 3). Here we report run-times for probabilistic mod-els with varying number of random variables. The results are reported inFigure 5.8 (a) and Figure 5.8 (b). As should be clear from Figure 5.8 (a),with increasing size of the probabilistic model, all three inference proce-dures, ground inference, exact lifted inference and approximate lifted in-ference, show an increase in run-time but there is an order of magnitudedifference in times between ground inference and exact lifted inference(which partitions identical factors together) and another order of magni-tude speedup over exact lifted inference for approximate lifted inference(which also bins nearly identical factors together) while keeping the accu-racy within bounds. Thus approximate lifted inference is more than twoorders of magnitude faster than ground inference. The accuracies for ap-proximate lifted inference for these experiments varied between 65-95%.For these complex networks, we could not run ground inference on modelswith more than 256 random variables due to memory limitations. Figure5.8 (b) makes it clear how the run-time between exact lifted inference andapproximate lifted inference varies. Here we set all priors in our probabilis-tic model similar to each other but varied the probability of two factors be-ing identical to each other. The plot shows that as this probability increases,

87

Dataset Inf. Alg. Time (s) Arith. Ops. Rem. Ops. Acc.

CoraGround Inf. 163.5 163 0.5 77.8Lifted Inf. 60.6 59.9 0.7 73

CiteSeerGround Inf. 101.0 100.8 0.2 68.7Lifted Inf. 65.0 63.9 1.1 66.8

(a)

10 20 30 40 50 60 70 80 90

100

65 70 75 80 85 90 95

Prec

ision

Recall

0 10000 20000 30000 40000 50000 60000

50 100 150 200 250#I

nter

med

iate

Fun

ctio

ns#Entries to be De-Duplicated

Ground InferenceLifted Inference

(b) (c)

Figure 5.9: (a) Times for Cora and CiteSeer. (b) Precision-Recall curve forCora-ER and (c) number of factors generated.

exact lifted inference captures the symmetry and does better, whereas ap-proximate lifted inference keeps run-times low throughout.

5.5.5 Experiments on Real-World Data

We experimented with a number of real world datasets. We first report re-sults on the Cora [McCallum et al., 2000] and CiteSeer [Giles et al., 1998a]datasets. The Cora dataset contains 2708 machine learning papers with5429 citations; each paper is labeled from one of seven topics. The Cite-Seer dataset consists of 3316 publications with 4591 citations; each paper islabeled with one of 6 topics. The task is to predict the correct topic labelof the papers. We divided each dataset into three roughly equal splits andperformed three-fold cross valiation. For each experiment, we train on twosplits and test on the third, randomly choosing 10% of the papers’ class la-bels to be our query nodes. Each number we report is an average acrossall splits. For these experiments, we produce Markov networks using thecitations in the datasets as dependencies among the topic labels and then

88

perform collective classification [Sen et al., 2008b]. Note that, the treewidthof the Markov networks thus produced can be unbounded, so we com-pare against ground inference with mini-buckets restricted to 6 arguments.Also, while testing on the third split, we include as evidence topic labels ofthe papers belonging to the training set linked to from the test set. We triedvarious parameter settings with our approximate lifted inference engineand report the best results. As Table 5.9 (a) shows, we obtained a 2.7 timesspeedup for Cora and 1.55 times speedup for CiteSeer with our approxi-mate lifted inference engine over ground inference. The loss in accuracywas 4.8% for Cora and 1.9% for CiteSeer. These results were obtained withpath length = 2, ε = 0.01, and using mini-buckets restricted to 6 argu-ments. We also show how much time was spent by each inference schemeto multiply factors and sum over random variables (arithmetic operationsor “Arith. Ops.” in Table 5.9 (a)) and the remaining operations (or “Rem.Ops.” in Table 5.9 (a)). As should be clear from Table 5.9 (a), the variousoperations required to implement lifted inference (bisimulation algorithmsand dominating set computations) do not really add much overhead; wespend about 0.7 − 0.5 = 0.2 seconds for Cora and 1.1 − 0.2 = 0.9 secondsfor CiteSeer (column “Rem. Ops.” in Table 5.9 (a)).

We also experimented with the Cora dataset for entity resolution (Cora-ER) [Cora Entity Resolution Dataset]. For this experiment, we used a Markovlogic network with 46 distinct rules. Unfortunately, we could not get anynoticeable speedup for this dataset. This dataset consists solely of randomvariables with domain size 2 (match/non-match). As a result, all the fac-tors produced are extremely small in size (size of a factor is determinedby the number of rows in it) which implies that the time spent performingarithmetic operations (multiplying factors and eliminating random vari-ables) is not the bottleneck during inference. The techniques proposed inthis paper are mainly directed towards reducing the time spent to performarithmetic operations. However, we do present the precision-recall curvewe obtained for Cora-ER (Figure 5.9 (b), increasing argument count restric-tion for the mini-buckets scheme reduces precision but increases recall) andwe also counted the number of intermediate factors computed by groundand lifted inference for various samplings of the dataset consisting of 50-250 bibliographic citations to be deduplicated. Figure 5.9 (c) shows thatlifted inference produces far fewer intermediate factors during inferencethan ground inference; recall that ground inference produces an interme-diate factor everytime a random variable is eliminated but lifted inferencesaves on this computation by computing one factor for each block in thefinal partitioning. This, in turn, indicates that the dataset possesses sym-

89

metry which could lead to speedups if the domain sizes of the randomvariables and factors were large. Note that Figure 5.9 (c) also gives an ideaof the reduced memory consumption for lifted inference.

5.6 Conclusion and Future Work

In this chapter, we described light-weight, generally applicable approxi-mate algorithms for lifted inference based on the graph theoretic conceptof bisimulation. Essentially, our techniques are wrap-arounds for variableelimination [Zhang and Poole, 1994], and can be used whenever variableelimination is applicable, even though our focus in this thesis happensto be probabilistic databases. Besides being able to compute single nodemarginal probabilities, the techniques we propose here can also be usedto perform other kinds of inference, including computing joint conditionalprobabilities and MAP assignments (by switching from the sum-productoperator to max-product). One interesting avenue of future work is to lookfor other bounded complexity inference algorithms (besides mini-buckets)that can be combined with the techniques introduced in this chapter. Otheravenues of future work are determining the optimal values of the variousparameters (path-length and ε) automatically, and building the compressedrv-elim graph directly from the first-order description of the probabilisticmodel.

90

Chapter 6

Read-Once Functions andProbabilistic Databases

Until now, we have discussed a general representation for probabilisticdatabases (Chapter 3) and efficient large-scale query processing (Chapter4 and Chapter 5). In this chapter, we take a closer look at the query eval-uation problem itself. How hard is it to evaluate a single query, or, morespecifically, to compute the marginal probability of a single result tuple andwhat can we do to do this efficiently? Recall that, for probabilistic databasesbased on possible world semantics, query evaluation is #P-Complete. Mostof the probabilistic databases proposed in prior literature use one of twoapproaches to circumvent this issue: they either resort to approximate re-sults using approximate inference ([Jampani et al., 2008; Re et al., 2007]) orthey restrict their attention to a smaller class of tractable queries for whichefficient evaluation is possible (hierarchical queries or safe plans [Dalvi andSuciu, 2007, 2004]).

In this chapter, we take a closer look at the latter approach. Hierar-chical queries Dalvi and Suciu [2007], are a purely query-centric way ofdetermining whether a query posed on a tuple-independent probabilisticdatabase can be solved efficiently or not. More precisely, if the query qsatisfies a certain criteria (defined in [Dalvi and Suciu, 2007] and reviewedin the next section) then it can be evaluated in PTIME. Since the criteriaonly looks at the query and does not involve the database, it stands to rea-son that if the query is hierarchical then it can be evaluated efficiently forany tuple-independent probabilistic database. As the reader may have al-ready guessed, this is quite a pessimistic way of determining solvability ofqueries. Usually, the user is more interested in evaluating her/his query on

91

the database at hand and not for all possible databases; in other words, thefinal evaluation problem is a result of the combination of the query and thedatabase, not just the query alone and we need to leverage both query anddata, if we are to evaluate queries in the most efficiently possible way.

In this chapter, we view the problem of evaluating queries on tuple-level uncertainty probabilistic databases at a different level of abstraction.It is straightforward to show that in such databases, the PGM correspond-ing to the query evaluation problem can also be represented using result tu-ple specific-boolean formulas, and the query evaluation problem reduces tocomputing the marginal probability for the boolean formulas holding true.Prior research performed by the graph theory community has shown thatif the boolean formula can be factorized into a form where every booleanvariable (a tuple-level existence variable, in our case) appears not morethan once, then one can compute the marginal probability for the formulaextremely efficiently. Boolean formulas that have such a factorization areknown as read-once functions (Golumbic et al. [2006]; Hayes [1975]). It isalso possible to show that hierarchical queries only produce result tupleswith read-once functions, thus providing a connection to efficient queryevaluation in probabilistic databases. In this chapter, we propose to turnthe previous approach to efficient query evaluation on its head. Instead ofadopting a query-centric approach, we evaluate the user-submitted queryand propose algorithms that generate for each result tuple, its factorizedform (if it exists), so that we can compute the required marginal probabil-ities efficiently. With this approach, not only do we allow efficient com-putation of hierarchical queries, but also for non-hierarchical queries thatproduce result tuples with read-once functions on the given database.

Another issue with hierarchical queries is that their definition involvesthe specific operators used in the query. As reviewed in Chapter 2, mostof the work on hierarchical queries (Dalvi and Suciu [2007, 2004]) almostexclusively deals with equality joins∗. Presumably, extending the approachto deal with other operators requires effort and dealing with queries com-posed of different operators is even more cumbersome. On the contrary,our approach of treating result tuples as boolean formulas allows us to re-strict our attention to only two operators, ∧ (and) and ∨ (or). We do notcare what kind of join operator (equality or inequality or anything else)gave rise to the boolean formula associated with the result tuple. Thus,our techniques are likely to be wider in range than earlier work on efficientquery evaluation.

∗Olteanu and Huang [2009, 2008] are notable exceptions.

92

Here we restrict ourselves to the simpler case of probabilistic databaseswith tuple-level uncertainty only. The more difficult case of databases withattribute and tuple uncertainty will be left open, although some proposalsin the machine learning community [Darwiche, 2002] may help in this re-gard (see Chapter 2 for more discussion). In the next section, we reviewtuple-level uncertainty probabilistic databases and discuss how to gener-ate boolean formulas for result tuples while evaluating queries. In Sec-tion 6.2, we discuss read-once functions. In Section 6.3, we discuss hier-archical queries and show how they always generate read-once functions.In Section 6.4, we propose our approach to solving queries by generatingread-once functions in probabilistic databases. In Section 6.5, we considerthe special case of conjunctive queries without self-joins allowing for non-equality predicates. We conclude the chapter with Section 6.7, after a dis-cussion in Section 6.6.

6.1 Preliminaries

Most of the notation remains the same, but since we are dealing with asimpler model of a probabilistic database some changes/simplifications arein order. As before, let R denote a relation defined over a set of attributesattr(R) = {A1, A2 . . . A|attr(R)|} and each tuple t ∈ R is a mapping fromattributes to values from some pre-defined domain. We associate a unique(boolean-valued) random variable with t denoted by xt and a probability ofexistence of pt. Often, if it is clear from the context, we will abuse notationand refer to the tuple’s random variable by the tuple itself. Possible worldssemantics remains the same, a (probabilistic) database D = {R1, . . . Rm} isa set of relations and represents a distribution over many possible worlds,each obtained by choosing a (sub)set of tuples in each relation Ri to bepresent. If a tuple t is present, we say xt is assigned the value true ort and false or f, otherwise. Each possible world w is associated with aprobability:

Pr(w) =m∏

i=1

∏t∈Ri

xt=t

pt

∏t∈Ri

xt=f

(1− pt)

Given a query q to be evaluated against database D, the result of thequery is defined to be the union of results returned by each possible worldalong with the marginal probabilities of each result tuple. Since we aredealing exclusively with uncertain tuples, one way to compute the marginalprobability of a result tuple t produced by (relational algebra) query q is to

93

f(t ∈ R) = xt

f(σc(t)) = if c(t) then f(t) else ff(

∏(t1, . . . tk)) =

∨ki=1 f(ti)

f(t× t′) = f(t) ∧ f(t′)

Figure 6.1: Extended definitions for σc, ×,∏

where c denotes a selectionpredicate. t denotes true and f denotes false.

L: Xx1 x1

x2 x2

x3 x3

J: X Yz1 x1 y1

z2 x1 y2

z3 x2 y3

z4 x3 y3

R: Yy1 y1

y2 y2

y3 y3

q() :−L(X), J(X,Y), R(Y)

r = x1z1y1 + x1z2y2

+ x2z3y3 + x3z4y3

Figure 6.2: A query q, its singleton result r and the corresponding booleanformula.

extend each (relational algebra) operator in q so that it builds a booleanformula for each (intermediate) tuple generated during query evaluation.We refer to the boolean formula for t by f(t). Figure 6.1 provides theseextended definitions for operators σ, × and

∏. The marginal probabil-

ity of the result tuple can then be obtained by computing the probabilityof the corresponding boolean formula holding true. Figure 6.2 shows athree-relation join query which produces a singleton result tuple r and thecorresponding result tuple’s boolean formula.

6.2 Read-Once Functions: Unateness, P4-Free and Nor-mality

Even though computing marginal probabilities of a result tuple’s booleanformula holding true is #P-Complete in general [Dalvi and Suciu, 2004],special classes of boolean formulas allow tractable computations. Read-oncefunctions are one such class of formulas:

Definition 8 (Read-Once Function [Hayes, 1975]). A boolean formula φ issaid to be read-once if there exists a factorization such that each variable appears

94

a

cb

d

(i)

a

cb

d

(ii)

x1

y1z1

z2y2

x2

x3

y3

z4

z3

(iii)

Figure 6.3: (i) φ = ab+bc+cd, co-occurrence graph is a P4 and φ is not read-once. (ii) φ = c(ab + d), co-occurrence graph is P4-free and φ is read-once.(iii) shows the co-occurrence graph for r from Figure 6.2.

not more than once.

Further, the read-once factorized form of the boolean formula is knownas its read-once expression. For instance, r in Figure 6.2 is a read-oncefunction with the read-once expression x1(z1y1 + z2y2) + y3(x2z3 + x3z4).Prior work [Golumbic et al., 2006] has identified properties that a formulashould satisfy for it to be read-once:

Theorem 2 ([Golumbic et al., 2006]). A boolean formula is read-once iff it isunate, P4-free and normal.

We describe each property in turn.A boolean formula φ is said to be unate (or monotone) [Golumbic et al.,

2006] if every variable either appears in its positive or negated form through-out. Thus, ab and ab+ ac are unate but ab+ ac is not.

For any boolean formula φ, the co-occurrence graph Gφ is formed by rep-resenting every variable in φ using a vertex and introducing an undirectededge between variables xi and xj if they appear together in some clausewhen φ is expressed in disjunctive normal form (dnf). Let X denote a sub-set of vertices, then the subgraph ofG induced byX is the subgraph formedby restrcting edges of G to edges with end points in X . The graph P4 de-notes a chordless path with 4 vertices and 3 edges (see Figure 6.3 (i)). φ isP4-free if no induced subgraph ofGφ forms a P4. Figure 6.3 (i) and (ii) showtwo formulas one of which is not read-once because it contains a P4, Figure6.3 (iii) shows the co-occurrence graph for the result tuple r from Figure6.2. Notice that in Figure 6.3 (ii), even though a, b, c and d do form a pathof length 3, they do not form a P4 because a and c have an edge betweenthem that provides a shorting.

A formula φ is said to be normal (or clique-maximal) if every clique in itsco-occurrence graph is contained in some clause in its dnf form [Golumbicet al., 2006]. For instance, even though the two formulas φ1 = abc and

95

y2 z2

y3

z1

1©

0©

1©

z4y1 z3

0©

1©

x2 x3

1©

x10©

1©1©

Figure 6.4: Co-tree of result tuple r from Figure 6.2.

φ2 = ab+ bc+ ca both have the same co-occurrence graph (the triangle), φ1

is normal (and read-once) whereas φ2 is not.Traditionally, co-trees [Corneil et al., 1981] have been used to concisely

represent read-once expressions of read-once functions. Co-trees are treeswhere leaves correspond to boolean variables while internal node 1© rep-resents ∧ and 0© represents ∨. A given read-once expression can be repre-sented by many co-trees but there exists a canonical co-tree, where 1© and0© alternate on every path. Given the co-tree for a read-once result tuple,

the probability can be computed using a simple, bottom-up procedure:

Pr(v) =

∏

c∈ch(v) Pr(c) if v is 1©1−

∏c∈ch(v)(1− Pr(c)) if v is 0©

pt if v = xt

where ch(v) denotes children of v. The marginal probability of the resulttuple can then be retrieved from the root of the co-tree. Figure 6.4 showsthe co-tree for the result tuple from Figure 6.2.

6.3 Hierarchical Queries and Read-Once Functions

Earlier work on query evaluation in probabilistic databases has identifiedtractable queries for which probability computation is efficient and this setof queries is referred to as hierarchical queries [Dalvi and Suciu, 2007]. Wenext illustrate the close connection between hierarchical queries and read-once functions. For the ensuing discussion, we will assume that all queriesare projected onto the empty set of attributes. Queries projected onto anon-empty set of attributes can be handled by replacing these attributeswith constants. Let q denote a query in datalog notation. Let A denote an

96

attribute in q and sg(A) denote the set of relations a is mentioned in. Inother words, sg(A) denotes the set of relations or subgoals of A. For theexample in Figure 6.2, sg(X) = {L, J} and sg(Y) = {J,R}.

Definition 9 (Hierarchical Query [Dalvi and Suciu, 2007]). A (conjunctive)query q is a hierarchical if for any two attributes A and B either sg(A) ⊆ sg(B),sg(A) ⊇ sg(B) or sg(A) ∩ sg(B) = ∅.

For instance, the query q in Figure 6.2 is not hierarchical (sg(X)∩sg(Y) ={J} 6= ∅, sg(X) * sg(Y), sg(Y) * sg(X)) but q′() :−S(X,Y), T (Y) is.Further, an attribute A is said to be maximal, if ∀B, sg(B) ∩ sg(A) 6= ∅ ⇒sg(B) ⊆ sg(A). Note that, using the notion of maximality it is possi-ble to divide the attributes in any hierarchical query q into disjoint setsA1 ∪ . . . ∪Ak such that:

• subgoals of attributes across the sets are disjoint: sg(A) ∩ sg(B) =∅, ∀A ∈ Ai,∀B ∈ Aj , i 6= j

• there is a maximal attribute A in each set Ai: ∃A ∈ Ai s.t. sg(Am) ⊆sg(A) ∀Am ∈ Ai ∀i = 1, . . . k.

Dalvi and Suciu [2007] showed that hierarchical queries always giverise to result tuples with read-once functions. Here, we express the sameproof for the simple case of queries without self-joins in our notation:

Proposition 6.3.1. Hierarchical queries always produce result tuples with read-once expressions.

Proof. Assume q is hierarchical. By induction on the number of attributesin q, we can show that the result tuple is read-once. The base case is whenq has only one attribute A which is maximal and whose set of subgoalscontains all the relations in q. The boolean formula for the result tupleproduced by q can be expressed as

∨c∈UA

q[A/c], where q[A/c] denotes thequery obtained with A set to constant c and UA denotes the domain of A.Note that q[A/c] indexes into a different set of variables for different c’s,thus variables appearing in q[A/ci] and q[A/cj ] for i 6= j are distinct. Also,within q[A/c], we essentially have a cartesian product among tuples fromdifferent relations satisfying A = c (if |sg(A)| > 1), which is clearly read-once. Thus, q produces a read-once result tuple. For the inductive case, letus assume that all sub-queries of q with at least one less attribute producesread-once result tuples. Given that we can divide the attributes in q intodisjoint sets A1 ∪ . . . ∪ Am such that each set has a distinct maximal at-tribute and subgoals for attributes across sets are disjoint; let qi denote the

97

part of q restricted to subgoals of Ai (the maximal attribute in Ai). Then,we can express the result tuple’s boolean formula as

∧i

∨c∈UAi

qi[Ai/c]. Notwo variables in

∨c∈UAi

qi[Ai/c] and∨

c∈UAjqj [Aj/c] for i 6= j can be iden-

tical since the relations are distinct. Also, qi[Ai/c] is read-once by our in-ductive hypothesis since it contains at least one less attribute than q, andqi[Ai/c] and qi[Ai/c

′] for c 6= c′ do not share variables since Ai is maximaland they index into different sets of tuples thus implying

∨c∈UAi

qi[Ai/c] isread-once. These two observations put together imply

∧i

∨c∈UAi

qi[Ai/c] isread-once, hence q produces a read-once result tuple.

Since read-once functions form the basis of tractability of hierarchicalqueries and we have already seen how hierarchical queries guarantee read-once result tuples, a natural question to ask is whether the converse is true?That is, if we have a read-once result tuple is it necessary for the querythat produced it to be hierarchical? If the answer is yes then that wouldimply that by equipping our query engine with techniques to deal withhierarchical queries, introduced in Dalvi and Suciu [2007, 2004], we havedone all we can to deal with tractable cases. The answer, however, is no,as should be clear from our running example. In Figure 6.2, we showeda query that is not hierarchical, however, the result tuple it produced hasa read-once expression for which probability computation is easy. In thischapter, we would like to develop techniques that helps us evaluate suchcases efficiently.

6.4 Read-Once Expressions for Probabilistic Databases

We now concentrate our efforts on devising a query evaluation engine thatefficiently evaluates read-once result tuples without restricting itself to hi-erarchical queries. We first describe a simple query evaluator that worksfor all result tuples without making any assumptions about the query. Wethen discuss the complexity of our proposed engine. After that we con-centrate on a subset of relational algebra queries for which we attempt todevise a faster approach.

One viable approach to evaluating queries is to generate boolean for-mulas for result tuples (using the extended operators in Figure 6.1) andthen determine whether it is a read-once function. If the result tuple isread-once, then we compute its probability from its read-once expression’sco-tree, else we resort to a general-purpose inference engine. Checking forread-once result tuples is possible in polynomial time. We briefly discuss

98

the algorithms that have been proposed previously in the literature to checkthe three properties that determine whether a formula is read-once.

In what follows, let φ denote a result tuple’s boolean formula, |φ| thelength of its dnf form (with different occurrences of the same variable countedmultiple times) and V ars(φ) the distinct variables in it. To check for unate-ness, a linear scan of the formula is sufficient which requires O(|φ|) time.To check for P4’s in φ’s co-occurrence graph Gφ, there have been a hand-ful of algorithms proposed in the literature [Bretscher et al., 2008; Corneilet al., 1985; Habib and Paul, 2005]†. The common aspects of all of thesealgorithms is that all of them require Gφ to be provided as input and theyrun in time linear in size of Gφ (O(|V ars(φ)| + |E|), where |E| is the num-ber of edges in Gφ). Gφ can be obtained easily from the formula’s dnf form.For instance, Corneil et al. [1985] takes the co-occurrence graph and pickseach variable from the graph along with its neighbours and incrementallybuilds the co-tree which depicts the read-once function. If the algorithmreturns a co-tree successfully then the co-occurrence graph did not containany P4; if there is a P4 then the algorithm stops and provides the P4. Thus,the good thing about this algorithm is that not only does it check for theabsence of P4’s, it also returns the read-once expression as a co-tree whichwe can subsequently use for probability computation. Figure 6.5 shows arun of Corneil et al.’s approach to build the co-tree for the result tuple inFigure 6.2. Bretscher et al. [2008]; Habib and Paul [2005] have a slightlydifferent approach, they first build an ordering on the variables using theco-occurrence graph (Habib and Paul [2005] uses vertex partitioning tech-niques, while Bretscher et al. [2008] uses LexBFS techniques) and then sub-sequently use the ordering to build the co-tree. Checking for normality isalso possible in polynomial time, but is more expensive than checking forunateness or P4-freeness. Golumbic et al. [2006] describes a way to checkfor normality in O(|V ars(φ)||φ|) time using the co-tree obtained from theprevious step of checking for P4’s.

6.5 Read-Once Functions and Conjunctive Queries with-out Self-Joins

Recall that the most expensive step while generating read-once functionsis the step that checks for normality. In this section, we specifically look

†Note that P4-free graphs are also referred to as Cographs. Thus, some of the algorithmsthat test for the absence of P4’s also go by the name of “Cograph recognition algorithms”.

99

x1

z1

1©

x∗1 z∗1

1©

y1x∗1

z1

z2

y1

x∗1

1©

1©

0©

z1 z∗2 y2y1

x∗1

1©

1©

0©

1©

z1 z2 y2y1

x2

x1

1©

1©

0©

1©

0©

z1 z2

z3

y2y1

x∗2x1

1©

1©

0©

1©

0©

1©

z1 z2

z∗3 y3

y2y1

x∗2x1

1©

1©

0©

1©

0©

1©

z1 z2 z3

y∗3

y2y1

0©

x2

x3

x1

1©

1©

0©

1©

0©

1©

1©

z1 z2 z3 z4

y∗3

y2y1

0©

1©

x2 x∗3

x1

1©

1©

0©

1©

0©

1©

1©

Figure 6.5: Run of Corneil et al. [1985]’s algorithm to generate the co-tree forresult tuple r in Figure 6.2. In each iteration, an ellipse depicts the variablebeing added and the asterisks denote its neighbors already present in thetree.

at the case of generating read-once functions for result tuples producedby conjunctive queries without self-joins, also known as select(distinct)-project-join or SPJ queries. Essentially, we show that for conjunctive querieswithout self-joins, the normality check at the end is not required. Conjunc-tive queries form a large fragment of relational algebra (or SQL) and otherworks have also concentrated their efforts on this subclass of queries [Dalviand Suciu, 2004; Olteanu and Huang, 2008]. We first define our notion ofconjunctive queries.

Let A denote an attribute. An atomic formula is a predicate of the formA op B where B is either an attribute or a constant conforming to the typeof A and op is any binary operator conforming to the same type such as=, >,<, 6= etc. A conjunctive query q without self-joins is a relational algebraquery that involves the three operators σ, ./ and

∏, where the joins are

among distinct relationsR1, . . . Rk. We refer to the relations in q byRels(q).We allow join and selection predicates of the form c1 ∧ c2 . . . ∧ cn, whereeach ci is an atomic formula. Note that if the final set of projected attributesin q is empty then we refer to it as a boolean conjunctive query. Also notethat we allow operators besides equality in our selection and join predicateswhich makes our definition of conjunctive queries more general than whatis usually considered.

As earlier, we will denote by r the input result tuple (whose marginalprobability we would like to compute), by φ its boolean formula (whichwe would like to factorize) and by Gφ its co-occurrence graph. We mayalso abuse notation and refer to the result tuple by its formula φ wheneverit is clear from the context. A clause C = x1x2 . . . xn is a conjunction ofmultiple boolean variables. A monotone clause is one where all variablesappear in their positive form, no negations. We will often refer to a clauseas a set of variables. Further, since the variables in φ come from tuple-existence random variables, we will denote the relation of variable x byRel(x). We will also frequently refer to φ’s dnf form by φdnf . Since weconsider conjunctive queries without self-joins, φdnf has a very uniformstructure:

Definition 10 (k-monotone dnf). Given conjunctive query q without self-joinsand any result tuple φ produced by it, φdnf is a k-monotone dnf where every clauseis monotone (or unate), contains exactly one variable from each relation inRels(q)and is of size k where k = |Rels(q)|‡.

Given that conjunctive queries do not allow negations, it follows that

‡These observations have been made in prior work [Re and Suciu, 2008].

101

the result tuples we will be dealing with are automatically unate (variablesappear only in their positive form). We next show that P4-free result tuplesproduced by conjunctive queries without self-joins are guaranteed to benormal. This implies that when we are generating read-once functions forsuch result tuples, the only operation we need to do is check for P4’s andgenerate the co-tree corresponding to its read-once expression. The othersteps for generating read-once functions are not required and this shouldhelp make our approach much more efficient. We next make an observa-tion about result tuples produced by conjunctive queries and then prove alemma which will allow us to prove our main result.

Property 2 (Conjunctive Query Clique Structure). Given a result tuple r pro-duced by a conjunctive query q without self-joins along with its formula φ, the setof variables C = {x1, . . . x|Rels(q)|} represents a clause in φdnf iff C is a clique inGφ.

Proof. Note that if C is a clause in φdnf then it has to be a clique in Gφ byconstruction. The other way is also easy. An edge between two variablesa, b inGφ implies that the corresponding tuples satisfy all join and selectionpredicates in q, and agree with r on all of the final projected attributes (if aor b have any of those). Thus, a |Rels(q)|-sized clique in Gφ implies that allmember variables’ tuples satisfy all predicates associated with the queryand agree with r’s values, and should produce an intermediate join tupleand thus should appear as a clause in φdnf .

Lemma 6.5.1. Let φ denote a result tuple produced by a conjunctive query q with-out self-joins. Let Rels(q) = {R1, . . . R|Rels(q)|}, and a ∈ R1 and b ∈ R2 de-note two variables in φ. Let C1, C2, C3 denote three clauses in φdnf such thata ∈ C1 63 b, a /∈ C2 3 b and a, b ∈ C3. If φ is P4-free then ∃ clause C4 inφdnf that contains both a, b and w3, . . . w|Rels(q)| such that either wi ∈ C1 orwi ∈ C2,∀i = 3, . . . |Rels(q)|.

Proof. Let us begin by completing clauses C1, C2:

• C1 = {a, b′, x3, . . . xn, zn+1, . . . z|Rels(q)|}, b′ 6= b

• C2 = {a′, b, y3, . . . yn, zn+1, . . . z|Rels(q)|}, a′ 6= a

where a, a′ ∈ R1, b, b′ ∈ R2, xi, yi ∈ Ri, xi 6= yi,∀i = 3, . . . n and zi ∈Ri,∀i = n + 1, . . . |Rels(q)|. The x’s and y’s denote the variables in whichC1 and C2 differ, besides a and b. z’s denote the variables they share incommon. Further note that n can be either 2 or |Rels(q)|.

102

First note that if neither edge xi − b nor yi − a exists, then Gφ has a P4:

xi /−yi ∵ Rel(xi) = Rel(yi), no self joinsxi − a ∵ {xi, a} ⊂ C1

yi − b ∵ {yi, b} ⊂ C2

a− b ∵ {a, b} ⊂ C3

xi

a b

yi

Now consider the following selection procedure that picks variablesfrom {x3, . . . xn} and {y3, . . . yn}:

if edge xi − b exists in Gφ then pick xi, else pick yi

Note that, once we have picked a set of x’s and y’s, we will have pickeda variable from each relation Ri, i = 3, . . . n. Also, note that, if among thechosen variables there exists a pair of xi ∈ Ri and yj ∈ Rj (i 6= j) such thatxi /−yj in Gφ, then we have a P4:

xj /−yj ∵ Rel(xj) = Rel(yj)xj − xi ∵ {xi, xj} ⊂ C1

xj /−b ∵ otherwise we would pick xj , not yj

xi − b ∵ otherwise we would not pick xi

yj − b ∵ {b, yj} ⊂ C2

xj

xi b

yj

Thus, if φ isP4-free, then the chosen x’s and y’s, along with a, b, zn+1, . . . z|Rels(q)|form a |Rels(q)|-sized clique in Gφ, and by Property 2 that means this setof variables forms the clause C4 we need.

Proposition 6.5.1. Let φ be a k-monotone dnf produced by some conjunctivequery q without self-joins. If φ is P4-free then φ is normal.

Proof. Assume the contrary, i.e., let φ be P4-free but not normal. This meansthat φ should have a distributed 3-clique, in other words, φ has at least threeclauses C1, C2, C3 such that a, b ∈ C1, c /∈ C1; b, c ∈ C2, a /∈ C2; c, a ∈ C3, b /∈C3 but no clause C such that a, b, c ∈ C. However, by Lemma 6.5.1, since φis P4-free there should be another clause C ′ that contains a, b and variablesexclusively from C2 and C3. This means C ′ also contains c and hence wehave a contradiction.

103

X: A1 B1

x1 3 4x2 1 2

Y: A2

y1 1y2 3

Z: B2

z1 2z2 4

q() :− X(A1,B1), Y (A2), Z(B2),A1 = A2 ∨B1 = B2

r = x1y1z2 + x1y2z1

+x1y2z2 + x2y1z1

+x2y1z2 + x2y2z1

z1

z2y2

y1

x2x1

Figure 6.6: A disjunctive query where Proposition 6.5.1 does not hold.

6.6 Discussion

Having considered the case for conjunctive queries, the next obvious ques-tion is whether we can do something similar for queries with disjunctions.Some disjunctions can be allowed in the queries we considered in this chap-ter without breaking any of our results. For instance, if the cumulativejoin predicate in the query is such that it can be partitioned into a conjunc-tions of smaller formulas c1 ∧ c2 . . . cm such that each formula ci involvesattributes from only two relations then allowing disjunctions inside eachci still allows us to use our efficient read-once function building approach.However, if a disjunction appears between any of the ci’s, then seeminglyinnocuous queries produce cases where our results do not hold. Figure 6.6shows one such case, where we have a three relation join boolean queryand the join predicate involves a disjunction among attributes from threeseparate relations A1 = A2 ∨ B1 = B2. The result tuple φ’s co-occurrencegraph turns out to be the complete graph minus the edges connecting tu-ples from the same relation (x1 /−x2, y1 /−y2, z1 /−z2), which means it is P4-free. But there are clauses which are not present in φdnf , x1y1z1 and x2y2z2,implying φdnf is not normal which means that Proposition 6.5.1 does nothold.

Even though our discussion throughout the chapter mainly involvedtuple-independent probabilistic databases, the techniques we proposed arelikely to be useful for databases with correlated tuples also. In this case, ourtechniques can be used to convert the part of the graphical model generatedduring query evaluation into a tree. It is easy to show that the combinedtreewidth of the complete probabilistic graphical model thus produced (in-

104

cluding the probabilistic graphical model among the base tuples and thepart constructed during query evaluation) is not larger than the treewidthof the graphical model that would have otherwise been produced.

6.7 Conclusion

In summary, we considered the problem of efficiently evaluating queriesover tuple-level uncertainty probabilistic databases. For such databases,every result tuple is associated with a boolean formula and the problem re-duces to computing the marginal probabilities of the result tuples returnedby the query. Previously proposed approaches to this problem have eitherresorted to the use of expensive (exact/approximate) inference algorithmsor concentrated on a subset of the query language that allows efficient eval-uation. In this chapter, we build on the latter approach by going beyondjust looking at the query to decide whether it is PTIME-solvable or not. In-ference problems arising out of query evaluation on probabilistic databasesare a combination of both the query and the database. If the result tuple’sformula can be factored into a tree-structured form, then computing itsmarginal probability is in PTIME. We proposed efficient algorithms thatreturn such a factorization if it exists.

105

Chapter 7

Conclusion

In this dissertation, we presented (a few of) the nuts and bolts that may oneday form part of a system that can manage uncertain data. Here, we brieflysummarize the main contributions made and list a few or the broader, morecompelling, possible avenues for future work.

7.1 Summary of Contributions

Here is a brief listing of the major contributions made in this dissertation:

• We began by showing how the concept of probabilistic graphical mod-els from the machine learning literature, can be utilized in proba-bilistic databases as a means of modeling uncertainty associated withdata. We showed that probabilistic databases based on probabilis-tic graphical models have precise and intuitive semantics in terms ofpossible worlds, that every query posed on such a database has pre-cisely defined answers.

• We showed how queries can be evaluated under such a setting byfirst generating an augmented probabilistic graphical model and thenrunning probabilistic inference on it. We illustrated the generality ofour approach: for any query a PGM can be generated on which wesimply need to run inference to obtain the desired results. This alsoallowed us to utilize any inference algorithm (exact or approximate)developed previously to build our query evaluation engine.

• We then proceeded to generalizing our representation for modelinguncertainty. Instead of using standard graphical models (such as

106

Bayesian networks and Markov networks), we motivated the use offirst-order probabilistic graphical models (such as probabilistic re-lational models and Markov logic networks). First-order graphicalmodels represent one of the more popular approaches to modelinguncertainty not only due to their compactness and ease of mainte-nance, but also because they are easier to estimate statistically.

• First-order graphical models provide symmetry in the form of sharedcorrelations. We designed a inference procedure that exploits sharedcorrelations to perform large-scale inference efficiently for evaluatingqueries in probabilistic databases. Not only that, our inference proce-dure is general enough so that it can be applied to any probabilisticgraphical model. It also subsumes inversion elimination, a popularlifted inference procedure developed in the machine learning com-munity.

• We generalized our lifted inference scheme to be able to performfaster, approximate lifted inference. We introduced two different tech-niques to do this. Moreover, both techniques can be combined, andalong with bounded complexity inference techniques, they form thecore of a unified lifted inference scheme that lets the user specifyher/his desired level of lifting, approximation and complexity of in-ference through the use of a handful of tunable parameters.

• Finally, we designed a novel query evaluation scheme where we firstattempt to reorder the probabilistic graphical model produced duringquery evaluation so that we get an optimal, tree-structured graphicalmodel (if one exists) on which we can efficiently run inference. Thishas the potential to reduce an intractable query evaluation problemto a tractable one.

7.2 Avenues for Future Work

Due to the almost ubiquitous need to model uncertainty for large scaledata, we believe, probabilistic databases are going to be an overwhelmingdriving force behind database and machine learning research in the nearfuture. Besides the aspects of user interfacing and query languages thatrequire our immediate attention, we list below some of the main researchareas that, we think, are of specific interest.

107

Information Integration and Information Extraction Two of the mainapplications that can immediately benefit from the application of proba-bilistic databases are information extraction and information integration.These two areas have been of interest to researchers for a long period oftime, however, neither is close to being solved. Most machine learning ap-proaches to solving these problems face issues when scaling to large datasets and most solutions proposed by the database community tend to ig-nore the rich correlations that can help achieve good quality solutions. Bylooking at these problems from the point of view of probabilistic databases,perhaps for the first time, we can deal with these issues in uniform andprincipled manner to achieve practical solutions that can immediately ben-efit many applications.

Efficient Algorithms for Lifted Inference Our original work on lifted in-ference just scratches the surface of this very exciting field. Recall that, thetechniques introduced in Chapter 4 subsume inversion elimination, and inChapter 5 we proposed algorithms for approximate lifted inference. In fu-ture, we would like to explore whether other kinds of lifted inference canbe included into our general framework. Foremost on this list would be ex-tending our approach to include counting elimination [de Salvo Braz et al.,2005] and counting formulas [Milch et al., 2008], which are techniques thatmay, in some cases, lead to exponential speedups during inference, if im-plemented properly.

Unifying Uncertainty Model Description and Query Evaluation Mostprobabilistic databases allow the user to express queries in a high-levellogic-based language (usually SQL, barring a few exceptions), but do notallow declarative specification of the uncertainty model. Machine learningresearchers, on the other hand, regularly use first-order logic to describethe uncertainty models but rarely allow the use of a high-level declarativelanguage for querying purposes. We would like to explore the interplaybetween these two aspects in the context of use in probabilistic databasessince we believe both are essential for a system to be usable. To the best ofour knowledge, there is no work to date that has systematically exploredthe expressiveness of the various languages used to describe uncertaintymodels, and we would also like to explore if such high level model de-scriptions can be exploited to make query processing more efficient.

108

Approximate Inference Algorithms based on Generalizations of Read-Once Functions Following our work described in Chapter 6, where weshowed how to generate a tree-structured PGM given a query, the next ob-vious question to follow would be: What if a tree-structured PGM does notexist? In such cases, it may be possible to construct a PGM that is “close”to being tree-structured which may help run inference fast with reasonablyaccurate query results. There is ample work in the graph theory communityon generalizations of P4-free graphs (such as P4-tidy graphs [Giakoumakiset al., 1997]) that may lead us to novel approximate query evaluation algo-rithms which have not been seen before in the database or machine learningcommunities.

7.3 Conclusion

One of the over-arching themes underlying this dissertation has been to ex-plore the synergy between related fields of research. Probabilistic databasesis a topic that lies at the intersection of database research, machine learningand graph theory. Even though the challenges in working under such asetting are obvious, one needs to have expertise on not one but each of therelated fields to be able to produce original, useful research, the rewardsare also plentiful. Among the various pieces of work that form parts of thisdissertation, perhaps the most rewarding are the ones that find use beyondjust that of probabilistic databases. For instance, our work on lifted infer-ence (Chapter 4 and Chapter 5) are of obvious interest to the machine learn-ing community, our work on altering the structure of the PGMs to produceread-once functions may find use in the statistical relational learning com-munity where people have recently begun to investigate the use of declara-tive querying (e.g., the ProbLog system [De Raedt et al., 2007]). This seemsto be true for most work done in the context of probabilistic databases, andthat, we believe, is what makes it worthwhile working in such an multi-disciplinary environment. We hope that further research with the canvas ofprobabilistic databases as the background will lead to more synergy amongrelated research communities and will eventually lead to a system that canefficiently handle large-scale uncertain data.

109

Bibliography

P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirtydatabases. In International Conference on Data Engineering (ICDE), 2006.4

L. Antova, C. Koch, and D. Olteanu. 10106worlds and beyond: Efficient

representation and processing of incomplete information. In InternationalConference on Data Engineering (ICDE), 2007. 2.1

S. Arnborg. Efficient algorithms for combinatorial problems on graphs withbounded decomposability - a survey. BIT, 25(1):2–23, 1985. 1, 3.3.3, 4.3

J. Bar-Ilan, G. Kortsarz, and D. Peleg. How to allocate network centers.Journal of Algorithms, 1993. 5.3

D. Barbara, H. Garcia-Molina, and D. Porter. The management of proba-bilistic data. IEEE Transactions on Knowledge and Data Engineering, 1992.2.1

O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databaseswith uncertainty and lineage. In Proceedings of Very Large Data Bases(PVLDB), Seoul, Korea, 2006. 2.1, 2.4, 4, 4.5, 4.5

P. Bosc and O. Pivert. About projection-selection-join queries addressedto possibilistic relational databases. IEEE Transactions on Fuzzy Systems,2005. 2.1

A. Bretscher, D. Corneil, M. Habib, and C. Paul. A simple linear time lexbfscograph recognition algorithm. SIAM Journal on Discrete Mathematics,2008. 6.4

B. Buckles and F. Petry. A fuzzy model for relational databases. InternationalJournal of Fuzzy Sets and Systems, 1982. 2.1

110

R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating prob. queries overimprecise data. In International Conference on Management of Data (SIG-MOD), 2003. 1, 4.5

S. Choenni, H. E. Blok, and E. Leertouwer. Handling uncertainty and ig-norance in databases: A rule to combine dependent data. In DatabaseSystems for Advanced Applications, 2006. 2.1

G. F. Cooper. The computational complexity of probabilistic inference usingbayesian belief networks. Artificial Intelligence, 1990. 3.3.3

Cora Entity Resolution Dataset. http://www.cs.umass.edu/˜mccallum/data/cora-refs.tar.gz. 5.5.5

D. Corneil, H. Lerchs, and L. Burlingham. Complement reducible graphs.Discrete Applied Mathematics, 1981. 6.2

D. Corneil, Y. Perl, and L. Stewart. A linear recognition algorithm forcographs. SIAM Journal of Computing, 1985. 6.4, 6.5

R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhater. Probabilistic Net-works and Expert Systems. Springer, 1999. 1.2, 2.2, 3.1

N. Dalvi and D. Suciu. Management of probabilistic data: foundations andchallenges. In Principles of Database Systems (PODS), 2007. 2.4, 6, 6.3, 9,6.3, 6.3

N. Dalvi and D. Suciu. Efficient query evaluation on probabilisticdatabases. In Proceedings of Very Large Data Bases (PVLDB), 2004. 1.2,1, 2.1, 2.4, 3.2.1, 3.3, 3.4.1, 3.4.2, 4, 6, 6.2, 6.3, 6.5

A. Darwiche. A logical approach to factoring belief networks. In Interna-tional Conference on Principles and Knowledge Representation and Reasoning,2002. 2.4, 6

A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working modelsfor uncertain data. In International Conference on Data Engineering (ICDE),Atlanta, Georgia, 2006. 2.1, 4.5

A. Das Sarma, M. Theobald, and J. Widom. Exploiting lineage for confi-dence computation in uncertain and probabilistic databases. In Interna-tional Conference on Data Engineering (ICDE), 2008. 2.4

111

L. De Raedt, A. Kimmig, and H. Toivonen. Problog: A probabilistic prologand its application in link discovery. In International Joint Conference onArtificial Intelligence (IJCAI), 2007. 2.2, 7.3

R. de Salvo Braz, E. Amir, and D. Roth. Lifted first-order probabilistic infer-ence. In International Joint Conference on Artificial Intelligence (IJCAI), 2005.2.3, 4.5, 5, 7.2

R. de Salvo Braz, E. Amir, and D. Roth. MPE and partial inversion in liftedprobabilistic variable elimination. In AAAI Conference on Artificial Intelli-gence, 2006. 2.3

R. Dechter. Bucket elimination: A unifying framework for probabilisticinference. In Uncertainty in Artificial Intelligence (UAI), 1996. 3.3.1, 3.3.3,4.1

R. Dechter and I. Rish. Mini-buckets: A general scheme for bounded infer-ence. Journal of the ACM, 2003. 1.2, 5, 5.4, 5.4.1

A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong.Model-driven data acquisition in sensor networks. In Proceedings of VeryLarge Data Bases (PVLDB), 2004. 1, 1.2, 4.5

A. Deshpande, L. Getoor, and P. Sen. Graphical models for uncertain data.In C. Aggarwal, editor, Managing and Mining Uncertain Data. Springer,2008. 2.1

A. Dovier, C. Piazza, and A. Policriti. A fast bisimulation algorithm. InInternational Conference on Computer Aided Verification, Paris, France, 2001.4.2.3, 4.2.4, 4.3

U. Feige. A threshold of ln(n) for approximating set cover. Journal of theACM, 1998. 5.3

B. Frey. Extending factor graphs so as to unify directed and undirectedgraphical models. In Uncertainty in Artificial Intelligence (UAI), 2003. 2.2

N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilisticrelational models. In International Joint Conference on Artificial Intelligence(IJCAI), 1999. 2.1, 2.2

N. Fuhr and T. Rolleke. A probabilistic NF2 relational algebra for inte-grated information retrieval and database systems. In World Conferenceon Integrated Design and Process Technology, 1996. 2.1

112

N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integra-tion of information retrieval and database systems. ACM Transactions onInformation Systems, 15(1):32–66, 1997. 1.2, 2.1, 2.4

M. Garey and D. Johnson. Computers and Intractability: A guide to the theoryof NP-Completeness. W. H. Freeman, 1979. 5.3

L. Getoor and B. Taskar, editors. Introduction to Statistical Relational Learning.MIT Press, 2007. 2.2

L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilisticmodels. In International Conference on Management of Data (SIGMOD),2001. 4.5

L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilisticmodels of link structure. Journal of Machine Learning Research, 3:679–707,2002. 4.4.2

V. Giakoumakis, F. Roussel, and H. Thuillier. On P4-tidy graphs. DiscreteMathematics and Theoretical Computer Science, 1997. 7.2

C. Giles, K. Bollacker, and S. Lawrence. Citeseer: An automatic indexingsystem. ACM Digital Libraries, 1998a. 5.5.5

C. L. Giles, K. Bollacker, and S. Lawrence. Citeseer: An automatic citationindexing system. In Conference on Digital Libraries, 1998b. 3.4.1

M. Golumbic, A. Mintz, and U. Rotics. Factoring and recognition of read-once functions using cographs and normality and the readability of func-tions associated with partial k-trees. Discrete Applied Mathematics, 2006.2.4, 6, 6.2, 2, 6.2, 6.2, 6.4

M. Habib and C. Paul. A simple linear time algorithm for cograph recogni-tion. Discrete Applied Mathematics, 2005. 6.4

J. Halpern. An analysis of first-order logics for reasoning about probability.Artificial Intelligence, 1990. 1.2

J. Hayes. The fanout structure of switching functions. Journal of the ACM,1975. 6, 8

D. Hochbaum and W. Maass. Approximation schemes for covering andpacking problems in image processing and VLSI. Journal of the ACM,1985. 5.3

113

C. Huang and A. Darwiche. Inference in belief networks: A proceduralguide. International Journal of Approximate Reasoning, 15(3):225–263, 1994.3.3.1, 3.3.3, 4.1, 4.2

T. Imielinski and W. Lipski, Jr. Incomplete information in relationaldatabases. Journal of the ACM, 1984. 2.1

A. Jaimovich, O. Meshi, and N. Friedman. Template based inference insymmetric relational markov random fields. In Uncertainty in ArtificialIntelligence (UAI), 2007. 2.3

R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, and P. Haas. MCDB: Amonte carlo approach to managing uncertain data. In International Con-ference on Management of Data (SIGMOD), 2008. 6

T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu.Avatar information extraction system. In IEEE Data Engineering Bulletin,2006. 1

B. Kanagal and A. Deshpande. Indexing correlated probabilistic databases.In International Conference on Management of Data (SIGMOD), 2009. 2.4

P. Kanellakis and S. Smolka. CCS expressions, finite state processes, andthree problems of equivalence. In ACM Symposium on Principles of Dis-tributed Computing (PODC), Montreal, Canada, 1983. 1.2, 2.3, 5.2

O. Kariv and S. Hakimi. An algorithmic approach to network location prob-lems I: The p-centers. SIAM Journal on Applied Mathematics, 1979. 5.3

K. Kersting, B. Ahmadi, and S. Natarajan. Counting belief propagation. InUncertainty in Artificial Intelligence (UAI), 2009. 2.3

U. Kjaerulff. Triangulation of graphs - algorithms giving small total statespace. Technical report, University of Aalborg, Denmark, 1990. 4.3

C. Koch and D. Olteanu. Conditioning probabilistic databases. In Proceed-ings of Very Large Data Bases (PVLDB), 2008. 2.1, 2.4

F. Kschischang, B. Frey, and H. andrea Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 2001. 2.3

L. V. S. Lakshmanan, N. Leone, R. Ross, and V. S. Subrahmanian. Probview:a flexible probabilistic database system. ACM Transactions on Data BaseSystems, 1997. 2.1

114

S. Lauritzen. Graphical Models. Oxford University Press, 1996. 2.2

J. Li and A. Deshpande. Consensus answers for queries over probabilisticdatabases. In Principles of Database Systems (PODS), 2009. 2.1

A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the con-struction of internet portals with machine learning. Information RetrievalJournal, 2000. 5.5.5

B. Milch, B. Marthi, S. J. Russell, D. Sontag, D. L. Ong, and A. Kolobov.BLOG: Probabilistic models with unknown objects. In International JointConference on Artificial Intelligence (IJCAI), 2005. 2.2

B. Milch, L. Zettlemoyer, K. Kersting, M. Haimes, and L. Kaelbling. Liftedprobabilistic inference with counting formulas. In AAAI Conference onArtificial Intelligence, 2008. 5, 7.2

D. Olteanu and J. Huang. Secondary-storage confidence computation forconjunctive queries with inequalities. In International Conference on Man-agement of Data (SIGMOD), 2009. 2.4, 3.4.2, ∗

D. Olteanu and J. Huang. Using OBDDs for efficient query evaluation onprobabilistic databases. In Scalable Uncertainty Management, 2008. 2.4,3.4.2, ∗, 6.5

R. Paige and R. Tarjan. Three partition refinement algorithms. SIAM Journalof Computing, 16(6):973–989, 1987. 2.3, 4.2.3, 4.3

J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann,1988. 1.2, 2.2, 2.4, 3.1, ∗

A. Pfeffer, D. Koller, B. Milch, and K. Takusagawa. SPOOK: A system forprobabilistic object-oriented knowledge representation. In Uncertainty inArtificial Intelligence (UAI), Stockholm, Sweden, 1999. 5

D. Poole. First-order prob. inference. In International Joint Conference onArtificial Intelligence (IJCAI), 2003. 2.3, 2.4, 4.5, 5

C. Re and D. Suciu. Materialized views in probabilistic databases for in-formation exchange and query optimization. In Proceedings of Very LargeData Bases (PVLDB), Vienna, Austria, 2007. 4, 4.5

C. Re and D. Suciu. Approximate lineage for probabilistic databases. InProceedings of Very Large Data Bases (PVLDB), Auckland, New Zealand,2008. ‡

115

C. Re, N. Dalvi, and D. Suciu. Query evaluation on probabilistic databases.IEEE Data Engineering Bulletin Special Issue on Probabilistic Data Manage-ment, 2006. 2.1, 4.5

C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on proba-bilistic data. In International Conference on Data Engineering (ICDE), 2007.2.4, 6

M. Richardson and P. Domingos. Markov logic networks. Machine Learning,62(1-2):107–136, 2006. 2.1, 2.2

T. Richardson. A characterization of markov equivalence for directed cyclicgraphs. International Journal of Approximate Reasoning, 1997. 2.2

I. Rish. Efficient Reasoning in Graphical Models. PhD thesis, University ofCalifornia, Irvine, 1999. 3.3.2

P. Sen and A. Deshpande. Representing and querying correlated tuples inprob. databases. In International Conference on Data Engineering (ICDE),2007. 1.3, 2.1

P. Sen, A. Deshpande, and L. Getoor. Representing tuple and attribute un-certainty in probabilistic databases. In ICDM Workshop on Data Mining ofUncertain Data, 2007. 1.3, 2.1

P. Sen, A. Deshpande, and L. Getoor. Exploiting shared correlations inprobabilistic databases. Proceedings of Very Large Data Bases (PVLDB), 1(1):809–820, 2008a. 1.3

P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad.Collective classification in network data. AI Magazine, 29(3), 2008b. 5.5.5

P. Sen, A. Deshpande, and L. Getoor. Bisimulation-based approximatelifted inference. In Uncertainty in Artificial Intelligence (UAI), 2009a. 1.3

P. Sen, A. Deshpande, and L. Getoor. Read-once functions and query eval-uation in probabilistic databases. In preparation, 2009b.

P. Sen, A. Deshpande, and L. Getoor. PrDB: Managing and exploitingshared correlations in probabilistic databases. Special issue on Uncertainand Probabilistic Databases, VLDB Journal, 2009c. 1.3, 2.1, ∗

S. Singh, C. Mayfield, S. Prabhakar, S. Hambrusch, and R. Shah. Indexinguncertain categorical data. In International Conference on Data Engineering(ICDE), 2007. 2.4

116

P. Singla and P. Domingos. Lifted first-order belief propagation. In AAAIConference on Artificial Intelligence, 2008. 2.3, 5

B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models forrelational data. In Uncertainty in Artificial Intelligence (UAI), Edmonton,Canada, 2002. 2.2

TPC-H Benchmark. http://www.tpc.org/tpch/. 3.4

E. Ukkonen. Approximate string matching with q-grams and maximalmatches. In Theoretical Computer Science, 1992. 3.4.1

V. Vazirani. Approximation Algorithms. Springer, 2001. 5.3

D. Wang, E. Michelakis, M. Garofalakis, and J. Hellerstein. BayesStore:Managing large, uncertain data repositories with prob. graphical models.In Proceedings of Very Large Data Bases (PVLDB), 2008. 2.3, 2.4

J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation. InNeural Information Processing Systems Conference (NIPS), 2000. 2.3

N. Zhang and D. Poole. A simple approach to bayesian network computa-tions. In Canadian Conference on Artificial Intelligence, Banff, Canada, 1994.2.3, 2.4, 3.3.1, 3.3.3, 4.1, 4.2, 4.4, 5.4.1, 5.5, 5.6

N. Zhang and D. Poole. Exploiting causal independence in bayesian net-work inference. Journal of Artificial Intelligence Research, 1996. 3.3.2

117

representing and querying uncertain data prithviraj sensen/pubs/thesis/thesis.pdf · 2010-05-21 ·...

Documents