bn-cacm

7/30/2019 bn-cacm

1/11

80 CommuCToS o TE Cm | ECEBE 2010 | . 53 | N. 12

review articles

Do:10.1145/1859204.1859227

What are Bayesian networks and why are theirapplications growing across all felds?

BY D DRWCE

Bayesian

etwrks

problems that span across domainssuch as computer vision, the Web, and

medical diagnosis.So what are Bayesian networks,

and why are they widely used, eitherdirectly or indirectly, across so manyelds and application areas? Intui-tively, Bayesian networks provide asystematic and localized method forstructuring probabilistic informa-tion about a situation into a coher-ent whole. They also provide a suiteof algorithms that allow one to auto-matically derive many implications of

this information, which can form thebasis for important conclusions anddecisions about the correspondingsituation (for example, computing theoverall reliability of a system, ndingthe most likely message that was sentacross a noisy channel, identifying themost likely users that would respondto an ad, restoring a noisy image,mapping genes onto a chromosome,among others). Technically speaking,a Bayesian network is a compact rep-resentation of a probability distribu-

tion that is usually too large to be han-dled using traditional specicationsfrom probability and statistics suchas tables and equations. For example,Bayesian networks with thousands of

variables have been constructed andreasoned about successfully, allowingone to efciently represent and reasonabout probability distributions whosesize is exponential in that number of

variables (for example, in genetic link-

BaYesian netWorKs HaVe been receiving considerableattention over the last few decades from scientistsand engineers across a number of elds, includingcomputer science, cognitive science, statistics, andphilosophy. In computer science, the developmentof Bayesian networks was driven by research inarticial intelligence, which aimed at producing apractical framework for commonsense reasoning.29Statisticians have also contributed to the developmentof Bayesian networks, where they are studied underthe broader umbrella of probabilistic graphicalmodels.5,11

Interestingly enough, a number of other morespecialized elds, such as genetic linkage analysis,speech recognition, information theory and reliabilityanalysis, have developed representations that can bethought of as concrete instantiations or restrictedcases of Bayesian networks. For example, pedigreesand their associated phenotype/genotype information,reliability block diagrams, and hidden Markov models(used in many elds including speech recognitionand bioinformatics) can all be viewed as Bayesiannetworks. Canonical instances of Bayesian networks

also exist and have been used to solve standard

key insights

Bayesian netwrks prvide asysteatic and lcalized ethd fr

strctring prbabilistic infratin

abt a sitatin int a cherent

whle, and are spprted by a site

f inference algriths.

Bayesian netwrks have beenestablished as a biqits tl

fr deling and reasning nder

ncertainty.

many applicatins can be redcedt Bayesian netwrk inference,

allwing ne t capitalize n Bayesian

netwrk algriths instead f having

t invent specialized algriths freach new applicatin.

7/30/2019 bn-cacm

2/11

ECEBE 2010 | . 53 | N. 12 | CommuCToS o TE Cm 81

image

by

robert

hodgin

age analysis,12 low-level vision,34 andnetworks synthesized from relational

models4).For a concrete feel of Bayesian net-

works, Figure 1 depicts a small net-work over six binary variables. EveryBayesian network has two compo-nents: a directed acyclic graph (calleda structure), and a set of conditionalprobability tables (CPTs). The nodesof a structure correspond to the vari-ables of interest, and its edges havea formal interpretation in terms ofprobabilistic independence. We will

discuss this interpretation later, butsufce to say here that in many prac-tical applications, one can often inter-pret network edges as signifying directcausal inuences. A Bayesian networkmust include a CPT for each variable,

which quanties the relationship be-tween that variable and its parents inthe network. For example, the CPT for

variable A species the conditionalprobability distribution ofA given itsparentsFand T. According to this CPT,the probability ofA = true given F =

true and T= false is Pr(A=true|F=true; T= false) = .9900 and is calleda networkparameter.a

A main feature of Bayesian net-works is their guaranteed consistencyand completeness as there is one andonly one probability distribution thatsatises the constraints of a Bayesiannetwork. For example, the network inFigure 1 induces a unique probabilitydistribution over the 64 instantiationsof its variables. This distribution pro-

vides enough information to attribute

a probability to every event that can beexpressed using the variables appear-ing in this network, for example, theprobability of alarm tampering givenno smoke and a report of people leav-ing the building.

Another feature of Bayesian net-works is the existence of efcientalgorithms for computing suchprobabilities without having to explic-

a Bayesian networks may contain continuous

variables, yet our discussion here is restrictedto the discrete case.

7/30/2019 bn-cacm

3/11


review articles

itly generate the underlying probabilitydistribution (which would be compu-tationally infeasible for many interest-ing networks). These algorithms, to bediscussed in detail later, apply to anyBayesian network, regardless of its to-pology. Yet, the efciency of these algo-rithmsand their accuracy in the case

of approximation algorithmsmay bequite sensitive to this topology and thespecic query at hand. Interestinglyenough, in domains such as genetics,reliability analysis, and informationtheory, scientists have developed algo-rithms that are indeed subsumed bythe more general algorithms for Bayes-ian networks. In fact, one of the mainobjectives of this article is to raiseawareness about these connections.The more general objective, however,

is to provide an accessible introduc-tion to Bayesian networks, which al-lows scientists and engineers to moreeasily identify problems that can be re-duced to Bayesian network inference,putting them in a position where they

can capitalize on the vast progress thathas been made in this area over the lastfew decades.

Casality and ndependence

We will start by unveiling the centralinsight behind Bayesian networks thatallows them to compactly represent

very large distributions. Consider Fig-ure 1 and the associated CPTs. Eachprobability that appears in one ofthese CPTs does specify a constraintthat must be satised by the distribu-tion induced by the network. For ex-ample, the distribution must assignthe probability .01 to having smoke

without re, Pr(S = true|F = false),since this is specied by the CPT of

variable S. These constraints, however,are not sufcient to pin down a unique

probability distribution. So what addi-tional information is being appealedto here?

The answer lies in the structure ofa Bayesian network, which speciesadditional constraints in the form of

probabilistic conditional independen-cies. In particular, every variable in thestructure is assumed to become inde-pendent of its non-descendants onceits parents are known. In Figure 1,

variable L is assumed to become inde-pendent of its non-descendants T, F,S once its parent A is known. In other

words, once the value of variable A isknown, the probability distribution of

variableL will no longer change due tonew information about variables T, Fand S. Another example from Figure1: variable A is assumed to becomeindependent of its non-descendantS once its parents F and T are known.These independence constraints areknown as the Markovian assumptionsof a Bayesian network. Together withthe numerical constraints specied by

CPTs, they are satised by exactly oneprobability distribution.Does this mean that every time

a Bayesian network is constructed,one must verify the conditional inde-pendencies asserted by its structure?This really depends on the construc-tion method. I will discuss three mainmethods in the section entitled How

Are Bayesian Networks Constructed?that include subjective construction,synthesis from other specications,and learning from data. The rst

method is the least systematic, buteven in that case, one rarely thinksabout conditional independence

when constructing networks. Instead,one thinks about causality, adding theedgeXYwhenever Xis perceived tobe a direct cause of Y. This leads to acausal structure in which the Markov-ian assumptions read: each variablebecomes independent of its non-ef-fects once its direct causes are known.The ubiquity of Bayesian networksstems from the fact that people are

quite good at identifying direct causesfrom a given set of variables, and at de-ciding whether the set of variables con-tains all of the relevant direct causes.This ability is all that one needs forconstructing a causal structure.

The distribution induced by aBayesian network typically satisesadditional independencies, beyondthe Markovian ones discussed above.Moreover, all such independenciescan be identied efciently using agraphical test known as d-separation.29

According to this test, variables Xand

igre 1. Bayesian netwrk with se f its cnditinal prbability tables (CPTs).

Fire (F)

Alarm (A)

eaving (L) eport (R)

Smoke (S)

Tampering (T)

ire Ske s|f

true true .90

false true .01

ire Tapering lar a|f,t

true true true .5000

true false true .9900

false true true .8500

false false true .0001

igre 2. hidden markv del (mm) and its crrespnding dynaic Bayesian

netwrk (DB).

a

x

b

y

c

z

.30

.15

(a) mm (state diagra) (b) DB

S1

O1

S2

O2

S3

O3

Sn

On

7/30/2019 bn-cacm

4/11

review articles


Y are guaranteed to be independentgiven variablesZif every path between

X and Y is blocked byZ. Intuitively, apath is blocked when it cannot be usedto justify a dependence betweenXandYin light of our knowledge ofZ. For anexample, consider the path : SFA T in Figure 1 and suppose we

know the alarm has triggered (that is,we know the value of variable A). Thispath can then be used to establish adependence between variables S and Tas follows. First, observing smoke in-creases the likelihood of re since reis a direct cause of smoke according topath. Moreover, the increased likeli-hood of re explains away tamperingas a cause of the alarm, leading to adecrease in the probability of tamper-ing (re and tampering are two com-

peting causes of the alarm accordingto path ). Hence, the path could beused to establish a dependence be-tween S and Tin this case. Variables Sand T are therefore not independentgiven A due to the presence of thisunblocked path. One can verify, how-ever, that this path cannot be used toestablish a dependence between S andT in case we know the value of vari-able Finstead ofA. Hence, the path isblocked byF.

Even though we appealed to the no-

tion of causality when describing thed-separation test, one can phrase andprove the test without any appeal tocausalitywe only need the Markov-ian assumptions. The full d-separationtest gives the precise conditions under

which a path between two variables isblocked, guaranteeing independence

whenever all paths are blocked. Thetest can be implemented in time lin-ear in the Bayesian network structure,

without the need to explicitly enumer-ate paths as suggested previously.

The d-separation test can be usedto directly derive results that havebeen proven for specialized probabi-listic models used in a variety of elds.One example is hidden Markov mod-els (HMMs), which are used to modeldynamic systems whose states arenot observable, yet their outputs are.One uses an HMM when interestedin making inferences about thesechanging states, given the sequenceof outputs they generate. HMMs are

widely used in applications requiringtemporal pattern recognition, includ-

ing speech, handwriting, and gesturerecognition; and various problems inbioinformatics.31 Figure 2a depicts anHMM, which models a system withthree states (a, b, c) and three outputs(x, y, z). The gure depicts the possibletransitions between the system states,

which need to be annotated by theirprobabilities. For example, state b cantransition to states a or c, with a 30%chance of transitioning to state c. Eachstate can emit a number of observable

outputs, again, with some probabili-ties. In this example, state b can emitany of the three outputs, with outputzhaving a 15% chance of being emittedby this state.

This HMM can be represented bythe Bayesian network in Figure 2b.32Here, variable St has three values a, b,c and represents the system state attime t, while variable Ot has the val-ues x, y, z and represents the systemoutput at time t. Using d-separationon this network, one can immediately

derive the characteristic property ofHMMs: once the state of the system attime tis known, its states and outputsat times > tbecome independent of itsstates and outputs at times < t.

We also note the network in Figure2b is one of the simplest instancesof what is known as dynamic Bayes-ian networks (DBNs).9 A number ofextensions have been considered forHMMs, which can be viewed as morestructured instances of DBNs. Whenproposing such extensions, however,one has the obligation of offering a

corresponding algorithmic toolboxfor inference. By viewing these extend-ed HMMs as instances of Bayesiannetworks, however, one immediatelyinherits the corresponding Bayesiannetwork algorithms for this purpose.

w are Bayesian etwrks

Cnstrcted?

One can identify three main methodsfor constructing Bayesian networks.8

According to the rst method, which

is largely subjective, one reects ontheir own knowledge or the knowledgeof others (typically, perceptions aboutcausal inuences) and then capturesthem into a Bayesian network. Thenetwork in Figure 1 is an example ofthis construction method. The net-

work structure of Figure 3 depictsanother example, yet the parametersof this network can be obtained frommore formal sources, such as popula-tion statistics and test specications.

According to this network, we have a

population that is 55% males and 45%females, whose members can sufferfrom a medical condition C that ismore likely to occur in males. More-over, two diagnostic tests are availablefor detecting this condition, T1 and T2,

with the second test being more effec-tive on females. The CPTs of this net-

work also reveal that the two tests areequally effective on males.

The second method for construct-ing Bayesian networks is based on au-tomatically synthesizing them fromsome other type of formal knowledge.

igre 3. Bayesian netwrk that dels a pplatin, a edical cnditin, and tw

crrespnding tests.

Sex (S)

C= yes means that

the condition Cis

present, and Ti= +ve

means the result of

test Ti is positive.

Test (T1) Test (T

2)

Condition (C)S s

male .55

S C T2 t2|s,c

male yes +ve .80

male no +ve .20

female yes +ve .95

female no +ve .05

S C c|s

male yes .05

female yes .01

C T1 t1|c

yes +ve .80

no +ve .20

7/30/2019 bn-cacm

5/11


review articles

For example, in many applicationsthat involve system analysis, such as

reliability and diagnosis, one can syn-thesize a Bayesian network automati-cally from a formal system design.Figure 4 depicts a reliability block dia-gram (RBD) used in reliability analysis.The RBD depicts system componentsand the dependencies between theiravailability. For example, Processor 1requires either of the fans for its avail-ability, and each of the fans requirespower for its availability. The goal hereis to compute the overall reliabilityof the system (probability of its avail-ability) given the reliabilities of each

of its components. Figure 4 shows alsohow one may systematically convert

each block in an RBD into a Bayes-ian network fragment, while Figure 5depicts the corresponding Bayesiannetwork constructed according to thismethod. The CPTs of this gure can becompletely constructed based on thereliabilities of individual components(not shown here) and the semantics ofthe transformation method.8

The third method for constructingBayesian networks is based on learn-ing them from data, such as medicalrecords or student admissions data.Consider Figure 3 and the data set de-

picted in the table here as an example.

Sex S Cnditin C Test T1 Test T2

male no ? ve

male ? ve +ve

female yes +ve ?

Each row of the table corresponds toan individual and what we know aboutthem. One can use such a data set tolearn the network parameters given itsstructure, or learn both the structureand its parameters. Learning parame-ters only is an easier task computation-ally. Moreover, learning either struc-ture or parameters always becomeseasier when the data set is completethat is, the value of each variable isknown in each data record.

Since learning is an inductive pro-cess, one needs a principle of induc-tion to guide the learning process. Thetwo main principles for this purposelead to the maximum likelihood andBayesian approaches to learning (see,for example, the work of 5,8,17,22,27). Themaximum likelihood approach favorsBayesian networks that maximize theprobability of observing the given dataset. The Bayesian approach on the oth-er hand uses the likelihood principlein addition to some prior information

which encodes preferences on Bayes-ian networks.

Suppose we are only learning net-work parameters. The Bayesian ap-proach allows one to put a prior distri-bution on the possible values of eachnetwork parameter. This prior distri-bution, together with the data set, in-duces a posterior distribution on the

values of that parameter. One can thenuse this posterior to pick a value forthat parameter (for example, the dis-tribution mean). Alternatively, one can

decide to avoid committing to a xedparameter value, while computing an-swers to queries by averaging over allpossible parameter values according totheir posterior probabilities.

It is critical to observe here that theterm Bayesian network does not nec-essarily imply a commitment to theBayesian approach for learning net-

works. This term was coined by JudeaPearl28 to emphasize three aspects:the often subjective nature of the in-formation used in constructing thesenetworks; the reliance on Bayes condi-

igre 4. reliability blck diagra (tp), with a systeatic ethd fr apping its blcks

int Bayesian netwrk fragents (btt).

Power

Supply

Block B

Sub-System 1 Sub-System 1Available

Sub-System 2AvailableSub-System 2

ard

rive

Fan 2

Fan 1 Processor 1

Processor 2

Block BAvailable

AN

igre 5. Bayesian netwrk generated atatically fr a reliability blck diagra.

Fan 1 (F1)

Fan 2 (F2)

PowerSupply (E)

ardrive (D)

Processor 1 (P1)

Processor 2 (P2)

System (S)

ariables E, F1, F2, P1, P2 and D represent

the availability of corresponding components,

and variable Srepresents the availability

of the whole system. ariables Aicorrespond

to logical ands, and variablesOicorrespond

to logical ors.A

1

A2

A4

O1

O2

A3

7/30/2019 bn-cacm

6/11

review articles


tioning when reasoning with Bayesiannetworks; and the ability to performcausal as well as evidential reasoningon these networks, which is a distinc-tion underscored by Thomas Bayes.1

These learning approaches aremeant to induce Bayesian networksthat are meaningful independently of

the tasks for which they are intend-ed. Consider for example a network

which models a set of diseases and acorresponding set of symptoms. Thisnetwork may be used to perform di-agnostic tasks, by inferring the mostlikely disease given a set of observedsymptoms. It may also be used for pre-diction tasks, where we infer the mostlikely symptom given some diseases.If we concern ourselves with only oneof these tasks, say diagnostics, we can

use a more specialized induction prin-ciple that optimizes the diagnosticperformance of the learned network.In machine learning jargon, we say

we are learning a discriminative modelin this case, as it is often used to dis-criminate among patients accordingto a predened set of classes (for ex-ample, has cancer or not). This is to becontrasted with learning a generativemodel, which is to be evaluated basedon its ability to generate the given dataset, regardless of how it performs on

any particular task.We nally note that it is not un-

common to assume some canoni-cal network structure when learningBayesian networks from data, in or-der to reduce the problem of learningstructure and parameters to the eas-ier problem of learning parametersonly. Perhaps the most common suchstructure is what is known as naveBayes: C A1, , C An, where C isknown as the class variable and vari-ablesA1,,An are known as attributes.

This structure has proven to be verypopular and effective in a number ofapplications, in particular classica-tion and clustering.14

Cannical Bayesian etwrks

A number of canonical Bayesian net-works have been proposed for model-ing some well-known problems in a

variety of elds. For example, geneticlinkage analysis is concerned withmapping genes onto a chromosome,utilizing the fact that the distance be-tween genes is inversely proportional

to the extent to which genes are linked(two genes are linked when it is morelikely than not that their states arepassed together from a single grand-parent, instead of one state from eachgrandparent). To assess the likelihoodof a linkage hypothesis, one uses apedigree with some information about

the genotype and phenotype of asso-ciated individuals. Such informationcan be systematically translated into aBayesian network (see Figure 6), where

the likelihood of a linkage hypothesiscorresponds to computing the prob-ability of an event with respect to thisnetwork.12 By casting this problem interms of inference on Bayesian net-

works, and by capitalizing on the state-of-the-art algorithms for this purpose,the scalability of genetic linkage analy-

sis was advanced considerably, lead-ing to the most efcient algorithms forexact linkage computations on generalpedigrees (for example, the SUPER-

igre 6. Bayesian netwrk generated atatically fr a pedigree that cntains three

individals.

ariables GPijandGMijrepresent the states of gene jfor individual i, which are inherited from

the father and mother of i, respectively. ariable Pij represents the phenotype determined by

gene jof individual i. ariables SPijand SMijare meant to model recombination events, which

occur when the states of two genes inherited from a parent do not come from a single grand-

parent. The CPTs of these variables contain parameters which encode linkage hypotheses.

GP11

GP21

GP12

GP22

GP13

GP23

GM11

GM21

SP31

SP32

SP33

SM31

SM32

SM33

GP31

GP32

GP33

GM31

GM32

GM33

P31

P32

P33

GM12

GM22

GM13

GM23

P11

P21

P12

P22

P13

P23

ather: 1 mther: 2

Child: 3

igre 7. Bayesian netwrk that dels a nisy channel with inpt (U1,,U4, X1,,X3) and

tpt (Y1,,Y7).

BitsX1,,X3 are redundant

coding bits and their CPTs

capture the coding protocol.

The CPTs of variables Yi

capture the noise model.

This network can be used to

recover the most likely sent

messageU1,,U4 given a

received message Y1,,Y7.

U1

Y1

U2

Y2

U3

Y3

U4

Y4

Y5

X1

Y6

X2

Y7

X3

7/30/2019 bn-cacm

7/11


review articles

LINK program initiated by Fishelsonand Geiger12).

Canonical models also exist formodeling the problem of passing in-formation over a noisy channel, wherethe goal here is to compute the mostlikely message sent over such a chan-nel, given the channel output.13 Forexample, Figure 7 depicts a Bayesiannetwork corresponding to a situation

where seven bits are sent across a noisychannel (four original bits and threeredundant ones).

Another class of canonical Bayes-ian networks has been used in variousproblems relating to vision, including

image restoration and stereo vision.Figure 8 depicts two examples of imag-

es that were restored by posing a queryto a corresponding Bayesian network.Figure 9a depicts the Bayesian networkin this case, where we have one nodePifor each pixel i in the imagethe val-ues pi ofPi represent the gray levels ofpixel i. For each pair of neighboringpixels, i andj, a child node Cij is added

with a CPT that imposes a smoothnessconstraint between the pixels. That is,the probabilityPr(Cij = true|Pi =pi, Pj=pj) specied by the CPT decreases asthe difference in gray levels | pi pj |increases. The only additional infor-

mation needed to completely specifythe Bayesian network is a CPT for eachnode Pi, which provides a prior distri-bution on the gray levels of each pixeli. These CPTs are chosen to give thehighest probability to the gray level viappearing in the input image, with theprior probabilityPr(Pi =pi) decreasing

as the difference |pi vi | increases. Bysimply adjusting the domain and theprior probabilities of nodes Pi, whileasserting an appropriate degree ofsmoothness using variables Cij, onecan use this model to perform otherpixel-labeling tasks such as stereo

vision, photomontage, and binarysegmentation.34 The formulation ofthese tasks as inference on Bayesiannetworks is not only elegant, but hasalso proven to be very powerful. For

example, such inference is the basisfor almost all top-performing stereomethods.34

Canonical Bayesian network mod-els have also been emerging in recent

years in the context of other importantapplications, such as the analysis ofdocuments, and text. Many of thesenetworks are based on topic modelsthat view a document as arising froma set of unknown topics, and providea framework for reasoning about theconnections between words, docu-

ments, and underlying topics.2,33

Topicmodels have been applied to manykinds of documents, including email,scientic abstracts, and newspaperarchives, allowing one to utilize infer-ence on Bayesian networks to tackletasks such as measuring documentsimilarity, discovering emergent top-ics, and browsing through documentsbased on their underlying content in-stead of traditional indexing schemes.

What Can one D with a

Bayesian etwrk?Similar to any modeling language, the

value of Bayesian networks is mainlytied to the class of queries they support.

Consider the network in Figure 3 foran example and the following queries:Given a male that came out positiveon both tests, what is the probabilityhe has the condition? Which group ofthe population is most likely to testnegative on both tests? Consideringthe network in Figure 5: What is theoverall reliability of the given system?

What is the most likely conguration

igre 8. ages fr left t right: inpt, restred (sing Bayesian netwrk inference)

and riginal.

igre 9. mdeling lw-level visin prbles sing tw types f graphical dels:

Bayesian netwrks and mRs.

Pi PiCij Pj Pj

(a) Bayesian netwrk (b) markv rand eld (mR)

7/30/2019 bn-cacm

8/11

review articles


of the two fans given that the system isunavailable? What single componentcan be replaced to increase the over-all system reliability by 5%? ConsiderFigure 7: What is the most likely chan-nel input that would yield the channeloutput 1001100? These are examplequestions that would be of interest in

these application domains, and theyare questions that can be answeredsystematically using three canonicalBayesian network queries.8 A mainbenet of using Bayesian networks inthese application areas is the ability tocapitalize on existing algorithms forthese queries, instead of having to in-

vent a corresponding specialized algo-rithm for each application area.

Probability of Evidence. This querycomputes the probabilityPr(e), where

e is an assignment of values to somevariables E in the Bayesian networke is called a variable instantiation orevidence in this case. For example, inFigure 3, we can compute the prob-ability that an individual will come outpositive on both tests using the proba-bility-of-evidence queryPr(T1 = +ve,T2= +ve). We can also use the same queryto compute the overall reliability of thesystem in Figure 5, Pr(S = avail). Thedecision version of this query is knownto be PP-complete. It is also related to

another common query, which asksfor computing the probabilityPr(x|e)for each network variable X and eachof its values x. This is known as thenode marginals query.

Most Probable Explanation (MPE).Given an instantiation e of some vari-ables E in the Bayesian network, thisquery identies the instantiation qof variables Q = E that maximizes theprobability Pr (q|e). In Figure 3, wecan use an MPE query to nd the mostlikely group, dissected by sex and con-

dition, that will yield negative resultsfor both tests, by taking the evidence eto be T1 =ve; T2 =veandQ = {S, C}. Wecan also use an MPE query to restore im-ages as shown in Figures 8 and 9. Here,

we take the evidence e to be Cij = truefor all i,jand Q to includePi for all i. Thedecision version of MPE is known tobe NP-complete and is therefore easierthan the probability-of-evidence queryunder standard assumptions of com-plexity theory.

Maximum a Posteriori Hypothesis

(MAP). Given an instantiation e of some

igre 10. n arithetic circit fr the Bayesian netwrk BAC. npts labeled

with variables crrespnd t netwrk paraeters, while thse labeled with variables

captre evidence.

Probability-of-evidence, node-

marginals, and PE queries can all be

answered using linear-time traversal

algorithms of the arithmetic circuit.+

+ ++ +

*

* ** ** ** *

a

b cb c

aa

b|a c|ab|a c|ab|a c|a b|a c|a

a

*

igre 11. Tw netwrks that represent the sae set f cnditinal independencies.

(a) (b)

Earthquake?(E)

Earthquake?(E)

adio?(R)

adio?(R)

Alarm?(A)

Alarm?(A)

Call?(C)

Call?(C)

Burglary?(B)

Burglary?(B)

igre 12. Extending Bayesian netwrks t accnt fr interventins.

(a) (b)

Earthquake?

(E)

Earthquake?

(E)

adio?(R)

adio?(R)

Alarm?(A)

Alarm?(A)

Call?(C)

Call?(C)

Tampering(T)

Tampering(T)

Burglary?

(B)

Burglary?

(B)

7/30/2019 bn-cacm

9/11


review articles

variables E in the Bayesian network,this query identies the instantiationq of some variables Q E that maxi-mizes the probabilityPr(q|e). Note thesubtle difference with MPE queries:Q is a subset of variables E instead ofbeing all of these variables. MAP is amore difcult problem than MPE since

its decision version is known to beNPPP -complete, while MPE is onlyNP-complete. As an example of this query,consider Figure 5 and suppose we areinterested in the most likely congu-ration of the two fans given that thesystem is unavailable. We can nd thisconguration using a MAP query withthe evidence e beingS =un_avail andQ = {F1,F2}.

One can use these canonical que-ries to implement more sophisticated

queries, such as the ones demandedby sensitivity analysis. This is a modeof analysis that allows one to check therobustness of conclusions drawn fromBayesian networks against perturba-tions in network parameters (for exam-ple, see Darwiche8). Sensitivity analy-sis can also be used for automaticallyrevising these parameters in order tosatisfy some global constraints that areimposed by experts, or derived fromthe formal specications of tasks un-der consideration. Suppose for exam-

ple that we compute the overall systemreliability using the network in Figure 5and it turns out to be 95%. Suppose we

wish this reliability to be no less than99%: Pr (S = avail) 99%. Sensitivityanalysis can be used to identify com-ponents whose reliability is relevant toachieving this objective, together withthe new reliabilities they must attainfor this purpose. Note that componentreliabilities correspond to network pa-rameters in this example.

w Well D Bayesianetwrks Scale?

Algorithms for inference on Bayes-ian networks fall into two main cat-egories: exact and approximate al-gorithms. Exact algorithms make nocompromises on accuracy and tendto be more expensive computation-ally. Much emphasis was placed onexact inference in the 1980s and early1990s, leading to two classes of al-gorithms based on the concepts ofelimination10,24,36 and conditioning.6,29In their pure form, the complexity of

these algorithms is exponential onlyin the network treewidth, which is agraph-theoretic parameter that mea-sures the resemblance of a graph to atree structure. For example, the tree-

width of trees is 1 and, hence, infer-ence on such tree networks is quite ef-cient. As the network becomes more

and more connected, its treewidth in-creases and so does the complexity ofinference. For example, the networkin Figure 9 has a treewidth of n as-suming an underlying image with n n pixels. This is usually too high, evenfor modest-size images, to make thesenetworks accessible to treewidth-based algorithms.

The pure form of elimination andconditioning algorithms are calledstructure-based as their complexity is

sensitive only to the network structure.In particular, these algorithms willconsume the same computational re-sources when applied to two networksthat share the same structure, regard-less of what probabilities are used toannotate them. It has long been ob-served that inference algorithms canbe made more efcient if they alsoexploit the structure exhibited by net-

work parameters, including determin-ism (0 and 1 parameters) and context-specic independence (independence

that follows from network parametersand is not detectable by d-separation3).Yet, algorithms for exploiting paramet-ric structure have only matured in thelast few years, allowing one to performexact inference on some networks

whose treewidth is quite large (see sur-vey8). Networks that correspond to ge-netic linkage analysis (Figure 6) tendto fall in this category12 and so do net-

works that are synthesized from rela-tional models.4

One of the key techniques for ex-

ploiting parametric structure is basedon compiling Bayesian networks intoarithmetic circuits, allowing one to re-duce probabilistic inference to a pro-cess of circuit propagation;7 see Figure10. The size of these compiled circuitsis determined by both the network to-pology and its parameters, leading torelatively compact circuits in somesituations where the parametric struc-ture is excessive, even if the networktreewidth is quite high (for example,Chavira et al.4). Reducing inference tocircuit propagation makes it also easi-

er to support applications that requirereal-time inference, as in certain diag-nosis applications.25

Around the mid-1990s, a strong be-lief started forming in the inferencecommunity that the performance ofexact algorithms must be exponentialin treewidththis is before parametric

structure was being exploited effective-ly. At about the same time, methods forautomatically constructing Bayesiannetworks started maturing to the pointof yielding networks whose treewidthis too large to be handled efciently byexact algorithms at the time. This hasled to a surge of interest in approxi-mate inference algorithms, which aregenerally independent of treewidth.Today, approximate inference algo-rithms are the only choice for networks

that have a high treewidth, yet lack suf-cient parametric structurethe net-works used in low-level vision appli-cations tend to have this property. Aninuential class of approximate infer-ence algorithms is based on reducingthe inference problem to a constrainedoptimization problem, with loopy be-lief propagation and its generalizationsas one key example.29,35 Loopy beliefpropagation is actually the commonalgorithm of choice today for handlingnetworks with very high treewidth,

such as the ones arising in vision orchannel coding applications. Algo-rithms based on stochastic samplinghave also been pursued for a long timeand are especially important for infer-ence in Bayesian networks that containcontinuous variables.8,15,22 Variationalmethods provide another importantclass of approximation techniques19,22and are key for inference on someBayesian networks, such as the onesarising in topic models.2

Casality, gainOne of the most intriguing aspects ofBayesian networks is the role they playin formalizing causality. To illustratethis point, consider Figure 11, whichdepicts two Bayesian network struc-tures over the same set of variables.One can verify using d-separationthat these structures represent thesame set of conditional independen-cies. As such, they are representation-ally equivalent as they can induce thesame set of probability distributions

when augmented with appropriate

7/30/2019 bn-cacm

10/11

review articles


one f the stintriging aspects

f Bayesiannetwrks isthe rle they playin fralizingcasality.

CPTs. Note, however, that the networkin Figure 11a is consistent with com-mon perceptions of causal inuences,

yet the one in Figure 11b violates theseperceptions due to edges AE and AB. Is there any signicance to thisdiscrepancy? In other words, is theresome additional information that can

be extracted from one of these net-works, which cannot be extracted fromthe other? The answer is yes accordingto a body of work on causal Bayesiannetworks, which is concerned with akey question:16,30 how can one charac-terize the additional information cap-tured by a causal Bayesian networkand, hence, what queries can be an-swered only by Bayesian networks thathave a causal interpretation?

According to this body of work, only

causal networks are capable of updat-ing probabilities based on interven-tions, as opposed to observations. Togive an example of this difference, con-sider Figure 11 again and suppose that

we want to compute the probabilitiesof various events given that someonehas tampered with the alarm, causingit to go off. This is an intervention, to becontrasted with an observation, where

we know the alarm went off but with-out knowing the reason. In a causalnetwork, interventions are handled as

shown in Figure 12a: by simply adding anew direct cause for the alarm variable.This local x, however, cannot be ap-plied to the non-causal network in Fig-ure 11b. If we do, we obtain the networkin Figure 12b, which asserts the follow-ing (using d-separation): if we observethe alarm did go off, then knowing it

was not tampered with is irrelevant towhether a burglary or an earthquaketook place. This independence, whichis counterintuitive, does not hold inthe causal structure and represents one

example of what may go wrong whenusing a non-causal structure to answerquestions about interventions.

Causal structures can also be usedto answer more sophisticated queries,such as counterfactuals. For example,the probability of the patient wouldhave been alive had he not taken thedrug requires reasoning about in-terventions (and sometimes mighteven require functional information,beyond standard causal Bayesian net-

works30). Other types of queries in-clude ones for distinguishing between

direct and indirect causes and for de-termining the sufciency and neces-sity of causation.30 Learning causalBayesian networks has also beentreated,16,30 although not as extensive-ly as the learning of general Bayesiannetworks.

Beynd Bayesian etwrksViewed as graphical representationsof probability distributions, Bayesiannetworks are only one of several othermodels for this purpose. In fact, inareas such as statistics (and now alsoin AI), Bayesian networks are studiedunder the broader class of probabi-listic graphical models, which includeother instances such as Markov net-

works and chain graphs (for example,Edwards11 and Koller and Friedman22).

Markov networks correspond to undi-rected graphs, and chain graphs haveboth directed and undirected edges.Both of these models can be inter-preted as compact specications ofprobability distributions, yet their se-mantics tend to be less transparentthan Bayesian networks. For example,both of these models include numericannotations, yet one cannot interpretthese numbers directly as probabili-ties even though the whole model canbe interpreted as a probability dis-

tribution. Figure 9b depicts a specialcase of a Markov network, known as aMarkov random eld (MRF), which istypically used in vision applications.Comparing this model to the Bayes-ian network in Figure 9a, one ndsthat smoothness constraints betweentwo adjacent pixels Pi and Pj can nowbe represented by a single undirectededgePi Pj instead of two directed edg-es and an additional node, PiCij

Pj. In this model, each edge is associ-ated with a function f (Pi, Pj) over the

states of adjacent pixels. The values ofthis function can be used to capturethe smoothness constraint for thesepixels, yet do not admit a direct proba-bilistic interpretation.

Bayesian networks are meant tomodel probabilistic beliefs, yet theinterest in such beliefs is typically mo-tivated by the need to make rationaldecisions. Since such decisions areoften contemplated in the presenceof uncertainty, one needs to knowthe likelihood and utilities associated

with various decision outcomes. A

7/30/2019 bn-cacm

11/11

90 CommuCToS o TE Cm | ECEBE 2010 | 53 | N 12

review articles

classical example in this regard con-cerns an oil wildcatter that needs todecide whether or not to drill for oil ata specic site, with an additional de-cision on whether to request seismicsoundings that may help determinethe geological structure of the site.Each of these decisions has an asso-

ciated cost. Moreover, their potentialoutcomes have associated utilities andprobabilities. The need to integratethese probabilistic beliefs, utilitiesand decisions has lead to the develop-ment ofInuence Diagrams, which areextensions of Bayesian networks thatinclude three types of nodes: chance,utility, and decision.18 Inuence dia-grams, also called decision networks,come with a toolbox that allows one tocompute optimal strategies: ones that

are guaranteed to produce the highestexpected utility.20,22

Bayesian networks have also beenextended in ways that are meant to fa-cilitate their construction. In many do-mains, such networks tend to exhibitregular and repetitive structures, withthe regularities manifesting in bothCPTs and network structure. In thesesituations, one can synthesize largeBayesian networks automatically fromcompact high-level specications. Anumber of concrete specications

have been proposed for this purpose.For example, template-based ap-proaches require two components forspecifying a Bayesian network: a set ofnetwork templates whose instantia-tion leads to network segments, anda specication of which segments togenerate and how to connect them to-gether.22,23 Other approaches includelanguages based on rst-order logic,allowing one to reason about situa-tions with varying sets of objects (forexample, Milch et al.26).

The Challenges head

Bayesian networks have been estab-lished as a ubiquitous tool for model-ing and reasoning under uncertainty.The reach of Bayesian networks, how-ever, is tied to their effectiveness inrepresenting the phenomena of inter-est, and the scalability of their infer-ence algorithms. To further improvethe scope and ubiquity of Bayesian net-

works, one therefore needs sustainedprogress on both fronts. The mainchallenges on the rst front lie in in-

creasing the expressive power of Bayes-ian network representations, whilemaintaining the key features that haveproven necessary for their success:modularity of representation, trans-parent graphical nature, and efciencyof inference. On the algorithmic side,there is a need to better understand the

theoretical and practical limits of exactinference algorithms based on the twodimensions that characterize Bayesiannetworks: their topology and paramet-ric structure.

With regard to approximate infer-ence algorithms, the main challengesseem to be in better understandingtheir behavior to the point where wecan characterize conditions under

which they are expected to yield goodapproximations, and provide tools for

practically trading off approximationquality with computational resources.Pushing the limits of inference algo-rithms will immediately push the en-

velope with regard to learning Bayes-ian networks since many learningalgorithms rely heavily on inference.One cannot emphasize enough the im-portance of this line of work, given theextent to which data is available today,and the abundance of applicationsthat require the learning of networksfrom such data.

References

1. bes, . an ess towrs sovin proem in teoctrine of cnces. Phil. Trans. 3 (1963), 370418.eprouce in W.. demin.

2. bei, d.., , a.y. n jorn, .. ltent diricetoction. Journal o Machine Learning Research 3(2003), 9931022.

3. boutiier, C., Friemn, ., goszmit, . n Koer, d.Context-specic inepenence in besin networks.n Proceedings o the 12th Conerence on Uncertaintyin Artifcial Intelligence (1996), 115123.

4. Cvir, ., drwice, a. n jeer, . Compiinretion besin networks for exct inference.International Journal o Approximate Reasoning 42 ,1-2 ( 2006) 420.

5. Cowe, ., dwi, a., luritzen, S. n Spieeter,d. Probabilistic Networks and Expert Systems.Spriner, 1999.

6. drwice, a. ecursive conitionin. ArtifcialIntelligence 126, 1-2 (2001), 541.

7. drwice, a. a ifferenti pproc to inference inbesin networks. Journal o the ACM 50, 3 (2003).

8. drwice, a. Modeling and Reasoning with BayesianNetworks. Cmrie Universit ress, 2009.

9. den, . n Knzw, K. a moe for resoninout persistence n custion. ComputationalIntelligence 5, 3 (1989), 142150.

10. decter, . bucket eimintion: a unifin frmeworkfor proiistic inference. n Proceedings o the 12thConerence on Uncertainty in Artifcial Intelligence(1996), 211219.

11. wrs, d. Introduction to Graphical Modeling.Spriner, 2n eition, 2000.

12. Fiseson, . n geier, d. xct enetic inkecomputtions for ener peirees. Bioinormatics 18,1 (2002), 189198.

13. Fre, b. eitor. Graphical Models or Machine Learning

and Digital Communication. ress, Cmrie,a, 1998.

14. Friemn, ., geier, d. n goszmit, . besinnetwork cssiers. Machine Learning 29, 2-3 (1997),131163.

15. giks, W., icrson, S. n Spieeter, d. MarkovChain Monte Carlo in Practice: InterdisciplinaryStatistics. Cpmn & h/CC, 1995.

16. gmour, C. n Cooper, g. es. Computation,Causation, and Discovery. ress, Cmrie, a,1999.

17. heckermn, d. a tutori on ernin wit besinnetworks. Learning in Graphical Models. Kuwer,1998, 301354.

18. howr, .a. n teson, j.. nuence irms.Principles and Applications o Decision Analysis, Vol.2. Strteic decision group, eno rk, Ca, 1984,719762.

19. jkko, . utori on vrition pproximtionmetos. Advanced Mean Field Methods. d. Sn . pper, e, ress, Cmrie, a, 2001,129160.

20. jensen, F.V. n iesen, .d. Bayesian Networks andDecision Graphs. Spriner, 2007.

21. jorn, ., grmni, Z., jkko, . n Su, l.an introuction to vrition metos for rpicmoes. Machine Learning 37, 2 (1999), 183233.

22. Koer, d. n Friemn, . Probabilistic GraphicalModels: Principles and Techniques. ress,Cmrie, a, 2009.

23. Koer, d. n feffer, a. ect-oriente besinnetworks. n Proceedings o the 13th Conerence onUncertainty in Artifcial Intelligence (1997), 302313.

24. luritzen, S.l. n Spieeter, d.j. loccomputtions wit proiities on rpicstructures n teir ppiction to expert sstems.Journal o Royal Statistics Society, Series B 50, 2(1998), 157224.

25. ensoe, ., drwice, a., Cscio, K., Cvir, .,o, S. n Uckun, S. dinosin futs in eectricpower sstems of spcecrft n ircrft. nProceedings o the 20th Innovative Applications oArtifcial Intelligence Conerence (2008), 16991705.

26. ic, b., rti, b., usse, S., Sont, d., n, d.n Kooov, a. blg: roiistic moes witunknown oects. n Proceedings o the InternationalJoint Conerence on Artifcial Intelligence (2005),13521359.

27. epoitn, . Learning Bayesian Networks. renticeh, newoo, j, 2004.

28. er, j. besin networks: a moe of sef-ctivte memor for evienti resonin. nProceedings o the Cognitive Science Society(1985),

329334.29. er, j. Probabilistic Reasoning in Intelligent

Systems: Networks o Plausible Inerence. ornKufmnn, 1988.

30. er, j. Causality: Models, Reasoning, and Inerence.Cmrie Universit ress, 2000.

31. durin, a.K.., , S. n itcison, g. BiologicalSequence Analysis: Probabilistic Models o Proteinsand Nucleic Acids. Cmrie Universit ress, 1998.

32. Smt, ., heckermn, d. n jorn, . roiisticinepenence networks for ien rkov proiitmoes. Neural Computation 9, 2 (1997), 227269.

33. Stevers, . n grifts, . roiistic topicmoes. Handbook o Latent Semantic Analysis. . K.lnuer, d. S. cmr, S. dennis, n W. Kintsc,es. 2007, 427448.

34. Szeiski, ., Zi, ., Scrstein, d., Vekser,., Komoorov, V., arw, a., ppen, .F.n oter, C. a comprtive stu of ener

minimiztion metos for rkov rnom es witsmootness-se priors. IEEE Trans. Pattern Anal.Mach. Intell. 30, 6 (2008), 10681080.

35. yeii, j., Freemn, W. n Weiss, y. Constructinfree-ener pproximtions n enerize eiefproption oritms. IEEE Transactions onInormation Theory 1, 7 (2005), 22822312.

36. Zn, .l. n ooe, d. a simpe pproc tobesin network computtions. n Proceedingso the 10th Conerence on Uncertainty in ArtifcialIntelligence, (1994), 171178.

Adnan Darwiche ([email protected]) is professorn former cir of te computer science eprtment tte Universit of Ciforni, los anees, were e soirects te automte esonin group.

2010 aC 0001-0782/10/1200 $10.00

bn-cacm

Documents