r&n: 13.2-13.6,14.1-14.5; p&m: 8.1-8.4, 8.6 jacques fleuriot

TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probabilistic ReasoningR&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6

Jacques Fleuriot

University of Edinburgh, School of Informatics

[email protected]

Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 1/71

TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probabilities and Bayesian Inference

Note This week’s lectures will review the following at a quick pace:

1. Basic Properties of Probability

2. Bayes’ Rule and Inference

3. Product Probability Spaces

And then cover:

4. Representing Distributions with Bayesian Networks

5. How to Construct the Network

6. Inference using Networks

7. ProbLog: Logic programming with probabilities


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probabilities

Can be used to model “objectively” the situation in the world.

A person with a cold sneezes (say at least once a day) withprobability 75%.

The objective interpretation is that this is a truly randomevent. Every person with a cold may or may not sneeze anddoes so with probability 75%.

Or . . .

Probabilities can model our “subjective” belief. For example:if we know that a person has a cold then we believe they willsneeze with probability 75%.The probabilities relate to an agent’s state of knowledge.They change with new evidence.

We will use the subjective interpretation, though probability theoryitself is independent of this choice.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probabilities: Some Terminology I

Elementary Event: Outcome of some experiment e.g. coincomes out Head (H) when tossed

Sample Space: SET of possible elementary events e.g.{H,T}.If sample space, S , is finite, we can denote the number ofelementary events in S by n(S).

Example: For one throw of a die, the sample space is given by:

S = {1, 2, 3, 4, 5, 6}

and son(S) = 6


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probabilities: Some Terminology II

Event: subset of sample space i.e. if event is denoted by Ethen

E ⊆ S

and also

n(E ) ≤ n(S)

Example: LetEo be event “the number is odd” when a die is rolled thenEo = {1, 3, 5} and n(Eo) = 3

Ee be event “the number is less than 3” when a die is rolledthen Ee = {1, 2} and n(Ee) = 2


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probabilities: Running Example I

We throw 2 dice (each with 6 sides and uniform construction).

Each elementary event is a pair of number (a, b).

The sample space i.e. set of elementary events is

{(1, 1), (1, 2), . . . , (6, 6)}

Events i.e. subsets of the sample space include:E1: outcomes where the first die has value 1.E2: outcomes where the sum of the two numbers is even.E3: outcomes where the sum of the two numbers is at least11.

Events A and B are Mutually Exclusive if A ∩ B = ∅(the empty set).

All pairs of elementary events are mutually exclusive.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probabilities: Running Example II

In Example:

Events E1 and E3 are mutually exclusiveEvents E1 and E2 are notEvents E2 and E3 are not


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Classical Definition of Probability

If the sample space S consists of a finite (but non-zero) number ofequally likely outcomes, then the probability of an event E , writtenP(E ) is defined as

P(E ) =n(E )

n(S)

This definition satisfies the (Kolmogorov) axioms of probability :

1. P(E ) ≥ 0 for any event E since n(E ) ≥ 0 and n(S) > 0

2. P(S) = n(S)n(S) = 1

3. If A and B are mutually exclusive: P(A ∪ B) = P(A) + P(B)

This is because A ∩ B = ∅. So, n(A ∪ B) = n(A) + n(B), andtherefore

P(A∪B) =n(A ∪ B)

n(S)=

n(A) + n(B)

n(S)=

n(A)

n(S)+n(B)

n(S)= P(A)+P(B)

Note: P(A ∩ B) is also denoted by P(A,B) and P(A and B)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probability Distributions

It is sufficient to specify the probability of elementary events.

Other probabilities can be computed from these.

In example: Each die gives a result 1, 2, 3, 4, 5, 6 withprobability 1

6 independently of the other die. So eachelementary event has probability 1

36

P(E1) = P((1, 1)) + . . .P((1, 6)) = 6/36 = 1/6

P(E2) = P((1, 1)) + P((1, 3)) + P((1, 5)) +P((2, 2)) + P((2, 4)) + P((2, 6)) + . . . = 1/2

P(E3) = P((5, 6)) + P((6, 5)) + P((6, 6)) = 3/36 = 1/12


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Properties of Probabilities

We know that P(A) = n(A)n(S)

Since event A is a subset of sample space S ,

0 ≤ n(A) ≤ n(S) i.e. 0 ≤ n(A)

n(S)≤ 1

So, we have 0 ≤ P(A) ≤ 1

If P(A) = 0 then event cannot occur e.g. mutually exclusiveevents A and B, since then P(A ∩ B) = P(∅) = 0

If P(A) = 1 then event is certain to occur


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Properties of Probabilities

Let A denote the event “A does not occur”

P(A) =n(A)

n(S)

=n(S)− n(A)

n(S)

= 1− n(A)

n(S)= 1− P(A)

So, P(A) = 1− P(A)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Conditional Probability

If A and B are two events and P(B) 6= 0, then theprobability of A given B has already occurred is writtenP(A|B) and defined as

P(A|B) =P(A ∩ B)

P(B)

Where does this formula come from?Since we know that event B has already occurred, we knowthat the sample space for event A given event B is B.

P(A|B) =n(A ∩ B)

n(B)

=n(A ∩ B)/n(S)

n(B)/n(S)

=P(A ∩ B)

P(B)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Conditional Probability: Example

P(E1 ∩ E2) = P((1, 1)) + P((1, 3)) + P((1, 5)) = 1/12P(E2 ∩ E3) = P((6, 6)) = 1/36P(E1 ∩ E3) = P(φ) = 0

P(E1|E2) = P(E1∩E2)P(E2) = 1/12

1/2 = 1/6

P(E2|E3) = P(E2∩E3)P(E3) = 1/36

1/12 = 1/3

P(E1|E3) = P(E1∩E3)P(E3) = 0

1/12 = 0


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

More on Conditional Probability

Conditional probability rule is often written and used in thefollowing form:

P(A ∩ B) = P(A|B)P(B)

Since P(A ∩ B) = P(B ∩ A), we also have:

P(A ∩ B) = P(B|A)P(A)

and hence that P(A|B)P(B) = P(B|A)P(A)

if A and B are mutually exclusive events then, asP(A ∩ B) = 0 and P(B) 6= 0 (from def. of conditionalprobability), it follows that P(A|B) = 0.

Next, we look at an important result: Bayes’ Theorem


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Bayes’ Theorem

We know that P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

Reorganizing we get:

P(A|B) =P(A)P(B|A)

P(B)

Dice Example: P(E2|E1) = P(E2)P(E1|E2)P(E1) = (1/2)(1/6)

1/6 = 1/2

This will be the basis of our Bayesian inference and learningprocedures!


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Independent Events I

If the occurrence or non-occurrence of an event A does notinfluence in any way the probability of an event B, then the eventB is (statistically) independent of event A, and we have:

P(A|B) = P(A)

Since

P(A|B)P(B) = P(B|A)P(A)

⇒ P(A)P(B) = P(B|A)P(A)

⇒ P(B|A) = P(B) also holds

But, we also know that

P(A ∩ B) = P(A|B)P(B)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Independent Events II

So, A and B independent also means that

P(A ∩ B) = P(A)P(B)

In Dice Example:Events E1 and E2 are statistically independentEvents E2 and E3 are notEvents E1 and E3 are not

Independence can be used to simplify computations!


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Independence: Example

Sometimes we know from the probabilistic experiment thatcertain events are “physically” independent. In this case theyare also statistically independent.

Event E4: outcomes where the second die has value 4.

Since each throw of the die is independent (the first throwdoes not change the probabilities of outcomes of the secondthrow), E1 and E4 are independent. They are also statisticallyindependent.

This implies: P(E1 ∩ E4) = P(E1)P(E4) = 1/36Can be verified using equations as above.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probability Distributions

A random variable associates a value with each possibleoutcome of an experiment e.g. X = H or X = T for outcomesof tossing a coin.

An elementary event is an assignment of values to allvariables.

A probability distribution describes the probability of everyelementary event and can be represented using a probabilitytable.

A Joint Probability Distribution contains information aboutprobabilities of all variables and the connections betweenthem.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Joint Distribution: Example

Two variables A and B each of which can take values (0, 1)(i.e. boolean variables).

Table has one column for each variable

Table has 2× 2 = 4 rows

A B P(A,B)

0 0 v00

0 1 v01

1 0 v10

1 1 v11

P(A = 0,B = 0) = v00

P(A = 1,B = 0) = v10

Particular probabilitiesdenoted by P()

The table is denoted byP(A,B)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Joint Probability Distribution: Another Example

We have 4 variables describing the situation about a personwith an allergy to cats:Cold: person has a coldCat: there is a cat in the vicinity of the personAllergy: the persons is showing an allergic reactionSneeze: person sneezed

Each of these is Boolean: taking values 0 or 1.

The joint distribution can be described in a table of size24 = 16.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

The Joint Distribution

Cold Cat Allergy Sneeze P(Cold ,Cat,Allergy ,Sneeze)

0 0 0 0 0.846450 0 0 1 0.00 0 1 0 0.0044550 0 1 1 0.0400950 1 0 0 0.00180 1 0 1 0.00 1 1 0 0.000720 1 1 1 0.006481 0 0 0 0.0094051 0 0 1 0.0846451 0 1 0 0.0002251 0 1 1 0.0047251 1 0 0 0.000021 1 0 1 0.000181 1 1 0 0.000041 1 1 1 0.00076


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Marginal Distribution I

This is an induced probability distribution for one variable ormore variables obtained from a joint distribution.

To compute the marginal distribution of one or morevariables, we ignore the other variables.

Example:

A B P(A,B)

0 0 v00

0 1 v01

1 0 v10

1 1 v11

A P(A)

0 v00 + v01

1 v10 + v11

B P(B)

0 v00 + v10

1 v01 + v11


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Marginal Distribution II

How do we “ignore variables”?

Example: We ignore B when computing marginal distributionfor A by summing it out. This is described generically by:

P(A) =∑

v∈{0,1}

P(A,B = v)

Similarly, for marginal distribution of B, we have:

P(B) =∑

v∈{0,1}

P(A = v ,B)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Marginal Distribution III

And for particular values:

P(A = 0) =1∑

v=0

P(A = 0 and B = v)

Dice Example:

P(X1 = 1) =6∑

v=1

P(X1 = 1 and X2 = v) = 1/6

Allergy Example:

P(Cold = 1) = Σ . . . = 0.1

P(Cold = 0) = Σ . . . = 0.9

P(Sneeze = 1) = Σ . . . = 0.136885


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Marginal Distribution: Another Example

C A B P(A,B|C )

0 0 0 v000

0 0 1 v001

0 1 0 v010

0 1 1 v011

1 0 0 v100

1 0 1 v101

1 1 0 v110

1 1 1 v111

P(A,B|C )

Sum out B:

C A P(A|C )

0 0 v000 + v001

0 1 v010 + v011

1 0 v100 + v101

1 1 v110 + v111

P(A|C ) =∑

v P(A,B = v |C )


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Inference using the Joint

Compute probabilities of events:P(Cold = 1 and Sneeze = 1) =

∑. . . = 0.0902875

Causal Inference:P(Sneeze = 1|Cold = 1) = 0.0902875

0.1 = 0.902875

Diagnostic Inference:P(Cold = 1|Sneeze = 1) = 0.0902875

0.136885 = 0.66

Inter-Causal Inference:P(Cold = 1|Sneeze = 1 and Allergy = 1)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Normalization

P(Cold = 1|Sneeze = 1) = P(Sneeze=1|Cold=1)P(Cold=1)P(Sneeze=1) = A

α

P(Cold = 0|Sneeze = 1) = P(Sneeze=1|Cold=0)P(Cold=0)P(Sneeze=1) = B

α

But Aα + B

α = 1 so α = A + B and

P(Cold = 1|Sneeze = 1) = AA+B

Will often use normalization to avoid computing thedenominator.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Conditional Independence

Events A and B are statistically independent given event C iff

P(A ∩ B|C ) = P(A|C )P(B|C )

This is equivalent to the condition P(A|B ∩ C ) = P(A|C )

As in standard independence this can simplify thecomputations.

Event E5: outcomes where at least one die has value 6.

P(E2|E3) = 1/3

P(E2|E3 and E5) = 1/3


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Independence in Joint Probability Distributions

In joint probability distributions we can express a more generalform of independence.

X1 and X2 are independent iff P(X1|X2) = P(X1)

This means that for all v1 and v2

P(X1 = v1|X2 = v2) = P(X1 = v1)

And similarly for conditional independenceX1 and X2 are independent given X3 iffP(X1|X2,X3) = P(X1|X3)

This means that for all v1, v2, v3

P(X1 = v1|X2 = v2,X3 = v3) = P(X1 = v1|X3 = v3)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Product Probability Spaces

We have n variables, X1, . . . ,Xn

Each variable Xi ranges over a finite set of valuesvi ,1, . . . , vi ,ki .

We can write a big table with n columns and∏

i ki rowsdescribing the probability of every elementary event.

Table grows exponentially with n.Not feasible unless n is very small.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Bayesian Networks

Allow us to represent distributions more compactly.

Take advantage of the structure available in a domain.

Basic idea: represent dependence and independence explicitly.

If P(X1 ∩ X2) = P(X1)P(X2) then we can use two1-dimensional tables instead of a 2-dimensional table.

If each has 6 values, this means 12 entries instead of 36!


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example

Edges represent “direct influence”

Assume that Cat and Cold do notdepend on other variables.

Allergy depends only on Cat

Sneeze depends on Cold and Allergy

Sneeze depends on CatBUT only through Allergy

For each node we associate aconditional probability table

Cold

Cat

Sneeze

Allergy


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example II

The network structure expresses independence of variables.

P(Cat|Cold) = P(Cat)

P(Allergy |Cat,Cold) = P(Allergy |Cat)

P(Sneeze|Allergy ,Cat,Cold) = P(Sneeze|Allergy ,Cold)

More generally, the joint distribution can be expressed as theproduct of the distributions in the network. Example:

P(Cat,Cold ,Allergy ,Sneeze) =P(Cat)P(Cold)P(Allergy |Cat)P(Sneeze|Allergy ,Cold)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example III

P(Cat = 0) P(Cat = 1)

0.99 0.01

P(Cold = 0) P(Cold = 1)

0.90 0.10

Cat P(Allergy = 0) P(Allergy = 1)

0 0.95 0.051 0.20 0.80

Cold Allergy P(Sneeze = 0) P(Sneeze = 1)

0 0 1.00 0.000 1 0.10 0.901 0 0.10 0.901 1 0.05 0.95


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example: Joint Distribution in Terms of CPDs I

We can use the conditional probability rule (in its various forms)and Bayes’ Theorem to express Joint Distributions in terms ofconditional probability distributions (stored with nodes in Bayesiannetwork as Conditional Probability Tables (CPTs)).

Example: Express P(Cat,Cold ,Allergy ,Sneeze) in terms of knownconditional probability distributions (tables):

P(Cat),P(Cold),P(Allergy |Cat),P(Sneeze|Allergy ,Cold)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example: Joint Distribution in Terms of CPDs II

P(Cat,Cold ,Allergy ,Sneeze)

= P(Sneeze|Allergy ,Cat,Cold)P(Allergy ,Cat,Cold)

by conditional probability rule applied to distributions

= P(Sneeze|Allergy ,Cold)P(Allergy |Cat,Cold)P(Cat,Cold)


= P(Sneeze|Allergy ,Cold)P(Allergy |Cat)P(Cat|Cold)P(Cold)


= P(Sneeze|Allergy ,Cold)P(Allergy |Cat)P(Cat)P(Cold)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Bayesian Networks I

We have n variables, X1, . . . ,Xn

The graph has n nodes, each corresponding to one variable.

The graph is acyclic; there is no directed path from Xi toitself.

For each node associate a conditional probability table (CPT)describing P(Xi | parents(Xi )).

The joint probability distribution can be expressed as theproduct of the distributions in the network:

P(X1,X2, . . . ,Xn) =n∏

i=1

P(Xi | parents(Xi ))


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Bayesian Networks II

P(Cat = 0,Cold = 1,Allergy = 0,Sneeze = 1) =P(Cat = 0)P(Cold = 1)P(Allergy = 0|Cat = 0)P(Sneeze = 1|Allergy = 0,Cold = 1) =

0.99 · 0.1 · 0.95 · 0.9 = 0.084645

So to represent a distribution we need to represent thenetwork and CPTs.

If for all nodes the number of parents is small then all CPTsare small and we have a compact representation.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

How to Construct a Network? I

Can represent any distribution using a network.By repeated application of P(A,B) = P(A|B)P(B)

Choose ordering of variables X1, . . . ,Xn.

For i=1 to nAdd Xi to network with P(Xi |X1, . . . ,Xi−1).

The joint distribution is:P(X1,X2, . . . ,Xn) =

∏ni=1 P(Xi |X1, . . . ,Xi−1).

But this is no improvement as Xn is connected to allpredecessors (so we need a huge table for it).


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

How to Construct a Network? II

Instead, at each stage choose a subset such thatparents(Xi ) ⊆ {X1, . . . ,Xi−1}.

Choice of parents must satisfyP(Xi |X1, . . . ,Xi−1) = P(Xi | parents(Xi ))

so that Xi is independent of other predecessors given itsparents.

If parents(Xi ) is small then representation is compact.

For example if for all Xi , | parents(Xi )| ≤ 3 theninstead of 2n entries we have n23 = 8n entries !

How should we order the variables?

Causal links tend to produce small representations.

Choose “root causes” first. Then continue with causalstructure as much as possible.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example

You are at work, neighbour John calls to say your home alarm isringing, but neighbour Mary doesn’t call. Sometimes alarm set offby minor earthquakes. Is there a burglar?

Variables and ordering:

Burglar , Earthquake, Alarm, JohnCalls, MaryCalls

Network topology reflects “causal” knowledge:


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

The Network

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

t

f.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29

.001

.94

.002

P(E)

A P(M)

t

f.70

.01


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

What if we choose another ordering?

Variable ordering (1):

MaryCalls, JohnCalls, Alarm, Burglar , Earthquake

Variable ordering (2):

MaryCalls, JohnCalls, Earthquake Burglar , Alarm,

JohnCalls

MaryCalls

Alarm

Burglary

Earthquake

MaryCalls

Alarm

Earthquake

Burglary

JohnCalls

(a) (b)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

How to Construct a Network

May work with domain experts to decide on structure andthen find values of probabilities.

Choosing a “bad order” can have adverse effect:

Large probability tables.“Unnatural” dependencies, that are in turn hard to estimate.

Luckily experts are often good at identifying causal structure,and prefer giving probability judgements for causal rules.

Learning the CPTs and even the structure is possible.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Computing with Bayes Nets

We have seen how to reconstruct the joint distribution fromthe network.

Can we compute other probabilities efficiently?

P(Cold = 1 and Sneeze = 1) =?P(Sneeze = 1|Cold = 1) =?P(Cold = 1|Sneeze = 1) =?P(Cold = 1|Sneeze = 1 and Allergy = 1) =?


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Computing a Marginal Distribution I

We want to compute P(Cold ,Sneeze).

By conditional probability rule:

P(Cold , Sneeze) = P(Sneeze|Cold)P(Cold)

Recall that we have the following CPTs attached to nodes of ournetwork:

P(Cat) P(Cold)

P(Allergy |Cat) P(Sneeze|Allergy ,Cold)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Computing a Marginal Distribution II

We need to compute P(Sneeze|Cold):

P(Sneeze|Cold)

=∑

v∈{0,1}

P(Sneeze,Allergy = v |Cold)

by def. of marginal distribution (we sum Allergy out)

=∑

v∈{0,1}

P(Sneeze|Allergy = v ,Cold)P(Allergy = v |Cold)

by result (†)=

∑v∈{0,1}

P(Sneeze|Allergy = v ,Cold)P(Allergy = v)

Since Allergy is independent of Cold

We can read out P(Sneeze|Allergy = v ,Cold) for all v using ourCPT for P(Sneeze|Allergy ,Cold).


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Computing a Marginal Distribution III

We now need to compute the marginal distribution P(Allergy).

P(Allergy = v)

=∑

v2∈{0,1}

P(Allergy = v ,Cat = v2)

by def. of marginal distribution (we sum Cat out)∑v2∈{0,1}

P(Allergy = v |Cat = v2)P(Cat = v2)

by def. of conditional probability rule


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Computing a Marginal Distribution IV

So, we have now expressed all computations in terms of the CPTsin the network. We can write the computation fully as:

P(Cold , Sneeze)

= (∑

v∈{0,1}

(P(Sneeze|Allergy = v ,Cold)

∑v2∈{0,1}

P(Allergy = v |Cat = v2)P(Cat = v2)))

P(Cold)

In general, “sum out” variables that do not appear in the query.Try to maintain small tables along the way.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Resulting Distributions

P(Allergy = 0) P(Allergy = 1)

0.9425 0.0575

Cold P(Sneeze = 0) P(Sneeze = 1)

0 0.94825 0.051751 0.097125 0.902875

Cold Sneeze P(Sneeze,Cold)

0 0 0.8534250 1 0.0465751 0 0.00971251 1 0.0902875


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Causal Inference

To compute P(Sneeze = 1|Cold = 1)

First compute P(Allergy) as in previous example.

Then compute

P(Sneeze = 1|Cold = 1) =∑v∈{0,1} P(Sneeze = 1|Allergy = v ,Cold = 1)P(Allergy = v)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Diagnostic Inference

Use Bayes’ Rule to compute P(Cold = 1|Sneeze = 1)

A1 = P(Cold = 1|Sneeze = 1) =P(Sneeze=1|Cold=1)P(Cold=1)

P(Sneeze=1) = N1P(Sneeze=1)

A0 = P(Cold = 0|Sneeze = 1) =P(Sneeze=1|Cold=0)P(Cold=0)

P(Sneeze=1) = N0P(Sneeze=1)

But A0 + A1 = 1 so

P(Cold |Sneeze = 1) =

(P(Cold = 0|Sneeze = 1),P(Cold = 1|Sneeze = 1)) =

( N0N0+N1

, N1N0+N1

), using normalization.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Diagnostic Inference

From the CPT we have:P(Cold = 0) = 0.90 and P(Cold = 1) = 0.10

P(Sneeze = 1|Cold = 0) = 0.05175P(Sneeze = 1|Cold = 1) = 0.902875computed as in previous example using P(Allergy)

N0 = 0.05175 · 0.90 = 0.046575

N1 = 0.902875 · 0.10 = 0.0902875

P(Cold = 1|Sneeze = 1) = N1N0+N1

= 0.66


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Inference in Burglary Example

Use B,E ,A, J,M to denote variables

Compute P(A = 1|E = 1, J = 1)

P(A = 1|E = 1, J = 1) = P(J=1|A=1,E=1)P(A=1|E=1)P(J=1|E=1)

By computing and normalisingP(J = 1|A = b,E = 1)P(A = b|E = 1) for b ∈ {0, 1}P(A = 1|E = 1) = 0.001 · 0.95 + 0.999 · 0.29 = 0.29066

P(A = 0|E = 1) = 0.70934

P(J = 1|A = 1,E = 1)P(A = 1|E = 1) = 0.90 · 0.29066 =0.261994

P(J = 1|A = 0,E = 1)P(A = 0|E = 1) = 0.05 · 0.70934 =0.035467

P(A = 1|E = 1, J = 1) = 0.2619940.261994+0.035467 = 0.88

Exercise: Work out the details of all the computations.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Inference in Burglary Example: Derivation II

P(A = 1|E = 1)

=P(A = 1,E = 1)

P(E = 1)

by def. of marginal distribution:

=Σv∈{0,1}P(A = 1,E = 1,B = v)

P(E = 1)

=Σv∈{0,1}P(A = 1|E = 1,B = v)P(E = 1,B = v)

P(E = 1)

since E (earthquake) and B (burglary) are independent:

=Σv∈{0,1}P(A = 1|E = 1,B = v)P(E = 1)P(B = v)

P(E = 1)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Inference in Burglary Example: Derivation III

= Σv∈{0,1}P(A = 1|E = 1,B = v)P(B = v)

by expanding summation:

= P(A = 1|E = 1,B = 0)P(B = 0) + P(A = 1|E = 1,B = 1)P(B = 1)

by probabilities from CPTs in network

= (0.29)(0.999) + (0.95)(0.01) = 0.29066


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Inference

In general the situation is more complicated than in oursimple network. The problem is NP-Hard.

Can always compute via the joint but this may not be efficient.

Efficient algorithms are known for graphs with polytreestructure. These are a singly connected networks. There areno cycles even ignoring the direction of edges.

Otherwise, try to turn graph into a tree, or use simulationmethods to approximate the probability.

Simulation is not guaranteed to give good answers but workswell in some cases.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Inference by Simulation

To compute P(X = v1|Y = v2) = P(X=v1,Y=v2)P(Y=v2)

Repeat many times:choose random values for all nodes using CPTs

by choosing first for nodes with no parentsand then for nodes whose parents have been chosen

if Y = v2 was chosen then:Ycounter = Ycounter +1if X = v1 was chosen then Xcounter = Xcounter+1

Return Xcounter/Ycounter

May require many rounds if Y = v2 occurs with lowprobability.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Simulation by Likelihood Weighting

Improves the simulation by forcing evidence values (Y = v2)to be chosen.

Initialise weight w = 1

In sampling if a node Xi corresponds to an evidence variablechoose the given value vi .

The current round will not have full weight but insteadw ← w · P(Xi = vi | values chosen for parents)


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example I

Compute P(Allergy = 1|Cold = 1, Sneeze = 1)

Set up general counter CG and counter CA for Allergy = 1(both initialised to zero).

Here we illustrate just one sample

Choose Cold = 1 so w ← 0.1

Choose Cat randomly. In the following we assume 0 waschosen.

Choose Allergy using P(Allergy |Cat = 0).In the following we assume 0 was chosen.

Choose Sneeze = 1 and updatew ← w · P(Sneeze = 1|Cold = 1,Allergy = 0) = 0.1 · 0.9 =0.09


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example II

The round is over.We forced the evidence to succeed.But Allergy = 1 did not succeed.

CG ← CG + 0.09CA ← CA (no change)

After Many rounds return CA/CG

Faster than simple simulation but may still need a long timeto get good answers


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Probabilistic Logic Programming (PLP)

Growing interest in combining probability and logic to dealwith complex relational domains.

Probabilistic logic programs enable some of the facts to beannotated with probabilities i.e. they extend logic programs.

Related field: Statistical Relational Learning, which extendsprobabilistic graphical models like Bayesian networks withlogical and relational representations.

Many inference and learning tasks that use graphical modelscan be expressed as PLP programs.

Example PLPs languages include ProbLog, ICL, PRISM, etc.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Problog Syntax in a Nutshell

A ProbLog program consists of two parts: a set of groundprobabilistic facts, and a logic program.1

A ground probabilistic fact, written p::f, is a ground fact fannotated with a probability p.

Probabilistic statements are of the form p::f(X1,X2,...,Xn) :-

body, where body is a conjunction of calls to non-probabilistic facts.

Such a statement defines the domains of the variables X1,X2, ...Xn.

An atom that unifies with a ground probabilistic fact is called aprobabilistic atom, while an atom that unifies with the head of somerule in the logic program is called a derived atom.

The set of probabilistic atoms must be disjoint from the set ofderived atoms.

All variables in the head of a rule should also appear in a positiveliteral in the body of the rule.

1https://dtai.cs.kuleuven.be/problog/index.htmlJacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 67/71

https://dtai.cs.kuleuven.be/problog/index.html

TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Burglar Example revisited in Problog

A possible (propositional) ProbLog program for

computing P(A = 1|E = 1, J = 1):

0.01::burglary.

0.02::earthquake.

0.95::alarm :- burglary, earthquake.

0.94::alarm :- burglary, \+ earthquake.

0.29::alarm :- \+ burglary, earthquake.

0.01::alarm :- \+ burglary, \+ earthquake.

0.9::calls john :- alarm.

0.05::calls john :- \+alarm.

0.7::calls mary :- alarm.

0.01::calls mary :- \+alarm.

evidence(calls john,true).

evidence(earthquake, true).

query(alarm).

.001

P(B)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

t

f.90

.05

B

t

t

f

f

E

t

f

t

f

P(A)

.95

.29

.001

.94

.002

P(E)

A P(M)

t

f.70

.01

Note: \+ denotes negation.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Example: A first order Problog programA slightly different (extended) version of burglary

example.

person(john).

person(mary).

0.01::burglary.

0.02::earthquake.

0.95::alarm :- burglary, earthquake.

0.94::alarm :- burglary, \+ earthquake.

0.29::alarm :- \+ burglary, earthquake.

0.01::alarm :- \+ burglary, \+ earthquake.

0.8::calls(X) :- alarm, person(X).

0.1::calls(X) :- \+ alarm, person(X).

evidence(calls(john),true).

evidence(calls(mary),true).

query(burglary).

query(earthquake).

Assume there are N people and

each person independently calls

the police with a certain

probability, depending on the

alarm ringing or not: if the alarm

rings, the probability of calling is

0.8, otherwise it is 0.1. The case

for N = 2 is shown.


TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Diagnostic: P(Cold = 1|Sneeze = 1)

Our allergy example revisited as ProbLog program:

0.01::cat.

0.10::cold.

0.8::allergy :- cat.

0.05::allergy :- \+ cat.

0.95::sneeze:- cold, allergy.

0.90::sneeze:- \+ cold, allergy.

0.90::sneeze:- cold, \+ allergy.

evidence(sneeze,true).

query(cold).

Question: Why do we need only 3 clauses to capture P(Sneeze|Cold ,Allergy)?Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 70/71

TH

E

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Exercise

A cab was involved in a hit-and-run accident at night. Two cabcompanies, the Green and the Blue, operate in the city. You aregiven the following data:

85% of the cabs in the city are Green and 15% are Blue.

A witness identified the cab as Blue. The court tested thereliability of the witness in the circumstances that existed onthe night of the accident and concluded that the witnesscorrectly identifies each one of the two colours 80% of thetime and failed 20% of the time.

1. Express the problem as a belief network and work out theprobability that the cab involved in the accident was Blue.

2. Express the problem as a ProbLog program (you may checkyour answer using the online engine on the ProbLog webpage).

(Problem by D. Kahneman, Thinking Fast and Slow, 2011)


r&n: 13.2-13.6,14.1-14.5; p&m: 8.1-8.4, 8.6 jacques fleuriot

Documents