r&n: 13.2-13.6,14.1-14.5; p&m: 8.1-8.4, 8.6 jacques fleuriot
TRANSCRIPT
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probabilistic ReasoningR&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6
Jacques Fleuriot
University of Edinburgh, School of Informatics
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 1/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probabilities and Bayesian Inference
Note This week’s lectures will review the following at a quick pace:
1. Basic Properties of Probability
2. Bayes’ Rule and Inference
3. Product Probability Spaces
And then cover:
4. Representing Distributions with Bayesian Networks
5. How to Construct the Network
6. Inference using Networks
7. ProbLog: Logic programming with probabilities
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 2/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probabilities
Can be used to model “objectively” the situation in the world.
A person with a cold sneezes (say at least once a day) withprobability 75%.
The objective interpretation is that this is a truly randomevent. Every person with a cold may or may not sneeze anddoes so with probability 75%.
Or . . .
Probabilities can model our “subjective” belief. For example:if we know that a person has a cold then we believe they willsneeze with probability 75%.The probabilities relate to an agent’s state of knowledge.They change with new evidence.
We will use the subjective interpretation, though probability theoryitself is independent of this choice.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 3/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probabilities: Some Terminology I
Elementary Event: Outcome of some experiment e.g. coincomes out Head (H) when tossed
Sample Space: SET of possible elementary events e.g.{H,T}.If sample space, S , is finite, we can denote the number ofelementary events in S by n(S).
Example: For one throw of a die, the sample space is given by:
S = {1, 2, 3, 4, 5, 6}
and son(S) = 6
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 4/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probabilities: Some Terminology II
Event: subset of sample space i.e. if event is denoted by Ethen
E ⊆ S
and also
n(E ) ≤ n(S)
Example: LetEo be event “the number is odd” when a die is rolled thenEo = {1, 3, 5} and n(Eo) = 3
Ee be event “the number is less than 3” when a die is rolledthen Ee = {1, 2} and n(Ee) = 2
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 5/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probabilities: Running Example I
We throw 2 dice (each with 6 sides and uniform construction).
Each elementary event is a pair of number (a, b).
The sample space i.e. set of elementary events is
{(1, 1), (1, 2), . . . , (6, 6)}
Events i.e. subsets of the sample space include:E1: outcomes where the first die has value 1.E2: outcomes where the sum of the two numbers is even.E3: outcomes where the sum of the two numbers is at least11.
Events A and B are Mutually Exclusive if A ∩ B = ∅(the empty set).
All pairs of elementary events are mutually exclusive.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 6/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probabilities: Running Example II
In Example:
Events E1 and E3 are mutually exclusiveEvents E1 and E2 are notEvents E2 and E3 are not
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 7/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Classical Definition of Probability
If the sample space S consists of a finite (but non-zero) number ofequally likely outcomes, then the probability of an event E , writtenP(E ) is defined as
P(E ) =n(E )
n(S)
This definition satisfies the (Kolmogorov) axioms of probability :
1. P(E ) ≥ 0 for any event E since n(E ) ≥ 0 and n(S) > 0
2. P(S) = n(S)n(S) = 1
3. If A and B are mutually exclusive: P(A ∪ B) = P(A) + P(B)
This is because A ∩ B = ∅. So, n(A ∪ B) = n(A) + n(B), andtherefore
P(A∪B) =n(A ∪ B)
n(S)=
n(A) + n(B)
n(S)=
n(A)
n(S)+n(B)
n(S)= P(A)+P(B)
Note: P(A ∩ B) is also denoted by P(A,B) and P(A and B)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 8/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probability Distributions
It is sufficient to specify the probability of elementary events.
Other probabilities can be computed from these.
In example: Each die gives a result 1, 2, 3, 4, 5, 6 withprobability 1
6 independently of the other die. So eachelementary event has probability 1
36
P(E1) = P((1, 1)) + . . .P((1, 6)) = 6/36 = 1/6
P(E2) = P((1, 1)) + P((1, 3)) + P((1, 5)) +P((2, 2)) + P((2, 4)) + P((2, 6)) + . . . = 1/2
P(E3) = P((5, 6)) + P((6, 5)) + P((6, 6)) = 3/36 = 1/12
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 9/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Properties of Probabilities
We know that P(A) = n(A)n(S)
Since event A is a subset of sample space S ,
0 ≤ n(A) ≤ n(S) i.e. 0 ≤ n(A)
n(S)≤ 1
So, we have 0 ≤ P(A) ≤ 1
If P(A) = 0 then event cannot occur e.g. mutually exclusiveevents A and B, since then P(A ∩ B) = P(∅) = 0
If P(A) = 1 then event is certain to occur
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 10/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Properties of Probabilities
Let A denote the event “A does not occur”
P(A) =n(A)
n(S)
=n(S)− n(A)
n(S)
= 1− n(A)
n(S)= 1− P(A)
So, P(A) = 1− P(A)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 11/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Conditional Probability
If A and B are two events and P(B) 6= 0, then theprobability of A given B has already occurred is writtenP(A|B) and defined as
P(A|B) =P(A ∩ B)
P(B)
Where does this formula come from?Since we know that event B has already occurred, we knowthat the sample space for event A given event B is B.
P(A|B) =n(A ∩ B)
n(B)
=n(A ∩ B)/n(S)
n(B)/n(S)
=P(A ∩ B)
P(B)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 12/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Conditional Probability: Example
P(E1 ∩ E2) = P((1, 1)) + P((1, 3)) + P((1, 5)) = 1/12P(E2 ∩ E3) = P((6, 6)) = 1/36P(E1 ∩ E3) = P(φ) = 0
P(E1|E2) = P(E1∩E2)P(E2) = 1/12
1/2 = 1/6
P(E2|E3) = P(E2∩E3)P(E3) = 1/36
1/12 = 1/3
P(E1|E3) = P(E1∩E3)P(E3) = 0
1/12 = 0
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 13/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
More on Conditional Probability
Conditional probability rule is often written and used in thefollowing form:
P(A ∩ B) = P(A|B)P(B)
Since P(A ∩ B) = P(B ∩ A), we also have:
P(A ∩ B) = P(B|A)P(A)
and hence that P(A|B)P(B) = P(B|A)P(A)
if A and B are mutually exclusive events then, asP(A ∩ B) = 0 and P(B) 6= 0 (from def. of conditionalprobability), it follows that P(A|B) = 0.
Next, we look at an important result: Bayes’ Theorem
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 14/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Bayes’ Theorem
We know that P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)
Reorganizing we get:
P(A|B) =P(A)P(B|A)
P(B)
Dice Example: P(E2|E1) = P(E2)P(E1|E2)P(E1) = (1/2)(1/6)
1/6 = 1/2
This will be the basis of our Bayesian inference and learningprocedures!
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 15/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Independent Events I
If the occurrence or non-occurrence of an event A does notinfluence in any way the probability of an event B, then the eventB is (statistically) independent of event A, and we have:
P(A|B) = P(A)
Since
P(A|B)P(B) = P(B|A)P(A)
⇒ P(A)P(B) = P(B|A)P(A)
⇒ P(B|A) = P(B) also holds
But, we also know that
P(A ∩ B) = P(A|B)P(B)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 16/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Independent Events II
So, A and B independent also means that
P(A ∩ B) = P(A)P(B)
In Dice Example:Events E1 and E2 are statistically independentEvents E2 and E3 are notEvents E1 and E3 are not
Independence can be used to simplify computations!
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 17/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Independence: Example
Sometimes we know from the probabilistic experiment thatcertain events are “physically” independent. In this case theyare also statistically independent.
Event E4: outcomes where the second die has value 4.
Since each throw of the die is independent (the first throwdoes not change the probabilities of outcomes of the secondthrow), E1 and E4 are independent. They are also statisticallyindependent.
This implies: P(E1 ∩ E4) = P(E1)P(E4) = 1/36Can be verified using equations as above.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 18/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probability Distributions
A random variable associates a value with each possibleoutcome of an experiment e.g. X = H or X = T for outcomesof tossing a coin.
An elementary event is an assignment of values to allvariables.
A probability distribution describes the probability of everyelementary event and can be represented using a probabilitytable.
A Joint Probability Distribution contains information aboutprobabilities of all variables and the connections betweenthem.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 19/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Joint Distribution: Example
Two variables A and B each of which can take values (0, 1)(i.e. boolean variables).
Table has one column for each variable
Table has 2× 2 = 4 rows
A B P(A,B)
0 0 v00
0 1 v01
1 0 v10
1 1 v11
P(A = 0,B = 0) = v00
P(A = 1,B = 0) = v10
Particular probabilitiesdenoted by P()
The table is denoted byP(A,B)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 20/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Joint Probability Distribution: Another Example
We have 4 variables describing the situation about a personwith an allergy to cats:Cold: person has a coldCat: there is a cat in the vicinity of the personAllergy: the persons is showing an allergic reactionSneeze: person sneezed
Each of these is Boolean: taking values 0 or 1.
The joint distribution can be described in a table of size24 = 16.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 21/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
The Joint Distribution
Cold Cat Allergy Sneeze P(Cold ,Cat,Allergy ,Sneeze)
0 0 0 0 0.846450 0 0 1 0.00 0 1 0 0.0044550 0 1 1 0.0400950 1 0 0 0.00180 1 0 1 0.00 1 1 0 0.000720 1 1 1 0.006481 0 0 0 0.0094051 0 0 1 0.0846451 0 1 0 0.0002251 0 1 1 0.0047251 1 0 0 0.000021 1 0 1 0.000181 1 1 0 0.000041 1 1 1 0.00076
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 22/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Marginal Distribution I
This is an induced probability distribution for one variable ormore variables obtained from a joint distribution.
To compute the marginal distribution of one or morevariables, we ignore the other variables.
Example:
A B P(A,B)
0 0 v00
0 1 v01
1 0 v10
1 1 v11
A P(A)
0 v00 + v01
1 v10 + v11
B P(B)
0 v00 + v10
1 v01 + v11
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 23/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Marginal Distribution II
How do we “ignore variables”?
Example: We ignore B when computing marginal distributionfor A by summing it out. This is described generically by:
P(A) =∑
v∈{0,1}
P(A,B = v)
Similarly, for marginal distribution of B, we have:
P(B) =∑
v∈{0,1}
P(A = v ,B)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 24/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Marginal Distribution III
And for particular values:
P(A = 0) =1∑
v=0
P(A = 0 and B = v)
Dice Example:
P(X1 = 1) =6∑
v=1
P(X1 = 1 and X2 = v) = 1/6
Allergy Example:
P(Cold = 1) = Σ . . . = 0.1
P(Cold = 0) = Σ . . . = 0.9
P(Sneeze = 1) = Σ . . . = 0.136885
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 25/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Marginal Distribution: Another Example
C A B P(A,B|C )
0 0 0 v000
0 0 1 v001
0 1 0 v010
0 1 1 v011
1 0 0 v100
1 0 1 v101
1 1 0 v110
1 1 1 v111
P(A,B|C )
Sum out B:
C A P(A|C )
0 0 v000 + v001
0 1 v010 + v011
1 0 v100 + v101
1 1 v110 + v111
P(A|C ) =∑
v P(A,B = v |C )
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 26/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Inference using the Joint
Compute probabilities of events:P(Cold = 1 and Sneeze = 1) =
∑. . . = 0.0902875
Causal Inference:P(Sneeze = 1|Cold = 1) = 0.0902875
0.1 = 0.902875
Diagnostic Inference:P(Cold = 1|Sneeze = 1) = 0.0902875
0.136885 = 0.66
Inter-Causal Inference:P(Cold = 1|Sneeze = 1 and Allergy = 1)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 27/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Normalization
P(Cold = 1|Sneeze = 1) = P(Sneeze=1|Cold=1)P(Cold=1)P(Sneeze=1) = A
α
P(Cold = 0|Sneeze = 1) = P(Sneeze=1|Cold=0)P(Cold=0)P(Sneeze=1) = B
α
But Aα + B
α = 1 so α = A + B and
P(Cold = 1|Sneeze = 1) = AA+B
Will often use normalization to avoid computing thedenominator.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 28/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Conditional Independence
Events A and B are statistically independent given event C iff
P(A ∩ B|C ) = P(A|C )P(B|C )
This is equivalent to the condition P(A|B ∩ C ) = P(A|C )
As in standard independence this can simplify thecomputations.
Event E5: outcomes where at least one die has value 6.
P(E2|E3) = 1/3
P(E2|E3 and E5) = 1/3
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 29/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Conditional Independence: Derivation
P(A ∩ B|C ) =P(A ∩ B ∩ C )
P(C )
=P(A|B ∩ C )P(B ∩ C )
P(C )
=P(A|B ∩ C )P(B|C )P(C )
P(C )
So, P(A|C )P(B|C ) = P(A|B ∩ C )P(B|C )
⇒ P(A|B ∩ C ) = P(A|C )
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 30/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Independence in Joint Probability Distributions
In joint probability distributions we can express a more generalform of independence.
X1 and X2 are independent iff P(X1|X2) = P(X1)
This means that for all v1 and v2
P(X1 = v1|X2 = v2) = P(X1 = v1)
And similarly for conditional independenceX1 and X2 are independent given X3 iffP(X1|X2,X3) = P(X1|X3)
This means that for all v1, v2, v3
P(X1 = v1|X2 = v2,X3 = v3) = P(X1 = v1|X3 = v3)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 31/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Product Probability Spaces
We have n variables, X1, . . . ,Xn
Each variable Xi ranges over a finite set of valuesvi ,1, . . . , vi ,ki .
We can write a big table with n columns and∏
i ki rowsdescribing the probability of every elementary event.
Table grows exponentially with n.Not feasible unless n is very small.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 32/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Bayesian Networks
Allow us to represent distributions more compactly.
Take advantage of the structure available in a domain.
Basic idea: represent dependence and independence explicitly.
If P(X1 ∩ X2) = P(X1)P(X2) then we can use two1-dimensional tables instead of a 2-dimensional table.
If each has 6 values, this means 12 entries instead of 36!
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 33/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example
Edges represent “direct influence”
Assume that Cat and Cold do notdepend on other variables.
Allergy depends only on Cat
Sneeze depends on Cold and Allergy
Sneeze depends on CatBUT only through Allergy
For each node we associate aconditional probability table
Cold
Cat
Sneeze
Allergy
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 34/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example II
The network structure expresses independence of variables.
P(Cat|Cold) = P(Cat)
P(Allergy |Cat,Cold) = P(Allergy |Cat)
P(Sneeze|Allergy ,Cat,Cold) = P(Sneeze|Allergy ,Cold)
More generally, the joint distribution can be expressed as theproduct of the distributions in the network. Example:
P(Cat,Cold ,Allergy ,Sneeze) =P(Cat)P(Cold)P(Allergy |Cat)P(Sneeze|Allergy ,Cold)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 35/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example III
P(Cat = 0) P(Cat = 1)
0.99 0.01
P(Cold = 0) P(Cold = 1)
0.90 0.10
Cat P(Allergy = 0) P(Allergy = 1)
0 0.95 0.051 0.20 0.80
Cold Allergy P(Sneeze = 0) P(Sneeze = 1)
0 0 1.00 0.000 1 0.10 0.901 0 0.10 0.901 1 0.05 0.95
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 36/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example: Joint Distribution in Terms of CPDs I
We can use the conditional probability rule (in its various forms)and Bayes’ Theorem to express Joint Distributions in terms ofconditional probability distributions (stored with nodes in Bayesiannetwork as Conditional Probability Tables (CPTs)).
Example: Express P(Cat,Cold ,Allergy ,Sneeze) in terms of knownconditional probability distributions (tables):
P(Cat),P(Cold),P(Allergy |Cat),P(Sneeze|Allergy ,Cold)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 37/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example: Joint Distribution in Terms of CPDs II
P(Cat,Cold ,Allergy ,Sneeze)
= P(Sneeze|Allergy ,Cat,Cold)P(Allergy ,Cat,Cold)
by conditional probability rule applied to distributions
= P(Sneeze|Allergy ,Cold)P(Allergy |Cat,Cold)P(Cat,Cold)
by conditional probability rule applied to distributions
= P(Sneeze|Allergy ,Cold)P(Allergy |Cat)P(Cat|Cold)P(Cold)
by conditional probability rule applied to distributions
= P(Sneeze|Allergy ,Cold)P(Allergy |Cat)P(Cat)P(Cold)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 38/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Another Useful Result
P(A,B|C ) = P(A|B,C )P(B|C ) (†)Proof:
RHS = P(A|B,C )P(B|C ) =P(A,B,C )
P(B,C )· P(B,C )
P(C )
=P(A,B,C )
P(C )
=P(A,B|C )P(C )
P(C )
= P(A,B|C ) = LHS
Also:
P(A = a,B = b|C = c) = P(A = a|B = b,C = c)P(B = b|C = c)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 39/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Bayesian Networks I
We have n variables, X1, . . . ,Xn
The graph has n nodes, each corresponding to one variable.
The graph is acyclic; there is no directed path from Xi toitself.
For each node associate a conditional probability table (CPT)describing P(Xi | parents(Xi )).
The joint probability distribution can be expressed as theproduct of the distributions in the network:
P(X1,X2, . . . ,Xn) =n∏
i=1
P(Xi | parents(Xi ))
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 40/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Bayesian Networks II
P(Cat = 0,Cold = 1,Allergy = 0,Sneeze = 1) =P(Cat = 0)P(Cold = 1)P(Allergy = 0|Cat = 0)P(Sneeze = 1|Allergy = 0,Cold = 1) =
0.99 · 0.1 · 0.95 · 0.9 = 0.084645
So to represent a distribution we need to represent thenetwork and CPTs.
If for all nodes the number of parents is small then all CPTsare small and we have a compact representation.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 41/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
How to Construct a Network? I
Can represent any distribution using a network.By repeated application of P(A,B) = P(A|B)P(B)
Choose ordering of variables X1, . . . ,Xn.
For i=1 to nAdd Xi to network with P(Xi |X1, . . . ,Xi−1).
The joint distribution is:P(X1,X2, . . . ,Xn) =
∏ni=1 P(Xi |X1, . . . ,Xi−1).
But this is no improvement as Xn is connected to allpredecessors (so we need a huge table for it).
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 42/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
How to Construct a Network? II
Instead, at each stage choose a subset such thatparents(Xi ) ⊆ {X1, . . . ,Xi−1}.
Choice of parents must satisfyP(Xi |X1, . . . ,Xi−1) = P(Xi | parents(Xi ))
so that Xi is independent of other predecessors given itsparents.
If parents(Xi ) is small then representation is compact.
For example if for all Xi , | parents(Xi )| ≤ 3 theninstead of 2n entries we have n23 = 8n entries !
How should we order the variables?
Causal links tend to produce small representations.
Choose “root causes” first. Then continue with causalstructure as much as possible.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 43/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example
You are at work, neighbour John calls to say your home alarm isringing, but neighbour Mary doesn’t call. Sometimes alarm set offby minor earthquakes. Is there a burglar?
Variables and ordering:
Burglar , Earthquake, Alarm, JohnCalls, MaryCalls
Network topology reflects “causal” knowledge:
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 44/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
The Network
.001
P(B)
Alarm
Earthquake
MaryCallsJohnCalls
Burglary
A P(J)
t
f.90
.05
B
t
t
f
f
E
t
f
t
f
P(A)
.95
.29
.001
.94
.002
P(E)
A P(M)
t
f.70
.01
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 45/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
What if we choose another ordering?
Variable ordering (1):
MaryCalls, JohnCalls, Alarm, Burglar , Earthquake
Variable ordering (2):
MaryCalls, JohnCalls, Earthquake Burglar , Alarm,
JohnCalls
MaryCalls
Alarm
Burglary
Earthquake
MaryCalls
Alarm
Earthquake
Burglary
JohnCalls
(a) (b)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 46/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
How to Construct a Network
May work with domain experts to decide on structure andthen find values of probabilities.
Choosing a “bad order” can have adverse effect:
Large probability tables.“Unnatural” dependencies, that are in turn hard to estimate.
Luckily experts are often good at identifying causal structure,and prefer giving probability judgements for causal rules.
Learning the CPTs and even the structure is possible.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 47/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Computing with Bayes Nets
We have seen how to reconstruct the joint distribution fromthe network.
Can we compute other probabilities efficiently?
P(Cold = 1 and Sneeze = 1) =?P(Sneeze = 1|Cold = 1) =?P(Cold = 1|Sneeze = 1) =?P(Cold = 1|Sneeze = 1 and Allergy = 1) =?
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 48/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Computing a Marginal Distribution I
We want to compute P(Cold ,Sneeze).
By conditional probability rule:
P(Cold , Sneeze) = P(Sneeze|Cold)P(Cold)
Recall that we have the following CPTs attached to nodes of ournetwork:
P(Cat) P(Cold)
P(Allergy |Cat) P(Sneeze|Allergy ,Cold)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 49/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Computing a Marginal Distribution II
We need to compute P(Sneeze|Cold):
P(Sneeze|Cold)
=∑
v∈{0,1}
P(Sneeze,Allergy = v |Cold)
by def. of marginal distribution (we sum Allergy out)
=∑
v∈{0,1}
P(Sneeze|Allergy = v ,Cold)P(Allergy = v |Cold)
by result (†)=
∑v∈{0,1}
P(Sneeze|Allergy = v ,Cold)P(Allergy = v)
Since Allergy is independent of Cold
We can read out P(Sneeze|Allergy = v ,Cold) for all v using ourCPT for P(Sneeze|Allergy ,Cold).
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 50/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Computing a Marginal Distribution III
We now need to compute the marginal distribution P(Allergy).
P(Allergy = v)
=∑
v2∈{0,1}
P(Allergy = v ,Cat = v2)
by def. of marginal distribution (we sum Cat out)∑v2∈{0,1}
P(Allergy = v |Cat = v2)P(Cat = v2)
by def. of conditional probability rule
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 51/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Computing a Marginal Distribution IV
So, we have now expressed all computations in terms of the CPTsin the network. We can write the computation fully as:
P(Cold , Sneeze)
= (∑
v∈{0,1}
(P(Sneeze|Allergy = v ,Cold)
∑v2∈{0,1}
P(Allergy = v |Cat = v2)P(Cat = v2)))
P(Cold)
In general, “sum out” variables that do not appear in the query.Try to maintain small tables along the way.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 52/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Resulting Distributions
P(Allergy = 0) P(Allergy = 1)
0.9425 0.0575
Cold P(Sneeze = 0) P(Sneeze = 1)
0 0.94825 0.051751 0.097125 0.902875
Cold Sneeze P(Sneeze,Cold)
0 0 0.8534250 1 0.0465751 0 0.00971251 1 0.0902875
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 53/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Causal Inference
To compute P(Sneeze = 1|Cold = 1)
First compute P(Allergy) as in previous example.
Then compute
P(Sneeze = 1|Cold = 1) =∑v∈{0,1} P(Sneeze = 1|Allergy = v ,Cold = 1)P(Allergy = v)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 54/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Diagnostic Inference
Use Bayes’ Rule to compute P(Cold = 1|Sneeze = 1)
A1 = P(Cold = 1|Sneeze = 1) =P(Sneeze=1|Cold=1)P(Cold=1)
P(Sneeze=1) = N1P(Sneeze=1)
A0 = P(Cold = 0|Sneeze = 1) =P(Sneeze=1|Cold=0)P(Cold=0)
P(Sneeze=1) = N0P(Sneeze=1)
But A0 + A1 = 1 so
P(Cold |Sneeze = 1) =
(P(Cold = 0|Sneeze = 1),P(Cold = 1|Sneeze = 1)) =
( N0N0+N1
, N1N0+N1
), using normalization.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 55/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Diagnostic Inference
From the CPT we have:P(Cold = 0) = 0.90 and P(Cold = 1) = 0.10
P(Sneeze = 1|Cold = 0) = 0.05175P(Sneeze = 1|Cold = 1) = 0.902875computed as in previous example using P(Allergy)
N0 = 0.05175 · 0.90 = 0.046575
N1 = 0.902875 · 0.10 = 0.0902875
P(Cold = 1|Sneeze = 1) = N1N0+N1
= 0.66
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 56/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Inference in Burglary Example
Use B,E ,A, J,M to denote variables
Compute P(A = 1|E = 1, J = 1)
P(A = 1|E = 1, J = 1) = P(J=1|A=1,E=1)P(A=1|E=1)P(J=1|E=1)
By computing and normalisingP(J = 1|A = b,E = 1)P(A = b|E = 1) for b ∈ {0, 1}P(A = 1|E = 1) = 0.001 · 0.95 + 0.999 · 0.29 = 0.29066
P(A = 0|E = 1) = 0.70934
P(J = 1|A = 1,E = 1)P(A = 1|E = 1) = 0.90 · 0.29066 =0.261994
P(J = 1|A = 0,E = 1)P(A = 0|E = 1) = 0.05 · 0.70934 =0.035467
P(A = 1|E = 1, J = 1) = 0.2619940.261994+0.035467 = 0.88
Exercise: Work out the details of all the computations.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 57/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Inference in Burglary Example: Derivation I
P(A = 1|E = 1, J = 1)
=P(A = 1,E = 1, J = 1)
P(J = 1,E = 1)
=P(J = 1|A = 1,E = 1)P(A = 1,E = 1)
P(J = 1,E = 1)
=P(J = 1|A = 1,E = 1)P(A = 1|E = 1)P(E = 1)
P(J = 1|E = 1)P(E = 1)
=P(J = 1|A = 1,E = 1)P(A = 1|E = 1)
P(J = 1|E = 1)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 58/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Inference in Burglary Example: Derivation II
P(A = 1|E = 1)
=P(A = 1,E = 1)
P(E = 1)
by def. of marginal distribution:
=Σv∈{0,1}P(A = 1,E = 1,B = v)
P(E = 1)
=Σv∈{0,1}P(A = 1|E = 1,B = v)P(E = 1,B = v)
P(E = 1)
since E (earthquake) and B (burglary) are independent:
=Σv∈{0,1}P(A = 1|E = 1,B = v)P(E = 1)P(B = v)
P(E = 1)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 59/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Inference in Burglary Example: Derivation III
= Σv∈{0,1}P(A = 1|E = 1,B = v)P(B = v)
by expanding summation:
= P(A = 1|E = 1,B = 0)P(B = 0) + P(A = 1|E = 1,B = 1)P(B = 1)
by probabilities from CPTs in network
= (0.29)(0.999) + (0.95)(0.01) = 0.29066
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 60/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Inference
In general the situation is more complicated than in oursimple network. The problem is NP-Hard.
Can always compute via the joint but this may not be efficient.
Efficient algorithms are known for graphs with polytreestructure. These are a singly connected networks. There areno cycles even ignoring the direction of edges.
Otherwise, try to turn graph into a tree, or use simulationmethods to approximate the probability.
Simulation is not guaranteed to give good answers but workswell in some cases.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 61/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Inference by Simulation
To compute P(X = v1|Y = v2) = P(X=v1,Y=v2)P(Y=v2)
Repeat many times:choose random values for all nodes using CPTs
by choosing first for nodes with no parentsand then for nodes whose parents have been chosen
if Y = v2 was chosen then:Ycounter = Ycounter +1if X = v1 was chosen then Xcounter = Xcounter+1
Return Xcounter/Ycounter
May require many rounds if Y = v2 occurs with lowprobability.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 62/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Simulation by Likelihood Weighting
Improves the simulation by forcing evidence values (Y = v2)to be chosen.
Initialise weight w = 1
In sampling if a node Xi corresponds to an evidence variablechoose the given value vi .
The current round will not have full weight but insteadw ← w · P(Xi = vi | values chosen for parents)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 63/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example I
Compute P(Allergy = 1|Cold = 1, Sneeze = 1)
Set up general counter CG and counter CA for Allergy = 1(both initialised to zero).
Here we illustrate just one sample
Choose Cold = 1 so w ← 0.1
Choose Cat randomly. In the following we assume 0 waschosen.
Choose Allergy using P(Allergy |Cat = 0).In the following we assume 0 was chosen.
Choose Sneeze = 1 and updatew ← w · P(Sneeze = 1|Cold = 1,Allergy = 0) = 0.1 · 0.9 =0.09
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 64/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example II
The round is over.We forced the evidence to succeed.But Allergy = 1 did not succeed.
CG ← CG + 0.09CA ← CA (no change)
After Many rounds return CA/CG
Faster than simple simulation but may still need a long timeto get good answers
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 65/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Probabilistic Logic Programming (PLP)
Growing interest in combining probability and logic to dealwith complex relational domains.
Probabilistic logic programs enable some of the facts to beannotated with probabilities i.e. they extend logic programs.
Related field: Statistical Relational Learning, which extendsprobabilistic graphical models like Bayesian networks withlogical and relational representations.
Many inference and learning tasks that use graphical modelscan be expressed as PLP programs.
Example PLPs languages include ProbLog, ICL, PRISM, etc.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 66/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Problog Syntax in a Nutshell
A ProbLog program consists of two parts: a set of groundprobabilistic facts, and a logic program.1
A ground probabilistic fact, written p::f, is a ground fact fannotated with a probability p.
Probabilistic statements are of the form p::f(X1,X2,...,Xn) :-
body, where body is a conjunction of calls to non-probabilistic facts.
Such a statement defines the domains of the variables X1,X2, ...Xn.
An atom that unifies with a ground probabilistic fact is called aprobabilistic atom, while an atom that unifies with the head of somerule in the logic program is called a derived atom.
The set of probabilistic atoms must be disjoint from the set ofderived atoms.
All variables in the head of a rule should also appear in a positiveliteral in the body of the rule.
1https://dtai.cs.kuleuven.be/problog/index.htmlJacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 67/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Burglar Example revisited in Problog
A possible (propositional) ProbLog program for
computing P(A = 1|E = 1, J = 1):
0.01::burglary.
0.02::earthquake.
0.95::alarm :- burglary, earthquake.
0.94::alarm :- burglary, \+ earthquake.
0.29::alarm :- \+ burglary, earthquake.
0.01::alarm :- \+ burglary, \+ earthquake.
0.9::calls john :- alarm.
0.05::calls john :- \+alarm.
0.7::calls mary :- alarm.
0.01::calls mary :- \+alarm.
evidence(calls john,true).
evidence(earthquake, true).
query(alarm).
.001
P(B)
Alarm
Earthquake
MaryCallsJohnCalls
Burglary
A P(J)
t
f.90
.05
B
t
t
f
f
E
t
f
t
f
P(A)
.95
.29
.001
.94
.002
P(E)
A P(M)
t
f.70
.01
Note: \+ denotes negation.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 68/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Example: A first order Problog programA slightly different (extended) version of burglary
example.
person(john).
person(mary).
0.01::burglary.
0.02::earthquake.
0.95::alarm :- burglary, earthquake.
0.94::alarm :- burglary, \+ earthquake.
0.29::alarm :- \+ burglary, earthquake.
0.01::alarm :- \+ burglary, \+ earthquake.
0.8::calls(X) :- alarm, person(X).
0.1::calls(X) :- \+ alarm, person(X).
evidence(calls(john),true).
evidence(calls(mary),true).
query(burglary).
query(earthquake).
Assume there are N people and
each person independently calls
the police with a certain
probability, depending on the
alarm ringing or not: if the alarm
rings, the probability of calling is
0.8, otherwise it is 0.1. The case
for N = 2 is shown.
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 69/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Diagnostic: P(Cold = 1|Sneeze = 1)
Our allergy example revisited as ProbLog program:
0.01::cat.
0.10::cold.
0.8::allergy :- cat.
0.05::allergy :- \+ cat.
0.95::sneeze:- cold, allergy.
0.90::sneeze:- \+ cold, allergy.
0.90::sneeze:- cold, \+ allergy.
evidence(sneeze,true).
query(cold).
Question: Why do we need only 3 clauses to capture P(Sneeze|Cold ,Allergy)?Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 70/71
TH
E
U N I V E RS
IT
Y
OF
ED I N B U
RG
H
Exercise
A cab was involved in a hit-and-run accident at night. Two cabcompanies, the Green and the Blue, operate in the city. You aregiven the following data:
85% of the cabs in the city are Green and 15% are Blue.
A witness identified the cab as Blue. The court tested thereliability of the witness in the circumstances that existed onthe night of the accident and concluded that the witnesscorrectly identifies each one of the two colours 80% of thetime and failed 20% of the time.
1. Express the problem as a belief network and work out theprobability that the cab involved in the accident was Blue.
2. Express the problem as a ProbLog program (you may checkyour answer using the online engine on the ProbLog webpage).
(Problem by D. Kahneman, Thinking Fast and Slow, 2011)
Jacques Fleuriot Probabilistic Reasoning, R&N: 13.2-13.6,14.1-14.5; P&M: 8.1-8.4, 8.6 71/71