directed graphical models or bayesian networkslsong/teaching/8803ml/lecture1.pdfbayesian networks...
TRANSCRIPT
![Page 1: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/1.jpg)
Directed Graphical Models or Bayesian Networks
Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Le Song
![Page 2: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/2.jpg)
Bayesian Networks
One of the most exciting recent advancements in statistical AI
Compact representation for exponentially-large probability distributions
Fast marginalization algorithm
Exploit conditional independencies
Difference from undirected graphical models!
2
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
![Page 3: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/3.jpg)
Handwriting recognition
3
![Page 4: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/4.jpg)
Handwriting recognition
4
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5
![Page 5: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/5.jpg)
Summary: basic concepts for R.V.
Outcome: assign 𝑥1, … , 𝑥𝑛to 𝑋1, …𝑋𝑛
Conditional probability: 𝑃(𝑋, 𝑌) = 𝑃(𝑋) 𝑃(𝑌|𝑋)
Bayes rule: 𝑃(𝑌|𝑋) =𝑃 𝑋 𝑌 𝑃(𝑌)
𝑃(𝑋)
Chain rule:
𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋1 𝑃 𝑋2 𝑋1 …𝑃 𝑋𝑛 𝑋1, … , 𝑋𝑛−1
5
![Page 6: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/6.jpg)
Summary: conditional independence
𝑋 is independent of 𝑌 given 𝑍 if 𝑃(𝑋 = 𝑥|𝑌 = 𝑦, 𝑍 = 𝑧) = 𝑃(𝑋 = 𝑥| 𝑍 = 𝑧) ∀𝑥 ∈ 𝑉𝑎𝑙 𝑋 , 𝑦 ∈ 𝑉𝑎𝑙 𝑌 , 𝑧 ∈ 𝑉𝑎𝑙 𝑍
Shorthand:
(𝑋 ⊥ 𝑌 | 𝑍)
For (𝑋 ⊥ 𝑌| ∅), write 𝑋 ⊥ 𝑌
Proposition: (𝑋 ⊥ 𝑌 | 𝑍) if and only if 𝑃(𝑋, 𝑌|𝑍) = 𝑃(𝑋|𝑍)𝑃(𝑌|𝑍)
6
![Page 7: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/7.jpg)
Representation of Bayesian Networks
Consider 𝑃 𝑋𝑖
Assign probability to each 𝑥𝑖 ∈ 𝑉𝑎𝑙 𝑋𝑖
Assume 𝑉𝑎𝑙 𝑋𝑖 = 𝑘, how many independent parameters?
𝑘 − 1
Consider 𝑃 𝑋1, … , 𝑋𝑛
How many independent parameters if 𝑉𝑎𝑙 𝑋𝑖 = 𝑘?
𝑘𝑛 − 1
Bayesian Networks can represent the same joint probability with fewer parameters
7
![Page 8: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/8.jpg)
Simple Bayesian Networks
![Page 9: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/9.jpg)
What if variables are independent?
What if all variables are independent?
Is it enough to have 𝑋𝑖 ⊥ 𝑋𝑗 , ∀𝑖, 𝑗
Not enough!!!
Must assume that 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛
𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …
Can write as
𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖𝑖=1…𝑛
Bayesian networks with no edges
How many independent parameters now?
𝑛 ⋅ 𝑘 − 1 9
𝑋1 𝑋2 𝑋3 𝑋𝑛 …
![Page 10: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/10.jpg)
Conditional parameterization – two nodes
Grade (G) is determined by intelligence (I)
𝑃(𝐼) =
𝑃 𝐺 𝐼 =
𝑃 𝐼 = 𝑉𝐻, 𝐺 = 𝐵 = 𝑃 𝐼 = 𝑉𝐻 𝑃 𝐺 = 𝐵 𝐼 = 𝑉𝐻 = 0.85 × 0.1 = 0.085
10
𝐺
𝐼 VH H
0.85 0.15
G I VH H
A 0.9 0.5
B 0.1 0.5
![Page 11: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/11.jpg)
Conditional parameterization – three nodes
Grade and SAT score are determined by intelligence
(𝐺 ⊥ 𝑆 | 𝐼) ⇔ 𝑃(𝑆 𝐺, 𝐼 = 𝑃(𝑆|𝐼)
𝑃(𝐼, 𝐺, 𝑆) = 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼), why?
Chain rule 𝑃(𝐼, 𝐺, 𝑆) = 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐺, 𝐼)
Use conditional independence, we get 𝑃(𝐼) 𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼)
11
𝐺
𝐼
𝑆
𝑃(𝐼)
𝑃(𝐺|𝐼) 𝑃(𝑆|𝐼)
![Page 12: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/12.jpg)
The naïve Bayes model
Class variable: 𝐶
Evidence variables: 𝑋1, …𝑋𝑛
Assume that (𝒳 ⊥ 𝒴| 𝐶), ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛
𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃 𝐶 𝑃 𝑋𝑖 𝐶 𝑖 , why?
Chain rule 𝑃 𝐶, 𝑋1, … , 𝑋𝑛= 𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶, 𝑋1 …P 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1
12
𝑋1
𝐶
𝑋2 𝑋𝑛 …
𝑃 𝑋2 𝐶, 𝑋1 = 𝑃 𝑋2 𝐶 𝑋_1 ⊥ 𝑋_2 | 𝐶
𝑃 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1 = 𝑃 𝑋𝑛 𝐶 𝑋𝑛 ⊥ 𝑋1, … , 𝑋𝑛−1| 𝐶
![Page 13: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/13.jpg)
More Complicated Bayesian Networks and an Example
![Page 14: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/14.jpg)
Causal Structure
We will learn the semantics of Bayesian Networks (BNs), relate them to independence assumptions encoded by the graph
Suppose we know the following:
The flu (F) causes sinus inflammation (S)
Allergies (A) cause sinus inflammation
Sinus inflammation causes a runny nose (N)
Sinus inflammation causes headaches (H)
How are these variables connected?
14
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐹𝑙𝑢
![Page 15: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/15.jpg)
Possible queries
Probabilistic Inference
Eg., 𝑃(𝐴 = 𝑡|𝐻 = 𝑡, 𝑁 = 𝑓)
Most probable explanation
max𝐹,𝐴,𝑆 𝑃(𝐹, 𝐴, 𝑆|𝐻 = 𝑡, 𝑁 = 𝑡)
Active data collection
What’s the next best test variable to observed
15
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
![Page 16: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/16.jpg)
Car starts BN
18 binary variables
𝑃 𝐴, 𝐹, 𝐿, … , 𝑆𝑝𝑎𝑟𝑘
Inference
P(BatteryAge|Starts=f)
= … 𝑃 𝐴, 𝐹, 𝐿, … 𝐹𝑎𝑛𝐵𝑒𝑙𝑡𝐴𝑙𝑡
Sum over 216 terms, why BN so fast?
Use the sparse graph structure
For the HailFinder BN in JavaBayes
More than 354 =
58149737003040059690390169 terms 16
![Page 17: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/17.jpg)
Factored joint distribution--preview
𝑃 𝐹, 𝐴, 𝑆, 𝐻,𝑁 = 𝑃 𝐹 𝑃 𝐴 𝑃 𝑆 𝐹, 𝐴 𝑃 𝐻 𝑆 𝑃 𝑁 𝑆
17
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢 𝑃(𝐹) 𝑃(𝐴)
𝑃(𝑆|𝐹, 𝐴)
𝑃(𝑁|𝑆) 𝑃(𝐻|𝑆)
![Page 18: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/18.jpg)
Bayesian networks (BN) has 10 parameters
Full probability table 𝑃(𝐹, 𝐴, 𝑆, 𝐻, 𝑁) explicitly has 25 − 1 = 31 parameters
Number of parameters
18
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢 𝑃(𝐹) 𝑃(𝐴)
𝑃(𝑆|𝐹, 𝐴)
𝑃(𝑁|𝑆) 𝑃(𝐻|𝑆)
1 1
4
2 2
S FA tt tf ft ff
t 0.9 0.7 0.8 0.2
f 0.1 0.3 0.2 0.8
1 1 1 1
![Page 19: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/19.jpg)
Key: Independence assumptions
¬ 𝐻 ⊥ 𝑁
𝐻 ⊥ 𝑁 | 𝑆
¬ 𝐴 ⊥ 𝑁
𝐴 ⊥ 𝑁 | 𝑆
Knowing sinus separates the symptom variables from each other
19
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
![Page 20: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/20.jpg)
Marginal independence
Flu and Allergy are (marginally) independent
𝐹 ⊥ 𝐴
𝑃 𝐹, 𝐴 = 𝑃 𝐹 𝑃 𝐴
More generally:
∀subsets of {𝑋1, … , 𝑋𝑛}, 𝒳 ⊥ 𝒴,𝒳 ⊆ 𝑋1, … , 𝑋𝑛 , 𝒴 ⊆ 𝑋1, … , 𝑋𝑛
𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖
𝑖
20
Flu = t 0.1
Flu = f 0.9
Al = t 0.3
Al = f 0.7
Flu = t Flu = f
Al = t 0.1 × 0.3 = 0.03
0.3 × 0.9
Al = f 0.1 × 0.7 0.9 × 0.7
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
![Page 21: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/21.jpg)
Conditional independence
Flu and headache are not (marginally) independent
¬𝐹 ⊥ 𝐻, 𝑃 𝐹 𝐻 ≠ 𝑃 𝐹 , 𝑃 𝐹,𝐻 ≠ 𝑃 𝐹 𝑝 𝐻
Flu and headache are independent given Sinus infection
𝐹 ⊥ 𝐻 | 𝑆, 𝑃 𝐹,𝐻 𝑆 = 𝑃 𝐹 𝑆 𝑃 𝐻 𝑆
𝑃 𝐹 𝐻, 𝑆 = 𝑃 𝐹 𝑆
More generally:
𝑋1, 𝑋2, … , 𝑋𝑛 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑜𝑡ℎ𝑒𝑟 𝑔𝑖𝑣𝑒𝑛 𝐶
𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 𝐶 = 𝑃 𝑋1 𝐶 𝑃 𝑋2, … , 𝑋𝑛 𝐶
𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 𝐶 = 𝑃 𝑋𝑖 𝐶𝑖
21
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
![Page 22: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/22.jpg)
The conditional independence assumption
Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents
𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋
Flu: 𝑃𝑎𝐹𝑙𝑢 = ∅,𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝐹𝑙𝑢 = {𝐴}
𝐹 ⊥ 𝐴
Nose: 𝑃𝑎𝑛𝑜𝑠𝑒 = 𝑆 , 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑛𝑜𝑠𝑒 = 𝐹, 𝐴, 𝐻
𝑁 ⊥ {𝐹, 𝐴, 𝐻} | 𝑆
Sinus: 𝑃𝑎𝑠𝑖𝑛𝑢𝑠 = 𝐹, 𝐴 ,𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑠𝑖𝑛𝑢𝑠 = ∅
No assumption about 𝑆 ⊥? ? ? | 𝐹, 𝐴
22
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
![Page 23: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/23.jpg)
Local Markov Assumption: a variable X is independence of its non-descendants given its parents and only its parents
𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋
local Markov assumption not imply 𝐹 ⊥ 𝐴 | 𝑆
XOR: 𝑃 𝐴 = 𝑡 = 0.5, 𝑃 𝐹 = 𝑡 = 0.5, 𝑆 = 𝐹 𝑋𝑂𝑅 𝐴
𝑆 = 𝑡, 𝐴 = 𝑡 ⇒ 𝐹 = 𝑓,
ie., 𝑃 𝐹 = 𝑓 𝑆 = 𝑡, 𝐴 = 𝑡 =0
𝑃 𝐹 = 𝑡 = 0.2, 𝑃 𝐹 = 𝑡 𝑆 = 𝑡 = 0.5, 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 = 0.3
𝑃 𝐹 = 𝑡 ≤ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 ≤ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡
Knowing 𝐴 = 𝑡 lowers the probability of 𝐹 = 𝑡 (A=t explains away F=t)
Depends on P(S|F,A), it could be 𝑃 𝐹 = 𝑡 𝑆 = 𝑡, 𝐴 = 𝑡 ≥ 𝑃 𝐹 = 𝑡 𝑆 = 𝑡
Explaining away
23
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
![Page 24: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/24.jpg)
Chain rule and Local Markov Assumption
Pick a topological ordering of nodes
Interpret a BN using particular chain rule order
Use the conditional independence assumption
𝑃 𝑁,𝐻, 𝑆, 𝐴, 𝐹= 𝑃 𝐹 𝑃 𝐴 𝐹 𝑃 𝑆 𝐹, 𝐴 𝑃 𝐻 𝑆, 𝐹, 𝐴 𝑃 𝑁 𝑆, 𝐹, 𝐴, 𝐻
𝑃(𝑁,𝐻, 𝑆, 𝐴, 𝐹) = 𝑃(𝐹) 𝑃(𝐴) 𝑃(𝑆|𝐹𝐴) 𝑃(𝐻|𝑆) 𝑃(𝑁|𝑆)
Why can we decompose joint distribution?
24
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
𝑃(𝑆|𝐹, 𝐴) 𝑃(𝐴)
𝐴 ⊥ 𝐹
𝑃(𝐻|𝑆)
𝐻 ⊥ 𝐹𝐴|𝑆
𝑃(𝑁|𝑆)
𝐻 ⊥ 𝐹𝐴𝐻|𝑆
1 2
3
4 5
![Page 25: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/25.jpg)
General Bayesian Networks
![Page 26: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/26.jpg)
Set of random variables 𝑋1, … , 𝑋𝑛
Directed acyclic graph (DAG)
Loops are ok
But no directed cycle
Local Markov Assumptions
A variable 𝑋 is independent of its non-descendants given its parents and only its parents (𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋)
Conditional probability tables (CPTs), 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖), for each 𝑋𝑖
Joint distribution 𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋𝑖 𝑃𝑎𝑋𝑖 𝑖
A general Bayes net
26
![Page 27: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/27.jpg)
What distributions can be represented by a Bayesian Networks?
What Bayesian Networks can represent a distribution?
What are the independence assumptions encoded by a BN?
In addition to the local Markov assumption
Question?
27
𝐴 𝐵 𝐶 𝐷
Local Markov 𝐴 ⊥ 𝐶 | 𝐵 𝐷 ⊥ 𝐴𝐵 | 𝐶
Derived independence 𝐴 ⊥ 𝐷 | 𝐵
![Page 28: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/28.jpg)
Conditional Independence in Problem
28
True distribution 𝑃 Contains conditional independence assertions
𝐼(𝑃)
Graph G encodes local independence assumptions 𝐼𝑙 𝐺
World, Data, Reality: Bayesian Networks:
Key representational assumption: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
![Page 29: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/29.jpg)
The representation theorem – True conditional independence => BN factorization
BN encodes local conditional independence assumptions 𝐼𝑙 𝐺
29
If local conditional independence in BN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
Then the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑖
obtain
![Page 30: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/30.jpg)
The representation theorem – BN factorization => True conditional independence
BN encodes local conditional independence assumptions 𝐼𝑙 𝐺
30
Then local conditional independence in BN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
If the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑖
obtain
![Page 31: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/31.jpg)
Example: naïve Bayes – True conditional independence => BN Factorization
Independence assumptions:
𝑋𝑖 are independent of each other given 𝐶
That is 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛
𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …
Same as local conditional independence
prove 𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃(𝐶) 𝑃 𝑋𝑖 𝐶 𝑖
Use chain rule, and local Markov property 𝑃 𝐶, 𝑋1, … , 𝑋𝑛= 𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶, 𝑋1 …P 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1
31
𝑋1
𝐶
𝑋2 𝑋𝑛 …
𝑃 𝑋2 𝐶, 𝑋1 = 𝑃 𝑋2 𝐶 𝑋_1 ⊥ 𝑋_2 | 𝐶
𝑃 𝑋𝑛 𝐶, 𝑋1, … , 𝑋𝑛−1 = 𝑃 𝑋𝑛 𝐶 𝑋𝑛 ⊥ 𝑋1, … , 𝑋𝑛−1| 𝐶
![Page 32: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/32.jpg)
Example: naïve Bayes – BN Factorization => True conditional independence
Assume 𝑃 𝐶, 𝑋1, … , 𝑋𝑛 = 𝑃(𝐶) 𝑃 𝑋𝑖 𝐶 𝑖
Prove Independence assumptions:
𝑋𝑖 are independent of each other given 𝐶
That is 𝒳 ⊥ 𝒴 , ∀𝒳,𝒴 subsets of 𝑋1, … , 𝑋𝑛
𝑋1, 𝑋3 ⊥ 𝑋2, 𝑋4 , 𝑋1 ⊥ 𝑋7, 𝑋8, 𝑋9 , …
Eg. 𝑛 = 4, 𝑃(𝑋1, 𝑋_2|𝐶) =𝑃(𝑋1,𝑋2,𝐶)
𝑃(𝐶)= 𝑃(𝑋1,𝑋2,𝑋3,𝑋4,𝐶)𝑥3,𝑥4
𝑃(𝐶)
=1
𝑃 𝐶 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 𝑃 𝑋3 𝐶𝑥3,𝑥4 𝑃 𝑋4 𝐶 𝑃 𝐶
= 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 𝑃 𝑋3 𝐶 𝑃(𝑋4|𝐶)𝑥4𝑥3
= 𝑃 𝑋1 𝐶 𝑃 𝑋2 𝐶 ⋅ 1 ⋅ 1
32
𝑋1
𝐶
𝑋2 𝑋𝑛 …
![Page 33: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/33.jpg)
How about the general case?
BN encodes local conditional independence assumptions 𝐼𝑙 𝐺
33
If local conditional independence in BN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
Then the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑖
obtain
Then local conditional independence in BN are
subset of conditional independence in
𝑃 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
If the joint probability 𝑃 can be written as
𝑃(𝑋1, … , 𝑋𝑛) = 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖)
𝑖
obtain
This BN is an I-map of P
𝑃 factorizes according to BN
Every P has least one BN structure G
Read independence of P from BN structure G
![Page 34: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/34.jpg)
Proof from I-map to factorization
Topological ordering of 𝑋1, 𝑋2, …𝑋𝑛:
Number variables such that
Parent has lower number than child
ie., 𝑋𝑖 → 𝑋𝑗 ⇒ 𝑖 < 𝑗
Variable has lower number than all its descendants
Use chain rule
𝑃 𝑋1, … , 𝑋𝑛 = 𝑃 𝑋1 𝑃 𝑋2 𝑋1 …𝑃 𝑋𝑛 𝑋1, …𝑋𝑛−1
𝑃 𝑋𝑖 𝑋1, … , 𝑋𝑖−1 :
𝑃𝑎𝑋𝑖 ∈{𝑋1, … , 𝑋𝑖−1},
And there is no descendant of 𝑋𝑖 in {𝑋1, … , 𝑋𝑖−1}
𝑃 𝑋𝑖 𝑋1, … , 𝑋𝑖−1 = 𝑃(𝑋𝑖| 𝑃𝑎𝑋𝑖) 34
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢 1 2
3
4 5
Local Markov Assumption: 𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋
![Page 35: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/35.jpg)
Let G be an I-map for P, any DAG G’ that includes the same directed edges as G is also an I-map of P
G’ is strictly more expressive than G
If G is an I-map for P, then adding edges still results in an I-map
P(N|S,A)
𝑃(𝑁|𝑆, 𝐴 = 𝑡) = 𝑃(𝑁|𝑆, 𝐴 = 𝑓)
⇒ 𝑃(𝑁|𝑆) = 𝑃(𝑁|𝑆, 𝐴)
⇔ 𝑁 ⊥ 𝐴 | 𝑆
Adding edges doesn’t hurt
35
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐹𝑙𝑢
![Page 36: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/36.jpg)
Minimal I-maps
G is a minimal I-map for 𝑃 if deleting any edges for 𝐺 makes it no longer an I-map
Eg. ¬𝐴 ⊥ 𝐵, 𝐴 ⊥ 𝐵 | 𝐶, minimal I-map 𝐴 → 𝐶 → 𝐵, if remove an edge then no longer I-map
Obtain a minimal I-map given a set of variabls and conditional independence assertions
Choose an ordering on variables 𝑋1, … , 𝑋𝑛
For i = 1 to n
Add 𝑋𝑖 to network
Define parents of 𝑋𝑖 , 𝑃𝑎𝑋𝑖, in graphs as the minimal subset of nodes
such that local Markov assumptions hold
Define/learn CPT – 𝑃 𝑋𝑖 𝑃𝑎𝑋𝑖) 36
![Page 37: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/37.jpg)
Read conditional independence from BN structure G
![Page 38: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/38.jpg)
Conditional Independence encoded in BN
Local Markov assumption
𝑋 ⊥ 𝑁𝑜𝑛𝐷𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑠𝑋 | 𝑃𝑎𝑋
There are other conditional independence
Eg., explaining away, derived
What other derived conditional independence there?
38
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑆𝑖𝑛𝑢𝑠
𝐹𝑙𝑢
𝐴 ⊥ 𝐹 ¬𝐴 ⊥ 𝐹 | 𝑆
𝐴 𝐵 𝐶 𝐷
Local Markov 𝐴 ⊥ 𝐶 | 𝐵 𝐷 ⊥ 𝐴𝐵 | 𝐶
Derived independence 𝐴 ⊥ 𝐷 | 𝐵
![Page 39: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/39.jpg)
3-node cases
Causal effect:
Common cause
Common effect (V-structure)
39
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝑆𝑖𝑛𝑢𝑠
𝐹𝑙𝑢
𝐴 ⊥ 𝐹 ¬𝐴 ⊥ 𝐹 | 𝑆
𝑁𝑜𝑠𝑒
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
¬𝑁 ⊥ 𝐻 𝑁 ⊥ 𝐻 | 𝑆
𝑆𝑖𝑛𝑢𝑠 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝐴𝑙𝑙𝑒𝑟𝑔𝑦
¬𝐴 ⊥ 𝐻 𝐴 ⊥ 𝐻 | 𝑆
![Page 40: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/40.jpg)
How about other derived relations? ? 𝐹 ? {𝐵, 𝐸, 𝐺, 𝐽}
𝐹 ⊥ {𝐵, 𝐸, 𝐺, 𝐽}
? 𝐴, 𝐶, 𝐹, 𝐼 ? 𝐵, 𝐸, 𝐺, 𝐽
𝐴, 𝐶, 𝐹, 𝐼 ⊥ {𝐵, 𝐸, 𝐺, 𝐽}
? 𝐵 ? 𝐽 | 𝐸
𝐵 ⊥ 𝐽 | 𝐸
? 𝐸 ? 𝐹 | 𝐾
¬ 𝐸 ⊥ 𝐹 | 𝐾
? 𝐸 ? 𝐹 | {𝐾, 𝐼}
𝐸 ⊥ 𝐹 | {𝐾, 𝐼}
? 𝐹 ? 𝐺 | 𝐷 ¬ 𝐹 ⊥ 𝐺 | 𝐷
? 𝐹 ? 𝐺 | 𝐻
¬ 𝐹 ⊥ 𝐺 | 𝐻
? 𝐹 ? 𝐺 | {𝐻, 𝐾}
¬ 𝐹 ⊥ 𝐺 | {𝐻, 𝐾}
? 𝐹 ? 𝐺 | {𝐻, 𝐴}
𝐹 ⊥ 𝐺 | {𝐻, 𝐴}
40
𝐴
𝐷
𝐵
𝐶 𝐸
𝐻 𝐹 𝐺
𝐼
𝐾
𝐽
![Page 41: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/41.jpg)
Active trails
A trail 𝑋1– 𝑋2 −⋯– 𝑋𝑘is an active trail when variables 𝑂 ⊆ 𝑋1, … , 𝑋𝑛 are observed if for each consecutive triplet in the trail:
𝑋𝑖−1 → 𝑋𝑖 → 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂
𝑋𝑖−1 ← 𝑋𝑖 ← 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂
𝑋𝑖−1 ← 𝑋𝑖 → 𝑋𝑖+1 and 𝑋𝑖 is not observed 𝑋𝑖 ∉ 𝑂
𝑋𝑖−1 → 𝑋𝑖 ← 𝑋𝑖+1 and 𝑋𝑖 is observed 𝑋𝑖 ∈ 𝑂 or one of its descendants (V-structure)
41
![Page 42: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/42.jpg)
Active trails and conditional independence
Variables 𝑋𝑖 and 𝑋𝑗 are
independent given 𝑍 ⊆𝑋1, … , 𝑋𝑛 , if there is no active trail
between 𝑋𝑖 and 𝑋𝑗 when variables Z
are observed (𝑋𝑖⊥ 𝑋𝑗|𝑍)
We say that 𝑋𝑖 and 𝑋𝑗 are d-
separated given 𝑍 (dependency separation)
Eg., 𝐹 ⊥ 𝐺
42
𝐴
𝐷
𝐵
𝐶 𝐸
𝐻 𝐹 𝐺
𝐼
𝐾
𝐽
![Page 43: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/43.jpg)
Soundness of d-separation
Given BN with structure 𝐺
Set of conditional independence assertions obtained by d-separation
𝐼 𝐺 = 𝑋 ⊥ 𝑌 𝑍: 𝑑 − 𝑠𝑒𝑝𝐺 𝑋; 𝑌 𝑍
Soundness of d-separation
If 𝑃 factorizes over 𝐺 then 𝐼 𝐺 ⊆ 𝐼 𝑃 (not only 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃 )
Interpretation: d-separation only captures true conditional independencies
For most P’s that factorize over G, 𝐼 𝐺 = 𝐼 𝑃 (P-map)
43
![Page 44: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/44.jpg)
Bayesian Networks are not enough
![Page 45: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/45.jpg)
Inexistence of P-maps, example 1
XOR: A = B XOR C
𝐴 ⊥ 𝐵,¬𝐴 ⊥ 𝐵 | 𝐶
𝐵 ⊥ 𝐶,¬ 𝐴 ⊥ 𝐶 | 𝐵
𝐶 ⊥ 𝐴,¬𝐵 ⊥ 𝐶 | 𝐴
45
𝐴 𝐵
𝐶
Minimal I-map 𝐴 ⊥ 𝐵
Not P-map Can not read 𝐵 ⊥ 𝐶 𝐴 ⊥ C
![Page 46: Directed Graphical Models or Bayesian Networkslsong/teaching/8803ML/lecture1.pdfBayesian Networks One of the most exciting recent advancements in statistical AI Compact representation](https://reader035.vdocument.in/reader035/viewer/2022071214/60427412e059b026554d7fd7/html5/thumbnails/46.jpg)
Inexistence of P-maps, example 2
Swinging couples of variables 𝑋1, 𝑌1 and 𝑋2, 𝑌2
𝑋1 ⊥ 𝑌1
𝑋2 ⊥ 𝑌2
𝑋1 ⊥ 𝑋2 | 𝑌1𝑌2
𝑌1 ⊥ 𝑌2 | 𝑋1𝑋2
No Bayesian Network P-map
Need undirected graphical models!
46
𝑋1 𝑋2
𝑌1
𝑌2