![Page 1: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/1.jpg)
Probabilistic Graphical Models
David Sontag
New York University
Lecture 1, January 31, 2013
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 1 / 44
![Page 2: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/2.jpg)
One of the most exciting advances in machine learning (AI, signalprocessing, coding, control, . . .) in the last decades
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 2 / 44
![Page 3: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/3.jpg)
How can we gain global insight based on local observations?
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 3 / 44
![Page 4: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/4.jpg)
Key idea
1 Represent the world as a collection of random variables X1, . . . ,Xn
with joint distribution p(X1, . . . ,Xn)
2 Learn the distribution from data
3 Perform “inference” (compute conditional distributionsp(Xi | X1 = x1, . . . ,Xm = xm))
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 4 / 44
![Page 5: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/5.jpg)
Reasoning under uncertainty
As humans, we are continuously making predictions under uncertainty
Classical AI and ML research ignored this phenomena
Many of the most recent advances in technology are possible becauseof this new, probabilistic, approach
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 5 / 44
![Page 6: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/6.jpg)
Applications: Deep question answering
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 6 / 44
![Page 7: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/7.jpg)
Applications: Machine translation
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 7 / 44
![Page 8: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/8.jpg)
Applications: Speech recognition
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 8 / 44
![Page 9: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/9.jpg)
Applications: Stereo vision
output: disparity!input: two images!
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 9 / 44
![Page 10: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/10.jpg)
Key challenges
1 Represent the world as a collection of random variables X1, . . . ,Xn
with joint distribution p(X1, . . . ,Xn)
How does one compactly describe this joint distribution?Directed graphical models (Bayesian networks)Undirected graphical models (Markov random fields, factor graphs)
2 Learn the distribution from data
Maximum likelihood estimation. Other estimation methods?How much data do we need?How much computation does it take?
3 Perform “inference” (compute conditional distributionsp(Xi | X1 = x1, . . . ,Xm = xm))
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 10 / 44
![Page 11: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/11.jpg)
Syllabus overview
We will study Representation, Inference & Learning
First in the simplest case
Only discrete variablesFully observed modelsExact inference & learning
Then generalize
Continuous variablesPartially observed data during learning (hidden variables)Approximate inference & learning
Learn about algorithms, theory & applications
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 11 / 44
![Page 12: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/12.jpg)
Logistics: class
Class webpage:http://cs.nyu.edu/~dsontag/courses/pgm13/
Sign up for mailing list!Draft slides posted before each lecture
Book: Probabilistic Graphical Models: Principles and Techniques byDaphne Koller and Nir Friedman, MIT Press (2009)
Required readings for each lecture posted to course website.Many additional reference materials available!
Office hours: Wednesday 5-6pm and by appointment. 715Broadway, 12th floor, Room 1204
Teaching Assistant: Li Wan ([email protected])
Li’s Office hours: Monday 5-6pm. 715 Broadway, Room 1231
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 12 / 44
![Page 13: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/13.jpg)
Logistics: prerequisites & grading
Prerequisites:Previous class on machine learningBasic concepts from probability and statisticsAlgorithms (e.g., dynamic programming, graphs, complexity)Calculus
Grading: problem sets (65%) + in class final exam (30%) +participation (5%)
Class attendance is required.7-8 assignments (every 1–2 weeks). Both theory and programming.First homework out today, due next Thursday (Feb. 7) at 5pmImportant: See collaboration policy on class webpage
Solutions to the theoretical questions require formal proofs.
For the programming assignments, I recommend Python, Java, orMatlab. Do not use C++.
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 13 / 44
![Page 14: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/14.jpg)
Review of probability: outcomesReference: Chapter 2 and Appendix A
An outcome space specifies the possible outcomes that we wouldlike to reason about, e.g.
IntroducNon&to&probability:&outcomes&
• An&outcome(space&specifies&the&possible&outcomes&that&we&would&like&to&reason&about,&e.g.&
Ω = Coin toss ,
Ω = Die toss , , , , ,
• We&specify&a&probability(p(x)&for&each&outcome&x(such&that(X
x2
p(x) = 1p(x) 0,E.g., p( ) = .6
p( ) = .4
We specify a probability p(ω) for each outcome ω such that
p(ω) ≥ 0,∑
ω∈Ω
p(ω) = 1
Review of probability: outcomesReference: Chapter 2 and Appendix A
An outcome space specifies the possible outcomes that we wouldlike to reason about, e.g.
IntroducNon&to&probability:&outcomes&
• An&outcome(space&specifies&the&possible&outcomes&that&we&would&like&to&reason&about,&e.g.&
Ω = Coin toss ,
Ω = Die toss , , , , ,
• We&specify&a&probability(p(x)&for&each&outcome&x(such&that(X
x
p(x) = 1p(x) 0,E.g., p( ) = .6
p( ) = .4
We specify a probability p(x) for each outcome x such that
IntroducNon&to&probability:&outcomes&
• An&outcome(space&specifies&the&possible&outcomes&that&we&would&like&to&reason&about,&e.g.&
Ω = Coin toss ,
Ω = Die toss , , , , ,
• We&specify&a&probability(p(x)&for&each&outcome&x(such&that(X
x
p(x) = 1p(x) 0,E.g., p( ) = .6
p( ) = .4
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 14 / 44
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 14 / 44
![Page 15: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/15.jpg)
Review of probability: events
An event is a subset of the outcome space, e.g.
IntroducNon&to&probability:&events&
• An&event(is&a&subset&of&the&outcome&space,&e.g.&
O = Odd die tosses , ,
E = Even die tosses , ,
• The&probability&of&an&event&is&given&by&the&sum&of&the&probabiliNes&of&the&outcomes&it&contains,&
p(E) =X
x2E
p(x) E.g., p(E) = p( ) + p( ) + p( )
= 1/2, if fair die
The probability of an event is given by the sum of the probabilities ofthe outcomes it contains,
p(E ) =∑
ω∈Ep(ω)
IntroducNon&to&probability:&events&
• An&event(is&a&subset&of&the&outcome&space,&e.g.&
O = Odd die tosses , ,
E = Even die tosses , ,
• The&probability&of&an&event&is&given&by&the&sum&of&the&probabiliNes&of&the&outcomes&it&contains,&
p(E) =X
x2E
p(x) E.g., p(E) = p( ) + p( ) + p( )
= 1/2, if fair die
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 15 / 44
![Page 16: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/16.jpg)
Independence of events
Two events A and B are independent if
p(A ∩ B) = p(A)p(B)
Are these two events independent?
IntroducNon&to&probability:&independence&
• Two&events&A&and&B&are&independent(if&p(A \ B) = p(A)p(B)
BAAre these events independent?
No! p(A \ B) = 0
p(A)p(B) =
1
6
2
• Suppose&our&outcome&space&had&two&different&die:&
Ω = 2 die tosses , , , … , 62 = 36 outcomes
and each die is (defined to be) independent, i.e.
p( ) = p( ) ) p( p( ) = p( ) ) p(
No! p(A ∩ B) = 0, p(A)p(B) =
(1
6
)2
Now suppose our outcome space had two different die:
IntroducNon&to&probability:&independence&
• Two&events&A&and&B&are&independent(if&p(A \ B) = p(A)p(B)
BAAre these events independent?
No! p(A \ B) = 0
p(A)p(B) =
1
6
2
• Suppose&our&outcome&space&had&two&different&die:&
Ω = 2 die tosses , , , … , 62 = 36 outcomes
and each die is (defined to be) independent, i.e.
p( ) = p( ) ) p( p( ) = p( ) ) p(
and the probability distribution is such that each die is independent,
IntroducNon&to&probability:&independence&
• Two&events&A&and&B&are&independent(if&p(A \ B) = p(A)p(B)
BAAre these events independent?
No! p(A \ B) = 0
p(A)p(B) =
1
6
2
• Suppose&our&outcome&space&had&two&different&die:&
Ω = 2 die tosses , , , … , 62 = 36 outcomes
and each die is (defined to be) independent, i.e.
p( ) = p( ) ) p( p( ) = p( ) ) p(
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 16 / 44
![Page 17: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/17.jpg)
Independence of events
Two events A and B are independent if
p(A ∩ B) = p(A)p(B)
Are these two events independent?
IntroducNon&to&probability:&independence&
• Two&events&A&and&B&are&independent(if&p(A \ B) = p(A)p(B)
Are these events independent? B A
p(A) = p( ) p(B) = p( )
p(A)p(B) =
1
6
2
p( ) ) p(
Yes! p(A \ B) = 0p( )
Yes!
IntroducNon&to&probability:&independence&
• Two&events&A&and&B&are&independent(if&p(A \ B) = p(A)p(B)
Are these events independent? B A
p(A) = p( ) p(B) = p( )
p(A)p(B) =
1
6
2
p( ) ) p(
Yes! p(A \ B) = 0p( )
IntroducNon&to&probability:&independence&
• Two&events&A&and&B&are&independent(if&p(A \ B) = p(A)p(B)
Are these events independent? B A
p(A) = p( ) p(B) = p( )
p(A)p(B) =
1
6
2
p( ) ) p(
Yes! p(A \ B) = 0p( )
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 17 / 44
![Page 18: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/18.jpg)
Conditional probabilityConditional Probabilities • A simple relation between joint and conditional probabilities
– In fact, this is taken as the definition of a conditional probability
T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
A B
A
Conditional probability
Let S1, S2 be events, p(S2) > 0.
p(S1 | S2) =p(S1 \ S2)
p(S2)
Claim 1:P
!2S p(! | S) = 1
Claim 2: If S1 and S2 are independent, then p(S1 | S2) = p(S1)
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 18 / 44
B
Let A,B be events, p(B) > 0.
p(A | B) =p(A ∩ B)
p(B)
Claim 1:∑
ω∈S p(ω | S) = 1
Claim 2: If A and B are independent, then p(A | B) = p(A)
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 18 / 44
![Page 19: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/19.jpg)
Two important rules
1 Chain rule Let S1, . . .Sn be events, p(Si ) > 0.
p(S1 ∩ S2 ∩ · · · ∩ Sn) = p(S1)p(S2 | S1) · · · p(Sn | S1, . . . ,Sn−1)
2 Bayes’ rule Let S1,S2 be events, p(S1) > 0 and p(S2) > 0.
p(S1 | S2) =p(S1 ∩ S2)
p(S2)=
p(S2 | S1)p(S1)
p(S2)
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 19 / 44
![Page 20: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/20.jpg)
Discrete random variables
Often each outcome corresponds to a setting of various attributes(e.g., “age”, “gender”, “hasPneumonia”, “hasDiabetes”)
A random variable X is a mapping X : Ω→ D
D is some set (e.g., the integers)Induces a partition of all outcomes Ω
For some x ∈ D, we say
p(X = x) = p(ω ∈ Ω : X (ω) = x)
“probability that variable X assumes state x”
Notation: Val(X ) = set D of all values assumed by X(will interchangeably call these the “values” or “states” of variable X )
p(X ) is a distribution:∑
x∈Val(X ) p(X = x) = 1
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 20 / 44
![Page 21: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/21.jpg)
Multivariate distributions
Instead of one random variable, have random vector
X(ω) = [X1(ω), . . . ,Xn(ω)]
Xi = xi is an event. The joint distribution
p(X1 = x1, . . . ,Xn = xn)
is simply defined as p(X1 = x1 ∩ · · · ∩ Xn = xn)
We will often write p(x1, . . . , xn) instead of p(X1 = x1, . . . ,Xn = xn)
Conditioning, chain rule, Bayes’ rule, etc. all apply
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 21 / 44
![Page 22: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/22.jpg)
Working with random variables
For example, the conditional distribution
p(X1 | X2 = x2) =p(X1,X2 = x2)
p(X2 = x2).
This notation meansp(X1 = x1 | X2 = x2) = p(X1=x1,X2=x2)
p(X2=x2) ∀x1 ∈ Val(X1)
Two random variables are independent, X1 ⊥ X2, if
p(X1 = x1,X2 = x2) = p(X1 = x1)p(X2 = x2)
for all values x1 ∈ Val(X1) and x2 ∈ Val(X2).
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 22 / 44
![Page 23: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/23.jpg)
Example
Consider three binary-valued random variables
X1,X2,X3 Val(Xi ) = 0, 1
Let outcome space Ω be the cross-product of their states:
Ω = Val(X1)×Val(X2)×Val(X3)
Xi (ω) is the value for Xi in the assignment ω ∈ ΩSpecify p(ω) for each outcome ω ∈ Ω by a big table:
x1 x2 x3 p(x1, x2, x3)0 0 0 .110 0 1 .02
...1 1 1 .05
How many parameters do we need to specify?
23 − 1David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 23 / 44
![Page 24: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/24.jpg)
Marginalization
Suppose X and Y are random variables with distribution p(X ,Y )X : Intelligence, Val(X ) = “Very High”, “High”Y : Grade, Val(Y ) = “a”, “b”
Joint distribution specified by:
X
Y
vh h
a 0.7 0.15
b 0.1 0.05
p(Y = a) = ?= 0.85
More generally, suppose we have a joint distribution p(X1, . . . ,Xn).Then,
p(Xi = xi ) =∑
x1
∑
x2
· · ·∑
xi−1
∑
xi+1
· · ·∑
xn
p(x1, . . . , xn)
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 24 / 44
![Page 25: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/25.jpg)
Conditioning
Suppose X and Y are random variables with distribution p(X ,Y )X : Intelligence, Val(X ) = “Very High”, “High”Y : Grade, Val(Y ) = “a”, “b”
X
Y
vh h
a 0.7 0.15
b 0.1 0.05
Can compute the conditional probability
p(Y = a | X = vh) =p(Y = a,X = vh)
p(X = vh)
=p(Y = a,X = vh)
p(Y = a,X = vh) + p(Y = b,X = vh)
=0.7
0.7 + 0.1= 0.875.
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 25 / 44
![Page 26: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/26.jpg)
Example: Medical diagnosis
Variable for each symptom (e.g. “fever”, “cough”, “fast breathing”,“shaking”, “nausea”, “vomiting”)
Variable for each disease (e.g. “pneumonia”, “flu”, “common cold”,“bronchitis”, “tuberculosis”)
Diagnosis is performed by inference in the model:
p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)
One famous model, Quick Medical Reference (QMR-DT), has 600diseases and 4000 findings
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 26 / 44
![Page 27: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/27.jpg)
Representing the distribution
Naively, could represent multivariate distributions with table ofprobabilities for each outcome (assignment)
How many outcomes are there in QMR-DT? 24600
Estimation of joint distribution would require a huge amount of data
Inference of conditional probabilities, e.g.
p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)
would require summing over exponentially many variables’ values
Moreover, defeats the purpose of probabilistic modeling, which is tomake predictions with previously unseen observations
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 27 / 44
![Page 28: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/28.jpg)
Structure through independence
If X1, . . . ,Xn are independent, then
p(x1, . . . , xn) = p(x1)p(x2) · · · p(xn)
2n entries can be described by just n numbers (if |Val(Xi )| = 2)!
However, this is not a very useful model – observing a variable Xi
cannot influence our predictions of Xj
If X1, . . . ,Xn are conditionally independent given Y , denoted asXi ⊥ X−i | Y , then
p(y , x1, . . . , xn) = p(y)p(x1 | y)n∏
i=2
p(xi | x1, . . . , xi−1, y)
= p(y)p(x1 | y)n∏
i=2
p(xi | y).
This is a simple, yet powerful, model
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 28 / 44
![Page 29: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/29.jpg)
Example: naive Bayes for classification
Classify e-mails as spam (Y = 1) or not spam (Y = 0)Let 1 : n index the words in our vocabulary (e.g., English)Xi = 1 if word i appears in an e-mail, and 0 otherwiseE-mails are drawn according to some distribution p(Y ,X1, . . . ,Xn)
Suppose that the words are conditionally independent given Y . Then,
p(y , x1, . . . xn) = p(y)n∏
i=1
p(xi | y)
Estimate the model with maximum likelihood. Predict with:
p(Y = 1 | x1, . . . xn) =p(Y = 1)
∏ni=1 p(xi | Y = 1)∑
y=0,1 p(Y = y)∏n
i=1 p(xi | Y = y)
Are the independence assumptions made here reasonable?
Philosophy: Nearly all probabilistic models are “wrong”, but many arenonetheless useful
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 29 / 44
![Page 30: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/30.jpg)
Bayesian networksReference: Chapter 3
A Bayesian network is specified by a directed acyclic graphG = (V ,E ) with:
1 One node i ∈ V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi | xPa(i)),specifying the variable’s probability conditioned on its parents’ values
Corresponds 1-1 with a particular factorization of the jointdistribution:
p(x1, . . . xn) =∏
i∈Vp(xi | xPa(i))
Powerful framework for designing algorithms to perform probabilitycomputations
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 30 / 44
![Page 31: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/31.jpg)
Example
Consider the following Bayesian network:
Grade
Letter
SAT
IntelligenceDifficulty
d1d0
0.6 0.4
i1i0
0.7 0.3
i0
i1
s1s0
0.95
0.2
0.05
0.8
g1
g2
g2
l1l 0
0.1
0.4
0.99
0.9
0.6
0.01
i0,d0
i0,d1
i0,d0
i0,d1
g2 g3g1
0.3
0.05
0.9
0.5
0.4
0.25
0.08
0.3
0.3
0.7
0.02
0.2
What is its joint distribution?
p(x1, . . . xn) =∏
i∈Vp(xi | xPa(i))
p(d , i , g , s, l) = p(d)p(i)p(g | i , d)p(s | i)p(l | g)
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 31 / 44
![Page 32: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/32.jpg)
More examples
p(x1, . . . xn) =∏
i∈Vp(xi | xPa(i))
Will my car start this morning?
Heckerman et al., Decision-Theoretic Troubleshooting, 1995
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 32 / 44
![Page 33: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/33.jpg)
More examples
p(x1, . . . xn) =∏
i∈Vp(xi | xPa(i))
What is the differential diagnosis?
Beinlich et al., The ALARM Monitoring System, 1989David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 33 / 44
![Page 34: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/34.jpg)
Bayesian networks are generative models
naive Bayes
Y
X1 X2 X3 Xn. . .
Features
Label
Evidence is denoted by shading in a node
Can interpret Bayesian network as a generative process. Forexample, to generate an e-mail, we
1 Decide whether it is spam or not spam, by samping y ∼ p(Y )2 For each word i = 1 to n, sample xi ∼ p(Xi | Y = y)
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 34 / 44
![Page 35: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/35.jpg)
Bayesian network structure implies conditionalindependencies!
Grade
Letter
SAT
IntelligenceDifficulty
The joint distribution corresponding to the above BN factors as
p(d , i , g , s, l) = p(d)p(i)p(g | i , d)p(s | i)p(l | g)
However, by the chain rule, any distribution can be written as
p(d , i , g , s, l) = p(d)p(i | d)p(g | i , d)p(s | i , d , g)p(l | g , d , i , g , s)
Thus, we are assuming the following additional independencies:D ⊥ I , S ⊥ D,G | I , L ⊥ I ,D,S | G . What else?
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 35 / 44
![Page 36: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/36.jpg)
Bayesian network structure implies conditionalindependencies!
Generalizing the above arguments, we obtain that a variable isindependent from its non-descendants given its parents
Common parent – fixing B decouples A and C
!
" #$%&'(%)*'+',-./'011!20113 3
Qualitative Specification
456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC
D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9
D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9
G99699B6):'A$8B'6H@6$:9
I6=$)%)*'A$8B'7=:=
46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'
N
" #$%&'(%)*'+',-./'011!20113 O1
Local Structures & Independencies
,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R
,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'
6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R
T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q
U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R
W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X
! "#
!
"
#
!
#
"
Cascade – knowing B decouples A and C
!
" #$%&'(%)*'+',-./'011!20113 3
Qualitative Specification
456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC
D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9
D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9
G99699B6):'A$8B'6H@6$:9
I6=$)%)*'A$8B'7=:=
46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'
N
" #$%&'(%)*'+',-./'011!20113 O1
Local Structures & Independencies
,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R
,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'
6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R
T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q
U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R
W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X
! "#
!
"
#
!
#
"
V-structure – Knowing C couples A and B
!
" #$%&'(%)*'+',-./'011!20113 3
Qualitative Specification
456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC
D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9
D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9
G99699B6):'A$8B'6H@6$:9
I6=$)%)*'A$8B'7=:=
46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'
N
" #$%&'(%)*'+',-./'011!20113 O1
Local Structures & Independencies
,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R
,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'
6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R
T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q
U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R
W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X
! "#
!
"
#
!
#
"
This important phenomona is called explaining away and is whatmakes Bayesian networks so powerful
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 36 / 44
![Page 37: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/37.jpg)
A simple justification (for common parent)
!
" #$%&'(%)*'+',-./'011!20113 3
Qualitative Specification
456$6'7869':56';<=>%:=:%?6'9@6&%A%&=:%8)'&8B6'A$8BC
D$%8$'E)8F>67*6'8A'&=<9=>'$6>=:%8)95%@9
D$%8$'E)8F>67*6'8A'B87<>=$'$6>=:%8)95%@9
G99699B6):'A$8B'6H@6$:9
I6=$)%)*'A$8B'7=:=
46'9%B@>J'>%)E'='&6$:=%)'=$&5%:6&:<$6'K6L*L'='>=J6$67'*$=@5M'
N
" #$%&'(%)*'+',-./'011!20113 O1
Local Structures & Independencies
,8BB8)'@=$6):P%H%)*'Q'76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R
,=9&=76S)8F%)*'Q 76&8<@>69 G'=)7',R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'
6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R
T29:$<&:<$6S)8F%)*','&8<@>69'G'=)7'Q
U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R
W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X
! "#
!
"
#
!
#
"
We’ll show that p(A,C | B) = p(A | B)p(C | B) for any distributionp(A,B,C ) that factors according to this graph structure, i.e.
p(A,B,C ) = p(B)p(A | B)p(C | B)
Proof.
p(A,C | B) =p(A,B,C )
p(B)= p(A | B)p(C | B)
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 37 / 44
![Page 38: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/38.jpg)
D-separation (“directed separated”) in Bayesian networks
Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation
Look to see if there is active path between X and Z when variablesY are observed:
(a)
X
Y
Z X
Y
Z
(b)
(a)
X
Y
Z
(b)
X
Y
Z
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 38 / 44
![Page 39: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/39.jpg)
D-separation (“directed separated”) in Bayesian networks
Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation
Look to see if there is active path between X and Z when variablesY are observed:
(a)
X
Y
Z X
Y
Z
(b)
(a)
X
Y
Z
(b)
X
Y
Z
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 39 / 44
![Page 40: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/40.jpg)
D-separation (“directed separated”) in Bayesian networks
Algorithm to calculate whether X ⊥ Z | Y by looking at graphseparation
Look to see if there is active path between X and Z when variablesY are observed:
X Y Z X Y Z
(a) (b)
If no such path, then X and Z are d-separated with respect to Y
d-separation reduces statistical independencies (hard) to connectivityin graphs (easy)
Important because it allows us to quickly prune the Bayesian network,finding just the relevant variables for answering a query
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 40 / 44
![Page 41: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/41.jpg)
D-separation example 1
1X
2X
3X
X 4
X 5
X6
1X
2X
3X
X 4
X 5
X6
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 41 / 44
![Page 42: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/42.jpg)
D-separation example 2
1X
2X
3X
X 4
X 5
X6
1X
2X
3X
X 4
X 5
X6
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 42 / 44
![Page 43: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/43.jpg)
2011 Turing Award was for Bayesian networks
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 43 / 44
![Page 44: Probabilistic Graphical Modelspeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture1.pdf · Probabilistic Graphical Models David Sontag ... (Bayesian networks) ... Basic concepts](https://reader038.vdocument.in/reader038/viewer/2022103012/5b701f027f8b9af12d8d2d0f/html5/thumbnails/44.jpg)
Summary
Bayesian networks given by (G ,P) where P is specified as a set oflocal conditional probability distributions associated with G ’s nodes
One interpretation of a BN is as a generative model, where variablesare sampled in topological order
Local and global independence properties identifiable viad-separation criteria
Computing the probability of any assignment is obtained bymultiplying CPDs
Bayes’ rule is used to compute conditional probabilitiesMarginalization or inference is often computationally difficult
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 44 / 44