notes on graphical models
DESCRIPTION
Notes on Graphical Models. Padhraic Smyth Department of Computer Science University of California, Irvine. P(Data | Parameters). Real World Data. Probabilistic Model. P(Data | Parameters). Real World Data. Probabilistic Model. P(Parameters | Data). Generative Model, Probability. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/1.jpg)
Notes on Graphical Models
Padhraic SmythDepartment of Computer Science
University of California, Irvine
![Page 2: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/2.jpg)
ProbabilisticModel
Real WorldData
P(Data | Parameters)
![Page 3: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/3.jpg)
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
![Page 4: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/4.jpg)
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
Generative Model, Probability
Inference, Statistics
![Page 5: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/5.jpg)
Part 1: Review of Probability
![Page 6: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/6.jpg)
Notation and Definitions
• X is a random variable– Lower-case x is some possible value for X– “X = x” is a logical proposition: that X takes value x– There is uncertainty about the value of X
• e.g., X is the Dow Jones index at 5pm tomorrow
• p(X = x) is the probability that proposition X=x is true– often shortened to p(x)
• If the set of possible x’s is finite, we have a probability distribution and p(x) = 1
• If the set of possible x’s is infinite, p(x) is a density function, and p(x) integrates to 1 over the range of X
![Page 7: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/7.jpg)
Example
• Let X be the Dow Jones Index (DJI) at 5pm Monday August 22nd (tomorrow)
• X can take real values from 0 to some large number
• p(x) is a density representing our uncertainty about X– This density could be constructed from historical data, e.g.,
– After 5pm p(x) = 1 for some value of x (no uncertainty), once we hear from Wall Street what x is
![Page 8: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/8.jpg)
Probability as Degree of Belief
• Different agents can have different p(x)’s– Your p(x) and the p(x) of a Wall Street expert might be
quite different– OR: if we were on vacation we might not have access to
stock market information• we would still be uncertain about p(x) after 5pm
• So we should really think of p(x) as p(x | BI)
– Where BI is background information available to agent I
– (will drop explicit conditioning on BI in notation)
• Thus, p(x) represents the degree of belief that agent I has in proposition x, conditioned on available background information
![Page 9: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/9.jpg)
Comments on Degree of Belief
• Different agents can have different probability models– There is no necessarily “correct” p(x)– Why? Because p(x) is a model built on whatever assumptions or
background information we use– Naturally leads to the notion of updating
• p(x | BI) -> p(x | BI, CI)
• This is the subjective Bayesian interpretation of probability– Generalizes other interpretations (such as frequentist)– Can be used in cases where frequentist reasoning is not applicable– We will use “degree of belief” as our interpretation of p(x) in this
tutorial
• Note!– Degree of belief is just our semantic interpretation of p(x)– The mathematics of probability (e.g., Bayes rule) remain the same
regardless of our semantic interpretation
![Page 10: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/10.jpg)
Multiple Variables
• p(x, y, z)– Probability that X=x AND Y=y AND Z =z– Possible values: cross-product of X Y Z
– e.g., X, Y, Z each take 10 possible values• x,y,z can take 103 possible values• p(x,y,z) is a 3-dimensional array/table
– Defines 103 probabilities• Note the exponential increase as we add more
variables
– e.g., X, Y, Z are all real-valued• x,y,z live in a 3-dimensional vector space• p(x,y,z) is a positive function defined over this space,
integrates to 1
![Page 11: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/11.jpg)
Conditional Probability
• p(x | y, z)– Probability of x given that Y=y and Z = z– Could be
• hypothetical, e.g., “if Y=y and if Z = z”• observational, e.g., we observed values y and z
– can also have p(x, y | z), etc– “all probabilities are conditional probabilities”
• Computing conditional probabilities is the basis of many prediction and learning problems, e.g.,– p(DJI tomorrow | DJI index last week)– expected value of [DJI tomorrow | DJI index next week)– most likely value of parameter given observed data
![Page 12: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/12.jpg)
Computing Conditional Probabilities
• Variables A, B, C, D– All distributions of interest related to A,B,C,D can be computed
from the full joint distribution p(a,b,c,d)
• Examples, using the Law of Total Probability
– p(a) = {b,c,d} p(a, b, c, d)
– p(c,d) = {a,b} p(a, b, c, d)
– p(a,c | d) = {b} p(a, b, c | d)
where p(a, b, c | d) = p(a,b,c,d)/p(d)
• These are standard probability manipulations: however, we will see how to use these to make inferences about parameters and unobserved variables, given data
![Page 13: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/13.jpg)
Two Practical Problems
(Assume for simplicity each variable takes K values)
• Problem 1: Computational Complexity– Conditional probability computations scale as O(KN)
• where N is the number of variables being summed over
• Problem 2: Model Specification– To specify a joint distribution we need a table of O(KN) numbers
– Where do these numbers come from?
![Page 14: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/14.jpg)
Two Key Ideas
• Problem 1: Computational Complexity– Idea: Graphical models
• Structured probability models lead to tractable inference
• Problem 2: Model Specification– Idea: Probabilistic learning
• General principles for learning from data
![Page 15: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/15.jpg)
Part 2: Graphical Models
![Page 16: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/16.jpg)
“…probability theory is more fundamentally concerned with the structure of reasoning and causation than with numbers.”
Glenn Shafer and Judea PearlIntroduction to Readings in Uncertain Reasoning,Morgan Kaufmann, 1990
![Page 17: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/17.jpg)
Conditional Independence
• A is conditionally independent of B given C iff p(a | b, c) = p(a | c)
(also implies that B is conditionally independent of A given C)
• In words, B provides no information about A, if value of C is known
• Example:– a = “reading ability”– b = “height”– c = “age”
• Note that conditional independence does not imply marginal independence
![Page 18: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/18.jpg)
Graphical Models
• Represent dependency structure with a directed graph– Node <-> random variable– Edges encode dependencies
• Absence of edge -> conditional independence– Directed and undirected versions
• Why is this useful?– A language for communication– A language for computation
• Origins: – Wright 1920’s– Independently developed by Spiegelhalter and Lauritzen in
statistics and Pearl in computer science in the late 1980’s
![Page 19: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/19.jpg)
Examples of 3-way Graphical Models
A CB Marginal Independence:p(A,B,C) = p(A) p(B) p(C)
![Page 20: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/20.jpg)
Examples of 3-way Graphical Models
A
CB
Conditionally independent effects:p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independentGiven A
e.g., A is a disease, and we model B and C as conditionally independentsymptoms given A
![Page 21: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/21.jpg)
Examples of 3-way Graphical Models
A B
C
Independent Causes:p(A,B,C) = p(C|A,B)p(A)p(B)
![Page 22: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/22.jpg)
Examples of 3-way Graphical Models
A CB Markov dependence:p(A,B,C) = p(C|B) p(B|A)p(A)
![Page 23: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/23.jpg)
Real-World Example
Monitoring Intensive-Care Patients• 37 variables• 509 parameters …instead of 237
(figure courtesy of Kevin Murphy/Nir Friedman)
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
![Page 24: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/24.jpg)
Directed Graphical Models
A B
C
p(A,B,C) = p(C|A,B)p(A)p(B)
![Page 25: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/25.jpg)
Directed Graphical Models
A B
C
In general, p(X1, X2,....XN) = p(Xi | parents(Xi ) )
p(A,B,C) = p(C|A,B)p(A)p(B)
![Page 26: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/26.jpg)
Directed Graphical Models
A B
C
• Probability model has simple factored form
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Also known as belief networks, Bayesian networks, causal networks
In general, p(X1, X2,....XN) = p(Xi | parents(Xi ) )
p(A,B,C) = p(C|A,B)p(A)p(B)
![Page 27: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/27.jpg)
Reminders from Probability….
• Law of Total Probability
P(a) = b P(a, b) = b P(a | b) P(b)
– Conditional version:
P(a|c) = b P(a, b|c) = b P(a | b , c) P(b|c)
• Factorization or Chain Rule– P(a, b, c, d) = P(a | b, c, d) P(b | c, d) P(c | d) P (d), or = P(b | a, c, d) P(c | a, d) P(d | a) P(a), or = …..
![Page 28: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/28.jpg)
Graphical Models for Computation
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
• Say we want to compute P(BP|Press)
• Law of total probability: -> must sum over all other variables -> exponential in # variables
• Factorization: -> joint distribution factors into smaller tables
• Can now sum over smaller tables, can reduce complexity dramatically
![Page 29: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/29.jpg)
Example
D
A
B
C F
E
G
![Page 30: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/30.jpg)
Example
D
A
B
C F
E
G
p(A, B, C, D, E, F, G) = p( variable | parents ) = p(A|B)p(C|B)p(B|D)p(F|E)p(G|E)p(E|D) p(D)
![Page 31: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/31.jpg)
Example
D
A
B
c F
E
g
Say we want to compute p(a | c, g)
![Page 32: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/32.jpg)
Example
D
A
B
c F
E
g
Direct calculation: p(a|c,g) = bdef p(a,b,d,e,f | c,g)
Complexity of the sum is O(K4)
![Page 33: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/33.jpg)
Example
D
A
B
c F
E
g
Reordering (using factorization):
b p(a|b) d p(b|d,c) e p(d|e) f p(e,f |g)
![Page 34: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/34.jpg)
Example
D
A
B
c F
E
g
Reordering:
bp(a|b) d p(b|d,c) e p(d|e) f p(e,f |g)
p(e|g)
![Page 35: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/35.jpg)
Example
D
A
B
c F
E
g
Reordering:
bp(a|b) d p(b|d,c) e p(d|e) p(e|g)
p(d|g)
![Page 36: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/36.jpg)
Example
D
A
B
c F
E
g
Reordering:
bp(a|b) d p(b|d,c) p(d|g)
p(b|c,g)
![Page 37: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/37.jpg)
Example
D
A
B
c F
E
g
Reordering:
bp(a|b) p(b|c,g)
p(a|c,g) Complexity is O(K), compared to O(K4)
![Page 38: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/38.jpg)
Graphs with “loops”
D
A
B
C F
E
G
Message passing algorithm does not work whenthere are multiple paths between 2 nodes
![Page 39: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/39.jpg)
Graphs with “loops”
D
A
B
C F
E
G
General approach: “cluster” variablestogether to convert graph to a tree
![Page 40: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/40.jpg)
Reduce to a Tree
D
A
B, E
C F G
![Page 41: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/41.jpg)
Probability Calculations on Graphs
• General algorithms exist - beyond trees – Complexity is typically O(m (number of parents ) )
(where m = arity of each node)– If single parents (e.g., tree), -> O(m)– The sparser the graph the lower the complexity
• Technique can be “automated”– i.e., a fully general algorithm for arbitrary graphs– For continuous variables:
• replace sum with integral– For identification of most likely values
• Replace sum with max operator
![Page 42: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/42.jpg)
Part 3: Learning with Graphical Models
Further Reading:
M. Jordan, Graphical models, Statistical Science: Special Issue on Bayesian Statistics, vol. 19, no. 1, pp. 140-155, Feb. 2004
A.Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin,Bayesian Data Analysis (2nd ed), Chapman and Hall, 2004
![Page 43: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/43.jpg)
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
Generative Model, Probability
Inference, Statistics
![Page 44: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/44.jpg)
The Likelihood Function
• Likelihood = p(data | parameters)
= p( D | )
= L ()
• Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters
• Details– Constants that do not involve can be dropped in defining L ()
– Often easier to work with log L ()
![Page 45: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/45.jpg)
Comments on the Likelihood Function
• Constructing a likelihood function L () is the first step in
probabilistic modeling
• The likelihood function implicitly assumes an underlying probabilistic model M with parameters
• L () connects the model to the observed data
• Graphical models provide a useful language for constructing likelihoods
![Page 46: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/46.jpg)
Binomial Likelihood• Binomial model
– N memoryless trials, 2 outcomes
– probability of success at each trial
• Observed data– r successes in n trials – Defines a likelihood:
L() = p(D | )
= p(successes) p(non-successes)
= r (1-) n-r
![Page 47: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/47.jpg)
Binomial Likelihood Examples
![Page 48: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/48.jpg)
Multinomial Likelihood• Multinomial model
– N memoryless trials, K outcomes
– Probability vector for outcomes at each trial
• Observed data– nj successes in n trials
– Defines a likelihood:
– Maximum likelihood estimates:
![Page 49: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/49.jpg)
Graphical Model for Multinomial
w1
= [ p(w1), p(w2),….. p(wk) ]
w2 wn
Parameters
Observed data
![Page 50: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/50.jpg)
“Plate” Notation
wi
i=1:n
Data = D = {w1,…wn}
Model parameters
Plate (rectangle) indicates replicated nodes in a graphical model
Variables within a plate are conditionally independent manner given parent
![Page 51: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/51.jpg)
Learning in Graphical Models
wi
i=1:n
Data = D = {w1,…wn}
Model parameters
Can view learning in a graphical model as computing themost likely value of the parameter node given the data nodes
![Page 52: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/52.jpg)
Maximum Likelihood (ML) Principle (R. Fisher ~ 1922)
wi
i=1:n
L () = p(Data | ) = p(yi | )
Maximum Likelihood: ML = arg max{ Likelihood() }
Select the parameters that make the observed data most likely
Data = {w1,…wn}
Model parameters
![Page 53: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/53.jpg)
The Bayesian Approach to Learning
wi
i=1:n
Fully Bayesian: p( | Data) = p(Data | ) p() / p(Data)
Maximum A Posteriori: MAP = arg max{ Likelihood() x Prior() }
Prior() = p( )
![Page 54: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/54.jpg)
Learning a Multinomial
• Likelihood: same as before
• Prior: p( ) = Dirichlet (1,…K)
proportional to
– Has mean Prior weight for j
Can set all j = for “uniform” prior
![Page 55: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/55.jpg)
Dirichlet Shapes
From: http://en.wikipedia.org/wiki/Dirichlet_distribution
![Page 56: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/56.jpg)
Bayesian Learning
• P( | D, ) is proportional to p(data | ) p() =
=
= Dirichlet(n1 + 1,…, nK +K )
Posterior mean estimate
![Page 57: Notes on Graphical Models](https://reader035.vdocument.in/reader035/viewer/2022062500/568150e4550346895dbf02c8/html5/thumbnails/57.jpg)
Summary of Bayesian Learning
• Can use graphical models to describe relationships between parameters and data
• P(data | parameters) = Likelihood function
• P(parameters) = prior– In applications such as text mining, prior can be “uninformative”, i.e., flat– Prior can also be optimized for prediction (e.g., on validation data)
• We can compute P(parameters | data, prior) or a “point estimate” (e.g., posterior mode or mean)
• Computation of posterior estimates can be computationally intractable – Monte Carlo techniques often used