Download - Describing Data
![Page 1: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/1.jpg)
Describing Data
• The canonical descriptive strategy is to describe the data in terms of their underlying distribution
• As usual, we have a p-dimensional data matrix with variables X1, …, Xp
• The joint distribution is P(X1, …, Xp)• The joint gives us complete information about the
variables• Given the joint distribution, we can answer any question
about the relationships among any subset of variables– are X2 and X5 independent?– generating approximate answers to queries for large
databases or selectivity estimation• Given a query (conditions that observations must satisfy),
estimate the fraction of rows that satisfy this condition (the selectivity of the query)
• These estimates are needed during query optimization • If we have a good approximation for the joint distribution of data,
we can use it to efficiently compute approximate selectivities
![Page 2: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/2.jpg)
Graphical Models
• In the next 3-4 lectures, we will be studying graphical models
• e.g. Bayesian networks, Bayes nets, Belief nets, Markov networks, etc.
• We will study:– representation– reasoning– learning
• Materials based on upcoming book by Nir Friedman and Daphne Koller. Slides courtesy of Nir Friedman.
![Page 3: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/3.jpg)
Probability Distributions
• Let X1,…,Xp be random variables
• Let P be a joint distribution over X1,…,Xp
If the variables are binary, then we need O(2p) parameters to describe P
Can we do better?• Key idea: use properties of independence
![Page 4: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/4.jpg)
Independent Random Variables
• Two variables X and Y are independent if– P(X = x|Y = y) = P(X = x) for all values x,y– That is, learning the values of Y does not change
prediction of X
• If X and Y are independent then – P(X,Y) = P(X|Y)P(Y) = P(X)P(Y)
• In general, if X1,…,Xp are independent, then– P(X1,…,Xp)= P(X1)...P(Xp)
– Requires O(n) parameters
![Page 5: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/5.jpg)
Conditional Independence
• Unfortunately, most of random variables of interest are not independent of each other
• A more suitable notion is that of conditional independence
• Two variables X and Y are conditionally independent given Z if– P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z– That is, learning the values of Y does not change prediction
of X once we know the value of Z
– notation: I ( X , Y | Z )
![Page 6: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/6.jpg)
Example: Naïve Bayesian Model
• A common model in early diagnosis:– Symptoms are conditionally independent given the
disease (or fault)
• Thus, if – X1,…,Xp denote whether the symptoms exhibited by
the patient (headache, high-fever, etc.) and – H denotes the hypothesis about the patients health
• then, P(X1,…,Xp,H) = P(H)P(X1|H)…P(Xp|H),
• This naïve Bayesian model allows compact representation
– It does embody strong independence assumptions
![Page 7: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/7.jpg)
Modeling assumptions:Ancestors can affect descendants' genotype only by passing genetic materials through intermediate generations
Example: Family trees
Noisy stochastic process:
Example: Pedigree• A node represents
an individual’sgenotype
Homer
Bart
Marge
Lisa Maggie
![Page 8: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/8.jpg)
Markov Assumption
• We now make this independence assumption more precise for directed acyclic graphs (DAGs)
• Each random variable X, is independent of its non-descendents, given its parents Pa(X)
• Formally,I (X, NonDesc(X) | Pa(X))
Descendent
Ancestor
Parent
Non-descendent
X
Y1 Y2
Non-descendent
![Page 9: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/9.jpg)
Markov Assumption Example
• In this example:– I ( E, B )– I ( B, {E, R} )– I ( R, {A, B, C} | E )– I ( A, R | B,E )– I ( C, {B, E, R} | A)
Earthquake
Radio
Burglary
Alarm
Call
![Page 10: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/10.jpg)
I-Maps
• A DAG G is an I-Map of a distribution P if all the Markov assumptions implied by G are satisfied by P(Assuming G and P both use the same set of random
variables)
Examples:
X Y
x y P(x,y)0 0 0.250 1 0.251 0 0.251 1 0.25
X Y
x y P(x,y)0 0 0.20 1 0.31 0 0.41 1 0.1
![Page 11: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/11.jpg)
Factorization
• Given that G is an I-Map of P, can we simplify the representation of P?
• Example:
• Since I(X,Y), we have that P(X|Y) = P(X)• Applying the chain rule
P(X,Y) = P(X|Y) P(Y) = P(X) P(Y)
• Thus, we have a simpler representation of P(X,Y)
X Y
![Page 12: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/12.jpg)
Factorization Theorem
From assumption:
)X(NonDesc)X(Pa}X,X{
}X,X{)X(Pa
ii1i,1
1i,1i
⊆−
⊆
−
−
K
K
∏=i
iip1 ))X(Pa|X(P)X,...,X(P
Thm: if G is an I-Map of P, then
∏ −=i
1i1ip1 )X,...,X|X(P)X,...,X(PProof:• By chain rule:
• wlog. X1,…,Xp is an ordering consistent with G
• Since G is an I-Map, I (Xi, NonDesc(Xi)| Pa(Xi))
• We conclude, P(Xi | X1,…,Xi-1) = P(Xi | Pa(Xi) )
))X(Pa|)X(Pa}X,X{,X(I ii1i,1i −−K• Hence,
![Page 13: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/13.jpg)
Factorization Example
P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E)
Earthquake
Radio
Burglary
Alarm
Call
versusP(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)
![Page 14: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/14.jpg)
Consequences
• We can write P in terms of “local” conditional probabilities
If G is sparse,– that is, |Pa(Xi)| < k ,
each conditional probability can be specified compactly
– e.g. for binary variables, these require O(2k) params.
representation of P is compact– linear in number of variables
![Page 15: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/15.jpg)
Pause…Summary
We defined the following concepts• The Markov Independences of a DAG G
– I (Xi , NonDesc(Xi) | Pai )
• G is an I-Map of a distribution P– If P satisfies the Markov independencies implied by G
We proved the factorization theorem• if G is an I-Map of P, then
∏=i
iin1 )Pa|X(P)X,...,X(P
![Page 16: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/16.jpg)
• Let Markov(G) be the set of Markov Independencies implied by G
• The factorization theorem shows
G is an I-Map of P
• We can also show the opposite:
Thm: G is an I-Map
of P
Conditional Independencies
∏=i
iin PaXPXXP )|(),...,( 1
∏=i
iin PaXPXXP )|(),...,( 1
![Page 17: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/17.jpg)
Proof (Outline)
Example:X
Y
Z
)|()()|()|()(
),(),,(
),|(XYPXP
XZPXYPXPYXPZYXP
YXZP ==
)|( XZP=
![Page 18: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/18.jpg)
Implied Independencies
• Does a graph G imply additional independencies as a consequence of Markov(G)?
• We can define a logic of independence statements
• Some axioms:– I( X ; Y | Z ) I( Y; X | Z )
– I( X ; Y1, Y2 | Z ) I( X; Y1 | Z )
![Page 19: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/19.jpg)
d-seperation
• A procedure d-sep(X; Y | Z, G) that given a DAG G, and sets X, Y, and Z returns either yes or no
• Goal: d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows from
Markov(G)
![Page 20: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/20.jpg)
Paths
• Intuition: dependency must “flow” along paths in the graph
• A path is a sequence of neighboring variables
Examples:• R E A B• C A E R
Earthquake
Radio
Burglary
Alarm
Call
![Page 21: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/21.jpg)
Paths
• We want to know when a path is– active -- creates dependency between end nodes– blocked -- cannot create dependency end nodes
• We want to classify situations in which paths are active.
![Page 22: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/22.jpg)
Blocked Unblocked
E
R A
E
R A
Path Blockage
Three cases:– Common cause
–
–
Blocked Active
![Page 23: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/23.jpg)
Blocked Unblocked
E
C
A
E
C
A
Path Blockage
Three cases:– Common cause
– Intermediate cause
–
Blocked Active
![Page 24: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/24.jpg)
Blocked Unblocked
E B
A
C
E B
A
CE B
A
C
Path Blockage
Three cases:– Common cause
– Intermediate cause
– Common Effect
Blocked Active
![Page 25: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/25.jpg)
Path Blockage -- General Case
A path is active, given evidence Z, if• Whenever we have the configuration
B or one of its descendents are in Z
• No other nodes in the path are in Z
A path is blocked, given evidence Z, if it is not active.
A C
B
![Page 26: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/26.jpg)
A
– d-sep(R,B)?
Example
E B
C
R
![Page 27: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/27.jpg)
– d-sep(R,B) = yes– d-sep(R,B|A)?
Example
E B
A
C
R
![Page 28: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/28.jpg)
– d-sep(R,B) = yes– d-sep(R,B|A) = no– d-sep(R,B|E,A)?
Example
E B
A
C
R
![Page 29: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/29.jpg)
d-Separation
• X is d-separated from Y, given Z, if all paths from a node in X to a node in Y are blocked, given Z.
• Checking d-separation can be done efficiently (linear time in number of edges)– Bottom-up phase:
Mark all nodes whose descendents are in Z– X to Y phase:
Traverse (BFS) all edges on paths from X to Y and check if they are blocked
![Page 30: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/30.jpg)
Soundness
Thm: • If
– G is an I-Map of P– d-sep( X; Y | Z, G ) = yes
• then– P satisfies I( X; Y | Z )
Informally,• Any independence reported by d-separation is
satisfied by underlying distribution
![Page 31: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/31.jpg)
Completeness
Thm: • If d-sep( X; Y | Z, G ) = no• then there is a distribution P such that
– G is an I-Map of P– P does not satisfy I( X; Y | Z )
Informally,• Any independence not reported by d-separation
might be violated by the underlying distribution• We cannot determine this by examining the
graph structure alone
![Page 32: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/32.jpg)
I-Maps revisited
• The fact that G is I-Map of P might not be that useful
• For example, complete DAGs– A DAG is G is complete if we cannot add an arc without
creating a cycle
• These DAGs do not imply any independencies• Thus, they are I-Maps of any distribution
X1
X3
X2
X4
X1
X3
X2
X4
![Page 33: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/33.jpg)
Minimal I-Maps
A DAG G is a minimal I-Map of P if• G is an I-Map of P• If G’ G, then G’ is not an I-Map of P
Removing any arc from G introduces (conditional) independencies that do not hold in P
![Page 34: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/34.jpg)
Minimal I-Map Example
• If is a minimal I-Map
• Then, these are not I-Maps:
X1
X3
X2
X4
X1
X3
X2
X4
X1
X3
X2
X4
X1
X3
X2
X4
X1
X3
X2
X4
![Page 35: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/35.jpg)
Constructing minimal I-Maps
The factorization theorem suggests an algorithm
• Fix an ordering X1,…,Xn
• For each i, – select Pai to be a minimal subset of {X1,…,Xi-1 },
such that I(Xi ; {X1,…,Xi-1 } - Pai | Pai )
• Clearly, the resulting graph is a minimal I-Map.
![Page 36: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/36.jpg)
Non-uniqueness of minimal I-Map
• Unfortunately, there may be several minimal I-Maps for the same distribution– Applying I-Map construction procedure with different orders
can lead to different structures
E B
A
C
R
Original I-Map
E B
A
C
R
Order: C, R, A, E, B
![Page 37: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/37.jpg)
Choosing Ordering & Causality
• The choice of order can have drastic impact on the complexity of minimal I-Map
• Heuristic argument: construct I-Map using causal ordering among variables
• Justification?– It is often reasonable to assume that graphs of causal
influence should satisfy the Markov properties.
![Page 38: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/38.jpg)
P-Maps
• A DAG G is P-Map (perfect map) of a distribution P if– I(X; Y | Z) if and only if
d-sep(X; Y |Z, G) = yes
Notes:• A P-Map captures all the independencies in the
distribution• P-Maps are unique, up to DAG equivalence
![Page 39: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/39.jpg)
P-Maps
• Unfortunately, some distributions do not have a P-Map
• Example:
• A minimal I-Map:
• This is not a P-Map since I(A;C) but d-sep(A;C) = no
⎪⎩
⎪⎨⎧
=⊕⊕
=⊕⊕=
1if61
0if121
),,(CBA
CBACBAP
A B
C
![Page 40: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/40.jpg)
Bayesian Networks
• A Bayesian network specifies a probability distribution via two components:
– A DAG G– A collection of conditional probability distributions
P(Xi|Pai)
• The joint distribution P is defined by the factorization
• Additional requirement: G is a minimal I-Map of P
∏=i
iin PaXPXXP )|(),...,( 1
![Page 41: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/41.jpg)
Summary
• We explored DAGs as a representation of conditional independencies:
– Markov independencies of a DAG– Tight correspondence between Markov(G) and the
factorization defined by G– d-separation, a sound & complete procedure for
computing the consequences of the independencies– Notion of minimal I-Map– P-Maps
• This theory is the basis for defining Bayesian networks
![Page 42: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/42.jpg)
Markov Networks
• We now briefly consider an alternative representation of conditional independencies
• Let U be an undirected graph
• Let Ni be the set of neighbors of Xi
• Define Markov(U) to be the set of independenciesI( Xi ; {X1,…,Xn} - Ni - {Xi } | Ni )
• U is an I-Map of P if P satisfies Markov(U)
![Page 43: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/43.jpg)
Example
This graph implies that• I(A; C | B, D )• I(B; D | A, C )
• Note: this example does not have a directed P-Map
A
D
B
C
![Page 44: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/44.jpg)
Markov Network Factorization
Thm: if
• P is strictly positive, that is P(x1, …, xn ) > 0 for all assignments
then• U is an I-Map of Pif and only if• there is a factorization
where C1, …, Ck are the maximal cliques in U
Alternative form:
∏=i
in CfXXP )(),,( 1 K
∑= i
iCg
n1 eZ1
XXP)(
),,( K
![Page 45: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/45.jpg)
Relationship between Directed & Undirected Models
Chain Graphs
Directed Graphs
UndirectedGraphs
![Page 46: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/46.jpg)
CPDs
• So far, we focused on how to represent independencies using DAGs
• The “other” component of a Bayesian networks is the specification of the conditional probability distributions (CPDs)
• We start with the simplest representation of CPDs and then discuss additional structure
![Page 47: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/47.jpg)
Tabular CPDs
• When the variable of interest are all discrete, the common representation is as a table:
• For example P(C|A,B) can be represented by
A B P(C = 0 | A, B) P(C = 1 | A, B)
0 0 0.25 0.750 1 0.50 0.501 0 0.12 0.881 1 0.33 0.67
![Page 48: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/48.jpg)
Tabular CPDs
Pros:• Very flexible, can capture any CPD of discrete
variables• Can be easily stored and manipulated
Cons:• Representation size grows exponentially with
the number of parents!• Unwieldy to assess probabilities for more than
few parents
![Page 49: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/49.jpg)
Structured CPD
• To avoid the exponential blowup in representation, we need to focus on specialized types of CPDs
• This comes at a cost in terms of expressive power
• We now consider several types of structured CPDs
![Page 50: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/50.jpg)
Causal Independence
• Consider the following situation
• In tabular CPD, we need to assess the probability of fever in eight cases
• These involve all possible interactions between diseases
• For three disease, this might be feasible….For ten diseases, not likely….
Disease 1 Disease 3Disease 2
Fever
![Page 51: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/51.jpg)
Causal Independence
• Simplifying assumption:– Each disease attempts to cause fever, independently of the
other diseases– The patient has fever if one of the diseases “succeeds”
• We can model this using a Bayesian network fragment
Fever
Fever 1 Fever 3Fever 2
Disease 1 Disease 3Disease 2
OR gate
F = or(SF,F1,F2,F3)Hypothetical variables
“Fever caused by Disease i”
SpontenuousFever
![Page 52: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/52.jpg)
Noisy-Or CPD
• Models P(X|Y1,…,Yk), X, Y1,…, Yk are all binary
• Paremeters:– pi -- probability of X = 1 due to Yi = 1
– p0 -- probability of X = 1 due to other causes
• Plugging these in the model we get
),...,|0(1),...,|1(
)1()1(),...,|0(
11
01
kk
i
Yik
YYXPYYXP
ppYYXP i
=−==
−−== ∏
![Page 53: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/53.jpg)
Noisy-Or CPD
• Benefits of noisy-or– “Reasonable” assumptions in many domains
• e.g., medical domain
– Few parameters. – Each parameter can be estimated independently of
the others
• The same idea can be extended to other functions:noisy-max, noisy-and, etc.
• Frequently used in large medical expert systems
![Page 54: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/54.jpg)
Context Specific Independence
• Consider the following examples:
• Alarm sound depends on– Whether the alarm
was set before leaving the house– Burglary– Earthquake
• Arriving on time depends on – Travel route– The congestion on the two
possible routes
Set EarthquakeBurglary
Alarm
TravelRoute
Route 2traffic
Route 1traffic
Arrive ontime
![Page 55: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/55.jpg)
Context-Specific Independence
• In both of these example we have context-specific indepdencies (CSI)– Independencies that depends on a particular value of
one or more variables
• In our examples:– Ind( A ; B, E | S = 0 )
Alarm sound is independent of B and E when the alarm is not set
– Ind( A ; R2 | T = 1 )Arrival time is independent of traffic on route 2 if we choose to travel on route 1
![Page 56: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/56.jpg)
Representing CSI
• When we have such CSI, P(X | Y1,…,Yk ) is the same for several values of Y1,…,Yk
• There are many ways of representing these regularities
• A natural representation: decision trees– Internal nodes: tests on parents– Leaves: probability distributions on X
• Evaluate P(X | Y1,…,Yk ) by traversing tree
S
B
.0
E .8
.1 .7
0 1
0 1
0 1
![Page 57: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/57.jpg)
Detecting CSI
• Given evidence on some nodes, we can identify the “relevant” parts of the trees– This consists of the paths in the tree that are consistent with
context
• Example – Context S = 0– Only one path of tree is relevant
• A parent is independent given the context if it does not appear on oneof the relevant paths
S
B
.0
E .8
.1 .7
0 1
0 1
0 1
![Page 58: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/58.jpg)
Decision Tree CPDs
Benefits• Decision trees offer a flexible and intuitive
language to represent CSI • Incorporated into several commercial tools for
constructing Bayesian networks
Comparison to noisy-or• Noisy-or CPDs require full trees to represent• General decision tree CPDs cannot be
represented by noisy-or
![Page 59: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/59.jpg)
Inference in Bayesian Inference in Bayesian NetworksNetworks
![Page 60: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/60.jpg)
Inference
• We now have compact representations of probability distributions:– Bayesian Networks– Markov Networks
• Network describes a unique probability distribution P
• How do we answer queries about P?
• We use inference as a name for the process of computing answers to such queries
![Page 61: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/61.jpg)
Queries: Likelihood
• There are many types of queries we might ask. • Most of these involve evidence
– An evidence e is an assignment of values to a set E variables in the domain
– Without loss of generality E = { Xk+1, …, Xn }
• Simplest query: compute probability of evidence
• This is often referred to as computing the likelihood of the evidence
∑ ∑=1x
1 ),,,( )(kx
kxxPP ee KK
![Page 62: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/62.jpg)
Queries: A posteriori belief
• Often we are interested in the conditional probability of a variable given the evidence
• This is the a posteriori belief in X, given evidence e
• A related task is computing the term P(X, e) – i.e., the likelihood of e and X = x for values of X – we can recover the a posteriori belief by
∑′
′==
==
x
xXPxXP
xXP),(
),()|(
ee
e
)(),(
)|(eP
eXPeXP =
![Page 63: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/63.jpg)
A posteriori belief
This query is useful in many cases:• Prediction: what is the probability of an outcome
given the starting condition– Target is a descendent of the evidence
• Diagnosis: what is the probability of disease/fault given symptoms– Target is an ancestor of the evidence
• As we shall see, the direction between variables does not restrict the directions of the queries– Probabilistic inference can combine evidence form all
parts of the network
![Page 64: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/64.jpg)
Queries: A posteriori joint
• In this query, we are interested in the conditional probability of several variables, given the evidence
P(X, Y, … | e )
• Note that the size of the answer to query is exponential in the number of variables in the joint
![Page 65: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/65.jpg)
Queries: MAP
• In this query we want to find the maximum a posteriori assignment for some variable of interest (say X1,…,Xl )
• That is, x1,…,xl maximize the probability
P(x1,…,xl | e)
• Note that this is equivalent to maximizing
P(x1,…,xl, e)
![Page 66: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/66.jpg)
Queries: MAP
We can use MAP for:• Classification
– find most likely label, given the evidence
• Explanation – What is the most likely scenario, given the evidence
![Page 67: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/67.jpg)
Queries: MAP
Cautionary note:• The MAP depends on the set of variables• Example:
– MAP of X– MAP of (X, Y)
x y P(x,y)
0 0 0.35
0 1 0.05
1 0 0.3
1 1 0.3
![Page 68: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/68.jpg)
Complexity of Inference
Thm:Computing P(X = x) in a Bayesian network is NP-hard
Not surprising, since we can simulate Boolean gates.
![Page 69: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/69.jpg)
Hardness
• Hardness does not mean we cannot solve inference– It implies that we cannot find a general procedure
that works efficiently for all networks– For particular families of networks, we can have
provably efficient procedures
![Page 70: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/70.jpg)
Approaches to inference
•Exact inference – Inference in Simple Chains– Variable elimination– Clustering / join tree algorithms
•Approximate inference– Stochastic simulation / sampling methods– Markov chain Monte Carlo methods– Mean field theory
![Page 71: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/71.jpg)
Inference in Simple Chains
How do we compute P(X2)?
X1 X2
∑∑ ==11
)|()(),()( 121212xx
xxPxPxxPxP
![Page 72: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/72.jpg)
Inference in Simple Chains (cont.)
How do we compute P(X3)?
• we already know how to compute P(X2)...
X1 X2
∑∑ ==22
)|()(),()( 232323xx
xxPxPxxPxP
X3
∑∑ ==11
)|()(),()( 121212xx
xxPxPxxPxP
![Page 73: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/73.jpg)
Inference in Simple Chains (cont.)
How do we compute P(Xn)?
• Compute P(X1), P(X2), P(X3), …
• We compute each term by using the previous one
∑ ++ =ix
iiii xxPxPxP )|()()( 11
X1 X2 X3Xn
...
Complexity:
• Each step costs O(|Val(Xi)|*|Val(Xi+1)|) operations
• Compare to naïve evaluation, that requires summing over joint values of n-1 variables
![Page 74: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/74.jpg)
Inference in Simple Chains (cont.)
• Suppose that we observe the value of X2 =x2
• How do we compute P(X1|x2)?– Recall that we it suffices to compute P(X1,x2)
X1 X2
)()|(),( 11221 xPxxPxxP =
![Page 75: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/75.jpg)
Inference in Simple Chains (cont.)
• Suppose that we observe the value of X3 =x3
• How do we compute P(X1,x3)?
• How do we compute P(x3|x1)?
X1 X2
)|()(),( 13131 xxPxPxxP =
X3
∑
∑∑=
==
2
22
)|()|(
),|()|()|,()|(
2312
2131213213
x
xx
xxPxxP
xxxPxxPxxxPxxP
![Page 76: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/76.jpg)
Inference in Simple Chains (cont.)
• Suppose that we observe the value of Xn =xn
• How do we compute P(X1,xn)?
• We compute P(xn|xn-1), P(xn|xn-2), … iteratively
X1 X2
)|()(),( 111 xxPxPxxP nn =
X3
∑
∑
++
+
=
=+
i
i
xinii
xiniin
xxPxxP
xxxPxxP
)|()|(
)|,()|(
11
11
Xn...
![Page 77: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/77.jpg)
Inference in Simple Chains (cont.)
• Suppose that we observe the value of Xn =xn
• We want to find P(Xk|xn )
• How do we compute P(Xk,xn )?
• We compute P(Xk ) by forward iterations
• We compute P(xn | Xk ) by backward iterations
X1 X2
)|()(),( knknk xxPxPxxP =
Xk Xn......
![Page 78: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/78.jpg)
Elimination in Chains
• We now try to understand the simple chain example using first-order principles
• Using definition of probability, we have
∑∑∑∑=d c b a
edcbaPeP ),,,,()(
A B C ED
![Page 79: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/79.jpg)
Elimination in Chains
• By chain decomposition, we get
A B C ED
∑∑∑∑
∑∑∑∑=
=
d c b a
d c b a
dePcdPbcPabPaP
edcbaPeP
)|()|()|()|()(
),,,,()(
![Page 80: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/80.jpg)
Elimination in Chains
• Rearranging terms ...
A B C ED
∑∑∑ ∑
∑∑∑∑=
=
d c b a
d c b a
abPaPdePcdPbcP
dePcdPbcPabPaPeP
)|()()|()|()|(
)|()|()|()|()()(
![Page 81: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/81.jpg)
Elimination in Chains
• Now we can perform innermost summation
• This summation, is exactly the first step in the forward iteration we describe before
A B C ED
∑∑∑
∑∑∑ ∑=
=
d c b
d c b a
bpdePcdPbcP
abPaPdePcdPbcPeP
)()|()|()|(
)|()()|()|()|()(
X
![Page 82: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/82.jpg)
Elimination in Chains
• Rearranging and then summing again, we get
A B C ED
∑∑
∑∑ ∑
∑∑∑
=
=
=
d c
d c b
d c b
cpdePcdP
bpbcPdePcdP
bpdePcdPbcPeP
)()|()|(
)()|()|()|(
)()|()|()|()(
X X
![Page 83: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/83.jpg)
Elimination in Chains with Evidence
• Similarly, we understand the backward pass
• We write the query in explicit form
A B C ED
∑∑∑
∑∑∑=
=
b c d
b c d
dePcdPbcPabPaP
edcbaPeaP
)|()|()|()|()(
),,,,(),(
![Page 84: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/84.jpg)
Elimination in Chains with Evidence
• Eliminating d, we get
A B C ED
∑∑
∑∑ ∑
∑∑∑
=
=
=
b c
b c d
b c d
cePbcPabPaP
dePcdPbcPabPaP
dePcdPbcPabPaPeaP
)|()|()|()(
)|()|()|()|()(
)|()|()|()|()(),(
X
![Page 85: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/85.jpg)
Elimination in Chains with Evidence
• Eliminating c, we get
A B C ED
∑
∑ ∑
∑∑
=
=
=
b
b c
b c
bepabPaP
cePbcPabPaP
cePbcPabPaPeaP
)|()|()(
)|()|()|()(
)|()|()|()(),(
XX
![Page 86: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/86.jpg)
Elimination in Chains with Evidence
• Finally, we eliminate b
A B C ED
)|()(
)|()|()(
)|()|()(),(
aePaP
bepabPaP
bepabPaPeaP
b
b
=
=
=
∑
∑
XXX
![Page 87: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/87.jpg)
Variable Elimination
General idea:• Write query in the form
• Iteratively– Move all irrelevant terms outside of innermost sum– Perform innermost sum, getting a new term– Insert the new term into the product
∑ ∑∑∏=kx x x i
iin paxPXP3 2
)|(),( Le
![Page 88: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/88.jpg)
A More Complex Example
Visit to Asia
Smoking
Lung CancerTuberculosis
Abnormalityin Chest
Bronchitis
X-Ray Dyspnea
• “Asia” network:
![Page 89: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/89.jpg)
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b
Initial factors
![Page 90: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/90.jpg)
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: v,s,x,t,l,a,b
Initial factors
Eliminate: v
Note: fv(t) = P(t)In general, result of elimination is not necessarily a probability term
Compute: ∑=v
v vtPvPtf )|()()(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv⇒
![Page 91: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/91.jpg)
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: s,x,t,l,a,b
• Initial factors
Eliminate: s
Summing on s results in a factor with two arguments fs(b,l)In general, result of elimination may be a function of several variables
Compute: ∑=s
s slPsbPsPlbf )|()|()(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv⇒
),|()|(),|(),()( badPaxPltaPlbftf sv⇒
![Page 92: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/92.jpg)
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: x,t,l,a,b
• Initial factors
Eliminate: x
Note: fx(a) = 1 for all values of a !!
Compute: ∑=x
x axPaf )|()(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv⇒
),|()|(),|(),()( badPaxPltaPlbftf sv⇒
),|(),|()(),()( badPltaPaflbftf xsv⇒
![Page 93: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/93.jpg)
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: t,l,a,b
• Initial factors
Eliminate: t
Compute: ∑=t
vt ltaPtflaf ),|()(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv⇒
),|()|(),|(),()( badPaxPltaPlbftf sv⇒
),|(),|()(),()( badPltaPaflbftf xsv⇒
),|(),()(),( badPlafaflbf txs⇒
![Page 94: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/94.jpg)
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: l,a,b
• Initial factors
Eliminate: l
Compute: ∑=l
tsl laflbfbaf ),(),(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv⇒
),|()|(),|(),()( badPaxPltaPlbftf sv⇒
),|(),|()(),()( badPltaPaflbftf xsv⇒
),|(),()(),( badPlafaflbf txs⇒
),|()(),( badPafbaf xl⇒
![Page 95: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/95.jpg)
V S
LT
A B
X D
),|()|(),|()|()|()|()()( badPaxPltaPsbPslPvtPsPvP
• We want to compute P(d)• Need to eliminate: b
• Initial factors
Eliminate: a,bCompute:
∑∑ ==b
aba
xla dbfdfbadpafbafdbf ),()(),|()(),(),(
),|()|(),|()|()|()()( badPaxPltaPsbPslPsPtfv⇒
),|()|(),|(),()( badPaxPltaPlbftf sv⇒
),|(),|()(),()( badPltaPaflbftf xsv⇒
),|()(),( badPafbaf xl⇒),|(),()(),( badPlafaflbf txs⇒
)(),( dfdbf ba ⇒⇒
![Page 96: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/96.jpg)
Variable Elimination
• We now understand variable elimination as a sequence of rewriting operations
• Actual computation is done in elimination step
• Exactly the same computation procedure applies to Markov networks
• Computation depends on order of elimination
![Page 97: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/97.jpg)
∑=x
kxkx yyxfyyf ),,,('),,( 11 KK
∏=
=m
ilikx i
yyxfyyxf1
,1,1,11 ),,(),,,(' KK
Complexity of variable elimination
• Suppose in one elimination step we compute
This requires • multiplications
– For each value for x, y1, …, yk, we do m multiplications
• additions
– For each value of y1, …, yk , we do |Val(X)| additions
Complexity is exponential in number of variables in the intermediate factor!
∏⋅⋅i
iYXm )Val()Val(
∏⋅i
iYX )Val()Val(
![Page 98: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/98.jpg)
Understanding Variable Elimination
• We want to select “good” elimination orderings that reduce complexity
• We start by attempting to understand variable elimination via the graph we are working with
• This will reduce the problem of finding good ordering to graph-theoretic operation that is well-understood
![Page 99: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/99.jpg)
Undirected graph representation
• At each stage of the procedure, we have an algebraic term that we need to evaluate
• In general this term is of the form:
where Zi are sets of variables
• We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor
– that is, if X,Y are in some Zi
• Note: this is the Markov network that describes the probability on the variables we did not eliminate yet
∑ ∑∏=1
)(),,( 1y y i
ikn
fxxP iZKK
![Page 100: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/100.jpg)
Chordal Graphs
• elimination ordering undirected chordal graph
Graph:• Maximal cliques are factors in elimination• Factors in elimination are cliques in the graph• Complexity is exponential in size of the largest
clique in graph
LT
A B
X
V S
D
V S
LT
A B
X D
![Page 101: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/101.jpg)
Induced Width
• The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination
• This quantity is called the induced width of a graph according to the specified ordering
• Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph
![Page 102: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/102.jpg)
General Networks
• From graph theory:Thm:• Finding an ordering that minimizes the
induced width is NP-HardHowever,• There are reasonable heuristic for finding
“relatively” good ordering• There are provable approximations to the best
induced width• If the graph has a small induced width, there
are algorithms that find it in polynomial time
![Page 103: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/103.jpg)
Elimination on Trees
• Formally, for any tree, there is an elimination ordering with induced width = 1
Thm• Inference on trees is linear in number of
variables
![Page 104: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/104.jpg)
PolyTrees
• A polytree is a network where there is at most one path from one variable to another
Thm:• Inference in a polytree is linear in the
representation size of the network– This assumes tabular CPT representation
A
CB
D E
F G
H
![Page 105: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/105.jpg)
Approaches to inference
•Exact inference – Inference in Simple Chains– Variable elimination– Clustering / join tree algorithms
•Approximate inference– Stochastic simulation / sampling methods– Markov chain Monte Carlo methods– Mean field theory
![Page 106: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/106.jpg)
Stochastic simulation
•Suppose you are given values for some subset of the variables, G, and want to infer values for unknown variables, U
•Randomly generate a very large number of instantiations from the BN– Generate instantiations for all variables – start at root
variables and work your way “forward”
•Only keep those instantiations that are consistent with the values for G
•Use the frequency of values for U to get estimated probabilities
•Accuracy of the results depends on the size of the sample (asymptotically approaches exact results)
![Page 107: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/107.jpg)
Markov chain Monte Carlo methods
• So called because– Markov chain – each instance generated in the sample
is dependent on the previous instance– Monte Carlo – statistical sampling method
• Perform a random walk through variable assignment space, collecting statistics as you go– Start with a random instantiation, consistent with
evidence variables– At each step, for some nonevidence variable, randomly
sample its value, consistent with the other current assignments
• Given enough samples, MCMC gives an accurate estimate of the true distribution of values
![Page 108: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/108.jpg)
Learning Bayesian Learning Bayesian NetworksNetworks
![Page 109: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/109.jpg)
Learning Bayesian networks
InducerInducerInducerInducerData + Prior information
E
R
B
A
C .9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
![Page 110: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/110.jpg)
Known Structure -- Complete Data
E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
• Network structure is specified– Inducer needs to estimate parameters
• Data does not contain missing values
![Page 111: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/111.jpg)
Unknown Structure -- Complete Data
E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
• Network structure is not specified– Inducer needs to select arcs & estimate parameters
• Data does not contain missing values
![Page 112: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/112.jpg)
Known Structure -- Incomplete Data
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
• Network structure is specified• Data contains missing values
– We consider assignments to missing values
E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>
![Page 113: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/113.jpg)
Known Structure / Complete Data
• Given a network structure G– And choice of parametric family for P(Xi|Pai)
• Learn parameters for network
Goal• Construct a network that is “closest” to
probability that generated the data
![Page 114: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/114.jpg)
Learning Parameters for a Bayesian Network
E B
A
C
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⋅⋅⋅⋅⋅⋅⋅⋅
=
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
D
• Training data has the form:
![Page 115: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/115.jpg)
Learning Parameters for a Bayesian Network
E B
A
C
• Since we assume i.i.d. samples,likelihood function is
∏ Θ=Θm
mCmAmBmEPDL ):][],[],[],[():(
![Page 116: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/116.jpg)
Learning Parameters for a Bayesian Network
E B
A
C
• By definition of network, we get
∏
∏
Θ
Θ
Θ
Θ
=
Θ=Θ
m
m
mAmCP
mEmBmAP
mBP
mEP
mCmAmBmEPDL
):][|][(
):][],[|][(
):][(
):][(
):][],[],[],[():(
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⋅⋅⋅⋅⋅⋅⋅⋅
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
![Page 117: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/117.jpg)
Learning Parameters for a Bayesian Network
E B
A
C
• Rewriting terms, we get
∏
∏
∏
∏
∏
Θ
Θ
Θ
Θ=
Θ=Θ
m
m
m
m
m
mAmCP
mEmBmAP
mBP
mEP
mCmAmBmEPDL
):][|][(
):][],[|][(
):][(
):][(
):][],[],[],[():(
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⋅⋅⋅⋅⋅⋅⋅⋅
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
![Page 118: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/118.jpg)
General Bayesian Networks
Generalizing for any Bayesian network:
• The likelihood decomposes according to the structure of the network.
∏
∏∏
∏∏
∏
Θ=
Θ=
Θ=
Θ=Θ
iii
i miii
m iiii
mn
DL
mPamxP
mPamxP
mxmxPDL
):(
):][|][(
):][|][(
):][,],[():( 1 Ki.i.d. samples
Network factorization
![Page 119: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/119.jpg)
General Bayesian Networks (Cont.)
Decomposition Independent Estimation Problems
If the parameters for each family are not related, then they can be estimated independently of
each other.
![Page 120: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/120.jpg)
From Binomial to Multinomial
• For example, suppose X can have the values 1,2,…,K
• We want to learn the parameters 1, 2. …, K
Sufficient statistics:
• N1, N2, …, NK - the number of times each outcome is observed
Likelihood function:
MLE:
∏=
=K
k
Nk
kDL1
):(
∑=
llN
Nkk
ˆ
![Page 121: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/121.jpg)
Likelihood for Multinomial Networks
• When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:
∏∏
∏∏
∏ ∏
∏
=
Θ=
Θ=
Θ=Θ
=
i i
ii
ii
i i
ii
i ii
pa x
paxNpax
pa x
paxNiii
pa pamPamiii
miiiii
paxP
pamxP
mPamxPDL
),(|
),(
][,
):|(
):|][(
):][|][():(
θ
∏ Θ=Θm
iiiii mPamxPDL ):][|][():(
∏∏
∏ ∏
∏
Θ=
Θ=
Θ=Θ
=
i i
ii
i ii
pa x
paxNiii
pa pamPamiii
miiiii
paxP
pamxP
mPamxPDL
),(
][,
):|(
):|][(
):][|][():(
∏ ∏
∏
=
Θ=
Θ=Θ
i iipa pamPamiii
miiiii
pamxP
mPamxPDL
][,
):|][(
):][|][():(
![Page 122: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/122.jpg)
Likelihood for Multinomial Networks
• When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:
• For each value pai of the parents of Xi we get an independent multinomial problem
• The MLE is
∏∏=Θi i
ii
iipa x
paxNpaxii DL ),(
|):( θ
)(),(ˆ
|i
iipax paN
paxNii=
![Page 123: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/123.jpg)
Maximum Likelihood Estimation
Consistency • Estimate converges to best possible value as the
number of examples grow
• To make this formal, we need to introduce some definitions
![Page 124: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/124.jpg)
KL-Divergence
• Let P and Q be two distributions over X• A measure of distance between P and Q is the
Kullback-Leibler Divergence
• KL(P||Q) = 1 (when logs are in base 2) =– The probability P assigns to an instance is, on
average, half the probability Q assigns to it
• KL(P||Q) 0 • KL(P||Q) = 0 iff are P and Q equal
∑=x )x(Q
)x(Plog)x(P)Q||P(KL
![Page 125: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/125.jpg)
Consistency
• Let P(X| ) be a parametric family– We need to make various regularity condition we won’t go
into now
• Let P*(X) be the distribution that generates the data
• Let be the MLE estimate given a dataset D
Thm• As N , where
with probability 1
D̂
*ˆ →D
):X(P||)X(P(KLminarg ** =
![Page 126: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/126.jpg)
Consistency -- Geometric Interpretation
P*
P(X| * )
Space of probability distribution
Distributions that canrepresented by P(X| )
![Page 127: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/127.jpg)
Bayesian Inference
Frequentist Approach:• Assumes there is an unknown but fixed parameter • Estimates with some confidence • Prediction by using the estimated parameter value
Bayesian Approach:• Represents uncertainty about the unknown
parameter• Uses probability to quantify this uncertainty:
– Unknown parameters as random variables
• Prediction follows from the rules of probability:– Expectation over the unknown parameters
![Page 128: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/128.jpg)
Bayesian Inference (cont.)
• We can represent our uncertainty about the sampling process using a Bayesian network
• The values of X are independent given
• The conditional probabilities, P(x[m] | ), are the parameters in the model
• Prediction is now inference in this network
X[1] X[2] X[m] X[m+1]
Observed data Query
![Page 129: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/129.jpg)
Bayesian Inference (cont.)
Prediction as inference in this network
where
+=
+=
+
∫∫
dMxxPMxP
dMxxPMxxMxP
MxxMxP
])[,],1[|()|]1[(
])[,],1[|(])[,],1[,|]1[(
])[,],1[|]1[(
K
KK
K
])[],1[()()|][],1[(
])[],1[|(MxxP
PMxxPMxxP
KK
K=
Posterior
Likelihood Prior
Probability of data
X[1] X[2] X[m] X[m+1]
![Page 130: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/130.jpg)
Dirichlet Priors
• Recall that the likelihood function is
• A Dirichlet prior with hyperparameters 1,…,K is defined as
for legal 1,…, K
Then the posterior has the same form, with
hyperparameters 1+N 1,…,K +N K
∏=
=ΘK
1k
Nk
kDL θ):(
∏=
−∝ΘK
kk
kP1
1)(
∏∏∏=
−+
==
− =∝ΘΘ∝ΘK
k
Nk
K
k
Nk
K
kk
kkkkDPPDP1
1
11
1)|()()|( αα θθθ
![Page 131: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/131.jpg)
Dirichlet Priors (cont.)
• We can compute the prediction on a new event in closed form:
• If P( ) is Dirichlet with hyperparameters 1,…,K then
• Since the posterior is also Dirichlet, we get
∑∫
=ΘΘ⋅==
ll
kk d)(P)k]1[X(P
∑∫ ++
=ΘΘ⋅==+
lll )N(
Nd)D|(P)D|k]1M[X(P kk
k
![Page 132: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/132.jpg)
Prior Knowledge
• The hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience
• Equivalent sample size = 1+…+K
• The larger the equivalent sample size the more confident we are in our prior
![Page 133: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/133.jpg)
Bayesian Prediction(cont.)
• Given these observations, we can compute the posterior for each multinomial Xi | pai
independently– The posterior is Dirichlet with parameters
(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)
• The predictive distribution is then represented by the parameters
)pa(N)pa()pa,x(N)pa,x(~
ii
iiiipa|x ii +
+=
![Page 134: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/134.jpg)
Learning Parameters: Summary
• Estimation relies on sufficient statistics– For multinomial these are of the form N (xi,pai)
– Parameter estimation
• Bayesian methods also require choice of priors• Both MLE and Bayesian are asymptotically
equivalent and consistent• Both can be implemented in an on-line manner
by accumulating sufficient statistics
)()(),(),(~
|ii
iiiipax paNpa
paxNpaxii +
+=
)(),(ˆ
|i
iipax paN
paxNii=
MLE Bayesian (Dirichlet)
![Page 135: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/135.jpg)
Learning Structure from Complete Data
![Page 136: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/136.jpg)
Why Struggle for Accurate Structure?
• Increases the number of parameters to be fitted
• Wrong assumptions about causality and domain structure
• Cannot be compensated by accurate fitting of parameters
• Also misses causality and domain structure
Earthquake Alarm Set
Sound
BurglaryEarthquake Alarm Set
Sound
Burglary
Earthquake Alarm Set
Sound
Burglary
Adding an arc Missing an arc
![Page 137: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/137.jpg)
Approaches to Learning Structure
• Constraint based– Perform tests of conditional independence– Search for a network that is consistent with the
observed dependencies and independencies
• Pros & Cons Intuitive, follows closely the construction of BNs Separates structure learning from the form of the
independence tests Sensitive to errors in individual tests
![Page 138: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/138.jpg)
Approaches to Learning Structure
• Score based– Define a score that evaluates how well the (in)dependencies
in a structure match the observations– Search for a structure that maximizes the score
• Pros & Cons Statistically motivated Can make compromises Takes the structure of conditional probabilities into account Computationally hard
![Page 139: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/139.jpg)
Likelihood Score for Structures
First cut approach: – Use likelihood function
• Recall, the likelihood score for a network structure and parameters is
• Since we know how to maximize parameters from now we assume
∏∏
∏Θ=
Θ=Θ
m ii,G
Gii
mGn1G
),G:]m[Pa|]m[x(P
),G:]m[x,],m[x(P)D:,G(L K
):,(max):( DGLDGL GGΘ= Θ
![Page 140: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/140.jpg)
Avoiding Overfitting
“Classic” issue in learning. Approaches:• Restricting the hypotheses space
– Limits the overfitting capability of the learner– Example: restrict # of parents or # of parameters
• Minimum description length– Description length measures complexity– Prefer models that compactly describes the training data
• Bayesian methods– Average over all possible parameter values– Use prior knowledge
![Page 141: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/141.jpg)
Bayesian Inference
• Bayesian Reasoning---compute expectation over unknown G
• Assumption: Gs are mutually exclusive and exhaustive
• We know how to compute P(x[M+1]|G,D)– Same as prediction with fixed structure
• How do we compute P(G|D)?
∑ +=+G
)D|G(P)G,D|]1M[x(P)D|]1M[x(P
![Page 142: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/142.jpg)
Marginal likelihood
Prior over structures
)()()|(
)|(DP
GPGDPDGP =
Using Bayes rule:
P(D) is the same for all structures GCan be ignored when comparing structures
Probability of Data
Posterior Score
![Page 143: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/143.jpg)
Marginal Likelihood
• By introduction of variables, we have that
• This integral measures sensitivity to choice of parameters
∫= dGPGDPGDP )|(),|()|(
LikelihoodPrior over parameters
![Page 144: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/144.jpg)
Marginal Likelihood: Multinomials
The same argument generalizes to multinomials with Dirichlet prior
• P( ) is Dirichlet with hyperparameters 1,…,K
• D is a dataset with sufficient statistics N1,…,NK
Then
∏∑
∑Γ
+Γ
⎟⎠
⎞⎜⎝
⎛+Γ
⎟⎠
⎞⎜⎝
⎛Γ
=l l
ll
lll
ll
)()(
)()(
αα
α
αN
NDP
![Page 145: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/145.jpg)
Marginal Likelihood for General Network
The marginal likelihood has the form:
where • N(..) are the counts from the data (..) are the hyperparameters for each family
given G
( )( )∏∏ ∏ Γ
+Γ
+Γ
Γ=
i pa xG
i
Gi
Gi
GG
G
Gi i i
ii
ii
i
pax
paxNpax
paNpa
paGDP
)),((
)),(),((
)()(
)()|(
α
α
α
α
Dirichlet Marginal LikelihoodFor the sequence of values of Xi when
Xi’s parents have a particular value
![Page 146: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/146.jpg)
Priors
• We need: prior counts (..) for each network structure G
• This can be a formidable task– There are exponentially many structures…
![Page 147: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/147.jpg)
BDe Score
Possible solution: The BDe prior
• Represent prior using two elements M0, B0
– M0 - equivalent sample size
– B0 - network representing the prior probability of events
![Page 148: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/148.jpg)
BDe Score
Intuition: M0 prior examples distributed by B0
• Set (xi,paiG) = M0 P(xi,pai
G| B0) – Note that pai
G are not the same as the parents of Xi in B0.
– Compute P(xi,paiG| B0) using standard inference
procedures
• Such priors have desirable theoretical properties– Equivalent networks are assigned the same score
![Page 149: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/149.jpg)
Bayesian Score: Asymptotic Behavior
Theorem: If the prior P(Θ |G) is “well-behaved”, then
)1()dim(2
log):()|(log OG
MDGlGDP +−=
![Page 150: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/150.jpg)
Asymptotic Behavior: Consequences
• Bayesian score is consistent– As M the “true” structure G* maximizes the score
(almost surely)– For sufficiently large M, the maximal scoring
structures are equivalent to G*
• Observed data eventually overrides prior information– Assuming that the prior assigns positive probability to
all cases
)1()dim(2
log):()|(log OG
MDGlGDP +−=
![Page 151: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/151.jpg)
Asymptotic Behavior
• This score can also be justified by the Minimal Description Length (MDL) principle
• This equation explicitly shows the tradeoff between– Fitness to data --- likelihood term– Penalty for complexity --- regularization term
)dim(2
log):():Score( G
MDGlDG −=
![Page 152: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/152.jpg)
Scores -- Summary
• Likelihood, MDL, (log) BDe have the form
• BDe requires assessing prior network.It can naturally incorporate prior knowledge and previous experience
• BDe is consistent and asymptotically equivalent (up to a constant) to MDL
• All are score-equivalent– G equivalent to G’ Score(G) = Score(G’)
))(:|():( iii
Gi PaXNPaXScoreDGScore
i∑=
![Page 153: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/153.jpg)
Optimization Problem
Input:– Training data– Scoring function (including priors, if needed)– Set of possible structures
• Including prior knowledge about structure
Output:– A network (or networks) that maximize the score
Key Property:– Decomposability: the score of a network is a sum of
terms.
![Page 154: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/154.jpg)
Difficulty
Theorem: Finding maximal scoring network structure with at most k parents for each variables is NP-hard for k > 1
![Page 155: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/155.jpg)
Heuristic Search
We address the problem by using heuristic search
• Define a search space:– nodes are possible structures– edges denote adjacency of structures
• Traverse this space looking for high-scoring structures
Search techniques:– Greedy hill-climbing– Best first search– Simulated Annealing– ...
![Page 156: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/156.jpg)
Heuristic Search (cont.)
• Typical operations:
S C
E
D
S C
E
D
Reverse C EDelete C
E
Add C
D
S C
E
D
S C
E
D
![Page 157: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/157.jpg)
Exploiting Decomposability in Local Search
• Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move
S C
E
D
S C
E
D
S C
E
D
S C
E
D
![Page 158: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/158.jpg)
Greedy Hill-Climbing
• Simplest heuristic local search– Start with a given network
• empty network• best tree • a random network
– At each iteration• Evaluate all possible changes• Apply change that leads to best improvement in score• Reiterate
– Stop when no modification improves score
• Each step requires evaluating approximately n new changes
![Page 159: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/159.jpg)
Greedy Hill-Climbing: Possible Pitfalls
• Greedy Hill-Climbing can get struck in:– Local Maxima:
• All one-edge changes reduce the score
– Plateaus:• Some one-edge changes leave the score unchanged• Happens because equivalent networks received the
same score and are neighbors in the search space
• Both occur during structure search• Standard heuristics can escape both
– Random restarts– TABU search
![Page 160: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/160.jpg)
Model Selection
• So far, we focused on single model– Find best scoring model– Use it to predict next example
• Implicit assumption: – Best scoring model dominates the weighted
sum
• Pros:– We get a single structure– Allows for efficient use in our tasks
• Cons:– We are committing to the independencies of a
particular structure– Other structures might be as probable given the data
![Page 161: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/161.jpg)
Model Averaging
• Recall, Bayesian analysis started with
– This requires us to average over all possible models
∑ +=+G
DGPGDMxPDMxP )|(),|]1[()|]1[(
![Page 162: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/162.jpg)
Model Averaging (cont.)
• Full Averaging– Sum over all structures– Usually intractable---there are exponentially many structures
• Approximate Averaging– Find K largest scoring structures– Approximate the sum by averaging over their prediction– Weight of each structure determined by the Bayes Factor
)()(
)'|()'()|()(
)|'()|(
DPDP
GDPGPGDPGP
DGPDGP
⋅=
The actual score we compute
![Page 163: Describing Data](https://reader036.vdocument.in/reader036/viewer/2022062423/568146a9550346895db3c617/html5/thumbnails/163.jpg)
Search: Summary
• Discrete optimization problem• In general, NP-Hard
– Need to resort to heuristic search– In practice, search is relatively fast (~100 vars in ~10 min):
• Decomposability• Sufficient statistics
• In some cases, we can reduce the search problem to an easy optimization problem– Example: learning trees