junction trees trees where each node is a set of variables

Carnegie Mellon

Junction trees

Trees where each node is a set of variables

- Running intersection property: every clique between Ci and Cj contains Ci Cj

- Ci and Cj are neighbors Sij=Ci Cj is called a separator

- Example:

- Notation: Vij is a set of all variables on the same side of edge i-j as clique Cj:

- V34={GF}, V31={A}, V43={AD}- Encoded independencies:

(Vij Vji | Sij)

Efficient Principled Learning of Junction TreesAnton Chechetka and Carlos Guestrin

Carnegie Mellon University

MotivationProbabilistic graphical models

are everywhere- Medical diagnosis, datacenter

performance monitoring, sensor nets, …

Main advantages- Compact representation of probability

distributions- Exploit structure to speed up inference

≤4 neighbors per variable (a constant!), but inference still hard

But also problems- Compact representation≠ tractable

inference. - Exact inference #P-complete in

general- Often still need exponential time

even for compact models- Example:

- Often do not even have structure, only data- Best structure is NP-complete to find- Most structure learning algorithms

return complex models, where inference is intractable

- Very few structure learning algorithms have global quality guarantees

We address both of these issues! We provide

- efficient structure learning algorithm- guaranteed to learn tractable models- with global guarantees on the results

quality

This work: contributions

- The first polynomial time algorithm with PAC guarantees for learning low-treewidth graphical models with

- guaranteed tractable inference!

- Key theoretical insight: polynomial-time upper bound on conditional mutual information for arbitrarily large sets of variables

- Empirical viability demonstration

Tractability guarantees:- Inference exponential in clique size k- Small cliques tractable inference

JTs as approximationsOften exact conditional independence is

toostrict a requirement- generalization: conditional mutual

informationI(A , B | C) H(A | B) – H(A | BC)- H() is conditional entropy- I(A , B | C) ≥0 always- I(A , B | C) = 0 (A B | C)

- intuitively: if C is already known, how much new information about A is contained in B?

Goal: find an –junction tree with fixed clique size k in polynomial (in |V|) time

AB

CD

BC BECliques

B

C

B

E

Separators

1

2

3 4

5

6EF

EGV3 4

Theorem [Narasimhan and Bilmes, UAI05]: If for every separator Sij in the junction tree it holds that the conditional mutual information

I(Vij, Vji | Sij ) < (call it -junction tree)

thenKL(P||Ptree) < |V|

Approximation quality guarantee:

E

Constraint-based learning

Naively- for every candidate

sep. S of size k- for every XV\S

- if I(X, V\SX | S) < - add (S,X) to the “list of

useful components” L- find a JT consistent with L

Naïve

Our work

nk nk

O(2n) O(nk+3)

O(2n) O(24k+4)

O(2n) O(nk+2)

Complexity:

Key theoretical resultEfficient upper bound for I(,|)

Intuition: Suppose a distribution P(V) can be well approximated by a junction tree with clique size k. Then for every set SV of size k, A,BV of arbitrary size, to check that I(A,B | S) is small, it is enough to check for all subsets XA, YB of size at most k that I(X,Y|S) is small.

Computation time is reduced from exponential in |V| to

polynomial!

Set S does not have to relate to the separators of the “true” JT in any

way!

Theorem 1: Suppose an -JT of treewidth k exists for P(V). Suppose the sets SV of size k, AV\S of arbitrary size are s.t. for every XV\S of size k+1 it holds that

I(XA, X(V\SA)S | S) < then

I(A, V\SA | S) < |V|(+)

Complexity: O(nk+1). Polynomial in n,instead of O(exp(n)) for straightforward computation

A B

S

Only needto

compute

I(X,Y|S)for smallX and Y!

I(A,B | S)=??

I(X,Y|S)X

Y

Finding almost independent

subsetsQuestion: if S is a separator of an -JT, which variables are on the same side of

S?- More than one correct answer

possible:

- We will settle for finding one- Drop the complexity to polynomial

from exponential

AB AC ADA A

S={A}: {B}, {C,D} OR {B,C}, {D}

Intuition: Consider set of variables Q={BCD}. Suppose an -JT (e.g. above) with separator S={A} exists s.t. some of the variables in Q ({B}) are on the left of S and the remaining ones ({CD}) on the right.then a partitioning of Q into X and Y exists s.t. I(X,Y|S)<

B

DC

possible partitionings

if no such splits exist, all variables of Q must be on the same side of S

Alg. 1 (given candidate sep. S), threshold :

- each variable of V\S starts out as a separate partition

- for every QV\S of size at most k+2- if minXQ I(X,Q\S | S) >

- merge all partitions that have variables in Q

Fixed size

regardless

of |Q|

Complexity: O(nk+3). Polynomial in n.

Theorem (results quality):If after invoking Alg.1(S,=) a set U is a connected component, then

- For every Z s.t. I(Z, V\ZS | S)<it holds that UZ

- I(U, V\US | S)<nk

never mistakenly

put variablestogetherIncorrect splits not too

bad

Example: =0.25

Pairwise

I(.,.|S)

Test edge,merge

variables

I() too low, do

not merge

0.4

0.30.

2

merge;end

result

Constructing a junction tree

Using Alg.1 for every SV, obtain a list L of pairs (S,Q) s.t I(Q,V\SQ|S)<|V|(+)

Example:

AB

CD

BC

BE

EF C E

S

Q:

E

, , ,

, ,

,{ }Problem: From L, reconstruct a junction

tree.This is non-trivial. Complications: - L may encode more independencies than

a single JT encodes- Several different JTs may be consistent

with independencies in L

Key insight [Arnborg+al:,SIAM-JADM1987, Narasimhan+Bilmes: UAI05]:

In a junction tree, components (S,Q) have recursive decomposition:

a clique in the

junction tree

smaller components

from L

DP algorithm (input list L of pairs (S,Q)):- sort L in the order of increasing |Q|- mark (S,Q)L with |Q|=1 as positive- for (S,Q)L, Q≥2, in the sorted order

- if xQ, (S1,Q1), …, (Sm,Qm) L s.t.

- Si {Sx}, (Si,Qi) is positive

- QiQj=- i=1:mQi=Q\x

- then mark (S,Q) positive- decomposition(S,Q)=(S1,Q1),...,

(Sm,Qm)

- if S s.t. all (S,Qi)L are positive- return corresponding junction tree

Look for such recursive decompositions in L!

NP-complete to decide

We use greedy heuristic

Greedy heuristic for decomposition search

- initialize decomposition to empty- iteratively add pairs (Si,Qi) that do not

conflict with those already in the decomposition

- if all variables of Q are covered, success- May fail even if a decomposition exists

- But we prove that for certain distributions guaranteed to work

ABEF F ABCD

B

A

B

EF

B

CD

C

D

C

B A EF

Theoretical guarantees

Intuition: if the intra-clique dependencies are

strong enough, guaranteed to find a well-approximating JT in polynomial time.

Theorem: Suppose a maximal -JT tree of

treewidth k exists for P(V) s.t. for every clique

C and separator S of tree it holds that minX(C\S)I(X,C\SX|S) > (k+3)(+)

then our algorithm will find a k|V|(+)-JT for

P(V) with probability at least (1-) using

n

Ok

log1

log2

22

2

44

samples and

nn

Okk

log1

log2

22

2

4432time

Corollary: Maximal JTs of fixed treewidth s.t. for every clique C and separator S it holds that

minX(C\S)I(X,C\SX|S) >for fixed >0 are efficiently PAC

learnable

Related work

Experimental resultsModel quality (log-likelihood on test

set)Compare this work with - ordering-based search (OBS)

[Teyssier+Koller:UAI05]- Chow-Liu alg. [Chow+Liu:IEEE68]- Karger-Srebro alg.

[Karger+Srebro:SODA01]- local search- this work + local search combination

(using our algorithm to initialize local search)

Data: Beinlich+al:ECAIM198837 variables, treewidth 4,learned treewidth 3

Data: Krause+Guestrin:UAI0532 variables, treewidth 3

Data: Desphande+al:VLDB0454 variables, treewidth 2

Future work- Extend to non-maximal junction trees- Heuristics to speed up performance- Using information about edges likelihood

(e.g. from L1 regularized logistic regression) to cut down on computation.

Ref. Model Guarantees

Time

[1,2] tractable

local poly(n)

[3] tree global O(n2 log n)

[4] tree mix local O(n2 log n)

[5] compact local poly(n)

[6] all global exp(n)

[7] tractable

const-factor

poly(n)

[8] compact PAC poly(n)

[9] tractable

PAC exp(n)

this work

tractable

PAC poly(n)[1] Bach+Jordan:NIPS-02[2] Choi+al:UAI-05[3] Chow+Liu:IEEE-1968[4] Meila+Jordan:JMLR-01[5] Teyssier+Koller:UAI-05[6] Singh+Moore:CMU-CALD-05[7] Karger+Srebro:SODA-01[8] Abbeel+al:JMLR-06[9] Narasimhan+Bilmes:UAI-04

junction trees trees where each node is a set of variables

Documents