junction trees trees where each node is a set of variables

1
Carnegi e Mellon Junction trees Trees where each node is a set of variables - Running intersection property: every clique between C i and C j contains C i C j - C i and C j are neighbors S ij =C i C j is called a separator - Example: - Notation: V ij is a set of all variables on the same side of edge i-j as clique C j : - V 34 ={GF}, V 31 ={A}, V4 3 ={AD} - Encoded independencies: (V ij V ji | S ij ) Efficient Principled Learning of Junction Trees Anton Chechetka and Carlos Guestrin Carnegie Mellon University Motivation Probabilistic graphical models are everywhere - Medical diagnosis, datacenter performance monitoring, sensor nets, … Main advantages - Compact representation of probability distributions - Exploit structure to speed up inference ≤4 neighbors per variable (a constant!), but inference still hard But also problems - Compact representationtractable inference. - Exact inference #P-complete in general - Often still need exponential time even for compact models - Example: - Often do not even have structure, only data - Best structure is NP-complete to find - Most structure learning algorithms return complex models, where inference is intractable - Very few structure learning algorithms have global quality guarantees We address both of these issues! We provide - efficient structure learning algorithm - guaranteed to learn tractable models - with global guarantees on the results quality This work: contributions - The first polynomial time algorithm with PAC guarantees for learning low-treewidth graphical models with - guaranteed tractable inference! - Key theoretical insight: polynomial- time upper bound on conditional mutual information for arbitrarily large sets of variables - Empirical viability demonstration Tractability guarantees: - Inference exponential in clique size k - Small cliques tractable inference JTs as approximations Often exact conditional independence is too strict a requirement - generalization: conditional mutual information I(A , B | C) H(A | B) – H(A | BC) - H() is conditional entropy - I(A , B | C) 0 always - I(A , B | C) = 0 (A B | C) - intuitively: if C is already known, how much new information about A is contained in B? Goal: find an –junction tree with fixed clique size k in polynomial (in |V|) time AB CD BC BE Clique s B C B E Separato rs 1 2 3 4 5 6 EF EG V 3 4 Theorem [Narasimhan and Bilmes, UAI05]: If for every separator S ij in the junction tree it holds that the conditional mutual information I(V ij , V ji | S ij ) < (call it -junction tree) then KL(P||P tree ) < |V| Approximation quality guarantee: E Constraint-based learning Naively - for every candidate sep. S of size k - for every XV\S - if I(X, V\SX | S) < - add (S,X) to the “list of useful components” L - find a JT consistent with L Naïv e Our work n k n k O(2 n ) O(n k+3 ) O(2 n ) O(2 4k+4 ) O(2 n ) O(n k+2 ) Complexity: Key theoretical result Efficient upper bound for I(,| ) Intuition: Suppose a distribution P(V) can be well approximated by a junction tree with clique size k. Then for every set SV of size k, A,BV of arbitrary size, to check that I(A,B | S) is small, it is enough to check for all subsets XA, YB of size at most k that I(X,Y|S) is small. Computation time is reduced from exponential in |V| to polynomial! Set S does not have to relate to the separators of the trueJT in any way! Theorem 1: Suppose an -JT of treewidth k exists for P(V). Suppose the sets SV of size k, AV\S of arbitrary size are s.t. for every XV\S of size k+1 it holds that I(XA, X(V\SA)S | S) < then I(A, V\SA | S) < |V|(+) Complexity: O(n k+1 ). Polynomial in n, instead of O(exp(n)) for straightforward computation A B S Only need to comput e I(X,Y|S) for small X and Y! I(A,B | S)=?? I(X,Y|S) X Y Finding almost independent subsets Question: if S is a separator of an -JT, which variables are on the same side of S? - More than one correct answer possible: - We will settle for finding one - Drop the complexity to polynomial from exponential AB AC AD A A S={A}: {B}, {C,D} OR {B,C}, {D} Intuition: Consider set of variables Q={BCD}. Suppose an -JT (e.g. above) with separator S={A} exists s.t. some of the variables in Q ({B}) are on the left of S and the remaining ones ({CD}) on the right. then a partitioning of Q into X and Y exists s.t. I(X,Y|S)< B D C possible partitionin gs if no such splits exist, all variables of Q must be on the same side of S Alg. 1 (given candidate sep. S), threshold : - each variable of V\S starts out as a separate partition - for every QV\S of size at most k+2 - if min XQ I(X,Q\S | S) > - merge all partitions that have variables in Q Fixed size regardl ess of |Q| Complexity: O(n k+3 ). Polynomial in n. Theorem (results quality): If after invoking Alg.1(S,=) a set U is a connected component, then - For every Z s.t. I(Z, V\ZS | S)< it holds that UZ - I(U, V\US | S)<nk never mistakenly put variables together Incorrect splits not too bad Example: =0.25 Pairwi se I(.,.|S) Test edge, merge variables I() too low, do not merge 0. 4 0. 3 0. 2 merge; end resul t Constructing a junction tree Using Alg.1 for every SV, obtain a list L of pairs (S,Q) s.t I(Q,V\SQ| S)<|V|(+) Example: AB CD BC BE EF C E S Q : E , , , , , , { } Problem: From L, reconstruct a junction tree. This is non-trivial. Complications: - L may encode more independencies than a single JT encodes - Several different JTs may be consistent with independencies in L Key insight [Arnborg+al:,SIAM- JADM1987, Narasimhan+Bilmes: UAI05]: In a junction tree, components (S,Q) have recursive decomposition: a clique in the junction tree smaller components from L DP algorithm (input list L of pairs (S,Q) ) : - sort L in the order of increasing | Q| - mark (S,Q)L with |Q|=1 as positive - for (S,Q)L, Q 2, in the sorted order - if xQ, (S 1 ,Q 1 ), …, (S m ,Q m ) L s.t. - S i {Sx}, (S i ,Q i ) is positive - Q i Q j = - i=1:m Q i =Q\x - then mark (S,Q) positive - decomposition(S,Q)=(S 1 ,Q 1 ),...,(S m ,Q m ) - if S s.t. all (S,Q i )L are positive - return corresponding junction tree Look for such recursive decompositions in L! NP-complete to decide We use greedy heuristic Greedy heuristic for decomposition search - initialize decomposition to empty - iteratively add pairs (S i ,Q i ) that do not conflict with those already in the decomposition - if all variables of Q are covered, success - May fail even if a decomposition exists - But we prove that for certain distributions guaranteed to work ABEF F ABCD B A B EF B CD C D C B A EF Theoretical guarantees Intuition: if the intra-clique dependencies are strong enough, guaranteed to find a well- approximating JT in polynomial time. Theorem: Suppose a maximal -JT tree of treewidth k exists for P(V) s.t. for every clique C and separator S of tree it holds that min X(C\S) I(X,C\SX|S) > (k+3)(+) then our algorithm will find a k|V| (+)-JT for P(V) with probability at least (1-) using n O k log 1 log 2 2 2 2 4 4 samples and n n O k k log 1 log 2 2 2 2 4 4 3 2 time Corollary: Maximal JTs of fixed treewidth s.t. for every clique C and separator S it holds that minX(C\S)I(X,C\SX|S) > for fixed >0 are efficiently PAC learnable Related work Experimental results Model quality (log-likelihood on test set) Compare this work with - ordering-based search (OBS) [Teyssier+Koller:UAI05] - Chow-Liu alg. [Chow+Liu:IEEE68] - Karger-Srebro alg. [Karger+Srebro:SODA01] - local search - this work + local search combination (using our algorithm to initialize local search) Data: Beinlich+al:ECAIM1988 37 variables, treewidth 4, learned treewidth 3 Data: Krause+Guestrin:UAI05 32 variables, treewidth 3 Data: Desphande+al:VLDB04 54 variables, treewidth 2 Future work - Extend to non-maximal junction trees - Heuristics to speed up performance - Using information about edges likelihood (e.g. from L1 regularized logistic regression) to cut down on computation. Ref. Model Guarante es Time [1,2] tractab le local poly(n) [3] tree global O(n 2 log n) [4] tree mix local O(n 2 log n) [5] compact local poly(n) [6] all global exp(n) [7] tractab le const- factor poly(n) [8] compact PAC poly(n) [9] tractab le PAC exp(n) this work tractab le PAC poly(n) [1] Bach+Jordan:NIPS-02 [2] Choi+al:UAI-05 [3] Chow+Liu:IEEE-1968 [4] Meila+Jordan:JMLR-01 [5] Teyssier+Koller:UAI-05 [6] Singh+Moore:CMU-CALD-05 [7] Karger+Srebro:SODA-01 [8] Abbeel+al:JMLR-06 [9] Narasimhan+Bilmes:UAI-04

Upload: akeem-everett

Post on 31-Dec-2015

36 views

Category:

Documents


0 download

DESCRIPTION

Efficient Principled Learning of Junction Trees. A. A. AB. AC. AD. 0.3. 0.4. 0.2. Anton Chechetka and Carlos Guestrin. Carnegie Mellon University. Motivation. Constructing a junction tree Using Alg.1 for every S V , obtain a list L of pairs (S,Q) s.t I(Q,V\SQ|S)

TRANSCRIPT

Page 1: Junction trees Trees where  each node  is a  set of variables

Carnegie Mellon

Junction trees

Trees where each node is a set of variables

- Running intersection property: every clique between Ci and Cj contains Ci Cj

- Ci and Cj are neighbors Sij=Ci Cj is called a separator

- Example:

- Notation: Vij is a set of all variables on the same side of edge i-j as clique Cj:

- V34={GF}, V31={A}, V43={AD}- Encoded independencies:

(Vij Vji | Sij)

Efficient Principled Learning of Junction TreesAnton Chechetka and Carlos Guestrin

Carnegie Mellon University

MotivationProbabilistic graphical models

are everywhere- Medical diagnosis, datacenter

performance monitoring, sensor nets, …

Main advantages- Compact representation of probability

distributions- Exploit structure to speed up inference

≤4 neighbors per variable (a constant!), but inference still hard

But also problems- Compact representation≠ tractable

inference. - Exact inference #P-complete in

general- Often still need exponential time

even for compact models- Example:

- Often do not even have structure, only data- Best structure is NP-complete to find- Most structure learning algorithms

return complex models, where inference is intractable

- Very few structure learning algorithms have global quality guarantees

We address both of these issues! We provide

- efficient structure learning algorithm- guaranteed to learn tractable models- with global guarantees on the results

quality

This work: contributions

- The first polynomial time algorithm with PAC guarantees for learning low-treewidth graphical models with

- guaranteed tractable inference!

- Key theoretical insight: polynomial-time upper bound on conditional mutual information for arbitrarily large sets of variables

- Empirical viability demonstration

Tractability guarantees:- Inference exponential in clique size k- Small cliques tractable inference

JTs as approximationsOften exact conditional independence is

toostrict a requirement- generalization: conditional mutual

informationI(A , B | C) H(A | B) – H(A | BC)- H() is conditional entropy- I(A , B | C) ≥0 always- I(A , B | C) = 0 (A B | C)

- intuitively: if C is already known, how much new information about A is contained in B?

Goal: find an –junction tree with fixed clique size k in polynomial (in |V|) time

AB

CD

BC BECliques

B

C

B

E

Separators

1

2

3 4

5

6EF

EGV3 4

Theorem [Narasimhan and Bilmes, UAI05]: If for every separator Sij in the junction tree it holds that the conditional mutual information

I(Vij, Vji | Sij ) < (call it -junction tree)

thenKL(P||Ptree) < |V|

Approximation quality guarantee:

E

Constraint-based learning

Naively- for every candidate

sep. S of size k- for every XV\S

- if I(X, V\SX | S) < - add (S,X) to the “list of

useful components” L- find a JT consistent with L

Naïve

Our work

nk nk

O(2n) O(nk+3)

O(2n) O(24k+4)

O(2n) O(nk+2)

Complexity:

Key theoretical resultEfficient upper bound for I(,|)

Intuition: Suppose a distribution P(V) can be well approximated by a junction tree with clique size k. Then for every set SV of size k, A,BV of arbitrary size, to check that I(A,B | S) is small, it is enough to check for all subsets XA, YB of size at most k that I(X,Y|S) is small.

Computation time is reduced from exponential in |V| to

polynomial!

Set S does not have to relate to the separators of the “true” JT in any

way!

Theorem 1: Suppose an -JT of treewidth k exists for P(V). Suppose the sets SV of size k, AV\S of arbitrary size are s.t. for every XV\S of size k+1 it holds that

I(XA, X(V\SA)S | S) < then

I(A, V\SA | S) < |V|(+)

Complexity: O(nk+1). Polynomial in n,instead of O(exp(n)) for straightforward computation

A B

S

Only needto

compute

I(X,Y|S)for smallX and Y!

I(A,B | S)=??

I(X,Y|S)X

Y

Finding almost independent

subsetsQuestion: if S is a separator of an -JT, which variables are on the same side of

S?- More than one correct answer

possible:

- We will settle for finding one- Drop the complexity to polynomial

from exponential

AB AC ADA A

S={A}: {B}, {C,D} OR {B,C}, {D}

Intuition: Consider set of variables Q={BCD}. Suppose an -JT (e.g. above) with separator S={A} exists s.t. some of the variables in Q ({B}) are on the left of S and the remaining ones ({CD}) on the right.then a partitioning of Q into X and Y exists s.t. I(X,Y|S)<

B

DC

possible partitionings

if no such splits exist, all variables of Q must be on the same side of S

Alg. 1 (given candidate sep. S), threshold :

- each variable of V\S starts out as a separate partition

- for every QV\S of size at most k+2- if minXQ I(X,Q\S | S) >

- merge all partitions that have variables in Q

Fixed size

regardless

of |Q|

Complexity: O(nk+3). Polynomial in n.

Theorem (results quality):If after invoking Alg.1(S,=) a set U is a connected component, then

- For every Z s.t. I(Z, V\ZS | S)<it holds that UZ

- I(U, V\US | S)<nk

never mistakenly

put variablestogetherIncorrect splits not too

bad

Example: =0.25

Pairwise

I(.,.|S)

Test edge,merge

variables

I() too low, do

not merge

0.4

0.30.

2

merge;end

result

Constructing a junction tree

Using Alg.1 for every SV, obtain a list L of pairs (S,Q) s.t I(Q,V\SQ|S)<|V|(+)

Example:

AB

CD

BC

BE

EF C E

S

Q:

E

, , ,

, ,

,{ }Problem: From L, reconstruct a junction

tree.This is non-trivial. Complications: - L may encode more independencies than

a single JT encodes- Several different JTs may be consistent

with independencies in L

Key insight [Arnborg+al:,SIAM-JADM1987, Narasimhan+Bilmes: UAI05]:

In a junction tree, components (S,Q) have recursive decomposition:

a clique in the

junction tree

smaller components

from L

DP algorithm (input list L of pairs (S,Q)):- sort L in the order of increasing |Q|- mark (S,Q)L with |Q|=1 as positive- for (S,Q)L, Q≥2, in the sorted order

- if xQ, (S1,Q1), …, (Sm,Qm) L s.t.

- Si {Sx}, (Si,Qi) is positive

- QiQj=- i=1:mQi=Q\x

- then mark (S,Q) positive- decomposition(S,Q)=(S1,Q1),...,

(Sm,Qm)

- if S s.t. all (S,Qi)L are positive- return corresponding junction tree

Look for such recursive decompositions in L!

NP-complete to decide

We use greedy heuristic

Greedy heuristic for decomposition search

- initialize decomposition to empty- iteratively add pairs (Si,Qi) that do not

conflict with those already in the decomposition

- if all variables of Q are covered, success- May fail even if a decomposition exists

- But we prove that for certain distributions guaranteed to work

ABEF F ABCD

B

A

B

EF

B

CD

C

D

C

B A EF

Theoretical guarantees

Intuition: if the intra-clique dependencies are

strong enough, guaranteed to find a well-approximating JT in polynomial time.

Theorem: Suppose a maximal -JT tree of

treewidth k exists for P(V) s.t. for every clique

C and separator S of tree it holds that minX(C\S)I(X,C\SX|S) > (k+3)(+)

then our algorithm will find a k|V|(+)-JT for

P(V) with probability at least (1-) using

n

Ok

log1

log2

22

2

44

samples and

nn

Okk

log1

log2

22

2

4432time

Corollary: Maximal JTs of fixed treewidth s.t. for every clique C and separator S it holds that

minX(C\S)I(X,C\SX|S) >for fixed >0 are efficiently PAC

learnable

Related work

Experimental resultsModel quality (log-likelihood on test

set)Compare this work with - ordering-based search (OBS)

[Teyssier+Koller:UAI05]- Chow-Liu alg. [Chow+Liu:IEEE68]- Karger-Srebro alg.

[Karger+Srebro:SODA01]- local search- this work + local search combination

(using our algorithm to initialize local search)

Data: Beinlich+al:ECAIM198837 variables, treewidth 4,learned treewidth 3

Data: Krause+Guestrin:UAI0532 variables, treewidth 3

Data: Desphande+al:VLDB0454 variables, treewidth 2

Future work- Extend to non-maximal junction trees- Heuristics to speed up performance- Using information about edges likelihood

(e.g. from L1 regularized logistic regression) to cut down on computation.

Ref. Model Guarantees

Time

[1,2] tractable

local poly(n)

[3] tree global O(n2 log n)

[4] tree mix local O(n2 log n)

[5] compact local poly(n)

[6] all global exp(n)

[7] tractable

const-factor

poly(n)

[8] compact PAC poly(n)

[9] tractable

PAC exp(n)

this work

tractable

PAC poly(n)[1] Bach+Jordan:NIPS-02[2] Choi+al:UAI-05[3] Chow+Liu:IEEE-1968[4] Meila+Jordan:JMLR-01[5] Teyssier+Koller:UAI-05[6] Singh+Moore:CMU-CALD-05[7] Karger+Srebro:SODA-01[8] Abbeel+al:JMLR-06[9] Narasimhan+Bilmes:UAI-04