survey of contemporary bayesian network structure learning ... · survey of contemporary bayesian...

Survey of contemporary Bayesian Network StructureLearning methods

Ligon Liu

September 2015

Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 1) September 2015 1 / 38

Bayesian Network

DefinitionLet V be a set of variables. A Bayesian Network is comprised of a discretestructure part and a continuous parameter part:

structural part: a Directed Acyclic Graph (V ,E), V being randomvariables, E ⊂ V × Vparameter part: the conditional probability of every variable given itsparents in the DAG.

ExampleThe Y-shaped Bayesian Network: V={0,1,2,3}, E={(0,2),(1,2),(2,3)}

10

3

2


Bayesian Network

ExampleConditional Probability Table

x0,x1 x2 Pr

0 01 0.12 0.253 0.65

0 11 0.22 0.353 0.45

1 02 0.53 0.5

1 21 0.12 0.43 0.5

x0 Pr

0 0.61 0.4

x1 Pr

0 0.31 0.42 0.3

x2 x3 Pr

10 0.21 0.8

20 0.51 0.5

3 1 1


BN Structure Learning Problem

Counted indexed relation dataset (V ,R , c)

Scoring function s(V ,R , c,E), abbrev. s(V ,E)usually s(V ,E) is required to be decomposable:

s(V ,E) =∑v∈V

S(v ,Pa(v))

Find the DAG (V ,E) that maximize s(V ,E)


Clusters of surveyed articles

Conditional Independence(C.I.) constraint-based algorithms [16], [1],[11], [12][20])

Ordering-based search[10], [19], [15], Branch and bound[4][14],Parent Graph shortest path [22], [5, 7, 6]

Integer Linear Programming and LP relaxation based approximatealgorithms [8] [9], [17, 18], [2, 3]


Conditional Independence and its testing

DefinitionLet P be a distribution over variable set V , X ,Y ,M ⊂ V , X and Y is said tobe Conditional Independent given M if

P(X ,Y | M) = P(X | M) · P(Y | M)

Conditional Independence conclusions can be tested or inferred fromknown conditional independences.

Testing(for discrete variables), e.g. Inferring, e.g.

χ2 testG2 test

Monte Carlo permutation test

Semi-graphoid rules:(1) Symmetry

CI(A ,B | C)⇔ CI(B ,A | C)(2) Decomposition

CI(A ,B ∪ D | C)⇒ CI(A ,B | C)(3) Weak union

CI(A ,B∪C | D)⇒ CI(A ,B | C∪D)(4) Contraction

CI(A ,B | C ∪ D) ∧ CI(A ,C | D)⇒CI(A ,B ∪ C | D)


Conditional Independence and Bayesian Network DAG

NotationLet (V,E) be a Directed Acyclic Graph, v ∈ V , the parent set of v is denotedas Pa(v), i.e.

Pa(v) = {u | (u, v) ∈ E}

LemmaP(v | V − {v}) = P(v | Pa(v))

DefinitionLet (V ,E) be a DAG, vertexes u, v ∈ V is said to be d-separated given M,if after all colliders(including collider sets) in M be replaced by bidirectededges between their parents, all directed paths from u to v or from v to uin (V,E) does not pass through M.

On a Bayesian Network DAG, vertexes u, v being d-separated by Mindicates u ⊥ v | M in the BN probability distribution.


Early Conditional Independence-based algorithms

C.I. based algorithms are based on the following facts:

d-separation on Bayesian Network DAG⇐⇒ ConditionalIndependence

Existence of an undirected edge u ↔ v can be inferred from at leastone of many conditional independences. Testing on more C.I. triples(u, v ,M) may increase confidence.

C.I. tests are computationally expensive to perform on datasets.Minimize number of tests.


Early Conditional Independence-based algorithms

SGS (the first C.I.-based algorithm to learn BN)

PC(PC*, Stable- and Conservative- PC)

Grow-Shrink, IAMB, SRS


PC Algorithm brief

PC is an iterative algorithm to learn Bayesian Network from C.I. tests. Withgraph edges E as a variable,

Start with complete undirected graphEdge elimination: For each two variables u, v, do C.I. tests,startingawithsize 0 (unconditional) | Ø , thensize 1 condition sets | {i},| {j},..., thensize 2 condition sets | {i, j},| {i, k },| {j, k },...,larger condition sets . . . . . .until conditional independence i ⊥ j | M is found, | V − {u, v}.Eliminate any edge between two variables that are conditionallyindependent given any condition set.For any pair of variables,PC algorithm test against conditional sets with variables in any pathbetween the pair.Directing the edge by “unshielded collider rule” and “loop removalrule”


Edge direction rules

Unshielded Collider Rule:If two variables u,w are not directly connected but are connected asu → v −w, orient v −w as v → w to avoid forming unshielded collideru → v ← w

Loop Removal Rule:If two variables, u and v connected both by an undirected edge andby a directed path, orient the undirected edge as u → v


Advantages of PC based algorithms

1 Fast speed. On sparse graphs, the computation time of PC ispolynomial time.

2 Compared to SGS, C.I. constraint propagation by semi-graphoid rulessaved a lot of C.I. testings.In addition, if parallel machine is available, it is possible to doredundant C.I. testings to improve confidence[1],

3 By computing independence of smaller conditional testing M first, theconditional independence test has higher confidence for highdimensional dataset.


Robustness of PC based algorithms

The robustness of C.I. based algorithms is doubted by researchers[11],[1].Factors that will undermine robustness of PC in high dimensional datasets:

Sampling loss – when marginal dataset is relatively small w/regardingto graph complexity, local C.I. tests are usually less accurate – alsocalled non faithfulness of the C.I. relations to the distribution.

C.I. testing order – when earlier independence test happens to havelower confidence, they can prevent tests generating higher confidencecontradictory C.I. results.

Two algorithms, Conservative-PC[11] and Stable-PC[1], are invented toovercome the instability over C.I. testing order. They use redundant CItesting to detect unfaithfulness and a voting mechanism to find the mostlikely CI.


Markov blanket

DefinitionLet (V ,E) be a DAG. The Markov Blanket of v ∈ V , denoted by MB(v), isthe set of vertexes not d-separated with v by any variables. i.e., the set ofnodes composed of v ’s parents, children, and children’s parents in theDAG.

Theoremv is d-separated from V − {v} −MB(v) by MB(v)

DefinitionLet (V ,E) be a DAG. The Moral Graph (V ,F) of (V ,E) is formed byconnecting nodes that have a common child, and then making all edges inthe graph undirected. i.e.

F = {{u, v} | (u, v) ∈ E or (v , u) ∈ E or ∃w : (u,w), (w, u) ∈ E}


Markov blanket

CorollaryLet (V ,E) be a DAG, (V ,F) be the Moral Graph of (V,E), the MarkovBlanket of v ∈ V is the neighbors of v in (V ,F).

DAG Moral Graph Markov Blanket of E


Grow-Shrink and IAMB algorithms

DefinitionLet (V ,E) be a directed graph. The Markov Blanket of v ∈ V is the set ofvertexes not d-separated with v by any variables. i.e., the set of nodescomposed of v ’s parents, children, and children’s parents.i.e. A Markov Blanket M is a minimum subset of V that satisfies:

∀U ⊆ V − {v} −M : v ⊥ U | M

Obvious: finding every variable’s Markov Blanket is equivalent offinding the DAG’s Moral Graph

Grow-Shrink algorithm

IAMB algorithm – greedy ordering of condition sets of Grow-Shrink


Multiple Markov Blankets

Since Bayesian Network decomposition is usually not unique for a datadistribution, one variable may have multiple different Markov Blankets.

Like for both M1 and M2:

∀U ⊆ V − {v} −M : v ⊥ U | M

DefinitionLet (V ,E) be a directed graph. A variable u ∈ V is called StronglyRelevant with v ∈ V if and only if

∀S ⊆ V − {v , u} : P(v | S) , P(v | S ∪ {u})

A variable u ∈ V is called Weakly Relevant with v ∈ V if and only if

∃S ⊆ V − {v , u} : P(v | S) , P(v | S ∪ {u})

and u is not Strongly Relevant with v.


“Selection via Represent Sets” Algorithm

DefinitionLet V be the variable set, a representative set of v ∈ V consists of avariable u in v ’s Markov blanket and u’s corresponding correlated features.

Propositionu is strongly relevant with v, if and only if u belongs to the set of parentsand children of variable v in a faithful Bayesian Network.

SRS Algorithm

Step 1:Gv ← Get − PC(v)(PC means Parent&Child)for u in Gv :

Gu ← {u} ∪ Get − PC(u)

Step 2:Search a group of strongly relevantvariables’ Parent Child sets{Gi} ⊆ {Gu | u ∈ SR(v)}, such that ∪iGi

is a best Markov Blanket under the givenmeasure.


Decomposable Scoring Function

Let V be the variables, scoring function s(V ,E): P(V × V)→ R

s is called decomposable if and only if s(V ,E) =∑

v∈V S(v ,Pa(v)),where Pa(v) = {u | (u, v) ∈ E}

Commonly used decomposable scoring functions:Log-Likelihood(AIC,BIC), BD(e,eu)

Define BN learning as finding E for V that maximizes s(V ,E)


Ordering Search

Given an order O of variables, if scoring function s is decomposable, thebest DAG satisfying O can be found in polynomial time to the number ofvariables, simply by finding best parents among smaller-order variablesfrom sink to source.[15]

Modern ordering search algorithms use propagation of constraintsinferred from scoring function’s properties and background knowledgeto reduce search space.

Algorithm: branch ’n’ bound search, A∗ heuristic search


Dynamic Programming of Parent Sets

Lemma[13]Let v ∈ V , Q ⊆ V , v < Q .

maxP⊆QS(v ,P) = max(S(v ,Q),maxu∈QmaxP⊆Q−{u}S(v ,P))

Which enables DP for propagation of argmaxP⊆QS(v ,P) for all subsets Qof V − {v}. This is the step all dynamic programming algorithms use to getoptimal parent sets of every variable.


The 2006 ordering search algorithm[15]

Variable Set: V = {1, . . . ,N}.

Variable i’s parent candidate set:

Pa(v) ⊆ V − {v}

1. Calculate the localscores for all n · 2n−1

different (v ,Pa(v))-pairs.[s(v ,Pa(v)) | v ∈

V , Pa(v) ⊆ V − {v}]

2. Find optimalsmaller-by-1 parent setPα(v ,G) ⊆ Pa(v) for allG ⊆ V − {v} [Pα(v ,G) |

v = 1, . . . ,N, G ⊆ V − {v}]Pα(v ,G) = Pa(v) −

argmaxv∈Gs(v ,G − {v})3. Find the best sink from

all 2n variable sets.[sink(W) =

argmaxs∈W skore(W , s) |W ⊆ V ]

4. Using the best sink, finda best ordering of the

variables.Oi = sink(V − ∪N

j=i+1{Oj})

5. Compute the best network using above best parents, best sink, best orderingLigon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 22) September 2015 22 / 38

Ordering by Sink Score

Lemma[15]Let W ⊆ V , k is the last variable(called sink) in the optimal order of Wif and only if

k = argmaxk∈W (maxP∈W−{k }S(k ,P) + S(W − {k }))

Which enables using DP for computation of optimal sink.maxP∈W−{k }S(k ,P) + S(W − {k }) is called sink score.


Example of Optimal Parents and Optimal Sink

Add graphic example


Optimization techniques

AD-Tree [19]

if U′ U, and s(v ,U′) ≥ s(v ,U), remove U from candidates [19]

Partition parent sets by size – reduce space to 2n(34)

pnO(1), p isdegree of parents [14]


Structural constraints

Optimal DAGs under common scoring functions (MDL, BDeu) havecommon structural constraints[4] that can be used to prune.

Hard limits of incoming degrees

CorollaryUsing BIC or AIC as criterion, the optimal graph (V ,E) has at mostdlog2Ne parents per node.

Optimal parent set score has upper bounds – various heuristics

Optimal parent set has upper bounds


Upper bound of optimal parent set[4]

TheoremLet N be total count of (V ,R , c), |LU | =

∏u∈U |Lu|. With BIC or AIC as

score function, if∣∣∣LPa(v)

∣∣∣ > Nw ·

log(|Lv |)|Lv |−1 , any proper superset of Pa(v) is not

the parent set of vertex v in an optimal structure.

TheoremGiven a BD score and two parent sets Pa′(v) and Pa(v) for a node v suchthat Pa′(v) ⊂ Pa(v), let Kvj =

∣∣∣LPa(v)|pj∣∣∣, if

S(v ,Pa′(v)) >Kvj |Kvj>2∑

j=1

f(Kvj , αvjk |∀k ) +

Kvj |Kvj=1∑j=1

logαvjk

αvj

then Pa(v) is not an optimal parent set of v.


Upper bound of optimal parent set score[22, 21]

TheoremGiven a BD score S and two parent sets Pa′(v) and Pa(v) for a node vsuch that Pa′(v) ⊂ Pa(v), let Kvj =

∣∣∣LPa(v)|pj∣∣∣, if

S(v ,Pa′(v)) >Kvj |Kvj>2∑

j=1

f(Kvj , αvjk |∀k ) +

Kvj |Kvj=1∑j=1

logαvjk

αvj

then Pa(v) is not an optimal parent set of v.


Order Graph

DefinitionLet V = 1, . . . , n be the indexset of variables, the order graph (V,E) isdefined by a graph with vertex setV being V’s powerset, edge setE= {(X ,Y) | X ,Y ∈ P(V),X ⊂ Y , |X |+ 1 = |Y |}.

Obviously, any order graph is DAG.

ExampleOrder graph of V = 1, 2, 3, 4


Shortest Path formation of Optimal Parent Set Problem

Let S(v ,Pa(v)) be the scoring function item for v ∈ V and its parents.Finding optimal BN is equivalent to finding shortest path on OrderGraph (V,E) from Ø to V , if we define length of edge (X ,Y) to be:

d(X ,Y) = minPa⊆XS(Y − X ,Pa)

Advantages:

Shortest Path on directed graph G has well studied algorithms(Dijkstra, BFBnB, A∗ etc)

Generally does not require pre-generation of all graph data, vertexesand edges can be computed dynamically.


Shortest Path Example

Add an example of shortest path on order graph <==> optimal parentset


A ∗ best-first search algorithm

A heuristics-enhanced variation of Dijkstra algorithm, use “priorityfunction” to decide the next step of search

finding Shortest Path from vertex x to y on (V ,E), with the length ofeach edge d(u, v) | (u, v) ∈ E computable in a fixed time cost.

The “priority function” on vertex v ∈ V :

f(v) = d(x, v) + h(v , y)

d(x, v): already computed distance from x to v

h(v , y) is the heuristically estimated distance from v to y


One A* heuristic function ford(X ,Y) = minPa⊆XS(Y − X ,Pa)

DefinitionLet (V,E) be an order graph of vertex set V, U ⊆ V , heuristic functionused in [22], denoted by h(U), is defined by

h(U) =∑

v⊆V−U

minPa⊆V−{v}S({v},Pa)

Remark: h(U) is acquired by using the best parent set for each vertex inV − U, regardless if the graph is DAG.

Theoremh(U) is monotonic. [22]


Integer valued Multisets(imsets)

DefinitionLet V be a set of integers, A Integer Valued Multiset (imset) is a mappingfrom P(V) to the set of integers Z.

ExampleLet a,b,c be integers, an example imset u with V={a,b,c}:

u = δ{b} − δ{a,b} − δ{b ,c} + δ{a,b ,c}

δ : Kronecker delta imset defined on the following page.


Arithmetic notation of imsets

DefinitionLet V be a set of integers and U ⊆ V . The U Kronecker delta imset,denoted by δU, is defined by

δU(X) =

1 X = U0 X , U

DefinitionLet V be a set of integers, a and b are imsets: P(V)→ Z

(a + b)(X) = a(X) + b(X)

The same for minus.


DAG to {0,1} to [0,1]

Family Variable Vector

{φvU = 1 if Pa(v) = U, 0 otherwise}

Standard Imset

u(V ,E) = δV − δØ +∑v∈V

(δPa(v) − δ{v}∪Pa(v))

Characteristic Imset

c(V ,E)(U) = 1 −W⊆V∑U⊆W

u(V ,E)(W)


Linear Program of Family Variable Vector

Family Variable Vector

{φvU = 1 if Pa(v) = U, 0 otherwise}


References

Diego Colombo and Marloes H Maathuis.Order-independent constraint-based causal structure learning.The Journal of Machine Learning Research, 15(1):3741–3782, 2014.

James Cussens.Integer programming for bayesian network structure learning.2014.

James Cussens, David Haws, and Milan Studeny.Polyhedral aspects of score equivalence in bayesian network structurelearning.arXiv preprint arXiv:1503.00829, 2015.

Cassio P De Campos and Qiang Ji.Efficient structure learning of bayesian networks using constraints.The Journal of Machine Learning Research, 12:663–689, 2011.

Xiannian Fan, Brandon Malone, and Changhe Yuan.Finding optimal bayesian network structures with constraints learnedfrom data.In Proceedings of the 30th Annual Conference on Uncertainty inArtificial Intelligence (UAI-14), 2014.

Xiannian Fan and Changhe Yuan.An improved lower bound for bayesian network structure learning.In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

Xiannian Fan, Changhe Yuan, and Brandon Malone.Tightening bounds for bayesian network structure learning.In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

Yuhong Guo and Dale Schuurmans.Convex structure learning for bayesian networks: Polynomial featureselection and approximate ordering.

Tommi Jaakkola, David Sontag, Amir Globerson, and Marina Meila.Learning bayesian network structure using lp relaxations.In International Conference on Artificial Intelligence and Statistics,pages 358–365, 2010.

Mikko Koivisto and Kismat Sood.Exact bayesian structure discovery in bayesian networks.The Journal of Machine Learning Research, 5:549–573, 2004.

Jan Lemeire, Stijn Meganck, and Francesco Cartella.Robust independence-based causal structure learning in absence ofadjacency faithfulness.on Probabilistic Graphical Models, page 169, 2010.

Dimitris Margaritis.Learning Bayesian network model structure from data.PhD thesis, US Army, 2003.

Sascha Ott and Satoru Miyano.Finding optimal gene networks using biological constraints.Genome Informatics, 14:124–133, 2003.

Pekka Parviainen and Mikko Koivisto.Bayesian structure discovery in bayesian networks with less space.In International Conference on Artificial Intelligence and Statistics,pages 589–596, 2010.

Tomi Silander and Petri Myllymaki.A simple approach for finding the globally optimal bayesian networkstructure.arXiv preprint arXiv:1206.6875, 2006.

Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, prediction, and search, volume 81.MIT press, 2000.

Milan Studeny and David Haws.On polyhedral approximations of polytopes for learning bayes nets.Technical report, 2011.

Milan Studeny and David Haws.Learning bayesian network structure: Towards the essential graph byinteger linear programming tools.International Journal of Approximate Reasoning, 55(4):1043–1071,2014.

Marc Teyssier.Ordering-based search: A simple and effective algorithm for learningbayesian networks.In In UAI, 2005.

Kui Yu, Xindong Wu, Zan Zhang, Yang Mu, Hao Wang, and Wei Ding.Markov blanket feature selection with non-faithful data distributions.In Data Mining (ICDM), 2013 IEEE 13th International Conference on,pages 857–866. IEEE, 2013.

Changhe Yuan and Brandon Malone.An improved admissible heuristic for learning optimal bayesiannetworks.

Changhe Yuan, Brandon Malone, and Xiaojian Wu.Learning optimal bayesian networks using a* search.In IJCAI Proceedings-International Joint Conference on ArtificialIntelligence, volume 22, page 2186, 2011.


survey of contemporary bayesian network structure learning ... · survey of contemporary bayesian...

Documents