advances in probabilistic graphical models€¦ · advances in probabilistic graphical models...

Advances in Probabilistic Graphical Models

Margarida Nunes de Almeida Rodrigues de Sousa

Thesis to obtain the Master of Science Degree in

Matemática e Aplicações

Supervisor: Prof. Alexandra Sofia Martins de CarvalhoProf. Mário Alexandre Teles de FigueiredoProf. Paulo Alexandre Carreira Mateus

Examination Committee

Chairperson: Prof. Maria Cristina De Sales Viana Serôdio SernadasSupervisor: Prof. Alexandra Sofia Martins de CarvalhoSupervisor: Prof. Mário Alexandre Teles de FigueiredoMember of the Committee: Prof. Paulo Alexandre Carreira Mateus

October 2017

Acknowledgments

I want to thank my supervisors, Alexandra Carvalho, Mario Figueiredo and Paulo Mateus for their impor-

tant support throughout this journey.

I want to thank Mae, Isabel, Ana, Pai and Pedro for always faithfully believing in me and for giving me

strength.

I would also like to thank Reuma.pt for providing the Rheumatoid arthritis data.

iii

Resumo

A descricao de comprimento mınimo (MDL) e um criterio de seleccao de modelos bastante conhecido

baseado em teoria da informacao. O MDL escolhe o modelo que minimiza a descricao do comprimento

dos dados e do modelo. Contudo, Rissanen observou que este criterio e redundante, no sentido em que

nao tem em conta que os parametros do modelo sao enviados anteriormente para o receptor. Portanto,

so os conjuntos de dados compatıveis com estes parametros devem ser considerados e isto torna

possıvel comprimir mais a descricao dos dados. Rissanen propos um novo criterio chamado Descricao

Completa de Comprimento Mınimo (CMDL) que resolve este problema.

Nesta tese, consideramos modelos de Redes de Bayes e implementamos um algoritmo de apren-

dizagem usando o CMDL como funcao de pontuacao, o algoritmo ganancioso de escalada como o pro-

cedimentos de procura com o conjunto das redes de cobertura como espaco de procura. Analisamos o

desempenho deste novo criterio de seleccao, usando dados sinteticos e dados reais.

Na segunda parte desta tese, propomos um novo algoritmo de aprendizagem de redes de Bayes

Dinamicas k-estruturas consistentes. O algoritmo proposto aumenta exponencialmente o espaco de

procura das estruturas de dependencias intra-temporais de transicao, quando comparado com o es-

tado da arte - estruturas em arvore. Analisamos o desempenho deste novo algoritmo, usando dados

sinteticos e reais.

v

Abstract

The Minimum Description Length (MDL) is a well known information theoretical model selection criteria,

based on a two-part asymptotic code. MDL selects the model that minimizes the description length of

both the data and the model. However, Rissanen observed that this criteria is redundant, in the sense

that it does not take in account that the parameters of the model were sent beforehand to the receiver.

Therefore, only the data sets compatible with these parameters should be considered and it becomes

possible to further compress the data. Rissanen proposed a new criteria called Complete Minimum

Description Length (CMDL) that solves this issue.

In this thesis, we consider Bayesian network models and implement a score-based learning algorithm

using the CMDL as a scoring function, the greedy hill climber as the search procedure and with the set

of covering networks as the search space. We analyze the performance of this model selection criterion,

using synthetic and real data.

In a second part, we propose a new polynomial-time algorithm for learning dynamic Bayesian net-

works. The proposed algorithm increases exponentially the search space for the intra-slice connections

of the transition networks. This algorithm considers the set of consistent k-graphs, instead of the state-

of-the-art tree-network structures.

Keywords: minimum description length, complete minimum description length, compression,

model selection, learning Bayesian networks, dynamic Bayesian networks

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction 1

2 Bayesian Networks 3

2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Learning Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Paremeter Estimation in Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Dynamic Bayesian Networks 26

3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Learning Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Proposed Method 31

5 Experimental Results 35

5.1 Learning Bayesian Networks with CMDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Learning cDBNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Conclusions 55

Bibliography 57

ix

Chapter 1

Introduction

We are in the big data era, the amount of data available has increased exponentially in the last decade.

Therefore, intelligent and efficient ways of analyzing and learning these big amounts of data becomes

crucial. Machine learning is defined as the set of methods that can automatically detect patterns in data,

and then use the uncovered patterns to predict future data, or to perform other kinds of decision making

under uncertainty [31].

Bayesian networks are probabilistic graphical models that represent in a compact way relations

between random variables [34]. They give raise to generative classifiers, as they model the class-

conditional probability density functions. They are used in a large variety of real world applications such

as diagnosis, forecasting, automated vision, sensor fusion, manufactoring control, program debugging,

information retrieval and troubleshooting system failures [25].

Given a data set and a set of possible models, the problem of deciding which model to select arises.

In this thesis we will focus on information theoretical model selection approaches. These criteria are

based in a measure defined as description length, that expresses the compression achieved to transmit

a given data set. The minimum description length (MDL) [37] is based on a two-part asymptotic code.

In the MDL approach, the description length of encoding a data set with a given model is the sum of the

length of encoding the data set together with the model. Rissanen observed the MDL is redundant [38],

as the parameters are sent beforehand, only the data sets compatible with these parameters should

be considered in the second part of the description. Rissanen proposes a new criteria called Complete

Minimum Description Length (CMDL) that takes into account this fact [37]. In this thesis we implemented

an algorithm for learning Bayesian networks models using the CMDL. We analyze its performance in

terms of learning and compression achieved, using synthetic and real data.

Furthermore, dynamic Bayesian networks (DBN) model stochastic processes [32]. They are used in

a large variety of applications, such as protein sequencing [47], speech recognition [48] and clinical fore-

casting [45]. In the second part of this thesis, we propose a new polynomial algorithm for learning DBN

that increases exponentially the search space for the intra-slice connections of the transition networks.

We consider that the search space for these connections is the set of consistent k-graphs. The current

state of the art algorithm takes the search space to be tree graphs [28]. We analyze the performance of

1

the proposed algorithm using synthetic and real data.

Claim of contributions

The main contributions of this thesis are:

1. A review on Bayesian networks, dynamic Bayesian networks and the their learning algorithms.

2. An implementation of a score based learning algorithm for Bayesian Networks with a new proposed

scoring function, Complete Minimum Description Length. The algorithm was made freely available

at https://margaridanarsousa.github.io/learn_cmdl/.

3. A new polynomial time algorithm for learning cDBN dynamic Bayesian networks. The algorithm

was made freely available at https://margaridanarsousa.github.io/learn_cDBN/.

4. An analysis of the developed methods on simulated and real data, including comparisons to other

methods and to results obtained in other publications.

Thesis outline

In Chapter 2 we start by defining basic concepts on Bayesian networks. We introduce the problem of

learning Bayesian networks in Section 2.2, that has two approaches: parameter estimation, defined in

Subsection 2.2.1, and structure learning, defined in Subsection 2.2.2. In order to learn the structure of

Bayesian networks it becomes necessary to specify a scoring function, a search space and a search

procedure. In Subsection 2.2.3 we start by introducing basic coding and data compression concepts

and then describe information theoretical scoring functions.

In Section 3.1 we introduce dynamic Bayesian networks (DBN), that are extensions of Bayesian

networks that evolve in time. We describe the previously proposed methods for learning DBN in Section

3.2. We propose a new learning algorithm for DBN in Chapter 4.

Furthermore, in Chapter 5 we present the experimental results. In Section 5.1 the results regarding

the implementation of the score-based learning algorithm using the Complete Minimum Description

Length are analyzed and discussed. In Chapter 5.2 the results of the proposed learning algorithm for

dynamic Bayesian networks are presented and discussed.

Finally, in Chapter 6 we make some final remarks and propose directions for future work.

2

https://margaridanarsousa.github.io/learn_cmdl/

https://margaridanarsousa.github.io/learn_cDBN/

Chapter 2

Bayesian Networks

2.1 Basic Concepts

LetX denote a discrete random variable that takes values over a finite setX . Furthermore, let X = (X1, . . . , Xn)

represent an n-dimensional random vector, where each Xi takes values in Xi = xi1, . . . , xiri and P (x)

denotes the probability that X takes the value x.

A Bayesian network encodes the joint probability distribution of a set of n random variables X1, . . . , Xn [34].

The underlying structure of a Bayesian network is based on a directed graph, therefore they are also

known as directed graphical models. They are also named belief networks, generative models or causal

models. Suppose that these n random variables have K states, using the chain rule:

P (x) = P (xn|x1 . . . xn−1) . . . P (x2|x1)P (x1). (2.1)

In order to determine the joint distribution, we would need to estimate Kn − 1 probabilities, for the

Kn possible values X1, . . . , Xn may take. Therefore, computing the joint probability requires space

exponential in the number of random variables n. Assuming certain independence properties, the joint

probability can be represented in a more compact way and require less parameters.

Definition 1 (Conditional Independence). Let X,Y and Z be sets of random variables. X is said to be

conditionally independent of Y given Z if P (x|y, z) = P (x|z), for all x,y and z. Let X ⊥ Y|Z denote that

X is conditionally independent of Y given Z.

Definition 2 (Bayesian Network). A n-dimensional Bayesian Network (BN) is a triple B = (X, G,Θ),

where:

• X = (X1, . . . , Xn) and each random variable Xi takes values in the set xi1, . . . , xiri, and xik

denotes the k-th value Xi takes.

• G = (X, E) is a directed acyclic graph (DAG) with nodes in X and edges E representing direct

dependencies between the nodes.

Let ΠXidenote the set of parents of Xi in the network G. Define an ordering for the set of all

3

possible configurations of ΠXi, wi1, . . . , wiqi, where qi =

∏Xj∈ΠXi

rj is the total number of

configurations and wij corresponds to the j-th configuration of ΠXi .

• Each random variableXi has an associated conditional probability distribution (CPD) or local prob-

abilistic model with parameters:

Θijk = PB(Xi = xik|ΠXi = wij). (2.2)

The set Θ encodes the parameters Θijki∈1...n,j∈1...qi,k∈1.....,ri of the network G.

Let Nij be the number of instances in the data D where the variables ΠXitake their j-th configuration

wij . Observe that Xi|ΠXi ∼ Multinomial(Nij , θij1, . . . , θijri) for i ∈ 1, . . . , n and j ∈ 1, . . . , qi, i.e.,

the distribution of a node Xi conditioned on the parent configuration ΠXiis multinomial.

Example 3 (Medical Diagnosis). Consider the Bayesian network depicted in Figure 2.1, representing

two diseases, Pneumonia and Flu. Both diseases cause Fever, however the XRay only shows signs in

the case of Pneumonia and the Muscular Pain is only caused by a Flu. Consider the following notation:

Pneumonia → Pn, Flu → Fl, Fever → Fe, Xray → Xr and Muscular Pain → Mp. All of the random

variables are binary. Consider P (Pn) = 0.05 and P (Fl) = 0.02, in Figure 2.2 the table of CPDs are

depicted. The number of rows of each table is the number of parent configurations, and each row

represents the distribution of the random variable given the parent configuration, which describes a

multinomial distribution.

Pneumonia Flu

Fever Muscular PainXRay

Figure 2.1: A Bayesian network representing dependencies between diagnosis and diseases.

Pn, F l P (Fe|Pn, F l)1, 0 0.80, 1 0.60, 0 0.21, 1 0.01

(a)

Pn P (Xr|Pn)0 0.81 0.6

(b)

Fl P (Mp|Fl)0 0.81 0.6

(c)

Figure 2.2: CPDs Tables of Example 3.

A BN B induces a unique joint probability distribution over X given by:

PB(X1, . . . , Xn) =

n∏i=1

PB(Xi|ΠXi). (2.3)

Intuitively the graph of a BN can be viewed as a network structure that provides the skeleton for

representing the joint probability compactly in a factorized way, and making inferences in the probabilistic

4

graphical model provides the mechanism for gluing all these components back together in a probabilistic

coherent manner [26].

Definition 4 (Markov Local Assumptions). Given a BN with network structure G over random variables

X1, . . . , Xn. G encodes the following set of conditional independence assumptions:

Xi ⊥ Non DescendantsXi|ΠXi

, for all random variables Xi, (2.4)

where Non DescendantsXi are the variables in G that are non descendants of Xi. These assumptions

are called the Markov local assumptions.

Example 5. The BN depicted in Figure 2.1 encodes the following Markov local assumptions: Pn ⊥ Fl|∅,

Fl ⊥ Pn|∅, Fe ⊥ Xr,Mp|Pn, F l, Xr ⊥ Mp,Fe, F l|Pn, Mp ⊥ Xr, Pn, Fe|Fl.

Bayesian Networks reduce the number of values to determine when computing the joint probability

PB(X1, . . . , Xn) to space exponential in maxi∈1,...,n

|ΠXi |.

Informally, two Bayesian networks are equivalent if they encode the same joint probability distribution.

The next theorem provides sufficient and necessary conditions for the equivalence of two Bayesian

networks.

Definition 6 (v-structure). In a directed acyclic graph, a v-structure is a local dependency X → Y ← Z.

Example 7. The edges E = (Pn, Fe), (Fl, Fe) in the graph represented in Figure 2.1 form a v-

structure.

The skeleton of any DAG is the undirected graph resulting from ignoring the direction of every edge.

Theorem 8 (From [2]). Two directed acyclic graphs are equivalent if and only if they have the same

skeleton and the same v-structures.

Since tree networks have no v-structures, two trees with the same edges are equivalent, indepen-

dently of the direction of the edges.

2.2 Learning Bayesian Networks

Learning a Bayesian Network has two variants: parameter learning and structure learning. When learn-

ing the parameters, we assume the underlying graph G is given, and our goal is to estimate the set of

parameters of the network Θ. When learning the structure, the goal is to find a structure G, given only

the training data. We assume data is complete, i.e, each instance is fully observed, there are no missing

or hidden values and the training set D is given by a set of N i.i.d. instances, D = x1, . . . ,xl, . . . ,xN.

2.2.1 Paremeter Estimation in Bayesian Networks

There are two approaches to estimate the Bayesian network parameters: maximum likelihood estima-

tion and Bayesian variants. Both approaches are based in the likelihood function. We will begin by

5

describing the maximum likelihood estimate approach. The likelihood of a set of parameters ΘG, given

an underlying graph G, is given by:

L(D,ΘG) = P (D|ΘG) =

N∏l=1

P (xl|ΘG).

Considering the Markov local independence assumptions and that the set of parameters θXi|ΠXiare

disjoint for i ∈ 1, . . . , n, the likelihood can be decomposed into the product of the local likelihood

functions for each node Xi, and becomes:

L(D,ΘG) =

n∏i=1

Li(θXi|ΠXi, D),

where

Li(θXi|ΠXi, D) =

N∏l=1

P (xil|Πxil, θXi|ΠXi

),

and xil denotes the observed value for the variable Xi in the instance l of D and Πxil

denotes the

observed parent configuration for Xi and instance l. In this case, our problem reduces to maximizing

each local likelihood estimate Li independently.

Let Nijk be the number of instances in data set D, where variable Xi takes the value xik and the set

of parents ΠXitakes the configuration wij . Denote the number of instances in D where Xi takes the

value xij by Nij ,

Nij =

ri∑k=1

Nijk.

LetN be the total number of instances in dataD. Assuming thatXi|ΠXi∼ Multinomial(Nij , θij1, . . . , θijri),

the local likelihood of Xi can be simplified to:

Li(θXi|ΠXi, D) =

qi∏j=1

ri∏k=1

θNijk

ijk . (2.5)

Our goal is to maximize Li(θXi|ΠXi, D) for all i ∈ 1, . . . , n, considering the constraint:

qi∑j=1

ri∑k=1

θijk = 1. (2.6)

Using the general results of the maximum likelihood estimate in a multinomial distribution we get the

following estimate:

θijk =NijkNij

, (2.7)

that is denoted by observed frequency estimate (OFE). The maximum likelihood estimate, however,

overfits the training data in many situations. On the other hand, this estimate assigns probability zero to

an event that is extremely unlikely, but not impossible.

In a Bayesian approach a regularization parameter is added to the parameters, which gives raise

6

to an estimator significantly more robust. Considering a Dirichlet prior distribution over the parame-

ters θijk with hyperparameter αijk, θijk ∼ Dir(αijk), as this distribution is the conjugate prior of a

multinomial distribution, the posterior distribution becomes a Dirichlet distribution with hyperparame-

ters Nijk + αijk, θijk|D ∼ Dir(Nijk + αijk), which yields the following estimate:

θijk =Nijk + αijk∑kNijk + αijk

. (2.8)

2.2.2 Structure Learning

These are the main methods proposed to solve the problem of learning a structure for a general Bayesian

network: Independence Tests or Constrained-Based approaches [41], Bayesian Model Averaging Ap-

proaches [19] and Search-Based Methods [10, 24].

Constrained-based approach views Bayesian networks as encoding conditional dependencies and

independencies and tries to test and infer these conditions in the data in order to construct a network.

The Bayesian model averaging approach doesn’t look for a single network, but rather tries to define

a probability distribution over all possible structures and average the prediction over all networks.

The most common method is the search-based method, we will focus on this approach. The space

of all Bayesian networks with n nodes has a superexponential number of structures, 2O(n2). Learn-

ing general Bayesian networks is a NP -hard problem: Cooper [9] proved the inference of a general

Bayesian network is NP -hard. Later, Dagum and Luby proved that even finding an approximate solution

is NP -hard [13]. Chow and Liu [8] and Edmonds [16] use an optimal branching algorithm that finds

the optimal Bayesian networks, constraining the search space to tree graphs. Cooper [10] proposes a

polynomial time algorithm for learning Bayesian networks consistent with an order and a bounded in-

degree1. Chickering [7] proved that even constraining to graphs with in-degree at most 2 is NP -hard.

Therefore, we resort to heuristic search techniques. Score-based methods reduce the problem of learn-

ing a Bayesian network to a model selection problem, viewing a BN as a statistical prediction model.

Define a scoring function φ : S × X → R, that measures how well the Bayesian network B fits the data

D (where S denotes the search space). The problem reduces to an optimization problem: given a score

function, a data set, a search space and a search procedure, find the network that maximizes this score.

However, the heuristic-search method is not guaranteed to find the optimal network. We will consider as

the search space, the set all Bayesian networks with n variables, denoted by Bn.

Definition 9 (Learning a Bayesian Network). Given a data D = x1, . . . ,xN and a scoring function φ,

the problem of learning a Bayesian network is to find a Bayesian network B ∈ Bn that maximizes the

value φ(B,D). 2

Thus, improving the search-based methods can be done by finding new scoring criteria or new search

methods. In this work we implement a score-based learning algorithm using a new scoring function. On

the other hand, we propose a new search procedure for the dynamic counterpart of Bayesian networks.1The in-degree of node Xi is |ΠXi

|.2Take in account that for clear understanding in Subsection 2.2.3 and in Section 5.1, this problem is defined as the minimization

of −φ(B,D).

7

As was mentioned in the beginning of this Section, if we restrict the search space S to tree networks

or networks with known ordering over the variables and bounded in-degree, it is possible to obtain a

global optimum solution for the structure learning problem. We will now describe the search procedures

for the mentioned search spaces.

The generalization of Chow-Liu algorithm [8] for any score equivalent and decomposable scoring

function proposed by Heckerman et al [24] is depicted in Algorithm 1. It starts by building a complete

weighted undirected graph, such that the weight of an edge Xi → Xj is φj(Xi, D) − φj(∅, D). Then,

it is possible to determine a maximal weighted spanning tree in polynomial time. A arbitrary node is

chosen to be the root of the tree and the direction of all edges are set to be outward from it.

Algorithm 1 Learning tree Bayesian networks, for any decomposable and score equivalent φ-score

1: Compute φj(Xi, D)− φj(∅, D) between each pair of attributes Xi and Xj , with i 6= j and i, j ≤ n.

2: Build a complete undirected graph with attributes X1, . . . , Xn as nodes. Annotate the weight of an

edge connecting Xi and Xj by the value computed in the previous step.

3: Build a maximal weight (undirected) spanning tree.

4: Transform the resulting undirected tree to a directed one by choosing a root variable and setting the

direction of all edges to be outward from it and return the resulting tree.

Heckerman also proposes a polynomial algorithm for the case of scoring functions that are decom-

posable but not score equivalent represented in Algorithm 2 [24]. In this case, the edge Xi → Xj may

have a different score from the edge Xj → Xi, and so one must build a directed spanning tree. Ed-

mond’s algorithm [16] finds an optimal spanning tree, given a root. By ranging over all possible roots, it

is possible to find an optimal spanning tree in polynomial-time.

Algorithm 2 Learning tree Bayesian networks, for any decomposable φ-score

1: Compute φj(Xi, D)− φj(∅, D) for each edge from Xi to Xj , with i 6= j and i, j ≤ n.

2: Build a complete directed graph with attributes X1, . . . , Xn as nodes. Annotate the weight of an edge

connecting Xi and Xj by the value computed in the previous step.

3: Build a maximal weight directed spanning tree.

In the case that the BN is consistent with a given order and has bounded in-degree k, an algorithm

named K2 was proposed, that is represented in Algorithm 3 [10]. For each node Xi the algorithm tests

all the parent-sets from the subsets of X1, . . . , Xi−1 with at most k elements and selects the optimal

one. The algorithm is polynomial in the number of variables, but exponential in k [10].

A polynomial-time algorithm to learn Bayesian networks with underlying consistent k-graphs (CkG)

was proposed and is represented in Algorithm 4 [6]. The set of networks consistent with the optimal

branching and bounded in-degree is exponentially larger in the number of variables, when comparing

with trees. In Figure 2.4 the relations of inclusions of the tree, polytrees 3 and CkG graphs are repre-

sented.

3A polytree is a DAG such that the underlying undirected graph is a tree.

8

Algorithm 3 K2 algorithm

input: A set of nodes X1, . . . , Xn, an ordering on the nodes, an upper bound on the in-degree k, a

data set D and a scoring function φ.

output: The optimal parent set for each node.

1: Run a deterministic algorithm Aφ that outputs the nodes ordered.

2: for each node Xi in R do

3: Compute the set αi of ancestors of i, that is, the set of nodes connecting the root of R and Xi.

4: for each subset S of αi with at most k nodes do

5: Compute φi(S,D).

6: if φi(S,D) is the maximal score for Xi then

7: Set ΠXito S.

8: end if

9: end for

10: end for

11: Output the directed graph such that the parents of a node Xi are Πi.

Definition 10 (k-graph). A k-graph is a graph where each node has in-degree at most k.

Definition 11 (Consistent k-graph). Given a directed tree R over a set of nodes V , a graph G = (V,E)

is said to be a consistent k-graph (CkG) w.r.t R if it is a k-graph and for any edge in E from Xi to Xj the

node Xi is in the path from the root of R to Xj . We denote by CkR the set of all CkG’s w.r.t. R.

X1

X2 X3

X4X5

Figure 2.3: Network structure for Example 12.

Example 12. Considering the optimal branching represented in Figure 2.3, we observe that adding

edge (X1, X5) gives raise to a consistent 2-graph, however the edge (X2, X4) does not.

...

Polytrees

CkG

C1G

−φ

Trees

−φ

Figure 2.4: Inclusion relations of trees, CkG and polytree graphs [6].

The algorithm for learning CkG structures starts by determining the optimal branching, and then

9

adds the “relevant” edges, that were not defined due to the tree restriction, and removes those that are

not, by choosing the optimal subset of ancestors S for each node Xi.

Algorithm 4 Learning CkG networks

1: Run a deterministic algorithm Aφ that outputs an optimal branching R.

2: for each node Xi in R do

3: Compute the set αi of ancestors of i, that is, the set of nodes connecting the root of R and Xi.



6: if φi(S,D) is the maximal score for Xi then

7: Set ΠXito S.

8: end if

9: end for

10: end for

11: Output the directed graph such that the parents of a node Xi are ΠXi.

For general Bayesian networks, the heuristic search procedure attempts to find the optimal BN,

but is not guaranteed to. The greedy hill-climber search (GHC) is the most common procedure and

Heckerman et al. found it to yield the best combination between accuracy and efficiency. We will define

the neighborhood of a given structure in DAG-space to be all networks we can reach by applying one of

the following operations:

• add an edge;

• delete an edge;

• flip an edge.

The GHC starts with a initial network, that can be empty, random or constructed using prior knowl-

edge. At each search step it moves through the neighborhood of the network, and selects the network

with largest improvement in the score, and this one becomes the current network. The process is re-

peated until there is not a network in the neighborhood that improves the current score. There are a few

extensions of the GHC:

• TABU list: Keeps track of the recently seen structures and avoids them, i.e., does not consider

“legal” to move to any of these structures in the next search steps. This strategy avoids getting

stuck in local maximum [26].

• Random Restarts: Once stuck, applies random operations (add, remove or flip edge), and

restarts the greedy search. This strategy escapes from the basin or going from local maximum to

local maximum [26].

The GHC with the extensions described is represented in Algorithm 5 [26]. We will now introduce the

concept of scoring criterion in more detail.

10

Algorithm 5 GHC algorithm for learning BNs with tabu list and random restarts

input: Initial structure Ginit, dataset D, a scoring function φ and a stopping criteria C.

output: final structure Gres.

1: Gres = Ginit, G′ = Gres and TABU= Gres

2: while C not satisfied do

3: G′′ = arg maxG∈neighbourhood(G′) φ(G)

4: if φ(G′) > φ(G′′) then

5: G′′ = random(G′)

6: end if

7: if φ(G′′) > φ(Gres) then

8: Gres = random(G′)

9: end if

10: TABU = TABU ∪G′

11: G′ = G′′

12: end while

return Gres

2.2.3 Scoring Functions

A large variety of scoring functions have been proposed in the literature [5]. A scoring function φ :

S × X → R, measures how well the Bayesian network B fits the data D (where S denotes the search

space). Score- based learning algorithms are efficient if the scoring criterion is decomposable, since

in this case a local change in the neighborhood of a node Xi will only change the local score φi, for

i ∈ 1, . . . , n.

Definition 13 (Decomposable scoring function). A scoring function φ is decomposable if the score

assigned to each network decomposes over the network in such a way that it can be expressed as a

sum of local scores that depends only on each node and its parents, that is, scores of the following form:

φ(B,D) =

n∑i=1

φi(ΠXi , D). (2.9)

Another important property of scoring functions is the score equivalence; we will define some pre-

liminary concepts in order to define this property.

Definition 14 (Partially directed acyclic graph). A partially directed acyclic graph is a graph that contains

both directed and undirected edges, with no directed cycle in its directed subgraph.

A partially directed acyclic graph can be viewed as a representative of an equivalence class of DAGs.

Definition 15 (Compelled edge). A directed edge X → Y is compelled in a directed acyclic graph G if

for every directed acyclic graph G′ equivalent to G, X → Y exits in G′.

By Theorem 8 (page 5), any edge participating in a v-structure is compelled. If a directed edge is not

11

compelled, we call it reversible, as there may exist another DAG in the same equivalence class with the

reverse edge.

Definition 16 (Essential graph). An essential graph, denoting an equivalence class of directed acyclic

graphs, is the partially directed acyclic graph consisting of a directed edge for every compelled edge in

the equivalence class, and an undirected edge for every reversible edge in the equivalence class.

For tree-network structures, the essential graph corresponds to its skeleton.

Definition 17 (Score Equivalence). A scoring function φ is score equivalent if it assigns the same score

to all directed acyclic graphs that are represented by the same essential graph.

Scoring functions are divided in two classes: Bayesian and information-theoretical. We will focus on

information-theoretical scoring functions: Log-Likelihood, Minimum Description Length, Complete Min-

imum Description Length and Normalized Maximum Likelihood. Information-theoretical scoring criteria

are based on the compression achieved to describe a data set, given an optimal code induced by a

probability distribution encoded by a Bayesian network. The rational is to choose a representation of the

data that corresponds to the minimum description length. The idea is the following: the more we are

able to compress a data set, the more regularities the data set has, and therefore the more we learn the

data.

Example 18. This example was adapted from [22]. Consider two sequences of binary data of 10000

bits each represented by:

0001000100010001000100010001...00010001000100010001,

0111010000100101011101110001...11101000101011101001.

The first sequence is the repetition of the pattern 0001 2500 times, therefore we can predict that future

data will follow the same “law”. The second sequence is random, there is no regularity underlying it.

Therefore, the first sequence can be compressed, it can be described as “2500 repetitions of 2500”,

instead of describing the entire sequence, however for the second sequence, we can npt summarize it.

We will introduce some basic concepts of coding, data compression and information theory that will

be important to understand this class of scoring functions.

Basic Coding and Data Compression Concepts

Let Y∗ denote the set of finite-length strings of symbols from a Y-ary alphabet.

Definition 19 (Code). Given a random variable X with range X and a set of finite-length strings of

symbols from a Y-ary alphabet, Y∗, a code C is a mapping:

C : X → Y∗. (2.10)

Let C(x) denote the codeword corresponding to x and let l(x) denote the length of C(x).

12

Definition 20 (Expected length of a code). The expected length L(C) of a code C(x) for a random

variable X with probability mass function Q is given by:

L(C) =∑x∈X

Q(x)l(x), (2.11)

where l(x) is the length of the codeword associated with x.

By assigning short codewords to common outcomes of the data set and longer codewords with less

frequent outcomes, it is possible to decrease the redundancy from the data and therefore compress the

data.

Example 21 (Huffman’s Algorithm). This example was adapted from [23]. Let X = a, b, c and P be the

probability distribution on X with P (a) =1

2, P (b) = P (c) =

1

4. Construct a code following the Huffman’s

Algorithm: first choose the two elements with the smallest probabilities, b and c, and connect them with

leaves 0 and 1 (assigned arbitrarily), to form the intermediate node bc with node probability P (ab) =1

2.

The constructed code is depicted in Figure 2.5. The resulting code is a → 0, b → 10, c → 11 with

codeword lengths l(a) = 1, l(b) = l(c) = 2 and expected length given by L(C) =1

2× 1 +

1

2× 2 = 1.5.

Figure 2.5: Huffman code of Example 21.

Definition 22 (Prefix code). A code is called a prefix code or an instantaneous code if no codeword is a

prefix of any other codeword.

If no codeword is a prefix of any other codeword we can instantaneously decode it, in the sense that

we do not need to decode future codewords in order to decode a previous one, and on the other hand,

unique decodability is guaranteed.

Example 23. This example was adapted from [18]. The code constructed in Example 21 is prefix. For

example the string 0101110 is uniquely decoded to abca. Consider the code C : a, b, c, d → 0, 1∗,

where C(a) = 01, C(b) = 11, C(c) = 00, C(d) = 110. Given the string 110...0...011, if the number of zeros

between both 11 is even, then the first codeword is decoded as b; if it is odd, then the first codeword is

decoded as d. Therefore to decode the first codeword we need to observe an arbitrary number of future

codewords.

Our goal is to define prefix codes with minimum expected length, however, assigning short codewords

to all source symbols and still having prefix-free codes is clearly infeasible. Consider the case described

in Example 21, if a→ 0, neither b nor c can be assigned to 1, if we want to construct a prefix code. The

following theorem expresses this relation. Denote the length of word xi by li = l(xi) and Pi = P (xi).

13

Theorem 24 (Kraft Inequality [12]). For any prefix code over an alphabet of size Y, the codeword lengths

l1, . . . , lm must satisfy the following inequality:

∑i

Y−li ≤ 1. (2.12)

Conversely, given a set of codeword lengths that satisfy this inequality, there exists a prefix code with

these word lengths.

Suppose elements in X are generated according to a known probability distribution P , Shannon’s

Source Coding Theorem states that the expected code length defined in (2.11) is minimum when Q = P .

Theorem 25 (Shannon’s Source Coding Theorem, from [12]). Suppose elements of X are generated

according to a probability distribution P . For any prefix code on X with length function l, the expected

code length L(C) is bounded below by H(P ), the entropy of P. That is,

L(C) ≥ H(P ) = −∑i∈N

P (x)l(x). (2.13)

The optimal code lengths l?1, . . . , l?m, that minimize the expected code length satisfy:

l?i = −logYPi. (2.14)

However, l?i as defined above is not necessarily an integer and it is not possible to define codewords

with non-integer lengths. Defining li = d− logY Pie solves this problem, while also satisfying the Kraft

Inequality. Thus an optimal code Copt satisfies:

H(P ) ≤ L(Copt) ≤ H(P ) + 1. (2.15)

Fano-Shannon and Huffman codes are examples of optimal codes [12].

In a Bayesian networks approach, consider HG the set of hypothesis that subsumes the data T was

generated by some Bayesian network with structure G. We will use an optimal code, defining the source

set X as the data we want to model D, and the probability function defined over D will be induced by a

given hypothesis HG ∈ HG. We will use the description length of D as a measure to select the model.

Definition 26 (Description length). Given data D and a set of probability distributions HG encoded by a

Bayesian network, that may be used to describe D , the description length of D with HG ∈ HG is given

by:

L(D,HG) = L(D|HG) + L(HG), (2.16)

where L(D|HG) is the length of the description of D when encoded with HG and L(HG) is the length of

the description of HG.

By Shannon’s Source Coding Theorem, using an optimal code, the length of the description of D

when encoded with hypothesis HG is:

14

L(D|HG) = −LL(HG|D) = − logPHG(D) = −

n∑i=1

qi∑j=1

ri∑k=1

Nijk log(θijk). (2.17)

Next, we will introduce the information theoretical scoring functions. What distinguishes these scoring

functions is how description length is defined.

Log-Likelihood Criterion

The Log-likelihood criterion assumes that the hypothesis HG is transmitted cost-free and it is enough to

choose HG that minimizes the maximum likelihood estimate, in this case:

L(D,HG) = L(D|HG) = −LL(HG|D). (2.18)

This criterion favors complete network structures, and does not generalize well, leading to the over-

fitting of the model to the training data.

Minimum Description Length Criterion

The Minimum description length (MDL) criterion, proposed by Rissanen [35], imposes that the param-

eters of the hypothesis HG must also be transmitted. The length of these parameters are a form of

penalized likelihood, the price one must pay for not knowing which hypothesis generated the data. The

MDL criterion follows Occam’s reasoning, selecting simple models. Hence, in this case we want to

choose HG that minimizes:

L(D,HG) = L(D|HG) + L(HG) = −LL(HG|D) + L(HG). (2.19)

The MDL principle can be viewed as a two-part coding scheme:

1. In a first stage, the parameters that minimize (2.19), ΘG, are estimated. Then, the parameters are

transmitted using a uniform encoder and a certain precision.

2. In a second stage, an optimal prefix code is constructed using the distribution indexed by ΘG and

the data set D is encoded using the induced code and sent to the receiver.

Now we will describe how to encode ΘG. First, let’s suppose the parameters are integers. Elias [17]

and Rissanen [36] constructed an universal code 4 for integers, such that length log∗2 of an integer x

would take

log?2(x) =∑j>1

max(log(j)2 n, 0) + log2 c0 (2.20)

bits, where log(j)2 (.) is the j-th composition of the binary logarithm and c0 is given by

c0 :=∑n>1

2− log∗2 n ≈ 2.865064. (2.21)

4A universal code for integers is a prefix code, with the additional property that whatever the true probability distribution onintegers, as long as the distribution is monotonic, the expected lengths of the codewords are within a constant factor of theexpected lengths that the optimal code for that probability distribution would have assigned.

15

However, the parameters of a Bayesian network are rational, therefore real numbers. In this case, the

real number x should be represented by an integer x/δx, where δx is the precision of the representation.

By approximating log? ≈ log, it is possible to compute the optimal precision δ?x = 1/√N . By considering

the asymptotic case, taking the number of independent samples N →∞, the length of x would take:

log?(x

δ?x)→ 1

2ln(N) (2.22)

bits, and we arrive to the following number of bits required to represent a Bayesian network B:

ln(1√N

)|B| = 1

2ln(N)|B|, (2.23)

where |B| corresponds to the number of parameters Θ of the network and is given by:

|B| =n∑i=1

(ri − 1)qi. (2.24)

An intuitive way to understand the defined optimal precision of 1/√N is that this value corresponds

to the maximum magnitude of the estimation error of the parameters ΘG, hence, there is no need to

encode the estimator with a greater precision.

The minimum description length criterion becomes:

MDL(B|D) = −LL(B|D) +1

2ln(N)|B|. (2.25)

Example 27. This example is adapted from [23]. Consider the sender wishes to transmit a binary string

y = y1, . . . , yn to a receiver and uses a Bernoulli(θ) model to send the string. In a Bayesian networks

approach, this can be represented as a unique node corresponding to a random binary variable. The

binary string can be viewed as a set of n i.i.d. observations sampled from the distribution Bernoulli(θ).

Let k be the number of 1’s in the string. The parameter θ needs to be first estimated and then sent to the

receiver. The maximum likelihood estimate is θ = k/n. This parameter takes 1/2 lnn nats to send. Then

the sender encodes all the symbols in the string, that takes − log2

(k/n

)bits for a 1 and − log2

(1−k/n

)for a 0. Therefore, transmitting the string requires an additional

−k log2

(k

n

)− (n− k) log2

(1− k

n

)

bits. Consider the particular case such that, n = 5, y = (1, 1, 1, 0, 0, ) and the maximum likelihood

parameter is θ = 3/5. The parameter takes 1/2 ln 5 ≈ 0.8047 nats, that corresponds approximately to

1.1609 bits, to communicate. Encoding the data set takes − log((3/5)3(2/5)2) ≈ 5.8453 bits.

The minimum desciption length is equivalent to the Bayesian scoring called Bayesian information

criterion (BIC) [39].

16

Complete Minimum Description Length Criterion

Rissanen observed that the MDL criterion is redundant and incomplete [38] in the sense that as the

parameters are sent beforehand to the receiver, the data has to be compatible with these parameters,

allowing to further compress it. He therefore proposed a new criterion called Complete Minimum De-

scription Length (CMDL), that solves this issue.

Example 28. Let’s consider the Complete Minimum Description Length Criteria approach in Example

27. As the receiver knows the parameter θ = k/n, he knows the data set has to have k 1’s, therefore if

an enumeration for all the compatible data sets is defined, the length of encoding the data set is

− log2

(n

k

)

bits. Considering the particular case described in Example 27, the length for encoding k is log2 3 ≈

1.5850, and the length of encoding the data set is

− log2

(5

3

)≈ 3.3219

bits, that is significantly smaller than the one considering the MDL approach.

Rissanen defines the Complete Minimum Description Length distribution as the one that minimizes

the length of the code of D, given that the receiver already knows the parameters. Given a set of

hypotheses HG and denoting by HHG(D) the BN with parameters given by the OFEs in D, the CMDL

distribution is given by:

PCMDLHG

(D) =PHHG(D)

(D)∑D′:θD′=θD

PHHG(D′)(D′)

.

Since the data instances are assumed to be sampled from a multinomial distribution, two data sets

D and D′ have the same parameters if and only if they are a permutation of each other and HHG(D) =

HHG(D′). Hence we get PHHG(D′)(D′) = PHHG(D)

(D), and the CMDL distribution simplifies to:

PCMDLHG

(D) =1

|D′ : θD′ = θD|.

The length of the optimal code induced by the CMDL distribution is given by:

CMDL(G|D) = − log(PCMDLHG

(D)) + L(ΘD) = log(|D′ : θD′ = θD|) + L(ΘD). (2.26)

The problem of computing CMDL(G|D) is therefore reduced to the problem of counting how many

datasets induce the same OFE and sending these parameters to the receiver. We will start by deriving

the cardinality of the set D′ : θD′ = θD.

17

Counting the number of data sets compatible with the OFEs

The number of data sets compatible with the OFEs has a analytical solution for forest BNs and for

general structures a non-trivial solution is proposed. We will start by defining formally forest graphs.

Definition 29 (Forest). A forest is a disjoint union of trees.

Given expression (2.7) for the OFE parameters Θijki∈1...n,j∈1...qi,k∈1.....,ri of a given BN, one ob-

serves that two datasets induce the same OFE parameters if and only if they induce the same family of

counts N = Nijkijk and we have the following result for forest BN.

Theorem 30 ([4]). Let D be a dataset of size N , B a forest BN, and N = Nijki,j,k the family of counts

for each parent-child in B induced by D. The number of datasets of size N that induce the same family

of counts N for B is:

n∏i=1

qi∏j=1

(Nij

Nij1, . . . , Nijri

)=

n∏i=1

qi∏j=1

Nij !

Nij1! . . . Nijri !. (2.27)

Denote the family of OFE multinomials Mult(Nij , θij1, . . . , θijri)i,j by Mij . Expression (2.27) only

holds for forest BNs, next we give a counter example to prove this fact, that considers a v-structure BN

where the multinomials M11, M21 are not pairwise independent, and (2.27) is an upper bound for the

number of data sets compatible with the OFEs.

X1 X2

X3

(a)

X1 X2 X3

0 0 00 1 10 1 11 1 1

(b)

X1 N111 = 3 N112 = 1 N11 = 4X2 N211 = 1 N212 = 3 N21 = 4X3 N311 = 1 N312 = 0 N31 = 1

N321 = 0 N322 = 2 N32 = 2N331 = 0 N332 = 0 N33 = 0N341 = 0 N342 = 1 N34 = 1

(c)

Figure 2.6: Network structures and datasets for Example 31.

Example 31. Consider the BN depicted in Figure 2.6(a), where all Xi, with i ∈ 1, 2, 3, are binary

random variables, the data set is represented in Figure 2.6(b) and the counts in Figure 2.6(c). According

to Theorem 30, the number of compatible datasets is:

n∏i=1

qi∏j=1

Nij !

Nij1! . . . Nijri !=

4!

3!1!

4!

1!3!

1!

1!0!

2!

2!0!

0!

0!0!

1!

0!1!= 4× 4× 1× 1× 1× 1 = 16.

However, we can deduce the counts of X1 and X2 from the counts of the X3. And moreover, the counts

of X1 and X2 are not independent. Consider the case such that N31 = 0, then we know that X1 only

take the value 1 once. From N33 = 0 and N34 = 1 we know that when X1 takes the value 1, X2 always

takes the same value. Therefore, the true number of compatible data sets is given by:

(N

N31, N32, N33, N34

)=

(4

1, 2, 0, 1

)=

4!

1!2!0!1!= 12.

Hence, expression (2.27) is only an upper bound for the number of compatible data sets, that is only

reached when the multinomial distributions are independent.

18

We will reduce the problem of determining the number of compatible sets for a general network, to

the simple problem for forests, constructing a quotient over the set of nodes, that gives raise to a forest

in the defined quotient graph.

Now we will introduce some notation. Let’s represent a directed graph by G = (V,E), where V =

1, . . . , n are the nodes and E = (i, j) : i, j ∈ V ⊂ V 2 are the edges. We denote the edge (i, j) ∈ E

by i →G j. When node j is reachable in zero or more steps from node i we write i →?G j. If i and j are

reachable from each other we write i↔?G j.

Definition 32 (Strongly Connected Components). The strongly connected components (SCC) of a

graph form the partition V1, . . . , Vm over the nodes V such that:

1. Vl ∪ Vk = for 6= k.

2. V = V1 ∪ · · · ∪ Vn.

3. i↔?G j for all Vl and i, j ∈ Vl.

4. It is the coarsest partition fulfilling conditions 1, 2 and 3.

Given an arbitrary directed graph, Tarjan’s algorithm computes the SCC components in timeO(|V |2) [42].

Tarjan’s algorithm works as follows: we start by performing a depth-first search over the graph, such that

each node is visited exactly once; nodes are placed on a stack in the order they are visited. A node v

and its descendants are popped from the stack if and only there is not a path in the graph from any of

the nodes to some node earlier on the stack. In this case a SCC with root v and all the nodes later on

the stack is determined. In the case a path exists, node v remains in the stack. S denotes the set of

nodes that were discovered but do not belong to a SCC. Tarjan’s algorithm is represented in Algorithm 6.

A summary of all the functions and variables present in the algorithm is described next:

• v.index: order in which node v was discovered.

• v.lowlink: smallest index of any node known to be reachable from node v.

• strongconnect(v): function that performs a single depth-first search of the graph and finds all suc-

cessors of node v and determines all strongly connected components of that subgraph.

• v.onStack: function that verifies if node v is on the stack.

1

2 3

4

Figure 2.7: Network structure for Example 33.

Example 33. In the graph represented in Figure 2.7 the strongly connected components are V1, V2,

where V1 = 1, 2, 3 and V2 = 4.

19

Algorithm 6 Tarjan’s Algorithm

input: Graph G = (V,E).

output: Set of strongly connected components.

1: index= 0

2: S = empty array

3: for each v ∈ V do

4: if v.index is undefined then

5: strongconnect(v)

6: end if

7: end for

8: for each (v, w) ∈ E do

9: if w.index is undefined then

10: strongconnect(w)

11: v.lowlink = min(v.lowlink,w.lowlink)

12: else if w.onStack then

13: v.lowlink = min(v.lowlink,w.index)

14: end if

15: end for

16: if v.lowlink=v.index then

17: Start a new SCC

18: w = S.pop()

19: w.onStack = false

20: Add w to current strongly connected component.

21: while w! = v do

22: output the current strongly connected component.

23: end while

24: end if

20

Definition 34 (Quotient Graph). Given a graph G = (V,E) and an equivalence relation R ⊂ V 2, the

quotient graph is the graph G/R = (V/R,E/R), where V/R is the equivalence classes induced by R,

and [i]R →G/R [j]R, whenever k →G l for some k ∈ [i]R and l ∈ [j]R, with [i]R 6= [j]R.

Let 4V denote the diagonal relation of V , 4V = (i, i) : i ∈ V and trivially G/4V ' G. Now we

will focus on a particular equivalence relation that we will call forestification and denote by ∼, such that

given an acyclic graph G, the quotient graph G/ ∼ is a forest.

Definition 35 (Forestification). Let G be an acyclic graph. The forestification relation ∼ for G is th finest

equivalence relation such that: i) i ∼ j whenever there exist k and l such that k ∼ l and there are edges

[i]→G/∼ [k] and [j]→G/∼ [l]; ii) G/ ∼ is acyclic.

The forestification will be computed as the fixed point of an operator. Consider the operator Φ : 2V2 → 2V

2

defined by:

ΦG(R) = ΛG(ΩG(R)),

such that

ΩG(R) = R ∪ (j, j′) : (i, i′) ∈ R, j ∈ ΠGi , j′ ∈ ΠG

i′ ,

merges nodes that have children in the same equivalence class, so that in G/R there are no nodes with

more than one parent and

ΛG(R) = (i, j) : i↔?G∪G/R j,

where GR = (V,R). ΛG guarantees the cycles that may be formed in ΩG(R) belong to the same

equivalence class.

X1 X2

X3

X4

X5

(a) Initial network structure.

X[1]

X3 X4

X5

(b) Resultant network after ap-plying ΩG.

X[1]

X5

(c) Resultant network after ap-plying ΛG.

Figure 2.8: Given the initial structure represented in (a), the operator ΩG merges nodes X1 and X2

as they are both parents of the node X3, and the supernode X[1] = X1, X2 is created. As theedges (X[1], X3), (X3, X4), (X4, X[1]) form a cycle, the operator ΛG merges them into the supernodeX[1] = X1, X2, X3, X4. The resultant graph is a tree.

Theorem 36 ([4]). Let R be an equivalence relation over V . Then G/R is a forest iff R is a fixed point of

ΦG. Moreover, the forestification relation ∼ of G is the least fixed point(lfp) of ΦG and we have

∼= Φ|V |G (∅).

The forestification can thus be computed as the least fixed point of ΦG. The forestification algorithm

is represented in Algorithm 7 and takes O(|V |3) time [4]. Let G′ = (V ′, E′) represent the quotient of the

21

graph G = (V,E), where V ′ represents the equivalence relation of V . As ΦG(∅) = 4V = (i, i) : i ∈ V ,

V ′ is initialized as the trivial partition V ′ = i : i ∈ V and E′ is initialized as E′ = (i, j) : (i, j) ∈ E.

The algorithm applies ΦG to V ′ until V ′ is a fixed point of ΦG, and so V ′ = lfp(ΦG).

Algorithm 7 Algorithm to compute the forestification relation

input: Graph G = (V,E).

output: The forestification relation ∼ of G.

V ′ = i : i ∈ V , E′ = (i, j) : (i, j) ∈ E and G′ = (V ′, E′)

flag=false

while flag=false do

E′′ = E

for all i ∈ V ′ and j1, . . . , jk = ΠG′

i with jl < jl+1 do

E′′ = E′′ ∪⋃k−1l=1 (jl, jl+1), (jl+1, jl)

end for

∼= Partition in SCC by Tarjan(V ′, E′′)

if ∼= 4V ′ then

flag=true

else

G′ = (V ′, E′)/ ∼

end if

end while

Observe that two datasets D and D′ that induce the same counts N = Nijki,j,k with graph G may

induce two different counts when nodes are aggregated accordingly to ∼. To illustrate this fact we will

consider two interwines v-structures, forming a w-structure represented in Figure 2.9.

Example 37. Consider the BN with network structure G (Figure 2.9(a)). Its forestification G/ ∼ is

depicted in Figure 2.9(b), whereX[1] ≡ X1, X2, X3. Moreover, consider two datasets,D (Figure 2.9(c))

and D′ (Figure 2.9(d)), drawn from binary random variables.

X1 X2 X3

X4 X5

(a) Network G.

X[1]

X4 X5

(b) Network G/ ∼.

X1 X2 X3 X4 X5

1 0 1 0 10 0 1 0 11 0 0 1 1

(c) Dataset D.

X1 X2 X3 X4 X5

0 0 0 0 11 0 1 0 11 0 1 1 1

(d) Dataset D′.

Figure 2.9: Network structures and datasets for Example 37.

We aim at illustrating that D and D′ induce the same counts for G, which does not happen with G/ ∼.

Indeed, for G we have that ri = 2 for all i ∈ 1, . . . , 5, q1 = q2 = q3 = 1 and q4 = q5 = 4. Let xi1 = 0

and xi2 = 1 for all i ∈ 1, . . . , 5, w11 = w21 = w31 = ε, where ε is the empty parent configuration, and

w41 = w51 = 00, w42 = w52 = 01, w43 = w53 = 10, and w44 = w54 = 11. For both datasets D and D′ the

counts induced by G are given by:

22

X1 N111 = 1 N112 = 2 N11 = 3

X2 N211 = 3 N212 = 0 N21 = 3

X3 N211 = 1 N212 = 2 N21 = 3

X4 N411 = 1 N412 = 0 N41 = 1

N421 = 0 N422 = 0 N42 = 0

N431 = 1 N432 = 1 N43 = 2

N441 = 0 N442 = 0 N44 = 0

X5 N411 = 0 N412 = 1 N41 = 1

N421 = 0 N422 = 2 N42 = 2

N431 = 0 N432 = 0 N43 = 0

N441 = 0 N442 = 0 N44 = 0

Observe, however, that for G/ ∼ datasets D and D′ induce different counts. As X[1] correspond to

the equivalent class X1, X2, X3 in G/ ∼, the only possible configuration for its parents is the empty

one, and so r[1] = 8 and q[1] = 1. In this case, x[1]1 = 000, x[1]2 = 001, x[1]3 = 010, x[1]4 = 011,

x[1]5 = 100, x[1]6 = 101, x[1]7 = 110, x[1]8 = 111, and w[1]1 = ε. Concerning X4 and X5, r4 = r5 = 2,

q4 = q5 = 8, with xi1 = 0, xi1 = 1, wij = x[1]j , for all i = 4, 5 and j = 1, . . . , 8.

Having settled up the values of the nodes and the parents’ configurations, the counts induced by

G/ ∼ for D are given by:

X[1] M[1]11 = 0 M[1]12 = 1 M[1]13 = 0 M[1]14 = 0 M[1]15 = 1 M[1]16 = 1 M[1]17 = 0 M[1]18 = 0

X4 N411 = 0 N421 = 1 N431 = 0 N441 = 0 N451 = 0 N461 = 1 N471 = 0 N481 = 0

N412 = 0 N422 = 0 N432 = 0 N442 = 0 N452 = 1 N462 = 0 N472 = 0 N482 = 0

X5 N511 = 0 N521 = 0 N523 = 0 N541 = 0 N551 = 0 N561 = 0 N571 = 0 N581 = 0

N512 = 0 N522 = 1 N532 = 0 N542 = 0 N552 = 1 N562 = 1 N572 = 0 N582 = 0

whereas, the counts induced by G/ ∼ for D′ are given by:

X[1] M[1]11 = 1 M[1]12 = 0 M[1]13 = 0 M[1]14 = 0 M[1]15 = 0 M[1]16 = 2 M[1]17 = 0 M[1]18 = 0

X4 N411 = 1 N421 = 0 N431 = 0 N441 = 0 N451 = 0 N461 = 1 N471 = 0 N481 = 0

N412 = 0 N422 = 0 N432 = 0 N442 = 0 N452 = 0 N462 = 1 N472 = 0 N482 = 0

X5 N511 = 0 N521 = 0 N523 = 0 N541 = 0 N551 = 0 N561 = 0 N571 = 0 N581 = 0

N512 = 1 N522 = 0 N532 = 0 N542 = 0 N552 = 0 N562 = 2 N572 = 0 N582 = 0

and so, there are several differences in both counts for D and D′ (highlighted in bold).

Definition 38 (Compatable counts in the quotient graph). A countM = M[i][j][k][i],[j],[k] for the quotient

graph G/ ∼ is said to be compatible with a count N = Nijki,j,k for BN with graph G and D, which we

denote byM ↓ N , if there is a dataset D′ with the same size of D such that the count for the structure

G of D′ coincide with N ; and moreoverM is the count for the structure G/∼ of D′.

As illustrated in the previous example, there are several possible counts for the quotient graph G/ ∼

that are compatible with N . Therefore, we can deduce a generalization of Theorem 30, page 18, for a

general BN.

Theorem 39 ([4]). Let D be a dataset of size N , B a BN andN the family of counts for each parent-child

induced by B on D. The number of datasets of size N that induce the same counts for B is

∑M↓N

[n]∏[i]=1

q[i]∏[j]=1

(M[i][j]

M[i][j]1 . . .M[i][j]r[i]

). (2.28)

However, there is no analytical expression for the number of compatible counts for the quotient graph.

Therefore we restrict to structures for which there is only one compatible count for the quotient graph, that

23

we call Covering Graphs. Given a graph G consider the following covering CG = Xi1, . . . , Xik, Xj :

Πj = Xi1, . . . , Xik. In Figure 2.10(a) a graph that is not covering is represented; in Figure 2.10(b) a

covering graph is represented.

Definition 40 (Covering graph). A graph G is said to be covering if for all Xi ∈ X there is a C ∈ CG such

that [Πi]∼ ∪ [Xi]∼ ⊆ C, where [Πi]∼ is either the empty set, if [Xi]∼ has no parents in G/ ∼, or [Πi]∼ is

the parent of [Xi]∼ in G/ ∼.

X1 X2

X3 X4

(a)

X1 X2

X3 X4

(b)

Figure 2.10: The graph represented in (a) is not covering, since [Π4]∼ = X1, X2 and the setX4, X2, X1 does not belong to the covering CG = X3, X1, X2, X4, X2, X1, X2. However,the graph represented in (b) is covering.

Theorem 41. Let G be a covering graph, then there is only one compatible countM for G/∼ with N .

Thus, for the case of covering graphs expression (2.28) simplifies to:

[n]∏[i]=1

q[i]∏[j]=1

(M[i][j]

M[i][j]1 . . .M[i][j]r[i]

).

Sending the OFEs

We will consider the parameters are given by the asymptotic approximation derived for MDL [35]

L(Θ) =1

2ln(N)|B|,

where |B| was defined in (2.24), page 16.

Thus, for the case of covering graphs, we are able to compute the CMDL as

CMDL(G | D) = CMDL(G/ ∼| D)

=

m∑[i]=1

q[i]∑[j]=1

log

(M[i][j]

M[i][j]1 . . .M[i][j]r[i]

)+

1

2ln(N)|B|.

Normalized Maximum Likelihood Criterion

The Normalized Maximum Likelihood Criterion (NML), proposed by Rissanen [38], defines a different

description length for encoding the hypothesis HG. Instead of using an universal encoder, NML uses an

approach related to Rissanen’s stochastic complexity.

24

LetHG be a set of given probability distribution and suppose the sender believes there is a HG ∈ HGthat assigns a high likelihood to a given data set D. Let’s call HG the best fitting hypothesis. Given a

hypothesis HG′ , that doesn’t necessarily belong to H, the regret of HG′ relative to HG is given by:

− log(P (D|HG′))− minHG∈HG

(− log(P (D|HG))), (2.29)

and corresponds to the extra bits when data set D is encoded with HG′ , when comparing to the best

hypothesis in HG. The worst case regret, relative to data of fixed size N is defined as:

maxD:|D|=N

(− log(P (D|HG′)) + log(P (D|HG)). (2.30)

The goal is to find a hypothesisHG′ that minimizes the worst case regret. The solution to this minmax

problem is the normalized maximum likelihood distribution that induces codes with the following length:

L(D,HG) = −LL(HG|D) + CD(HG), (2.31)

where CD(HG) is the parametric complexity of HG for data D .The parametric complexity is in general

not computable. But a linear-time algorithm was proposed to compute the parametric complexity of a

single multinomial variable [27].

Since in the case of Bayesian network, the probability the distribution of Xi|ΠXi, for i ∈ 1, . . . , n is

a multinomial distribution with ri states and Nij observations, it is possible to decompose the parametric

complexity into:

CD(H) =

n∑i=1

qi∑j=1

CriNij,

where CriNijis the parametric complexity associated to data of size Nij generated by a multinomial with

ri states. This gives raise to the factorized normalized maximum likelihood (fNML) scoring function that

is given by:

fNML(B|D) = −LL(B|D) +

n∑i=1

qi∑j=1

CriNij.

The parametric complexity can be computed recursively using the algorithm presented next, in Algo-

rithm 8.

Algorithm 8 Compute Crminput: Natural numbers r,m.

output: CrmC1m = 1

C2m =

∑mh=0

(mh

)( hm

)h(m− hm

)m−hif r > 2 then

Crm = Cr−1m +

(m

r − 2

)m−hend if

25

Chapter 3

Dynamic Bayesian Networks

3.1 Basic Concepts

Dynamic Bayesian networks (DBN) model the stochastic evolution of a set of random variables over time

[32]. Consider the discretization of time in time slices T = 0, . . . , T. Let X[t] = (X1[t], . . . , Xn[t]) be a

random vector that denotes the value of the set of attributes at time t. Furthermore, let X[t1 : t2] denote

the set of random variables X for the interval t1 ≤ t ≤ t2. Consider a set of individuals H measured

over T sequential instants of time. The set of observations is represented as xh[t]h∈H,t∈T , where

xh[t] = (xh1 , . . . , xhn) ∈ RN is a single observation of n attributes, measured at time t and referring to

individual h.

In DBNs our goal is to define a probability joint distribution over all possible trajectories, i.e., possible

values for each attribute Xi and instant t, Xi[t]. Let P (X[t1 : t2]) denote the joint probability distribution

over the trajectory of the process from X[t1] to X[t2]. However, the space of possible trajectories is very

complex, therefore, it is necessary to simplify the problem and make assumptions, in order to define a

tractable problem.

Observations are viewed as i.i.d. samples of a sequence of probability distributions Pθ[t]t∈T . For

all individuals h ∈ H, and a fixed time t, the probability distribution is considered constant, i.e., xh[t] ∼

Pθ[t], h ∈ H. Using the chain rule the joint probability over X is given by:

P(X[0 : T ]

)= P

(X[0]

) T−1∏t=1

P(X[t+ 1]|X[0 : t]

).

A common assumption is to consider that the attributes in time-slice t+ 1 only depend on those in time

slice t, for t ∈ 0, . . . , T − 1.

26

Definition 42 (mth-order Markov Assumption). A stochastic process over X satisfies the mth-order

Markov assumption if, for all t ≥ 0

P(X[t+ 1]|X[0] ∪ · · · ∪X[t]

)= P

(X[t+ 1]|X[t−m+ 1] ∪ · · · ∪X[t]

).

In this case m is called the Markov lag of the process.

A simplistic approach is to consider the process is stationary, that in some particular cases might

hold, but in most does not. When the number of instances in the training data is small, a general

approach is to consider the process is stationary. Another one is to consider the process is piece-wise

stationary.

Definition 43 (Stochastic stationary process). A stochastic process is stationary (also called time in-

variant or homogeneous) if

P(X[t+ 1]|X[t]

)is the same for all time slices t ∈ 0, . . . , T − 1.

Considering the first-order Markov assumption we can encode the joint probability in a compact way:

defining an initial distribution and the transition distributions P(X[t+ 1]|X[t]

), for all t ∈ 0, . . . , T − 1.

Definition 44 (First-order dynamic Bayesian network). A non-stationary first-order Bayesian DBN con-

sists of:

• A prior network B0, which specifies a distribution over the initial states X[0].

• A set of transition networks Bt+1t over the variables X[t]∪X[t+1], representing the state transition

probabilities, for 0 ≤ t ≤ T − 1.

The transition network has the additional constraint that edges between slices must flow forward in

time. A stationary network contains only one prior network and one transition network. A first-order

Markov DBN has a prior network and a transition network for each transition of time t→ t+ 1. Observe

that a transition network encodes the inter-slice dependencies (from time transitions t → t + 1) and

intra-slice dependencies (in time slice t+ 1). Figure 3.1 represents a DBN.

3.2 Learning Dynamic Bayesian Networks

Learning dynamic Bayesian networks, considering no hidden variables or missing values, i.e., consider-

ing a fully observable process, reduces simply to applying the methods described for Bayesian networks

in Section 2.2 for each transition of time [20]. Not taking into account the acyclicity constraints, it was

proved that learning Bayesian networks does not have to be NP -hard [15]. This result can be applied

to DBNs, as the resulting “unrolled” graph, that contains a “copy” of each attribute in each time step, is

acyclic. And, on the other hand, it was also derived in the same paper a time complexity bound in the

number of random variables for the MDL and the Bayesian Dirichlet Equivalence scores. More recently

27

X1[0]

X2[0]

X3[0]

(a)

X1[0]

X2[0]

X3[0]

X1[1]

X2[1]

X3[1]

(b)

Figure 3.1: An example of a DBN B. In the left, the prior network B0 is depicted and in the right, thetransition network B1

0 is represented. The edges E1 = (X1[0], X1[1]), (X2[0], X2[1]) are the inter-sliceconnections and the edge E2 = (X2[1], X3[1]) represents the intra-slice connection.

a polynomial-time algorithm for learning optimal DBN was proposed using the mutual information tests

(MIT) [46]. A software for learning DBN that does not consider the intra-slice networks is proposed [30].

A polynomial-time algorithm is proposed that learns both the inter-slice and intra-slice connections in a

transition network; the resultant network is denoted by tDBN [28]. However, the search space for the

intra-slice networks is restricted to tree augmented networks, i.e, acyclic networks such that each at-

tribute has at most one parent from the same time-slice, but can have a finite number of parents p from

the previous time-slices. The letter t in the tDBN notation, reflects the search space considered. We will

now describe this algorithm for a first-order DBN. Denote by P≤p(X[t]) the set of subsets of X[t] with

cardinality less or equal to p. For each Xi[t] ∈ X[t], the optimal set of parents ΠXi[t] ∈ P≤p(X[t]) yields

the following score:

si = maxΠXi

[t]∈P≤p(X[t])φi(ΠXi

[t], Dt+1t ),

where φi is the local score of attribute Xi and Dt+1t is the subset of observations for the transition of time

t→ t+ 1. Then, allowing at most one parent Xj [t+ 1] from the current time-slice, the maximal score is

defined as:

sij = maxΠXi

[t]∈P≤p(X[t])φi(ΠXi

[t] ∪ Xj [t+ 1], Dt+1t ). (3.1)

For each transition t→ t+ 1 a complete directed graph in X[t+ 1] is built, the optimal set of parents for

all nodes is determined and the maximal branching is computed using Algorithm 1. The tDBN algorithm

has a worst-case complexity of O(np+3rp+2N), where r is the maximum number of discrete states a

variable can take [28] and is represented in Algorithm 9.

28

Algorithm 9 Optimal non-stationary first-order Bayesian tDBN structure learning

input: Set of attributes X, dataset D and a decomposable scoring function φ.

output: A tree-augmented DBN structure.

1: for each transition t→ t+ 1 do

2: Build a complete directed graph in X[t+ 1].

3: Calculate the weight of all edges and the optimal set of parents of all nodes.

4: Apply a maximum branching algorithm.

5: Extract transition t→ t+ 1 network using the maximum branching and the optimal set of parents.

6: end for

7: Collect transition networks to obtain the DBN structure.

29

Chapter 4

Proposed Method

We propose a polynomial-time algorithm in the number of attributes for learning consistent k-graph

dynamic Bayesian networks, denoted by cDBN. It was proved that the class of consistent k-graphs is

exponentially larger, in the number of variables, when compared to tree-network structures [6] . The

algorithm for learning cDBN structures starts by deriving the optimal branching of the input data, and

then determines the optimal set of parents with cardinality less or equal to k, consistent with the order

induced by the optimal branching, for each attribute [6]. Recall the definition given in Subsection 2.2.4

for consistent k-graphs and the CkG learning algorithm represented in Algorithm 4, page 10.

A polynomial-time algorithm for learning optimal tree-augmented dynamic Bayesian networks was

proposed [28]. Considering a first-order Markov DBN, the algorithm for each time transition t → t + 1

outputs the maximum branching for the intra-slice connections in time step t + 1 and the optimal set of

parents, with maximum cardinality of p, from the previous time step t. Recall the algorithm for learning

optimal non-stationary first-order Markov tree-augmented networks depicted in Subsection 3.2, Algo-

rithm 9, page 29.

The proposed algorithm increases exponentially the search space of the intra-slice connections for

each transition network, by applying the CkG learning algorithm. We start by giving a formal definition

for consistent k-graph dynamic Bayesian network.

Definition 45 (Consistent k-graph dynamic Bayesian network). A dynamic Bayesian network is called a

consistent k-graph and is denoted by cDBN if for each intra-slice transition network Gt+1, t ∈ 0, . . . , T −

1, the following holds: i)Gt+1 is a k-graph, i.e., each node has in-degree at most k; ii) given the optimum

branching R over the set of nodes X[t+ 1], for every edge in Gt+1 from Xi[t+ 1] to Xj [t+ 1], the node

Xi[t+ 1] is in the path from the root of R to Xj [t+ 1].

Theorem 46. Algorithm 10 finds an optimal mth-order cDBN, given a decomposable scoring function φ,

a set of n random variables, a maximum number of parents from the previous m time steps of p and a

bounded in-degree in each intra-slice network of k.

Proof. Let B be the optimal cDBN and B′ be the DBN output of Algorithm 10. Without loss of generality

consider the transition t−m+1, . . . , t → t+1. Let Bt+1t−m+1 and B

′t+1t−m+1 be the corresponding transition

31

Algorithm 10 Learning Optimal mth-order Markov cDBN

input: Set of attributes X, dataset D, a Markov lag m, a decomposable scoring function φ, maximum

intra-slice graph in-degree of k and maximum number of parents from the previous time slices of p.

output: Optimal mth-order cDBN.

1: for each transition t−m+ 1, . . . , t → t+ 1 do

2: Build a complete directed graph in X[t+ 1].

3: Calculate the weight of all edges and the optimal set of p parents from t −m + 1, . . . , t for all

nodes.

4: Apply a maximum branching algorithm to the intra-slice graph in t+ 1 that outputs the maximum

branching R.

5: for each node Xi ∈ R do

6: Compute the set αi of ancestors of i, that is, the set of nodes connecting the root R and Xi.



9: if αi(S,D) is the maximal score for Xi then

10: Set ΠXi to S.

11: end if

12: end for

13: end for

14: end for

15: Collect the transition networks to obtain the DBN structure.

networks. Denote byDt+1t−m+1 the subset of observations regarding the transition t−m+1, . . . , t → t+1.

By definition of optimal cDBN:

φ(Bt+1t−m+1, D

t+1t−m+1) ≥ φ(B

′t+1t−m+1, D

t+1t−m+1).

We will prove by contradiction that φ(Bt+1t−m+1, D

t+1t−m+1) ≤ φ(B

′t+1t−m+1, D

t+1t−m+1). Suppose φ(B,Dt+1

t−m+1) >

φ(B′, Dt+1t−m+1). Then, assuming the scoring function φ is decomposable:

φR(∅, Dt+1t−m+1) +

∑i 6=R

φi(Πi[t−m+ 1] ∪ · · · ∪Πi[t] ∪ Xj [t+ 1], Dt+1t−m+1) >

φR′(∅, Dt+1t−m+1) +

∑i 6=R′

φi(Π′i[t−m+ 1] ∪ . . .Π′i[t] ∪ X ′j [t+ 1], Dt+1

t−m+1),(4.1)

where Πi[t−m+ 1]∪ · · · ∪Πi[t] are the parents from the time slices t−m+ 1, . . . , t, Xj [t+ 1] is the

parent from the time slice t+1 and R is the root of the unrolled graph. Let ∆i[t−m+1]∪· · ·∪∆i[t] be the

optimal set of parents from time slices t−m+ 1, . . . , t determined in Step 3 for node i. Equation (4.1)

32

is equivalent to:

∑i 6=R

φi(Πi[t−m+ 1] ∪ · · · ∪Πi[t] ∪ Xj [t+ 1], Dt+1t−m+1)− φi(∆i[t−m+ 1] ∪ · · · ∪∆i[t], D

t+1t−m+1) >

∑i6=R′

φi(Π′i[t−m+ 1] ∪ · · · ∪Π′i[t] ∪ X ′j [t+ 1], Dt+1

t−m+1)− φi(∆i[t−m+ 1] ∪ · · · ∪∆i[t], Dt+1t−m+1).

Notice, however, that the maximum branching algorithm applied to the intra-slice graph, Step 4 of

Algorithm 10, constructs a complete graph such that the edge X ′j → X ′i is weighted by

φi(Π′i[t−m+ 1] ∪ · · · ∪Π′i[t] ∪ X ′j [t+ 1], Dt+1

t−m+1)− φi(∆i[t−m+ 1] ∪ · · · ∪∆i[t], Dt+1t−m+1),

and outputs the maximal spanning tree. Moreover, in Steps 5-11, all sets of parents from the time slice

t + 1 with cardinality k consistent with the maximal spanning tree are checked. Therefore the optimal

set of parents is found for each node. On the other hand, the selected graph is acyclic. Suppose there

existed a cycle X1, . . . , Xi, then this would imply that X1 would be in the path connecting the root R

and X1. Hence, we arrive to a contradiction. Therefore, Bt+1t−m+1 = B

′t+1t−m+1 and generalizing for all

transitions t−m+ 1, . . . , t → t+ 1, with t ∈ 0, . . . , T − 1, we prove B = B′.

Theorem 47. Algorithm 10 takes time

maxO(np+3mp+4rp+2NT ),O(nk+1rk+1NT ),

given a decomposable scoring function φ, a Markov lag m, a set of n random variables, a maximum

number of parents from the previous m time steps of p, a bounded in-degree in each intra-slice network

of k and a set of observations of N individuals over T time steps.

Proof. For each transition t − m + 1, . . . , t → t + 1, in Step 3, iterating over all the edges takes

time O((nm)2). The number of subsets of parents with at most p elements is given by:

|P≤p(X[t])| =p∑i=1

(nm

i

)<

p∑i=1

(nm)i ∈ O((nm)p). (4.2)

Calculating the score of each parent set, considering the maximum number of states a variable may

take is r, and that each variable has at most p + 1 parents (p from the previous m time slices and one

in the current), the number of possible configuration is given by rp+2. The score of each configuration

is computed over the set of observations Dt+1t−m+1, that has |Dt+1

t−m+1| elements. Denote the number of

individuals by N . The scores are stored in a |Dt+1t−m+1| × n(m + 1) matrix, therefore taking O(m2nN)

comparisons in order to determine the optimal set of parents. The maximum branching, Step 4, has

time complexity of O(n2), therefore Steps 2-4 take time O(np+3mp+4rp+2N). Step 5 takes O(n) time

as it ranges over all variables. The number of subsets with at most k elements, as seen in (4.2), is nk.

For each set of ancestors, the number of possible configurations is rk+1, that are stored in a |Dt+1| × n

matrix, therefore Steps 5-11 take time O(nk+1rk+1N). Algorithm 10 ranges over all T transitions of time,

33

hence, takes time maxO(np+3mp+4rp+2NT ),O(nk+1rk+1NT ).

34

Chapter 5

Experimental Results

The Experimental Results are organized in the following way: in Subsection 5.1 the results for the

CMDL learning algorithm are presented; in Subsection 5.2 the results for the cDBN learning algorithm

are presented.

5.1 Learning Bayesian Networks with CMDL

We implemented a score-based Bayesian networks learning algorithm, using the CMDL as scoring func-

tion. The implementation was in Java using a object-oriented paradigm and released under a free soft-

ware license available at https://margaridanarsousa.github.io/learn_cmdl/. We used the greedy

hill climber (GHC) as search procedure and the covering graphs as search space. The experiments

were run on an Intel Core i5-4200U CPU @ 1.60GHz×4 machine. We start by analyzing a benchmark

data set, LED, for which the Bayesian network structure used to generate it is known. Then, we analyze

the compression achieved, for real data sets, comparing CMDL, MDL and LL codes.

LED Data Set

The LED database was used by Fung and Crawford [21] and Songh and Valtorta [40]. The network rep-

resents a faulty LED display. There are eight variables, one representing the digit key and the remaining

seven corresponding to the seven segments of the display. In this case segment 1 is conditionally

independent of the digit key given the state of LED segments 2 and 3, whereas in a normal display

knowledge about the depressed key is sufficient to indicate which LED segments are on. The original

network is represented in Figure 5.1. Figures 5.2, 5.3, 5.4 and 5.5 represent the evolution of the learned

network using the CMDL as scoring function considering respectively N = 1000, N = 2000, N = 3000,

N = 4000 and N = 5000 observations. In Table 5.1, the compression achieved using the CMDL code

is represented. Figures 5.6 and 5.7 represent the learned networks using MDL as scoring criterion

for N = 1000, N = 2000, N = 3000, N = 4000 and N = 5000 observations respectively. In Table 5.2

the compression achieved using the MDL code is depicted. In Figures 5.8 and 5.9 the evolution of the

35

https://margaridanarsousa.github.io/learn_cmdl/

learned networks using LL are depicted and in Table 5.3 the compression achieved using this code is

represented.

D

432

1

5 6 7

Figure 5.1: LED database network.

D

432

1

5 6 7

Figure 5.2: Learned network using CMDL with N = 1000 observations and 1000 random restarts inGHC.

D

432

1

5 6 7


36

D

432

1

5 6 7


D

432

1

5 6 7

Figure 5.5: Learned network using CMDL with N = 4000 observations and 1000 random restarts inGHC. For N = 5000 the same network was recovered.

N CMDL-true (bits) CMDL-optimal (bits) -LL-true (bits) -LL-optimal (bits)

1000 4876.35 4510.52 4831.28 6973.49

2000 8796.83 6747.01 9313.41 14273.07

3000 12743.26 7543.56 13852.30 21487.60

4000 16753.36 16319.94 16068.92 18439.11

5000 20707.54 20256.814 22912.41 20004.45

Table 5.1: Compression achieved using the CMDL code. CMDL-optimal and LL-optimal correspondto the length of the codes induced by the optimal structure found by the GHC, 1000 random restartswere considered. CMDL-true and LL-true correspond to the length of the codes induced by the initialstructure, represented in Figure 5.1. N is the number of instances considered.

D

432

1

5 6 7

Figure 5.6: Learned network using MDL with N = 1000, N = 2000, N = 3000, N = 4000 observationsand 1000 restarts in GHC.

37

D

432

1

5 6 7

Figure 5.7: Learned network using MDL with N = 5000 observations and 1000 restarts in GHC.

N MDL-true (bits) MDL-optimal (bits) -LL-true (bits) -LL-optimal (bits)

1000 5194.98 4687.96 4831.28 4324.27

2000 9713.63 8599.92 9313.41 8166.80

3000 14273.89 12543.60 13852.30 12087.37

4000 18875.84 16541.56 18439.11 16068.92

5000 23360.91 20489.81 22912.41 20004.45

Table 5.2: Compression achieved using the MDL code. MDL-optimal and LL-optimal correspond to thelength of the codes induced by the optimal structure found by the GHC. 1000 random restarts wereconsidered for the GHC. MDL-true and LL-true correspond to the length of the codes induced by theinitial structure, represented in Figure 5.1. N is the number of instances considered.

D

432

1

5 6 7

Figure 5.8: Learned network using LL with N = 1000 observations and 1000 restarts in GHC.

D

432

1

5 6 7

Figure 5.9: Learned network using LL with N = 2000, N = 3000, N = 4000, N = 5000 observations and1000 restarts in GHC.

Real Data

We evaluated the compression achieved for the CMDL and MDL codes, using four datasets from the

UCI repositry [1]. Results are presented in Tabel 5.4.

38

N -LL-true (bits) -LL-optimal (bits)1000 4831.28 4324.272000 9313.41 8166.803000 13852.30 12087.364000 18439.11 16068.925000 22912.41 20004.45

Table 5.3: Compression achieved using the LL code. LL-optimal and LL-true correspond to the lengthinduced by the optimal structure found by the GHC and induced by the initial structure, represented inFigure 5.1. As usual, 1000 random restarts were considered for the GHC.

Data set Nb of Attributes Nb of Classes Nb of Instances MDL (bits) CMDL (bits)

chess 36 2 3196 57200.64 54487.32

letter 16 26 20000 731659.06 715538.73

shuttle-small 9 7 5800 77736.14 76483.61

waveform-21 21 3 5000 202127.20 200098.62

Table 5.4: Description of the data sets used in the experiments and the compression achieved with MDLand CMDL codes.

Discussion

From the experimental results regarding the LED data set we observe that none of the scoring functions

considering N = 5000 instances is able to recover the original structure. Moreover, none of the scores

give raise to a structure that captures the conditional independence of the segment 1 and digit key given

segments 2 and 3.

CMDL is the scoring criterion that selects the most complex structures for N = 1000, 2000, 3000 ob-

servations. On the other hand, it yields the maximum compression rate for N = 2000, N = 3000, N =

4000, N = 5000. Intuitively, this may have the following interpretation: we are compressing aggressively

the data set, therefore the regularities of the training are being captured in an exaggerated manner,

hence this leads to the overfitting to the training data, and the selection of complex structures. Roughly,

when the number of instances increases, the behavior of CMDL and its code length seems to be ap-

proaching LL’s behavior and code length.

All scoring criteria converge to the same network for N = 5000 instances. MDL is the most consistent

criterion, in the sense that it selects the same structure for N = 1000, N = 2000, N = 3000, N = 4000

observations. On the other hand, it gives raise to the highest description length. Notice that in all cases,

except for CMDL and N = 1000, the log likelihood is always higher for the optimal model selected by the

GHC, comparing with the model that generated the data.

From the results for the real data, we observe that CMDL further compresses the data, comparing

with MDL. And from the results in Table 5.4, increasing the number of instances, increases the difference

between the length codes of MDL and CMDL.

Therefore, we conclude CMDL is not an advantage, in terms of the learning, when compared to MDL.

However, it gives raise to considerably higher compression rates.

39

5.2 Learning cDBNs

Now we will compair the results obtained using Algorithm 9 [28], denoted by tDBN, that restricts the

search space for the intra-slice network of the transition networks to tree-network structures and Al-

gorithm 10, proposed in this thesis, denoted by cDBN, that increases exponentially the search space,

to consistent k-graphs. For Algorithm 9 we used the implementation released under a free-software

license available at http://josemonteiro.github.io/tDBN/. Algorithm 10 was implemented in Java

using a object-oriented paradigma and is released under a free software license available at https:

//margaridanarsousa.github.io/learn_cDBN/. The experiments were run on an Intel Core i5-4200U

CPU @ 1.60GHz×4 machine. We start by analyzing the performance of the proposed algorithm for

synthetic data generated from stationary first-order Markov cDBN. And then a first-order Markov cDBN

is used to model the evolution of patients with rheumatoid arthritis disease.

Experience 1 – Synthetic Data

A first-order cDBN structure and parameters were determined, and observations were sampled from

the generated network. Algorithms 9 and 10 were applied to the resultant data sets, and the ability

to learn and recover the original network structure was measured. The maximum intra-slice in-degree

k considered in Algorithm 10, was taken to be the one of the initial structure. To fully evaluate the

performance of the cDBN learning Algorithm, we did not consider the topological order1 induced by the

optimal branching, we considered instead the breadth-first-search order2 of the optimal branching. We

compared the original and recovered networks using the precision, recall and F1 metrics that are defined

as follows:

precision =TP

TP + FP, (5.1)

recall =TP

TP + FNand (5.2)

F1 = 2.precision× recallprecision + recall

, (5.3)

where TP are the true positive edges, FP are the false positive edges and FN are the false negative

edges. Five independent datasets were sampled from the generated network, for a given number of

observations. The initial networks considered are represented in Figure 5.10. In Tables 5.5 and 5.6

the results are depicted, the presented values are annotated with a 95% confidence interval. tDBN+LL

and tDBN+MDL denote respectively the tDBN learning algorithm applied using the LL and MDL scoring

functions. cDBN+LL and cDBN+MDL denote respectively the cDBN learning algorithm applied using

the LL and MDL scoring functions.

1A topological order of a DAG G = (V,E) is a total ordering of all its vertices such that if E contains an edge (u, v) then uappears before v in the ordering [11].

2We consider the breadth-first-search order as defined in [11].

40

http://josemonteiro.github.io/tDBN/



X1[0]

X2[0]

X3[0]

X4[0]

X5[0]

X1[1]

X2[1]

X3[1]

X4[1]

X5[1]

X1[0]

X2[0]

X3[0]

X4[0]

X5[0]

X1[1]

X2[1]

X3[1]

X4[1]

X5[1]

X5[0]

X6[0]

X7[0]

X8[0]

X9[0]

X10[0]

X5[1]

X6[1]

X7[1]

X8[1]

X9[1]

X10[1]

X5[0]

X6[0]

X7[0]

X8[0]

X9[0]

X10[0]

X5[1]

X6[1]

X7[1]

X8[1]

X9[1]

X10[1]

Figure 5.10: Initial network for the experiments considering the following parameters (from left to right):n = 5, r = 2, k = 2; n = 5, r = 2, k = 4; n = 10, r = 3, k = 5; n = 10, r = 4, k = 6.

41

Tabl

e5.

5:C

ompa

rativ

est

ruct

ure

reco

very

resu

ltsfo

rtD

BN

+LL

and

tDB

N+M

DL

onsi

mul

ated

data

.Th

en

isth

enu

mbe

rofn

etw

ork

attr

ibut

es,p

isth

enu

mbe

rof

pare

nts

from

the

prec

edin

gtim

e-sl

ice,r

isth

enu

mbe

rofs

tate

sof

alla

ttrib

utes

andN

isth

enu

mbe

rofo

bser

vatio

ns.

Run

ning

time

isin

seco

nds.

NtD

BN

+LL

tDB

N+M

DL

Pre

Rec

F1

Tim

eP

reR

ecF

1Ti

me

Net

wor

k1

(n=

5,r

=2)

250

0.45

7±

0.0

501

0.58

2±

0.0

638

0.512±

0.0561

10.

803±

0.0

568

0.6±

0.0637

0.686±

0.0613

1

500

0.54

3±

0.0

638

0.69

1±

0.0

813

0.608±

0.0715

10.

853±

0.128

0.655±

0.117

0.74±

0.122

1

750

0.55

7±

0.0

469

0.69

1±

0.0

813

0.624±

0.0715

10.

908±

0.114

0.727±

0.101

0.8

07±

0.106

1

1000

0.61

4±

0.0

307

0.78

2±

0.0

390

0.688±

0.0344

10.

856±

0.0

835

0.654±

0.0

781

0.741±

0.0795

1

Net

wor

k2

(n=

5,r

=2)

250

0.58

6±

0.0

469

0.54

7±

0.0

437

0.566±

0.0452

10.

831±

0.0

938

0.333±

0.0

522

0.475±

0.0664

1

500

0.60

0±

0.0

307

0.66

0±

0.0

286

0.580±

0.0296

10.8

57±

00.

4±

00.

545±

01

750

0.61

4±

0.0

307

0.57

3±

0.0

286

0.593±

0.0296

10.

893±

0.0

475

0.440±

0.0

286

0.589±

0.0324

1

1000

0.61

4±

0.0

307

0.57

3±

0.0

286

0.593±

0.0296

10.

918±

0.0

591

0.440±

0.0

286

0.594±

0.0362

1

Net

wor

k3

(n=

10,r

=3)

250

0.49

7±

0.0

779

0.41

1±

0.0

645

0.4

5±

0.0706

10.

583±

0.103

0.194±

0.0

401

0.291±

0.0579

1

500

0.53

8±

0.0

308

0.44

6±

0.0

255

0.488±

0.0279

10.

804±

0.0

652

0.314±

0.0

388

0.452±

0.0507

1

750

0.59

3±

0.0

352

0.49

1±

0.0

292

0.538±

0.0319

10.

784±

0.0

885

0.314±

0.0

411

0.449±

0.0564

1

1000

0.57

9±

0.0

226

0.4

8±

0.01

870.

525±

0.0205

10.

893±

0.0

596

0.383±

0.0

255

0.536±

0.0358

1

Net

wor

k4

(n=

10,r

=4)

250

0.34

5±

0.0

331

0.30

3±

0.0

291

0.323±

0.0310

10.2

73±

00.

0909±

00.

136±

01

500

0.35

9±

0.0

308

0.31

5±

0.0

271

0.335±

0.0288

10.

297±

0.0

260

0.103±

0.0

130

0.153±

0.0178

1

750

0.41

4±

0.0

382

0.36

7±

0.0

336

0.387±

0.0358

10.

374±

0.0

180

0.145±

0.0

106

0.209±

0.0139

1

1000

0.46

9±

0.0

242

0.41

2±

0.0

212

0.439±

0.0226

10.3

85±

00.

152±

00.

217±

01

42

Tabl

e5.

6:C

ompa

rativ

est

ruct

ure

reco

very

resu

ltsfo

rcD

BN

+LL

and

cDB

N+M

DL.

Then

isth

enu

mbe

rof

netw

ork

attr

ibut

es,p

isth

enu

mbe

rof

pare

nts

from

the

prec

edin

gtim

e-sl

ice,r

isth

enu

mbe

rof

stat

esof

alla

ttrib

utes

,kis

the

num

ber

ofpa

rent

sfro

mth

ecu

rren

ttim

esl

ice

andN

isth

enu

mbe

rof

obse

rvat

ions

.R

unni

ngtim

eis

inse

cond

s.Th

epa

ram

eterk

ista

ken

tobe

the

max

imum

in-d

egre

eof

the

intra

-slic

ene

twor

kof

the

initi

alst

ruct

ure.

NcD

BN

+LL

cDB

N+M

DL

Pre

Rec

F1

Tim

eP

reR

ecF

1Ti

me

Net

wor

k1

(n=

5,k

=2,r

=2)

250

0.54

1±

0.06

010.8

36±

0.09

300.6

57±

0.0

30

20.

733±

0.0477

0.600±

0.0400

0.66±

0.0429

2

500

0.57

6±

0.03

860.8

91±

0.05

970.

7±

0.0

469

30.

871±

0.0312

0.727±

00.

792±

0.0

134

2

750

0.63

5±

0.02

060.9

82±

0.03

190.

771±

0.0469

40.

920±

0.0621

0.782±

0.0390

0.844±

0.0

408

4

1000

0.61

2±

0.02

520.9

45±

0.03

900.

743±

0.0307

40.

933±

0.0477

0.782±

0.0637

0.850±

0.0

561

5

Net

wor

k2

(n=

5,k

=4,r

=2)

250

0.74

0±

0.01

750.9

87±

0.02

340.

846±

0.0200

21.

00±

00.

600±

00.7

50±

02

500

0.75

0±

01.

00±

00.

857±

02

0.9

8±

0.0

351

0.6

13±

0.0234

0.754±

0.0

226

3

750

0.75

0±

01.

00±

00.

857±

04

0.980±

0.0351

0.600±

00.

744±

0.0

105

4

1000

0.75

0±

01.

00±

00.

857±

04

0.9

6±

0.0

428

0.600±

00.

738±

0.0

128

4

Net

wor

k3

(n=

10,k

=5,r

=3)

250

0.40

7±

0.0

0781

0.8

62±

0.01

650.

553±

0.0106

10

0.820±

0.0534

0.623±

0.0252

0.708±

0.0

356

10

500

0.41

5±

0.01

860.8

77±

0.03

930.

563±

0.0252

24

0.856±

0.0327

0.638±

0.0270

0.731±

0.0

286

29

750

0.43

3±

0.01

560.9

15±

0.03

300.

588±

0.0212

32

0.914±

0.0159

0.731±

00.8

12±

0.00616

34

1000

0.41

8±

0.01

750.8

85±

0.03

700.

568±

0.0237

55

0.884±

0.0221

0.708±

0.0165

0.786±

0.0

287

54

Net

wor

k4

(n=

10,k

=6,r

=4)

250

0.49

5±

0.01

110.8

85±

0.01

990.

635±

0.0143

52

0.389±

0.0227

0.224±

0.0130

0.284±

0.0

165

51

500

0.49

8±

0.01

190.8

91±

0.02

120.

639±

0.0152

110

0.453±

0.0226

0.261±

0.0130

0.331±

0.0

165

109

750

0.49

5±

0.0

0594

0.8

85±

0.01

060.

635±

0.00762

167

0.463±

0.0185

0.267±

0.0106

0.338±

0.0

135

162

1000

0.49

2±

0.01

330.8

79±

0.02

380.6

3±

0.0170

225

0.463±

0.0185

0.267±

0.0106

0.338±

0.0

135

229

43

Experience 2

We further show that given data generated from a fixed structure, the cDBN learning algorithm is able to

recover the initial network. Figure 5.11 shows the evolution of the learned structure with the increase of

number of observations N considering an initial structure with n = 5 attributes, p = 2 maximum number

of parents from the previous time slice and k = 2 maximum in-degree in the intra-slice network. Fig-

ure 5.12 considers an initial structure with n = 5, p = 1 and k = 2. In both cases the MDL was used as

scoring criterion.

X1[0]

X2[0]

X3[0]

X4[0]

X1[1]

X2[1]

X3[1]

X4[1]

(a) Original network

X1[0]

X2[0]

X3[0]

X4[0]

X1[1]

X2[1]

X3[1]

X4[1]

(b) tDBN and cDBN for N = 250, N = 500

X1[0]

X2[0]

X3[0]

X4[0]

X1[1]

X2[1]

X3[1]

X4[1]

(c) tDBN and cDBN for N = 750

X1[0]

X2[0]

X3[0]

X4[0]

X1[1]

X2[1]

X3[1]

X4[1]

(d) cDBN for N = 1250

X1[0]

X2[0]

X3[0]

X4[0]

X1[1]

X2[1]

X3[1]

X4[1]

(e) tDBN for N = 1250

Figure 5.11: Recovered networks for tDBN and cDBN algorithms.

44

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(a)I

nitia

lstr

uctu

re

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(b)t

DB

Nfo

rN=

500

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(c)c

DB

Nfo

rN=

500

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(d)t

DB

Nfo

rN=

1000

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(e)c

DB

Nfo

rN=

1000

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(f)tD

BN

forN

=2000

untilN

=4500

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(g)c

DB

Nfo

rN=

2000

untilN

=4500

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(h)t

DB

Nfo

rN=

5000

X1[0

]

X2[0

]

X3[0

]

X4[0

]

X1[1

]

X2[1

]

X3[1

]

X4[1

]

(i)cD

BN

forN

=5000

Figu

re5.

12:

Rec

over

edne

twor

ksfo

rtD

BN

and

cDB

Nal

gorit

hms.

45

Experience 3 Real Data: Rheumatoid arthritis data

In a next step we used the proposed algorithm to model the evolution of the Rheumatoid arthritis (RA)

disease in patients. RA is a chronic disease that causes joint pain, stiffness, swelling and decreased

movement of the joints [33]. The expressiveness of this disease is not constant, there are periods of

mild activity and periods with increased disease activity. We considered a stationary DBN because no

temporal alignment of the individuals, with respect to the disease evolution, was expected in the dataset,

and this allows us to consider a bigger number of observations, obtaining more complex structures.

We used the database provided by reuma.pt [3], that contains the observations of 426 patients, over

9305 hospital visits. For each patient and hospital visit, the characteristics of the patient (age, medical

history), the disease activity (medical scores, health assessment questionnaires, joints evaluation, lab

tests, adverse events) and therapy (active agents) was measured. We considered the preprocessed

data from [29], where a selection of attributes was made, with the following criteria: the attributes that

didn’t change with time were discarded; the attributes that had more than 25% of missing values, were

discarded. Continuous attributes were discretized into 10 discrete equal-width intervals and the median

of each interval was chosen as representative. We considered the resultant attributes and observations,

and used the cDBN algorithm to predict the disease activity score (DAS) class for the following time

slice.The resultant attributes are described next [29]:

• n meses inicio bio: number of months since the beginning of the treatment with the current

biological agent.

• eva doente: visual analogue of pain according to the patient.

• vs: the rate at witch red blood cells sediment, used as non-specific measure of inflammation (units:

mm/h).

• pcr: amount of C-reactive protein (CRP), a protein found in the blood plasma, and whose levels

rise in responde to inflammation(units: mg/l).

• ndDAS: number of painful joints from the 28 joints measured to assess the DAS.

• ntDAS: number of swollen joints from the 28 joints measured to assess the DAS.

• nd: total number of painful joints.

• nt: total number of swollen joints.

• idade consulta arred: current age of the patient, in years.

• desc bio activo: current biological agent for RA treatment.

• anos doenca ate cons: number of years since the patient was diagnosed with RA.

• i manif ea: indication of disease manifestation besides the joints.

• cod actividade das: DAS class.

46

The measure of disease activity (DAS) in patients that suffer from RA is defined as [44]:

DAS = 0.56√ndDAS + 0.28

√ntDAS + 0.70 ln(vs) + 0.014eva doente. (5.4)

The resulting DAS was further discretized in 4 classes defined as [44]:

• Remission (Class 0) for DAS < 2.6.

• Low disease activity (Class 1) for 2.6 ≤ DAS ≤ 3.2

• Medium disease activity (Class 2) for 3.2 < DAS ≤ 5.1.

• High disease activity (Class 3) for DAS > 5.1.

Figures 5.13, 5.14, 5.15 and 5.16 represent the learned first-order cDBN for different values of k,

from one to three, and scoring criterion, LL or MDL.

47

n_m

eses

_ini

cio_

bio[

0]

n_m

eses

_ini

cio_

bio[

1]

eva_

doen

te[0

]vs

[0]

vs[1

]

pcr[

0]nd

DA

S[0]

ntD

AS[

0]nd

[0]

nt[0

]id

ade_

cons

ulta

_arr

ed[0

]

idad

e_co

nsul

ta_a

rred

[1]

desc

_bio

_act

ivo[

0]

desc

_bio

_act

ivo[

1]

anos

_doe

nca_

ate_

cons

[0]

anos

_doe

nca_

ate_

cons

[1]

i_m

anif_

ea[0

]

eva_

doen

te[1

]nd

DA

S[1]

ntD

AS[

1]

nd[1

]nt

[1]

i_m

anif_

ea[1

]

cod_

activ

idad

e_da

s[0] pc

r[1]

cod_

activ

idad

e_da

s[1]

n_m

eses

_ini

cio_

bio[

0]

n_m

eses

_ini

cio_

bio[

1]

eva_

doen

te[0

]vs

[0]

vs[1

]

pcr[

0]nd

DA

S[0]

ntD

AS[

0]nd

[0]

nt[0

]id

ade_

cons

ulta

_arr

ed[0

]

idad

e_co

nsul

ta_a

rred

[1]

desc

_bio

_act

ivo[

0]

desc

_bio

_act

ivo[

1]

anos

_doe

nca_

ate_

cons

[0]

anos

_doe

nca_

ate_

cons

[1]

i_m

anif_

ea[0

]

eva_

doen

te[1

]nd

DA

S[1]

ntD

AS[

1]nd

[1]

nt[1

]i_

man

if_ea

[1]

cod_

activ

idad

e_da

s[0] pc

r[1]

cod_

activ

idad

e_da

s[1]

Figu

re5.

13:

From

top

tobo

ttom

:tD

BN

with

m=

1,p

=1

and

MD

L;cD

BN

with

m=

1,p

=1,k

=2

and

MD

L.

48

n_m

eses

_ini

cio_

bio[

0]

n_m

eses

_ini

cio_

bio[

1]

eva_

doen

te[0

]vs

[0]

vs[1

]

pcr[

0]nd

DA

S[0]

ntD

AS[

0]nd

[0]

nt[0

]id

ade_

cons

ulta

_arr

ed[0

]

idad

e_co

nsul

ta_a

rred

[1]

desc

_bio

_act

ivo[

0]

desc

_bio

_act

ivo[

1]

anos

_doe

nca_

ate_

cons

[0]

anos

_doe

nca_

ate_

cons

[1]

i_m

anif_

ea[0

]

eva_

doen

te[1

]nd

DA

S[1]

ntD

AS[

1]nd

[1]

nt[1

]i_

man

if_ea

[1]

cod_

activ

idad

e_da

s[0] pc

r[1]

cod_

activ

idad

e_da

s[1]

n_m

eses

_ini

cio_

bio[

0]

n_m

eses

_ini

cio_

bio[

1]

eva_

doen

te[0

]vs

[0]

pcr[

0]nd

DA

S[0]

ntD

AS[

0]nd

[0]

nt[0

]

idad

e_co

nsul

ta_a

rred

[0]

idad

e_co

nsul

ta_a

rred

[1]

desc

_bio

_act

ivo[

0]

desc

_bio

_act

ivo[

1]

anos

_doe

nca_

ate_

cons

[0]

anos

_doe

nca_

ate_

cons

[1]

i_m

anif_

ea[0

]

eva_

doen

te[1

]vs

[1]

pcr[

1]nd

DA

S[1]

ntD

AS[

1]nd

[1]

nt[1

]

i_m

anif_

ea[1

]co

d_ac

tivid

ade_

das[

0]

cod_

activ

idad

e_da

s[1]

Figu

re5.

14:

Top:

Res

ultin

gcD

BN

form

=1,p

=1,k

=3

usin

gM

DL.

Bot

tom

:R

esul

ting

cDB

Nfo

rm

=1,p

=1,k

=2

and

MD

L,co

nsid

erin

gth

eto

polo

gica

lor

deri

nduc

edby

the

tree

cont

aini

ngal

lthe

attr

ibut

essu

chth

atth

ecl

assDAS

has

the

high

estd

epth

.

49

n_m

eses

_ini

cio_

bio[

0]

n_m

eses

_ini

cio_

bio[

1]

eva_

doen

te[0

]vs

[0]

pcr[

0]nd

DA

S[0]

ntD

AS[

0]nd

[0]

nt[0

]

idad

e_co

nsul

ta_a

rred

[0]

idad

e_co

nsul

ta_a

rred

[1]

desc

_bio

_act

ivo[

0]

desc

_bio

_act

ivo[

1]

anos

_doe

nca_

ate_

cons

[0]

anos

_doe

nca_

ate_

cons

[1]

i_m

anif_

ea[0

]

eva_

doen

te[1

]vs

[1]

pcr[

1]nd

DA

S[1]

ntD

AS[

1]nd

[1]

nt[1

]

i_m

anif_

ea[1

]co

d_ac

tivid

ade_

das[

0]

cod_

activ

idad

e_da

s[1]

n_m

eses

_ini

cio_

bio[

0]

n_m

eses

_ini

cio_

bio[

1]

eva_

doen

te[0

] eva_

doen

te[1

]

vs[0

]

vs[1

]nt

[1]

cod_

activ

idad

e_da

s[1]

pcr[

0]

pcr[

1]

ndD

AS[

0]

ndD

AS[

1]

nd[1

]

ntD

AS[

0]

ntD

AS[

1]

nd[0

]nt

[0]

idad

e_co

nsul

ta_a

rred

[0]

idad

e_co

nsul

ta_a

rred

[1]

desc

_bio

_act

ivo[

0]

desc

_bio

_act

ivo[

1]an

os_d

oenc

a_at

e_co

ns[0

]

anos

_doe

nca_

ate_

cons

[1]

i_m

anif_

ea[0

]

i_m

anif_

ea[1

]

cod_

activ

idad

e_da

s[0]

Figu

re5.

15:

Top:

Res

ultin

gcD

BN

form

=1,p

=1,k

=3

and

MD

L,co

nsid

erin

gth

eto

polo

gica

lord

erin

duce

dby

the

tree

cont

aini

ngal

lthe

attr

ibut

essu

chth

atth

ecl

assDAS

has

the

high

estd

epth

.B

otto

m:

Res

ultin

gcD

BN

cons

ider

ingp

=1,m

=1,k

=2

and

LL.

50

n_m

eses

_ini

cio_

bio[

0]

n_m

eses

_ini

cio_

bio[

1]

eva_

doen

te[0

]

eva_

doen

te[1

]

vs[0

]

vs[1

]

nt[1

]

cod_

activ

idad

e_da

s[1]

pcr[

0]

pcr[

1]

ndD

AS[

0]

ndD

AS[

1]

nd[1

]

ntD

AS[

0]

ntD

AS[

1]

nd[0

]nt

[0]

idad

e_co

nsul

ta_a

rred

[0]

idad

e_co

nsul

ta_a

rred

[1]

desc

_bio

_act

ivo[

0]

desc

_bio

_act

ivo[

1]

anos

_doe

nca_

ate_

cons

[0]

anos

_doe

nca_

ate_

cons

[1]

i_m

anif_

ea[0

]

i_m

anif_

ea[1

]

cod_

activ

idad

e_da

s[0]

n_m

eses

_ini

cio_

bio[

0]

n_m

eses

_ini

cio_

bio[

1]

eva_

doen

te[0

]

eva_

doen

te[1

]

vs[0

]

vs[1

]nt

[1]

cod_

activ

idad

e_da

s[1]

pcr[

0]

pcr[

1]

ndD

AS[

0]

ndD

AS[

1]

nd[1

]

ntD

AS[

0]

ntD

AS[

1]

nd[0

]nt

[0]

idad

e_co

nsul

ta_a

rred

[0]

idad

e_co

nsul

ta_a

rred

[1]

desc

_bio

_act

ivo[

0]

desc

_bio

_act

ivo[

1]

anos

_doe

nca_

ate_

cons

[0]

anos

_doe

nca_

ate_

cons

[1]

i_m

anif_

ea[0

]

i_m

anif_

ea[1

]

cod_

activ

idad

e_da

s[0]

Figu

re5.

16:

Top:

Res

ultin

gcD

BN

cons

ider

ingp

=1,m

=1,k

=3

and

LL.B

otto

m:

Res

ultin

gcD

BN

cons

ider

ingp

=1,m

=1,k

=3

with

LL.

51

Classification

We used the cDBN algorithm for predicting the class DAS of patients from one hospital visit to the other,

comparing its performance with the tDBN algorithm. We measured the average accuracy and precision

defined as:

average accuracy =1

C

C∑i=1

TPi + TNiTPi + TNi + FPi + FNi

and (5.5)

precision =

∑Ci=1 TPi∑C

i=1 TPi + FPi, (5.6)

where C is the number of classes, and TPi, TNi, FPi and FNi are respectively the true positive, true

negative, false positive and false negative counts for class i.

As was seen, these metrics do not depend on the Markiv lag m and the optimal number of parents

from the previous time slice is determined to be p = 1 [29]. Therefore, we kept m = 1 and p = 1 and

varied the number of parents from the same time slice k and the scoring function considered, LL or MDL.

These metrics were measured using 10-fold cross validation. The results are presented in Table 5.7.

Model k Precision Average Accuracy

tDBN+LL 1 0.444 0.632

tDBN+MDL 0.516 0.691

cDBN+LL 2 0.465 0.644

cDBN+MDL 0.522 0.696

cDBN+LL 3 0.464 0.641

cDBN+MDL 0.520 0.693

cDBN+LL 4 0.459 0.636

cDBN+MDL 0.523 0.696

cDBN+LL 5 0.462 0.639

cDBN+MDL 0.523 0.697

Table 5.7: Experimental results for the tDBN and cDBN classification algorithms, where N = 4721observations were considered.

Discussion

We will now analyze the results of experiences 1, 2, and 3.

Experience 1

Regarding network 1 represented in Figure 5.10, page 41, we observe that for all number of observations

N considered, the cDBN+LL clearly outperforms the tDBN+LL. This result was expected, since the

LL scoring function does not penalize the complexity of the structures, therefore the more complex

consistent 2-networks are recovered. The selected structures give raise to a considerably higher recall,

and to a similar precision. On the other hand, the results for cDBN+MDL and tDBN+MDL for a small

52

number of observations, e.g. N = 250, N = 500, are similar. This was also expected, since the MDL

penalizes the complexity of the structures, therefore a bigger number of observations are necessary

to select consistent 2-structures. For a higher number of observations, e.g. N = 750, N = 1000, the

cDBN+MDL performs better than the cDBN+LL, as the precision metric is considerable higher in the

case of the first algorithm.

Considering network 2 represented in Figure 5.10, page 41, the cDBN+LL outperforms the other

implementations. The intra-slice network considered is the fully connected consistent 4-graph. There-

fore, it is clearly biased towards the cDBN+LL algorithm. In this case the cDBN+LL learns the sufficient

and necessary connections, while in the other settings considered it clearly overfits to the training data.

However, the cDBN+MDL implementation increases its performance, with some fluctuations, with the

number of observations N .

In the case of network 3 represented in Figure 5.10, page 41, the cDBN+MDL gives raise to the

best results. The penalizing term in MDL prevents false positive edges from being chosen, resulting in

significantly higher precision values compared to LL.

Considering network 4 represented in Figure 5.10, page 41, each variable has r = 4 possible values,

therefore the number of parameters increases, which explains the higher regularization effects of the

MDL scoring function. In this case the cDBN+LL is the implementation that yields the best results. The

precision obtained, comparing with the cDBN+MDL, is similar, however the recall is considerably higher.

In general, the recall obtained with the cDBN learning algorithms, when compared to tDBN, is always

greater, while the precision is similar in both cases. Comparing the MDL and LL implementations, the

MDL has higher precision, while the LL has higher recall. The performance of the implementations

taking MDL as scoring function improves with the number of observations, giving raise to a higher recall.

In terms of running time, tDBN has a constant running time of 1 second for all networks considered. The

cDBN algorithm has a higher running time, but was always less than 4 minutes. The cDBN algorithms

improves in all cases the F1 measure, in at least roughly 5%.

The number of observations necessary for the cDBN to recover the first and second structures rep-

resented in Figure 5.10 are respectively 6000 ± 748.33 and 14904.28 ± 6665.30, with a 95% confidence

interval, where five independent datasets were sampled from the generated network and MDL was

used. This number is considerably high, considering networks with five attributes. When increasing k,

the number of necessary observations increases significantly.

Experience 2

From Figures 5.11 and 5.12 we observe that in order for the cDBN+MDL algorithm to recover both the

inter-slice and intra-slice connections of the initial structure, a substantial number of observations are

necessary. In Figure 5.11, considering n = 5 attributes, p = 2 parents from the previous time slice and

k = 2 parents from the current time slice, the algorithm converges when N = 1250. In Figure 5.12

considering n = 5, p = 1 and k = 2, the algorithm only converges to the initial structure for N = 5000

observations.

53

Experience 3

Regarding Figures 5.13 and 5.14 (top), we observe that the number of observations considered (N =

4721) are not sufficient for the cDBN+MDL algorithm to learn the intra-slice consistent k-structures,

taking k = 2, 3. The cDBN+MDL for k = 2, 3, as the tDBN+MDL, only selects the vs attribute from the

future hospital visit to influence the predicted cod actividade das.

Since our goal is to predict the class DAS of a given patient from one hospital visit to the other, we

considered the topological order of the nodes induced by a tree such that the DAS has the highest depth.

The results are represented in Figures 5.14 (bottom) and 5.15 (top). For this case the cDBN+MDL

algorithm selected the attributes cod actividade das (from previous visit) and ndDAS (from future visit)

to influence the prediction of cod actividade das. However, it is not able to learn more complex k-

consistent structures for k = 2, 3.

The cDBN+LL algorithm, on the other hand, is able to learn 2-consistent and 3-consistent structures

for the intra-slice connections. Taking k = 2 the algorithm selects the attributes eva doente and ndDAS to

influence the prediction of cod actividade das. Considering k = 3 it selects the attributes eva doente,

ndDAS and idade consulta arred. Notice that the variable idade consulta arred is not used to com-

pute the DAS class, see Equation (5.4). For k = 4 the same dependencies for the intra-slice network are

learned. However, the cod actividade das from the previous visit is not considered to influence this

measure for the future visit. Instead, the vs attribute is the only attribute that influences the future DAS

class.

In Table 5.7 the results for the classification task for the DAS class are depicted. We observe that

the average accuracy is always increased when using the cDBN algorithm, however, this improvement

is not relevant. The maximum average accuracy improvement is of 0, 6% and is obtained considering

k = 5. Using the LL scoring function in all cases yields a lower average accuracy and precision, when

comparing with the MDL. The maximum precision improvement is of 0, 7% and is obtained considering

k = 4 and k = 5.

54

Chapter 6

Conclusions

The main advantage of CMDL, when comparing to MDL, is the completeness, in the sense that MDL

reserves many code words to encode the same sequence, whereas CMDL reserves one code word for

each parameter. From the experimental results we verified that the CMDL scoring criterion compresses

aggressively the data, therefore its regularities are over-learned and it does not generalize well. Hence,

MDL, in terms of learning, clearly outperforms CMDL. However, CMDL gives raise to considerably lower

description lengths. These facts are discussed in Section 10.2 of Grunwald’s book, The Minimum De-

scription Length [22].

The cDBN learning algorithm has polynomial time complexity with respect to the number of attributes

and can be applied to stationary or non-stationary Markov processes. The proposed algorithm increases

the search space of the intra-slice connections exponentially, comparing with the tDBN algorithm. Con-

sidering more complex k-structures (with k > 1), the cDBN is a good alternative to the tDBN: it is able to

recover a bigger number of dependencies and improves in all cases considered the performance of the

state-of-the-art tDBN algorithm in terms of F1-measure.

Directions of future work

As future work, we could derive a non-asymptotic code for the parameters of the bayesian networks. In

this case the precise parameters are sent to the receiver, therefore MDL saves approximately 1/2(logN)

bits in the first part of the description, since only the truncated parameters are encoded. On the other

hand, in the second part of the description MDL reserves code words for all possible instances in the

data, whereas CMDL reserves code words for a subset of them. Asymptotically these two effects cancel

and both codes give raise to approximate equal code lengths [22]. Considering the precise parameters

will bring modifications in the learning and compression achieved.

Comparing the compression achieved using CMDL with the Bayesian networks compression meth-

ods proposed by Davies and Moore could also be considered [14].

In terms of the implementation, a more efficient search-procedure could be considered [43], instead

of the greedy hill climber.

The cDBN considers the topological order induced by the optimal branching as an heuristic for a

55

causality order between the network variables. However, there are n! number of ways to order the

n variables. Other orders could be considered. On the other hand, considering a total order, would

increase the search space significantly. The breath-first search of the optimum branching is a good

candidate [6].

56

Bibliography

[1] Catherine L Blake. UCI repository of Machine Learning databases. http://www. ics. uci. edu/˜

mlearn/MLRepository. html, 1998.

[2] P Bonissone, M Henrion, L Kanal, and J Lemmer. Equivalence and synthesis of causal models. In

UAI, volume 6, page 255, 1991.

[3] Helena Canhao, Augusto Faustino, Fernando Martins, Joao Eurico Fonseca, Patrıcia Nero, and

Jaime C Branco. Reuma.pt-The rheumatic diseases portuguese register. Acta reumatologica por-

tuguesa, 36(1):45–56, 2011.

[4] Alexandra Carvalho, Mario Figueiredo, and Margarida Sousa. Complete Minimum Description

Length for Learning Bayesian networks (to be submitted).

[5] Alexandra M Carvalho. Scoring functions for learning Bayesian networks. INES-ID Tec. Rep, 2009.

[6] Alexandra M Carvalho and Arlindo L Oliveira. Learning Bayesian networks consistent with the

optimal branching. In Machine Learning and Applications, 2007. ICMLA 2007. Sixth International

Conference on, pages 369–374. IEEE, 2007.

[7] David Maxwell Chickering. Learning Bayesian networks is NP-complete. Learning from data: Arti-

ficial Intelligence and Statistics V, 112:121–130, 1996.

[8] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees.

IEEE Transactions on Information Theory, 14(3):462–467, 1968.

[9] Gregory F Cooper. The computational complexity of probabilistic inference using Bayesian belief

networks. Artificial Intelligence, 42(2-3):393–405, 1990.

[10] Gregory F Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic

networks from data. Machine Learning, 9(4):309–347, 1992.

[11] Thomas H Cormen. Introduction to algorithms. MIT press, 2009.

[12] Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012.

[13] Paul Dagum and Michael Luby. Approximating probabilistic inference in Bayesian belief networks

is NP-hard. Artificial Intelligence, 60(1):141–153, 1993.

57

[14] Scott Davies and Andrew Moore. Bayesian networks for lossless dataset compression. In Proceed-

ings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining,

pages 387–391. ACM, 1999.

[15] Norbert Dojer. Learning Bayesian networks does not have to be NP-hard. In MFCS, pages 305–

314. Springer, 2006.

[16] Jack Edmonds. Optimum branchings. Mathematics and the Decision Sciences, Part, 1:335–345,

1968.

[17] Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on

Information Theory, 21(2):194–203, 1975.

[18] Mario AT Figueiredo. Elementos de Teoria da Informacao. 2011.

[19] Nir Friedman and Daphne Koller. Being Bayesian about network structure. A Bayesian approach to

structure discovery in Bayesian networks. Machine Learning, 50(1-2):95–125, 2003.

[20] Nir Friedman, Kevin Murphy, and Stuart Russell. Learning the structure of dynamic probabilistic

networks. In Proceedings of the Fourteenth conference on UAI, pages 139–147. Morgan Kaufmann

Publishers Inc., 1998.

[21] Robert M. Fung and Stuart L. Crawford. Constructor: A system for the induction of probabilistic

models. In AAAI, volume 90, pages 762–769, 1990.

[22] Peter Grunwald. Minimum description length tutorial. Advances in minimum description length:

Theory and applications, pages 23–80, 2005.

[23] Mark H Hansen and Bin Yu. Model selection and the principle of minimum description length.

Journal of the American Statistical Association, 96(454):746–774, 2001.

[24] David Heckerman, Dan Geiger, and David M Chickering. Learning Bayesian networks: The combi-

nation of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995.

[25] David Heckerman, Abe Mamdani, and Michael P Wellman. Real-world applications of Bayesian

networks. Communications of the ACM, 38(3):24–26, 1995.

[26] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT

press, 2009.

[27] Petri Kontkanen and Petri Myllymaki. A linear-time algorithm for computing the multinomial stochas-

tic complexity. Information Processing Letters, 103(6):227–233, 2007.

[28] Jose L Monteiro, Susana Vinga, and Alexandra M Carvalho. Polynomial-time algorithm for learning

optimal tree-augmented dynamic Bayesian networks. In UAI, pages 622–631, 2015.

[29] Jose Maria Pedro Serra Libano Monteiro. Learning from short multivariate time series. Master

Thesis, Instituto Superior Tecnico, 2014.

58

[30] Kevin Murphy et al. The Bayes net toolbox for matlab. Computing Science and Statistics,

33(2):1024–1034, 2001.

[31] Kevin P Murphy. Machine Learning: a probabilistic perspective. MIT press, 2012.

[32] Kevin Patrick Murphy and Stuart Russell. Dynamic Bayesian networks: representation, inference

and learning. 2002.

[33] American College of Rheumatology. Rheumatoid Arthritis, 2017.

[34] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Mor-

gan Kaufmann, 2014.

[35] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.

[36] Jorma Rissanen. A universal prior for integers and estimation by minimum description length. The

Annals of Statistics, pages 416–431, 1983.

[37] Jorma Rissanen. Minimum Description Length Principle. Wiley Online Library, 1985.

[38] Jorma J Rissanen. Fisher information and stochastic complexity. IEEE Transactions on Information

Theory, 42(1):40–47, 1996.

[39] Gideon Schwarz et al. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464,

1978.

[40] Moninder Singh and Marco Valtorta. Construction of Bayesian network structures from data: a brief

survey and an efficient algorithm. International Journal of Approximate Reasoning, 12(2):111–131,

1995.

[41] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT

press, 2000.

[42] Robert Tarjan. Depth-first search and linear graph algorithms. SIAM Journal on Computing,

1(2):146–160, 1972.

[43] Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for

learning Bayesian networks. arXiv preprint arXiv:1207.1429, 2012.

[44] DM Van der Heijde, Martin A van’t Hof, PL Van Riel, LA Theunisse, Evelien W Lubberts, Miek A van

Leeuwen, Martin H van Rijswijk, and LB Van de Putte. Judging disease activity in clinical practice

in rheumatoid arthritis: first step in the development of a disease activity score. Annals of the

Rheumatic Diseases, 49(11):916–920, 1990.

[45] Marcel AJ Van Gerven, Babs G Taal, and Peter JF Lucas. Dynamic Bayesian networks as prog-

nostic models for clinical patient management. Journal of Biomedical Informatics, 41(4):515–529,

2008.

59

[46] Nguyen Xuan Vinh, Madhu Chetty, Ross Coppel, and Pramod P Wangikar. Polynomial time al-

gorithm for learning globally optimal dynamic Bayesian network. In International Conference on

Neural Information Processing, pages 719–729. Springer, 2011.

[47] Xin-Qiu Yao, Huaiqiu Zhu, and Zhen-Su She. A dynamic Bayesian network approach to protein

secondary structure prediction. BMC Bioinformatics, 9(1):49, 2008.

[48] Geoffrey Zweig and Stuart Russell. Speech recognition with dynamic Bayesian networks. 1998.

60

advances in probabilistic graphical models€¦ · advances in probabilistic graphical models...

Documents