advances in probabilistic graphical models€¦ · advances in probabilistic graphical models...
TRANSCRIPT
Advances in Probabilistic Graphical Models
Margarida Nunes de Almeida Rodrigues de Sousa
Thesis to obtain the Master of Science Degree in
Matemática e Aplicações
Supervisor: Prof. Alexandra Sofia Martins de CarvalhoProf. Mário Alexandre Teles de FigueiredoProf. Paulo Alexandre Carreira Mateus
Examination Committee
Chairperson: Prof. Maria Cristina De Sales Viana Serôdio SernadasSupervisor: Prof. Alexandra Sofia Martins de CarvalhoSupervisor: Prof. Mário Alexandre Teles de FigueiredoMember of the Committee: Prof. Paulo Alexandre Carreira Mateus
October 2017
ii
Acknowledgments
I want to thank my supervisors, Alexandra Carvalho, Mario Figueiredo and Paulo Mateus for their impor-
tant support throughout this journey.
I want to thank Mae, Isabel, Ana, Pai and Pedro for always faithfully believing in me and for giving me
strength.
I would also like to thank Reuma.pt for providing the Rheumatoid arthritis data.
iii
iv
Resumo
A descricao de comprimento mınimo (MDL) e um criterio de seleccao de modelos bastante conhecido
baseado em teoria da informacao. O MDL escolhe o modelo que minimiza a descricao do comprimento
dos dados e do modelo. Contudo, Rissanen observou que este criterio e redundante, no sentido em que
nao tem em conta que os parametros do modelo sao enviados anteriormente para o receptor. Portanto,
so os conjuntos de dados compatıveis com estes parametros devem ser considerados e isto torna
possıvel comprimir mais a descricao dos dados. Rissanen propos um novo criterio chamado Descricao
Completa de Comprimento Mınimo (CMDL) que resolve este problema.
Nesta tese, consideramos modelos de Redes de Bayes e implementamos um algoritmo de apren-
dizagem usando o CMDL como funcao de pontuacao, o algoritmo ganancioso de escalada como o pro-
cedimentos de procura com o conjunto das redes de cobertura como espaco de procura. Analisamos o
desempenho deste novo criterio de seleccao, usando dados sinteticos e dados reais.
Na segunda parte desta tese, propomos um novo algoritmo de aprendizagem de redes de Bayes
Dinamicas k-estruturas consistentes. O algoritmo proposto aumenta exponencialmente o espaco de
procura das estruturas de dependencias intra-temporais de transicao, quando comparado com o es-
tado da arte - estruturas em arvore. Analisamos o desempenho deste novo algoritmo, usando dados
sinteticos e reais.
v
vi
Abstract
The Minimum Description Length (MDL) is a well known information theoretical model selection criteria,
based on a two-part asymptotic code. MDL selects the model that minimizes the description length of
both the data and the model. However, Rissanen observed that this criteria is redundant, in the sense
that it does not take in account that the parameters of the model were sent beforehand to the receiver.
Therefore, only the data sets compatible with these parameters should be considered and it becomes
possible to further compress the data. Rissanen proposed a new criteria called Complete Minimum
Description Length (CMDL) that solves this issue.
In this thesis, we consider Bayesian network models and implement a score-based learning algorithm
using the CMDL as a scoring function, the greedy hill climber as the search procedure and with the set
of covering networks as the search space. We analyze the performance of this model selection criterion,
using synthetic and real data.
In a second part, we propose a new polynomial-time algorithm for learning dynamic Bayesian net-
works. The proposed algorithm increases exponentially the search space for the intra-slice connections
of the transition networks. This algorithm considers the set of consistent k-graphs, instead of the state-
of-the-art tree-network structures.
Keywords: minimum description length, complete minimum description length, compression,
model selection, learning Bayesian networks, dynamic Bayesian networks
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Introduction 1
2 Bayesian Networks 3
2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Learning Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Paremeter Estimation in Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Dynamic Bayesian Networks 26
3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Learning Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Proposed Method 31
5 Experimental Results 35
5.1 Learning Bayesian Networks with CMDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Learning cDBNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Conclusions 55
Bibliography 57
ix
x
Chapter 1
Introduction
We are in the big data era, the amount of data available has increased exponentially in the last decade.
Therefore, intelligent and efficient ways of analyzing and learning these big amounts of data becomes
crucial. Machine learning is defined as the set of methods that can automatically detect patterns in data,
and then use the uncovered patterns to predict future data, or to perform other kinds of decision making
under uncertainty [31].
Bayesian networks are probabilistic graphical models that represent in a compact way relations
between random variables [34]. They give raise to generative classifiers, as they model the class-
conditional probability density functions. They are used in a large variety of real world applications such
as diagnosis, forecasting, automated vision, sensor fusion, manufactoring control, program debugging,
information retrieval and troubleshooting system failures [25].
Given a data set and a set of possible models, the problem of deciding which model to select arises.
In this thesis we will focus on information theoretical model selection approaches. These criteria are
based in a measure defined as description length, that expresses the compression achieved to transmit
a given data set. The minimum description length (MDL) [37] is based on a two-part asymptotic code.
In the MDL approach, the description length of encoding a data set with a given model is the sum of the
length of encoding the data set together with the model. Rissanen observed the MDL is redundant [38],
as the parameters are sent beforehand, only the data sets compatible with these parameters should
be considered in the second part of the description. Rissanen proposes a new criteria called Complete
Minimum Description Length (CMDL) that takes into account this fact [37]. In this thesis we implemented
an algorithm for learning Bayesian networks models using the CMDL. We analyze its performance in
terms of learning and compression achieved, using synthetic and real data.
Furthermore, dynamic Bayesian networks (DBN) model stochastic processes [32]. They are used in
a large variety of applications, such as protein sequencing [47], speech recognition [48] and clinical fore-
casting [45]. In the second part of this thesis, we propose a new polynomial algorithm for learning DBN
that increases exponentially the search space for the intra-slice connections of the transition networks.
We consider that the search space for these connections is the set of consistent k-graphs. The current
state of the art algorithm takes the search space to be tree graphs [28]. We analyze the performance of
1
the proposed algorithm using synthetic and real data.
Claim of contributions
The main contributions of this thesis are:
1. A review on Bayesian networks, dynamic Bayesian networks and the their learning algorithms.
2. An implementation of a score based learning algorithm for Bayesian Networks with a new proposed
scoring function, Complete Minimum Description Length. The algorithm was made freely available
at https://margaridanarsousa.github.io/learn_cmdl/.
3. A new polynomial time algorithm for learning cDBN dynamic Bayesian networks. The algorithm
was made freely available at https://margaridanarsousa.github.io/learn_cDBN/.
4. An analysis of the developed methods on simulated and real data, including comparisons to other
methods and to results obtained in other publications.
Thesis outline
In Chapter 2 we start by defining basic concepts on Bayesian networks. We introduce the problem of
learning Bayesian networks in Section 2.2, that has two approaches: parameter estimation, defined in
Subsection 2.2.1, and structure learning, defined in Subsection 2.2.2. In order to learn the structure of
Bayesian networks it becomes necessary to specify a scoring function, a search space and a search
procedure. In Subsection 2.2.3 we start by introducing basic coding and data compression concepts
and then describe information theoretical scoring functions.
In Section 3.1 we introduce dynamic Bayesian networks (DBN), that are extensions of Bayesian
networks that evolve in time. We describe the previously proposed methods for learning DBN in Section
3.2. We propose a new learning algorithm for DBN in Chapter 4.
Furthermore, in Chapter 5 we present the experimental results. In Section 5.1 the results regarding
the implementation of the score-based learning algorithm using the Complete Minimum Description
Length are analyzed and discussed. In Chapter 5.2 the results of the proposed learning algorithm for
dynamic Bayesian networks are presented and discussed.
Finally, in Chapter 6 we make some final remarks and propose directions for future work.
2
Chapter 2
Bayesian Networks
2.1 Basic Concepts
LetX denote a discrete random variable that takes values over a finite setX . Furthermore, let X = (X1, . . . , Xn)
represent an n-dimensional random vector, where each Xi takes values in Xi = xi1, . . . , xiri and P (x)
denotes the probability that X takes the value x.
A Bayesian network encodes the joint probability distribution of a set of n random variables X1, . . . , Xn [34].
The underlying structure of a Bayesian network is based on a directed graph, therefore they are also
known as directed graphical models. They are also named belief networks, generative models or causal
models. Suppose that these n random variables have K states, using the chain rule:
P (x) = P (xn|x1 . . . xn−1) . . . P (x2|x1)P (x1). (2.1)
In order to determine the joint distribution, we would need to estimate Kn − 1 probabilities, for the
Kn possible values X1, . . . , Xn may take. Therefore, computing the joint probability requires space
exponential in the number of random variables n. Assuming certain independence properties, the joint
probability can be represented in a more compact way and require less parameters.
Definition 1 (Conditional Independence). Let X,Y and Z be sets of random variables. X is said to be
conditionally independent of Y given Z if P (x|y, z) = P (x|z), for all x,y and z. Let X ⊥ Y|Z denote that
X is conditionally independent of Y given Z.
Definition 2 (Bayesian Network). A n-dimensional Bayesian Network (BN) is a triple B = (X, G,Θ),
where:
• X = (X1, . . . , Xn) and each random variable Xi takes values in the set xi1, . . . , xiri, and xik
denotes the k-th value Xi takes.
• G = (X, E) is a directed acyclic graph (DAG) with nodes in X and edges E representing direct
dependencies between the nodes.
Let ΠXidenote the set of parents of Xi in the network G. Define an ordering for the set of all
3
possible configurations of ΠXi, wi1, . . . , wiqi, where qi =
∏Xj∈ΠXi
rj is the total number of
configurations and wij corresponds to the j-th configuration of ΠXi .
• Each random variableXi has an associated conditional probability distribution (CPD) or local prob-
abilistic model with parameters:
Θijk = PB(Xi = xik|ΠXi = wij). (2.2)
The set Θ encodes the parameters Θijki∈1...n,j∈1...qi,k∈1.....,ri of the network G.
Let Nij be the number of instances in the data D where the variables ΠXitake their j-th configuration
wij . Observe that Xi|ΠXi ∼ Multinomial(Nij , θij1, . . . , θijri) for i ∈ 1, . . . , n and j ∈ 1, . . . , qi, i.e.,
the distribution of a node Xi conditioned on the parent configuration ΠXiis multinomial.
Example 3 (Medical Diagnosis). Consider the Bayesian network depicted in Figure 2.1, representing
two diseases, Pneumonia and Flu. Both diseases cause Fever, however the XRay only shows signs in
the case of Pneumonia and the Muscular Pain is only caused by a Flu. Consider the following notation:
Pneumonia → Pn, Flu → Fl, Fever → Fe, Xray → Xr and Muscular Pain → Mp. All of the random
variables are binary. Consider P (Pn) = 0.05 and P (Fl) = 0.02, in Figure 2.2 the table of CPDs are
depicted. The number of rows of each table is the number of parent configurations, and each row
represents the distribution of the random variable given the parent configuration, which describes a
multinomial distribution.
Pneumonia Flu
Fever Muscular PainXRay
Figure 2.1: A Bayesian network representing dependencies between diagnosis and diseases.
Pn, F l P (Fe|Pn, F l)1, 0 0.80, 1 0.60, 0 0.21, 1 0.01
(a)
Pn P (Xr|Pn)0 0.81 0.6
(b)
Fl P (Mp|Fl)0 0.81 0.6
(c)
Figure 2.2: CPDs Tables of Example 3.
A BN B induces a unique joint probability distribution over X given by:
PB(X1, . . . , Xn) =
n∏i=1
PB(Xi|ΠXi). (2.3)
Intuitively the graph of a BN can be viewed as a network structure that provides the skeleton for
representing the joint probability compactly in a factorized way, and making inferences in the probabilistic
4
graphical model provides the mechanism for gluing all these components back together in a probabilistic
coherent manner [26].
Definition 4 (Markov Local Assumptions). Given a BN with network structure G over random variables
X1, . . . , Xn. G encodes the following set of conditional independence assumptions:
Xi ⊥ Non DescendantsXi|ΠXi
, for all random variables Xi, (2.4)
where Non DescendantsXi are the variables in G that are non descendants of Xi. These assumptions
are called the Markov local assumptions.
Example 5. The BN depicted in Figure 2.1 encodes the following Markov local assumptions: Pn ⊥ Fl|∅,
Fl ⊥ Pn|∅, Fe ⊥ Xr,Mp|Pn, F l, Xr ⊥ Mp,Fe, F l|Pn, Mp ⊥ Xr, Pn, Fe|Fl.
Bayesian Networks reduce the number of values to determine when computing the joint probability
PB(X1, . . . , Xn) to space exponential in maxi∈1,...,n
|ΠXi |.
Informally, two Bayesian networks are equivalent if they encode the same joint probability distribution.
The next theorem provides sufficient and necessary conditions for the equivalence of two Bayesian
networks.
Definition 6 (v-structure). In a directed acyclic graph, a v-structure is a local dependency X → Y ← Z.
Example 7. The edges E = (Pn, Fe), (Fl, Fe) in the graph represented in Figure 2.1 form a v-
structure.
The skeleton of any DAG is the undirected graph resulting from ignoring the direction of every edge.
Theorem 8 (From [2]). Two directed acyclic graphs are equivalent if and only if they have the same
skeleton and the same v-structures.
Since tree networks have no v-structures, two trees with the same edges are equivalent, indepen-
dently of the direction of the edges.
2.2 Learning Bayesian Networks
Learning a Bayesian Network has two variants: parameter learning and structure learning. When learn-
ing the parameters, we assume the underlying graph G is given, and our goal is to estimate the set of
parameters of the network Θ. When learning the structure, the goal is to find a structure G, given only
the training data. We assume data is complete, i.e, each instance is fully observed, there are no missing
or hidden values and the training set D is given by a set of N i.i.d. instances, D = x1, . . . ,xl, . . . ,xN.
2.2.1 Paremeter Estimation in Bayesian Networks
There are two approaches to estimate the Bayesian network parameters: maximum likelihood estima-
tion and Bayesian variants. Both approaches are based in the likelihood function. We will begin by
5
describing the maximum likelihood estimate approach. The likelihood of a set of parameters ΘG, given
an underlying graph G, is given by:
L(D,ΘG) = P (D|ΘG) =
N∏l=1
P (xl|ΘG).
Considering the Markov local independence assumptions and that the set of parameters θXi|ΠXiare
disjoint for i ∈ 1, . . . , n, the likelihood can be decomposed into the product of the local likelihood
functions for each node Xi, and becomes:
L(D,ΘG) =
n∏i=1
Li(θXi|ΠXi, D),
where
Li(θXi|ΠXi, D) =
N∏l=1
P (xil|Πxil, θXi|ΠXi
),
and xil denotes the observed value for the variable Xi in the instance l of D and Πxil
denotes the
observed parent configuration for Xi and instance l. In this case, our problem reduces to maximizing
each local likelihood estimate Li independently.
Let Nijk be the number of instances in data set D, where variable Xi takes the value xik and the set
of parents ΠXitakes the configuration wij . Denote the number of instances in D where Xi takes the
value xij by Nij ,
Nij =
ri∑k=1
Nijk.
LetN be the total number of instances in dataD. Assuming thatXi|ΠXi∼ Multinomial(Nij , θij1, . . . , θijri),
the local likelihood of Xi can be simplified to:
Li(θXi|ΠXi, D) =
qi∏j=1
ri∏k=1
θNijk
ijk . (2.5)
Our goal is to maximize Li(θXi|ΠXi, D) for all i ∈ 1, . . . , n, considering the constraint:
qi∑j=1
ri∑k=1
θijk = 1. (2.6)
Using the general results of the maximum likelihood estimate in a multinomial distribution we get the
following estimate:
θijk =NijkNij
, (2.7)
that is denoted by observed frequency estimate (OFE). The maximum likelihood estimate, however,
overfits the training data in many situations. On the other hand, this estimate assigns probability zero to
an event that is extremely unlikely, but not impossible.
In a Bayesian approach a regularization parameter is added to the parameters, which gives raise
6
to an estimator significantly more robust. Considering a Dirichlet prior distribution over the parame-
ters θijk with hyperparameter αijk, θijk ∼ Dir(αijk), as this distribution is the conjugate prior of a
multinomial distribution, the posterior distribution becomes a Dirichlet distribution with hyperparame-
ters Nijk + αijk, θijk|D ∼ Dir(Nijk + αijk), which yields the following estimate:
θijk =Nijk + αijk∑kNijk + αijk
. (2.8)
2.2.2 Structure Learning
These are the main methods proposed to solve the problem of learning a structure for a general Bayesian
network: Independence Tests or Constrained-Based approaches [41], Bayesian Model Averaging Ap-
proaches [19] and Search-Based Methods [10, 24].
Constrained-based approach views Bayesian networks as encoding conditional dependencies and
independencies and tries to test and infer these conditions in the data in order to construct a network.
The Bayesian model averaging approach doesn’t look for a single network, but rather tries to define
a probability distribution over all possible structures and average the prediction over all networks.
The most common method is the search-based method, we will focus on this approach. The space
of all Bayesian networks with n nodes has a superexponential number of structures, 2O(n2). Learn-
ing general Bayesian networks is a NP -hard problem: Cooper [9] proved the inference of a general
Bayesian network is NP -hard. Later, Dagum and Luby proved that even finding an approximate solution
is NP -hard [13]. Chow and Liu [8] and Edmonds [16] use an optimal branching algorithm that finds
the optimal Bayesian networks, constraining the search space to tree graphs. Cooper [10] proposes a
polynomial time algorithm for learning Bayesian networks consistent with an order and a bounded in-
degree1. Chickering [7] proved that even constraining to graphs with in-degree at most 2 is NP -hard.
Therefore, we resort to heuristic search techniques. Score-based methods reduce the problem of learn-
ing a Bayesian network to a model selection problem, viewing a BN as a statistical prediction model.
Define a scoring function φ : S × X → R, that measures how well the Bayesian network B fits the data
D (where S denotes the search space). The problem reduces to an optimization problem: given a score
function, a data set, a search space and a search procedure, find the network that maximizes this score.
However, the heuristic-search method is not guaranteed to find the optimal network. We will consider as
the search space, the set all Bayesian networks with n variables, denoted by Bn.
Definition 9 (Learning a Bayesian Network). Given a data D = x1, . . . ,xN and a scoring function φ,
the problem of learning a Bayesian network is to find a Bayesian network B ∈ Bn that maximizes the
value φ(B,D). 2
Thus, improving the search-based methods can be done by finding new scoring criteria or new search
methods. In this work we implement a score-based learning algorithm using a new scoring function. On
the other hand, we propose a new search procedure for the dynamic counterpart of Bayesian networks.1The in-degree of node Xi is |ΠXi
|.2Take in account that for clear understanding in Subsection 2.2.3 and in Section 5.1, this problem is defined as the minimization
of −φ(B,D).
7
As was mentioned in the beginning of this Section, if we restrict the search space S to tree networks
or networks with known ordering over the variables and bounded in-degree, it is possible to obtain a
global optimum solution for the structure learning problem. We will now describe the search procedures
for the mentioned search spaces.
The generalization of Chow-Liu algorithm [8] for any score equivalent and decomposable scoring
function proposed by Heckerman et al [24] is depicted in Algorithm 1. It starts by building a complete
weighted undirected graph, such that the weight of an edge Xi → Xj is φj(Xi, D) − φj(∅, D). Then,
it is possible to determine a maximal weighted spanning tree in polynomial time. A arbitrary node is
chosen to be the root of the tree and the direction of all edges are set to be outward from it.
Algorithm 1 Learning tree Bayesian networks, for any decomposable and score equivalent φ-score
1: Compute φj(Xi, D)− φj(∅, D) between each pair of attributes Xi and Xj , with i 6= j and i, j ≤ n.
2: Build a complete undirected graph with attributes X1, . . . , Xn as nodes. Annotate the weight of an
edge connecting Xi and Xj by the value computed in the previous step.
3: Build a maximal weight (undirected) spanning tree.
4: Transform the resulting undirected tree to a directed one by choosing a root variable and setting the
direction of all edges to be outward from it and return the resulting tree.
Heckerman also proposes a polynomial algorithm for the case of scoring functions that are decom-
posable but not score equivalent represented in Algorithm 2 [24]. In this case, the edge Xi → Xj may
have a different score from the edge Xj → Xi, and so one must build a directed spanning tree. Ed-
mond’s algorithm [16] finds an optimal spanning tree, given a root. By ranging over all possible roots, it
is possible to find an optimal spanning tree in polynomial-time.
Algorithm 2 Learning tree Bayesian networks, for any decomposable φ-score
1: Compute φj(Xi, D)− φj(∅, D) for each edge from Xi to Xj , with i 6= j and i, j ≤ n.
2: Build a complete directed graph with attributes X1, . . . , Xn as nodes. Annotate the weight of an edge
connecting Xi and Xj by the value computed in the previous step.
3: Build a maximal weight directed spanning tree.
In the case that the BN is consistent with a given order and has bounded in-degree k, an algorithm
named K2 was proposed, that is represented in Algorithm 3 [10]. For each node Xi the algorithm tests
all the parent-sets from the subsets of X1, . . . , Xi−1 with at most k elements and selects the optimal
one. The algorithm is polynomial in the number of variables, but exponential in k [10].
A polynomial-time algorithm to learn Bayesian networks with underlying consistent k-graphs (CkG)
was proposed and is represented in Algorithm 4 [6]. The set of networks consistent with the optimal
branching and bounded in-degree is exponentially larger in the number of variables, when comparing
with trees. In Figure 2.4 the relations of inclusions of the tree, polytrees 3 and CkG graphs are repre-
sented.
3A polytree is a DAG such that the underlying undirected graph is a tree.
8
Algorithm 3 K2 algorithm
input: A set of nodes X1, . . . , Xn, an ordering on the nodes, an upper bound on the in-degree k, a
data set D and a scoring function φ.
output: The optimal parent set for each node.
1: Run a deterministic algorithm Aφ that outputs the nodes ordered.
2: for each node Xi in R do
3: Compute the set αi of ancestors of i, that is, the set of nodes connecting the root of R and Xi.
4: for each subset S of αi with at most k nodes do
5: Compute φi(S,D).
6: if φi(S,D) is the maximal score for Xi then
7: Set ΠXito S.
8: end if
9: end for
10: end for
11: Output the directed graph such that the parents of a node Xi are Πi.
Definition 10 (k-graph). A k-graph is a graph where each node has in-degree at most k.
Definition 11 (Consistent k-graph). Given a directed tree R over a set of nodes V , a graph G = (V,E)
is said to be a consistent k-graph (CkG) w.r.t R if it is a k-graph and for any edge in E from Xi to Xj the
node Xi is in the path from the root of R to Xj . We denote by CkR the set of all CkG’s w.r.t. R.
X1
X2 X3
X4X5
Figure 2.3: Network structure for Example 12.
Example 12. Considering the optimal branching represented in Figure 2.3, we observe that adding
edge (X1, X5) gives raise to a consistent 2-graph, however the edge (X2, X4) does not.
...
Polytrees
CkG
C1G
−φ
Trees
−φ
Figure 2.4: Inclusion relations of trees, CkG and polytree graphs [6].
The algorithm for learning CkG structures starts by determining the optimal branching, and then
9
adds the “relevant” edges, that were not defined due to the tree restriction, and removes those that are
not, by choosing the optimal subset of ancestors S for each node Xi.
Algorithm 4 Learning CkG networks
1: Run a deterministic algorithm Aφ that outputs an optimal branching R.
2: for each node Xi in R do
3: Compute the set αi of ancestors of i, that is, the set of nodes connecting the root of R and Xi.
4: for each subset S of αi with at most k nodes do
5: Compute φi(S,D).
6: if φi(S,D) is the maximal score for Xi then
7: Set ΠXito S.
8: end if
9: end for
10: end for
11: Output the directed graph such that the parents of a node Xi are ΠXi.
For general Bayesian networks, the heuristic search procedure attempts to find the optimal BN,
but is not guaranteed to. The greedy hill-climber search (GHC) is the most common procedure and
Heckerman et al. found it to yield the best combination between accuracy and efficiency. We will define
the neighborhood of a given structure in DAG-space to be all networks we can reach by applying one of
the following operations:
• add an edge;
• delete an edge;
• flip an edge.
The GHC starts with a initial network, that can be empty, random or constructed using prior knowl-
edge. At each search step it moves through the neighborhood of the network, and selects the network
with largest improvement in the score, and this one becomes the current network. The process is re-
peated until there is not a network in the neighborhood that improves the current score. There are a few
extensions of the GHC:
• TABU list: Keeps track of the recently seen structures and avoids them, i.e., does not consider
“legal” to move to any of these structures in the next search steps. This strategy avoids getting
stuck in local maximum [26].
• Random Restarts: Once stuck, applies random operations (add, remove or flip edge), and
restarts the greedy search. This strategy escapes from the basin or going from local maximum to
local maximum [26].
The GHC with the extensions described is represented in Algorithm 5 [26]. We will now introduce the
concept of scoring criterion in more detail.
10
Algorithm 5 GHC algorithm for learning BNs with tabu list and random restarts
input: Initial structure Ginit, dataset D, a scoring function φ and a stopping criteria C.
output: final structure Gres.
1: Gres = Ginit, G′ = Gres and TABU= Gres
2: while C not satisfied do
3: G′′ = arg maxG∈neighbourhood(G′) φ(G)
4: if φ(G′) > φ(G′′) then
5: G′′ = random(G′)
6: end if
7: if φ(G′′) > φ(Gres) then
8: Gres = random(G′)
9: end if
10: TABU = TABU ∪G′
11: G′ = G′′
12: end while
return Gres
2.2.3 Scoring Functions
A large variety of scoring functions have been proposed in the literature [5]. A scoring function φ :
S × X → R, measures how well the Bayesian network B fits the data D (where S denotes the search
space). Score- based learning algorithms are efficient if the scoring criterion is decomposable, since
in this case a local change in the neighborhood of a node Xi will only change the local score φi, for
i ∈ 1, . . . , n.
Definition 13 (Decomposable scoring function). A scoring function φ is decomposable if the score
assigned to each network decomposes over the network in such a way that it can be expressed as a
sum of local scores that depends only on each node and its parents, that is, scores of the following form:
φ(B,D) =
n∑i=1
φi(ΠXi , D). (2.9)
Another important property of scoring functions is the score equivalence; we will define some pre-
liminary concepts in order to define this property.
Definition 14 (Partially directed acyclic graph). A partially directed acyclic graph is a graph that contains
both directed and undirected edges, with no directed cycle in its directed subgraph.
A partially directed acyclic graph can be viewed as a representative of an equivalence class of DAGs.
Definition 15 (Compelled edge). A directed edge X → Y is compelled in a directed acyclic graph G if
for every directed acyclic graph G′ equivalent to G, X → Y exits in G′.
By Theorem 8 (page 5), any edge participating in a v-structure is compelled. If a directed edge is not
11
compelled, we call it reversible, as there may exist another DAG in the same equivalence class with the
reverse edge.
Definition 16 (Essential graph). An essential graph, denoting an equivalence class of directed acyclic
graphs, is the partially directed acyclic graph consisting of a directed edge for every compelled edge in
the equivalence class, and an undirected edge for every reversible edge in the equivalence class.
For tree-network structures, the essential graph corresponds to its skeleton.
Definition 17 (Score Equivalence). A scoring function φ is score equivalent if it assigns the same score
to all directed acyclic graphs that are represented by the same essential graph.
Scoring functions are divided in two classes: Bayesian and information-theoretical. We will focus on
information-theoretical scoring functions: Log-Likelihood, Minimum Description Length, Complete Min-
imum Description Length and Normalized Maximum Likelihood. Information-theoretical scoring criteria
are based on the compression achieved to describe a data set, given an optimal code induced by a
probability distribution encoded by a Bayesian network. The rational is to choose a representation of the
data that corresponds to the minimum description length. The idea is the following: the more we are
able to compress a data set, the more regularities the data set has, and therefore the more we learn the
data.
Example 18. This example was adapted from [22]. Consider two sequences of binary data of 10000
bits each represented by:
0001000100010001000100010001...00010001000100010001,
0111010000100101011101110001...11101000101011101001.
The first sequence is the repetition of the pattern 0001 2500 times, therefore we can predict that future
data will follow the same “law”. The second sequence is random, there is no regularity underlying it.
Therefore, the first sequence can be compressed, it can be described as “2500 repetitions of 2500”,
instead of describing the entire sequence, however for the second sequence, we can npt summarize it.
We will introduce some basic concepts of coding, data compression and information theory that will
be important to understand this class of scoring functions.
Basic Coding and Data Compression Concepts
Let Y∗ denote the set of finite-length strings of symbols from a Y-ary alphabet.
Definition 19 (Code). Given a random variable X with range X and a set of finite-length strings of
symbols from a Y-ary alphabet, Y∗, a code C is a mapping:
C : X → Y∗. (2.10)
Let C(x) denote the codeword corresponding to x and let l(x) denote the length of C(x).
12
Definition 20 (Expected length of a code). The expected length L(C) of a code C(x) for a random
variable X with probability mass function Q is given by:
L(C) =∑x∈X
Q(x)l(x), (2.11)
where l(x) is the length of the codeword associated with x.
By assigning short codewords to common outcomes of the data set and longer codewords with less
frequent outcomes, it is possible to decrease the redundancy from the data and therefore compress the
data.
Example 21 (Huffman’s Algorithm). This example was adapted from [23]. Let X = a, b, c and P be the
probability distribution on X with P (a) =1
2, P (b) = P (c) =
1
4. Construct a code following the Huffman’s
Algorithm: first choose the two elements with the smallest probabilities, b and c, and connect them with
leaves 0 and 1 (assigned arbitrarily), to form the intermediate node bc with node probability P (ab) =1
2.
The constructed code is depicted in Figure 2.5. The resulting code is a → 0, b → 10, c → 11 with
codeword lengths l(a) = 1, l(b) = l(c) = 2 and expected length given by L(C) =1
2× 1 +
1
2× 2 = 1.5.
Figure 2.5: Huffman code of Example 21.
Definition 22 (Prefix code). A code is called a prefix code or an instantaneous code if no codeword is a
prefix of any other codeword.
If no codeword is a prefix of any other codeword we can instantaneously decode it, in the sense that
we do not need to decode future codewords in order to decode a previous one, and on the other hand,
unique decodability is guaranteed.
Example 23. This example was adapted from [18]. The code constructed in Example 21 is prefix. For
example the string 0101110 is uniquely decoded to abca. Consider the code C : a, b, c, d → 0, 1∗,
where C(a) = 01, C(b) = 11, C(c) = 00, C(d) = 110. Given the string 110...0...011, if the number of zeros
between both 11 is even, then the first codeword is decoded as b; if it is odd, then the first codeword is
decoded as d. Therefore to decode the first codeword we need to observe an arbitrary number of future
codewords.
Our goal is to define prefix codes with minimum expected length, however, assigning short codewords
to all source symbols and still having prefix-free codes is clearly infeasible. Consider the case described
in Example 21, if a→ 0, neither b nor c can be assigned to 1, if we want to construct a prefix code. The
following theorem expresses this relation. Denote the length of word xi by li = l(xi) and Pi = P (xi).
13
Theorem 24 (Kraft Inequality [12]). For any prefix code over an alphabet of size Y, the codeword lengths
l1, . . . , lm must satisfy the following inequality:
∑i
Y−li ≤ 1. (2.12)
Conversely, given a set of codeword lengths that satisfy this inequality, there exists a prefix code with
these word lengths.
Suppose elements in X are generated according to a known probability distribution P , Shannon’s
Source Coding Theorem states that the expected code length defined in (2.11) is minimum when Q = P .
Theorem 25 (Shannon’s Source Coding Theorem, from [12]). Suppose elements of X are generated
according to a probability distribution P . For any prefix code on X with length function l, the expected
code length L(C) is bounded below by H(P ), the entropy of P. That is,
L(C) ≥ H(P ) = −∑i∈N
P (x)l(x). (2.13)
The optimal code lengths l?1, . . . , l?m, that minimize the expected code length satisfy:
l?i = −logYPi. (2.14)
However, l?i as defined above is not necessarily an integer and it is not possible to define codewords
with non-integer lengths. Defining li = d− logY Pie solves this problem, while also satisfying the Kraft
Inequality. Thus an optimal code Copt satisfies:
H(P ) ≤ L(Copt) ≤ H(P ) + 1. (2.15)
Fano-Shannon and Huffman codes are examples of optimal codes [12].
In a Bayesian networks approach, consider HG the set of hypothesis that subsumes the data T was
generated by some Bayesian network with structure G. We will use an optimal code, defining the source
set X as the data we want to model D, and the probability function defined over D will be induced by a
given hypothesis HG ∈ HG. We will use the description length of D as a measure to select the model.
Definition 26 (Description length). Given data D and a set of probability distributions HG encoded by a
Bayesian network, that may be used to describe D , the description length of D with HG ∈ HG is given
by:
L(D,HG) = L(D|HG) + L(HG), (2.16)
where L(D|HG) is the length of the description of D when encoded with HG and L(HG) is the length of
the description of HG.
By Shannon’s Source Coding Theorem, using an optimal code, the length of the description of D
when encoded with hypothesis HG is:
14
L(D|HG) = −LL(HG|D) = − logPHG(D) = −
n∑i=1
qi∑j=1
ri∑k=1
Nijk log(θijk). (2.17)
Next, we will introduce the information theoretical scoring functions. What distinguishes these scoring
functions is how description length is defined.
Log-Likelihood Criterion
The Log-likelihood criterion assumes that the hypothesis HG is transmitted cost-free and it is enough to
choose HG that minimizes the maximum likelihood estimate, in this case:
L(D,HG) = L(D|HG) = −LL(HG|D). (2.18)
This criterion favors complete network structures, and does not generalize well, leading to the over-
fitting of the model to the training data.
Minimum Description Length Criterion
The Minimum description length (MDL) criterion, proposed by Rissanen [35], imposes that the param-
eters of the hypothesis HG must also be transmitted. The length of these parameters are a form of
penalized likelihood, the price one must pay for not knowing which hypothesis generated the data. The
MDL criterion follows Occam’s reasoning, selecting simple models. Hence, in this case we want to
choose HG that minimizes:
L(D,HG) = L(D|HG) + L(HG) = −LL(HG|D) + L(HG). (2.19)
The MDL principle can be viewed as a two-part coding scheme:
1. In a first stage, the parameters that minimize (2.19), ΘG, are estimated. Then, the parameters are
transmitted using a uniform encoder and a certain precision.
2. In a second stage, an optimal prefix code is constructed using the distribution indexed by ΘG and
the data set D is encoded using the induced code and sent to the receiver.
Now we will describe how to encode ΘG. First, let’s suppose the parameters are integers. Elias [17]
and Rissanen [36] constructed an universal code 4 for integers, such that length log∗2 of an integer x
would take
log?2(x) =∑j>1
max(log(j)2 n, 0) + log2 c0 (2.20)
bits, where log(j)2 (.) is the j-th composition of the binary logarithm and c0 is given by
c0 :=∑n>1
2− log∗2 n ≈ 2.865064. (2.21)
4A universal code for integers is a prefix code, with the additional property that whatever the true probability distribution onintegers, as long as the distribution is monotonic, the expected lengths of the codewords are within a constant factor of theexpected lengths that the optimal code for that probability distribution would have assigned.
15
However, the parameters of a Bayesian network are rational, therefore real numbers. In this case, the
real number x should be represented by an integer x/δx, where δx is the precision of the representation.
By approximating log? ≈ log, it is possible to compute the optimal precision δ?x = 1/√N . By considering
the asymptotic case, taking the number of independent samples N →∞, the length of x would take:
log?(x
δ?x)→ 1
2ln(N) (2.22)
bits, and we arrive to the following number of bits required to represent a Bayesian network B:
ln(1√N
)|B| = 1
2ln(N)|B|, (2.23)
where |B| corresponds to the number of parameters Θ of the network and is given by:
|B| =n∑i=1
(ri − 1)qi. (2.24)
An intuitive way to understand the defined optimal precision of 1/√N is that this value corresponds
to the maximum magnitude of the estimation error of the parameters ΘG, hence, there is no need to
encode the estimator with a greater precision.
The minimum description length criterion becomes:
MDL(B|D) = −LL(B|D) +1
2ln(N)|B|. (2.25)
Example 27. This example is adapted from [23]. Consider the sender wishes to transmit a binary string
y = y1, . . . , yn to a receiver and uses a Bernoulli(θ) model to send the string. In a Bayesian networks
approach, this can be represented as a unique node corresponding to a random binary variable. The
binary string can be viewed as a set of n i.i.d. observations sampled from the distribution Bernoulli(θ).
Let k be the number of 1’s in the string. The parameter θ needs to be first estimated and then sent to the
receiver. The maximum likelihood estimate is θ = k/n. This parameter takes 1/2 lnn nats to send. Then
the sender encodes all the symbols in the string, that takes − log2
(k/n
)bits for a 1 and − log2
(1−k/n
)for a 0. Therefore, transmitting the string requires an additional
−k log2
(k
n
)− (n− k) log2
(1− k
n
)
bits. Consider the particular case such that, n = 5, y = (1, 1, 1, 0, 0, ) and the maximum likelihood
parameter is θ = 3/5. The parameter takes 1/2 ln 5 ≈ 0.8047 nats, that corresponds approximately to
1.1609 bits, to communicate. Encoding the data set takes − log((3/5)3(2/5)2) ≈ 5.8453 bits.
The minimum desciption length is equivalent to the Bayesian scoring called Bayesian information
criterion (BIC) [39].
16
Complete Minimum Description Length Criterion
Rissanen observed that the MDL criterion is redundant and incomplete [38] in the sense that as the
parameters are sent beforehand to the receiver, the data has to be compatible with these parameters,
allowing to further compress it. He therefore proposed a new criterion called Complete Minimum De-
scription Length (CMDL), that solves this issue.
Example 28. Let’s consider the Complete Minimum Description Length Criteria approach in Example
27. As the receiver knows the parameter θ = k/n, he knows the data set has to have k 1’s, therefore if
an enumeration for all the compatible data sets is defined, the length of encoding the data set is
− log2
(n
k
)
bits. Considering the particular case described in Example 27, the length for encoding k is log2 3 ≈
1.5850, and the length of encoding the data set is
− log2
(5
3
)≈ 3.3219
bits, that is significantly smaller than the one considering the MDL approach.
Rissanen defines the Complete Minimum Description Length distribution as the one that minimizes
the length of the code of D, given that the receiver already knows the parameters. Given a set of
hypotheses HG and denoting by HHG(D) the BN with parameters given by the OFEs in D, the CMDL
distribution is given by:
PCMDLHG
(D) =PHHG(D)
(D)∑D′:θD′=θD
PHHG(D′)(D′)
.
Since the data instances are assumed to be sampled from a multinomial distribution, two data sets
D and D′ have the same parameters if and only if they are a permutation of each other and HHG(D) =
HHG(D′). Hence we get PHHG(D′)(D′) = PHHG(D)
(D), and the CMDL distribution simplifies to:
PCMDLHG
(D) =1
|D′ : θD′ = θD|.
The length of the optimal code induced by the CMDL distribution is given by:
CMDL(G|D) = − log(PCMDLHG
(D)) + L(ΘD) = log(|D′ : θD′ = θD|) + L(ΘD). (2.26)
The problem of computing CMDL(G|D) is therefore reduced to the problem of counting how many
datasets induce the same OFE and sending these parameters to the receiver. We will start by deriving
the cardinality of the set D′ : θD′ = θD.
17
Counting the number of data sets compatible with the OFEs
The number of data sets compatible with the OFEs has a analytical solution for forest BNs and for
general structures a non-trivial solution is proposed. We will start by defining formally forest graphs.
Definition 29 (Forest). A forest is a disjoint union of trees.
Given expression (2.7) for the OFE parameters Θijki∈1...n,j∈1...qi,k∈1.....,ri of a given BN, one ob-
serves that two datasets induce the same OFE parameters if and only if they induce the same family of
counts N = Nijkijk and we have the following result for forest BN.
Theorem 30 ([4]). Let D be a dataset of size N , B a forest BN, and N = Nijki,j,k the family of counts
for each parent-child in B induced by D. The number of datasets of size N that induce the same family
of counts N for B is:
n∏i=1
qi∏j=1
(Nij
Nij1, . . . , Nijri
)=
n∏i=1
qi∏j=1
Nij !
Nij1! . . . Nijri !. (2.27)
Denote the family of OFE multinomials Mult(Nij , θij1, . . . , θijri)i,j by Mij . Expression (2.27) only
holds for forest BNs, next we give a counter example to prove this fact, that considers a v-structure BN
where the multinomials M11, M21 are not pairwise independent, and (2.27) is an upper bound for the
number of data sets compatible with the OFEs.
X1 X2
X3
(a)
X1 X2 X3
0 0 00 1 10 1 11 1 1
(b)
X1 N111 = 3 N112 = 1 N11 = 4X2 N211 = 1 N212 = 3 N21 = 4X3 N311 = 1 N312 = 0 N31 = 1
N321 = 0 N322 = 2 N32 = 2N331 = 0 N332 = 0 N33 = 0N341 = 0 N342 = 1 N34 = 1
(c)
Figure 2.6: Network structures and datasets for Example 31.
Example 31. Consider the BN depicted in Figure 2.6(a), where all Xi, with i ∈ 1, 2, 3, are binary
random variables, the data set is represented in Figure 2.6(b) and the counts in Figure 2.6(c). According
to Theorem 30, the number of compatible datasets is:
n∏i=1
qi∏j=1
Nij !
Nij1! . . . Nijri !=
4!
3!1!
4!
1!3!
1!
1!0!
2!
2!0!
0!
0!0!
1!
0!1!= 4× 4× 1× 1× 1× 1 = 16.
However, we can deduce the counts of X1 and X2 from the counts of the X3. And moreover, the counts
of X1 and X2 are not independent. Consider the case such that N31 = 0, then we know that X1 only
take the value 1 once. From N33 = 0 and N34 = 1 we know that when X1 takes the value 1, X2 always
takes the same value. Therefore, the true number of compatible data sets is given by:
(N
N31, N32, N33, N34
)=
(4
1, 2, 0, 1
)=
4!
1!2!0!1!= 12.
Hence, expression (2.27) is only an upper bound for the number of compatible data sets, that is only
reached when the multinomial distributions are independent.
18
We will reduce the problem of determining the number of compatible sets for a general network, to
the simple problem for forests, constructing a quotient over the set of nodes, that gives raise to a forest
in the defined quotient graph.
Now we will introduce some notation. Let’s represent a directed graph by G = (V,E), where V =
1, . . . , n are the nodes and E = (i, j) : i, j ∈ V ⊂ V 2 are the edges. We denote the edge (i, j) ∈ E
by i →G j. When node j is reachable in zero or more steps from node i we write i →?G j. If i and j are
reachable from each other we write i↔?G j.
Definition 32 (Strongly Connected Components). The strongly connected components (SCC) of a
graph form the partition V1, . . . , Vm over the nodes V such that:
1. Vl ∪ Vk = for 6= k.
2. V = V1 ∪ · · · ∪ Vn.
3. i↔?G j for all Vl and i, j ∈ Vl.
4. It is the coarsest partition fulfilling conditions 1, 2 and 3.
Given an arbitrary directed graph, Tarjan’s algorithm computes the SCC components in timeO(|V |2) [42].
Tarjan’s algorithm works as follows: we start by performing a depth-first search over the graph, such that
each node is visited exactly once; nodes are placed on a stack in the order they are visited. A node v
and its descendants are popped from the stack if and only there is not a path in the graph from any of
the nodes to some node earlier on the stack. In this case a SCC with root v and all the nodes later on
the stack is determined. In the case a path exists, node v remains in the stack. S denotes the set of
nodes that were discovered but do not belong to a SCC. Tarjan’s algorithm is represented in Algorithm 6.
A summary of all the functions and variables present in the algorithm is described next:
• v.index: order in which node v was discovered.
• v.lowlink: smallest index of any node known to be reachable from node v.
• strongconnect(v): function that performs a single depth-first search of the graph and finds all suc-
cessors of node v and determines all strongly connected components of that subgraph.
• v.onStack: function that verifies if node v is on the stack.
1
2 3
4
Figure 2.7: Network structure for Example 33.
Example 33. In the graph represented in Figure 2.7 the strongly connected components are V1, V2,
where V1 = 1, 2, 3 and V2 = 4.
19
Algorithm 6 Tarjan’s Algorithm
input: Graph G = (V,E).
output: Set of strongly connected components.
1: index= 0
2: S = empty array
3: for each v ∈ V do
4: if v.index is undefined then
5: strongconnect(v)
6: end if
7: end for
8: for each (v, w) ∈ E do
9: if w.index is undefined then
10: strongconnect(w)
11: v.lowlink = min(v.lowlink,w.lowlink)
12: else if w.onStack then
13: v.lowlink = min(v.lowlink,w.index)
14: end if
15: end for
16: if v.lowlink=v.index then
17: Start a new SCC
18: w = S.pop()
19: w.onStack = false
20: Add w to current strongly connected component.
21: while w! = v do
22: output the current strongly connected component.
23: end while
24: end if
20
Definition 34 (Quotient Graph). Given a graph G = (V,E) and an equivalence relation R ⊂ V 2, the
quotient graph is the graph G/R = (V/R,E/R), where V/R is the equivalence classes induced by R,
and [i]R →G/R [j]R, whenever k →G l for some k ∈ [i]R and l ∈ [j]R, with [i]R 6= [j]R.
Let 4V denote the diagonal relation of V , 4V = (i, i) : i ∈ V and trivially G/4V ' G. Now we
will focus on a particular equivalence relation that we will call forestification and denote by ∼, such that
given an acyclic graph G, the quotient graph G/ ∼ is a forest.
Definition 35 (Forestification). Let G be an acyclic graph. The forestification relation ∼ for G is th finest
equivalence relation such that: i) i ∼ j whenever there exist k and l such that k ∼ l and there are edges
[i]→G/∼ [k] and [j]→G/∼ [l]; ii) G/ ∼ is acyclic.
The forestification will be computed as the fixed point of an operator. Consider the operator Φ : 2V2 → 2V
2
defined by:
ΦG(R) = ΛG(ΩG(R)),
such that
ΩG(R) = R ∪ (j, j′) : (i, i′) ∈ R, j ∈ ΠGi , j′ ∈ ΠG
i′ ,
merges nodes that have children in the same equivalence class, so that in G/R there are no nodes with
more than one parent and
ΛG(R) = (i, j) : i↔?G∪G/R j,
where GR = (V,R). ΛG guarantees the cycles that may be formed in ΩG(R) belong to the same
equivalence class.
X1 X2
X3
X4
X5
(a) Initial network structure.
X[1]
X3 X4
X5
(b) Resultant network after ap-plying ΩG.
X[1]
X5
(c) Resultant network after ap-plying ΛG.
Figure 2.8: Given the initial structure represented in (a), the operator ΩG merges nodes X1 and X2
as they are both parents of the node X3, and the supernode X[1] = X1, X2 is created. As theedges (X[1], X3), (X3, X4), (X4, X[1]) form a cycle, the operator ΛG merges them into the supernodeX[1] = X1, X2, X3, X4. The resultant graph is a tree.
Theorem 36 ([4]). Let R be an equivalence relation over V . Then G/R is a forest iff R is a fixed point of
ΦG. Moreover, the forestification relation ∼ of G is the least fixed point(lfp) of ΦG and we have
∼= Φ|V |G (∅).
The forestification can thus be computed as the least fixed point of ΦG. The forestification algorithm
is represented in Algorithm 7 and takes O(|V |3) time [4]. Let G′ = (V ′, E′) represent the quotient of the
21
graph G = (V,E), where V ′ represents the equivalence relation of V . As ΦG(∅) = 4V = (i, i) : i ∈ V ,
V ′ is initialized as the trivial partition V ′ = i : i ∈ V and E′ is initialized as E′ = (i, j) : (i, j) ∈ E.
The algorithm applies ΦG to V ′ until V ′ is a fixed point of ΦG, and so V ′ = lfp(ΦG).
Algorithm 7 Algorithm to compute the forestification relation
input: Graph G = (V,E).
output: The forestification relation ∼ of G.
V ′ = i : i ∈ V , E′ = (i, j) : (i, j) ∈ E and G′ = (V ′, E′)
flag=false
while flag=false do
E′′ = E
for all i ∈ V ′ and j1, . . . , jk = ΠG′
i with jl < jl+1 do
E′′ = E′′ ∪⋃k−1l=1 (jl, jl+1), (jl+1, jl)
end for
∼= Partition in SCC by Tarjan(V ′, E′′)
if ∼= 4V ′ then
flag=true
else
G′ = (V ′, E′)/ ∼
end if
end while
Observe that two datasets D and D′ that induce the same counts N = Nijki,j,k with graph G may
induce two different counts when nodes are aggregated accordingly to ∼. To illustrate this fact we will
consider two interwines v-structures, forming a w-structure represented in Figure 2.9.
Example 37. Consider the BN with network structure G (Figure 2.9(a)). Its forestification G/ ∼ is
depicted in Figure 2.9(b), whereX[1] ≡ X1, X2, X3. Moreover, consider two datasets,D (Figure 2.9(c))
and D′ (Figure 2.9(d)), drawn from binary random variables.
X1 X2 X3
X4 X5
(a) Network G.
X[1]
X4 X5
(b) Network G/ ∼.
X1 X2 X3 X4 X5
1 0 1 0 10 0 1 0 11 0 0 1 1
(c) Dataset D.
X1 X2 X3 X4 X5
0 0 0 0 11 0 1 0 11 0 1 1 1
(d) Dataset D′.
Figure 2.9: Network structures and datasets for Example 37.
We aim at illustrating that D and D′ induce the same counts for G, which does not happen with G/ ∼.
Indeed, for G we have that ri = 2 for all i ∈ 1, . . . , 5, q1 = q2 = q3 = 1 and q4 = q5 = 4. Let xi1 = 0
and xi2 = 1 for all i ∈ 1, . . . , 5, w11 = w21 = w31 = ε, where ε is the empty parent configuration, and
w41 = w51 = 00, w42 = w52 = 01, w43 = w53 = 10, and w44 = w54 = 11. For both datasets D and D′ the
counts induced by G are given by:
22
X1 N111 = 1 N112 = 2 N11 = 3
X2 N211 = 3 N212 = 0 N21 = 3
X3 N211 = 1 N212 = 2 N21 = 3
X4 N411 = 1 N412 = 0 N41 = 1
N421 = 0 N422 = 0 N42 = 0
N431 = 1 N432 = 1 N43 = 2
N441 = 0 N442 = 0 N44 = 0
X5 N411 = 0 N412 = 1 N41 = 1
N421 = 0 N422 = 2 N42 = 2
N431 = 0 N432 = 0 N43 = 0
N441 = 0 N442 = 0 N44 = 0
Observe, however, that for G/ ∼ datasets D and D′ induce different counts. As X[1] correspond to
the equivalent class X1, X2, X3 in G/ ∼, the only possible configuration for its parents is the empty
one, and so r[1] = 8 and q[1] = 1. In this case, x[1]1 = 000, x[1]2 = 001, x[1]3 = 010, x[1]4 = 011,
x[1]5 = 100, x[1]6 = 101, x[1]7 = 110, x[1]8 = 111, and w[1]1 = ε. Concerning X4 and X5, r4 = r5 = 2,
q4 = q5 = 8, with xi1 = 0, xi1 = 1, wij = x[1]j , for all i = 4, 5 and j = 1, . . . , 8.
Having settled up the values of the nodes and the parents’ configurations, the counts induced by
G/ ∼ for D are given by:
X[1] M[1]11 = 0 M[1]12 = 1 M[1]13 = 0 M[1]14 = 0 M[1]15 = 1 M[1]16 = 1 M[1]17 = 0 M[1]18 = 0
X4 N411 = 0 N421 = 1 N431 = 0 N441 = 0 N451 = 0 N461 = 1 N471 = 0 N481 = 0
N412 = 0 N422 = 0 N432 = 0 N442 = 0 N452 = 1 N462 = 0 N472 = 0 N482 = 0
X5 N511 = 0 N521 = 0 N523 = 0 N541 = 0 N551 = 0 N561 = 0 N571 = 0 N581 = 0
N512 = 0 N522 = 1 N532 = 0 N542 = 0 N552 = 1 N562 = 1 N572 = 0 N582 = 0
whereas, the counts induced by G/ ∼ for D′ are given by:
X[1] M[1]11 = 1 M[1]12 = 0 M[1]13 = 0 M[1]14 = 0 M[1]15 = 0 M[1]16 = 2 M[1]17 = 0 M[1]18 = 0
X4 N411 = 1 N421 = 0 N431 = 0 N441 = 0 N451 = 0 N461 = 1 N471 = 0 N481 = 0
N412 = 0 N422 = 0 N432 = 0 N442 = 0 N452 = 0 N462 = 1 N472 = 0 N482 = 0
X5 N511 = 0 N521 = 0 N523 = 0 N541 = 0 N551 = 0 N561 = 0 N571 = 0 N581 = 0
N512 = 1 N522 = 0 N532 = 0 N542 = 0 N552 = 0 N562 = 2 N572 = 0 N582 = 0
and so, there are several differences in both counts for D and D′ (highlighted in bold).
Definition 38 (Compatable counts in the quotient graph). A countM = M[i][j][k][i],[j],[k] for the quotient
graph G/ ∼ is said to be compatible with a count N = Nijki,j,k for BN with graph G and D, which we
denote byM ↓ N , if there is a dataset D′ with the same size of D such that the count for the structure
G of D′ coincide with N ; and moreoverM is the count for the structure G/∼ of D′.
As illustrated in the previous example, there are several possible counts for the quotient graph G/ ∼
that are compatible with N . Therefore, we can deduce a generalization of Theorem 30, page 18, for a
general BN.
Theorem 39 ([4]). Let D be a dataset of size N , B a BN andN the family of counts for each parent-child
induced by B on D. The number of datasets of size N that induce the same counts for B is
∑M↓N
[n]∏[i]=1
q[i]∏[j]=1
(M[i][j]
M[i][j]1 . . .M[i][j]r[i]
). (2.28)
However, there is no analytical expression for the number of compatible counts for the quotient graph.
Therefore we restrict to structures for which there is only one compatible count for the quotient graph, that
23
we call Covering Graphs. Given a graph G consider the following covering CG = Xi1, . . . , Xik, Xj :
Πj = Xi1, . . . , Xik. In Figure 2.10(a) a graph that is not covering is represented; in Figure 2.10(b) a
covering graph is represented.
Definition 40 (Covering graph). A graph G is said to be covering if for all Xi ∈ X there is a C ∈ CG such
that [Πi]∼ ∪ [Xi]∼ ⊆ C, where [Πi]∼ is either the empty set, if [Xi]∼ has no parents in G/ ∼, or [Πi]∼ is
the parent of [Xi]∼ in G/ ∼.
X1 X2
X3 X4
(a)
X1 X2
X3 X4
(b)
Figure 2.10: The graph represented in (a) is not covering, since [Π4]∼ = X1, X2 and the setX4, X2, X1 does not belong to the covering CG = X3, X1, X2, X4, X2, X1, X2. However,the graph represented in (b) is covering.
Theorem 41. Let G be a covering graph, then there is only one compatible countM for G/∼ with N .
Thus, for the case of covering graphs expression (2.28) simplifies to:
[n]∏[i]=1
q[i]∏[j]=1
(M[i][j]
M[i][j]1 . . .M[i][j]r[i]
).
Sending the OFEs
We will consider the parameters are given by the asymptotic approximation derived for MDL [35]
L(Θ) =1
2ln(N)|B|,
where |B| was defined in (2.24), page 16.
Thus, for the case of covering graphs, we are able to compute the CMDL as
CMDL(G | D) = CMDL(G/ ∼| D)
=
m∑[i]=1
q[i]∑[j]=1
log
(M[i][j]
M[i][j]1 . . .M[i][j]r[i]
)+
1
2ln(N)|B|.
Normalized Maximum Likelihood Criterion
The Normalized Maximum Likelihood Criterion (NML), proposed by Rissanen [38], defines a different
description length for encoding the hypothesis HG. Instead of using an universal encoder, NML uses an
approach related to Rissanen’s stochastic complexity.
24
LetHG be a set of given probability distribution and suppose the sender believes there is a HG ∈ HGthat assigns a high likelihood to a given data set D. Let’s call HG the best fitting hypothesis. Given a
hypothesis HG′ , that doesn’t necessarily belong to H, the regret of HG′ relative to HG is given by:
− log(P (D|HG′))− minHG∈HG
(− log(P (D|HG))), (2.29)
and corresponds to the extra bits when data set D is encoded with HG′ , when comparing to the best
hypothesis in HG. The worst case regret, relative to data of fixed size N is defined as:
maxD:|D|=N
(− log(P (D|HG′)) + log(P (D|HG)). (2.30)
The goal is to find a hypothesisHG′ that minimizes the worst case regret. The solution to this minmax
problem is the normalized maximum likelihood distribution that induces codes with the following length:
L(D,HG) = −LL(HG|D) + CD(HG), (2.31)
where CD(HG) is the parametric complexity of HG for data D .The parametric complexity is in general
not computable. But a linear-time algorithm was proposed to compute the parametric complexity of a
single multinomial variable [27].
Since in the case of Bayesian network, the probability the distribution of Xi|ΠXi, for i ∈ 1, . . . , n is
a multinomial distribution with ri states and Nij observations, it is possible to decompose the parametric
complexity into:
CD(H) =
n∑i=1
qi∑j=1
CriNij,
where CriNijis the parametric complexity associated to data of size Nij generated by a multinomial with
ri states. This gives raise to the factorized normalized maximum likelihood (fNML) scoring function that
is given by:
fNML(B|D) = −LL(B|D) +
n∑i=1
qi∑j=1
CriNij.
The parametric complexity can be computed recursively using the algorithm presented next, in Algo-
rithm 8.
Algorithm 8 Compute Crminput: Natural numbers r,m.
output: CrmC1m = 1
C2m =
∑mh=0
(mh
)( hm
)h(m− hm
)m−hif r > 2 then
Crm = Cr−1m +
(m
r − 2
)m−hend if
25
Chapter 3
Dynamic Bayesian Networks
3.1 Basic Concepts
Dynamic Bayesian networks (DBN) model the stochastic evolution of a set of random variables over time
[32]. Consider the discretization of time in time slices T = 0, . . . , T. Let X[t] = (X1[t], . . . , Xn[t]) be a
random vector that denotes the value of the set of attributes at time t. Furthermore, let X[t1 : t2] denote
the set of random variables X for the interval t1 ≤ t ≤ t2. Consider a set of individuals H measured
over T sequential instants of time. The set of observations is represented as xh[t]h∈H,t∈T , where
xh[t] = (xh1 , . . . , xhn) ∈ RN is a single observation of n attributes, measured at time t and referring to
individual h.
In DBNs our goal is to define a probability joint distribution over all possible trajectories, i.e., possible
values for each attribute Xi and instant t, Xi[t]. Let P (X[t1 : t2]) denote the joint probability distribution
over the trajectory of the process from X[t1] to X[t2]. However, the space of possible trajectories is very
complex, therefore, it is necessary to simplify the problem and make assumptions, in order to define a
tractable problem.
Observations are viewed as i.i.d. samples of a sequence of probability distributions Pθ[t]t∈T . For
all individuals h ∈ H, and a fixed time t, the probability distribution is considered constant, i.e., xh[t] ∼
Pθ[t], h ∈ H. Using the chain rule the joint probability over X is given by:
P(X[0 : T ]
)= P
(X[0]
) T−1∏t=1
P(X[t+ 1]|X[0 : t]
).
A common assumption is to consider that the attributes in time-slice t+ 1 only depend on those in time
slice t, for t ∈ 0, . . . , T − 1.
26
Definition 42 (mth-order Markov Assumption). A stochastic process over X satisfies the mth-order
Markov assumption if, for all t ≥ 0
P(X[t+ 1]|X[0] ∪ · · · ∪X[t]
)= P
(X[t+ 1]|X[t−m+ 1] ∪ · · · ∪X[t]
).
In this case m is called the Markov lag of the process.
A simplistic approach is to consider the process is stationary, that in some particular cases might
hold, but in most does not. When the number of instances in the training data is small, a general
approach is to consider the process is stationary. Another one is to consider the process is piece-wise
stationary.
Definition 43 (Stochastic stationary process). A stochastic process is stationary (also called time in-
variant or homogeneous) if
P(X[t+ 1]|X[t]
)is the same for all time slices t ∈ 0, . . . , T − 1.
Considering the first-order Markov assumption we can encode the joint probability in a compact way:
defining an initial distribution and the transition distributions P(X[t+ 1]|X[t]
), for all t ∈ 0, . . . , T − 1.
Definition 44 (First-order dynamic Bayesian network). A non-stationary first-order Bayesian DBN con-
sists of:
• A prior network B0, which specifies a distribution over the initial states X[0].
• A set of transition networks Bt+1t over the variables X[t]∪X[t+1], representing the state transition
probabilities, for 0 ≤ t ≤ T − 1.
The transition network has the additional constraint that edges between slices must flow forward in
time. A stationary network contains only one prior network and one transition network. A first-order
Markov DBN has a prior network and a transition network for each transition of time t→ t+ 1. Observe
that a transition network encodes the inter-slice dependencies (from time transitions t → t + 1) and
intra-slice dependencies (in time slice t+ 1). Figure 3.1 represents a DBN.
3.2 Learning Dynamic Bayesian Networks
Learning dynamic Bayesian networks, considering no hidden variables or missing values, i.e., consider-
ing a fully observable process, reduces simply to applying the methods described for Bayesian networks
in Section 2.2 for each transition of time [20]. Not taking into account the acyclicity constraints, it was
proved that learning Bayesian networks does not have to be NP -hard [15]. This result can be applied
to DBNs, as the resulting “unrolled” graph, that contains a “copy” of each attribute in each time step, is
acyclic. And, on the other hand, it was also derived in the same paper a time complexity bound in the
number of random variables for the MDL and the Bayesian Dirichlet Equivalence scores. More recently
27
X1[0]
X2[0]
X3[0]
(a)
X1[0]
X2[0]
X3[0]
X1[1]
X2[1]
X3[1]
(b)
Figure 3.1: An example of a DBN B. In the left, the prior network B0 is depicted and in the right, thetransition network B1
0 is represented. The edges E1 = (X1[0], X1[1]), (X2[0], X2[1]) are the inter-sliceconnections and the edge E2 = (X2[1], X3[1]) represents the intra-slice connection.
a polynomial-time algorithm for learning optimal DBN was proposed using the mutual information tests
(MIT) [46]. A software for learning DBN that does not consider the intra-slice networks is proposed [30].
A polynomial-time algorithm is proposed that learns both the inter-slice and intra-slice connections in a
transition network; the resultant network is denoted by tDBN [28]. However, the search space for the
intra-slice networks is restricted to tree augmented networks, i.e, acyclic networks such that each at-
tribute has at most one parent from the same time-slice, but can have a finite number of parents p from
the previous time-slices. The letter t in the tDBN notation, reflects the search space considered. We will
now describe this algorithm for a first-order DBN. Denote by P≤p(X[t]) the set of subsets of X[t] with
cardinality less or equal to p. For each Xi[t] ∈ X[t], the optimal set of parents ΠXi[t] ∈ P≤p(X[t]) yields
the following score:
si = maxΠXi
[t]∈P≤p(X[t])φi(ΠXi
[t], Dt+1t ),
where φi is the local score of attribute Xi and Dt+1t is the subset of observations for the transition of time
t→ t+ 1. Then, allowing at most one parent Xj [t+ 1] from the current time-slice, the maximal score is
defined as:
sij = maxΠXi
[t]∈P≤p(X[t])φi(ΠXi
[t] ∪ Xj [t+ 1], Dt+1t ). (3.1)
For each transition t→ t+ 1 a complete directed graph in X[t+ 1] is built, the optimal set of parents for
all nodes is determined and the maximal branching is computed using Algorithm 1. The tDBN algorithm
has a worst-case complexity of O(np+3rp+2N), where r is the maximum number of discrete states a
variable can take [28] and is represented in Algorithm 9.
28
Algorithm 9 Optimal non-stationary first-order Bayesian tDBN structure learning
input: Set of attributes X, dataset D and a decomposable scoring function φ.
output: A tree-augmented DBN structure.
1: for each transition t→ t+ 1 do
2: Build a complete directed graph in X[t+ 1].
3: Calculate the weight of all edges and the optimal set of parents of all nodes.
4: Apply a maximum branching algorithm.
5: Extract transition t→ t+ 1 network using the maximum branching and the optimal set of parents.
6: end for
7: Collect transition networks to obtain the DBN structure.
29
30
Chapter 4
Proposed Method
We propose a polynomial-time algorithm in the number of attributes for learning consistent k-graph
dynamic Bayesian networks, denoted by cDBN. It was proved that the class of consistent k-graphs is
exponentially larger, in the number of variables, when compared to tree-network structures [6] . The
algorithm for learning cDBN structures starts by deriving the optimal branching of the input data, and
then determines the optimal set of parents with cardinality less or equal to k, consistent with the order
induced by the optimal branching, for each attribute [6]. Recall the definition given in Subsection 2.2.4
for consistent k-graphs and the CkG learning algorithm represented in Algorithm 4, page 10.
A polynomial-time algorithm for learning optimal tree-augmented dynamic Bayesian networks was
proposed [28]. Considering a first-order Markov DBN, the algorithm for each time transition t → t + 1
outputs the maximum branching for the intra-slice connections in time step t + 1 and the optimal set of
parents, with maximum cardinality of p, from the previous time step t. Recall the algorithm for learning
optimal non-stationary first-order Markov tree-augmented networks depicted in Subsection 3.2, Algo-
rithm 9, page 29.
The proposed algorithm increases exponentially the search space of the intra-slice connections for
each transition network, by applying the CkG learning algorithm. We start by giving a formal definition
for consistent k-graph dynamic Bayesian network.
Definition 45 (Consistent k-graph dynamic Bayesian network). A dynamic Bayesian network is called a
consistent k-graph and is denoted by cDBN if for each intra-slice transition network Gt+1, t ∈ 0, . . . , T −
1, the following holds: i)Gt+1 is a k-graph, i.e., each node has in-degree at most k; ii) given the optimum
branching R over the set of nodes X[t+ 1], for every edge in Gt+1 from Xi[t+ 1] to Xj [t+ 1], the node
Xi[t+ 1] is in the path from the root of R to Xj [t+ 1].
Theorem 46. Algorithm 10 finds an optimal mth-order cDBN, given a decomposable scoring function φ,
a set of n random variables, a maximum number of parents from the previous m time steps of p and a
bounded in-degree in each intra-slice network of k.
Proof. Let B be the optimal cDBN and B′ be the DBN output of Algorithm 10. Without loss of generality
consider the transition t−m+1, . . . , t → t+1. Let Bt+1t−m+1 and B
′t+1t−m+1 be the corresponding transition
31
Algorithm 10 Learning Optimal mth-order Markov cDBN
input: Set of attributes X, dataset D, a Markov lag m, a decomposable scoring function φ, maximum
intra-slice graph in-degree of k and maximum number of parents from the previous time slices of p.
output: Optimal mth-order cDBN.
1: for each transition t−m+ 1, . . . , t → t+ 1 do
2: Build a complete directed graph in X[t+ 1].
3: Calculate the weight of all edges and the optimal set of p parents from t −m + 1, . . . , t for all
nodes.
4: Apply a maximum branching algorithm to the intra-slice graph in t+ 1 that outputs the maximum
branching R.
5: for each node Xi ∈ R do
6: Compute the set αi of ancestors of i, that is, the set of nodes connecting the root R and Xi.
7: for each subset S of αi with at most k nodes do
8: Compute φi(S,D).
9: if αi(S,D) is the maximal score for Xi then
10: Set ΠXi to S.
11: end if
12: end for
13: end for
14: end for
15: Collect the transition networks to obtain the DBN structure.
networks. Denote byDt+1t−m+1 the subset of observations regarding the transition t−m+1, . . . , t → t+1.
By definition of optimal cDBN:
φ(Bt+1t−m+1, D
t+1t−m+1) ≥ φ(B
′t+1t−m+1, D
t+1t−m+1).
We will prove by contradiction that φ(Bt+1t−m+1, D
t+1t−m+1) ≤ φ(B
′t+1t−m+1, D
t+1t−m+1). Suppose φ(B,Dt+1
t−m+1) >
φ(B′, Dt+1t−m+1). Then, assuming the scoring function φ is decomposable:
φR(∅, Dt+1t−m+1) +
∑i 6=R
φi(Πi[t−m+ 1] ∪ · · · ∪Πi[t] ∪ Xj [t+ 1], Dt+1t−m+1) >
φR′(∅, Dt+1t−m+1) +
∑i 6=R′
φi(Π′i[t−m+ 1] ∪ . . .Π′i[t] ∪ X ′j [t+ 1], Dt+1
t−m+1),(4.1)
where Πi[t−m+ 1]∪ · · · ∪Πi[t] are the parents from the time slices t−m+ 1, . . . , t, Xj [t+ 1] is the
parent from the time slice t+1 and R is the root of the unrolled graph. Let ∆i[t−m+1]∪· · ·∪∆i[t] be the
optimal set of parents from time slices t−m+ 1, . . . , t determined in Step 3 for node i. Equation (4.1)
32
is equivalent to:
∑i 6=R
φi(Πi[t−m+ 1] ∪ · · · ∪Πi[t] ∪ Xj [t+ 1], Dt+1t−m+1)− φi(∆i[t−m+ 1] ∪ · · · ∪∆i[t], D
t+1t−m+1) >
∑i6=R′
φi(Π′i[t−m+ 1] ∪ · · · ∪Π′i[t] ∪ X ′j [t+ 1], Dt+1
t−m+1)− φi(∆i[t−m+ 1] ∪ · · · ∪∆i[t], Dt+1t−m+1).
Notice, however, that the maximum branching algorithm applied to the intra-slice graph, Step 4 of
Algorithm 10, constructs a complete graph such that the edge X ′j → X ′i is weighted by
φi(Π′i[t−m+ 1] ∪ · · · ∪Π′i[t] ∪ X ′j [t+ 1], Dt+1
t−m+1)− φi(∆i[t−m+ 1] ∪ · · · ∪∆i[t], Dt+1t−m+1),
and outputs the maximal spanning tree. Moreover, in Steps 5-11, all sets of parents from the time slice
t + 1 with cardinality k consistent with the maximal spanning tree are checked. Therefore the optimal
set of parents is found for each node. On the other hand, the selected graph is acyclic. Suppose there
existed a cycle X1, . . . , Xi, then this would imply that X1 would be in the path connecting the root R
and X1. Hence, we arrive to a contradiction. Therefore, Bt+1t−m+1 = B
′t+1t−m+1 and generalizing for all
transitions t−m+ 1, . . . , t → t+ 1, with t ∈ 0, . . . , T − 1, we prove B = B′.
Theorem 47. Algorithm 10 takes time
maxO(np+3mp+4rp+2NT ),O(nk+1rk+1NT ),
given a decomposable scoring function φ, a Markov lag m, a set of n random variables, a maximum
number of parents from the previous m time steps of p, a bounded in-degree in each intra-slice network
of k and a set of observations of N individuals over T time steps.
Proof. For each transition t − m + 1, . . . , t → t + 1, in Step 3, iterating over all the edges takes
time O((nm)2). The number of subsets of parents with at most p elements is given by:
|P≤p(X[t])| =p∑i=1
(nm
i
)<
p∑i=1
(nm)i ∈ O((nm)p). (4.2)
Calculating the score of each parent set, considering the maximum number of states a variable may
take is r, and that each variable has at most p + 1 parents (p from the previous m time slices and one
in the current), the number of possible configuration is given by rp+2. The score of each configuration
is computed over the set of observations Dt+1t−m+1, that has |Dt+1
t−m+1| elements. Denote the number of
individuals by N . The scores are stored in a |Dt+1t−m+1| × n(m + 1) matrix, therefore taking O(m2nN)
comparisons in order to determine the optimal set of parents. The maximum branching, Step 4, has
time complexity of O(n2), therefore Steps 2-4 take time O(np+3mp+4rp+2N). Step 5 takes O(n) time
as it ranges over all variables. The number of subsets with at most k elements, as seen in (4.2), is nk.
For each set of ancestors, the number of possible configurations is rk+1, that are stored in a |Dt+1| × n
matrix, therefore Steps 5-11 take time O(nk+1rk+1N). Algorithm 10 ranges over all T transitions of time,
33
hence, takes time maxO(np+3mp+4rp+2NT ),O(nk+1rk+1NT ).
34
Chapter 5
Experimental Results
The Experimental Results are organized in the following way: in Subsection 5.1 the results for the
CMDL learning algorithm are presented; in Subsection 5.2 the results for the cDBN learning algorithm
are presented.
5.1 Learning Bayesian Networks with CMDL
We implemented a score-based Bayesian networks learning algorithm, using the CMDL as scoring func-
tion. The implementation was in Java using a object-oriented paradigm and released under a free soft-
ware license available at https://margaridanarsousa.github.io/learn_cmdl/. We used the greedy
hill climber (GHC) as search procedure and the covering graphs as search space. The experiments
were run on an Intel Core i5-4200U CPU @ 1.60GHz×4 machine. We start by analyzing a benchmark
data set, LED, for which the Bayesian network structure used to generate it is known. Then, we analyze
the compression achieved, for real data sets, comparing CMDL, MDL and LL codes.
LED Data Set
The LED database was used by Fung and Crawford [21] and Songh and Valtorta [40]. The network rep-
resents a faulty LED display. There are eight variables, one representing the digit key and the remaining
seven corresponding to the seven segments of the display. In this case segment 1 is conditionally
independent of the digit key given the state of LED segments 2 and 3, whereas in a normal display
knowledge about the depressed key is sufficient to indicate which LED segments are on. The original
network is represented in Figure 5.1. Figures 5.2, 5.3, 5.4 and 5.5 represent the evolution of the learned
network using the CMDL as scoring function considering respectively N = 1000, N = 2000, N = 3000,
N = 4000 and N = 5000 observations. In Table 5.1, the compression achieved using the CMDL code
is represented. Figures 5.6 and 5.7 represent the learned networks using MDL as scoring criterion
for N = 1000, N = 2000, N = 3000, N = 4000 and N = 5000 observations respectively. In Table 5.2
the compression achieved using the MDL code is depicted. In Figures 5.8 and 5.9 the evolution of the
35
learned networks using LL are depicted and in Table 5.3 the compression achieved using this code is
represented.
D
432
1
5 6 7
Figure 5.1: LED database network.
D
432
1
5 6 7
Figure 5.2: Learned network using CMDL with N = 1000 observations and 1000 random restarts inGHC.
D
432
1
5 6 7
Figure 5.3: Learned network using CMDL with N = 2000 observations and 1000 random restarts inGHC.
36
D
432
1
5 6 7
Figure 5.4: Learned network using CMDL with N = 3000 observations and 1000 random restarts inGHC.
D
432
1
5 6 7
Figure 5.5: Learned network using CMDL with N = 4000 observations and 1000 random restarts inGHC. For N = 5000 the same network was recovered.
N CMDL-true (bits) CMDL-optimal (bits) -LL-true (bits) -LL-optimal (bits)
1000 4876.35 4510.52 4831.28 6973.49
2000 8796.83 6747.01 9313.41 14273.07
3000 12743.26 7543.56 13852.30 21487.60
4000 16753.36 16319.94 16068.92 18439.11
5000 20707.54 20256.814 22912.41 20004.45
Table 5.1: Compression achieved using the CMDL code. CMDL-optimal and LL-optimal correspondto the length of the codes induced by the optimal structure found by the GHC, 1000 random restartswere considered. CMDL-true and LL-true correspond to the length of the codes induced by the initialstructure, represented in Figure 5.1. N is the number of instances considered.
D
432
1
5 6 7
Figure 5.6: Learned network using MDL with N = 1000, N = 2000, N = 3000, N = 4000 observationsand 1000 restarts in GHC.
37
D
432
1
5 6 7
Figure 5.7: Learned network using MDL with N = 5000 observations and 1000 restarts in GHC.
N MDL-true (bits) MDL-optimal (bits) -LL-true (bits) -LL-optimal (bits)
1000 5194.98 4687.96 4831.28 4324.27
2000 9713.63 8599.92 9313.41 8166.80
3000 14273.89 12543.60 13852.30 12087.37
4000 18875.84 16541.56 18439.11 16068.92
5000 23360.91 20489.81 22912.41 20004.45
Table 5.2: Compression achieved using the MDL code. MDL-optimal and LL-optimal correspond to thelength of the codes induced by the optimal structure found by the GHC. 1000 random restarts wereconsidered for the GHC. MDL-true and LL-true correspond to the length of the codes induced by theinitial structure, represented in Figure 5.1. N is the number of instances considered.
D
432
1
5 6 7
Figure 5.8: Learned network using LL with N = 1000 observations and 1000 restarts in GHC.
D
432
1
5 6 7
Figure 5.9: Learned network using LL with N = 2000, N = 3000, N = 4000, N = 5000 observations and1000 restarts in GHC.
Real Data
We evaluated the compression achieved for the CMDL and MDL codes, using four datasets from the
UCI repositry [1]. Results are presented in Tabel 5.4.
38
N -LL-true (bits) -LL-optimal (bits)1000 4831.28 4324.272000 9313.41 8166.803000 13852.30 12087.364000 18439.11 16068.925000 22912.41 20004.45
Table 5.3: Compression achieved using the LL code. LL-optimal and LL-true correspond to the lengthinduced by the optimal structure found by the GHC and induced by the initial structure, represented inFigure 5.1. As usual, 1000 random restarts were considered for the GHC.
Data set Nb of Attributes Nb of Classes Nb of Instances MDL (bits) CMDL (bits)
chess 36 2 3196 57200.64 54487.32
letter 16 26 20000 731659.06 715538.73
shuttle-small 9 7 5800 77736.14 76483.61
waveform-21 21 3 5000 202127.20 200098.62
Table 5.4: Description of the data sets used in the experiments and the compression achieved with MDLand CMDL codes.
Discussion
From the experimental results regarding the LED data set we observe that none of the scoring functions
considering N = 5000 instances is able to recover the original structure. Moreover, none of the scores
give raise to a structure that captures the conditional independence of the segment 1 and digit key given
segments 2 and 3.
CMDL is the scoring criterion that selects the most complex structures for N = 1000, 2000, 3000 ob-
servations. On the other hand, it yields the maximum compression rate for N = 2000, N = 3000, N =
4000, N = 5000. Intuitively, this may have the following interpretation: we are compressing aggressively
the data set, therefore the regularities of the training are being captured in an exaggerated manner,
hence this leads to the overfitting to the training data, and the selection of complex structures. Roughly,
when the number of instances increases, the behavior of CMDL and its code length seems to be ap-
proaching LL’s behavior and code length.
All scoring criteria converge to the same network for N = 5000 instances. MDL is the most consistent
criterion, in the sense that it selects the same structure for N = 1000, N = 2000, N = 3000, N = 4000
observations. On the other hand, it gives raise to the highest description length. Notice that in all cases,
except for CMDL and N = 1000, the log likelihood is always higher for the optimal model selected by the
GHC, comparing with the model that generated the data.
From the results for the real data, we observe that CMDL further compresses the data, comparing
with MDL. And from the results in Table 5.4, increasing the number of instances, increases the difference
between the length codes of MDL and CMDL.
Therefore, we conclude CMDL is not an advantage, in terms of the learning, when compared to MDL.
However, it gives raise to considerably higher compression rates.
39
5.2 Learning cDBNs
Now we will compair the results obtained using Algorithm 9 [28], denoted by tDBN, that restricts the
search space for the intra-slice network of the transition networks to tree-network structures and Al-
gorithm 10, proposed in this thesis, denoted by cDBN, that increases exponentially the search space,
to consistent k-graphs. For Algorithm 9 we used the implementation released under a free-software
license available at http://josemonteiro.github.io/tDBN/. Algorithm 10 was implemented in Java
using a object-oriented paradigma and is released under a free software license available at https:
//margaridanarsousa.github.io/learn_cDBN/. The experiments were run on an Intel Core i5-4200U
CPU @ 1.60GHz×4 machine. We start by analyzing the performance of the proposed algorithm for
synthetic data generated from stationary first-order Markov cDBN. And then a first-order Markov cDBN
is used to model the evolution of patients with rheumatoid arthritis disease.
Experience 1 – Synthetic Data
A first-order cDBN structure and parameters were determined, and observations were sampled from
the generated network. Algorithms 9 and 10 were applied to the resultant data sets, and the ability
to learn and recover the original network structure was measured. The maximum intra-slice in-degree
k considered in Algorithm 10, was taken to be the one of the initial structure. To fully evaluate the
performance of the cDBN learning Algorithm, we did not consider the topological order1 induced by the
optimal branching, we considered instead the breadth-first-search order2 of the optimal branching. We
compared the original and recovered networks using the precision, recall and F1 metrics that are defined
as follows:
precision =TP
TP + FP, (5.1)
recall =TP
TP + FNand (5.2)
F1 = 2.precision× recallprecision + recall
, (5.3)
where TP are the true positive edges, FP are the false positive edges and FN are the false negative
edges. Five independent datasets were sampled from the generated network, for a given number of
observations. The initial networks considered are represented in Figure 5.10. In Tables 5.5 and 5.6
the results are depicted, the presented values are annotated with a 95% confidence interval. tDBN+LL
and tDBN+MDL denote respectively the tDBN learning algorithm applied using the LL and MDL scoring
functions. cDBN+LL and cDBN+MDL denote respectively the cDBN learning algorithm applied using
the LL and MDL scoring functions.
1A topological order of a DAG G = (V,E) is a total ordering of all its vertices such that if E contains an edge (u, v) then uappears before v in the ordering [11].
2We consider the breadth-first-search order as defined in [11].
40
X1[0]
X2[0]
X3[0]
X4[0]
X5[0]
X1[1]
X2[1]
X3[1]
X4[1]
X5[1]
X1[0]
X2[0]
X3[0]
X4[0]
X5[0]
X1[1]
X2[1]
X3[1]
X4[1]
X5[1]
X5[0]
X6[0]
X7[0]
X8[0]
X9[0]
X10[0]
X5[1]
X6[1]
X7[1]
X8[1]
X9[1]
X10[1]
X5[0]
X6[0]
X7[0]
X8[0]
X9[0]
X10[0]
X5[1]
X6[1]
X7[1]
X8[1]
X9[1]
X10[1]
Figure 5.10: Initial network for the experiments considering the following parameters (from left to right):n = 5, r = 2, k = 2; n = 5, r = 2, k = 4; n = 10, r = 3, k = 5; n = 10, r = 4, k = 6.
41
Tabl
e5.
5:C
ompa
rativ
est
ruct
ure
reco
very
resu
ltsfo
rtD
BN
+LL
and
tDB
N+M
DL
onsi
mul
ated
data
.Th
en
isth
enu
mbe
rofn
etw
ork
attr
ibut
es,p
isth
enu
mbe
rof
pare
nts
from
the
prec
edin
gtim
e-sl
ice,r
isth
enu
mbe
rofs
tate
sof
alla
ttrib
utes
andN
isth
enu
mbe
rofo
bser
vatio
ns.
Run
ning
time
isin
seco
nds.
NtD
BN
+LL
tDB
N+M
DL
Pre
Rec
F1
Tim
eP
reR
ecF
1Ti
me
Net
wor
k1
(n=
5,r
=2)
250
0.45
7±
0.0
501
0.58
2±
0.0
638
0.512±
0.0561
10.
803±
0.0
568
0.6±
0.0637
0.686±
0.0613
1
500
0.54
3±
0.0
638
0.69
1±
0.0
813
0.608±
0.0715
10.
853±
0.128
0.655±
0.117
0.74±
0.122
1
750
0.55
7±
0.0
469
0.69
1±
0.0
813
0.624±
0.0715
10.
908±
0.114
0.727±
0.101
0.8
07±
0.106
1
1000
0.61
4±
0.0
307
0.78
2±
0.0
390
0.688±
0.0344
10.
856±
0.0
835
0.654±
0.0
781
0.741±
0.0795
1
Net
wor
k2
(n=
5,r
=2)
250
0.58
6±
0.0
469
0.54
7±
0.0
437
0.566±
0.0452
10.
831±
0.0
938
0.333±
0.0
522
0.475±
0.0664
1
500
0.60
0±
0.0
307
0.66
0±
0.0
286
0.580±
0.0296
10.8
57±
00.
4±
00.
545±
01
750
0.61
4±
0.0
307
0.57
3±
0.0
286
0.593±
0.0296
10.
893±
0.0
475
0.440±
0.0
286
0.589±
0.0324
1
1000
0.61
4±
0.0
307
0.57
3±
0.0
286
0.593±
0.0296
10.
918±
0.0
591
0.440±
0.0
286
0.594±
0.0362
1
Net
wor
k3
(n=
10,r
=3)
250
0.49
7±
0.0
779
0.41
1±
0.0
645
0.4
5±
0.0706
10.
583±
0.103
0.194±
0.0
401
0.291±
0.0579
1
500
0.53
8±
0.0
308
0.44
6±
0.0
255
0.488±
0.0279
10.
804±
0.0
652
0.314±
0.0
388
0.452±
0.0507
1
750
0.59
3±
0.0
352
0.49
1±
0.0
292
0.538±
0.0319
10.
784±
0.0
885
0.314±
0.0
411
0.449±
0.0564
1
1000
0.57
9±
0.0
226
0.4
8±
0.01
870.
525±
0.0205
10.
893±
0.0
596
0.383±
0.0
255
0.536±
0.0358
1
Net
wor
k4
(n=
10,r
=4)
250
0.34
5±
0.0
331
0.30
3±
0.0
291
0.323±
0.0310
10.2
73±
00.
0909±
00.
136±
01
500
0.35
9±
0.0
308
0.31
5±
0.0
271
0.335±
0.0288
10.
297±
0.0
260
0.103±
0.0
130
0.153±
0.0178
1
750
0.41
4±
0.0
382
0.36
7±
0.0
336
0.387±
0.0358
10.
374±
0.0
180
0.145±
0.0
106
0.209±
0.0139
1
1000
0.46
9±
0.0
242
0.41
2±
0.0
212
0.439±
0.0226
10.3
85±
00.
152±
00.
217±
01
42
Tabl
e5.
6:C
ompa
rativ
est
ruct
ure
reco
very
resu
ltsfo
rcD
BN
+LL
and
cDB
N+M
DL.
Then
isth
enu
mbe
rof
netw
ork
attr
ibut
es,p
isth
enu
mbe
rof
pare
nts
from
the
prec
edin
gtim
e-sl
ice,r
isth
enu
mbe
rof
stat
esof
alla
ttrib
utes
,kis
the
num
ber
ofpa
rent
sfro
mth
ecu
rren
ttim
esl
ice
andN
isth
enu
mbe
rof
obse
rvat
ions
.R
unni
ngtim
eis
inse
cond
s.Th
epa
ram
eterk
ista
ken
tobe
the
max
imum
in-d
egre
eof
the
intra
-slic
ene
twor
kof
the
initi
alst
ruct
ure.
NcD
BN
+LL
cDB
N+M
DL
Pre
Rec
F1
Tim
eP
reR
ecF
1Ti
me
Net
wor
k1
(n=
5,k
=2,r
=2)
250
0.54
1±
0.06
010.8
36±
0.09
300.6
57±
0.0
30
20.
733±
0.0477
0.600±
0.0400
0.66±
0.0429
2
500
0.57
6±
0.03
860.8
91±
0.05
970.
7±
0.0
469
30.
871±
0.0312
0.727±
00.
792±
0.0
134
2
750
0.63
5±
0.02
060.9
82±
0.03
190.
771±
0.0469
40.
920±
0.0621
0.782±
0.0390
0.844±
0.0
408
4
1000
0.61
2±
0.02
520.9
45±
0.03
900.
743±
0.0307
40.
933±
0.0477
0.782±
0.0637
0.850±
0.0
561
5
Net
wor
k2
(n=
5,k
=4,r
=2)
250
0.74
0±
0.01
750.9
87±
0.02
340.
846±
0.0200
21.
00±
00.
600±
00.7
50±
02
500
0.75
0±
01.
00±
00.
857±
02
0.9
8±
0.0
351
0.6
13±
0.0234
0.754±
0.0
226
3
750
0.75
0±
01.
00±
00.
857±
04
0.980±
0.0351
0.600±
00.
744±
0.0
105
4
1000
0.75
0±
01.
00±
00.
857±
04
0.9
6±
0.0
428
0.600±
00.
738±
0.0
128
4
Net
wor
k3
(n=
10,k
=5,r
=3)
250
0.40
7±
0.0
0781
0.8
62±
0.01
650.
553±
0.0106
10
0.820±
0.0534
0.623±
0.0252
0.708±
0.0
356
10
500
0.41
5±
0.01
860.8
77±
0.03
930.
563±
0.0252
24
0.856±
0.0327
0.638±
0.0270
0.731±
0.0
286
29
750
0.43
3±
0.01
560.9
15±
0.03
300.
588±
0.0212
32
0.914±
0.0159
0.731±
00.8
12±
0.00616
34
1000
0.41
8±
0.01
750.8
85±
0.03
700.
568±
0.0237
55
0.884±
0.0221
0.708±
0.0165
0.786±
0.0
287
54
Net
wor
k4
(n=
10,k
=6,r
=4)
250
0.49
5±
0.01
110.8
85±
0.01
990.
635±
0.0143
52
0.389±
0.0227
0.224±
0.0130
0.284±
0.0
165
51
500
0.49
8±
0.01
190.8
91±
0.02
120.
639±
0.0152
110
0.453±
0.0226
0.261±
0.0130
0.331±
0.0
165
109
750
0.49
5±
0.0
0594
0.8
85±
0.01
060.
635±
0.00762
167
0.463±
0.0185
0.267±
0.0106
0.338±
0.0
135
162
1000
0.49
2±
0.01
330.8
79±
0.02
380.6
3±
0.0170
225
0.463±
0.0185
0.267±
0.0106
0.338±
0.0
135
229
43
Experience 2
We further show that given data generated from a fixed structure, the cDBN learning algorithm is able to
recover the initial network. Figure 5.11 shows the evolution of the learned structure with the increase of
number of observations N considering an initial structure with n = 5 attributes, p = 2 maximum number
of parents from the previous time slice and k = 2 maximum in-degree in the intra-slice network. Fig-
ure 5.12 considers an initial structure with n = 5, p = 1 and k = 2. In both cases the MDL was used as
scoring criterion.
X1[0]
X2[0]
X3[0]
X4[0]
X1[1]
X2[1]
X3[1]
X4[1]
(a) Original network
X1[0]
X2[0]
X3[0]
X4[0]
X1[1]
X2[1]
X3[1]
X4[1]
(b) tDBN and cDBN for N = 250, N = 500
X1[0]
X2[0]
X3[0]
X4[0]
X1[1]
X2[1]
X3[1]
X4[1]
(c) tDBN and cDBN for N = 750
X1[0]
X2[0]
X3[0]
X4[0]
X1[1]
X2[1]
X3[1]
X4[1]
(d) cDBN for N = 1250
X1[0]
X2[0]
X3[0]
X4[0]
X1[1]
X2[1]
X3[1]
X4[1]
(e) tDBN for N = 1250
Figure 5.11: Recovered networks for tDBN and cDBN algorithms.
44
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(a)I
nitia
lstr
uctu
re
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(b)t
DB
Nfo
rN=
500
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(c)c
DB
Nfo
rN=
500
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(d)t
DB
Nfo
rN=
1000
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(e)c
DB
Nfo
rN=
1000
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(f)tD
BN
forN
=2000
untilN
=4500
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(g)c
DB
Nfo
rN=
2000
untilN
=4500
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(h)t
DB
Nfo
rN=
5000
X1[0
]
X2[0
]
X3[0
]
X4[0
]
X1[1
]
X2[1
]
X3[1
]
X4[1
]
(i)cD
BN
forN
=5000
Figu
re5.
12:
Rec
over
edne
twor
ksfo
rtD
BN
and
cDB
Nal
gorit
hms.
45
Experience 3 Real Data: Rheumatoid arthritis data
In a next step we used the proposed algorithm to model the evolution of the Rheumatoid arthritis (RA)
disease in patients. RA is a chronic disease that causes joint pain, stiffness, swelling and decreased
movement of the joints [33]. The expressiveness of this disease is not constant, there are periods of
mild activity and periods with increased disease activity. We considered a stationary DBN because no
temporal alignment of the individuals, with respect to the disease evolution, was expected in the dataset,
and this allows us to consider a bigger number of observations, obtaining more complex structures.
We used the database provided by reuma.pt [3], that contains the observations of 426 patients, over
9305 hospital visits. For each patient and hospital visit, the characteristics of the patient (age, medical
history), the disease activity (medical scores, health assessment questionnaires, joints evaluation, lab
tests, adverse events) and therapy (active agents) was measured. We considered the preprocessed
data from [29], where a selection of attributes was made, with the following criteria: the attributes that
didn’t change with time were discarded; the attributes that had more than 25% of missing values, were
discarded. Continuous attributes were discretized into 10 discrete equal-width intervals and the median
of each interval was chosen as representative. We considered the resultant attributes and observations,
and used the cDBN algorithm to predict the disease activity score (DAS) class for the following time
slice.The resultant attributes are described next [29]:
• n meses inicio bio: number of months since the beginning of the treatment with the current
biological agent.
• eva doente: visual analogue of pain according to the patient.
• vs: the rate at witch red blood cells sediment, used as non-specific measure of inflammation (units:
mm/h).
• pcr: amount of C-reactive protein (CRP), a protein found in the blood plasma, and whose levels
rise in responde to inflammation(units: mg/l).
• ndDAS: number of painful joints from the 28 joints measured to assess the DAS.
• ntDAS: number of swollen joints from the 28 joints measured to assess the DAS.
• nd: total number of painful joints.
• nt: total number of swollen joints.
• idade consulta arred: current age of the patient, in years.
• desc bio activo: current biological agent for RA treatment.
• anos doenca ate cons: number of years since the patient was diagnosed with RA.
• i manif ea: indication of disease manifestation besides the joints.
• cod actividade das: DAS class.
46
The measure of disease activity (DAS) in patients that suffer from RA is defined as [44]:
DAS = 0.56√ndDAS + 0.28
√ntDAS + 0.70 ln(vs) + 0.014eva doente. (5.4)
The resulting DAS was further discretized in 4 classes defined as [44]:
• Remission (Class 0) for DAS < 2.6.
• Low disease activity (Class 1) for 2.6 ≤ DAS ≤ 3.2
• Medium disease activity (Class 2) for 3.2 < DAS ≤ 5.1.
• High disease activity (Class 3) for DAS > 5.1.
Figures 5.13, 5.14, 5.15 and 5.16 represent the learned first-order cDBN for different values of k,
from one to three, and scoring criterion, LL or MDL.
47
n_m
eses
_ini
cio_
bio[
0]
n_m
eses
_ini
cio_
bio[
1]
eva_
doen
te[0
]vs
[0]
vs[1
]
pcr[
0]nd
DA
S[0]
ntD
AS[
0]nd
[0]
nt[0
]id
ade_
cons
ulta
_arr
ed[0
]
idad
e_co
nsul
ta_a
rred
[1]
desc
_bio
_act
ivo[
0]
desc
_bio
_act
ivo[
1]
anos
_doe
nca_
ate_
cons
[0]
anos
_doe
nca_
ate_
cons
[1]
i_m
anif_
ea[0
]
eva_
doen
te[1
]nd
DA
S[1]
ntD
AS[
1]
nd[1
]nt
[1]
i_m
anif_
ea[1
]
cod_
activ
idad
e_da
s[0] pc
r[1]
cod_
activ
idad
e_da
s[1]
n_m
eses
_ini
cio_
bio[
0]
n_m
eses
_ini
cio_
bio[
1]
eva_
doen
te[0
]vs
[0]
vs[1
]
pcr[
0]nd
DA
S[0]
ntD
AS[
0]nd
[0]
nt[0
]id
ade_
cons
ulta
_arr
ed[0
]
idad
e_co
nsul
ta_a
rred
[1]
desc
_bio
_act
ivo[
0]
desc
_bio
_act
ivo[
1]
anos
_doe
nca_
ate_
cons
[0]
anos
_doe
nca_
ate_
cons
[1]
i_m
anif_
ea[0
]
eva_
doen
te[1
]nd
DA
S[1]
ntD
AS[
1]nd
[1]
nt[1
]i_
man
if_ea
[1]
cod_
activ
idad
e_da
s[0] pc
r[1]
cod_
activ
idad
e_da
s[1]
Figu
re5.
13:
From
top
tobo
ttom
:tD
BN
with
m=
1,p
=1
and
MD
L;cD
BN
with
m=
1,p
=1,k
=2
and
MD
L.
48
n_m
eses
_ini
cio_
bio[
0]
n_m
eses
_ini
cio_
bio[
1]
eva_
doen
te[0
]vs
[0]
vs[1
]
pcr[
0]nd
DA
S[0]
ntD
AS[
0]nd
[0]
nt[0
]id
ade_
cons
ulta
_arr
ed[0
]
idad
e_co
nsul
ta_a
rred
[1]
desc
_bio
_act
ivo[
0]
desc
_bio
_act
ivo[
1]
anos
_doe
nca_
ate_
cons
[0]
anos
_doe
nca_
ate_
cons
[1]
i_m
anif_
ea[0
]
eva_
doen
te[1
]nd
DA
S[1]
ntD
AS[
1]nd
[1]
nt[1
]i_
man
if_ea
[1]
cod_
activ
idad
e_da
s[0] pc
r[1]
cod_
activ
idad
e_da
s[1]
n_m
eses
_ini
cio_
bio[
0]
n_m
eses
_ini
cio_
bio[
1]
eva_
doen
te[0
]vs
[0]
pcr[
0]nd
DA
S[0]
ntD
AS[
0]nd
[0]
nt[0
]
idad
e_co
nsul
ta_a
rred
[0]
idad
e_co
nsul
ta_a
rred
[1]
desc
_bio
_act
ivo[
0]
desc
_bio
_act
ivo[
1]
anos
_doe
nca_
ate_
cons
[0]
anos
_doe
nca_
ate_
cons
[1]
i_m
anif_
ea[0
]
eva_
doen
te[1
]vs
[1]
pcr[
1]nd
DA
S[1]
ntD
AS[
1]nd
[1]
nt[1
]
i_m
anif_
ea[1
]co
d_ac
tivid
ade_
das[
0]
cod_
activ
idad
e_da
s[1]
Figu
re5.
14:
Top:
Res
ultin
gcD
BN
form
=1,p
=1,k
=3
usin
gM
DL.
Bot
tom
:R
esul
ting
cDB
Nfo
rm
=1,p
=1,k
=2
and
MD
L,co
nsid
erin
gth
eto
polo
gica
lor
deri
nduc
edby
the
tree
cont
aini
ngal
lthe
attr
ibut
essu
chth
atth
ecl
assDAS
has
the
high
estd
epth
.
49
n_m
eses
_ini
cio_
bio[
0]
n_m
eses
_ini
cio_
bio[
1]
eva_
doen
te[0
]vs
[0]
pcr[
0]nd
DA
S[0]
ntD
AS[
0]nd
[0]
nt[0
]
idad
e_co
nsul
ta_a
rred
[0]
idad
e_co
nsul
ta_a
rred
[1]
desc
_bio
_act
ivo[
0]
desc
_bio
_act
ivo[
1]
anos
_doe
nca_
ate_
cons
[0]
anos
_doe
nca_
ate_
cons
[1]
i_m
anif_
ea[0
]
eva_
doen
te[1
]vs
[1]
pcr[
1]nd
DA
S[1]
ntD
AS[
1]nd
[1]
nt[1
]
i_m
anif_
ea[1
]co
d_ac
tivid
ade_
das[
0]
cod_
activ
idad
e_da
s[1]
n_m
eses
_ini
cio_
bio[
0]
n_m
eses
_ini
cio_
bio[
1]
eva_
doen
te[0
] eva_
doen
te[1
]
vs[0
]
vs[1
]nt
[1]
cod_
activ
idad
e_da
s[1]
pcr[
0]
pcr[
1]
ndD
AS[
0]
ndD
AS[
1]
nd[1
]
ntD
AS[
0]
ntD
AS[
1]
nd[0
]nt
[0]
idad
e_co
nsul
ta_a
rred
[0]
idad
e_co
nsul
ta_a
rred
[1]
desc
_bio
_act
ivo[
0]
desc
_bio
_act
ivo[
1]an
os_d
oenc
a_at
e_co
ns[0
]
anos
_doe
nca_
ate_
cons
[1]
i_m
anif_
ea[0
]
i_m
anif_
ea[1
]
cod_
activ
idad
e_da
s[0]
Figu
re5.
15:
Top:
Res
ultin
gcD
BN
form
=1,p
=1,k
=3
and
MD
L,co
nsid
erin
gth
eto
polo
gica
lord
erin
duce
dby
the
tree
cont
aini
ngal
lthe
attr
ibut
essu
chth
atth
ecl
assDAS
has
the
high
estd
epth
.B
otto
m:
Res
ultin
gcD
BN
cons
ider
ingp
=1,m
=1,k
=2
and
LL.
50
n_m
eses
_ini
cio_
bio[
0]
n_m
eses
_ini
cio_
bio[
1]
eva_
doen
te[0
]
eva_
doen
te[1
]
vs[0
]
vs[1
]
nt[1
]
cod_
activ
idad
e_da
s[1]
pcr[
0]
pcr[
1]
ndD
AS[
0]
ndD
AS[
1]
nd[1
]
ntD
AS[
0]
ntD
AS[
1]
nd[0
]nt
[0]
idad
e_co
nsul
ta_a
rred
[0]
idad
e_co
nsul
ta_a
rred
[1]
desc
_bio
_act
ivo[
0]
desc
_bio
_act
ivo[
1]
anos
_doe
nca_
ate_
cons
[0]
anos
_doe
nca_
ate_
cons
[1]
i_m
anif_
ea[0
]
i_m
anif_
ea[1
]
cod_
activ
idad
e_da
s[0]
n_m
eses
_ini
cio_
bio[
0]
n_m
eses
_ini
cio_
bio[
1]
eva_
doen
te[0
]
eva_
doen
te[1
]
vs[0
]
vs[1
]nt
[1]
cod_
activ
idad
e_da
s[1]
pcr[
0]
pcr[
1]
ndD
AS[
0]
ndD
AS[
1]
nd[1
]
ntD
AS[
0]
ntD
AS[
1]
nd[0
]nt
[0]
idad
e_co
nsul
ta_a
rred
[0]
idad
e_co
nsul
ta_a
rred
[1]
desc
_bio
_act
ivo[
0]
desc
_bio
_act
ivo[
1]
anos
_doe
nca_
ate_
cons
[0]
anos
_doe
nca_
ate_
cons
[1]
i_m
anif_
ea[0
]
i_m
anif_
ea[1
]
cod_
activ
idad
e_da
s[0]
Figu
re5.
16:
Top:
Res
ultin
gcD
BN
cons
ider
ingp
=1,m
=1,k
=3
and
LL.B
otto
m:
Res
ultin
gcD
BN
cons
ider
ingp
=1,m
=1,k
=3
with
LL.
51
Classification
We used the cDBN algorithm for predicting the class DAS of patients from one hospital visit to the other,
comparing its performance with the tDBN algorithm. We measured the average accuracy and precision
defined as:
average accuracy =1
C
C∑i=1
TPi + TNiTPi + TNi + FPi + FNi
and (5.5)
precision =
∑Ci=1 TPi∑C
i=1 TPi + FPi, (5.6)
where C is the number of classes, and TPi, TNi, FPi and FNi are respectively the true positive, true
negative, false positive and false negative counts for class i.
As was seen, these metrics do not depend on the Markiv lag m and the optimal number of parents
from the previous time slice is determined to be p = 1 [29]. Therefore, we kept m = 1 and p = 1 and
varied the number of parents from the same time slice k and the scoring function considered, LL or MDL.
These metrics were measured using 10-fold cross validation. The results are presented in Table 5.7.
Model k Precision Average Accuracy
tDBN+LL 1 0.444 0.632
tDBN+MDL 0.516 0.691
cDBN+LL 2 0.465 0.644
cDBN+MDL 0.522 0.696
cDBN+LL 3 0.464 0.641
cDBN+MDL 0.520 0.693
cDBN+LL 4 0.459 0.636
cDBN+MDL 0.523 0.696
cDBN+LL 5 0.462 0.639
cDBN+MDL 0.523 0.697
Table 5.7: Experimental results for the tDBN and cDBN classification algorithms, where N = 4721observations were considered.
Discussion
We will now analyze the results of experiences 1, 2, and 3.
Experience 1
Regarding network 1 represented in Figure 5.10, page 41, we observe that for all number of observations
N considered, the cDBN+LL clearly outperforms the tDBN+LL. This result was expected, since the
LL scoring function does not penalize the complexity of the structures, therefore the more complex
consistent 2-networks are recovered. The selected structures give raise to a considerably higher recall,
and to a similar precision. On the other hand, the results for cDBN+MDL and tDBN+MDL for a small
52
number of observations, e.g. N = 250, N = 500, are similar. This was also expected, since the MDL
penalizes the complexity of the structures, therefore a bigger number of observations are necessary
to select consistent 2-structures. For a higher number of observations, e.g. N = 750, N = 1000, the
cDBN+MDL performs better than the cDBN+LL, as the precision metric is considerable higher in the
case of the first algorithm.
Considering network 2 represented in Figure 5.10, page 41, the cDBN+LL outperforms the other
implementations. The intra-slice network considered is the fully connected consistent 4-graph. There-
fore, it is clearly biased towards the cDBN+LL algorithm. In this case the cDBN+LL learns the sufficient
and necessary connections, while in the other settings considered it clearly overfits to the training data.
However, the cDBN+MDL implementation increases its performance, with some fluctuations, with the
number of observations N .
In the case of network 3 represented in Figure 5.10, page 41, the cDBN+MDL gives raise to the
best results. The penalizing term in MDL prevents false positive edges from being chosen, resulting in
significantly higher precision values compared to LL.
Considering network 4 represented in Figure 5.10, page 41, each variable has r = 4 possible values,
therefore the number of parameters increases, which explains the higher regularization effects of the
MDL scoring function. In this case the cDBN+LL is the implementation that yields the best results. The
precision obtained, comparing with the cDBN+MDL, is similar, however the recall is considerably higher.
In general, the recall obtained with the cDBN learning algorithms, when compared to tDBN, is always
greater, while the precision is similar in both cases. Comparing the MDL and LL implementations, the
MDL has higher precision, while the LL has higher recall. The performance of the implementations
taking MDL as scoring function improves with the number of observations, giving raise to a higher recall.
In terms of running time, tDBN has a constant running time of 1 second for all networks considered. The
cDBN algorithm has a higher running time, but was always less than 4 minutes. The cDBN algorithms
improves in all cases the F1 measure, in at least roughly 5%.
The number of observations necessary for the cDBN to recover the first and second structures rep-
resented in Figure 5.10 are respectively 6000 ± 748.33 and 14904.28 ± 6665.30, with a 95% confidence
interval, where five independent datasets were sampled from the generated network and MDL was
used. This number is considerably high, considering networks with five attributes. When increasing k,
the number of necessary observations increases significantly.
Experience 2
From Figures 5.11 and 5.12 we observe that in order for the cDBN+MDL algorithm to recover both the
inter-slice and intra-slice connections of the initial structure, a substantial number of observations are
necessary. In Figure 5.11, considering n = 5 attributes, p = 2 parents from the previous time slice and
k = 2 parents from the current time slice, the algorithm converges when N = 1250. In Figure 5.12
considering n = 5, p = 1 and k = 2, the algorithm only converges to the initial structure for N = 5000
observations.
53
Experience 3
Regarding Figures 5.13 and 5.14 (top), we observe that the number of observations considered (N =
4721) are not sufficient for the cDBN+MDL algorithm to learn the intra-slice consistent k-structures,
taking k = 2, 3. The cDBN+MDL for k = 2, 3, as the tDBN+MDL, only selects the vs attribute from the
future hospital visit to influence the predicted cod actividade das.
Since our goal is to predict the class DAS of a given patient from one hospital visit to the other, we
considered the topological order of the nodes induced by a tree such that the DAS has the highest depth.
The results are represented in Figures 5.14 (bottom) and 5.15 (top). For this case the cDBN+MDL
algorithm selected the attributes cod actividade das (from previous visit) and ndDAS (from future visit)
to influence the prediction of cod actividade das. However, it is not able to learn more complex k-
consistent structures for k = 2, 3.
The cDBN+LL algorithm, on the other hand, is able to learn 2-consistent and 3-consistent structures
for the intra-slice connections. Taking k = 2 the algorithm selects the attributes eva doente and ndDAS to
influence the prediction of cod actividade das. Considering k = 3 it selects the attributes eva doente,
ndDAS and idade consulta arred. Notice that the variable idade consulta arred is not used to com-
pute the DAS class, see Equation (5.4). For k = 4 the same dependencies for the intra-slice network are
learned. However, the cod actividade das from the previous visit is not considered to influence this
measure for the future visit. Instead, the vs attribute is the only attribute that influences the future DAS
class.
In Table 5.7 the results for the classification task for the DAS class are depicted. We observe that
the average accuracy is always increased when using the cDBN algorithm, however, this improvement
is not relevant. The maximum average accuracy improvement is of 0, 6% and is obtained considering
k = 5. Using the LL scoring function in all cases yields a lower average accuracy and precision, when
comparing with the MDL. The maximum precision improvement is of 0, 7% and is obtained considering
k = 4 and k = 5.
54
Chapter 6
Conclusions
The main advantage of CMDL, when comparing to MDL, is the completeness, in the sense that MDL
reserves many code words to encode the same sequence, whereas CMDL reserves one code word for
each parameter. From the experimental results we verified that the CMDL scoring criterion compresses
aggressively the data, therefore its regularities are over-learned and it does not generalize well. Hence,
MDL, in terms of learning, clearly outperforms CMDL. However, CMDL gives raise to considerably lower
description lengths. These facts are discussed in Section 10.2 of Grunwald’s book, The Minimum De-
scription Length [22].
The cDBN learning algorithm has polynomial time complexity with respect to the number of attributes
and can be applied to stationary or non-stationary Markov processes. The proposed algorithm increases
the search space of the intra-slice connections exponentially, comparing with the tDBN algorithm. Con-
sidering more complex k-structures (with k > 1), the cDBN is a good alternative to the tDBN: it is able to
recover a bigger number of dependencies and improves in all cases considered the performance of the
state-of-the-art tDBN algorithm in terms of F1-measure.
Directions of future work
As future work, we could derive a non-asymptotic code for the parameters of the bayesian networks. In
this case the precise parameters are sent to the receiver, therefore MDL saves approximately 1/2(logN)
bits in the first part of the description, since only the truncated parameters are encoded. On the other
hand, in the second part of the description MDL reserves code words for all possible instances in the
data, whereas CMDL reserves code words for a subset of them. Asymptotically these two effects cancel
and both codes give raise to approximate equal code lengths [22]. Considering the precise parameters
will bring modifications in the learning and compression achieved.
Comparing the compression achieved using CMDL with the Bayesian networks compression meth-
ods proposed by Davies and Moore could also be considered [14].
In terms of the implementation, a more efficient search-procedure could be considered [43], instead
of the greedy hill climber.
The cDBN considers the topological order induced by the optimal branching as an heuristic for a
55
causality order between the network variables. However, there are n! number of ways to order the
n variables. Other orders could be considered. On the other hand, considering a total order, would
increase the search space significantly. The breath-first search of the optimum branching is a good
candidate [6].
56
Bibliography
[1] Catherine L Blake. UCI repository of Machine Learning databases. http://www. ics. uci. edu/˜
mlearn/MLRepository. html, 1998.
[2] P Bonissone, M Henrion, L Kanal, and J Lemmer. Equivalence and synthesis of causal models. In
UAI, volume 6, page 255, 1991.
[3] Helena Canhao, Augusto Faustino, Fernando Martins, Joao Eurico Fonseca, Patrıcia Nero, and
Jaime C Branco. Reuma.pt-The rheumatic diseases portuguese register. Acta reumatologica por-
tuguesa, 36(1):45–56, 2011.
[4] Alexandra Carvalho, Mario Figueiredo, and Margarida Sousa. Complete Minimum Description
Length for Learning Bayesian networks (to be submitted).
[5] Alexandra M Carvalho. Scoring functions for learning Bayesian networks. INES-ID Tec. Rep, 2009.
[6] Alexandra M Carvalho and Arlindo L Oliveira. Learning Bayesian networks consistent with the
optimal branching. In Machine Learning and Applications, 2007. ICMLA 2007. Sixth International
Conference on, pages 369–374. IEEE, 2007.
[7] David Maxwell Chickering. Learning Bayesian networks is NP-complete. Learning from data: Arti-
ficial Intelligence and Statistics V, 112:121–130, 1996.
[8] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees.
IEEE Transactions on Information Theory, 14(3):462–467, 1968.
[9] Gregory F Cooper. The computational complexity of probabilistic inference using Bayesian belief
networks. Artificial Intelligence, 42(2-3):393–405, 1990.
[10] Gregory F Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic
networks from data. Machine Learning, 9(4):309–347, 1992.
[11] Thomas H Cormen. Introduction to algorithms. MIT press, 2009.
[12] Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
[13] Paul Dagum and Michael Luby. Approximating probabilistic inference in Bayesian belief networks
is NP-hard. Artificial Intelligence, 60(1):141–153, 1993.
57
[14] Scott Davies and Andrew Moore. Bayesian networks for lossless dataset compression. In Proceed-
ings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 387–391. ACM, 1999.
[15] Norbert Dojer. Learning Bayesian networks does not have to be NP-hard. In MFCS, pages 305–
314. Springer, 2006.
[16] Jack Edmonds. Optimum branchings. Mathematics and the Decision Sciences, Part, 1:335–345,
1968.
[17] Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on
Information Theory, 21(2):194–203, 1975.
[18] Mario AT Figueiredo. Elementos de Teoria da Informacao. 2011.
[19] Nir Friedman and Daphne Koller. Being Bayesian about network structure. A Bayesian approach to
structure discovery in Bayesian networks. Machine Learning, 50(1-2):95–125, 2003.
[20] Nir Friedman, Kevin Murphy, and Stuart Russell. Learning the structure of dynamic probabilistic
networks. In Proceedings of the Fourteenth conference on UAI, pages 139–147. Morgan Kaufmann
Publishers Inc., 1998.
[21] Robert M. Fung and Stuart L. Crawford. Constructor: A system for the induction of probabilistic
models. In AAAI, volume 90, pages 762–769, 1990.
[22] Peter Grunwald. Minimum description length tutorial. Advances in minimum description length:
Theory and applications, pages 23–80, 2005.
[23] Mark H Hansen and Bin Yu. Model selection and the principle of minimum description length.
Journal of the American Statistical Association, 96(454):746–774, 2001.
[24] David Heckerman, Dan Geiger, and David M Chickering. Learning Bayesian networks: The combi-
nation of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995.
[25] David Heckerman, Abe Mamdani, and Michael P Wellman. Real-world applications of Bayesian
networks. Communications of the ACM, 38(3):24–26, 1995.
[26] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT
press, 2009.
[27] Petri Kontkanen and Petri Myllymaki. A linear-time algorithm for computing the multinomial stochas-
tic complexity. Information Processing Letters, 103(6):227–233, 2007.
[28] Jose L Monteiro, Susana Vinga, and Alexandra M Carvalho. Polynomial-time algorithm for learning
optimal tree-augmented dynamic Bayesian networks. In UAI, pages 622–631, 2015.
[29] Jose Maria Pedro Serra Libano Monteiro. Learning from short multivariate time series. Master
Thesis, Instituto Superior Tecnico, 2014.
58
[30] Kevin Murphy et al. The Bayes net toolbox for matlab. Computing Science and Statistics,
33(2):1024–1034, 2001.
[31] Kevin P Murphy. Machine Learning: a probabilistic perspective. MIT press, 2012.
[32] Kevin Patrick Murphy and Stuart Russell. Dynamic Bayesian networks: representation, inference
and learning. 2002.
[33] American College of Rheumatology. Rheumatoid Arthritis, 2017.
[34] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Mor-
gan Kaufmann, 2014.
[35] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
[36] Jorma Rissanen. A universal prior for integers and estimation by minimum description length. The
Annals of Statistics, pages 416–431, 1983.
[37] Jorma Rissanen. Minimum Description Length Principle. Wiley Online Library, 1985.
[38] Jorma J Rissanen. Fisher information and stochastic complexity. IEEE Transactions on Information
Theory, 42(1):40–47, 1996.
[39] Gideon Schwarz et al. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464,
1978.
[40] Moninder Singh and Marco Valtorta. Construction of Bayesian network structures from data: a brief
survey and an efficient algorithm. International Journal of Approximate Reasoning, 12(2):111–131,
1995.
[41] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT
press, 2000.
[42] Robert Tarjan. Depth-first search and linear graph algorithms. SIAM Journal on Computing,
1(2):146–160, 1972.
[43] Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for
learning Bayesian networks. arXiv preprint arXiv:1207.1429, 2012.
[44] DM Van der Heijde, Martin A van’t Hof, PL Van Riel, LA Theunisse, Evelien W Lubberts, Miek A van
Leeuwen, Martin H van Rijswijk, and LB Van de Putte. Judging disease activity in clinical practice
in rheumatoid arthritis: first step in the development of a disease activity score. Annals of the
Rheumatic Diseases, 49(11):916–920, 1990.
[45] Marcel AJ Van Gerven, Babs G Taal, and Peter JF Lucas. Dynamic Bayesian networks as prog-
nostic models for clinical patient management. Journal of Biomedical Informatics, 41(4):515–529,
2008.
59
[46] Nguyen Xuan Vinh, Madhu Chetty, Ross Coppel, and Pramod P Wangikar. Polynomial time al-
gorithm for learning globally optimal dynamic Bayesian network. In International Conference on
Neural Information Processing, pages 719–729. Springer, 2011.
[47] Xin-Qiu Yao, Huaiqiu Zhu, and Zhen-Su She. A dynamic Bayesian network approach to protein
secondary structure prediction. BMC Bioinformatics, 9(1):49, 2008.
[48] Geoffrey Zweig and Stuart Russell. Speech recognition with dynamic Bayesian networks. 1998.
60