structure learning in bayesian...
TRANSCRIPT
Structure learning in Bayesian networks
Probabilistic Graphical Models
Sharif University of Technology
Spring 2018
Soleymani
Recall: Learning problem
2
Target: true distribution πβ that maybe correspond to β³β
= π¦β, π½β
Hypothesis space: specified probabilistic graphical models
Data: set of instances sampled from πβ
Learning goal: selecting a model β³ to construct the best
approximation to πβ according to a performance metric
We may use Bayesian model averaging instead of selecting a model β³
Structure learning of Bayesian networks
3
For a fixed set of variables (i.e. nodes), we intend to learn
the directed graph structure π’β:
Which edges are included in the model?
Fewer edges miss dependencies
More edges cause spurious dependencies
We often prefer sparser structures that show better
generalization
Learning the structure of DGMs
4
Approaches
Constraint-based
Find conditional independencies in data using statistical tests
Find an I-map graph for the obtained conditional independencies
Score-based
View a BN structure and parameter learning as a density estimation
task and address structure learning as a model selection problem
Score function measures how well the model fits the observed data
Maximum likelihood
Bayesian score
Score-based approach
5
Input:
Hypothesis space: set of possible structures
Training data
Output:A network that maximizes the score function
Thus, we need a score function (including priors, if needed)
and a search strategy (i.e. optimization procedure) to find a
structure in the hypothesis space
Structural search
6
Hypothesis space
General graphs: Number of graphs on π nodes: O 2π2
Trees: Number of trees on π nodes O π!
Combinatorial optimization problem
Many heuristics to solve this problem but no guarantee on attaining
optimality
super-exponential
Likelihood score
7
Find π’, ππ’ that maximize the likelihood:
maxπ’,π½π’
π π|π’, π½π’ = maxπ’
maxπ½π’
π π|π’, π½π’
= maxπ’
π π|π’, π½π’
ScoreL π’, π = π π|π’, π½π’
Likelihood Score
Likelihood score: Information-theoretic
interpretation
8
Given structure, log likelihood of data:
ln π(π|ππ’, π’) = ln π=1
π
π(π π |ππ’, π’) =
π=1
π
ln
π
π(π₯ππ
|π·πππ
π, π½ππ|ππππ
, π’)
=
π
π=1
π
ln π(π₯ππ
|π·πππ
π, π½ππ|ππππ
, π’)
= π
π
π₯πβπππ(ππ)
ππβπππ(π·πππ)
πΆππ’ππ‘(π₯π , ππ)
πln π(π₯π|ππ , π½ππ|ππππ
, π’)
ln π(π| ππ’, π’) = π
π
π₯πβπππ(ππ)
ππβπππ(π·πππ)
π(π₯π , ππ) ln π(π₯π|ππ) π(π₯π)
π(π₯π)
= π
π
πΌ π ππ; ππππβ π» π ππ
This score measures the strength of the dependencies between variables and their parents.
Limitations of likelihood score
9
We have πΌπ π; π β₯ 0 and πΌπ π; π = 0 iff π π, π= π π π(π)
In the training data, almost always πΌ π π; π > 0
Due to statistical noise in the training data, exact independence
almost never occurs in the empirical distribution
Thus, ML score never prefers the simpler network
Score is maximized for fully connected network
Therefore likelihood score fails to generalize well
πΌ π; π βͺ π β₯ πΌ π; ππΌ π; π βͺ π = πΌ π; π iff π β₯ π|π
Avoiding overfitting
10
Restricting the hypothesis space explicitly:
restrict number of parents or number of parameters
Impose penalty on the complexity of the structure:
Explicitly
Bayesian score and BIC
Tree structure: Chow-Liu algorithm
11
Compute π€ππ = πΌ(ππ , ππ) for each pair of nodes ππ and ππ
Find a maximum weight spanning tree (MWST)
tree with the greatest total weight
maxπ
(π,π)βπ
π€ππ
Give directions to edges in MST
pick an arbitrary node as the root and draw arrows going away
from it (e.g., using a BFS algorithm)
we can find an optimal
tree in polynomial time
BIC score
12
ScoreBIC π’ = log π(π| π½π’) βlog π
2Dim π’
= π
π
πΌ π ππ; ππππβ π
π
π» π ππ βlog π
2Dim π’
Its negation is also known as Minimum Description Length
(MDL)
For the larger π, the more emphasis is given to the fit to data
Mutual information grows linearly with π while complexity grows
logarithmically with π
can be ignoredModel complexity
(i.e., # of independent
parameters)
Asymptotic consistency of BIC score
13
Definition: A scoring function is called consistent if as π β βand probabilityβ 1:
π’β (or any I-equivalent structure) maximizes the score
All non-I-equivalent structures have strictly lower score
Theorem: As π β β, the structure that maximizes the BIC
score is the true structure π’β (or any I-equivalent structure)
Asymptotically, spurious dependencies will not contribute to likelihood
and will be penalized
Required dependencies will be added due to linear growth of likelihood
term compared to logarithmic growth of model complexity
Bayesian score
14
Uncertainty over parameters and structure
Try to average over all possible parameter values and
also consider the structure prior
Bayesian score:
ScoreB π’ = log π π|π’ + log π π’
We consider the whole distribution over
parameters π½π’ for the structure π’ to find the
marginal likelihood:
π π|π’ = π½π’
π π|π’, π½π’ π π½π’|π’ ππ½π’
π π’|π =π π|π’ π π’
π π
Independent of π’
Structure prior
Marginal
likelihood
likelihood
Marginal likelihood
15
π π|π’ = π π 1 , β¦ , π π π’
= π π 1 π’ Γ π π 2 π 1 , π’ Γ β― Γ π π π π 1 , β¦ , π πβ1 , π’
Dirichlet priors and single variable:
π π₯ 1 , β¦ , π₯ π = π=1
πΎ πΌπ πΌπ + 1 β¦ πΌπ + π[π] β 1
πΌ πΌ + 1 β¦ πΌ + π β 1
=Ξ πΌ
Ξ πΌ + πΓ
π=1
πΎΞ πΌπ + π π
Ξ πΌπ
π π =
π=1
π
πΌ(π₯ π = π)
πΌ πΌ + 1 β¦ πΌ + π β 1 =Ξ(πΌ + π)
Ξ(πΌ)
πΌ =
π=1
πΎ
πΌπ
Marginal likelihood in Bayesian networks:
global & local parameter independence
16
Let π π½π’|π’ satisfy global parameter independence assumptions:
π π|π’ =
π
π½ππ|ππππ
π=1
π
π π₯π(π)
|ππππ
π’ (π), π½
ππ|ππππ
π’ , π’ π π½ππ|ππππ
π’ |π’ ππ½ππππ
π’
Let π π½π’|π’ satisfy global & local parameter independence assumptions:
π π|π’ =
π
ππβπππ(ππππ
π’)
π½ππ|ππ
π=1
πππ
π(π)
π’=ππ
π
π π₯π(π)
|ππ , π½ππ|ππ, π’ π π½ππ|ππ
|π’ ππ½ππ|ππ
If Dirichlet priors π π½ππ|ππ|π’ has hyperparameters πΌπ₯π|ππ
π’|π₯ β πππ ππ :
π π|π’ =
π
ππβπππ(ππππ
π’)
Ξ π₯πβπππ(ππ)πΌπ₯π|ππ
π’
Ξ π₯πβπππ(ππ)πΌπ₯π|ππ
π’+ ππ π₯π, ππ
π₯πβπππ(ππ)
Ξ πΌπ₯π|ππ
π’+ ππ π₯π , ππ
Ξ πΌπ₯π|ππ
π’
Asymptotic behavior of Bayesian score
17
As π β β, a network π’ with Dirichlet priors satisfies:
log π π|π’ = log π(π| π½π’) βlog π
2Dim π’ + π(1)
Thus, Bayesian score is asymptotically equivalent to BIC and
asymptotically consistent
π(1) term does not
grow with π
Priors
18
Structure prior: π(π’) Uniform prior: π(π’) β constant
Prior penalizing number of edges: π(π’) β π πΈ(π’) (0 < π < 1)
Prior penalizing number of parameters
Parameter priors: π(π½|π’) is usually instantiated as BDe prior
πΌ ππ , ππππ
π’= πΌπβ² ππ , ππππ
π’
BDe prior A single network π΅β² provides priors for all candidate networks
Unique prior with the property that I-equivalent networks have the sameBayesian score
Normalizing constant can be ignored
πΌ: equivalent sample size
π΅0: prior network representing πβ²
ππππ
π’are not the same as parents of ππ in π΅0
Score Equivalence
19
A score function satisfies score equivalence when all I-
equivalent networks have the same score
The likelihood score and BIC satisfy score equivalence
General graph structure
20
Theorem:
The problem of learning a BN structure with at most π (π β₯ 2)
parents is NP-hard.
Search in the space of graph structures using local search
algorithms
Algorithms like hill-climbing and simulated Annealing
Exploit score decomposition during the search to alleviate
computational overhead
Optimization by local searches
21
Graphs as states
Score function is used as the objective function
Neighbors of a state are found using these modifications:
Edge addition
Edge deletion
Edge reversal
We only consider legal neighbors
that result in DAGs
Decomposable score
22
Decomposable structure score:
Score π’; π =
π
FamScore(ππ|ππππ
π’; π)
Decomposability is useful to reduce the computational
overhead of evaluating different structures during search.
if we have a decomposable score, then a local change in the structure
does not change the score of the remaining parts of the structure.
FamScore(π|π; π) measures how well the set of
variables π serves as parent of π in the data set π
Decomposability of scores
23
Likelihood score is decomposable
Bayesian score is decomposable when:
parameter priors satisfy global parameter independence
and parameter modularity
Parameter modularity:
ππππ
π’= ππππ
π’β²
= π β π π½ππ|π π’ = π(π½ππ|π|π’β²)
and structure prior satisfies structure modularity
π(π’) β
π
π(ππππ= ππππ
π’)
Summary
24
Score functions
Likelihood score
BIC score
Bayesian score
Optimal structure for trees is found using a greedy algorithm
When hypothesis space includes general graphs, the model
selection problem will be NP-hard and we use local search
strategies to optimize the structure
References
25
D. Koller and N. Friedman, βProbabilistic Graphical Models:
Principles and Techniquesβ, MIT Press, 2009, Chapter 18.1, 18.3,
18.4.