structure learning in bayesian...

25
Structure learning in Bayesian networks Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani

Upload: others

Post on 23-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Structure learning in Bayesian networks

Probabilistic Graphical Models

Sharif University of Technology

Spring 2018

Soleymani

Page 2: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Recall: Learning problem

2

Target: true distribution π‘ƒβˆ— that maybe correspond to β„³βˆ—

= π’¦βˆ—, πœ½βˆ—

Hypothesis space: specified probabilistic graphical models

Data: set of instances sampled from π‘ƒβˆ—

Learning goal: selecting a model β„³ to construct the best

approximation to π‘€βˆ— according to a performance metric

We may use Bayesian model averaging instead of selecting a model β„³

Page 3: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Structure learning of Bayesian networks

3

For a fixed set of variables (i.e. nodes), we intend to learn

the directed graph structure π’’βˆ—:

Which edges are included in the model?

Fewer edges miss dependencies

More edges cause spurious dependencies

We often prefer sparser structures that show better

generalization

Page 4: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Learning the structure of DGMs

4

Approaches

Constraint-based

Find conditional independencies in data using statistical tests

Find an I-map graph for the obtained conditional independencies

Score-based

View a BN structure and parameter learning as a density estimation

task and address structure learning as a model selection problem

Score function measures how well the model fits the observed data

Maximum likelihood

Bayesian score

Page 5: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Score-based approach

5

Input:

Hypothesis space: set of possible structures

Training data

Output:A network that maximizes the score function

Thus, we need a score function (including priors, if needed)

and a search strategy (i.e. optimization procedure) to find a

structure in the hypothesis space

Page 6: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Structural search

6

Hypothesis space

General graphs: Number of graphs on 𝑀 nodes: O 2𝑀2

Trees: Number of trees on 𝑀 nodes O 𝑀!

Combinatorial optimization problem

Many heuristics to solve this problem but no guarantee on attaining

optimality

super-exponential

Page 7: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Likelihood score

7

Find 𝒒, πœƒπ’’ that maximize the likelihood:

max𝒒,πœ½π’’

𝑃 π’Ÿ|𝒒, πœ½π’’ = max𝒒

maxπœ½π’’

𝑃 π’Ÿ|𝒒, πœ½π’’

= max𝒒

𝑃 π’Ÿ|𝒒, πœ½π’’

ScoreL 𝒒, π’Ÿ = 𝑃 π’Ÿ|𝒒, πœ½π’’

Likelihood Score

Page 8: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Likelihood score: Information-theoretic

interpretation

8

Given structure, log likelihood of data:

ln 𝑃(π’Ÿ|πœƒπ’’, 𝒒) = ln 𝑛=1

𝑁

𝑃(𝒙 𝑛 |πœƒπ’’, 𝒒) =

𝑛=1

𝑁

ln

𝑖

𝑃(π‘₯𝑖𝑛

|𝑷𝒂𝑋𝑖

𝑛, πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

, 𝒒)

=

𝑖

𝑛=1

𝑁

ln 𝑃(π‘₯𝑖𝑛

|𝑷𝒂𝑋𝑖

𝑛, πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

, 𝒒)

= 𝑁

𝑖

π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)

π’–π‘–βˆˆπ‘‰π‘Žπ‘™(𝑷𝒂𝑋𝑖)

πΆπ‘œπ‘’π‘›π‘‘(π‘₯𝑖 , 𝒖𝑖)

𝑁ln 𝑃(π‘₯𝑖|𝒖𝑖 , πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

, 𝒒)

ln 𝑃(π’Ÿ| πœƒπ’’, 𝒒) = 𝑁

𝑖

π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)

π’–π‘–βˆˆπ‘‰π‘Žπ‘™(𝑷𝒂𝑋𝑖)

𝑃(π‘₯𝑖 , 𝒖𝑖) ln 𝑃(π‘₯𝑖|𝒖𝑖) 𝑃(π‘₯𝑖)

𝑃(π‘₯𝑖)

= 𝑁

𝑖

𝐼 𝑃 𝑋𝑖; π‘ƒπ‘Žπ‘‹π‘–βˆ’ 𝐻 𝑃 𝑋𝑖

This score measures the strength of the dependencies between variables and their parents.

Page 9: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Limitations of likelihood score

9

We have 𝐼𝑃 𝑋; π‘Œ β‰₯ 0 and 𝐼𝑃 𝑋; π‘Œ = 0 iff 𝑃 𝑋, π‘Œ= 𝑃 𝑋 𝑃(π‘Œ)

In the training data, almost always 𝐼 𝑃 𝑋; π‘Œ > 0

Due to statistical noise in the training data, exact independence

almost never occurs in the empirical distribution

Thus, ML score never prefers the simpler network

Score is maximized for fully connected network

Therefore likelihood score fails to generalize well

𝐼 𝑋; π‘Œ βˆͺ 𝑍 β‰₯ 𝐼 𝑋; π‘ŒπΌ 𝑋; π‘Œ βˆͺ 𝑍 = 𝐼 𝑋; π‘Œ iff 𝑋 βŠ₯ 𝑍|π‘Œ

Page 10: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Avoiding overfitting

10

Restricting the hypothesis space explicitly:

restrict number of parents or number of parameters

Impose penalty on the complexity of the structure:

Explicitly

Bayesian score and BIC

Page 11: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Tree structure: Chow-Liu algorithm

11

Compute 𝑀𝑖𝑗 = 𝐼(𝑋𝑖 , 𝑋𝑗) for each pair of nodes 𝑋𝑖 and 𝑋𝑗

Find a maximum weight spanning tree (MWST)

tree with the greatest total weight

max𝑇

(𝑖,𝑗)βˆˆπ‘‡

𝑀𝑖𝑗

Give directions to edges in MST

pick an arbitrary node as the root and draw arrows going away

from it (e.g., using a BFS algorithm)

we can find an optimal

tree in polynomial time

Page 12: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

BIC score

12

ScoreBIC 𝒒 = log 𝑃(π’Ÿ| πœ½π’’) βˆ’log 𝑁

2Dim 𝒒

= 𝑁

𝑖

𝐼 𝑃 𝑋𝑖; π‘ƒπ‘Žπ‘‹π‘–βˆ’ 𝑁

𝑖

𝐻 𝑃 𝑋𝑖 βˆ’log 𝑁

2Dim 𝒒

Its negation is also known as Minimum Description Length

(MDL)

For the larger 𝑁, the more emphasis is given to the fit to data

Mutual information grows linearly with 𝑁 while complexity grows

logarithmically with 𝑁

can be ignoredModel complexity

(i.e., # of independent

parameters)

Page 13: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Asymptotic consistency of BIC score

13

Definition: A scoring function is called consistent if as 𝑁 β†’ ∞and probabilityβ†’ 1:

π’’βˆ— (or any I-equivalent structure) maximizes the score

All non-I-equivalent structures have strictly lower score

Theorem: As 𝑁 β†’ ∞, the structure that maximizes the BIC

score is the true structure π’’βˆ— (or any I-equivalent structure)

Asymptotically, spurious dependencies will not contribute to likelihood

and will be penalized

Required dependencies will be added due to linear growth of likelihood

term compared to logarithmic growth of model complexity

Page 14: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Bayesian score

14

Uncertainty over parameters and structure

Try to average over all possible parameter values and

also consider the structure prior

Bayesian score:

ScoreB 𝒒 = log 𝑃 π’Ÿ|𝒒 + log 𝑃 𝒒

We consider the whole distribution over

parameters πœ½π’’ for the structure 𝒒 to find the

marginal likelihood:

𝑃 π’Ÿ|𝒒 = πœ½π’’

𝑃 π’Ÿ|𝒒, πœ½π’’ 𝑃 πœ½π’’|𝒒 π‘‘πœ½π’’

𝑃 𝒒|π’Ÿ =𝑃 π’Ÿ|𝒒 𝑃 𝒒

𝑃 π’Ÿ

Independent of 𝒒

Structure prior

Marginal

likelihood

likelihood

Page 15: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Marginal likelihood

15

𝑃 π’Ÿ|𝒒 = 𝑃 𝒙 1 , … , 𝒙 𝑁 𝒒

= 𝑃 𝒙 1 𝒒 Γ— 𝑃 𝒙 2 𝒙 1 , 𝒒 Γ— β‹― Γ— 𝑃 𝒙 𝑁 𝒙 1 , … , 𝒙 π‘βˆ’1 , 𝒒

Dirichlet priors and single variable:

𝑃 π‘₯ 1 , … , π‘₯ 𝑁 = π‘˜=1

𝐾 π›Όπ‘˜ π›Όπ‘˜ + 1 … π›Όπ‘˜ + 𝑀[π‘˜] βˆ’ 1

𝛼 𝛼 + 1 … 𝛼 + 𝑁 βˆ’ 1

=Ξ“ 𝛼

Ξ“ 𝛼 + 𝑁×

π‘˜=1

𝐾Γ π›Όπ‘˜ + 𝑀 π‘˜

Ξ“ π›Όπ‘˜

𝑀 π‘˜ =

𝑛=1

𝑁

𝐼(π‘₯ 𝑛 = π‘˜)

𝛼 𝛼 + 1 … 𝛼 + 𝑁 βˆ’ 1 =Ξ“(𝛼 + 𝑁)

Ξ“(𝛼)

𝛼 =

π‘˜=1

𝐾

π›Όπ‘˜

Page 16: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Marginal likelihood in Bayesian networks:

global & local parameter independence

16

Let 𝑃 πœ½π’’|𝒒 satisfy global parameter independence assumptions:

𝑃 π’Ÿ|𝒒 =

𝑖

πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

𝑛=1

𝑁

𝑃 π‘₯𝑖(𝑛)

|π‘ƒπ‘Žπ‘‹π‘–

𝒒 (𝑛), 𝜽

𝑋𝑖|π‘ƒπ‘Žπ‘‹π‘–

𝒒 , 𝒒 𝑃 πœ½π‘‹π‘–|π‘ƒπ‘Žπ‘‹π‘–

𝒒 |𝒒 π‘‘πœ½π‘ƒπ‘Žπ‘‹π‘–

𝒒

Let 𝑃 πœ½π’’|𝒒 satisfy global & local parameter independence assumptions:

𝑃 π’Ÿ|𝒒 =

𝑖

π’–π‘–βˆˆπ‘‰π‘Žπ‘™(π‘ƒπ‘Žπ‘‹π‘–

𝒒)

πœ½π‘‹π‘–|𝒖𝑖

𝑛=1

π‘ƒπ‘Žπ‘‹

𝑖(𝑛)

𝒒=𝒖𝑖

𝑁

𝑃 π‘₯𝑖(𝑛)

|𝒖𝑖 , πœ½π‘‹π‘–|𝒖𝑖, 𝒒 𝑃 πœ½π‘‹π‘–|𝒖𝑖

|𝒒 π‘‘πœ½π‘‹π‘–|𝒖𝑖

If Dirichlet priors 𝑃 πœ½π‘‹π‘–|𝒖𝑖|𝒒 has hyperparameters 𝛼π‘₯𝑖|π’–π’Š

𝒒|π‘₯ ∈ π‘‰π‘Žπ‘™ 𝑋𝑖 :

𝑃 π’Ÿ|𝒒 =

𝑖

π’–π‘–βˆˆπ‘‰π‘Žπ‘™(π‘ƒπ‘Žπ‘‹π‘–

𝒒)

Ξ“ π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)𝛼π‘₯𝑖|𝒖𝑖

𝒒

Ξ“ π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)𝛼π‘₯𝑖|𝒖𝑖

𝒒+ 𝑀𝑖 π‘₯𝑖, 𝒖𝑖

π‘₯π‘–βˆˆπ‘‰π‘Žπ‘™(𝑋𝑖)

Ξ“ 𝛼π‘₯𝑖|𝒖𝑖

𝒒+ 𝑀𝑖 π‘₯𝑖 , 𝒖𝑖

Ξ“ 𝛼π‘₯𝑖|𝒖𝑖

𝒒

Page 17: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Asymptotic behavior of Bayesian score

17

As 𝑁 β†’ ∞, a network 𝒒 with Dirichlet priors satisfies:

log 𝑃 π’Ÿ|𝒒 = log 𝑃(π’Ÿ| πœ½π’’) βˆ’log 𝑁

2Dim 𝒒 + 𝑂(1)

Thus, Bayesian score is asymptotically equivalent to BIC and

asymptotically consistent

𝑂(1) term does not

grow with 𝑀

Page 18: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Priors

18

Structure prior: 𝑃(𝒒) Uniform prior: 𝑃(𝒒) ∝ constant

Prior penalizing number of edges: 𝑃(𝒒) ∝ 𝑐 𝐸(𝒒) (0 < 𝑐 < 1)

Prior penalizing number of parameters

Parameter priors: 𝑃(𝜽|𝒒) is usually instantiated as BDe prior

𝛼 𝑋𝑖 , π‘ƒπ‘Žπ‘‹π‘–

𝒒= 𝛼𝑃′ 𝑋𝑖 , π‘ƒπ‘Žπ‘‹π‘–

𝒒

BDe prior A single network 𝐡′ provides priors for all candidate networks

Unique prior with the property that I-equivalent networks have the sameBayesian score

Normalizing constant can be ignored

𝛼: equivalent sample size

𝐡0: prior network representing 𝑃′

π‘ƒπ‘Žπ‘‹π‘–

𝒒are not the same as parents of 𝑋𝑖 in 𝐡0

Page 19: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Score Equivalence

19

A score function satisfies score equivalence when all I-

equivalent networks have the same score

The likelihood score and BIC satisfy score equivalence

Page 20: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

General graph structure

20

Theorem:

The problem of learning a BN structure with at most 𝑑 (𝑑 β‰₯ 2)

parents is NP-hard.

Search in the space of graph structures using local search

algorithms

Algorithms like hill-climbing and simulated Annealing

Exploit score decomposition during the search to alleviate

computational overhead

Page 21: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Optimization by local searches

21

Graphs as states

Score function is used as the objective function

Neighbors of a state are found using these modifications:

Edge addition

Edge deletion

Edge reversal

We only consider legal neighbors

that result in DAGs

Page 22: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Decomposable score

22

Decomposable structure score:

Score 𝒒; π’Ÿ =

𝑖

FamScore(𝑋𝑖|π‘ƒπ‘Žπ‘‹π‘–

𝒒; π’Ÿ)

Decomposability is useful to reduce the computational

overhead of evaluating different structures during search.

if we have a decomposable score, then a local change in the structure

does not change the score of the remaining parts of the structure.

FamScore(𝑋|π‘ˆ; π’Ÿ) measures how well the set of

variables π‘ˆ serves as parent of 𝑋 in the data set π’Ÿ

Page 23: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Decomposability of scores

23

Likelihood score is decomposable

Bayesian score is decomposable when:

parameter priors satisfy global parameter independence

and parameter modularity

Parameter modularity:

π‘ƒπ‘Žπ‘‹π‘–

𝒒= π‘ƒπ‘Žπ‘‹π‘–

𝒒′

= π‘ˆ β‡’ 𝑃 πœ½π‘‹π‘–|π‘ˆ 𝒒 = 𝑃(πœ½π‘‹π‘–|π‘ˆ|𝒒′)

and structure prior satisfies structure modularity

𝑃(𝒒) ∝

𝑖

𝑃(π‘ƒπ‘Žπ‘‹π‘–= π‘ƒπ‘Žπ‘‹π‘–

𝒒)

Page 24: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

Summary

24

Score functions

Likelihood score

BIC score

Bayesian score

Optimal structure for trees is found using a greedy algorithm

When hypothesis space includes general graphs, the model

selection problem will be NP-hard and we use local search

strategies to optimize the structure

Page 25: Structure learning in Bayesian networksce.sharif.edu/courses/96-97/2/ce768-1/resources/root/slides/Structure Learning.pdfLearning the structure of DGMs 4 Approaches Constraint-based

References

25

D. Koller and N. Friedman, β€œProbabilistic Graphical Models:

Principles and Techniques”, MIT Press, 2009, Chapter 18.1, 18.3,

18.4.