direct sparse structural change detection in markov …liu/papers/structuralcd_talk.pdfsl is...

Direct Learning of Sparse Changes in Markov Networks by Density Ratio Estimation

SONG LIU1, JOHN QUINN 2, MICHAEL GUTMANN3

AND MASASHI SUGIYAMA1

"Tempora mutantur, nos et mutamur in illis. ""Times change, and we change with time. “

Latin Phrase.

SL is supported by the JST PRESTO program and the JSPS fellowship, JQ is supported by the JST PRESTO program, and MS is supported by the JST CREST program. MUG is supported by the Finnish Centre-of-Excellence in Computational Inference Research COIN (251170).

1

1. Tokyo Institute of Technology, Japan

2. Makerere University, Uganda

3. University of Helsinki and HIIT, Finland.

Introductions,Changes in InteractionsInteractions are common, and they change over time/experimental conditions.Gene regulates each other differently when

stimuli changes.

Co-occurrence between words disappear/appear when the domains of text corpora shift.

Correlation among pixels may change if surveillance camera captures anomalies.

Reveling such changes are interesting, but

challenging.

Heatmap of Gene Expression (Wikipedia)

2

Introductions,Markov Networks (MN)

In statistical Machine Learning, interactions are measured by

Conditional Independence among Random Variables (R.V.s).

Markov Networks (MN) are undirected graphical models widely used for capturing interactions between R.V.s.

The joint distribution of a MN factorizes over its cliques.

Pairwise MN considers only the factorization over the smallest cliques: edges and nodes.

In which case, 𝑔𝑖,𝑗 characterize the interaction on edge 𝐸𝑖,𝑗.

3

1

2

34

5

Outline1. Introductions

2. Problem Formulation

3. Related Works

4. Proposed Approach

5. Experiments

6. Conclusion

4

Problem Formulation, (pairwise) Log-linear Model

Two groups of samples:

and are MNs with univariate and pairwise factors:

Our model in linear in .

f can be any function basis, e.g. Gaussian basis or Polynomial basis.

The same way that is modelled.

𝑔𝑖,𝑗: ℛ2 → ℛ1

5



3. Related Works1. Maximum Likelihood Estimation

2. Fused-lasso Approach

3. Non-paranormal methods


5. Experiments

6. Conclusion

6

Related Works, MLE on Gaussian Markov Network (GMN)

Obtain sparse parameterizations of 𝑝(𝒙) and 𝑞(𝒙).

Off-the-shelf software can be used.

What if the p and q are not sparse? But the change can be sparse!

Easy to compute?Bigger Issue: the normalization problem

Meinshausen et al., 2006; Schmit & Murphy, 2010; Ravikumar et al., 2010

P Q

Change

sparse sparse

sparse

7

Related Works,MLE on Gaussian Markov Network (GMN)Nonetheless, let’s consider Gaussian model:

The MLE boils down to:

Gaussian Copula may extend this ability to Nonparanomal distributions.

is the sample covariance matrix.

Friedman et al.,2007

is the inverse covariance matrix.

Banerjee, et al., 2008

8

Related Works,Using Fused-lasso on GMN

We can impose sparsity directly on , using Fused-lasso.

Consider the following objective:

A very similar approach for Gaussian has been proposed:

is Gaussian node-wise conditional density

Zhang and Wang, 2010

Tibshirani et al., 2005

Can we penalize change jointly?

9



3. Related Works

4. Proposed Approach1. Vapnik says…

2. Modelling Changes Directly

3. Density ratio Estimation

4. Sparsity inducing Norm

5. Experiments

6. Conclusion

10

Proposed Approach,Vapnik says…Vapnik says:

“If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step.

Statistical Learning Theory, 1998.

Separate MLE method is a more difficult “intermediate step” because: Redundant!

Normalization Issue!

Sparsity of changes cannot be directly controlled.

11

Proposed Approach,Modeling Changes DirectlyOur model is linear in g 𝒙 . The difference in g may easily tested by

The ratio of two MNs naturally incorporates the !

So, model the ratio directly!

12

Proposed Approach,Modeling Changes Directly We model density ratio instead of density function:

The normalization term is:

To ensure that: Sample average approximation, Also works when no closed form normalization term for density function.

13

Proposed Approach,Estimating Density RatioKullback-Leibler Importance Estimation Procedure (KLIEP):

automatically satisfied by our self-normalizing log-linear model!

Sugiyama et al., 2008

Tsuboi et al, 2009

14

Proposed Approach,Sparsity Inducing NormImpose sparsity constraints on the changes on each factor 𝜽𝑡. equals to impose sparsity constraints on .

Don’t assume the p or q is sparse, only the change is.p and q can be dense! Or even almost fully connected!

So finally, we can obtain a with group sparsity! L2 regularizers

Group lasso regularizer

Elastic Net

15



3. Related Works


5. Experiments1. Numerical Experiments

2. Real-world Application

6. Conclusion

16

Experiments,Numerical ExperimentsRandomly generated Gaussian, non-Gaussian Distribution

Comparison Methods: KLIEP, Graphical Lasso (Glasso), Fused-lasso (Flasso)

𝑓𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑥, 𝑦 = 𝑥2, 𝑦2, 𝑥𝑦 , 𝑓𝑝𝑜𝑙𝑦 𝑥, 𝑦 = [𝑥𝑘 , 𝑦𝑘 , 𝑥𝑘−1𝑦 … , 𝑥, 𝑦, 1]

Performance evaluation: a) regularization path, b) precision-recall curve.

Gaussian Diamond

17

Experiments, Regularization PathsOne of the important performance measure is regularization path:

• Plot the magnitude of each 𝜽𝑡 when the parameter 𝜆2 increases.

• All 𝜽𝑡 should drop to 0 as 𝜆2 grows.

• However, we would hope 𝜃𝑡 that corresponding to changed edges hits 0 in the last.

• Because they are associated with the non-zero ones in the “true model”.

18

Experiments,Gaussian Distribution (n = 50/100, d = 40)

Dropping 15 edges from a randomly generated MN at 25% sparsity.

19

Experiments,Gaussian Distribution (Precision/Recall)

By varying 𝜆2, we can obtain a set of precision/recall rates. We plot the averaged P/R curve over 20 generated datasets.

20

Experiments, Diamond Distribution (n = 5000, d = 9)

Diamond Distribution:

Samples are drawn by slice sampling.

21

Only the proposed method with the correct model has good performance.

Experiments, Real-world Applications

Gene Network

◦ Detecting changes from the original network to the

modified network.

Twitter Messages◦ Samples are the frequencies of 10 related keywords over

time.

◦ Detecting the change of co-occurrences on keywords before and after a certain events.

source: Wikipedia

22

Experiments, Gene Network P Q

Gene regulatory network is modified manually. 50 Samples

are collected before (P) and after (Q) the change.

Polynomial kernel is used for 𝑓𝑡. 𝜆1 is chosen by hold-out cross validation.Only KLIEP and Flasso (Gausian) are compared.

23

Experiments, Gene Network (n = 50, d = 13)

P/R curve is averaged over 20 simulations.

24

Q P P P

Experiments, Twitter Keywords

Time

3 weeks~4.17

We choose the Deepwater Horizon oil spill as the target event

Samples of distribution P are drawn from different time periods, compared with pre-event co-occurences.

25

…

Experiments,Twitter Keywords From 4.17-6.5

KLIEP Flasso

Green: structure shared by both graphs Red: structure detected only by KLIEP

𝜆1 are chosen by Cross Validated Likelihood.

26

Experiments,Twitter Keywords From 7.26-9.14

KLIEP Flasso

27



3. Related Works


5. Experiments

6. Conclusion

28

ConclusionLearning sparse change in two Markov Networks, directly!By density ratio estimation

Two advantages comparing to conventional methods:Simplicitymodeling difference of parameter

halving the parameters in optimization

Wide applicabilityP and Q are not only limited to discrete, Gaussian, or NPN.

29

direct sparse structural change detection in markov …liu/papers/structuralcd_talk.pdfsl is...

Documents