on sparse, spectral and other parameterizations of binary...

1
On Sparse, Spectral and Other Parameterizations of Binary Probabilistic Models David Buchman 1 Mark Schmidt 2 Shakir Mohamed 1 David Poole 1 Nando de Freitas 1 1 Department of Computer Science, University of British Columbia, Vancouver, B.C., Canada 2 INRIA – SIERRA Team, Laboratoire d’Informatique de l’ ´ Ecole Normale Sup´ erieure, Paris, France Artificial Intelligence and Statistics (AISTATS) 2012 Introduction Consider a general log-linear (Markov) model with binary variables S = (x 1 ,..., x n ): p(x) = 1 Z Y AS e φ A (x A ) May contain potentials over arbitrary sets of variables 2 n potentials, typically most are not modeled (set to zero) Parameterizations for each potential φ A (x A ): “Full” / “Ising” / “Generalized Ising” / “Canonical” / “Spectral” / . . . Parameterizations – Properties Parameterization Complete Minimal Symmetric Symmetric Uniquely w.r.t. Values w.r.t. Variables Defined Full Yes No Yes Yes Yes Ising No No No Yes Yes Generalized Ising When binary No No Yes Yes Canonical Yes Yes No No No Canonical (C1/C2) Yes Yes No Yes Yes Spectral (for binary) Yes Yes Yes Yes Yes Parameterizations Parameterization: For Binary Variables: Extension to Discrete Variables With c Values: 1-Var Pot. 2-Var Pot. Bases 3-Var Potential Bases φ ij (x i , x j ) #Params per Total 1-Var Potential 2-Var Potential Bases #Params per Total Bases k-Way Pot. #Params Bases k-Way Pot. #Params Full 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 X s 1 ,s 2 w ijs 1 s 2 I x i =s 1 ,x j =s 2 2 k 3 n 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 c k (c + 1) n Ising Special treatment 1 0 0 1 1 0 0 0 0 0 0 1 w ij I x i =x j 1 2 n Special treatment 1 0 0 0 1 0 0 0 1 1 2 n Generalized Ising 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 X s w ijs I x i =x j =s 2 2 n · 2 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 c 2 n c Canonical (C1) 0 1 0 0 0 1 0 0 0 0 0 0 0 1 w ij I x i 6=x ref i ,x j 6=x ref j 1 2 n 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 (c - 1) k c n Spectral 1 -1 1 -1 -1 1 1 -1 -1 1 -1 1 1 -1 w ij x i x j 1 2 n Parameterizing for Learning All complete parameterizations share the same ML estimate. ( parameterization, regularizer ) = prior MAP: w = arg max w X i log p(x i | w) - reg(w) New priors: (Spectral, *) What makes a good prior? I Prediction accuracy I Sparsity I Computation Comparing Parameterizations & Regularizers 4.32 4.34 4.36 4.38 4.4 x 10 4 I C1 C2 CR S GI F GI−group F−group I−h C1−h C2−h CR−h S−h Gi−h F−h Test set negative log-likelihood (a) NLTCS 3300 3350 3400 3450 3500 3550 3600 3650 I C1 C2 CR S GI F F−group F−h flat L1 flat L2 PC (b) Yeast 2.25 2.3 2.35 2.4 2.45 2.5 2.55 x 10 4 I C1 C2 CR S GI F F−group F−h flat L1 flat L2 PC (c) USPS 10 3 I C1 C2 CR S GI F GI−group F−group I−h C1−h C2−h CR−h S−h Gi−h F−h Number of non-zero parameters (a) NLTCS 10 2 10 3 10 4 I C1 C2 CR S GI F F−group F−h flat L1 flat L2 PC (b) Yeast 10 3 10 4 10 5 I C1 C2 CR S GI F F−group F−h flat L1 flat L2 PC (c) USPS 1 Group 1 Hierarchical Group 1 Flat 1 Flat 2 reg(w) X AS λ A kw A k 1 X AS λ A kw A k 2 X AS λ A X BA kw B k 2 2 1/2 λkwk 1 λkwk 2 2 Parameterization = X AS λ A X j |w (j) A | = X AS λ A X j ( w (j) A ) 2 1/2 = X AS λ A X BA X j ( w (j) B ) 2 1/2 = λ X j |w (j) | = λ X j w (j) 2 Ising I I-h Canonical – C1 C1 C1-h Canonical – C2 C2 C2-h Canonical – CR CR CR-h Spectral S S-h flat L1 flat L2 Generalized Ising GI GI-group GI-h Full F F-group F-h CR = Canonical with a random reference state. PC = Pseudo-Counts estimate. λ A = I |A|≥2 2 |A|-2 λ, λ chosen using cross validation. n 16 = Partition function Z is tractable = Can compute the exact global optimum = Comparison results more reliable. = Conclusions: Use a complete & minimal parameterization (spectral / canonical). Prefer the standard 1 . No natural reference state? = (Spectral, standard 1 ) seems the natural choice. Spectral Interpretation Dual representation for continuous signals: FT (Fourier Transform) Frequency Domain Time Domain (Inverse) FT Dual representation for binary distributions: WHT (Walsh-Hadamard Transform) “Interactions” Domain (“Spectrum”) Log-Probability Domain (Inverse) WHT (logp =) q = 2 n/2 H n w w=2 -n/2 H n q Hadamard matrix: H n =2 -n/2 h 1 1 1 -1 i n (Kronecker power) “Spectral” seems a “natural” parameterization. “Canonical” may be “natural” when problem domain has a natural reference state. The Statistics (Spectrum) of Binary Data Sets The spectral parameterization defines a dual “interactions” representation for binary distributions: log p(x) ⇐⇒ {w A } (w = {w A } = “interactions”) How can we measure the interactions in a binary data set? I Learn an (approximate) distribution p 0 (x) p(x) from the data I Represent p 0 (x) as {w 0 A }≈{w A } Learning p 0 (x): I Avoid prior bias for smaller-cardinality factors: Use a parameterization which treats all 2 n - 1 spectral parameters equally = Use (spectral,“flat” priors) / PC / . . . I High modeling accuracy (p 0 (x) p(x)) = Narrow down to: (spectral, flat 1 ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |A| 0 0.2 0.4 0.6 0.8 1 % non-zero parameters (b) Yeast NLTCS USPS 1 2 3 4 5 6 7 8 9 10 11 |A| 0 0.2 0.4 0.6 0.8 1 % non-zero parameters (d) Czech Rochdale Jokes Flow2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |A| 10 -5 10 -4 10 -3 10 -2 10 -1 E( |w A | ) (a) Yeast NLTCS USPS 1 2 3 4 5 6 7 8 9 10 11 |A| 10 -5 10 -4 10 -3 10 -2 10 -1 E( |w A | ) (c) Czech Rochdale Jokes Flow2 Solid lines = flat 1 , dashed lines = flat 2 , dotted lines = PC. Results: I Higher-order interactions {w 0 A } decrease exponentially with #vars = |A| I Confirms intuition & practical experience (adding higher potentials to models gives diminishing returns) I Rationale for prior bias for lower-order factors Select References Yvonne M. Bishop, Stephen E. Fienberg, and Paul W. Holland. Discrete multivariate analysis: Theory and practice. MIT Press, 1975. Daphne Koller and Nir Friedman. Probabilistic graphical models: Principles and techniques. MIT Press, 2009. Mark Schmidt and Kevin P. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. Artificial Intelligence and Statistics, 2010. Masashi Kato, Qian Ji Gao, Hiroshi Chigira, Hiroyuki Shindo, and Masato Inoue. A haplotype inference method based on sparsely connected multi-body Ising model. Journal of Physics: Conference Series, 233(1), 2010.

Upload: dangnhu

Post on 04-May-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

On Sparse, Spectral and Other Parameterizations of Binary Probabilistic ModelsDavid Buchman1 Mark Schmidt2 Shakir Mohamed1 David Poole1 Nando de Freitas1

1 Department of Computer Science, University of British Columbia, Vancouver, B.C., Canada2 INRIA – SIERRA Team, Laboratoire d’Informatique de l’Ecole Normale Superieure, Paris, France

Artificial Intelligence and Statistics (AISTATS) 2012

Introduction

Consider a general log-linear (Markov) model with binary variables S = (x1, . . . , xn):

p(x) =1

Z

∏A⊆S

eφA(xA)

�May contain potentials over arbitrary sets of variables

� 2n potentials, typically most are not modeled (set to zero)Parameterizations for each potential φA(xA):

� “Full” / “Ising” / “Generalized Ising” / “Canonical” / “Spectral” / . . .

Parameterizations – Properties

Parameterization Complete MinimalSymmetric Symmetric Uniquely

w.r.t. Values w.r.t. Variables Defined

Full Yes No Yes Yes Yes

Ising No No No Yes Yes

Generalized Ising When binary No No Yes Yes

Canonical Yes Yes No No No

Canonical (C1/C2) Yes Yes No Yes Yes

Spectral (for binary) Yes Yes Yes Yes Yes

Parameterizations

Parameterization: For Binary Variables: Extension to Discrete Variables With c Values:

1-Var Pot.2-Var Pot. Bases 3-Var Potential Bases φij(xi, xj)

#Params per Total 1-Var Potential 2-Var Potential Bases #Params per Total

Bases k-Way Pot. #Params Bases k-Way Pot. #Params

Full1

0

0

1

1 0

0 0

0 1

0 0

0 0

1 0

0 0

0 1

1 0

0 0 0 0

0 0

0 1

0 0 0 0

0 0. . .

0 0

0 0 0 0

0 1

∑s1,s2

wijs1s2Ixi=s1,xj=s2 2k 3n1

0

0

0

1

0

0

0

1

1 0 0

0 0 0

0 0 0

0 1 0

0 0 0

0 0 0

. . .

0 0 0

0 0 0

0 0 1

ck (c + 1)n

IsingSpecial

treatment1 0

0 1

1 0

0 0 0 0

0 1wijIxi=xj 1 2n Special treatment

1 0 0

0 1 0

0 0 1

1 2n

Generalized Ising1

0

0

1

1 0

0 0

0 0

0 1

1 0

0 0 0 0

0 0

0 0

0 0 0 0

0 1

∑s

wijsIxi=xj=s 2 2n · 21

0

0

0

1

0

0

0

1

1 0 0

0 0 0

0 0 0

0 0 0

0 1 0

0 0 0

0 0 0

0 0 0

0 0 1

c 2nc

Canonical (C1)0

1

0 0

0 1

0 0

0 0 0 0

0 1wijIxi 6=xref

i ,xj 6=xrefj

1 2n0

1

0

0

0

1

0 0 0

0 1 0

0 0 0

0 0 0

0 0 1

0 0 0

0 0 0

0 0 0

0 1 0

0 0 0

0 0 0

0 0 1

(c− 1)k cn

Spectral1

-1

1 -1

-1 1

1 -1

-1 1 -1 1

1 -1wijxixj 1 2n

Parameterizing for Learning

�All complete parameterizations share the same ML estimate.

� ( parameterization, regularizer ) = prior

�MAP: w = arg maxw

(∑i

log p(xi | w)− reg(w)

)�New priors: (Spectral, *)

�What makes a good prior?

I Prediction accuracyI SparsityI Computation

Comparing Parameterizations & Regularizers

4.32

4.34

4.36

4.38

4.4

x 104

I C1 C2 CR S GI F

GI−group

F−group I−hC1

−hC2−hCR−h S−

hGi−hF−hTe

st s

et n

egat

ive

log−

likel

ihoo

d

(a) NLTCS

3300

3350

3400

3450

3500

3550

3600

3650

I C1 C2 CR S GI F

F−grou

pF−h

flat L

1fla

t L2 PC

(b) Yeast

2.25

2.3

2.35

2.4

2.45

2.5

2.55

x 104

I C1 C2 CR S GI F

F−grou

pF−h

flat L

1fla

t L2 PC

(c) USPS

103

I C1 C2 CR S GI F

GI−group

F−group I−hC1

−hC2−hCR−h S−

hGi−hF−hN

umbe

r of n

on−z

ero

para

met

ers

(a) NLTCS

102

103

104

I C1 C2 CR S GI F

F−grou

pF−h

flat L

1fla

t L2 PC

(b) Yeast

103

104

105

I C1 C2 CR S GI F

F−grou

pF−h

flat L

1fla

t L2 PC

(c) USPS

`1 Group `1 Hierarchical Group `1 Flat `1 Flat `2

reg(w)∑A⊆S

λA‖wA‖1

∑A⊆S

λA‖wA‖2

∑A⊆S

λA

(∑B⊇A

‖wB‖22

)1/2

λ‖w‖1 λ‖w‖22

Parameterization =∑A⊆S

λA

∑j

|w(j)A | =

∑A⊆S

λA

(∑j

(w

(j)A

)2)1/2

=∑A⊆S

λA

(∑B⊇A

∑j

(w

(j)B

)2)1/2

= λ∑

j

|w(j)| = λ∑

j

w(j)2

Ising I I-h

Canonical – C1 C1 C1-h

Canonical – C2 C2 C2-h

Canonical – CR CR CR-h

Spectral S S-h flat L1 flat L2

Generalized Ising GI GI-group GI-h

Full F F-group F-h

CR = Canonical with a random reference state.

PC = Pseudo-Counts estimate.

λA = I|A|≥22|A|−2λ, λ chosen using cross validation.

n ≤ 16 =⇒ Partition function Z is tractable =⇒ Can compute the exact global optimum =⇒ Comparison results more reliable.

=⇒ Conclusions:

�Use a complete & minimal parameterization (spectral / canonical).

� Prefer the standard `1.

�No natural reference state?=⇒ (Spectral, standard `1) seems the natural choice.

Spectral Interpretation

Dual representation for continuous signals:

FT(Fourier

Transform)

Frequency Domain

Time Domain

(Inverse) FT

Dual representation for binary distributions:

WHT(Walsh-Hadamard

Transform)

“Interactions” Domain (“Spectrum”)

Log-Probability Domain

(Inverse) WHT

(log p =) q = 2n/2Hnww = 2−n/2Hnq

Hadamard matrix: Hn = 2−n/2[

1 11 −1

]⊗n(Kronecker power)

� “Spectral” seems a “natural” parameterization.

� “Canonical” may be “natural” when problem domain has a natural reference state.

The Statistics (Spectrum) of Binary Data Sets

�The spectral parameterization defines a dual “interactions” representation for binary distributions:

log p(x)⇐⇒ {wA} (w = {wA} = “interactions”)

�How can we measure the interactions in a binary data set?

I Learn an (approximate) distribution p′(x) ≈ p(x) from the dataI Represent p′(x) as {w′A} ≈ {wA}� Learning p′(x):

I Avoid prior bias for smaller-cardinality factors: Use a parameterization which treats all 2n − 1spectral parameters equally =⇒ Use (spectral,“flat” priors) / PC / . . .

I High modeling accuracy (p′(x) ≈ p(x)) =⇒ Narrow down to: (spectral, flat `1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

|A|

0

0.2

0.4

0.6

0.8

1

% n

on

−ze

ro p

ara

me

ters

(b)

Yeast

NLTCS

USPS

1 2 3 4 5 6 7 8 9 10 11

|A|

0

0.2

0.4

0.6

0.8

1

% n

on

−ze

ro p

ara

me

ters

(d)

Czech

Rochdale

Jokes

Flow2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

|A|

10−5

10−4

10−3

10−2

10−1

E(

|w

A|

)

(a)

Yeast

NLTCS

USPS

1 2 3 4 5 6 7 8 9 10 11

|A|

10−5

10−4

10−3

10−2

10−1

E(

|w

A|

)

(c)

Czech

Rochdale

Jokes

Flow2

Solid lines = flat `1, dashed lines = flat `2, dotted lines = PC.

� Results:

I Higher-order interactions {w′A} decrease exponentially with #vars = |A|I Confirms intuition & practical experience (adding higher potentials to models gives diminishing

returns)I Rationale for prior bias for lower-order factors

Select References

�Yvonne M. Bishop, Stephen E. Fienberg, and Paul W. Holland. Discrete multivariate analysis: Theory and practice. MIT Press, 1975.

�Daphne Koller and Nir Friedman. Probabilistic graphical models: Principles and techniques. MIT Press, 2009.

�Mark Schmidt and Kevin P. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. Artificial Intelligence and Statistics, 2010.

�Masashi Kato, Qian Ji Gao, Hiroshi Chigira, Hiroyuki Shindo, and Masato Inoue. A haplotype inference method based on sparsely connected multi-body Ising model. Journal of Physics: ConferenceSeries, 233(1), 2010.