on sparse, spectral and other parameterizations of binary...

On Sparse, Spectral and Other Parameterizations of Binary Probabilistic ModelsDavid Buchman1 Mark Schmidt2 Shakir Mohamed1 David Poole1 Nando de Freitas1

1 Department of Computer Science, University of British Columbia, Vancouver, B.C., Canada2 INRIA – SIERRA Team, Laboratoire d’Informatique de l’Ecole Normale Superieure, Paris, France

Artificial Intelligence and Statistics (AISTATS) 2012

Introduction

Consider a general log-linear (Markov) model with binary variables S = (x1, . . . , xn):

p(x) =1

Z

∏A⊆S

eφA(xA)

�May contain potentials over arbitrary sets of variables

� 2n potentials, typically most are not modeled (set to zero)Parameterizations for each potential φA(xA):

� “Full” / “Ising” / “Generalized Ising” / “Canonical” / “Spectral” / . . .

Parameterizations – Properties

Parameterization Complete MinimalSymmetric Symmetric Uniquely

w.r.t. Values w.r.t. Variables Defined

Full Yes No Yes Yes Yes

Ising No No No Yes Yes

Generalized Ising When binary No No Yes Yes

Canonical Yes Yes No No No

Canonical (C1/C2) Yes Yes No Yes Yes

Spectral (for binary) Yes Yes Yes Yes Yes

Parameterizations

Parameterization: For Binary Variables: Extension to Discrete Variables With c Values:

1-Var Pot.2-Var Pot. Bases 3-Var Potential Bases φij(xi, xj)

#Params per Total 1-Var Potential 2-Var Potential Bases #Params per Total

Bases k-Way Pot. #Params Bases k-Way Pot. #Params

Full1

0

0

1

1 0

0 0

0 1

0 0

0 0

1 0

0 0

0 1

1 0

0 0 0 0

0 0

0 1

0 0 0 0

0 0. . .

0 0

0 0 0 0

0 1

∑s1,s2

wijs1s2Ixi=s1,xj=s2 2k 3n1

0

0

0

1

0

0

0

1

1 0 0

0 0 0

0 0 0

0 1 0

0 0 0

0 0 0

. . .

0 0 0

0 0 0

0 0 1

ck (c + 1)n

IsingSpecial

treatment1 0

0 1

1 0

0 0 0 0

0 1wijIxi=xj 1 2n Special treatment

1 0 0

0 1 0

0 0 1

1 2n

Generalized Ising1

0

0

1

1 0

0 0

0 0

0 1

1 0

0 0 0 0

0 0

0 0

0 0 0 0

0 1

∑s

wijsIxi=xj=s 2 2n · 21

0

0

0

1

0

0

0

1

1 0 0

0 0 0

0 0 0

0 0 0

0 1 0

0 0 0

0 0 0

0 0 0

0 0 1

c 2nc

Canonical (C1)0

1

0 0

0 1

0 0

0 0 0 0

0 1wijIxi 6=xref

i ,xj 6=xrefj

1 2n0

1

0

0

0

1

0 0 0

0 1 0

0 0 0

0 0 0

0 0 1

0 0 0

0 0 0

0 0 0

0 1 0

0 0 0

0 0 0

0 0 1

(c− 1)k cn

Spectral1

-1

1 -1

-1 1

1 -1

-1 1 -1 1

1 -1wijxixj 1 2n

Parameterizing for Learning

�All complete parameterizations share the same ML estimate.

� ( parameterization, regularizer ) = prior

�MAP: w = arg maxw

(∑i

log p(xi | w)− reg(w)

)�New priors: (Spectral, *)

�What makes a good prior?

I Prediction accuracyI SparsityI Computation

Comparing Parameterizations & Regularizers

4.32

4.34

4.36

4.38

4.4

x 104

I C1 C2 CR S GI F

GI−group

F−group I−hC1

−hC2−hCR−h S−

hGi−hF−hTe

st s

et n

egat

ive

log−

likel

ihoo

d

(a) NLTCS

3300

3350

3400

3450

3500

3550

3600

3650

I C1 C2 CR S GI F

F−grou

pF−h

flat L

1fla

t L2 PC

(b) Yeast

2.25

2.3

2.35

2.4

2.45

2.5

2.55

x 104

I C1 C2 CR S GI F

F−grou

pF−h

flat L

1fla

t L2 PC

(c) USPS

103

I C1 C2 CR S GI F

GI−group

F−group I−hC1

−hC2−hCR−h S−

hGi−hF−hN

umbe

r of n

on−z

ero

para

met

ers

(a) NLTCS

102

103

104

I C1 C2 CR S GI F

F−grou

pF−h

flat L

1fla

t L2 PC

(b) Yeast

103

104

105

I C1 C2 CR S GI F

F−grou

pF−h

flat L

1fla

t L2 PC

(c) USPS

`1 Group `1 Hierarchical Group `1 Flat `1 Flat `2

reg(w)∑A⊆S

λA‖wA‖1

∑A⊆S

λA‖wA‖2

∑A⊆S

λA

(∑B⊇A

‖wB‖22

)1/2

λ‖w‖1 λ‖w‖22

Parameterization =∑A⊆S

λA

∑j

|w(j)A | =

∑A⊆S

λA

(∑j

(w

(j)A

)2)1/2

=∑A⊆S

λA

(∑B⊇A

∑j

(w

(j)B

)2)1/2

= λ∑

j

|w(j)| = λ∑

j

w(j)2

Ising I I-h

Canonical – C1 C1 C1-h

Canonical – C2 C2 C2-h

Canonical – CR CR CR-h

Spectral S S-h flat L1 flat L2

Generalized Ising GI GI-group GI-h

Full F F-group F-h

CR = Canonical with a random reference state.

PC = Pseudo-Counts estimate.

λA = I|A|≥22|A|−2λ, λ chosen using cross validation.

n ≤ 16 =⇒ Partition function Z is tractable =⇒ Can compute the exact global optimum =⇒ Comparison results more reliable.

=⇒ Conclusions:

�Use a complete & minimal parameterization (spectral / canonical).

� Prefer the standard `1.

�No natural reference state?=⇒ (Spectral, standard `1) seems the natural choice.

Spectral Interpretation

Dual representation for continuous signals:

FT(Fourier

Transform)

Frequency Domain

Time Domain

(Inverse) FT

Dual representation for binary distributions:

WHT(Walsh-Hadamard

Transform)

“Interactions” Domain (“Spectrum”)

Log-Probability Domain

(Inverse) WHT

(log p =) q = 2n/2Hnww = 2−n/2Hnq

Hadamard matrix: Hn = 2−n/2[

1 11 −1

]⊗n(Kronecker power)

� “Spectral” seems a “natural” parameterization.

� “Canonical” may be “natural” when problem domain has a natural reference state.

The Statistics (Spectrum) of Binary Data Sets

�The spectral parameterization defines a dual “interactions” representation for binary distributions:

log p(x)⇐⇒ {wA} (w = {wA} = “interactions”)

�How can we measure the interactions in a binary data set?

I Learn an (approximate) distribution p′(x) ≈ p(x) from the dataI Represent p′(x) as {w′A} ≈ {wA}� Learning p′(x):

I Avoid prior bias for smaller-cardinality factors: Use a parameterization which treats all 2n − 1spectral parameters equally =⇒ Use (spectral,“flat” priors) / PC / . . .

I High modeling accuracy (p′(x) ≈ p(x)) =⇒ Narrow down to: (spectral, flat `1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

|A|

0

0.2

0.4

0.6

0.8

1

% n

on

−ze

ro p

ara

me

ters

(b)

Yeast

NLTCS

USPS

1 2 3 4 5 6 7 8 9 10 11

|A|

0

0.2

0.4

0.6

0.8

1

% n

on

−ze

ro p

ara

me

ters

(d)

Czech

Rochdale

Jokes

Flow2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

|A|

10−5

10−4

10−3

10−2

10−1

E(

|w

A|

)

(a)

Yeast

NLTCS

USPS

1 2 3 4 5 6 7 8 9 10 11

|A|

10−5

10−4

10−3

10−2

10−1

E(

|w

A|

)

(c)

Czech

Rochdale

Jokes

Flow2

Solid lines = flat `1, dashed lines = flat `2, dotted lines = PC.

� Results:

I Higher-order interactions {w′A} decrease exponentially with #vars = |A|I Confirms intuition & practical experience (adding higher potentials to models gives diminishing

returns)I Rationale for prior bias for lower-order factors

Select References

�Yvonne M. Bishop, Stephen E. Fienberg, and Paul W. Holland. Discrete multivariate analysis: Theory and practice. MIT Press, 1975.

�Daphne Koller and Nir Friedman. Probabilistic graphical models: Principles and techniques. MIT Press, 2009.

�Mark Schmidt and Kevin P. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. Artificial Intelligence and Statistics, 2010.

�Masashi Kato, Qian Ji Gao, Hiroshi Chigira, Hiroyuki Shindo, and Masato Inoue. A haplotype inference method based on sparsely connected multi-body Ising model. Journal of Physics: ConferenceSeries, 233(1), 2010.

on sparse, spectral and other parameterizations of binary...

Documents