on sparse, spectral and other parameterizations of binary...
TRANSCRIPT
On Sparse, Spectral and Other Parameterizations of Binary Probabilistic ModelsDavid Buchman1 Mark Schmidt2 Shakir Mohamed1 David Poole1 Nando de Freitas1
1 Department of Computer Science, University of British Columbia, Vancouver, B.C., Canada2 INRIA – SIERRA Team, Laboratoire d’Informatique de l’Ecole Normale Superieure, Paris, France
Artificial Intelligence and Statistics (AISTATS) 2012
Introduction
Consider a general log-linear (Markov) model with binary variables S = (x1, . . . , xn):
p(x) =1
Z
∏A⊆S
eφA(xA)
�May contain potentials over arbitrary sets of variables
� 2n potentials, typically most are not modeled (set to zero)Parameterizations for each potential φA(xA):
� “Full” / “Ising” / “Generalized Ising” / “Canonical” / “Spectral” / . . .
Parameterizations – Properties
Parameterization Complete MinimalSymmetric Symmetric Uniquely
w.r.t. Values w.r.t. Variables Defined
Full Yes No Yes Yes Yes
Ising No No No Yes Yes
Generalized Ising When binary No No Yes Yes
Canonical Yes Yes No No No
Canonical (C1/C2) Yes Yes No Yes Yes
Spectral (for binary) Yes Yes Yes Yes Yes
Parameterizations
Parameterization: For Binary Variables: Extension to Discrete Variables With c Values:
1-Var Pot.2-Var Pot. Bases 3-Var Potential Bases φij(xi, xj)
#Params per Total 1-Var Potential 2-Var Potential Bases #Params per Total
Bases k-Way Pot. #Params Bases k-Way Pot. #Params
Full1
0
0
1
1 0
0 0
0 1
0 0
0 0
1 0
0 0
0 1
1 0
0 0 0 0
0 0
0 1
0 0 0 0
0 0. . .
0 0
0 0 0 0
0 1
∑s1,s2
wijs1s2Ixi=s1,xj=s2 2k 3n1
0
0
0
1
0
0
0
1
1 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 0 0
. . .
0 0 0
0 0 0
0 0 1
ck (c + 1)n
IsingSpecial
treatment1 0
0 1
1 0
0 0 0 0
0 1wijIxi=xj 1 2n Special treatment
1 0 0
0 1 0
0 0 1
1 2n
Generalized Ising1
0
0
1
1 0
0 0
0 0
0 1
1 0
0 0 0 0
0 0
0 0
0 0 0 0
0 1
∑s
wijsIxi=xj=s 2 2n · 21
0
0
0
1
0
0
0
1
1 0 0
0 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 0 0
0 0 0
0 0 1
c 2nc
Canonical (C1)0
1
0 0
0 1
0 0
0 0 0 0
0 1wijIxi 6=xref
i ,xj 6=xrefj
1 2n0
1
0
0
0
1
0 0 0
0 1 0
0 0 0
0 0 0
0 0 1
0 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 0 0
0 0 1
(c− 1)k cn
Spectral1
-1
1 -1
-1 1
1 -1
-1 1 -1 1
1 -1wijxixj 1 2n
Parameterizing for Learning
�All complete parameterizations share the same ML estimate.
� ( parameterization, regularizer ) = prior
�MAP: w = arg maxw
(∑i
log p(xi | w)− reg(w)
)�New priors: (Spectral, *)
�What makes a good prior?
I Prediction accuracyI SparsityI Computation
Comparing Parameterizations & Regularizers
4.32
4.34
4.36
4.38
4.4
x 104
I C1 C2 CR S GI F
GI−group
F−group I−hC1
−hC2−hCR−h S−
hGi−hF−hTe
st s
et n
egat
ive
log−
likel
ihoo
d
(a) NLTCS
3300
3350
3400
3450
3500
3550
3600
3650
I C1 C2 CR S GI F
F−grou
pF−h
flat L
1fla
t L2 PC
(b) Yeast
2.25
2.3
2.35
2.4
2.45
2.5
2.55
x 104
I C1 C2 CR S GI F
F−grou
pF−h
flat L
1fla
t L2 PC
(c) USPS
103
I C1 C2 CR S GI F
GI−group
F−group I−hC1
−hC2−hCR−h S−
hGi−hF−hN
umbe
r of n
on−z
ero
para
met
ers
(a) NLTCS
102
103
104
I C1 C2 CR S GI F
F−grou
pF−h
flat L
1fla
t L2 PC
(b) Yeast
103
104
105
I C1 C2 CR S GI F
F−grou
pF−h
flat L
1fla
t L2 PC
(c) USPS
`1 Group `1 Hierarchical Group `1 Flat `1 Flat `2
reg(w)∑A⊆S
λA‖wA‖1
∑A⊆S
λA‖wA‖2
∑A⊆S
λA
(∑B⊇A
‖wB‖22
)1/2
λ‖w‖1 λ‖w‖22
Parameterization =∑A⊆S
λA
∑j
|w(j)A | =
∑A⊆S
λA
(∑j
(w
(j)A
)2)1/2
=∑A⊆S
λA
(∑B⊇A
∑j
(w
(j)B
)2)1/2
= λ∑
j
|w(j)| = λ∑
j
w(j)2
Ising I I-h
Canonical – C1 C1 C1-h
Canonical – C2 C2 C2-h
Canonical – CR CR CR-h
Spectral S S-h flat L1 flat L2
Generalized Ising GI GI-group GI-h
Full F F-group F-h
CR = Canonical with a random reference state.
PC = Pseudo-Counts estimate.
λA = I|A|≥22|A|−2λ, λ chosen using cross validation.
n ≤ 16 =⇒ Partition function Z is tractable =⇒ Can compute the exact global optimum =⇒ Comparison results more reliable.
=⇒ Conclusions:
�Use a complete & minimal parameterization (spectral / canonical).
� Prefer the standard `1.
�No natural reference state?=⇒ (Spectral, standard `1) seems the natural choice.
Spectral Interpretation
Dual representation for continuous signals:
FT(Fourier
Transform)
Frequency Domain
Time Domain
(Inverse) FT
Dual representation for binary distributions:
WHT(Walsh-Hadamard
Transform)
“Interactions” Domain (“Spectrum”)
Log-Probability Domain
(Inverse) WHT
(log p =) q = 2n/2Hnww = 2−n/2Hnq
Hadamard matrix: Hn = 2−n/2[
1 11 −1
]⊗n(Kronecker power)
� “Spectral” seems a “natural” parameterization.
� “Canonical” may be “natural” when problem domain has a natural reference state.
The Statistics (Spectrum) of Binary Data Sets
�The spectral parameterization defines a dual “interactions” representation for binary distributions:
log p(x)⇐⇒ {wA} (w = {wA} = “interactions”)
�How can we measure the interactions in a binary data set?
I Learn an (approximate) distribution p′(x) ≈ p(x) from the dataI Represent p′(x) as {w′A} ≈ {wA}� Learning p′(x):
I Avoid prior bias for smaller-cardinality factors: Use a parameterization which treats all 2n − 1spectral parameters equally =⇒ Use (spectral,“flat” priors) / PC / . . .
I High modeling accuracy (p′(x) ≈ p(x)) =⇒ Narrow down to: (spectral, flat `1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|A|
0
0.2
0.4
0.6
0.8
1
% n
on
−ze
ro p
ara
me
ters
(b)
Yeast
NLTCS
USPS
1 2 3 4 5 6 7 8 9 10 11
|A|
0
0.2
0.4
0.6
0.8
1
% n
on
−ze
ro p
ara
me
ters
(d)
Czech
Rochdale
Jokes
Flow2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|A|
10−5
10−4
10−3
10−2
10−1
E(
|w
A|
)
(a)
Yeast
NLTCS
USPS
1 2 3 4 5 6 7 8 9 10 11
|A|
10−5
10−4
10−3
10−2
10−1
E(
|w
A|
)
(c)
Czech
Rochdale
Jokes
Flow2
Solid lines = flat `1, dashed lines = flat `2, dotted lines = PC.
� Results:
I Higher-order interactions {w′A} decrease exponentially with #vars = |A|I Confirms intuition & practical experience (adding higher potentials to models gives diminishing
returns)I Rationale for prior bias for lower-order factors
Select References
�Yvonne M. Bishop, Stephen E. Fienberg, and Paul W. Holland. Discrete multivariate analysis: Theory and practice. MIT Press, 1975.
�Daphne Koller and Nir Friedman. Probabilistic graphical models: Principles and techniques. MIT Press, 2009.
�Mark Schmidt and Kevin P. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. Artificial Intelligence and Statistics, 2010.
�Masashi Kato, Qian Ji Gao, Hiroshi Chigira, Hiroyuki Shindo, and Masato Inoue. A haplotype inference method based on sparsely connected multi-body Ising model. Journal of Physics: ConferenceSeries, 233(1), 2010.