tractable representations probabilistic inference learning ...guyvdb/slides/tpmtutorialuai19.pdf ·...

TractableProbabilisticModels

RepresentationsInferenceLearningApplications

Antonio VergariUniversity of California, Los Angeles

Nicola Di MauroUniversity of Bari

Guy Van den BroeckUniversity of California, Los Angeles

July 22, 2019 - Conference on Uncertainty in Artificial Intelligence (UAI 2019) Tel Aviv

AoGs PDGs NBs Fully factorized sd-DNNF PSDDs

Trees LTMs DNNFs OBDDs CNets SPNs NADEs

Thin Junction Trees NNF FBDDs BDDs ACs VAEs

Polytrees d-NNFs ADDs SDDs TACs GANs NFs

Mixtures XADDs XSDDs MNs BNs FGs

The Alphabet Soup of models in AI

2/147






Logical and Probabilisticmodels

3/147






Tractable and Intractableprobabilistic models

4/147






Expressivemodels without compromises

5/147

Why tractable inference?or expressiveness vs tractability

Probabilistic circuitsa unified framework for tractable models

Building circuitslearning them from data and compiling other models

Applicationswhat are circuits useful for

6/147

Why tractable inference?or the inherent trade-off of tractability vs. expressiveness

Why probabilistic inference?

q1: What is the probability that today is a Monday andthere is a traffic jam on Herzl Str.?

q2: Which day is most likely to have a traffic jam on myroute to work?

pinterest.com/pin/190417890473268205/

8/147





⇒ fitting a predictive model!


8/147





⇒ fitting a predictive model!⇒ answering probabilistic queries on a probabilistic

model of the worldm

q1(m) = ? q2(m) = ? pinterest.com/pin/190417890473268205/

8/147




X = Day,Time, JamStr1, JamStr2, . . . , JamStrN

q1(m) = pm(Day = Mon, JamHerzl = 1)


8/147






⇒ marginals


8/147





q2(m) = argmaxd pm(Day = d ∧∨

i∈route JamStr i)


8/147





q2(m) = argmaxd pm(Day = d ∧∨

i∈route JamStr i)

⇒ marginals + MAP + logical events


8/147


Tractable Probabilistic Inference

A class of queriesQ is tractable on a family of probabilistic modelsMiff for any query q ∈ Q and modelm ∈Mexactly computing q(m) runs in timeO(poly(|q| · |m|)).

9/147



⇒ often poly will in fact be linear!

9/147



⇒ often poly will in fact be linear!

Note: ifM andQ are compact in the number of random variablesX,that is, |m|, |q| ∈ O(poly(|X|)), then query time isO(poly(|X|)).

9/147

What about approximate inference?

Why approximate when we can do exact?⇒ and do we lose something in terms of expressiveness?

Approximations can be intractable as well [Dagum et al. 1993; Roth 1996]

⇒ But sometimes approximate inference comes with guarantees (Rina)

Approximate inference by exact inference in approximate model[Dechter et al. 2002; Choi et al. 2010; Lowd et al. 2010; Sontag et al. 2011; Friedman et al. 2018]

Approximate inference (even with guarantees) can mislead learners[Kulesza et al. 2007] ⇒ Chaining approximations is flying with a blindfold on

10/147

Stay Tuned For …

Next:

1. What are classes of queries?

2. Are my favorite models tractable?

3. Are tractable models expressive?

After: We introduce probabilistic circuits as a unified framework fortractable probabilistic modeling

11/147

Complete evidence queries (EVI)

q3: What is the probability that today is a Monday at12.00 and there is a traffic jam only on Herzl Str.?


12/147




X = Day,Time, JamHerzl, JamStr2, . . . , JamStrN

q3(m) = pm(X = Mon, 12.00, 1, 0, . . . , 0)


12/147




X = Day,Time, JamHerzl, JamStr2, . . . , JamStrN

q3(m) = pm(X = Mon, 12.00, 1, 0, . . . , 0)

…fundamental inmaximum likelihood learning

θMLEm = argmaxθ

∏x∈D pm(x; θ)


12/147


Generative Adversarial Networks

minθ maxϕ Ex∼pdata(x)

[logDϕ(x)

]+ Ez∼p(z)

[log(1−Dϕ(Gθ(z)))

]

Goodfellow et al., “Generative adversarial nets”, 2014 13/147

Generative Adversarial Networks

minθ maxϕ Ex∼pdata(x)

[logDϕ(x)

]+ Ez∼p(z)

[log(1−Dϕ(Gθ(z)))

]no explicit likelihood!

⇒ adversarial training instead of MLE⇒ no tractable EVI

good sample quality⇒ but lots of samples needed for MC

unstable training ⇒ mode collapse

Goodfellow et al., “Generative adversarial nets”, 2014 14/147

Variational Autoencoders

log pθ(x) =∫pθ(x | z)p(z)dz

an explicit likelihood model!

Rezende et al., “Stochastic backprop. and approximate inference in deep generative models”, 2014Kingma et al., “Auto-Encoding Variational Bayes”, 2014 15/147

Variational Autoencoders

log pθ(x) ≥ Ez∼qϕ(z|x)[log pθ(x | z)

]−KL(qϕ(z | x)||p(z))

an explicit likelihood model!

... but computing log pθ(x) is intractable

⇒ an infinite and uncountable mixture⇒ no tractable EVI

we need to optimize the ELBO…⇒ which is “broken”

[Alemi et al. 2017; Dai et al. 2019]

16/147

Probabilistic Graphical Models (PGMs)

Declarative semantics: a clean separation of modeling assumptions from inference

Nodes: random variablesEdges: dependencies

+

Inference: conditioning [Darwiche 2001; Sang et al. 2005]

elimination [Zhang et al. 1994; Dechter 1998]

message passing [Yedidia et al. 2001; Dechter

et al. 2002; Choi et al. 2010; Sontag et al. 2011]

X1

X2

X3

X4

X5

17/147

PGMs: MNs and BNs

Markov Networks (MNs)

p(X) = 1Z

∏c ϕc(Xc)

X1

X2

X3

X4

X5

18/147

PGMs: MNs and BNs


p(X) = 1Z

∏c ϕc(Xc)

Z =∫ ∏

c ϕc(Xc)dX

⇒ EVI queries are intractable!

X1

X2

X3

X4

X5

18/147

PGMs: MNs and BNs


p(X) = 1Z

∏c ϕc(Xc)

Z =∫ ∏

c ϕc(Xc)dX

⇒ EVI queries are intractable!

Bayesian Networks (BNs)

p(X) =∏

i p(Xi | pa(Xi))

⇒ EVI queries are tractable!

X1

X2

X3

X4

X5

X1

X2

X3

X4

X5

18/147

Marginal queries (MAR)



19/147






19/147





General: pm(e) =∫pm(e,H) dH

where E ⊂ X

H = X \ E


19/147


Conditional queries (CON)

q4: What is the probability that there is a traffic jam onHerzl Str. given that today is a Monday?


20/147




q4(m) = pm(JamHerzl = 1 | Day = Mon)


20/147




q4(m) = pm(JamHerzl = 1 | Day = Mon)

If you can answer MAR queries,then you can also do conditional queries (CON):

pm(Q | E) = pm(Q,E)

pm(E)


20/147


Complexity of MAR on PGMs

Exact complexity: Computing MAR and COND is #P-complete [Cooper 1990; Roth 1996].

Approximation complexity: Computing MAR and COND approximately within a relativeerror of 2n

1−ϵfor any fixed ϵ is NP-hard [Dagum et al. 1993; Roth 1996].

Treewidth: Informally, how tree-like is the graphical modelm?Formally, the minimum width of any tree-decomposition ofm.

Fixed-parameter tractable: MAR and CON on a graphical modelm with treewidthwtake timeO(|X| · 2w), which is linear for fixed widthw [Dechter 1998; Koller et al. 2009].

⇒ what about bounding the treewidth by design?

21/147

Low-treewidth PGMs

X1

X2

X3

X4

X5

Trees[Meilă et al. 2000]

X1

X2

X3

X4

X5

Polytrees[Dasgupta 1999]

X1 X2

X1 X3 X4

X3 X5

Thin Junction trees[Bach et al. 2001]

If treewidth is bounded (e.g.≊ 20), exact MAR and CON inference is possible in practice

22/147

Low-treewidth PGMs: trees

A tree-structured BN [Meilă et al. 2000] where eachXi ∈ X has at most one parent PaXi.

X1

X2

X3

X4

X5

p(X) =∏n

i=1p(xi|Paxi

)

Exact querying: EVI, MAR, CON tasks linear for trees: O(|X|)Exact learning from d examples takesO(|X|2 · d) with the classical Chow-Liu algorithm1

1Chow et al., “Approximating discrete probability distributions with dependence trees”, 1968 23/147

What do we lose?

Expressiveness: Ability to compactly represent rich and complex classes of distributions

X1

X2

X3

X4

X5

Bounded-treewidth PGMs lose the ability to represent all possible distributions …

Cohen et al., “On the expressive power of deep learning: A tensor analysis”, 2016Martens et al., “On the Expressive Efficiency of Sum Product Networks”, 2014 24/147

Mixtures

Mixtures as a convex combination of k (simpler) probabilistic models

−10 −5 0 5 10

X1

0.00

0.05

0.10

0.15

0.20

0.25

p(X

1)

p(X) = w1·p1(X)+w2·p2(X)

EVI, MAR, CON queries scale linearly in k

25/147

Mixtures

Mixtures as a convex combination of k (simpler) probabilistic models

−10 −5 0 5 10

X1

0.00

0.05

0.10

0.15

0.20

0.25

p(X

1)

p(X) =p(Z = 1 ) · p1(X|Z = 1 )

+ p(Z = 2 ) · p2(X|Z = 2 )

Mixtures are marginalizing a categorical latent variable Z with k values⇒ increased expressiveness

25/147

Expressiveness and efficiency

Expressiveness: Ability to compactly represent rich and effective classes of functions

⇒ mixture of Gaussians can approximate any distribution!

Expressive efficiency (succinctness) compares model sizes in terms of their ability tocompactly represent functions

⇒ but how many components do they need?

Cohen et al., “On the expressive power of deep learning: A tensor analysis”, 2016Martens et al., “On the Expressive Efficiency of Sum Product Networks”, 2014 26/147

Mixture modelsExpressive efficiency

⇒ deeper mixtures would be efficient compared to shallow ones 27/147

Maximum A Posteriori (MAP)aka Most Probable Explanation (MPE)

q5: Which combination of roads is most likely to bejammed on Monday at 9am?


28/147




q5(m) = argmaxj pm(j1, j2, . . . | Day=M,Time=9)


28/147




q5(m) = argmaxj pm(j1, j2, . . . | Day=M,Time=9)

General: argmaxq pm(q | e)

where Q ∪ E = Xpinterest.com/pin/190417890473268205/

28/147




…intractable for latent variable models!

maxq

pm(q | e) = maxq

∑z

pm(q, z | e)

=∑z

maxq

pm(q, z | e) pinterest.com/pin/190417890473268205/

28/147


Marginal MAP (MMAP)aka BN MAP



29/147




q6(m) = argmaxj pm(j1, j2, . . . | Time=9)


29/147





General: argmaxq pm(q | e)= argmaxq

∑h pm(q,h | e)

where Q ∪H ∪ E = X


29/147





⇒ NPPP-complete [Park et al. 2006]

⇒ NP-hard for trees [Campos 2011]

⇒ NP-hard even for Naive Bayes [ibid.]


29/147


Advanced queries

q2: Which day is most likely to have a traffic jam onmy route to work?


Bekker et al., “Tractable Learning for Complex Probability Queries”, 2015 30/147


Advanced queries


q2(m) = argmaxd pm(Day = d∧∨

i∈route JamStr i)

⇒ marginals + MAP + logical events




Advanced queries


q7: What is the probability of seeing more traffic jamsin Jaffa than Marina?




Advanced queries



⇒ counts + group comparison




Advanced queries



and more:

expected classification agreement[Oztok et al. 2016; Choi et al. 2017, 2018]

expected predictions [Khosravi et al. 2019a]




Fully factorized models

A completely disconnected graph. Example: Product of Bernoullis (PoBs)

X1

X2

X3

X4

X5

p(X) =∏n

i=1p(xi|Paxi

)

Complete evidence, marginals and MAP, MMAP inference is linear!

⇒ but definitely not expressive…

31/147

moreexpressive

efficien

t

less

expressive

efficien

t

more tractable queries

less tractable queries

32/147

moreexpressive

efficien

t

less

expressive

efficien

t



BNs

NFs

NADEs

MNsVAEs

GANs

Expressive models are not very tractable…33/147

moreexpressive

efficien

t

less

expressive

efficien

t



BNs

NFs

NADEs

MNsVAEs

GANs

Fully factorized

NBTrees

PolytreesTJT

LTM

Mixtures

and tractable ones are not very expressive…34/147

moreexpressive

efficien

t

less

expressive

efficien

t



BNs

NFs

NADEs

MNsVAEs

GANs

Fully factorized

NBTrees

PolytreesTJT

LTM

Mixtures

X

probabilistic circuits are at the “sweet spot”35/147

Probabilistic Circuits

Stay Tuned For …

Next:

1. What are the building blocks of tractable models?⇒ build into a computational graph: a probabilistic circuit

2. For which queries are probabilistic circuits tractable?⇒ tractability classes induced by structural properties

After: How are probabilistic circuits related to the alphabet soup of models?

37/147

Base Case: Univariate Distributions

x

X

pX(x)

Generally, univariate distributions are tractable for:

EVI: output p(Xi) (density or mass)

MAR: output 1 (normalized) or Z (unnormalized)

MAP: output the mode

38/147


x

X

pX(x)





⇒ often 100% probability for one value of a categorical random variable⇒ for example,X or ¬X for Boolean random variable

38/147


.74

X

.33





⇒ often 100% probability for one value of a categorical random variable⇒ for example,X or ¬X for Boolean random variable

38/147

Factorizations are productsDivide and conquer complexity

p(X1, X2, X3) = p(X1) · p(X2) · p(X3)

X1 X2 X3

X1

X2

X3

0.0

0.5

1.0

1.5

2.0

2.5

3.0 ×

X1 X2 X3

⇒ e.g. modeling a multivariate Gaussian with diagonal covariance matrix

39/147


p(x1, x2, x3) = p(x1) · p(x2) · p(x3)

X1 X2 X3

X1

X2

X3

0.0

0.5

1.0

1.5

2.0

2.5

3.0 ×

0.8

X1

0.5

X2

0.9

X3


39/147


p(x1, x2, x3) = p(x1) · p(x2) · p(x3)

X1 X2 X3

X1

X2

X3

0.0

0.5

1.0

1.5

2.0

2.5

3.0 .36

0.8

X1

0.5

X2

0.9

X3


39/147

Mixtures are sums

Also mixture models can be treated as a simple computational unit over distributions

−10 −5 0 5 10

X1

0.00

0.05

0.10

0.15

0.20

0.25

p(X

1)

p(X) = w1·p1(X)+w2·p2(X)

40/147

Mixtures are sums


X1 X1

w1 w2

p(x) = 0.2·p1(x)+0.8·p2(x)

40/147

Mixtures are sums


.44

0.2

X1

0.5

X1

0.2 0.8

p(x) = 0.2·p1(x)+0.8·p2(x)

With mixtures, we increase expressiveness⇒ by stacking them we increase expressive efficiency

40/147

A grammar for tractable modelsRecursive semantics of probabilistic circuits

X1

41/147


X1 X1 X1

w1 w2

41/147


X1 X1 X1

w1 w2

×

X1 X2

×

X1 X2

w1 w2

41/147


X1 X1 X1

w1 w2

×

X1 X2

×

X1 X2

w1 w2

× ×

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

41/147

Probabilistic circuits are not PGMs!

They are probabilistic and graphical, however …

PGMs Circuits

Nodes: random variables unit of computationsEdges: dependencies order of execution

Inference: conditioning

elimination

message passing

feedforward pass

backward pass

⇒ they are computational graphs, more like neural networks

42/147

The perks of being a computational graph

Computations that are repeated can be cached!⇒ amortizing inference; parameter/structure sharing

Clear operational semantics! ⇒ Tractability in terms of circuit size

Differentiable! ⇒ gradient-based optimization

Structural properties on the computational graph cleanly map to tractable queryclasses…

43/147

Just sum, products and distributions?

X1

X2

X3

X4

X5

X6

×

×

×

×

×

×

×

×

×

×

×

×

×

×

×

×

just arbitrarily compose them like a neural network!

44/147

Just sum, products and distributions?

X1

X2

X3

X4

X5

X6

×

×

×

×

×

×

×

×

×

×

×

×

×

×

×

×

just arbitrarily compose them like a neural network!⇒ structural constraints needed for tractability

44/147

How do we ensure tractability?

45/147

Decomposability

A product node is decomposable if its children depend on disjoint sets of variables⇒ just like in factorization!

×

X1 X2 X3

decomposable circuit

×

X1 X1 X3

non-decomposable circuit

Darwiche et al., “A knowledge compilation map”, 2002 46/147

Smoothnessaka completeness

A sum node is smooth if its children depend of the same variable sets⇒ otherwise not accounting for some variables

X1 X1

w1 w2

smooth circuit

X1 X2

w1 w2

non-smooth circuit

⇒ smoothness can be easily enforced [Shih et al. 2019]


Tractable MAR/CON

Smoothness and decomposability enable tractable MAR/CON queries

× ×

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

48/147

Tractable MAR/CON


If p(x,y) = p(x)p(y), (decomposability):

∫ ∫p(x,y)dxdy =

∫ ∫p(x)p(y)dxdy =

=

∫p(x)dx

∫p(y)dy

⇒ larger integrals decompose into easier ones

× ×

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

48/147

Tractable MAR/CON


If p(x) =∑

i wipi(x), (smoothness):

∫p(x)dx =

∫ ∑i

wipi(x)dx =

=∑i

wi

∫pi(x)dx

⇒ integrals are “pushed down” to children

× ×

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

48/147

Tractable MAR/CON


Forward pass evaluation ⇒ linear in circuit size!

E.g. to compute p(X2, X3), let input distributions overX1 andX4 output Z

⇒ for normalized leaf distribution, 1.0

.35 .64

.49

.58 .77

.58 .77.58 .77

1.0

.61

1.0

.83

1.0 .58 1.0 .77

48/147

Determinismaka selectivity

A sum node is deterministic if the output of only one children is non zero for any input⇒ e.g. if their distributions have disjoint support

×

X1 ≤ θ X2

×

X1 > θ X2

w1 w2

deterministic circuit

×

X1 X2

×

X1 X2

w1 w2

non-deterministic circuit 49/147

Tractable MAP

The addition of determinism enables tractable MAP queries!

× ×

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

50/147

Tractable MAP


If p(q, e) = p(qx, ex,qy, ey)= p(qx, ex)p(qy, ey) (decomposable product node):

argmaxq

p(q | e) = argmaxq

p(q, e) =

argmaxqx,qy

p(qx, ex,qy, ey) =

argmaxqx

p(qx, ex), argmaxqy

p(qy, ey)

⇒ solving optimization independently

× ×

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

50/147

Tractable MAP

The addition of determinism enables tractable MAP queries!If p(q, e) =

∑i wipi(q, e) = wcpc(q, e),

(deterministic sum node):

argmaxq

p(q, e) = argmaxq

∑i

wipi(q, e) =

argmaxq

maxi

wipi(q, e) =

argmaxq

wcpc(q, e)

⇒ only one non-zero children c

× ×

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

50/147

Tractable MAP


Evaluating the circuit twice:bottom-up and top-down ⇒ still linear in circuit size! × ×

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

50/147

Tractable MAP


Evaluating the circuit twice:bottom-up and top-down ⇒ still linear in circuit size!

In practice:

1. turn sum into max nodes

2. evaluate p(e) bottom-up

3. retrieve max activations top-down

4. compute MAP queries at leaves

× ×

max

max max

× ×× ×

X1

X2

X1

X2

X3 X4 X3 X4

50/147

Tractable MAP


Evaluating the circuit twice:bottom-up and top-down ⇒ still linear in circuit size!

In practice:

1. turn sum into max nodes

2. evaluate p(e) bottom-up

3. retrieve max activations top-down

4. compute MAP queries at leaves

× ×

× ×× ×

X1

X2

0

1

1 X4 X3 0

50/147

Approximate MAP

If the probabilistic circuit is non-deterministic, MAP is intractable:⇒ e.g. with latent variables Z

argmaxq

∑i

wipi(q, e) = argmaxq

∑z

p(q, z, e) = argmaxq

maxz

p(q, z, e)

However, same two steps algorithm, still used as an approximation to MAP [Liu et al. 2013;

Peharz et al. 2016]

51/147

Structured decomposability

A product node is structured decomposable if decomposes according to a node in a vtree⇒ stronger requirement than decomposability

X3

X1 X2

vtree

×

X1 X2

×

X1 X2

X3

×

×

X1 X2

×

X1 X2

X3

×

structured decomposable circuit52/147

Structured decomposability

A product node is structured decomposable if decomposes according to a node in a vtree⇒ stronger requirement than decomposability

X3

X1 X2

vtree

×

X1 X2

×

X2 X3

X2

×

×

X2 X3

×

X1 X2

X3

×

non structured decomposable circuit52/147

Structured decomposability enables tractable …Entropy of probabilistic circuit [Liang et al. 2017b]

Symmetric and group queries (exactly-k, odd-number, more, etc.) [Bekker et al. 2015]

For the “right” vtree

Probability of logical circuit event in probabilistic circuit [ibid.]

Multiply two probabilistic circuits [Shen et al. 2016]

KL Divergence between probabilistic circuits [Liang et al. 2017b]

Same-decision probability [Oztok et al. 2016]

Expected same-decision probability [Choi et al. 2017]

Expected classifier agreement [Choi et al. 2018]

Expected predictions [Khosravi et al. 2019b]53/147

Stay Tuned For …

Next:

1. How probabilistic circuits are related to logical ones?⇒ a historical perspective

2. How probabilistic circuits in the literature relate and differ?⇒ SPNs, ACs, CNets, PSDDs

3. How classical tractable models can be turned in a circuit?⇒ Compiling low-treewidth PGMs

After: How do I build my own probabilistic circuit?54/147

Tractability to other semi-rings

Tractable probabilistic inference exploits efficient summation for decomposablefunctions in the probability commutative semiring:

(R,+,×, 0, 1)analogously efficient computations can be done in other semi-rings:

(S,⊕,⊗, 0⊕, 1⊗)⇒ Algebraic model counting [Kimmig et al. 2017], Semi-ring

programming [Belle et al. 2016]Historically, very well studied for boolean functions:

(B = 0, 1,∨,∧, 0, 1) ⇒ logical circuits!55/147

Logical circuits

∧ ∧

∨

X4 X3

∨ ∨

∧ ∧∧ ∧

X3 X4

X1 X2 X1 X2

s/d-D/DNFs[Darwiche et al. 2002]

O/BDDs[Bryant 1986]

SDDs[Darwiche 2011]

Logical circuits are compact representations for boolean functions…56/147

Logical circuitsstructural properties

…and as probabilitistic circuits, one can define structural properties: (structured)decomposability, smoothness, determinism allowing for tractable computations


Logical circuitsa knowledge compilation map

…inducing a hierarchy of tractable query classes


Logical circuitsconnection to probabilistic circuits through WMC

A task called weighted model counting (WMC)

WMC(∆, w) =∑x|=∆

∏l∈x

w(l)

Two decades worth of connections:1. Encode probabilistic model as WMC (add variable placeholders for parameters)2. Compile∆ into a d-DNNF (or OBDD, SDD, etc.)3. Tractable MAR/CON by tractable WMC on circuit4. Depending on the WMC encoding even tractable MAP

End result equivalent to probabilistic circuit: efficiently replace parameter variablesin logical circuit by edge parameters in probabilistic circuit

59/147

From trees to circuitsvia compilation

D

C

A B

→

A = 0 A = 1 B = 0 B = 1

C = 0 C = 1

× ×

× ×

D = 0 D = 1

60/147


D

C

A B

Bottom-up compilation: starting from leaves…

60/147


D

C

A B

…compile a leaf CPT

A = 0 A = 1

.3 .7

p(A|C = 0)

60/147


D

C

A B

…compile a leaf CPT

A = 0 A = 1

.6 .4

p(A|C = 1)

60/147


D

C

A B

…compile a leaf CPT…for all leaves…

A = 0 A = 1 B = 0 B = 1

p(A|C) p(B|C)

60/147


D

C

A B

…and recurse over parents…

A = 0 A = 1 B = 0 B = 1

C = 0 C = 1

× ×

.2.8

p(C|D = 0)

60/147


D

C

A B

…while reusing previously compiled nodes!…

A = 0 A = 1 B = 0 B = 1

C = 0 C = 1

× ×

.9

.1

p(C|D = 1)

60/147


D

C

A BA = 0 A = 1 B = 0 B = 1

C = 0 C = 1

× ×

× ×

D = 0 D = 1

.5 .5

p(D)

60/147

Low-treewidh PGMs

Tree, polytrees andThin Junction treescan be turned into

decomposable

smooth

deterministic

circuits

Therefore they supporttractable

EVI

MAR/CON

MAP

D

C

A B

61/147

Arithmetic Circuits (ACs)

ACs [Darwiche 2003] are

decomposable

smooth

deterministic

They support tractable

EVI

MAR/CON

MAP

⇒ parameters are attached to the leaves⇒ …but can be moved to the sum node edges [Rooshenas et al. 2014]

⇒ Also see related AND/OR search spaces [Dechter et al. 2007]

Lowd et al., “Learning Markov Networks With Arithmetic Circuits”, 2013 62/147

Sum-Product Networks (SPNs)

SPNs [Poon et al. 2011] are

decomposable

smooth

deterministic


EVI

MAR/CON

MAP

× ×

× ×× ×

X3 X4

X1 X2 X1 X2

0.3 0.7

0.5 0.5 0.9 0.1

⇒ deterministic SPNs are also called selective [Peharz et al. 2014a]

63/147

Cutset Networks (CNets)

A CNet [Rahman et al. 2014] is a weighted model-trees [Dechter et al. 2007] whose leaves aretree Bayesian networks

X1

X2 X3

C1

C2 C30.4 0.6

X4X5

X6 X3

0.7

X5X6

X4 X2

0.8

X5X4

X6 X2

0.2

X3X6

X5 X4

0.3

⇒ they can be represented as probabilistic circuits64/147

CNets as probabilistic circuits

Every decision node in the CNet can be represented as a deterministic, smooth sum node

X1

M′X\1

M′′X\1

C1

C2 C3

M′X\1

M′′X\1

w10 w1

1 = × ×

w10 w1

1

M′X\1,2

λX1=0

M′X\1,3

λX1=1

M′X\1,2

λX1=0

M′X\1,3

λX1=1

and we can recurse on each child node until a BN tree is reached⇒ compilable into a deterministic, smooth and decomposable circuit!

65/147

CNets as probabilistic circuits

CNets are

decomposable

smooth

deterministic


EVI

MAR/CON

MAP

× ×

w10 w1

1

M′X\1,2

λX1=0

M′X\1,3

λX1=1

M′X\1,2

λX1=0

M′X\1,3

λX1=1

⇒ EVI can be computed inO(|X|)

66/147

Probabilistic Sentential Decision Diagrams

PSDDs [Kisa et al. 2014a] arestructureddecomposable

smooth

deterministic


EVI

MAR/CON

MAP

Complex queries!

Kisa et al., “Probabilistic sentential decision diagrams”, 2014Choi et al., “Tractable learning for structured probability spaces: A case study in learningpreference distributions”, 2015Shen et al., “Conditional PSDDs: Modeling and learning with modular knowledge”, 2018 67/147

moreexpressive

efficien

t

less

expressive

efficien

t



BNs

NFs

NADEs

MNsVAEs

GANs

Fully factorized

NBTrees

PolytreesTJT

LTM

Mixtures

?

where are probabilistic circuits?68/147

moreexpressive

efficien

t

less

expressive

efficien

t



SPNs

PSDDs

CNets

BNs

NFs

ACsAoGs

NADEs

MNsVAEs

GANs

Fully factorized

NBTrees

PolytreesTJT

LTM

Mixtures

tractability vs expressive efficiency69/147

How expressive are probabilistic circuits?

Measuring average test set log-likelihood on 20 density estimation benchmarks

Comparing against intractable models:

Bayesian networks (BN) [Chickering 2002] with sophisticated context-specific CPDs

MADEs [Germain et al. 2015]

VAEs [Kingma et al. 2014] (IWAE ELBO [Burda et al. 2015])

Gens et al., “Learning the Structure of Sum-Product Networks”, 2013Peharz et al., “Probabilistic deep learning using random sum-product networks”, 2018 70/147

How expressive are probabilistic circuits?density estimation benchmarks

dataset best circuit BN MADE VAE dataset best circuit BN MADE VAEnltcs -5.99 -6.02 -6.04 -5.99 dna -79.88 -80.65 -82.77 -94.56msnbc -6.04 -6.04 -6.06 -6.09 kosarek -10.52 -10.83 - -10.64kdd -2.12 -2.19 -2.07 -2.12 msweb -9.62 -9.70 -9.59 -9.73plants -11.84 -12.65 -12.32 -12.34 book -33.82 -36.41 -33.95 -33.19audio -39.39 -40.50 -38.95 -38.67 movie -50.34 -54.37 -48.7 -47.43jester -51.29 -51.07 -52.23 -51.54 webkb -149.20 -157.43 -149.59 -146.9netflix -55.71 -57.02 -55.16 -54.73 cr52 -81.87 -87.56 -82.80 -81.33accidents -26.89 -26.32 -26.42 -29.11 c20ng -151.02 -158.95 -153.18 -146.9retail -10.72 -10.87 -10.81 -10.83 bbc -229.21 -257.86 -242.40 -240.94pumbs* -22.15 -21.72 -22.3 -25.16 ad -14.00 -18.35 -13.65 -18.81

71/147

Building circuits

Tractable Learning

A learner L is a tractable learner for a class of queriesQ iff

(1) for any datasetD, learner L(D) runs in timeO(poly(|D|)), and(2) outputs a probabilistic model that is tractable for queriesQ.

73/147

Tractable Learning


(1) for any datasetD, learner L(D) runs in timeO(poly(|D|)), and⇒ Guarantees learned model has sizeO(poly(|D|))

⇒ Guarantees learned model has sizeO(poly(|X|))

(2) outputs a probabilistic model that is tractable for queriesQ.

73/147

Tractable Learning


(1) for any datasetD, learner L(D) runs in timeO(poly(|D|)), and⇒ Guarantees learned model has sizeO(poly(|D|))

⇒ Guarantees learned model has sizeO(poly(|X|))

(2) outputs a probabilistic model that is tractable for queriesQ.

⇒ Guarantees efficient querying forQ in timeO(poly(|X|))

73/147

Stay Tuned For …

Next:

1. How to learn circuit parameters?⇒ convex optimization, EM, SGD, Bayesian learning, …

2. How to learn the structure of circuits?⇒ local search, random structures, ensembles, …

3. How to compile other models to circuits?⇒ PGM compilation, probabilistic databases, probabilistic programming

After: What is this used for? 74/147

Learning circuit parameters

Sum node distibution p(X) can be interpreted as a marginal distribution of p(X, Z)overX and a latent variable Z

p(X|Z = k) child distribution

p(Z = k) = wk weight

Even leaf distributions could be parametrized by θ

Learning parameters involves learning both sum and leafparameters (w,θ)

X1 X1

w1 w2

75/147

Learning circuit parameters

deterministiccircuits

non- deterministiccircuits

⇒

⇒

closed-form, convex optimization[Kisa et al. 2014b; Liang et al. 2019]

SGD [Peharz et al. 2018]

soft/hard EM [Poon et al. 2011; Peharz 2015]

bayesian moment matching [Jaini et al. 2016]

collapsed variational Bayes [Zhao et al. 2016a]

CCCP [Zhao et al. 2016b]

Extended Baum-Welch [Rashwan et al. 2018]

76/147

Deterministic circuits

Given a deterministic circuit and a complete datasetD,maximize the likelihood of parameters given examples in the dataset

θMLE = argmaxθ

L(θ;D) = argmaxθ

∏i

pθ(di)

With determinism, L decomposes over the parameters, and θMLE has aclosed-form solution

⇒ compute sufficient statistics (just count)

⇒ a single pass of the dataset required!Kisa et al., “Probabilistic sentential decision diagrams”, 2014Liang et al., “Learning Logistic Circuits”, 2019 77/147

Hard/Soft Parameter UpdatingGradient Descent

Computing the likelihood gradient and optimize by GD

∆wpc

Soft GradientGenerative (∇wpcS(x)) Sc(x)∇Sp(x)S(x)

Discriminative (∇wpc logS(y|x))∇wpcS(y|x)

S(y|x) − ∇wpcS(∗|x)S(∗|x)

Hard GradientGenerative (∇wpc logM(x)) ♯wpc∈Wx

wpc

Discriminative (∇wpc logM(y|x)) ♯wpc∈W(y|x)−♯wpc∈W(1|x)wpc

Gens et al., “Discriminative Learning of Sum-Product Networks”, 2012 78/147

Hard/Soft Parameter UpdatingExpectation Maximization

…or using EM by considering each sum node as the marginalization of a hidden variable

Soft Posterior (p(Hp = c|x)) ∝ 1S(x)

∂S(x)∂Sp(x)

Sc(x)wpc

Hard Posterior (p(Hp = c|x)) =

1 ifwpc ∈Wx

0 otherwise

Peharz et al., “On the Latent Variable Interpretation in Sum-Product Networks”, 2016 79/147

Bayesian Parameter Learning

Bayesian Learning starts by expressing a prior p(w) over the weights⇒ learning corresponds to computing the posterior based on the data

p(w|D) ∝ p(w)p(D|w)

the posterior is intractableassuming a prior p(w) =

∏i∈sumNodesDir(wi|αi)

considering circuits with normalized weightswij ≥ 0 and

∑j wij = 1,∀i ∈ sumNodes

the posterior becomes a mixture of products of Dirichletsthe number of mixture components is exponential in the number of sum nodes

80/147

Bayesian Parameter Learning

Moment matching (oBMM) : approximate the posterior after each update with atractable distribution that matches some moments of the exact, but intractable posterior

the joint p(w) is approximated by a product of Dirichlets

the first and second moment of each marginal p(wi) are used to set thehyperparameters αi of each Dirichlet in the product of Dirichlets

oBMM extended to continuous models with Gaussian leavesCVB-SPN: a collapsed variational inference algorithm

better results than oBMM

Rashwan et al., “Online and Distributed Bayesian Moment Matching for Parameter Learning inSum-Product Networks”, 2016Jaini et al., “Online Algorithms for Sum-Product Networks with Continuous Variables”, 2016Zhao et al., “Collapsed Variational Inference for Sum-Product Networks”, 2016 81/147

Parameter LearningSequential monomial approximation & Concave-convex procedure

Any complete and decomposable circuit is equivalent to a mixture of trees where eachtree corresponds to a product of univariate distributions

learning the parameters based on the MLE principle can be formulated as asignomial program Sequential Monomial Approximation (SMA)

the signomial program formulation can be equivalently transformed into adifference of convex functions Concave-convex Procedure (CCCP)

Zhao et al., “A Unified Approach for Learning the Parameters of Sum-Product Networks”, 2016 82/147

Structure learning

Greedy layerwiseLearnSPN& and variants

Structure learning as searchdefining operators

Local searchLearnPSDD

Random structuresXCNets, RAT-SPNs 83/147

LearnSPN

1

2

3

4

5

6

7

8

X4X3X2X1 X5

Learning both structure and parameters of a circuit by starting from a data matrix

Gens et al., “Learning the Structure of Sum-Product Networks”, 2013 84/147

LearnSPNX4X3X2X1 X5

Looking for sub-population in the data—clustering—to introduce sum nodes…


LearnSPNX4X3X2X1 X5

X4X3X2X1 X5

…seeking independencies among sets of RVs to factorize into product nodes


LearnSPNX4X3X2X1 X5

X4X3X2X1 X5 X4X3X2X1 X5

X4X3X2X1 X5

…learning smaller estimators as a a recursive data crawler


LearnSPN variants

ID-SPN [Rooshenas et al. 2014]

LearnSPN-b/T/B [Vergari et al. 2015]

for heterogeneous data [Bueff et al. 2018; Molina et al. 2018]

using k-means [Butz et al. 2018a] or SVD splits [Adel et al. 2015]

learning DAGs [Dennis et al. 2015; Jaini et al. 2018]

approximating independence tests [Di Mauro et al. 2018]

85/147

ID-SPN

ID-SPN works like LearnSPN: clustering instance and variables for sum and product nodes

start with a single AC representing a tractableMarkov network

stop the process before reaching univariatedistributions

learn a tractable MN represented by an ACfactorizing a multivariate distribution

⇒ SPNs with tractable multivariate distributions as leaves—MN ACs

Rooshenas et al., “Learning Sum-Product Networks with Direct and Indirect Variable Interactions”,2014 86/147

Other variants

Bottom up learning [Peharz et al. 2013]

starting from simple models over small variable scopes

growing models over larger variable scopes, building successively more expressivemodels guided by dependence tests and a maximum mutual information principle

Greedy for deterministic circuits [Peharz et al. 2014a]

hill climbing tranforming a network with split and merge operations

Graph SPNs from tree SPNs by merging similar sub-structures [Rahman et al. 2016b]

bottom-up merging sub-SPNs with similar distributions defined over the samevariables

87/147

Cut(e)set Network

For deterministic circuits, structure scores decomposeCNet likelihood decomposition

LD(G;θ) =∑

i αi + LDi(Gi;θi)

BIC score decomposition

L(G′;θ′)− L(G′;θ′) > (logM)/2

Structure Learning

start with a single tractable multivariate model (CLT)

substitute a leaf node with the best CNet improvingboth the LL and the BIC score

X1

X2 X3

C1

C2 C30.4 0.6

X4X5

X6 X3

0.7

X5X6

X4 X2

0.8

X5X4

X6 X2

0.2

X3X6

X5 X4

0.3

Di Mauro et al., “Learning Accurate Cutset Networks by Exploiting Decomposability”, 2015 88/147

PSDD Structure LearningLearning vtree

A variable tree (vtree)

a full binary tree

leaves are labeled with variables

internal vtree nodes split variables into those appearing in the left subtreeX andthose in the right subtreeY

it can be learned from data in a top-down or bottom-up fashion⇒ maximising pairwise MI instead of joint MI

Liang et al., “Learning the structure of probabilistic sentential decision diagrams”, 2017 89/147

PSDD Structure LearningLocal operations

incrementally change the PSDD structure preserving syntactic soundness


LearnPSDD

LearnPSDD incrementally improves the structure of a PSDD to better fit the data

in every step, the structure is changed by executing an operation

learning continues until the log-likelihood on validation data stagnates, or a desiredtime or size limit is reached

the operation to execute is greedily chosen based on the best likelihoodimprovement per size increment

score =logL(r′|D)− logL(r|D)

size(r′)− size(r′)


Learning Logistic Circuits

propagates values and parametersbottom-up

logistic function at root node withweight function gr(x)

Pr(Y = 1|x) = 1

1 + exp(−gr(x))

Liang et al., “Learning Logistic Circuits”, 2019 92/147

Learning Logistic Circuits

Parameter Learning

Due to decomposability and determinism, any logistic circuit model can be reducedto a logistic regression model over a particular feature set

Pr(Y = 1|x) = 1

1 + exp(−Xθ)

X is some vector of features extracted from the raw exampleX

Structure Learning

use the split operation like in LearnSPDD

Liang et al., “Learning Logistic Circuits”, 2019 93/147

Bayesian Structure LearningA prior distribution for SPN trees

The priors are defined recursively, node by node

prior of each sum-node s is a Dirichlet process, with concentration parameter αs

and base distributionGP (s)

GP (s): probability distribution over the set of possible product nodes with scope s

the prior distribution over SPNs is specified as a tree of Dirichlet Processes overproduct nodes

The model is straightforward altered for DAG using hierarchical Dirichlet Process

Lee et al., “Non-Parametric Bayesian Sum-Product Networks”, 2014 94/147

ABDAAutomatic Bayesian Density Analysis

Overcoming the problem in DE of assuming homogeneous RVs and shallow dependencystructures

ABDA relies on SPNs to capture statistical dependencies in heterogeneous data atdifferent granularity through a hierarchical co-clustering

inference for both the statistical data types and (parametric) likelihood modelsrobust estimation of missing valuesdetection of corrupt or anomalous dataautomatic discovery of the statistical dependencies and local correlations in the data

Vergari et al., “Automatic Bayesian Density Analysis”, 2018 95/147

ABDAGenerative model

ZSn ∼ Cat(ΩS),ΩS ∼ Dir(γ)

wdj ∼ Dir(α), sdj,n ∼ Cat(wd

j )

prior on ηdj,l parametrized with λd

l

Vergari et al., “Automatic Bayesian Density Analysis”, 2018 96/147

Bayesian SPNsLearning both the structure and parameters

A well-principled Bayesian approach to SPN learning, simultaneously over both structureand parameters

the structure learning problem is decomposes into two steps1. proposing a computational graph

⇒ laying out the arrangement of sums, products and leaf distributions2. learning the scope-function, which assigns to each node its scope

Trapp et al., “Bayesian Learning of Sum-Product Networks”, 2019 97/147

Bayesian SPNsGenerative model

Trapp et al., “Bayesian Learning of Sum-Product Networks”, 2019 98/147

Randomized structure learning: RAT-SPNs

Random Tensorized SPNs (RAT-SPNs)

SPNs are obtained by first constructing a random region graph

subsequently populating the region graph with tensors of SPN nodes

implemented in Tensorflow and easily optimized using automatic differentiation,SGD, and automatic GPU-parallelization

implementing an SPN dropout heuristican elegant probabilistic interpretation as marginalization of missing features (dropoutat inputs) and as injection of discrete noise (dropout at sum nodes)

comparable DNNs; complete joint distribution over variables; robust in thepresence of missing features; well-calibrated uncertainty estimates over their inputs

Peharz et al., “Probabilistic deep learning using random sum-product networks”, 2018 99/147

RAT-SPNsLosses

Generative training (EM): LL = 1N

∑Ni=1 logS(xn)

Discriminative training (SGD): CE = − 1N

∑Ni=1 log

Syn (xn)∑y′ Sy′ (xn)

Hybrid training (SGD):O = λCE− (1− λ) LL|X|

More details and results during the UAI oral session, tomorrow at 2:30pm

100/147

Ensembles of Probabilistic Circuits

To mitigate issues like the scarce accuracy of a single model and their tendency to overfit,circuits can be employed as the components of a mixture

p(X) =K∑i=1

λiCi(X), λi ≥ 0 :K∑i=1

λi = 1

Employing EM to alternatively learn both the weights and the mixture components

issues about convergence and instability of EM → impractical

101/147


Bagging Probabilistic Circuits

more efficient than EM

mixture coefficients are set equally probable

mixture components can be learned independently on different bootstraps

Adding random subspace projection to bagged networks (like for CNets)

more efficient that bagging

Di Mauro et al., “Learning Accurate Cutset Networks by Exploiting Decomposability”, 2015Di Mauro et al., “Learning Bayesian Random Cutset Forests”, 2015 102/147


Boosting Probabilistic Circuits

BDE: boosting density estimationsequentially grows the ensemble, adding a weak base learner at each stageat each boosting stepm, find a weak learner cm and a coefficient ηm maximizing theweighted LL of the new model

fm = (1− ηm)fm−1 + ηmcm

GBDE: a kernel based generalization of BDE—AdaBoost style algorithm

sequential EMat each stepm, jointly optimize ηm and cm keeping fm−1 fixed

Rahman et al., “Learning Ensembles of Cutset Networks”, 2016 103/147

Extremely Randomized CNets: XCNets

Learning both the structure and parameters of a CNet equals to perform searching in thespace of all probabilistic weighted model trees

a problem tackled in a two-stage greedy fashion1. performing a top-down search in the space of weighted OR trees2. learning TPMs as leaf distributions

X1

X2 X3

C1

C2 C30.4 0.6

X4X5

X6 X3

0.7

X5X6

X4 X2

0.8

X5X4

X6 X2

0.2

X3X6

X5 X4

0.3

104/147

XCNetsLearnCNet(D,X, α, δ, σ)

1: Input: a datasetD over RVsX; α: δ min number samples; σ min number features2: Output: a CNet C encoding pC(X) learned fromD3: if |D| > δ and |X| > σ then4: Xi ← select(D,X, α) ▷ select the RV to condition on5: D0 ← ξ ∈ D : ξ[Xi] = 0,D1 ← ξ ∈ D : ξ[Xi] = 16: w0 ← |D0|/|D|,w1 ← |D1|/|D|7: C ← w0 · LearnCNet(D0,X\i, α, δ, σ) + w1 · LearnCNet(D1,X\i, α, δ, σ)8: else9: C ← learnLeafDistribution(D,X, α)

10: return C

XCNets (Extremely Randomized CNets): select chooses one RV at random

Di Mauro et al., “Fast and Accurate Density Estimation with Extremely Randomized CutsetNetworks”, 2017 105/147

Online Learning

Discrete data [Lee et al. 2013]

a variant of LearnSPN using online clustering

sum nodes can be extended with more children

product nodes are never modified

Continuos data [Hsu et al. 2017]

starting with a network assuming all variables independent

correlation are incrementally introduced in the form of a multivariate Gaussian or amixture distribution

106/147

Knowledge Compilation

107/147

Knowledge CompilationCompilation to arithmetic circuits

the joint distribution P (A,B,C) can be represented as an AC

the AC has inputs variable assignements (A and ¬A) or constants

internal nodes are sums or product

complete assignment: set variable assignments to 1 (opposing to 0)the root of the AC evaluates the weight (unnormalized probability) of that world

Darwiche,Modeling and Reasoning with Bayesian Networks, 2009 108/147

Hybridizing TPMs with intractable modelsCollapsed compilation

Inference algorithms based on a knowledge compilation approach perform exactinference by compiling a worst-case exponentially-sized arithmetic circuit representation

online collapsed importance samplingchoosing which variable to sample next based on the values sampled for previousvariables

collapsed compilationmaintaining a partially compiled arithmetic circuit during online collapsed sampling

Friedman et al., “Approximate Knowledge Compilation by Online Collapsed Importance Sampling”,2018 109/147

Hybridizing TPMs with intractable modelsSum-Product Graphical Model (SPGM)

A probabilistic architecture combining SPNs and Graphical Models (GMs)⇒ tractable inference (SPN) + high-level abstraction (PGM)

Desana et al., “Sum–product graphical models”, 2019 110/147

Hybridizing TPMs with intractable modelssum-product VAE

Tan et al., “Hierarchical Decompositional Mixtures of Variational Autoencoders”, 2019 111/147

Applications

Stay Tuned For …

Next:

1. what have been probabilistic circuits used for?⇒ computer vision, sop, speech, planning, …

2. what are the current trends in tractable learning?⇒ hybrid models, probabilistic programming, …

3. what are the current challenges?⇒ benchmarks, scaling, reasoning

After: Conclusions113/147

20 Datasetscurrent state-of-the-art

dataset single models ensembles dataset single models ensemblesnltcs -5.99 [ID-SPN] -5.99 [LearnPSDDs] dna -79.88 [SPGM] -80.07 [SPN-btb]

msnbc -6.04 [Prometheus] -6.04 [LearnPSDDs] kosarek -10.59 [Prometheus] -10.52 [LearnPSDDs]

kdd -2.12 [Prometheus] -2.12 [LearnPSDDs] msweb -9.73 [ID-SPN] -9.62 [XCNets]

plants -12.54 [ID-SPN] -11.84 [XCNets] book -34.14 [ID-SPN] -33.82 [SPN-btb]

audio -39.77 [BNP-SPN] -39.39 [XCNets] movie -51.49 [Prometheus] -50.34 [XCNets]

jester -52.42 [BNP-SPN] -51.29 [LearnPSDDs] webkb -151.84 [ID-SPN] -149.20 [XCNets]

netflix -56.36 [ID-SPN] -55.71 [LearnPSDDs] cr52 -83.35 [ID-SPN] -81.87 [XCNets]

accidents -26.89 [SPGM] -29.10 [XCNets] c20ng -151.47 [ID-SPN] -151.02 [XCNets]

retail -10.85 [ID-SPN] -10.72 [LearnPSDDs] bbc -248.5 [Prometheus] -229.21 [XCNets]

pumbs* -22.15 [SPGM] -22.67 [SPN-btb] ad -15.40 [CNetXD] -14.00 [XCNets]

114/147

Challenge #1better benchmarks

Move beyond toy benchmarksto datasets reflectingthe complex and heterogeneous nature of real data!

115/147

Computer vision

Image reconstruction and inpainting⇒MAP inference7.3 Face Image Completion

Original

Covered

MEAN

PD

|z

BACK-ORIG

SUM

BACK-MPE

DV

|z

BACK-ORIG

SUM

BACK-MPE

Merge

|z

BACK-ORIG

SUM

BACK-MPE

Figure 7.3: Examples of face image reconstructions, left half is covered.

– 121 –

7.3 Face Image Completion

Original

Covered

MEAN

PD

|z

BACK-ORIG

SUM

BACK-MPE

DV

|z

BACK-ORIG

SUM

BACK-MPE

Merge

|z

BACK-ORIG

SUM

BACK-MPE

Figure 7.3: Examples of face image reconstructions, left half is covered.

– 121 –

Reconstructing some symmetries (eyes, butnot beards, glasses).

Good results for 2001…

Poon et al., “Sum-Product Networks: a New Deep Architecture”, 2011Sguerra et al., “Image classification using sum-product networks for autonomous flight of microaerial vehicles”, 2016 116/147

Image segmentation

Semantic segmentation is again MAP inference!

Even approximate MAP for non-deterministic circuits (SPNs) has good performances.

Rathke et al., “Locally adaptive probabilistic models for global segmentation of pathological octscans”, 2017Yuan et al., “Modeling spatial layout for scene image understanding via a novel multiscalesum-product network”, 2016Friesen et al., “Submodular Sum-product Networks for Scene Understanding”, 2016 117/147

Scene Understanding: Su-PAIR

Stelzner et al., “Faster Attend-Infer-Repeat with Tractable Probabilistic Models”, 2019 118/147

Challenge #2hybridizing tractable and intractable models

Hybridize probabilistic inference:tractable models inside intractable loopsand intractable small boxes glued by tractable inference!

119/147

Activity recognition

Exploiting part-based decomposability along pixels and time (frames). Probabilisticcircuits for MAP and MMAP inference and explanations.

IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE, VOL. X, NO. X, XXX 2

Fig. 1. Our approach: (a) A video is represented by the counting grid model (CG) of visual words; every grid point u is assigned a distribution ofword counts uz ; space-time windows Bb are placed across the counting grid and characterized by a Bag-of-Words model in terms of aggregatedistributions of word counts on the grid that fall within the window,

Pu2B

b

uz . (b) Our activity model is the sum-product network (SPN) thatconsists of levels of sum and product nodes, ending with space-time windows at terminal nodes; children nodes in the SPN can be shared bymultiple parents. (c) SPN inference amounts to parsing, and identifying foreground (green space-time windows). (d) Localization of the activity“unloading of the trunk” in an example sequence from the VIRAT dataset [1].

where the weights correspond to the relative significanceof the component in the mixture distribution.

SPN is suitable for capturing alternative structuresof an activity, because the product nodes can encodeparticular configurations of BoWs, i.e., primitives of anactivity, whereas the sum nodes can account for alter-native configurations. Also, SPN can compactly encodea large number of alternative arrangements of BoWs,because SPN consists of a number of levels of sums andproducts, where children nodes are shared by parents.

When a new video is encountered, we place a setof space-time windows across the video’s regular grid.We characterize the windows with BoWs, and use themas terminal nodes of the SPN. SPN inference amountsto parsing the SPN graph, i.e., selecting a subset ofoptimal sum, product, and terminal nodes (i.e., BoWs)that yields the explanation of activity occurrence Fig. 1c.The resulting parse is a tree. The video is assigned thelabel of the activity whose parse tree yields the highestparse score. The selected subset of space-time windowslocalize foreground video parts Fig. 1d.

For our evaluation, we have compiled and annotateda new Volleyball dataset. Our video classification andactivity localization are superior to those of the state ofthe art on the benchmarks datasets, including VIRAT [1],UT-Interactions [6], KTH [7], TRECVID MED 2011 [8],and Volleyball [9].

Contributions:• Activity representation that integrates SPN+CG in a

unified framework;• Introducing new hidden random variables over the

graph connectivity of our SPN+CG model, whereasprevious approaches on SPN including our prelim-inary work treat SPN edges deterministically.

• Bottom-up and top-down inference algorithm;• Joint learning of SPN and CG under both weak

supervision, and supervision• Volleyball dataset.

In the following, Sec. 2 reviews prior work; Sec. 3specifies SPN; Sec. 5 formulates CG, and a joint model ofSPN and CG; Sections 6 and 7 specify our inference andlearning algorithms; and Sec. 8 presents our experiments.

2 PRIOR WORKOur literature review is focused on prior work thatmodels activities using graphical models. Then, we relateour work to that on aggregating counts of visual wordsin the video, and non-linear deep models. Finally, wepresent our contributions, and explain what is new inthis paper relative to our preliminary work of [9].

Graphical models have been successfully used formodeling spatiotemporal structure of activities [10]–[12].Representative models include Dynamic Bayesian Net-works [13]–[15], hierarchical graphical models [16]–[18],AND-OR graphs [19]–[21], and Logic Networks [22]–[24]. Recognition rates increase even further by ground-ing graphical models onto object-detector responses [16],rather than raw video features (e.g., optical flow). How-ever, graphical models relevant for activity recognitionare typically intractable. As learning algorithms use in-ference to estimate latent variables on training data un-der weak supervision, and generally assume exact infer-ence, their behavior in the context of heuristic inferenceis not well understood. In contrast, SPN allows exactinference under certain conditions that are unrestrictivefor our purposes, as discussed later.

Among the above graphical models, SPNs are mostrelated to AND-OR graphs, which are also capable ofencoding alternative decompositions and configurationsof activities [19]–[21], [25]–[27]. AND-OR graphs typi-cally require a manual specification of nodes and graphconnectivity, each associated with hand-picked seman-tic meaning (with few exceptions [28]). Thus, learn-ing AND-OR graphs usually amounts to only learningmodel parameters of nodes and edges representing user-specified activity parts. In contrast, SPN has a “deeper”


Fig. 1. Our approach: (a) A video is represented by the counting grid model (CG) of visual words; every grid point u is assigned a distribution ofword counts uz ; space-time windows Bb are placed across the counting grid and characterized by a Bag-of-Words model in terms of aggregatedistributions of word counts on the grid that fall within the window,

Pu2B

b

uz . (b) Our activity model is the sum-product network (SPN) thatconsists of levels of sum and product nodes, ending with space-time windows at terminal nodes; children nodes in the SPN can be shared bymultiple parents. (c) SPN inference amounts to parsing, and identifying foreground (green space-time windows). (d) Localization of the activity“unloading of the trunk” in an example sequence from the VIRAT dataset [1].

where the weights correspond to the relative significanceof the component in the mixture distribution.

SPN is suitable for capturing alternative structuresof an activity, because the product nodes can encodeparticular configurations of BoWs, i.e., primitives of anactivity, whereas the sum nodes can account for alter-native configurations. Also, SPN can compactly encodea large number of alternative arrangements of BoWs,because SPN consists of a number of levels of sums andproducts, where children nodes are shared by parents.

When a new video is encountered, we place a setof space-time windows across the video’s regular grid.We characterize the windows with BoWs, and use themas terminal nodes of the SPN. SPN inference amountsto parsing the SPN graph, i.e., selecting a subset ofoptimal sum, product, and terminal nodes (i.e., BoWs)that yields the explanation of activity occurrence Fig. 1c.The resulting parse is a tree. The video is assigned thelabel of the activity whose parse tree yields the highestparse score. The selected subset of space-time windowslocalize foreground video parts Fig. 1d.

For our evaluation, we have compiled and annotateda new Volleyball dataset. Our video classification andactivity localization are superior to those of the state ofthe art on the benchmarks datasets, including VIRAT [1],UT-Interactions [6], KTH [7], TRECVID MED 2011 [8],and Volleyball [9].

Contributions:• Activity representation that integrates SPN+CG in a

unified framework;• Introducing new hidden random variables over the

graph connectivity of our SPN+CG model, whereasprevious approaches on SPN including our prelim-inary work treat SPN edges deterministically.

• Bottom-up and top-down inference algorithm;• Joint learning of SPN and CG under both weak

supervision, and supervision• Volleyball dataset.

In the following, Sec. 2 reviews prior work; Sec. 3specifies SPN; Sec. 5 formulates CG, and a joint model ofSPN and CG; Sections 6 and 7 specify our inference andlearning algorithms; and Sec. 8 presents our experiments.

2 PRIOR WORKOur literature review is focused on prior work thatmodels activities using graphical models. Then, we relateour work to that on aggregating counts of visual wordsin the video, and non-linear deep models. Finally, wepresent our contributions, and explain what is new inthis paper relative to our preliminary work of [9].

Graphical models have been successfully used formodeling spatiotemporal structure of activities [10]–[12].Representative models include Dynamic Bayesian Net-works [13]–[15], hierarchical graphical models [16]–[18],AND-OR graphs [19]–[21], and Logic Networks [22]–[24]. Recognition rates increase even further by ground-ing graphical models onto object-detector responses [16],rather than raw video features (e.g., optical flow). How-ever, graphical models relevant for activity recognitionare typically intractable. As learning algorithms use in-ference to estimate latent variables on training data un-der weak supervision, and generally assume exact infer-ence, their behavior in the context of heuristic inferenceis not well understood. In contrast, SPN allows exactinference under certain conditions that are unrestrictivefor our purposes, as discussed later.

Among the above graphical models, SPNs are mostrelated to AND-OR graphs, which are also capable ofencoding alternative decompositions and configurationsof activities [19]–[21], [25]–[27]. AND-OR graphs typi-cally require a manual specification of nodes and graphconnectivity, each associated with hand-picked seman-tic meaning (with few exceptions [28]). Thus, learn-ing AND-OR graphs usually amounts to only learningmodel parameters of nodes and edges representing user-specified activity parts. In contrast, SPN has a “deeper”


(a) SPN+CG (b) CG

Fig. 4. Our inference on an example video from the VIRAT dataset: (a)A part of the parse graph using SPN+CG and the inferred foreground(green). (b) CG is equivalent to the counting grid model of [3] whichselects all space-time windows as foreground.

Sensitivity to model parameters. Tab. 1 shows sen-sitivity of SPN+CG to a specific choice of the numberof: (a) SPN levels, (b) counting grid points, and (c) gridpoints enclosed by each space-time window. As can beseen, we are relatively insensitive to (a)–(c) over a certainrange of their values. As the SPN height and widthincrease, the results improve. However, SPN heightsabove 24 levels, and widths above 10 nodes lead toover fitting. For all our experiments, presented below,we choose the smallest SPN height of 8 levels and widthof 10 nodes at non-terminal levels, which give equallygood performance as more complex models.

Sensitivity to Number of Training Data. Fig. 5 showshow the number of training examples affects our av-erage classification accuracy on the Volleyball dataset.We examined both learning settings: SPN+CG(WS,V)and SPN+CG(S,V). As can be seen, our performance im-proves in both settings as the number of training exam-ples increases, and becomes saturated when the numberof examples goes above 20. Interestingly, difference inthe performance of SPN+CG(WS,V) and SPN+CG(S,V)is relatively small for 20 training examples. This suggeststhat our approach is able to robustly learn volleyballactivity classes from a relatively small number of exam-ples. We observed similar behavior of SPN+CG(WS,V)and SPN+CG(S,V) on the other datasets. For the otherdatasets, we did not observe overfitting, i.e., decreasingperformance for larger numbers of training examples.

Supervision vs. Weak Supervision. Tab. 2 showsthat SPN+CG(S) outperforms SPN+CG(WS) in terms ofaverage classification accuracy. This is expected, sinceSPN+CG(S) has access to additional ground-truth an-notations in training. But the differences in their per-formance range between 1.6% and 3.1% on the KTH,UT-Interactions, VIRAT, and Volleyball datasets. Thisdemonstrates that SPN+CG(WS) successfully relaxes therequirement for expensive manual annotations of fore-ground in videos. Confusion matrices of SPN+CG(WS)and SPN+CG(S) on the four datasets are shown in Fig. 6.

Fig. 5. Average classification accuracy of SPN+CG(WS,V) andSPN+CG(S,V) on the Volleyball dataset as a function of the number oftraining examples.

Tab. 3 presents recall and precision of SPN+CG(WS)and SPN+CG(S) on the UT-Interactions, VIRAT, andVolleyball datasets. Both approaches achieve the highestF-measure when they use a hierarchy of space-time win-dows with sizes defined by varying m=2, 3, 4. As ex-pected, SPN+CG(WS) yields worse foreground localiza-tion. In some error cases we observed that SPN+CG(WS)identified informative parts of background, providingcontextual cues for recognition, as foreground. Consid-ering that SPN+CG(WS) is trained without any accessto foreground annotations, its localization performanceis quite good in comparison to that of SPN+CG(S).

Comparisons. Tab. 2 shows that SPN+CG outper-forms the baselines SPN+CG+Cubes(S), SPN+LR, andCG. In particular, on the Volleyball dataset, accuracy ofSPN+CG(S) and SPN+CG(WS) is larger by 12.5% and10.8% than that of CG, respectively, which quantifies theadvantages of grounding SPN onto the counting gridmodel of [3], even when our deep model is trained underweak supervision. As can be seen, replacing the countinggrid model with logistic regression in SPN+LR decreasesperformance. Also, using the cuboid spatiotemporal fea-tures in SPN+CG+Cubes(S) is inferior to our weaklysupervised SPN+CG(WS).

Tab. 2 also shows a comparison with prior work: (i)SVM of a Bag-of-Word of SCISA features [36]; (ii) SVMof space-time grids of local features [29]; (iii) SVM with akernel that accounts for spatiotemporal matches of inter-est points [41]; (iv) pLSA and LDA models [31]; (v) Con-volutional neural networks [37]; and (vi) Action-bank[44]. Interestingly, even without deep learning of localfeatures, SPN+CG+Cubes outperforms the approachesof [29], [31], [36], [37], [41]. The comparison with theaction-bank of [44] is unfair to us, hence our lowerperformance, since the approach of [44] uses a higherlevel of supervision in training for expressing humanactivities in terms of simpler actions. Unlike [44], wedo not have access to annotations of simpler actions intraining.

Valid vs. Invalid Graph Structure. Tab. 4 showsthe average classification accuracy, precision and recallof SPN+CG(S,V) and SPN+CG(S,I) on the VIRAT, UT-Interactions, and Volleyball datasets. As can be seen,SPN+CG(S,I) is worse for each evaluation metric. Onereason is that the graph connectivity of SPN+CG(S,I)

Amer et al., “Sum Product Networks for Activity Recognition”, 2015Wang et al., “Hierarchical spatial sum–product networks for action recognition in still images”,2016Chiradeep Roy et al., “Explainable Activity Recognition in Videos using Dynamic Cutset Networks”,2019 120/147

Speech reconstruction and extension

Probabilistic circuits to model the joint pdf of observables in HMMs (HMM-SPNs),again leveraging tractable inference: marginals and MAP

??

?

?

YtYt 1- Yt 1+Yt 2- Yt 2+

??

?

?

??

?

?

??

?

?

??

?

?

St 2-

_

St 1-

_

St

_

St 1+

_

St 2+

_

Fig. 1. Illustration of the HMM with SPN observation models. State-dependent SPNs are symbolized by triangles with a circle on top.For the forward-backward algorithm, frequency bins marked with“?” (missing) are marginalized out by the SPNs.

frames 1, . . . , (t + λ). An illustration of the modified HMM usedin this paper is given in Figure 1. Following [6], we use most-probable-explanation (MPE) inference for recovering the missingspectrogram content, where we reconstruct the high-band only. LetSt,k = (St,k(1), . . . St,k(F ))T be the MPE-reconstruction of the tth

time frame, using the SPN depending on the kth HMM-state. Thenwe use the following bandwidth-extended log-spectrogram

S(t, f) =

!S(t, f) if f < f ′"K

k=1 p(Yt = k|et)St,k(f) o.w.(1)

where f ′ corresponds to 4000Hz.

4. RECONSTRUCTING TIME SIGNALS

To synthesize a time-signal from the bandwidth extended log-spectrogram, we need to associate a phase to the estimated magni-tude spectrogram eS(t,f). The problem of recovering a time-domainsignal given a modified magnitude appears in many speech appli-cations, such as single-channel speech enhancement [17, 18, 19],single-channel source separation [20, 21, 22, 23] and speech sig-nal modification [24, 25]. These signal modifications are solelyemployed in spectral amplitude domain while the phase informa-tion of the desired signal is not available. A typical approach is touse the observed (noisy) phase spectrum or to replace it with anenhanced/estimated phase.

In order to recover phase information for ABE, we use the it-erative algorithm proposed by Griffin and Lim (GL) [26]. Let j ∈0, . . . , J be an iteration index, and C(j) be a complex valued ma-trix generated in the j th iteration. For j = 0, we have

C(0)(t, f) =

!C(t, f) 1 ≤ f ≤ f ′

eS(t,f) o.w.(2)

where C is the complex spectrogram of the bandpass filtered inputsignal. Within the telephone band, phase information is consideredreliable and copied from the input. Outside of the narrow-band,phase is initialized with zero. Note that in general C(0) is not a validspectrogram since a time signal whose STFT equals C(0) might notexist. The j th iteration of the GL algorithm is given by

C(j)(t, f) =

!C(t, f) 1 ≤ f ≤ f ′

eS(t,f) ei G(C(j−1))(t,f) o.w.(3)

G(C) = STFT(STFT−1(C)). (4)

At each iteration, the magnitude of the approximate STFT C(j)

equals the magnitude eS estimated by our model, while temporalcoherence of the signal is enforced by the operator G(·) (see e.g. [25]for more details). The estimated time signal sj at the j th iterationis given by sj = STFT−1

#C(j)

$. At each iteration, the mean

square error between |STFT(sj)| and |C(0)| is reduced [26]. Inour experiments, we set the number of iterations J = 100, whichappeared to be sufficient for convergence.

5. EXPERIMENTS

We used 2 baselines in our experiments. The first baseline is themethod proposed in [13], based on the vocal tract filter model usinglinear prediction. We used 64 HMM states and 16 components perstate-dependent GMM, which performed best in [13]. We refer asHMM-LP to this baseline. The second baseline is almost identicalto our method, where we replaced the SPN with a Gaussian mixturemodel with 256 components with diagonal covariance matrices. Fortraining GMMs, we ran the EM algorithm for maximal 100 itera-tions and using 3 random restarts. Inference using the GMM modelworks the same way as described in section 3, since a GMM can beformulated as an SPN with a single sum node [7]. We refer as HMM-GMM to this baseline. To our method, we refer as HMM-SPN. ForHMM-GMM and HMM-SPN, we used the same clustering of log-spectra using a codebook size of 64.

We used time-frames of 512 samples length, with 75% over-lap, which using a sampling frequency of 16 kHz corresponds to aframe length of 32ms and a frame rate of 8ms. Before applyingthe FFT, the frames were weighted with a Hamming window. Forthe forward-backward algorithm we used a look-ahead of λ = 3frames, which corresponds to the minimal delay introduced by the75% frame-overlap. We performed our experiments on the GRIDcorpus [27], where we used the test speakers with numbers 1, 2, 18,and 20, referred to as s1, s2, s18, and s20, respectively. Speakerss1 and s2 are male, and s18 and s20 are female. We trained speakerdependent and speaker independent models. For speaker dependentmodels we used 10 minutes of speech of the respective speaker. Forspeaker independent models we used 10 minutes of speech obtainedfrom the remaining 30 speakers of the corpus, each speaker provid-ing approximately 20 seconds of speech. For testing we used 50utterances per test speaker, not included in the training set.

Fig. 2 shows log-spectrograms of a test utterance of speaker s18and the bandwidth extended signals by HMM-LP, HMM-GMM andHMM-SPN, using speaker dependent models. We see that HMM-LPsucceeds in reconstructing a harmonic structure for voiced sounds.However, we see that fricative and plosive sounds are not wellcaptured. The reconstruction by HMM-GMM is blurry and doesnot recover the harmonic structure of the original signal well, butpartly recovers high-frequency content related to consonants. TheHMM-SPN method recovers a natural high frequency structure,which largely resembles the original full-band signal: the harmonicstructure appears more natural than the one delivered by HMM-LPand consonant sounds seem to be better detected and reconstructedthan by HMM-GMM. According to informal listening tests1, the vi-sual impression corresponds to the listening experience: the signalsdelivered by HMM-SPN clearly enhance the high-frequency contentand sound more natural than the signals delivered by HMM-LP and

1Formal listening tests were out of the scope of the paper. All ABE sig-nals, the full-band and the narrow-band telephone signals can be obtained asWAV files from http://www2.spsc.tugraz.at/people/peharz/ABE/

(a) Original full bandwidth (b) Reconstruction HMM-LP

(c) Reconstruction HMM-GMM (d) Reconstruction HMM-SPN

Fig. 2. Log-spectrogram of the utterance “Bin green at zed 5 now”,spoken by s18. (a): original full bandwidth signal. (b): ABE resultof HMM-LP [13]. (c): ABE result of HMM-GMM (this paper). (d):ABE results of HMM-SPN (this paper).

Table 1. Average LSD using speaker-dependent models.s1 s2 s18 s20

HMM-LP 7.13 7.57 6.48 6.41HMM-GMM 3.18 2.93 2.28 2.82HMM-SPN 3.12 2.84 2.15 2.59

HMM-GMM. HMM-GMM and HMM-SPN both deliver a morerealistic extension for fricative and plosive sounds. However, thisintroduces also a some high frequency noise. According to our lis-tening experience, these artifacts are less severe for the HMM-SPNsignals.

For an objective evaluation, we use the log-spectral distortion(LSD) in the high-band [13]. Given an original signal and an ABEreconstruction, we perform Lth-order LPC analysis for each frame,where L = 9. This yields (L + 1)-dimensional coefficient vectorsaτ and aτ of the original and the reconstructed signals, respectively,where τ is the frame index. The spectral envelope modeled by ageneric LPC coefficient vector a = (a0, . . . , aL)

t is given as

Ea(ejΩ) =

σ

|!L

k=0 ake−jkΩ|, (5)

where σ is the square-root of the variance of the LPC-analyzed sig-nal. The LSD for the τ th frame, in high-band is calculated as

LSDτ =

"# π

ν(20 logEaτ (ejΩ)− 20 logEaτ (ejΩ))

2 dΩ

π − ν, (6)

where ν = π 4000fs/2 , fs being the sampling frequency. The LSD at

utterance level is given as the average of LSDτ over all frames.Tables 1 and 2 show the LSD of all three methods for the speaker

dependent and speaker independent scenarios, respectively, averagedover the 50 test sentences. For each speaker, we see a clear ranking

Table 2. Average LSD using speaker-independent models.s1 s2 s18 s20


of the three method, and that the HMM-SPN method always per-forms best. All differences are significant at a 0.95 confidence level,according to a paired one-sided t-test.

6. DISCUSSION

We demonstrated that SPNs are a promising probabilistic model forspeech, applying them to the ill-posed problem of artificial band-width extension. Motivated by the success of SPNs on the alsoill-posed and related problem of image completion, we used SPNsas observation models in HMMs, modeling the temporal evolutionof log short-time spectra. While the model is trained on full-bandspeech, the fact that the high and very low frequencies are miss-ing in telephone signals is naturally treated by marginalization ofmissing frequency bins. Recovering the missing high frequencies,is naturally treated by MPE inference. The resulting system clearlyimproves the state of the art both in subjective listening tests and ob-jective performance evaluation using the log-spectral distortion mea-sure.

This performance improvement comes at an increased computa-tional cost. The trained observation SPNs have 136 layers and tensof thousand of nodes and parameters. Therefore, bandwidth exten-sion using our HMM-SPN approach currently takes about 1−2 min-utes computation time per utterance on a standard desktop computer,using a non-optimized Matlab/C++-based prototype. Inference us-ing the HMM-GMM model requires approximately 0.5− 1 minutesper utterance; inference in the HMM-LP model requires some sec-onds. Therefore, although we designed the overall system to be real-time capable (small HMM look-ahead), it is currently not suitablefor a real-time application implemented on a low-energy embeddedsystem. For non-real-time systems, e.g. for offline processing of tele-phone speech databases, the approach presented here is appropriate.The basic motivation in this paper, however, was to demonstrate theapplicability of SPNs for modeling speech; according to prior studies[6, 8], SPNs are able to express complex interaction with comparablelittle inference time. Therefore one can conjecture that an ABE sys-tem with classical graphical models, expressing a similar amount ofdependencies as the used SPNs, would have an overall computationtime in the range of hours.

The system presented in this paper is trained in a two-step ap-proach, i.e. (i) clustering the training data which delivers the HMMstates and statistics, and (ii) subsequent training of state-dependentobservation models. Incorporating state-sequence modeling directlyinto SPN training, similar as in dynamic graphical models, is a in-teresting future research direction. Finally, future directions for re-search on SPN-based speech models are further speech related ap-plications, such as packet loss concealment, (single channel) sourceseparation, and speech enhancement.








t is given as

Ea(ejΩ) =

σ

|!L



LSDτ =

"# π


2 dΩ

π − ν, (6)







6. DISCUSSION











t is given as

Ea(ejΩ) =

σ

|!L



LSDτ =

"# π


2 dΩ

π − ν, (6)







6. DISCUSSION











t is given as

Ea(ejΩ) =

σ

|!L



LSDτ =

"# π


2 dΩ

π − ν, (6)







6. DISCUSSION




State-of-the-art high frequency reconstruction (MAP inference)

Peharz et al., “Modeling speech with sum-product networks: Application to bandwidth extension”,2014Zohrer et al., “Representation learning for single-channel source separation and bandwidthextension”, 2015 121/147

Sequence labeling

Figure 2: SPN for language modeling.

probability as

P (Y=y|X=x) =

(Y=y|X=x)Py0 (Y=y0|X=x)

=

Ph (Y=y,H=h|X=x)P

y0,h (Y=y0,H=h|X=x)

where (Y = y|X = x) is an unnormalized probability. Thusthe partial derivative of the conditional log-likelihood with re-spect to a weight w in an SPN is given by:

@@w

logP (y|x)= @@w

log

X

h

(y,h|x) @@w

log

X

y0,h

y0,h|x

(1)To train an SPN, we first specify its architecture, i.e., its

sum and product nodes, and the connections between them.Then we learn the weights of the sum nodes via gradient de-scent to maximize the conditional log-likelihood of a trainingset of (x,y) examples. The gradient of each weight (Equa-tion 1) is computed via backpropagation. The first summationon the right-hand side of Equation 1 can be computed tractablyin a single upward pass through the SPN by setting all hid-den variables to 1, and the second summation can be computedsimilarly by setting both hidden and query variables to 1. Thepartial derivatives are passed from parent to child according tothe chain rule as described by [14]. Each weight is changedby multiplying a learning rate parameter to Equation 1, i.e.,w = @

@w

logP (y|x). To speed up training, we could esti-mate the gradient by computing it with a subset (mini-batch) ofexamples from the training set, rather than using all examples.

3. SPN Architecture

Figure 2 shows the architecture of our discriminative SPN forlanguage modeling1. To predict a word (a query variable), we

1https://github.com/stakok/lmspn/blob/master/faq.md containsmore details about the architecture.

use its previous N words as evidence in our SPN. Each previousword is represented by a K-dimensional vector where K is thenumber of words in a vocabulary. Each vector has exactly one1 at the index corresponding to the word it represents, and 0’severywhere else. When we predict the ith word, we have avector v

ij

(1 j N ) at the bottommost layer for each ofthe previous N words.

Above the bottommost layer, we have a (hidden) layer ofsum nodes. There are D sum nodes H

j1 . . . HjD

for each vec-tor v

ij

. Each sum node Hjl

has an edge connecting it to everyentry in v

ij

. Let the mth entry in vij

be denoted by vm

ij

,and the weight of the edge from H

jl

to vm

ij

be denoted byw

lm

. We constrain each weight wlm

to be the same for eachpair of H

jl

and vm

ij

(1 j N ). This layer of sum nodescan be interpreted as compressing each K-dimensional vectorsvij

into a smaller continuous-valued D-dimensional featurevector (thus gaining the same advantages of [5] as described inSection 1). Because the weights w

lm

’s are constrained to bethe same between each pair of K-dimensional input vector andD-dimensional feature vector, we ensure that the weights areposition independent, i.e., the same word will be compressedinto the same feature vector regardless of its position. Thisalso makes it easier to train the SPN by reducing the numberof weights to be learned.

Above the Hjl

layer, we have another layer of sum nodes.In this layer, each node M

k

(1 k K) is connected to everyH

jl

node. Moving up, we have a layer of product nodes. EachG

k

product node is connected via two edges to an Mk

node.Each G

k

node transforms the output from its child Mk

node bysquaring it. This helps to capture more complicated dependencyamong the input words.

Moving up, we have another layer of sum nodes. Each Bk

node in this layer is connected to an Mk

node and a Gk

node inthe lower layers. Above this, there is a layer of S

k

nodes, eachof which is connected to a B

k

node and an indicator variable yk

representing a value in our categorical query variable (i.e., theith word which we are predicting). y

k

= 1 if the query variableis the kth word, and y

k

= 0 otherwise. Intuitively, the indicatorvariables select which part of the SPN below an S

k

node gets“activated”. Finally, we have an S node which connects to allSk

nodes. When we normalize the weights between S and theSk

nodes to sum to 1, S’s output is the conditional probabilityof the ith word given its previous N words.

4. Experiments

4.1. Dataset

We performed our experiments on the commonly used PennTreebank corpus [15], and adhered to the experimental setupused in previous work [6, 9]. We used sections 0-20, sections21-22, and sections 23-24 respectively as training, validationand test sets. These sections contain segments of news re-ports from the Wall Street Journal. We treated punctuation aswords, and used the 10,000 most frequent words in the cor-pus to create a vocabulary. All other words are regarded asunknown and mapped to the token <unk>. The percentagesof out-of-vocabulary (<unk>) tokens in them are about 5.91%,6.96% and 6.63% respectively. Thus only a small fraction ofthe dataset consists of unknown words.

4.2. Methodology

Using the training set, we learned the weights of all sumnodes in our SPN described in Section 3. To evaluate

Ratajczak et al., “Sum-Product Networks for Structured Prediction: Context-Specific DeepConditional Random Fields”, 2014Ratajczak et al., “Sum-Product Networks for Sequence Labeling”, 2018Cheng et al., “Language modeling with Sum-Product Networks”, 2014 122/147

Robotics

Hierarchical planning robot executions

Scenes and maps decompose along circuitstructures

Pronobis et al., “Learning Deep Generative Spatial Models for Mobile Robots”, 2016Pronobis et al., “Deep spatial affordance hierarchy: Spatial knowledge representation for planningin large-scale environments”, 2017Zheng et al., “Learning graph-structured sum-product networks for probabilistic semantic maps”,2018 123/147

SOP: Preference learning

Preferences and rankings as logicalconstraints

Structured decomposable circuits foradvanced queries

SOTA on modeling densities over rankings

Choi et al., “Tractable learning for structured probability spaces: A case study in learningpreference distributions”, 2015Shen et al., “A Tractable Probabilistic Model for Subset Selection.”, 2017 124/147

SOP: RoutingDecomposing complex (conditional) probability spacesvia circuits

Shen et al., “Conditional PSDDs: Modeling and learning with modular knowledge”, 2018Shen et al., “Structured Bayesian Networks: From Inference to Learning with Routes”, 2019 125/147

Challenge #3scaling tractable learning

Learn tractable modelsonmillions of datapointsand thousands of featuresin tractable time!

126/147

Probabilistic programming

Chavira et al., “Compiling relational Bayesian networks for exact inference”, 2006Holtzen et al., “Symbolic Exact Inference for Discrete Probabilistic Programs”, 2019De Raedt et al.; Riguzzi; Fierens et al.; Vlasselaer et al., “ProbLog: A Probabilistic Prolog and ItsApplication in Link Discovery.”; “A top down interpreter for LPAD and CP-logic”; “Inference andLearning in Probabilistic Logic Programs using Weighted Boolean Formulas”; “Anytime Inference inProbabilistic Logic Programs with Tp-compilation”, 2007; 2007; 2015; 2015Olteanu et al.; Van den Broeck et al., “Using OBDDs for efficient query evaluation on probabilisticdatabases”; Query Processing on Probabilistic Data: A Survey, 2008; 2017Vlasselaer et al., “Exploiting Local and Repeated Structure in Dynamic Bayesian Networks”, 2016 127/147

and more…

fault prediction [Nath et al. 2016]

computational psychology [Joshi et al. 2018]

biology [Butz et al. 2018b]

low-energy prediction [Galindez Olascoaga et al. 2019; Shah et al. 2019]

calibration of analog/RF circuits [Andraud et al. 2018]

stochastic constraint optimization [Latour et al. 2017]

neuro-symbolic learning [Xu et al. 2018]

probabilistic and symbolic reasoning integration [Li 2015]

relational learning [Broeck et al. 2011; Domingos et al. 2012; Broeck 2013; Nath et al. 2014, 2015;

Niepert et al. 2015; Van Haaren et al. 2015]

128/147

Challenge #4better benchmarks

Move beyond toy queriestowards fully automated reasoning!

129/147

moreexpressive

efficien

t

less

expressive

efficien

t



SPNs

PSDDs

CNets

BNs

NFs

ACsAoGs

NADEs

MNsVAEs

GANs

Fully factorized

NBTrees

PolytreesTJT

LTM

Mixtures

takeaway #1 tractability is a spectrum130/147

moreexpressive

efficien

t

less

expressive

efficien

t



SPNs

PSDDs

CNets

BNs

NFs

ACsAoGs

NADEs

MNsVAEs

GANs

Fully factorized

NBTrees

PolytreesTJT

LTM

Mixtures

takeaway #2: you can be both tractable and expressive131/147

×

X1 X2 X3 X1 X1

w1 w2×

X1 ≤ θ X2

×

X1 > θ X2

w1 w2

takeaway #3: probabilistic circuits are a foundation fortractable inference and learning

132/147

Open challenges

1. new benchmarks are needed!

2. scaling tractable learning!

3. take the best from approximate reasoning!

4. move to complex reasoning!

takeaway #4: lots to do still,…

133/147

References I⊕ Chow, C and C Liu (1968). “Approximating discrete probability distributions with dependence trees”. In: IEEE Transactions on Information Theory 14.3, pp. 462–467.

⊕ Bryant, R (1986). “Graph-based algorithms for boolean manipulation”. In: IEEE Transactions on Computers, pp. 677–691.

⊕ Cooper, Gregory F (1990). “The computational complexity of probabilistic inference using Bayesian belief networks”. In: Artificial intelligence 42.2-3, pp. 393–405.

⊕ Dagum, Paul and Michael Luby (1993). “Approximating probabilistic inference in Bayesian belief networks is NP-hard”. In: Artificial intelligence 60.1, pp. 141–153.

⊕ Zhang, Nevin Lianwen and David Poole (1994). “A simple approach to Bayesian network computations”. In: Proceedings of the Biennial Conference-Canadian Society for ComputationalStudies of Intelligence, pp. 171–178.

⊕ Roth, Dan (1996). “On the hardness of approximate reasoning”. In: Artificial Intelligence 82.1–2, pp. 273–302.

⊕ Dechter, Rina (1998). “Bucket elimination: A unifying framework for probabilistic inference”. In: Learning in graphical models. Springer, pp. 75–104.

⊕ Dasgupta, Sanjoy (1999). “Learning polytrees”. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp. 134–141.

⊕ Meilă, Marina and Michael I. Jordan (2000). “Learning with mixtures of trees”. In: Journal of Machine Learning Research 1, pp. 1–48.

⊕ Bach, Francis R. and Michael I. Jordan (2001). “Thin Junction Trees”. In: Advances in Neural Information Processing Systems 14. MIT Press, pp. 569–576.

⊕ Darwiche, Adnan (2001). “Recursive conditioning”. In: Artificial Intelligence 126.1-2, pp. 5–41.

⊕ Yedidia, Jonathan S, William T Freeman, and Yair Weiss (2001). “Generalized belief propagation”. In: Advances in neural information processing systems, pp. 689–695. 134/147

References II⊕ Chickering, Max (2002). “The WinMine Toolkit”. In: Microsoft, Redmond.

⊕ Darwiche, Adnan and Pierre Marquis (2002). “A knowledge compilation map”. In: Journal of Artificial Intelligence Research 17, pp. 229–264.

⊕ Dechter, Rina, Kalev Kask, and Robert Mateescu (2002). “Iterative join-graph propagation”. In: Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence. MorganKaufmann Publishers Inc., pp. 128–136.

⊕ Darwiche, Adnan (2003). “A Differential Approach to Inference in Bayesian Networks”. In: J.ACM.

⊕ Sang, Tian, Paul Beame, and Henry A Kautz (2005). “Performing Bayesian inference by weighted model counting”. In: AAAI. Vol. 5, pp. 475–481.

⊕ Chavira, Mark, Adnan Darwiche, and Manfred Jaeger (2006). “Compiling relational Bayesian networks for exact inference”. In: International Journal of Approximate Reasoning 42.1-2,pp. 4–20.

⊕ Park, James D and Adnan Darwiche (2006). “Complexity results and approximation strategies for MAP explanations”. In: Journal of Artificial Intelligence Research 21, pp. 101–133.

⊕ De Raedt, Luc, Angelika Kimmig, and Hannu Toivonen (2007). “ProbLog: A Probabilistic Prolog and Its Application in Link Discovery.”. In: IJCAI. Vol. 7. Hyderabad, pp. 2462–2467.

⊕ Dechter, Rina and Robert Mateescu (2007). “AND/OR search spaces for graphical models”. In: Artificial intelligence 171.2-3, pp. 73–106.

⊕ Kulesza, A. and F. Pereira (2007). “Structured Learning with Approximate Inference”. In: Advances in Neural Information Processing Systems 20. MIT Press, pp. 785–792.

⊕ Riguzzi, Fabrizio (2007). “A top down interpreter for LPAD and CP-logic”. In: Congress of the Italian Association for Artificial Intelligence. Springer, pp. 109–120. 135/147

References III⊕ Olteanu, Dan and Jiewen Huang (2008). “Using OBDDs for efficient query evaluation on probabilistic databases”. In: International Conference on Scalable Uncertainty Management.

Springer, pp. 326–340.

⊕ Darwiche, Adnan (2009). Modeling and Reasoning with Bayesian Networks. Cambridge.

⊕ Koller, Daphne and Nir Friedman (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

⊕ Choi, Arthur and Adnan Darwiche (2010). “Relax, compensate and then recover”. In: JSAI International Symposium on Artificial Intelligence. Springer, pp. 167–180.

⊕ Lowd, Daniel and Pedro Domingos (2010). “Approximate inference by compilation to arithmetic circuits”. In: Advances in Neural Information Processing Systems, pp. 1477–1485.

⊕ Broeck, Guy Van den et al. (2011). “Lifted probabilistic inference by first-order knowledge compilation”. In: Proceedings of the Twenty-Second international joint conference on ArtificialIntelligence. AAAI Press/International Joint Conferences on Artificial Intelligence; Menlo …, pp. 2178–2185.

⊕ Campos, Cassio Polpo de (2011). “New complexity results for MAP in Bayesian networks”. In: IJCAI. Vol. 11, pp. 2100–2106.

⊕ Darwiche, Adnan (2011). “SDD: A New Canonical Representation of Propositional Knowledge Bases”. In: Proceedings of the Twenty-Second International Joint Conference on ArtificialIntelligence - Volume Volume Two. IJCAI’11. Barcelona, Catalonia, Spain. ISBN: 978-1-57735-514-4.

⊕ Poon, Hoifung and Pedro Domingos (2011). “Sum-Product Networks: a New Deep Architecture”. In: UAI 2011.

⊕ Sontag, David, Amir Globerson, and Tommi Jaakkola (2011). “Introduction to dual decomposition for inference”. In: Optimization for Machine Learning 1, pp. 219–254.

⊕ Domingos, Pedro and William Austin Webb (2012). “A tractable first-order probabilistic logic”. In: Twenty-Sixth AAAI Conference on Artificial Intelligence. 136/147

References IV⊕ Gens, Robert and Pedro Domingos (2012). “Discriminative Learning of Sum-Product Networks”. In: Advances in Neural Information Processing Systems 25, pp. 3239–3247.

⊕ Broeck, Guy Van den (2013). “Lifted inference and learning in statistical relational models”. PhD thesis. Ph. D. Dissertation, KU Leuven.

⊕ Gens, Robert and Pedro Domingos (2013). “Learning the Structure of Sum-Product Networks”. In: Proceedings of the ICML 2013, pp. 873–880.

⊕ Lee, Sang-Woo, Min-Oh Heo, and Byoung-Tak Zhang (2013). “Online Incremental Structure Learning of Sum-Product Networks”. In: Neural Information Processing: 20th InternationalConference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part II. Ed. by Minho Lee et al. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 220–227. ISBN:978-3-642-42042-9. DOI: 10.1007/978-3-642-42042-9_28. URL: http://dx.doi.org/10.1007/978-3-642-42042-9_28.

⊕ Liu, Qiang and Alexander Ihler (2013). “Variational algorithms for marginal MAP”. In: The Journal of Machine Learning Research 14.1, pp. 3165–3200.

⊕ Lowd, Daniel and Amirmohammad Rooshenas (2013). “Learning Markov Networks With Arithmetic Circuits”. In: Proceedings of the 16th International Conference on ArtificialIntelligence and Statistics. Vol. 31. JMLR Workshop Proceedings, pp. 406–414.

⊕ Peharz, Robert, Bernhard Geiger, and Franz Pernkopf (2013). “Greedy Part-Wise Learning of Sum-Product Networks”. In: ECML-PKDD 2013.

⊕ Cheng, Wei-Chen et al. (2014). “Language modeling with Sum-Product Networks”. In: INTERSPEECH 2014, pp. 2098–2102.

⊕ Goodfellow, Ian et al. (2014). “Generative adversarial nets”. In: Advances in neural information processing systems, pp. 2672–2680.

⊕ Kingma, Diederik P and Max Welling (2014). “Auto-Encoding Variational Bayes”. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR). 2014.

137/147

https://doi.org/10.1007/978-3-642-42042-9_28

http://dx.doi.org/10.1007/978-3-642-42042-9_28

References V⊕ Kisa, Doga et al. (July 2014a). “Probabilistic sentential decision diagrams”. In: Proceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning

(KR). Vienna, Austria.

⊕ — (July 2014b). “Probabilistic sentential decision diagrams”. In: Proceedings of the 14th International Conference on Principles of Knowledge Representation and Reasoning (KR).Vienna, Austria. URL: http://starai.cs.ucla.edu/papers/KisaKR14.pdf.

⊕ Lee, Sang-Woo, Christopher Watkins, and Byoung-Tak Zhang (2014). “Non-Parametric Bayesian Sum-Product Networks”. In: Workshop on Learning Tractable Probabilistic Models.Citeseer.

⊕ Martens, James and Venkatesh Medabalimi (2014). “On the Expressive Efficiency of Sum Product Networks”. In: CoRR abs/1411.7717.

⊕ Nath, Aniruddh and Pedro Domingos (2014). “Learning Tractable Statistical Relational Models”. In: Workshop on Learning Tractable Probabilistic Models, ICML 2014.

⊕ Peharz, Robert, Robert Gens, and Pedro Domingos (2014a). “Learning Selective Sum-Product Networks”. In: Workshop on Learning Tractable Probabilistic Models. LTPM.

⊕ Peharz, Robert et al. (2014b). “Modeling speech with sum-product networks: Application to bandwidth extension”. In: ICASSP2014.

⊕ Rahman, Tahrima, Prasanna Kothalkar, and Vibhav Gogate (2014). “Cutset Networks: A Simple, Tractable, and Scalable Approach for Improving the Accuracy of Chow-Liu Trees”. In:Machine Learning and Knowledge Discovery in Databases. Vol. 8725. LNCS. Springer, pp. 630–645.

⊕ Ratajczak, Martin, S Tschiatschek, and F Pernkopf (2014). “Sum-Product Networks for Structured Prediction: Context-Specific Deep Conditional Random Fields”. In: Proc Workshopon Learning Tractable Probabilistic Models 1, pp. 1–10.

138/147

http://starai.cs.ucla.edu/papers/KisaKR14.pdf

References VI⊕ Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra (2014). “Stochastic backprop. and approximate inference in deep generative models”. In: arXiv preprint

arXiv:1401.4082.

⊕ Rooshenas, Amirmohammad and Daniel Lowd (2014). “Learning Sum-Product Networks with Direct and Indirect Variable Interactions”. In: Proceedings of ICML 2014.

⊕ Adel, Tameem, David Balduzzi, and Ali Ghodsi (2015). “Learning the Structure of Sum-Product Networks via an SVD-based Algorithm”. In: Uncertainty in Artificial Intelligence.

⊕ Amer, Mohamed and Sinisa Todorovic (2015). “Sum Product Networks for Activity Recognition”. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on.

⊕ Bekker, Jessa et al. (2015). “Tractable Learning for Complex Probability Queries”. In: Advances in Neural Information Processing Systems 28 (NIPS).

⊕ Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov (2015). “Importance weighted autoencoders”. In: arXiv preprint arXiv:1509.00519.

⊕ Choi, Arthur, Guy Van den Broeck, and Adnan Darwiche (2015). “Tractable learning for structured probability spaces: A case study in learning preference distributions”. In:Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI).

⊕ Dennis, Aaron and Dan Ventura (2015). “Greedy Structure Search for Sum-product Networks”. In: IJCAI’15. Buenos Aires, Argentina: AAAI Press, pp. 932–938. ISBN:978-1-57735-738-4.

⊕ Di Mauro, Nicola, Antonio Vergari, and Floriana Esposito (2015a). “Learning Accurate Cutset Networks by Exploiting Decomposability”. In: Proceedings of AIXIA. Springer, pp. 221–232.

⊕ Di Mauro, Nicola, Antonio Vergari, and Teresa M.A. Basile (2015b). “Learning Bayesian Random Cutset Forests”. In: Proceedings of ISMIS. Springer, pp. 122–132.

139/147

References VII⊕ Fierens, Daan et al. (May 2015). “Inference and Learning in Probabilistic Logic Programs using Weighted Boolean Formulas”. In: Theory and Practice of Logic Programming 15 (03),

pp. 358–401. ISSN: 1475-3081. DOI: 10.1017/S1471068414000076. URL: http://starai.cs.ucla.edu/papers/FierensTPLP15.pdf.

⊕ Germain, Mathieu et al. (2015). “MADE: Masked Autoencoder for Distribution Estimation”. In: CoRR abs/1502.03509.

⊕ Li, Weizhuo (2015). “Combining sum-product network and noisy-or model for ontology matching.”. In: OM, pp. 35–39.

⊕ Nath, Aniruddh and Pedro Domingos (2015). “Learning Relational Sum-Product Networks”. In: Proceedings of the AAAI Conference on Artificial Intelligence.

⊕ Niepert, Mathias and Pedro Domingos (2015). “Learning and inference in tractable probabilistic knowledge bases”. In: AUAI Press.

⊕ Peharz, Robert (2015). “Foundations of Sum-Product Networks for Probabilistic Modeling”. PhD thesis. Graz University of Technology, SPSC.

⊕ Van Haaren, Jan et al. (2015). “Lifted Generative Learning of Markov Logic Networks”. In: Machine Learning 103.1, pp. 27–55. DOI: 10.1007/s10994-015-5532-x.

⊕ Vergari, Antonio, Nicola Di Mauro, and Floriana Esposito (2015). “Simplifying, Regularizing and Strengthening Sum-Product Network Structure Learning”. In: ECML-PKDD 2015.

⊕ Vlasselaer, Jonas et al. (2015). “Anytime Inference in Probabilistic Logic Programs with Tp-compilation”. In: Proceedings of 24th International Joint Conference on Artificial Intelligence(IJCAI). URL: http://starai.cs.ucla.edu/papers/VlasselaerIJCAI15.pdf.

⊕ Zohrer, Matthias, Robert Peharz, and Franz Pernkopf (2015). “Representation learning for single-channel source separation and bandwidth extension”. In: Audio, Speech, andLanguage Processing, IEEE/ACM Transactions on 23.12, pp. 2398–2409.

⊕ Belle, Vaishak and Luc De Raedt (2016). “Semiring Programming: A Framework for Search, Inference and Learning”. In: arXiv preprint arXiv:1609.06954. 140/147

https://doi.org/10.1017/S1471068414000076

http://starai.cs.ucla.edu/papers/FierensTPLP15.pdf

https://doi.org/10.1007/s10994-015-5532-x

http://starai.cs.ucla.edu/papers/VlasselaerIJCAI15.pdf

References VIII⊕ Cohen, Nadav, Or Sharir, and Amnon Shashua (2016). “On the expressive power of deep learning: A tensor analysis”. In: Conference on Learning Theory, pp. 698–728.

⊕ Friesen, Abram L and Pedro Domingos (2016). “Submodular Sum-product Networks for Scene Understanding”. In:

⊕ Jaini, Priyank et al. (2016). “Online Algorithms for Sum-Product Networks with Continuous Variables”. In: Probabilistic Graphical Models - Eighth International Conference, PGM 2016,Lugano, Switzerland, September 6-9, 2016. Proceedings, pp. 228–239. URL: http://jmlr.org/proceedings/papers/v52/jaini16.html.

⊕ Nath, Aniruddh and Pedro M. Domingos (2016). “Learning Tractable Probabilistic Models for Fault Localization”. In: CoRR abs/1507.01698. URL:http://arxiv.org/abs/1507.01698.

⊕ Oztok, Umut, Arthur Choi, and Adnan Darwiche (2016). “Solving PP-PP-complete problems using knowledge compilation”. In: Fifteenth International Conference on the Principles ofKnowledge Representation and Reasoning.

⊕ Peharz, Robert et al. (2016). “On the Latent Variable Interpretation in Sum-Product Networks”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence PP, Issue 99. URL:http://arxiv.org/abs/1601.06180.

⊕ Pronobis, A. and R. P. N. Rao (2016). “Learning Deep Generative Spatial Models for Mobile Robots”. In: ArXiv e-prints. arXiv: 1610.02627 [cs.RO].

⊕ Rahman, Tahrima and Vibhav Gogate (2016a). “Learning Ensembles of Cutset Networks”. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI’16. Phoenix,Arizona: AAAI Press, pp. 3301–3307. URL: http://dl.acm.org/citation.cfm?id=3016100.3016365.

⊕ — (2016b). “Merging Strategies for Sum-Product Networks: From Trees to Graphs”. In: UAI, ??–??

141/147

http://jmlr.org/proceedings/papers/v52/jaini16.html

http://arxiv.org/abs/1507.01698


https://arxiv.org/abs/1610.02627

http://dl.acm.org/citation.cfm?id=3016100.3016365

References IX⊕ Rashwan, Abdullah, Han Zhao, and Pascal Poupart (2016). “Online and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks”. In: Proceedings

of the 19th International Conference on Artificial Intelligence and Statistics, pp. 1469–1477.

⊕ Sguerra, Bruno Massoni and Fabio G Cozman (2016). “Image classification using sum-product networks for autonomous flight of micro aerial vehicles”. In: 2016 5th BrazilianConference on Intelligent Systems (BRACIS). IEEE, pp. 139–144.

⊕ Shen, Yujia, Arthur Choi, and Adnan Darwiche (2016). “Tractable Operations for Arithmetic Circuits of Probabilistic Models”. In: Advances in Neural Information Processing Systems 29:Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3936–3944.

⊕ Vlasselaer, Jonas et al. (Mar. 2016). “Exploiting Local and Repeated Structure in Dynamic Bayesian Networks”. In: Artificial Intelligence 232, pp. 43 –53. ISSN: 0004-3702. DOI:10.1016/j.artint.2015.12.001.

⊕ Wang, Jinghua and Gang Wang (2016). “Hierarchical spatial sum–product networks for action recognition in still images”. In: IEEE Transactions on Circuits and Systems for VideoTechnology 28.1, pp. 90–100.

⊕ Yuan, Zehuan et al. (2016). “Modeling spatial layout for scene image understanding via a novel multiscale sum-product network”. In: Expert Systems with Applications 63, pp. 231–240.

⊕ Zhao, Han, Pascal Poupart, and Geoffrey J Gordon (2016a). “A Unified Approach for Learning the Parameters of Sum-Product Networks”. In: Advances in Neural InformationProcessing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., pp. 433–441.

⊕ Zhao, Han et al. (2016b). “Collapsed Variational Inference for Sum-Product Networks”. In: In Proceedings of the 33rd International Conference on Machine Learning. Vol. 48.

⊕ Alemi, Alexander A et al. (2017). “Fixing a broken ELBO”. In: arXiv preprint arXiv:1711.00464. 142/147

https://doi.org/10.1016/j.artint.2015.12.001

References X⊕ Choi, YooJung, Adnan Darwiche, and Guy Van den Broeck (2017). “Optimal feature selection for decision robustness in Bayesian networks”. In: Proceedings of the 26th International

Joint Conference on Artificial Intelligence (IJCAI).

⊕ Di Mauro, Nicola et al. (2017). “Fast and Accurate Density Estimation with Extremely Randomized Cutset Networks”. In: ECML-PKDD 2017.

⊕ Hsu, Wilson, Agastya Kalra, and Pascal Poupart (2017). “Online Structure Learning for Sum-Product Networks with Gaussian Leaves”. In: 5th International Conference on LearningRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings. URL: https://openreview.net/forum?id=By7LxZNFe.

⊕ Kimmig, Angelika, Guy Van den Broeck, and Luc De Raedt (2017). “Algebraic model counting”. In: Journal of Applied Logic 22, pp. 46–62.

⊕ Latour, Anna et al. (Aug. 2017). “Combining Stochastic Constraint Optimization and Probabilistic Programming: From Knowledge Compilation to Constraint Solving”. In: Proceedingsof the 23rd International Conference on Principles and Practice of Constraint Programming (CP). DOI: 10.1007/978-3-319-66158-2_32.

⊕ Liang, Yitao, Jessa Bekker, and Guy Van den Broeck (2017a). “Learning the structure of probabilistic sentential decision diagrams”. In: Proceedings of the 33rd Conference onUncertainty in Artificial Intelligence (UAI).

⊕ Liang, Yitao and Guy Van den Broeck (Aug. 2017b). “Towards Compact Interpretable Models: Shrinking of Learned Probabilistic Sentential Decision Diagrams”. In: IJCAI 2017Workshop on Explainable Artificial Intelligence (XAI). URL: http://starai.cs.ucla.edu/papers/LiangXAI17.pdf.

⊕ Pronobis, Andrzej, Francesco Riccio, and Rajesh PN Rao (2017). “Deep spatial affordance hierarchy: Spatial knowledge representation for planning in large-scale environments”. In:ICAPS 2017 Workshop on Planning and Robotics, Pittsburgh, PA, USA.

⊕ Rathke, Fabian, Mattia Desana, and Christoph Schnörr (2017). “Locally adaptive probabilistic models for global segmentation of pathological oct scans”. In: International Conferenceon Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 177–184. 143/147

https://openreview.net/forum?id=By7LxZNFe

https://doi.org/10.1007/978-3-319-66158-2_32

http://starai.cs.ucla.edu/papers/LiangXAI17.pdf

References XI⊕ Shen, Yujia, Arthur Choi, and Adnan Darwiche (2017). “A Tractable Probabilistic Model for Subset Selection.”. In: UAI.

⊕ Van den Broeck, Guy and Dan Suciu (Aug. 2017). Query Processing on Probabilistic Data: A Survey. Foundations and Trends in Databases. Now Publishers. DOI:10.1561/1900000052. URL: http://starai.cs.ucla.edu/papers/VdBFTDB17.pdf.

⊕ Andraud, Martin et al. (2018). “On the use of Bayesian Networks for Resource-Efficient Self-Calibration of Analog/RF ICs”. In: 2018 IEEE International Test Conference (ITC). IEEE,pp. 1–10.

⊕ Bueff, Andreas, Stefanie Speichert, and Vaishak Belle (2018). “Tractable Querying and Learning in Hybrid Domains via Sum-Product Networks”. In: arXiv preprint arXiv:1807.05464.

⊕ Butz, Cory J et al. (2018a). “An Empirical Study of Methods for SPN Learning and Inference”. In: International Conference on Probabilistic Graphical Models, pp. 49–60.

⊕ Butz, Cory J et al. (2018b). “Efficient Examination of Soil Bacteria Using Probabilistic Graphical Models”. In: International Conference on Industrial, Engineering and Other Applications ofApplied Intelligent Systems. Springer, pp. 315–326.

⊕ Choi, YooJung and Guy Van den Broeck (2018). “On robust trimming of Bayesian network classifiers”. In: arXiv preprint arXiv:1805.11243.

⊕ Di Mauro, Nicola et al. (2018). “Sum-Product Network structure learning by efficient product nodes discovery”. In: Intelligenza Artificiale 12.2, pp. 143–159.

⊕ Friedman, Tal and Guy Van den Broeck (Dec. 2018). “Approximate Knowledge Compilation by Online Collapsed Importance Sampling”. In: Advances in Neural Information ProcessingSystems 31 (NeurIPS). URL: http://starai.cs.ucla.edu/papers/FriedmanNeurIPS18.pdf.

⊕ Jaini, Priyank, Amur Ghose, and Pascal Poupart (2018). “Prometheus: Directly Learning Acyclic Directed Graph Structures for Sum-Product Networks”. In: International Conference onProbabilistic Graphical Models, pp. 181–192. 144/147

https://doi.org/10.1561/1900000052

http://starai.cs.ucla.edu/papers/VdBFTDB17.pdf

http://starai.cs.ucla.edu/papers/FriedmanNeurIPS18.pdf

References XII⊕ Joshi, Himanshu, Paul S Rosenbloom, and Volkan Ustun (2018). “Exact, tractable inference in the Sigma cognitive architecture via sum-product networks”. In: Advances in Cognitive

Systems.

⊕ Molina, Alejandro et al. (2018). “Mixed Sum-Product Networks: A Deep Architecture for Hybrid Domains”. In: AAAI.

⊕ Peharz, Robert et al. (2018). “Probabilistic deep learning using random sum-product networks”. In: arXiv preprint arXiv:1806.01910.

⊕ Rashwan, Abdullah, Pascal Poupart, and Chen Zhitang (2018). “Discriminative Training of Sum-Product Networks by Extended Baum-Welch”. In: International Conference onProbabilistic Graphical Models, pp. 356–367.

⊕ Ratajczak, Martin, Sebastian Tschiatschek, and Franz Pernkopf (2018). “Sum-Product Networks for Sequence Labeling”. In: arXiv preprint arXiv:1807.02324.

⊕ Shen, Yujia, Arthur Choi, and Adnan Darwiche (2018). “Conditional PSDDs: Modeling and learning with modular knowledge”. In: Thirty-Second AAAI Conference on Artificial Intelligence.

⊕ Vergari, Antonio et al. (2018). “Automatic Bayesian Density Analysis”. In: CoRR abs/1807.09306. arXiv: 1807.09306. URL: http://arxiv.org/abs/1807.09306.

⊕ Xu, Jingyi et al. (July 2018). “A Semantic Loss Function for Deep Learning with Symbolic Knowledge”. In: Proceedings of the 35th International Conference on Machine Learning (ICML).

⊕ Zheng, Kaiyu, Andrzej Pronobis, and Rajesh PN Rao (2018). “Learning graph-structured sum-product networks for probabilistic semantic maps”. In: Thirty-Second AAAI Conference onArtificial Intelligence.

⊕ Chiradeep Roy, Tahrima Rahman and Vibhav Gogate (2019). “Explainable Activity Recognition in Videos using Dynamic Cutset Networks”. In: TPM2019.

⊕ Dai, Bin and David Wipf (2019). “Diagnosing and enhancing vae models”. In: arXiv preprint arXiv:1903.05789. 145/147



References XIII⊕ Desana, Mattia and Christoph Schnörr (2019). “Sum–product graphical models”. In: Machine Learning.

⊕ Galindez Olascoaga, Laura Isabel et al. (2019). “Towards Hardware-Aware Tractable Learning of Probabilistic Models”. In: Proceedings of the ICML Workshop on Tractable ProbabilisticModeling (TPM). URL: http://starai.cs.ucla.edu/papers/GalindezTPM19.pdf.

⊕ Holtzen, Steven, Todd Millstein, and Guy Van den Broeck (2019). “Symbolic Exact Inference for Discrete Probabilistic Programs”. In: arXiv preprint arXiv:1904.02079.

⊕ Khosravi, Pasha et al. (2019a). “What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features”. In: arXiv preprint arXiv:1903.01620.

⊕ Khosravi, Pasha et al. (2019b). “What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features”. In: Proceedings of the 28th International Joint Conference onArtificial Intelligence (IJCAI).

⊕ Liang, Yitao and Guy Van den Broeck (2019). “Learning Logistic Circuits”. In: Proceedings of the 33rd Conference on Artificial Intelligence (AAAI).

⊕ Shah, Nimish et al. (2019). “ProbLP: A framework for low-precision probabilistic inference”. In: Proceedings of the 56th Annual Design Automation Conference 2019. ACM, p. 190.

⊕ Shen, Yujia et al. (2019). “Structured Bayesian Networks: From Inference to Learning with Routes”. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI).

⊕ Shih, Andy et al. (2019). “Smoothing Structured Decomposable Circuits”. In: arXiv preprint arXiv:1906.00311.

⊕ Stelzner, Karl, Robert Peharz, and Kristian Kersting (2019). “Faster Attend-Infer-Repeat with Tractable Probabilistic Models”. In: Proceedings of the 36th International Conference onMachine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR,pp. 5966–5975. URL: http://proceedings.mlr.press/v97/stelzner19a.html. 146/147

http://starai.cs.ucla.edu/papers/GalindezTPM19.pdf

http://proceedings.mlr.press/v97/stelzner19a.html

References XIV

⊕ Tan, Ping Liang and Robert Peharz (2019). “Hierarchical Decompositional Mixtures of Variational Autoencoders”. In: Proceedings of the 36th International Conference on MachineLearning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR, pp. 6115–6124. URL:http://proceedings.mlr.press/v97/tan19b.html.

⊕ Trapp, Martin et al. (2019). “Bayesian Learning of Sum-Product Networks”. In: CoRR abs/1905.10884. arXiv: 1905.10884. URL: http://arxiv.org/abs/1905.10884.

147/147

http://proceedings.mlr.press/v97/tan19b.html



tractable representations probabilistic inference learning ...guyvdb/slides/tpmtutorialuai19.pdf ·...

Documents