query-specific learning and inference for probabilistic graphical models
DESCRIPTION
Query-Specific Learning and Inference for Probabilistic Graphical Models. Anton Chechetka. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/1.jpg)
Carnegie Mellon
Query-Specific Learning and Inferencefor Probabilistic Graphical Models
Thesis committee: Carlos Guestrin Eric Xing J. Andrew Bagnell Pedro Domingos (University of Washington)
14 June 2011
Anton Chechetka
![Page 2: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/2.jpg)
2
Motivation
Fundamental problem: to reason accurately about
noisyhigh-dimensional data with
local interactions
![Page 3: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/3.jpg)
3
Sensor networks
• noisy: sensors fail noise in readings• high-dimensional: many sensors, (temperature, humidity, …) per sensor• local interactions: nearby locations have high correlations
![Page 4: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/4.jpg)
4
Hypertext classification
• noisy: automated text understanding is far from perfect• high-dimensional: a variable for every webpage• local interactions: directly linked pages have correlated topics
![Page 5: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/5.jpg)
5
Image segmentation
• noisy: local information is not enough camera sensor noise compression artifacts• high-dimensional: a variable for every patch• local interactions: cows are next to grass, airplanes next to sky
![Page 6: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/6.jpg)
6
Probabilistic graphical models
Noisy
high-dimensional data with
local interactions
Probabilistic inference
a graph to encodeonly direct interactions
over many variables
)(
),()|(
EP
EQPEQP
query evidence
![Page 7: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/7.jpg)
7
Graphical models semantics
Ff
XZ
XP
1
Factorized distributions
X3
Graph structure
X4
X5 X2
X1
X7
X6
543 ,, XXXX
X are small subsets of X compact representation
separator
![Page 8: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/8.jpg)
8
Graphical models workflow
Ff
XZ
XP
1X3
Learn/constructstructure
Learn/defineparameters Inference P(Q|E=E)
Factorized distributions Graph structure
X4
X5 X2
X1
X7
X6
![Page 9: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/9.jpg)
9
Graph. models fundamental problems
Learn/constructstructure
Learn/defineparameters
Inference
P(Q|E=E)
#P-complete (exact)NP-complete (approx)
NP-complete
exp(|X|)
Compoundingerrors
![Page 10: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/10.jpg)
10
Domain knowledge structures don’t help
(webpages)
Domain knowledge-based structuresdo not support tractable inference
![Page 11: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/11.jpg)
11
This thesis: general directions
Emphasizing the computational aspects of the graphLearn accurate and tractable models
Compensate for reduced expressive power withexact inference and optimal parametersGain significant speedups
Inference speedups via better prioritization of computationEstimate the long-term effects of propagating information through the graphUse long-term estimates to prioritize updates
New algorithms for learning and inference in graphical models
to make answering the queries better
![Page 12: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/12.jpg)
12
Thesis contributionsLearn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
![Page 13: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/13.jpg)
13
Generative learning
)(
),()|(
EP
EQPEQP
query goallearning goal
Useful when E is not known in advance
Sensors fail unpredictably
Measurements are expensive (e.g. user time), want adaptive evidence selection
![Page 14: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/14.jpg)
14
Tractable vs intractable models workflow
learn simple tractablestructure from
domain knowledge + data
approx. P(Q|E=E)
optimal parameters,exact inference
construct intractablestructure from
domain knowledge
approx. P(Q|E=E)
approximate algs:no quality
guarantees
learn intractablestructure from
data
Tractable models Intractable models
![Page 15: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/15.jpg)
Tractability via low treewidth
Exact inference exponential in treewidth (sum-product)Treewidth NP-complete to compute in generalLow-treewidth graphs are easy to constructConvenient representation: junction treeOther tractable model classes exist too
15
7 2
1 5
3 4
6
Treewidth:size of largest clique in a
triangulated graph
![Page 16: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/16.jpg)
16
Junction treesCliques connected by edges with separatorsRunning intersection propertyMost likely junction tree of given treewidth >1 is NP-completeWe will look for good approximations
C1
X1,X
5
X4,X
5
X1,X
2
X1,X
5
X1,X2,X7
X1,X2,X5
X1,X4,X5 X4,X5,X6
X1,X3,X5
C2
C3
C4
C5
X4,X5,X6
X1,X3,X5X1,X2,X5
X1,X4,X5 X4,X5,X6
X1,X3,X5X1,X2,X5
X1,X4,X5
7 2
1 5
3 4
6
![Page 17: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/17.jpg)
17
Independencies in low-treewidth distributions
EC
CEC SP
CPXP
,,
),( )(conditional mutual information
works in the other way too!
SSSXXI | , XPPKL EC ),(||
X1,X
5 X1,X2,X7X1,X2,X5X1,X4,X5
X4,X5,X6 X1,X3,X5
0 | , SXXI
conditional independencies hold
S C C
X = X2 X3 X7X = X4 X6
P(X) factorizes according to a JT
![Page 18: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/18.jpg)
18
Constraint-based structure learning SSSXXI | , XPPKL EC ),(||
Look for JTs where this holds(constraint-based structure learning)
S1: X1X2
S2: X1X3
S3: X1X4
Sm: Xn-1Xn
…
all candidateseparators
partition remainingvariables into weakly
dependent subsets
all variables X
find consistentjunction tree
C1S1
S8
S7
S3
C2
C3
C4
C5X1 X4
XX X
I(X , X X | S3) <
![Page 19: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/19.jpg)
19
Mutual information complexity
I(X , X- | S) = H(X | S) - H(X | X- S3)
everything except for X conditional entropy
I(X , X- | S) depends on all assignments to X:exp(|X|) complexity in general
Our contribution: polynomial-time upper bound
![Page 20: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/20.jpg)
20
Mutual info upper bound: intuition
I(A,B | C)=??
DF
hard
A BI(D,F|C)
easy
|DF| k
Only look at small subsets D, F
Poly number of small subsetsPoly complexity for every pair
Any conclusions about I(A,B|C)?
In general, no If a good junction tree exists, yes
![Page 21: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/21.jpg)
21
Contribution: mutual info upper bound
Suppose an -JT of treewidth k for P(ABC) exists:
Let for |DF| k+1
Then I(A, B | C) |ABC| ( + )
= max I(D, F | C)
DF
A B|DF| treewidth+1I(D,F|C)
SSSXXI | ,
Theorem:
![Page 22: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/22.jpg)
22
Mutual info upper bound: complexityDirect computation: complexity exp(|ABC|)Our upper bound:
O(|AB|treewidth + 1) small subsets
exp(|C|+ treewidth) time each
|C| = treewidthfor structure learning
I(D,F|C)D
F
A B
|DF| treewidth+1
polynomial(|ABC|) complexity
![Page 23: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/23.jpg)
23
Guarantees on learned model quality
Theorem:Suppose a strongly connected -JT of treewidth k for P(X) exists.
Then our alg. will with probability at least (1-) find a JT s.t.
)2()1(|| XkPPKL JT
2
)/log( XO
2
32)/1log( k
XO
Corollary: strongly connected junction trees are PAC-learnable
quality guarantee
poly samples poly time
using samples and time.
![Page 24: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/24.jpg)
24
Related workReference Model Guarantees Time[Bach+Jordan:2002] tractable local poly(n)[Chow+Liu:1968] tree global O(n2 log n)[Meila+Jordan:2001] tree mix local O(n2 log n)[Teyssier+Koller:2005] compact local poly(n)[Singh+Moore:2005] all global exp(n)[Karger+Srebro:2001] tractable const-factor poly(n)[Abbeel+al:2006] compact PAC poly(n)[Narasimhan+Bilmes:2004] tractable PAC exp(n)our work tractable PAC poly(n)[Gogate+al:2010] tractable with
high treewidthPAC poly(n)
![Page 25: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/25.jpg)
25
Results – typical convergence time
good results early on in practice
Test
log-
likel
ihoo
d
bett
er
![Page 26: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/26.jpg)
26
Results – log-likelihoodbe
tter
our method
OBS local search in limited in-degree Bayes netsChow-Liu most likely JTs of treewidth 1Karger-Srebro constant-factor approximation JTs
![Page 27: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/27.jpg)
27
ConclusionsA tractable upper bound on conditional mutual infoGraceful quality degradation and PAC learnability guaranteesAnalysis on when dynamic programming works[in the thesis]
Dealing with unknown mutual information threshold[in the thesis]
Speedups preserving the guaranteesFurther speedups without guarantees
![Page 28: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/28.jpg)
28
Thesis contributionsLearn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
![Page 29: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/29.jpg)
29
Discriminative learning
)(
),()|(
EP
EQPEQP
query goal learning goal
Useful when variables E are always the sameNon-adaptive, one-shot observation
Image pixels scene descriptionDocument text topic, named entities
Better accuracy than generative models
![Page 30: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/30.jpg)
30
Discriminative log-linear models
EQfwwEZ
wEQP ,exp),(
1),|(
feature(domain knowledge)
weight(learn from data)
evidence-dependentnormalization
Don’t sum over all values of EDon’t model P(E)
No need for structure over E
Evidence
Query
f12
f34
![Page 31: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/31.jpg)
31
Model tractability still important
Observation #1: tractable models are necessary for exact inference and parameter learning in the discriminative setting
Tractability is determined by the structure over query
![Page 32: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/32.jpg)
32
Simple local models: motivation
evidence
query
Q=f(E)
E
Q
Locally almost linear
Exploiting evidence values overcomes the expressive power deficit of simple models
We will learn local tractable models
![Page 33: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/33.jpg)
33
Context-specific independence
Observation #2: use evidence values at test time to tune the structure of the models, do not commit to a single tractable model
noedge
![Page 34: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/34.jpg)
34
Low-dimensional dependencies in generative structure learning
CS
SCCS
H(C)H(S)LLH ),(cliques
Generative structure learning often relies only on low-dimensional marginals
Junction trees:decomposable scores
separators
Low-dimensional independence tests: ??
)|,( SBAI
Small changes to structure quick score recomputation
Discriminative structure learning: need inference in full modelfor every datapoint even for small changes in structure
![Page 35: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/35.jpg)
35
Leverage generative learning
Observation #3: generative structure learning algorithms have very useful properties, can we leverage them?
![Page 36: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/36.jpg)
36
Observations so farDiscriminative setting has extra information, including evidence values at test time
Want to use to learn local tractable models
Good structure learning algorithms exist for generative settingthat only require low-dimensional marginals P(Q)
Approach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights
![Page 37: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/37.jpg)
37
Evidence-specific CRF overviewApproach: 1. use local conditionals P(Q | E=E) as “fake marginals” to learn local tractable structures 2. learn exact discriminative feature weights
Local conditional density estimators P(Q | E)
Evidencevalue E=E
P(Q | E=E)
Generative structurelearning algorithm
Tractable structurefor E=E
Featureweights w
Tractable evidence-specific CRF
![Page 38: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/38.jpg)
Evidence-specific CRF formalism
),(),(exp),,(
1),,|( uEIEQfw
uwEZuwEQP
Observation: identically zero feature 0 does not affect the model
evidence-specific structure: I(E,u){0, 1}extra “structural” parameters
Fixed dense model
Evidence-specific
tree “mask”
Evidence-specific model× =( () ) ( )
38
Evidence-specific
feature values( )
E=E2
E=E3
E=E1 ×××
![Page 39: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/39.jpg)
39
Evidence-specific CRF learning
Learning is in the same order as testing
Local conditional density estimators P(Q | E)
Evidencevalue E=E
P(Q | E=E)
Generative structurelearning algorithm
Tractable structurefor E=E
Featureweights w
Tractable evidence-specific CRF
![Page 40: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/40.jpg)
40
Plug in generative structure learning
),(),(exp),,(
1),,|( uEIEQw
uwEZuwEQP
encodes the output of the chosen structure learning algorithm
Generative Discriminative
P(Qi,Qj)
(pairwise marginals)+
Chow-Liu algorithm=
optimal tree
P(Qi,Qj|E=E)
(pairwise conditionals)+
Chow-Liu algorithm=
good tree for E=E
Directly generalize generative algorithms :
![Page 41: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/41.jpg)
41
Evidence-specific CRF learning: structure
Choose generative structure learning algorithm A
Identify low-dimensional subsets Qβ that A may need
Chow-Liu
All pairs (Qi, Qj)
E Q E Q1,Q2 E EQ1,Q3 Q3,Q4
,original problem low-dimensional pairwise problems
…
),|,(ˆ1331 uEQQP ),|,(ˆ
3443 uEQQP),|,(ˆ1221 uEQQP
![Page 42: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/42.jpg)
42
Estimating low-dimensional conditionals
Use the same features as the baseline high-treewidth model
QQEQuuEZ
uEQP
s.t. ,exp),(
1),|(
EQwwEZ
wEQP ,exp),(
1)|,(Baseline CRF
Low-dimensionalmodel
Scope restriction
End result: optimal u
![Page 43: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/43.jpg)
43
Evidence-specific CRF learning: weights
),(),(exp),,(
1),,|( uEIEQw
uwEZuwEQP
Already chosen the algorithm behind I(E,u)
Already learned parameters u
Only need to learn feature weights w
log P(Q|E,w,u) is concave in w unique global optimum
“effective features”
![Page 44: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/44.jpg)
44
Evidence-specific CRF learning: weights
EEQEEQ
E ,E,,),,|(log
),,|(
QffuIw
uwPuwQP
Tree-structured distribution
Fixed dense model
Evidence-specific
tree “mask”( () )
E=E2
E=E3
E=E1
Exacttree-structuredgradients wrt w( )
Q=Q2
Q=Q3
Q=Q1
Σ
Overall gradient(dense)( )
![Page 45: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/45.jpg)
45
Results – WebKBText + links webpage topic
bett
er
Prediction error TimeSVM RMN ESS-CRF M3N
0
0.05
0.1
0.15
RMN ESS-CRF M3N0
200400600800
10001200
Ignore links Standard dense CRF Our work Max-margin model
![Page 46: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/46.jpg)
46
Image segmentation - accuractlocal segment features + neighbor segments type of object
Logisti
c regressi
on
Dense CRF
ESS-CRF
0.6000000000000010.6400000000000010.6800000000000010.7200000000000010.760000000000001
Accuracy
bett
er
Ignore links Standard dense CRF Our work
![Page 47: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/47.jpg)
47
Image segmentation - time
Train time (log scale)
bett
er
Logistic regression
Dense CRF ESS-CRF2
20
200
2000
20000
Test time (log scale)
Logistic regression
Dense CRF ESS-CRF0.3
3
30
300
3000
Ignore links Standard dense CRF Our work
![Page 48: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/48.jpg)
48
ConclusionsUsing evidence values to tune low-treewidth model structure
Compensates for the reduced expressive powerOrder of magnitude speedup at test time (sometimes train time too)
General framework for plugging in existing generative structure learnersStraightforward relational extension [in the thesis]
![Page 49: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/49.jpg)
49
Thesis contributionsLearn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
![Page 50: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/50.jpg)
50
Why high-treewidth models?A dense model expressing laws of nature
Protein folding
Max-margin parameters don’t work well (yet?) with evidence-specific structures
![Page 51: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/51.jpg)
51
Query-Specific inference problemevidencequery not interesting
Using information about the queryto speed up convergence of belief propagation
for the query marginals
Eij
jiij XXfP )()(X
![Page 52: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/52.jpg)
52
(loopy) Belief PropagationPassing messages along edges
Variable belief:
Update rule:
Result: all single-variable beliefs
ikEkjj
tjkji
xiji
tij xmxxfxm
j ,
)()1( )()()(
Eij
it
ijit xmxP )()(
~ )()(
kim r
sj
ki
h
u
![Page 53: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/53.jpg)
53
(loopy) Belief PropagationMessage dependencies are local:
Freedom in scheduling updatesRound–robin schedule
Fix message orderApply updates in that order until convergence
r
sj
ki
h
u
dependence
dependence
dep.
![Page 54: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/54.jpg)
54
Dynamic update prioritization
Fixed update sequence is not the best optionDynamic update scheduling can speed up convergence
Tree-Reweighted BP [Wainwright et. al., AISTATS 2003]Residual BP [Elidan et. al. UAI 2006]
Residual BP apply the largest change first
1
informative update
2
wasted computation
large change large
change
large change
small change
small change
small change
![Page 55: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/55.jpg)
55
Residual BP [Elidan et. al., UAI 2006]
Update rule:
Pick edge with largest residual
Update
oldnew
)()(max OLDij
NEWij mm
More effort on the difficult parts of the model
ikEkj
jOLD
jkjix
ijiNEW
ij xmxxfxmj ,
)()( )()()(
)(OLDijm
)( NEWijm
)( NEWijm
But no query
![Page 56: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/56.jpg)
56
• Residual BP updates• no influence on the query• wasted computation
Why edge importance weights?query
residual < residualwhich to update??
• want to update • influence on the query in the future
Residual BP max immediate residual reduction
Our work max approx. eventual effect on P(query)
![Page 57: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/57.jpg)
57
Query-Specific BP
Update rule:
Pick edge with
Update
oldnew
ijOLD
ijNEW
ij Amm max )()(
ikEkj
jOLD
jkjix
ijiNEW
ij xmxxfxmj ,
)()( )()()(
)(OLDijm
)( NEWijm
)( NEWijm
Rest of the talk: defining and computing edge importance
edgeimportance
the only change!
![Page 58: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/58.jpg)
Edge importance base case
approximate eventual update effect on P(Q)
query
r
sj
ki
h
uij
OLDij
NEWij Amm )()(
||P(NEW)(Q) P(OLD)(Q)|| ||m(NEW) m(OLD)||change in query belief change in message
tight bound
1
Base case: edge directly connected to the query Aji=??
1ji ji
||P(Q)|| ||m ||ji
![Page 59: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/59.jpg)
||m ||over values of
all other messages
mjisup mrj
Edge one step away from the query:
Arj=??
mjisup mrj
Edge importance one step awayquery
r
sj
ki
h
u
||P(Q)||
change in query belief
change in message
can compute in closed formlooking at only fji [Mooij, Kappen; 2007]
message importance
ji
rj||m ||
![Page 60: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/60.jpg)
One step away:
Arj=
Edge importance general case
queryr
sj
ki
h
u
||P(Q)|| ||msh|| P(Q)msh
sup
P(Q)msh
sup
Base case: Aji=1
mhrsup msh
mrjsup mhr
mjisup mrj
mjisup mrj
sensitivity(): max impact along the
path
Generalization? expensive to compute bound may be infinite
![Page 61: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/61.jpg)
Edge importance general case
queryr
sj
ki
h
u
1
P(Q)msh
sup
mhrsup msh
mrjsup mhr
mjisup mrj
sensitivity(): max impact along the
path
2
Ash = max all paths from to query sensitivity()h
There are a lot of paths in a graph,trying out every one is intractable
![Page 62: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/62.jpg)
62
Efficient edge importance computation
A = max all paths from to query sensitivity()
There are a lot of paths in a graph,trying out every one is intractable
always 1
always decreases as the path grows
mhrsup msh
mrjsup mhr
mjisup mrj
sensitivity( hrji ) =
always 1always 1
decomposes into individual edge contributions
Dijkstra’s (shortest paths) alg. will efficiently find max-sensitivity paths
for every edge
![Page 63: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/63.jpg)
63
Aji = max all paths from i to query sensitivity()
Query-Specific BP
Run Dijkstra’s alg starting at query to get edge weights
Pick edge with largest weighted residual
Update
ijOLD
ijNEW
ij Amm max )()(
)(OLDijm
)( NEWijm )( NEWijm
More effort on the difficult parts of the model
Takes into account not only graphical structure, but also strength of dependencies
and relevant
![Page 64: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/64.jpg)
64
Experiments – single query
Easy model(sparse connectivity,weak interactions)
Hard model(dense connectivity,strong interactions)
bett
er
Standard residual BP Our work
Faster convergence, but long initialization still a problem
![Page 65: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/65.jpg)
65
Anytime query-specific BPquery
Dijkstra’s alg. BP updates
Query-specific BP:
Anytime QSBP:
same BP update sequence!
r
sj
ki
![Page 66: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/66.jpg)
66
Experiments – anytime QSBP
Easy model(sparse connectivity,weak interactions)
Hard model(dense connectivity,strong interactions)
bett
er
Standard residual BP Our work
Much shorter initialization
Our work + anytime
![Page 67: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/67.jpg)
67
Experiments – multiquery
Easy model(sparse connectivity,weak interactions)
Hard model(dense connectivity,strong interactions)
bett
er
Standard residual BP Our work Our work + anytime
![Page 68: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/68.jpg)
68
ConclusionsWeighting edges is a simple and effective way to improve prioritizationWe introduce a principled notion of edge importance based on both structure and parameters of the modelRobust speedups in the query-specific setting
Don’t spend computation on nuisance variables unless needed for the query marginal
Deferring BP initialization has a large impact
![Page 69: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/69.jpg)
69
Thesis contributionsLearn accurate and tractable models
In the generative setting P(Q,E) [NIPS 2007]
In the discriminative setting P(Q|E) [NIPS 2010]
Speed up belief propagation for cases with many nuisance variables [AISTATS 2010]
![Page 70: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/70.jpg)
70
Future workMore practical JT learning
SAT solvers to construct structure, pruning heuristics, …
Evidence-specific learningTrade efficiency for accuracyMax-margin evidence-specific models
Theory on ES structures too
Inference:Beyond query-specific: better prioritization in generalBeyond BP: query-specific Gibbs sampling?
![Page 71: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/71.jpg)
71
Thesis conclusionsGraphical models are a regularization technique for high-dimensional distributionsRepresentation-based structure is well-understood
Conditional independencies
Right now, structured computation is a “consequence” of representation
Major issues with tractability, approximation quality
Logical next step structured computation as a primary basis of regularizationThis thesis: computation-centric approaches have better efficiency and do not sacrifice accuracy
![Page 72: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/72.jpg)
72
Thank you!
Collaborators: Carlos Guestrin, Joseph Bradley, Dafna Shahaf
![Page 73: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/73.jpg)
73
Mutual info upper bound: qualityUpper bound:
Suppose an -JT exists is the largest mutual information over small subsetsThen I(A, B | C) |ABC| ( + )
No need to know the -JT, only that it exists
No connection between C and the JT separators
C can be of any size, no connection to JT treewidthThe bound is loose only when there is no hope to learn a good JT
![Page 74: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/74.jpg)
74
Typical graphical models workflow
Learn/constructstructure
Learn/defineparameters
Inference
P(Q|E=e)
reasonable intractablestructure from
domain knowledge
approx. P(Q|E=e)
The graph isprimarily a
representationtool
approximate algs:no quality
guarantees
![Page 75: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/75.jpg)
75
Contributions – tractable modelsLearn accurate and tractable models
In the generative setting [NIPS 2007]Polynomial-time conditional mutual information upper boundFirst PAC-learning result for strongly connected junction treesGraceful degradation guaranteesSpeedup heuristrics
In the discriminative setting [NIPS 2010]General framework for learning CRF structure that depends on evidence values at test timeExtensions to the relational settingEmpirical: order of magnitude speedups with the same accuracy as high-treewidth models
![Page 76: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/76.jpg)
76
Contributions – faster inferenceSpeed up belief propagation for cases with many nuisance variables [AISTATS 2010]
A framework of importance-weighted residual belief propagationA principled measure of eventual impact of an edge update on the query belief
Prioritize updates by importance for the query instead of absolute magnitude
An anytime modification to defer much of initializationInitial inference results available much soonerOften much faster eventual convergenceThe same fixed points as the full model
![Page 77: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/77.jpg)
77
Future workTwo main bottlenecks:
Constructing JTs given mutual information values.Esp. with non-uniform treewidth, dependence strength
Large sample: learnability guarantees for non-uniform treewidth Small sample: non-uniform treewidth for regularizationConstraint satisfaction, SAT solvers, etc?Relax strong connectivity requirement?
Evaluating mutual information:need to look at 2k+1 variables instead of k+1, large penalty
Branch on features instead of sets of variables? [Gogate+al:2010]
Speedups without guaranteesLocal search, greedy separator construction, …
![Page 78: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/78.jpg)
78
Log-linear parameter learningconditional log-likelihood
DEQ
EQD),(
)|(log)( ,wP|wLLH
Convex optimization: unique global maximum
Gradient: features – [expected features]
EEQEQ
E ,E,),|(log
),|(
Qffw
wPwQP
need inference inference for every E given w
![Page 79: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/79.jpg)
79
Log-linear parameter learning
Generative (E=) Discriminative
Tractable Closed-form Exact gradient-based
IntractableApproximate
gradient-based(no guarantees)
Approximategradient-based(no guarantees)
Inference once per weights update
Inference for every datapoint (Q,E)once per weights update
“manageable” slowdownby the number of datapoints
Complexity“phase
transition”
![Page 80: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/80.jpg)
80
Plug in generative structure learning
),(),(exp),,(
1),,|( uEIEQw
uwEZuwEQP
encodes the output of the chosen structure learning algorithm
Chow-Liu for optimal treesOur thin junction tree learning from part 1Karger-Srebro for high-quality low-diameter junction treesLocal search, etc …
Fix algorithm always get structures with desired properties (e.g. treewidth):
replace P(Q) with approximate conditionals P(Q | E=E, u) everywhere
![Page 81: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/81.jpg)
81
Evidence-specific CRF learning: weights
),(),(exp),,(
1),,|( uEIEQw
uwEZuwEQP
Already knowalgorithm behind I(E,u)Already learned u
Only need to learn w
Structure induced by I(E,u)is always tractable
Can find evidence-specific structure I(E=E,u)
for every training datapoint (Q,E)
Learn optimal w exactly
EEQEEQ
E ,E,,),,|(log
),,|(
QffuIw
uwPuwQP
Tree-structured distribution
![Page 82: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/82.jpg)
82
Relational evidence-specific CRFRelational models: templated features + shared weights
webpage webpageLinksTo
LinksTo
LinksTo
Relation:
Groundings:
Learn a singleweight wLINK
wLINK
wLINK
Copy weight for every grounding
![Page 83: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/83.jpg)
83
Relational evidence-specific CRFRelational models: templated features + shared weights
Every grounding is a separate datapoint for structure training
use propositional approach + shared weights
x1 x2
x3
x4 x5
Grounded model Training datasets for “structural” parameters u
x3 x4
x3 x5
x4 x5
x1 x2x1 x3
x1 x4
x1 x5
x2 x3
x2 x4
x2 x5
![Page 84: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/84.jpg)
84
Future workFaster learning: pseudolikelihood is really fast, need to competeLarger treewidth: trade time for accuracyTheory on learning “structural parameters” uMax-margin learning
Inference is basic step in max-margin learning too tractable models are useful beyond log-likelihoodOptimizing feature weights w given local trees is straightforwardOptimizing “structural parameters” u for max-margin is hard
What is the right objective?
Almost tractable structures, other tractable modelsMake sure loops don’t hurt too much
![Page 85: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/85.jpg)
85
Query versus nuisance variablesWe may actually care about only few variables
What are the topics of the webpages on the first page of Google search results for my query?Smart heating control: is anybody going to be at home for the next hour?Does the patient need immediate doctor attention?
But the model may have a lot of other variables to be accurate enough
Don’t care about them per se, but necessary to look at to get the query right
Both query and nuisance variables are unknown, inference algorithms don’t see a differenceSpeed up inference by focusing on the query
Only look at nuisance variable to the extent needed to answer the query
![Page 86: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/86.jpg)
86
Our contributions
Using weighted residuals to prioritize updates
Define message weights reflecting the importance of the message to the query
Computing importance weights efficiently
Experiments: faster convergence on large relational models
![Page 87: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/87.jpg)
87
Interleaving
Dijkstra’s expands the highest weight edges firstqueryexpanded on
previous iteration just expanded
not yet expanded
min expanded edges A A
suppose
M min expanded A
no need to expand further at this point
upper bound on priorityactual priority of
)()( max OLD
ijNEW
ijEDGESALLij mmM
ijOLD
ijNEW
ijEXPANDEDij Amm )()(max
![Page 88: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/88.jpg)
88
Deferring BP initialization
Observation: Dijkstra’s alg. expands the most important edges first
Do we really need to look at every low importance edgebefore applying BP updates?
No! Can use upper bounds on priority instead.
![Page 89: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/89.jpg)
89
Upper bounds in priority queue
Observation: for edges low in the priority queue, an upper bound on the priority is enough
r
sj
ki
Updates priority queue
Exact priority needed fortop element
Priority upper boundis enough here
![Page 90: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/90.jpg)
90
|| factor( ) ||
Priority upper bound for not yet seen edges
priority( ) = residual( ) importance weight( )
importance weight( )s.t. is already expanded
priority( )
Component-wise upper bound without looking at the edge!
r
sj
ki
Expand several edges with Dijkstra’s : For :(residual) (weight) = (exact priority)
For all the other edges…
![Page 91: Query-Specific Learning and Inference for Probabilistic Graphical Models](https://reader031.vdocument.in/reader031/viewer/2022013004/56813498550346895d9b893a/html5/thumbnails/91.jpg)
91
Interleaving BP and Dijkstra’s
Dijkstra
BP
Dijkstra
Dijkstra
BP BP
…
exact priority upper bound BP update
exact priority upper bound Dijkstra expand an edge
queryfull model
><