![Page 1: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/1.jpg)
School of Computer Science
Probabilistic Graphical Models
Posterior Regularization: an integrative paradigm for learning
GMs
Matt Gormley (Slides by Jun Zhu and Eric Xing)
April 13, 2016 Reading: Zhu, Chen, & Xing (2014)
p
1 © Eric Xing @ CMU, 2005-2014
![Page 2: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/2.jpg)
Max-margin learning
Prior knowledge, bypass model selection,
Data integration, scalable inference
…
generalization dual sparsity efficient solvers …
nonlinear transformation rich forms of data …
Regularized Bayesian Inference
Learning GMs
2 © Eric Xing @ CMU, 2005-2014
![Page 3: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/3.jpg)
Bayesian Inference l A coherent framework of dealing with uncertainties
l Bayes’ rule offers a mathematically rigorous computational mechanism for combining prior knowledge with incoming evidence
Thomas Bayes (1702 – 1761)
• M: a model from some hypothesis space • x: observed data
3 © Eric Xing @ CMU, 2005-2014
![Page 4: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/4.jpg)
Parametric Bayesian Inference
A parametric likelihood: Prior on θ : Posterior distribution
is represented as a finite set of parameters
Examples: • Gaussian distribution prior + 2D Gaussian likelihood → Gaussian posterior distribution • Dirichilet distribution prior + 2D Multinomial likelihood → Dirichlet posterior distribution • Sparsity-inducing priors + some likelihood models → Sparse Bayesian inference
4 © Eric Xing @ CMU, 2005-2014
![Page 5: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/5.jpg)
Nonparametric Bayesian Inference
A nonparametric likelihood: Prior on : Posterior distribution
Examples: → see next slide
is a richer model, e.g., with an infinite set of parameters
5 © Eric Xing @ CMU, 2005-2014
![Page 6: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/6.jpg)
probability measure binary matrix
function
Dirichlet Process Prior [Antoniak, 1974] + Multinomial/Gaussian/Softmax likelihood
Indian Buffet Process Prior [Griffiths & Gharamani, 2005] + Gaussian/Sigmoid/Softmax likelihood
Gaussian Process Prior [Doob, 1944; Rasmussen & Williams, 2006] + Gaussian/Sigmoid/Softmax likelihood
Nonparametric Bayesian Inference
6 © Eric Xing @ CMU, 2005-2014
![Page 7: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/7.jpg)
Why Bayesian Nonparametrics? l Let the data speak for themselves l Bypass the model selection problem
l let data determine model complexity (e.g., the number of components in mixture models)
l allow model complexity to grow as more data observed
7 © Eric Xing @ CMU, 2005-2014
![Page 8: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/8.jpg)
It is desirable to further regularize the posterior distribution
l An extra freedom to perform Bayesian inference l Arguably more direct to control the behavior of models l Can be easier and more natural in some examples
likelihood model prior posterior
Can we further control the posterior distributions?
8 © Eric Xing @ CMU, 2005-2014
![Page 9: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/9.jpg)
Can we further control the posterior distributions?
l Directly control the posterior distributions? l Not obvious how …
likelihood model prior posterior
hard constraints (A single feasible space)
soft constraints (many feasible subspaces with different
complexities/penalties)
9 © Eric Xing @ CMU, 2005-2014
![Page 10: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/10.jpg)
l Bayes’ rule is equivalent to:
A direct but trivial constraint on the posterior distribution
[Zellner, Am. Stat. 1988]
E.T. Jaynes (1988): “this fresh interpretation of Bayes’ theorem could make the use of Bayesian methods more attractive and widespread, and stimulate new developments in the general theory of inference”
likelihood model prior posterior
A reformulation of Bayesian inference
10 © Eric Xing @ CMU, 2005-2014
![Page 11: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/11.jpg)
Regularized Bayesian Inference (Ganchev et al.’10)
where, e.x., and
Solving such constrained optimization problem needs convex duality theory
So, where do the constraints come from?
11 © Eric Xing @ CMU, 2005-2014
![Page 12: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/12.jpg)
Recall our evolution of the Max-Margin Learning Paradigms
?
SVM
SVM
b r a c eM3N
MED
MED
M3N
MED-MN = SMED + “Bayesian” M3N
12 © Eric Xing @ CMU, 2005-2014
![Page 13: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/13.jpg)
l Structured MaxEnt Discrimination (SMED):
l Feasible subspace of weight distribution:
l Average from distribution of M3Ns
Maximum Entropy Discrimination Markov Networks
p
13 © Eric Xing @ CMU, 2005-2014
![Page 14: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/14.jpg)
Can we use this scheme to learn models other than MN?
14 © Eric Xing @ CMU, 2005-2014
![Page 15: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/15.jpg)
Recall the 3 advantages of MEDN l An averaging Model: PAC-Bayesian prediction error guarantee
(Theorem 3)
l Entropy regularization: Introducing useful biases n Standard Normal prior => reduction to standard M3N (we’ve seen it)
n Laplace prior => Posterior shrinkage effects (sparse M3N)
l Integrating Generative and Discriminative principles (next
class) n Incorporate latent variables and structures (PoMEN) n Semisupervised learning (with partially labeled data)
15 © Eric Xing @ CMU, 2005-2014
![Page 16: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/16.jpg)
Latent Hierarchical MaxEnDNet
l Web data extraction l Goal: Name, Image, Price,
Description, etc.
Web Page
Data Record
Image
Name Desc Price Note
Desc
Note
Data Record
Image
Desc Desc Name Desc Price Note Note
� Hierarchical labeling
� Advantages: o Computational efficiency o Long-range dependency o Joint extraction {image} {name, price}
{name} {price} {name} {price}
{image} {name, price}
{desc}
{Head} {Tail} {Info Block}
{Repeat block} {Note} {Note}
16 © Eric Xing @ CMU, 2005-2014
![Page 17: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/17.jpg)
Partially Observed MaxEnDNet (PoMEN)
l Now we are given partially labeled data:
§ PoMEN: learning
§ Prediction:
Web Page
Data Record
Image
Name Desc Price Note
Desc
Note
Data Record
Image
Desc Desc Name Desc Price Note Note
(Zhu et al, NIPS 2008)
17 © Eric Xing @ CMU, 2005-2014
![Page 18: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/18.jpg)
Alternating Minimization Alg. l Factorization assumption:
l Alternating minimization: § Step 1: keep fixed, optimize over
§ Step 2: keep fixed, optimize over
o Normal prior • M3N problem (QP)
o Laplace prior • Laplace M3N problem (VB)
Equivalently reduced to an LP with a polynomial number of constraints
18 © Eric Xing @ CMU, 2005-2014
![Page 19: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/19.jpg)
Experimental Results l Web data extraction:
§ Name, Image, Price, Description
§ Methods: § Hierarchical CRFs, Hierarchical
M^3N § PoMEN, Partially observed HCRFs
§ Pages from 37 templates o Training: 185 (5/per template)
pages, or 1585 data records o Testing: 370 (10/per template)
pages, or 3391 data records
§ Record-level Evaluation o Leaf nodes are labeled
§ Page-level Evaluation o Supervision Level 1:
§ Leaf nodes and data record nodes are labeled
o Supervision Level 2: § Level 1 + the nodes above data
record nodes
Web Page
Data Record
Image
Name Desc Price Note
Desc
Note
Data Record
Image
Desc Desc Name Desc Price Note Note
19 © Eric Xing @ CMU, 2005-2014
![Page 20: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/20.jpg)
Record-Level Evaluations l Overall performance:
§ Avg F1: o avg F1 over all attributes
§ Block instance accuracy: o % of records whose Name,
Image, and Price are correct
l Attribute performance:
20 © Eric Xing @ CMU, 2005-2014
![Page 21: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/21.jpg)
Page-Level Evaluations l Supervision Level 1:
§ Leaf nodes and data record nodes are labeled
l Supervision Level 2: § Level 1 + the nodes above
data record nodes
4/13/16
21 © Eric Xing @ CMU, 2005-2014
![Page 22: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/22.jpg)
l Structured MaxEnt Discrimination (SMED):
l Feasible subspace of weight distribution:
l Average from distribution of PoMENs
l We can use this for any p and p0 !
Key message from PoMEN
p
22 © Eric Xing @ CMU, 2005-2014
![Page 23: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/23.jpg)
Max-margin learning
An all inclusive paradigm for learning general GM --- RegBayes
23 © Eric Xing @ CMU, 2005-2014
![Page 24: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/24.jpg)
Predictive Latent Subspace Learning via a large-margin approach
… where M is any subspace model and p is a parametric Bayesian prior
24 © Eric Xing @ CMU, 2005-2014
![Page 25: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/25.jpg)
l Finding latent subspace representations (an old topic) l Mapping a high-dimensional representation into a latent low-dimensional representation,
where each dimension can have some interpretable meaning, e.g., a semantic topic
l Examples: l Topic models (aka LDA) [Blei et al 2003]
l Total scene latent space models [Li et al 2009]
l Multi-view latent Markov models [Xing et al 2005]
l PCA, CCA, …
⇒
⇒
⇒
Athlete Horse Grass Trees Sky Saddle
Unsupervised Latent Subspace Discovery
25 © Eric Xing @ CMU, 2005-2014
![Page 26: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/26.jpg)
l Unsupervised latent subspace representations are generic but can be sub-optimal for predictions
l Many datasets are available with supervised side information
l Can be noisy, but not random noise (Ames & Naaman, 2007) l labels & rating scores are usually assigned based on some intrinsic property of the data l helpful to suppress noise and capture the most useful aspects of the data
l Goals: l Discover latent subspace representations that are both predictive and interpretable by
exploring weak supervision information
� Tripadvisor Hotel Review (http://www.tripadvisor.com)
� LabelMe http://labelme.csail.mit.edu/
� Many others
Flickr (http://www.flickr.com/)
Predictive Subspace Learning with Supervision
26 © Eric Xing @ CMU, 2005-2014
![Page 27: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/27.jpg)
I. LDA: Latent Dirichlet Allocation
l Joint Distribution:
l Variational Inference with :
l Minimize the variational bound to estimate parameters and infer the posterior distribution
� Generative Procedure: � For each document d:
� Sample a topic proportion � For each word: – Sample a topic – Sample a word
(Blei et al., 2003)
exact inference intractable!
27 © Eric Xing @ CMU, 2005-2014
![Page 28: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/28.jpg)
l Bayesian sLDA:
l MED Estimation: l MedLDA Regression Model
l MedLDA Classification Model
predictive accuracy model fitting
(Zhu et al, ICML 2009)
Maximum Entropy Discrimination LDA (MedLDA)
28 © Eric Xing @ CMU, 2005-2014
![Page 29: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/29.jpg)
Document Modeling l Data Set: 20 Newsgroups l 110 topics + 2D embedding with t-SNE (var der Maaten & Hinton, 2008)
MedLDA LDA
29 © Eric Xing @ CMU, 2005-2014
![Page 30: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/30.jpg)
Classification l Data Set: 20Newsgroups
– Binary classification: “alt.atheism” and “talk.religion.misc” (Simon et al., 2008) – Multiclass Classification: all the 20 categories
l Models: DiscLDA, sLDA (Binary ONLY! Classification sLDA (Wang et al., 2009)), LDA+SVM (baseline), MedLDA, MedLDA+SVM
l Measure: Relative Improvement Ratio
30 © Eric Xing @ CMU, 2005-2014
![Page 31: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/31.jpg)
Regression l Data Set: Movie Review (Blei & McAuliffe, 2007) l Models: MedLDA(partial), MedLDA(full), sLDA, LDA+SVR l Measure: predictive R2 and per-word log-likelihood
31 © Eric Xing @ CMU, 2005-2014
![Page 32: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/32.jpg)
Time Efficiency l Binary Classification
l Multiclass:
— MedLDA is comparable with LDA+SVM l Regression:
— MedLDA is comparable with sLDA 32 © Eric Xing @ CMU, 2005-2014
![Page 33: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/33.jpg)
l The “Total Scene Understanding” Model (Li et al, CVPR 2009)
l Using MLE to estimate model parameters
Athlete Horse Grass Trees Sky Saddle
class: Polo
II. Upstream Scene Understanding Models
33 © Eric Xing @ CMU, 2005-2014
![Page 34: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/34.jpg)
Scene Classification l 8-category sports data set (Li & Fei-Fei, 2007):
l Fei-Fei’s theme model: 0.65 (different image representation)
l SVM: 0.673
• 1574 images (50/50 split) • Pre-segment each image into regions • Region features:
• color, texture, and location • patches with SIFT features
• Global features: • Gist (Oliva & Torralba, 2001) • Sparse SIFT codes (Yang et al, 2009)
34 © Eric Xing @ CMU, 2005-2014
![Page 35: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/35.jpg)
l Classification results:
$ROI+Gist(annota;on) used human annotated interest regions.
• 67-category MIT indoor scene (Quattoni & Torralba, 2009): • ~80 per-category for training; ~20 per-category for testing • Same feature representation as above • Gist global features
MIT Indoor Scene
35 © Eric Xing @ CMU, 2005-2014
![Page 36: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/36.jpg)
III. Supervised Multi-view RBMs l A probabilistic method with an additional view of response variables
Y
l Parameters can be learned with maximum likelihood estimation, e.g., special supervised Harmonium (Yang et al., 2007) l contrastive divergence is the commonly used approximation method in
learning undirected latent variable models (Welling et al., 2004; Salakhutdinov & Murray, 2008).
Y1 YL
normalization factor
36 © Eric Xing @ CMU, 2005-2014
![Page 37: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/37.jpg)
l t-SNE (van der Maaten & Hinton, 2008) 2D embedding of the discovered latent space representation on the TRECVID 2003 data
l Avg-KL: average pair-wise divergence
MMH TWH
Predictive Latent Representation
37 © Eric Xing @ CMU, 2005-2014
![Page 38: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/38.jpg)
Predictive Latent Representation l Example latent topics discovered by a 60-topic MMH on Flickr Animal Data
38 © Eric Xing @ CMU, 2005-2014
![Page 39: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/39.jpg)
l Data Sets: – (Left) TRECVID 2003: (text + image features) – (Right) Flickr 13 Animal: (sift + image features)
l Models: l baseline(SVM),DWH+SVM, GM-Mixture+SVM, GM-LDA+SVM, TWH,
MedLDA(sift only), MMH
TRECVID Flickr
Classification Results
39 © Eric Xing @ CMU, 2005-2014
![Page 40: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/40.jpg)
l Data Set: TRECVID 2003 – Each test sample is treated as a query, training samples are ranked based on the
cosine similarity between a training sample and the given query – Similarity is computed based on the discovered latent topic representations
l Models: DWH, GM-Mixture, GM-LDA, TWH, MMH l Measure: (Left) average precision on different topics and (Right) precision-
recall curve
Retrieval Results
40 © Eric Xing @ CMU, 2005-2014
![Page 41: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/41.jpg)
Infinite SVM and infinite latent SVM:
-- where SVMs meet NB for classification and feature selection
… where M is any combinations of classifiers and p is a nonparametric Bayesian prior
41 © Eric Xing @ CMU, 2005-2014
![Page 42: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/42.jpg)
Mixture of SVMs l Dirichlet process mixture of large-margin kernel machines l Learn flexible non-linear local classifiers; potentially lead to a better
control on model complexity, e.g., few unnecessary components
l The first attempt to integrate Bayesian nonparametrics, large-margin learning, and kernel methods
SVM using RBF kernel Mixture of 2 linear SVM Mixture of 2 RBF-SVM
42 © Eric Xing @ CMU, 2005-2014
![Page 43: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/43.jpg)
Infinite SVM l RegBayes framework:
l Model – latent class model l Prior – Dirichlet process l Likelihood – Gaussian likelihood l Posterior constraints – max-margin constraints
direct and rich constraints on posterior distribution
convex function
43 © Eric Xing @ CMU, 2005-2014
![Page 44: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/44.jpg)
Infinite SVM l DP mixture of large-margin classifiers
l Given a component classifier: l Overall discriminant function:
l Prediction rule:
l Learning problem:
Graphical model with stick-breaking construction of DP
process of determining which classifier to use:
44 © Eric Xing @ CMU, 2005-2014
![Page 45: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/45.jpg)
Infinite SVM l Assumption and relaxation
l Truncated variational distribution
l Upper bound the KL-regularizer
l Opt. with coordinate descent l For , we solve an SVM learning problem l For , we get the closed update rule
l The last term regularizes the mixing proportions to favor prediction l For , the same update rules as in (Blei & Jordan, 2006)
Graphical model with stick-breaking construction of DP
45 © Eric Xing @ CMU, 2005-2014
![Page 46: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/46.jpg)
Experiments on high-dim real data
l Classification results and test time:
l Clusters: l simiar backgroud images group l a cluster has fewer categories
For training, linear-iSVM is very efficient (~200s); RBF-iSVM is much slower, but can be significantly improved using efficient kernel methods (Rahimi & Recht, 2007; Fine & Scheinberg, 2001)
46 © Eric Xing @ CMU, 2005-2014
![Page 47: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/47.jpg)
Learning Latent Features l Infinite SVM is a Bayesian nonparametric latent class model
l discover clustering structures l each data point is assigned to a single cluster/class
l Infinite Latent SVM is a Bayesian nonparametric latent feature/factor model l discover latent factors l each data point is mapped to a set (can be infinite) of latent factors
l Latent factor analysis is a key technique in many fields; Popular models are FA, PCA, ICA, NMF, LSI, etc.
47 © Eric Xing @ CMU, 2005-2014
![Page 48: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/48.jpg)
Infinite Latent SVM l RegBayes framework:
l Model – latent feature model l Prior – Indian Buffet process l Likelihood – Gaussian likelihood l Posterior constraints – max-margin constraints
direct and rich constraints on posterior distribution
convex function
48 © Eric Xing @ CMU, 2005-2014
![Page 49: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/49.jpg)
Beta-Bernoulli Latent Feature Model
l A random finite binary latent feature models
l is the relative probability of each feature being on, e.g.,
l are binary vectors, giving the latent structure that’s used to generate the data, e.g.,
49 © Eric Xing @ CMU, 2005-2014
![Page 50: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/50.jpg)
Indian Buffet Process l A stochastic process on infinite binary feature matrices l Generative procedure:
l Customer 1 chooses the first dishes: l Customer i chooses:
l Each of the existing dishes with probability
l additional dishes, where
50 © Eric Xing @ CMU, 2005-2014
![Page 51: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/51.jpg)
Posterior Constraints – classification
l Suppose latent features z are given, we define latent discriminant function:
l Define effective discriminant function (reduce uncertainty):
l Posterior constraints with max-margin principle
51 © Eric Xing @ CMU, 2005-2014
![Page 52: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/52.jpg)
Experimental Results l Classification
l Accuracy and F1 scores on TRECVID2003 and Flickr image datasets
52 © Eric Xing @ CMU, 2005-2014
![Page 53: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/53.jpg)
Large-margin learning
Linear Expectation Operator (resolve uncertainty)
Summary
53 © Eric Xing @ CMU, 2005-2014
![Page 54: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/54.jpg)
Summary • A general framework of MaxEnDNet for learning structured input/output models
– Subsumes the standard M3Ns – Model averaging: PAC-Bayes theoretical error bound – Entropic regularization: sparse M3Ns – Generative + discriminative: latent variables, semi-supervised learning on partially
labeled data, fast inference
• PoMEN – Provides an elegant approach to incorporate latent variables and structures under max-
margin framework – Enable Learning arbitrary graphical models discriminatively
• Predictive Latent Subspace Learning – MedLDA for text topic learning – Med total scene model for image understanding – Med latent MNs for multi-view inference
• Bayesian nonparametrics meets max-margin learning
• Experimental results show the advantages of max-margin learning over likelihood methods in EVERY case.
54 © Eric Xing @ CMU, 2005-2014
![Page 55: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/...Bayes’ rule is equivalent to: A direct but trivial constraint on the posterior distribution [Zellner, Am. Stat. 1988]](https://reader033.vdocument.in/reader033/viewer/2022060908/60a30b6167d7be2cda401e00/html5/thumbnails/55.jpg)
Remember: Elements of Learning l Here are some important elements to consider before you start:
l Task: l Embedding? Classification? Clustering? Topic extraction? …
l Data and other info: l Input and output (e.g., continuous, binary, counts, …) l Supervised or unsupervised, of a blend of everything? l Prior knowledge? Bias?
l Models and paradigms: l BN? MRF? Regression? SVM? l Bayesian/Frequents ? Parametric/Nonparametric?
l Objective/Loss function: l MLE? MCLE? Max margin? l Log loss, hinge loss, square loss? …
l Tractability and exactness trade off: l Exact inference? MCMC? Variational? Gradient? Greedy search? l Online? Batch? Distributed?
l Evaluation: l Visualization? Human interpretability? Perperlexity? Predictive accuracy?
55 © Eric Xing @ CMU, 2005-2014