a tutorial on bayesian...
TRANSCRIPT
A Tutorial on Bayesian Nonparametrics
Fatima Al-Raisi
Carnegie Mellon University
October 25, 2016
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 1 / 45
1 Introdution
2 Baseyan Non-Parametrics MotivationIntuitions and AssumptionsTheoretical MotivationPractical Motivation
3 Dirichlet Process
4 Chinese Restaurant ProcessPitman-Yor Process
5 Discussion and Concluding Remarks
6 List of Tutorials
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 2 / 45
Development of Interest in Topic Over Time
An interesting “interest over time” pattern!
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 3 / 45
Interest Over Time: Deep Learning
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 4 / 45
Interest Over Time: Reinforcement Learning
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 5 / 45
Interest Over Time: Nonparametric Statistics!
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 6 / 45
Interest Over Time: Bayesian Inference!
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 7 / 45
TerminologyWhat does “Bayesian Nonparametrics” mean?
Bayesian inference: data and parameters, priors and posterios
P(parameters|data) ∝ P(parameters)P(data|parameters)
Bayesian inference vs. Bayes rule (Bayesian inference does not meanusing Bayes rule!)
Non-parametric? (misnomer):large/unbounded number of parameters, growing number ofparameters, infinite parameter space
“the number of parameters grow with the amount of training data”
No (strong) assumption about underlying distribution of the data
Terminology note: non-parametric vs. noneparametric
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 8 / 45
TerminologyFormal Definition
A statistical model is a collection of distributions:{Pθ : θ ∈ Θ} indexed by a parameter θ
Parametric Model:indexing parameter is a finite-dimensional vector: Θ ⊂ Rk
Nonparametric Model:Θ ⊂ F for some possibly infinite-dimensional space FSemiparametric Model:parameter has both a finite-dimensional component and aninfinite-dimensional component:Θ ⊂ Rk × F where F is an infinite-dimensional space
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 9 / 45
ReviewProbabilistic Modeling
Data: x1, x2, . . . , xn
Latent variables: z1, z2, . . . , zn
Parameter: θ
A probabilistic model is a parametrized joint distribution overvariables P(x1, x2, . . . , xn, z1, z2, . . . , zn|θ)
Typically interpreted as a generative model of data
Inference of latent variables given observed data:
P(z1, z2, . . . , zn|x1, x2, . . . , xn, θ) =P(x1, x2, . . . , xn, z1, z2, . . . , zn|θ)
P(x1, x2, . . . , xn|θ)
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 10 / 45
ReviewProbabilistic Modeling
Learning, (e.g., by maximum likelihood):θ = argmax
θP(x1, x2, . . . , xn|θ)
Prediction: P(xn+1|x1, x2, . . . , xn, θ)
Classification: argmaxc
P(xn+1|θc)
Standard algorithms: EM, VI, MCMC, etc.
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 11 / 45
ReviewBayesian Modeling
Prior distribution: P(θ)
Posterior distribution:
P(z1, . . . , zn, θ|x1, . . . , xn) =P(x1, . . . , xn, z1, . . . , zn|θ)P(θ)
P(x1, . . . , xn)
The above is doing both inference and learning
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 12 / 45
ClusteringParametric Approach
Think of data as generated from a number of sourcesModel each cluster using a parametric modelA data item i is drawn as follows:zi |π ∼ Discrete(π)xi |zi , θ?k ∼ F (θ?zi ) where F is a parametric model (e.g., Guassianwith parameter vector θ = (µ, σ))Mixing proportions: π = (π1, . . . , πk)|α ∼ Dirichlet(αk , . . . ,
αk )
More on the Dirichlet distribution later
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 13 / 45
Motivation
Question: What is the number of sources?
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 14 / 45
Motivation
Question: What is the number of sources?
Is it 5?
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 15 / 45
Motivation
Question: What is the number of sources?
Or maybe 3?
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 16 / 45
Motivation
Question: What is the number of sources?
In practice an ad-hoc approach is followed to decide k. For example,
guess the number of clusters, then run EM for Gaussian MixtureModel, look at results and goodness of fit, and then if needed tryagain with a different k
or run hierarchical agglomerative clustering, and cut the tree at a“reasonable looking” level
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 17 / 45
Motivation
Question: What is the number of sources?
In practice an ad-hoc approach is followed to decide on k.
But we want a principled approach for discovering k. After all, it is anessential part of the problem to be solved!
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 18 / 45
MotivationIntuitive and Theoretical Motivation
Natural Phenomena:
Topics:I (Wikipedia) dynamic traversalI Clustering
Species discovery
Annotation and labeling
Knowledge-base entity types
. . .
For any fixed k, as we see more data, there is a positive probability that wewill encounter a data point that does not fit in the current scheme; i.e.,
k grows with data
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 19 / 45
MotivationTheoretical Motivation: De Finetti’s Theorem
Infinite Exchangeability
A data sequence is infinitely exchangeable if the distribution of any N datapoints does not change under permutation:p(X1, . . . ,Xn) = p(Xσ(1) , . . . ,Xσ(n))
Theoretical Motivation: De Finetti’s Theorem
Theorem (De Finetti’s Theorem)
A sequence X1, . . . ,Xn is infinitely exchangeable if and only if, for all Nand some distribution P:
p(X1, . . . ,Xn) =
∫θ
N∏n=1
p(Xn|θ)P(dθ)
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 20 / 45
MotivationTheoretical Motivation
De Finetti’s TheoremGeneral proof: Hewitt, Savage 1955; Aldous 1983
Theorem (De Finetti’s Theorem)
A sequence X1, . . . ,Xn is infinitely exchangeable if and only if, for all Nand some distribution P:
p(X1, . . . ,Xn) =
∫θ
N∏n=1
p(Xn|θ)P(dθ)
Motivates:
Parameters
Likelihood
Priors
Non-parametric Bayesian priors
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 21 / 45
MotivationTheoretical Motivation
What happens under the parametric regime?
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 22 / 45
MotivationTheoretical Motivation
What happens under the parametric regime?Let’s take the example of regression
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 23 / 45
MotivationTheoretical Motivation
What happens under the parametric regime?When fitting/optimizing, we’re finding the best fit within the chosen(parametric) family of functions; i.e., we’re optimizing to get the closestapproximation to the true taget function.
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 24 / 45
MotivationTheoretical Motivation
What happens under the parametric regime? When fitting, we’re findingthe best fit within the chosen (parametric) family of functions; i.e., we’reoptimizing to get the closest approximation to the true taget function.
But this may not be good enoughFatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 25 / 45
MotivationTheoretical Motivation: Non-parametric Bayesin Approach
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 26 / 45
MotivationPractical Problem-solving Motivation
Human intuitions about high-dimentional problems are oftenmisleading!Example: recent result from Random Matrix Theory:proving the proliferation of saddle points in comparison to localminina in high-dimentional problems [Dauphin et. al 2015]
Assumptions often made when attempting to solve different problemsare naturally part of the problem to be solved, e.g.,
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 27 / 45
MotivationPractical Problem-solving Motivation
Assumptions often made when attempting to solve different problems withdata, are naturally part of the problem to be solved, e.g.,
number of clusters in clustering
“type” or class of function in regression
number of factors in factor analysis
. . .
The Bayesian non-parametric approach:
no unreasonable assumptions about the data (i.e., true model forcomplex phenomenon goverened by a small number of parameters)
model that can adopt its complexity to the data
Let the data determine model complexity
naturally no fitting or model selection → no underfitting or overfitting→ no regularization required
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 28 / 45
MotivationPractical Problem-solving Motivation
Learning structures
Bayesian prior over combinatorial structures
Lack of intuitive parametric prior over these complex structures
Nonparametric priors sometimes end up simpler than parametric priors
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 29 / 45
MotivationPractical Problem-solving Motivation: Structure Learning
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 30 / 45
MotivationDesirable Properties of Non-parametric Models
Exchangeability
Naturally captures power laws
Flexible ways of building complex models (e.g., heirarchical models)
When conjugate priors are used, problems often becomecomputationally tractable
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 31 / 45
Dirichlet Process
Fundamental concept in Bayesian nonparametrics
Formally defined by [Ferguson 1973] as a distribution over measures
Can be derived in different ways, and as special cases of differentprocesses:
I Infinite limit of a Gibbs sampler for finite mixture modelsI Chinese restaurant processI Stick-breaking construction
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 32 / 45
Chinese Restaurant Process
A partition % of a set S is:
A disjoint family of non-empty subsets of S whose union is S .
Denote the set of all partitions of S as PSRandom partitions are random variables taking values in PSWe will consider partitions of S
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 33 / 45
Chinese Restaurant Process
Each customer comes into restaurant and sits at a table:
Customers correspond to elements of S , and tables to clusters in %
Rich-gets-richer: large clusters more likely to attract more items
Multiplying conditional probabilities together, the overall probability of
% , called the exchangeable partition probability function (EPPF), is:
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 34 / 45
Chinese Restaurant ProcessNumber of clusters
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 35 / 45
Nonparametric approach to clustering
Partitions are natural latent objects in clustering
Given a dataset S , partition it into clusters of similar items
Cluster c ∈ % described by a model F (θ?c) parameterized by θ?c
Bayesian approach: introduce prior over % and θc
Compute posterior over both
CRP mixture model: Use CRP prior over %, and an iid prior H overcluster parameters
Computation becomes efficient when H is the conjugate prior for FI One of the reasons why Guassians are popular in modeling is their nice
mathematical properties including conjugacy
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 36 / 45
Nonparametric approach to clustering
Generative model of data:% ∼ CRP(α)θ?c |% ∼ Hxi |θ?, % ∼ F (θ?c)
The CRP prior is a prior over partitions of the data with the numberof partitions/clusters unknown a priori and is part of inference
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 37 / 45
Nonparametric approach to clusteringConsider a finite mixture model with K sources. How can we describeparition of data into clusters?
What is the distribution over paritions %
where [x ]ab = x(x + b) . . . (x + (a− 1)b)Taking the limit k →∞, we obtain a distribution over partitionswithout a limit on the number of sources (K disappears in the limit):
Note where the excheangibility comes from!Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 38 / 45
Pitman-Yor Process
The Pitman-Yor Process is a generalization of the Dirichlet Process
Recall the CRP probabilities:
Here the difference is a discount parameter d :
Effect is as d increases, the model tends to create more tables andmore tables with fewer customers
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 39 / 45
Pitman-Yor ProcessThe Pitman-Yor Process is better at capturing natural power-lawphenomena, specially around the tales and peak of the distribution.
Example: English word frequencies and ranks [Wood et. al 2011]
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 40 / 45
DiscussionNote on the Statistical Properties of Nonparametric Models:
I ConsistencyI Efficiency (i.e., statistical efficiency)I Coverage (Bayesian analoug of Confidence Intervals)
Computationally expensive (also related to decoupling of models andalgorithms)How to compare against non-parametric counterpart:
I Accuracy alone is not a good metric for comparison. It is a function ofthe model and a specific dataset
I Asymptotic performance as the amount of data increases is better forcomparison
Nonparametric models are extremely popular in settings where thedata follows a power-lawShould be considered when we suspect a continous increase inpossible configurations as we see more dataShould not be used when we know that the distribution of the data islikely to follow a parametric form or is generated using a finitenumber of sources (no coverage guarantees in this case)
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 41 / 45
Tutorials:
A webpage with a list Tutorials on Bayesian Nonparametrics:http://stat.columbia.edu/~porbanz/npb-tutorial.html
A tutorial on Bayesian nonparametric models by Gershman Bleihttp:
//gershmanlab.webfactional.com/pubs/GershmanBlei12.pdf
Dirichlet Process. Yee Whye Teh. http:
//www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf
A Tutorial on Guassian Processes. Mark Ebden. http:
//www.robots.ox.ac.uk/~mebden/reports/GPtutorial.pdf
Video tutorial
Bayesian Nonparametrics - Yee Whye Teh - MLSS 2013 Tubingen(Max Planck Institute for Intelligent Systems Tubingen)
Bayesian Nonparametrics - Tamara Broderic - MLSS 2015 Tubingen
Bayesian Nonparametrics Lectures - Larry Wasserman
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 42 / 45
Tutorials and Further ReferencesCourses on Bayesian Nonparametrics
Nonparametric modeling. UIC.http://georgek.people.uic.edu/Nonparametric.htm
Bayesian Nonparametric Statistics. UNITO.ithttp://www.master-sds.unito.it/do/corsi.pl/Show?_id=meln
Bayesian Nonparametrics - Foundations and Applications. FSU.http://stat.fsu.edu/~sethu/st718outline.pdf
A Course in Bayesian Statistics. Stanford University.http://statweb.stanford.edu/~sabatti/Stat370/
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 43 / 45
Tutorials and Further ReferencesBayesian Nonparametrics Textbooks
Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 44 / 45