a tutorial on bayesian...

A Tutorial on Bayesian Nonparametrics

Fatima Al-Raisi

Carnegie Mellon University

[email protected]

October 25, 2016

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 1 / 45

1 Introdution

2 Baseyan Non-Parametrics MotivationIntuitions and AssumptionsTheoretical MotivationPractical Motivation

3 Dirichlet Process

4 Chinese Restaurant ProcessPitman-Yor Process

5 Discussion and Concluding Remarks

6 List of Tutorials


Development of Interest in Topic Over Time

An interesting “interest over time” pattern!


Interest Over Time: Deep Learning


Interest Over Time: Reinforcement Learning


Interest Over Time: Nonparametric Statistics!


Interest Over Time: Bayesian Inference!


TerminologyWhat does “Bayesian Nonparametrics” mean?

Bayesian inference: data and parameters, priors and posterios

P(parameters|data) ∝ P(parameters)P(data|parameters)

Bayesian inference vs. Bayes rule (Bayesian inference does not meanusing Bayes rule!)

Non-parametric? (misnomer):large/unbounded number of parameters, growing number ofparameters, infinite parameter space

“the number of parameters grow with the amount of training data”

No (strong) assumption about underlying distribution of the data

Terminology note: non-parametric vs. noneparametric


TerminologyFormal Definition

A statistical model is a collection of distributions:{Pθ : θ ∈ Θ} indexed by a parameter θ

Parametric Model:indexing parameter is a finite-dimensional vector: Θ ⊂ Rk

Nonparametric Model:Θ ⊂ F for some possibly infinite-dimensional space FSemiparametric Model:parameter has both a finite-dimensional component and aninfinite-dimensional component:Θ ⊂ Rk × F where F is an infinite-dimensional space


ReviewProbabilistic Modeling

Data: x1, x2, . . . , xn

Latent variables: z1, z2, . . . , zn

Parameter: θ

A probabilistic model is a parametrized joint distribution overvariables P(x1, x2, . . . , xn, z1, z2, . . . , zn|θ)

Typically interpreted as a generative model of data

Inference of latent variables given observed data:

P(z1, z2, . . . , zn|x1, x2, . . . , xn, θ) =P(x1, x2, . . . , xn, z1, z2, . . . , zn|θ)

P(x1, x2, . . . , xn|θ)


ReviewProbabilistic Modeling

Learning, (e.g., by maximum likelihood):θ = argmax

θP(x1, x2, . . . , xn|θ)

Prediction: P(xn+1|x1, x2, . . . , xn, θ)

Classification: argmaxc

P(xn+1|θc)

Standard algorithms: EM, VI, MCMC, etc.


ReviewBayesian Modeling

Prior distribution: P(θ)

Posterior distribution:

P(z1, . . . , zn, θ|x1, . . . , xn) =P(x1, . . . , xn, z1, . . . , zn|θ)P(θ)

P(x1, . . . , xn)

The above is doing both inference and learning


ClusteringParametric Approach

Think of data as generated from a number of sourcesModel each cluster using a parametric modelA data item i is drawn as follows:zi |π ∼ Discrete(π)xi |zi , θ?k ∼ F (θ?zi ) where F is a parametric model (e.g., Guassianwith parameter vector θ = (µ, σ))Mixing proportions: π = (π1, . . . , πk)|α ∼ Dirichlet(αk , . . . ,

αk )

More on the Dirichlet distribution later


Motivation

Question: What is the number of sources?


Motivation


Is it 5?


Motivation


Or maybe 3?


Motivation


In practice an ad-hoc approach is followed to decide k. For example,

guess the number of clusters, then run EM for Gaussian MixtureModel, look at results and goodness of fit, and then if needed tryagain with a different k

or run hierarchical agglomerative clustering, and cut the tree at a“reasonable looking” level


Motivation


In practice an ad-hoc approach is followed to decide on k.

But we want a principled approach for discovering k. After all, it is anessential part of the problem to be solved!


MotivationIntuitive and Theoretical Motivation

Natural Phenomena:

Topics:I (Wikipedia) dynamic traversalI Clustering

Species discovery

Annotation and labeling

Knowledge-base entity types

. . .

For any fixed k, as we see more data, there is a positive probability that wewill encounter a data point that does not fit in the current scheme; i.e.,

k grows with data


MotivationTheoretical Motivation: De Finetti’s Theorem

Infinite Exchangeability

A data sequence is infinitely exchangeable if the distribution of any N datapoints does not change under permutation:p(X1, . . . ,Xn) = p(Xσ(1) , . . . ,Xσ(n))

Theoretical Motivation: De Finetti’s Theorem

Theorem (De Finetti’s Theorem)

A sequence X1, . . . ,Xn is infinitely exchangeable if and only if, for all Nand some distribution P:

p(X1, . . . ,Xn) =

∫θ

N∏n=1

p(Xn|θ)P(dθ)


MotivationTheoretical Motivation

De Finetti’s TheoremGeneral proof: Hewitt, Savage 1955; Aldous 1983

Theorem (De Finetti’s Theorem)

A sequence X1, . . . ,Xn is infinitely exchangeable if and only if, for all Nand some distribution P:

p(X1, . . . ,Xn) =

∫θ

N∏n=1

p(Xn|θ)P(dθ)

Motivates:

Parameters

Likelihood

Priors

Non-parametric Bayesian priors



What happens under the parametric regime?



What happens under the parametric regime?Let’s take the example of regression



What happens under the parametric regime?When fitting/optimizing, we’re finding the best fit within the chosen(parametric) family of functions; i.e., we’re optimizing to get the closestapproximation to the true taget function.



What happens under the parametric regime? When fitting, we’re findingthe best fit within the chosen (parametric) family of functions; i.e., we’reoptimizing to get the closest approximation to the true taget function.

But this may not be good enoughFatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 25 / 45

MotivationTheoretical Motivation: Non-parametric Bayesin Approach


MotivationPractical Problem-solving Motivation

Human intuitions about high-dimentional problems are oftenmisleading!Example: recent result from Random Matrix Theory:proving the proliferation of saddle points in comparison to localminina in high-dimentional problems [Dauphin et. al 2015]

Assumptions often made when attempting to solve different problemsare naturally part of the problem to be solved, e.g.,



Assumptions often made when attempting to solve different problems withdata, are naturally part of the problem to be solved, e.g.,

number of clusters in clustering

“type” or class of function in regression

number of factors in factor analysis

. . .

The Bayesian non-parametric approach:

no unreasonable assumptions about the data (i.e., true model forcomplex phenomenon goverened by a small number of parameters)

model that can adopt its complexity to the data

Let the data determine model complexity

naturally no fitting or model selection → no underfitting or overfitting→ no regularization required



Learning structures

Bayesian prior over combinatorial structures

Lack of intuitive parametric prior over these complex structures

Nonparametric priors sometimes end up simpler than parametric priors


MotivationPractical Problem-solving Motivation: Structure Learning


MotivationDesirable Properties of Non-parametric Models

Exchangeability

Naturally captures power laws

Flexible ways of building complex models (e.g., heirarchical models)

When conjugate priors are used, problems often becomecomputationally tractable


Dirichlet Process

Fundamental concept in Bayesian nonparametrics

Formally defined by [Ferguson 1973] as a distribution over measures

Can be derived in different ways, and as special cases of differentprocesses:

I Infinite limit of a Gibbs sampler for finite mixture modelsI Chinese restaurant processI Stick-breaking construction


Chinese Restaurant Process

A partition % of a set S is:

A disjoint family of non-empty subsets of S whose union is S .

Denote the set of all partitions of S as PSRandom partitions are random variables taking values in PSWe will consider partitions of S


Chinese Restaurant Process

Each customer comes into restaurant and sits at a table:

Customers correspond to elements of S , and tables to clusters in %

Rich-gets-richer: large clusters more likely to attract more items

Multiplying conditional probabilities together, the overall probability of

% , called the exchangeable partition probability function (EPPF), is:


Chinese Restaurant ProcessNumber of clusters


Nonparametric approach to clustering

Partitions are natural latent objects in clustering

Given a dataset S , partition it into clusters of similar items

Cluster c ∈ % described by a model F (θ?c) parameterized by θ?c

Bayesian approach: introduce prior over % and θc

Compute posterior over both

CRP mixture model: Use CRP prior over %, and an iid prior H overcluster parameters

Computation becomes efficient when H is the conjugate prior for FI One of the reasons why Guassians are popular in modeling is their nice

mathematical properties including conjugacy


Nonparametric approach to clustering

Generative model of data:% ∼ CRP(α)θ?c |% ∼ Hxi |θ?, % ∼ F (θ?c)

The CRP prior is a prior over partitions of the data with the numberof partitions/clusters unknown a priori and is part of inference


Nonparametric approach to clusteringConsider a finite mixture model with K sources. How can we describeparition of data into clusters?

What is the distribution over paritions %

where [x ]ab = x(x + b) . . . (x + (a− 1)b)Taking the limit k →∞, we obtain a distribution over partitionswithout a limit on the number of sources (K disappears in the limit):

Note where the excheangibility comes from!Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 38 / 45

Pitman-Yor Process

The Pitman-Yor Process is a generalization of the Dirichlet Process

Recall the CRP probabilities:

Here the difference is a discount parameter d :

Effect is as d increases, the model tends to create more tables andmore tables with fewer customers


Pitman-Yor ProcessThe Pitman-Yor Process is better at capturing natural power-lawphenomena, specially around the tales and peak of the distribution.

Example: English word frequencies and ranks [Wood et. al 2011]


DiscussionNote on the Statistical Properties of Nonparametric Models:

I ConsistencyI Efficiency (i.e., statistical efficiency)I Coverage (Bayesian analoug of Confidence Intervals)

Computationally expensive (also related to decoupling of models andalgorithms)How to compare against non-parametric counterpart:

I Accuracy alone is not a good metric for comparison. It is a function ofthe model and a specific dataset

I Asymptotic performance as the amount of data increases is better forcomparison

Nonparametric models are extremely popular in settings where thedata follows a power-lawShould be considered when we suspect a continous increase inpossible configurations as we see more dataShould not be used when we know that the distribution of the data islikely to follow a parametric form or is generated using a finitenumber of sources (no coverage guarantees in this case)


Tutorials:

A webpage with a list Tutorials on Bayesian Nonparametrics:http://stat.columbia.edu/~porbanz/npb-tutorial.html

A tutorial on Bayesian nonparametric models by Gershman Bleihttp:

//gershmanlab.webfactional.com/pubs/GershmanBlei12.pdf

Dirichlet Process. Yee Whye Teh. http:

//www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf

A Tutorial on Guassian Processes. Mark Ebden. http:

//www.robots.ox.ac.uk/~mebden/reports/GPtutorial.pdf

Video tutorial

Bayesian Nonparametrics - Yee Whye Teh - MLSS 2013 Tubingen(Max Planck Institute for Intelligent Systems Tubingen)

Bayesian Nonparametrics - Tamara Broderic - MLSS 2015 Tubingen

Bayesian Nonparametrics Lectures - Larry Wasserman


http://stat.columbia.edu/~porbanz/npb-tutorial.html

http://gershmanlab.webfactional.com/pubs/GershmanBlei12.pdf

http://gershmanlab.webfactional.com/pubs/GershmanBlei12.pdf

http://www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf

http://www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf

http://www.robots.ox.ac.uk/~mebden/reports/GPtutorial.pdf

http://www.robots.ox.ac.uk/~mebden/reports/GPtutorial.pdf

Tutorials and Further ReferencesCourses on Bayesian Nonparametrics

Nonparametric modeling. UIC.http://georgek.people.uic.edu/Nonparametric.htm

Bayesian Nonparametric Statistics. UNITO.ithttp://www.master-sds.unito.it/do/corsi.pl/Show?_id=meln

Bayesian Nonparametrics - Foundations and Applications. FSU.http://stat.fsu.edu/~sethu/st718outline.pdf

A Course in Bayesian Statistics. Stanford University.http://statweb.stanford.edu/~sabatti/Stat370/


http://georgek.people.uic.edu/Nonparametric.htm

http://www.master-sds.unito.it/do/corsi.pl/Show?_id=meln

http://stat.fsu.edu/~sethu/st718outline.pdf

http://statweb.stanford.edu/~sabatti/Stat370/

Tutorials and Further ReferencesBayesian Nonparametrics Textbooks


Thank you!


a tutorial on bayesian...

Documents