a tutorial on bayesian...

45
A Tutorial on Bayesian Nonparametrics Fatima Al-Raisi Carnegie Mellon University [email protected] October 25, 2016 Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 1 / 45

Upload: trankhue

Post on 19-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

A Tutorial on Bayesian Nonparametrics

Fatima Al-Raisi

Carnegie Mellon University

[email protected]

October 25, 2016

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 1 / 45

1 Introdution

2 Baseyan Non-Parametrics MotivationIntuitions and AssumptionsTheoretical MotivationPractical Motivation

3 Dirichlet Process

4 Chinese Restaurant ProcessPitman-Yor Process

5 Discussion and Concluding Remarks

6 List of Tutorials

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 2 / 45

Development of Interest in Topic Over Time

An interesting “interest over time” pattern!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 3 / 45

Interest Over Time: Deep Learning

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 4 / 45

Interest Over Time: Reinforcement Learning

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 5 / 45

Interest Over Time: Nonparametric Statistics!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 6 / 45

Interest Over Time: Bayesian Inference!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 7 / 45

TerminologyWhat does “Bayesian Nonparametrics” mean?

Bayesian inference: data and parameters, priors and posterios

P(parameters|data) ∝ P(parameters)P(data|parameters)

Bayesian inference vs. Bayes rule (Bayesian inference does not meanusing Bayes rule!)

Non-parametric? (misnomer):large/unbounded number of parameters, growing number ofparameters, infinite parameter space

“the number of parameters grow with the amount of training data”

No (strong) assumption about underlying distribution of the data

Terminology note: non-parametric vs. noneparametric

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 8 / 45

TerminologyFormal Definition

A statistical model is a collection of distributions:{Pθ : θ ∈ Θ} indexed by a parameter θ

Parametric Model:indexing parameter is a finite-dimensional vector: Θ ⊂ Rk

Nonparametric Model:Θ ⊂ F for some possibly infinite-dimensional space FSemiparametric Model:parameter has both a finite-dimensional component and aninfinite-dimensional component:Θ ⊂ Rk × F where F is an infinite-dimensional space

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 9 / 45

ReviewProbabilistic Modeling

Data: x1, x2, . . . , xn

Latent variables: z1, z2, . . . , zn

Parameter: θ

A probabilistic model is a parametrized joint distribution overvariables P(x1, x2, . . . , xn, z1, z2, . . . , zn|θ)

Typically interpreted as a generative model of data

Inference of latent variables given observed data:

P(z1, z2, . . . , zn|x1, x2, . . . , xn, θ) =P(x1, x2, . . . , xn, z1, z2, . . . , zn|θ)

P(x1, x2, . . . , xn|θ)

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 10 / 45

ReviewProbabilistic Modeling

Learning, (e.g., by maximum likelihood):θ = argmax

θP(x1, x2, . . . , xn|θ)

Prediction: P(xn+1|x1, x2, . . . , xn, θ)

Classification: argmaxc

P(xn+1|θc)

Standard algorithms: EM, VI, MCMC, etc.

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 11 / 45

ReviewBayesian Modeling

Prior distribution: P(θ)

Posterior distribution:

P(z1, . . . , zn, θ|x1, . . . , xn) =P(x1, . . . , xn, z1, . . . , zn|θ)P(θ)

P(x1, . . . , xn)

The above is doing both inference and learning

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 12 / 45

ClusteringParametric Approach

Think of data as generated from a number of sourcesModel each cluster using a parametric modelA data item i is drawn as follows:zi |π ∼ Discrete(π)xi |zi , θ?k ∼ F (θ?zi ) where F is a parametric model (e.g., Guassianwith parameter vector θ = (µ, σ))Mixing proportions: π = (π1, . . . , πk)|α ∼ Dirichlet(αk , . . . ,

αk )

More on the Dirichlet distribution later

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 13 / 45

Motivation

Question: What is the number of sources?

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 14 / 45

Motivation

Question: What is the number of sources?

Is it 5?

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 15 / 45

Motivation

Question: What is the number of sources?

Or maybe 3?

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 16 / 45

Motivation

Question: What is the number of sources?

In practice an ad-hoc approach is followed to decide k. For example,

guess the number of clusters, then run EM for Gaussian MixtureModel, look at results and goodness of fit, and then if needed tryagain with a different k

or run hierarchical agglomerative clustering, and cut the tree at a“reasonable looking” level

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 17 / 45

Motivation

Question: What is the number of sources?

In practice an ad-hoc approach is followed to decide on k.

But we want a principled approach for discovering k. After all, it is anessential part of the problem to be solved!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 18 / 45

MotivationIntuitive and Theoretical Motivation

Natural Phenomena:

Topics:I (Wikipedia) dynamic traversalI Clustering

Species discovery

Annotation and labeling

Knowledge-base entity types

. . .

For any fixed k, as we see more data, there is a positive probability that wewill encounter a data point that does not fit in the current scheme; i.e.,

k grows with data

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 19 / 45

MotivationTheoretical Motivation: De Finetti’s Theorem

Infinite Exchangeability

A data sequence is infinitely exchangeable if the distribution of any N datapoints does not change under permutation:p(X1, . . . ,Xn) = p(Xσ(1) , . . . ,Xσ(n))

Theoretical Motivation: De Finetti’s Theorem

Theorem (De Finetti’s Theorem)

A sequence X1, . . . ,Xn is infinitely exchangeable if and only if, for all Nand some distribution P:

p(X1, . . . ,Xn) =

∫θ

N∏n=1

p(Xn|θ)P(dθ)

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 20 / 45

MotivationTheoretical Motivation

De Finetti’s TheoremGeneral proof: Hewitt, Savage 1955; Aldous 1983

Theorem (De Finetti’s Theorem)

A sequence X1, . . . ,Xn is infinitely exchangeable if and only if, for all Nand some distribution P:

p(X1, . . . ,Xn) =

∫θ

N∏n=1

p(Xn|θ)P(dθ)

Motivates:

Parameters

Likelihood

Priors

Non-parametric Bayesian priors

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 21 / 45

MotivationTheoretical Motivation

What happens under the parametric regime?

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 22 / 45

MotivationTheoretical Motivation

What happens under the parametric regime?Let’s take the example of regression

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 23 / 45

MotivationTheoretical Motivation

What happens under the parametric regime?When fitting/optimizing, we’re finding the best fit within the chosen(parametric) family of functions; i.e., we’re optimizing to get the closestapproximation to the true taget function.

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 24 / 45

MotivationTheoretical Motivation

What happens under the parametric regime? When fitting, we’re findingthe best fit within the chosen (parametric) family of functions; i.e., we’reoptimizing to get the closest approximation to the true taget function.

But this may not be good enoughFatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 25 / 45

MotivationTheoretical Motivation: Non-parametric Bayesin Approach

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 26 / 45

MotivationPractical Problem-solving Motivation

Human intuitions about high-dimentional problems are oftenmisleading!Example: recent result from Random Matrix Theory:proving the proliferation of saddle points in comparison to localminina in high-dimentional problems [Dauphin et. al 2015]

Assumptions often made when attempting to solve different problemsare naturally part of the problem to be solved, e.g.,

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 27 / 45

MotivationPractical Problem-solving Motivation

Assumptions often made when attempting to solve different problems withdata, are naturally part of the problem to be solved, e.g.,

number of clusters in clustering

“type” or class of function in regression

number of factors in factor analysis

. . .

The Bayesian non-parametric approach:

no unreasonable assumptions about the data (i.e., true model forcomplex phenomenon goverened by a small number of parameters)

model that can adopt its complexity to the data

Let the data determine model complexity

naturally no fitting or model selection → no underfitting or overfitting→ no regularization required

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 28 / 45

MotivationPractical Problem-solving Motivation

Learning structures

Bayesian prior over combinatorial structures

Lack of intuitive parametric prior over these complex structures

Nonparametric priors sometimes end up simpler than parametric priors

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 29 / 45

MotivationPractical Problem-solving Motivation: Structure Learning

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 30 / 45

MotivationDesirable Properties of Non-parametric Models

Exchangeability

Naturally captures power laws

Flexible ways of building complex models (e.g., heirarchical models)

When conjugate priors are used, problems often becomecomputationally tractable

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 31 / 45

Dirichlet Process

Fundamental concept in Bayesian nonparametrics

Formally defined by [Ferguson 1973] as a distribution over measures

Can be derived in different ways, and as special cases of differentprocesses:

I Infinite limit of a Gibbs sampler for finite mixture modelsI Chinese restaurant processI Stick-breaking construction

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 32 / 45

Chinese Restaurant Process

A partition % of a set S is:

A disjoint family of non-empty subsets of S whose union is S .

Denote the set of all partitions of S as PSRandom partitions are random variables taking values in PSWe will consider partitions of S

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 33 / 45

Chinese Restaurant Process

Each customer comes into restaurant and sits at a table:

Customers correspond to elements of S , and tables to clusters in %

Rich-gets-richer: large clusters more likely to attract more items

Multiplying conditional probabilities together, the overall probability of

% , called the exchangeable partition probability function (EPPF), is:

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 34 / 45

Chinese Restaurant ProcessNumber of clusters

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 35 / 45

Nonparametric approach to clustering

Partitions are natural latent objects in clustering

Given a dataset S , partition it into clusters of similar items

Cluster c ∈ % described by a model F (θ?c) parameterized by θ?c

Bayesian approach: introduce prior over % and θc

Compute posterior over both

CRP mixture model: Use CRP prior over %, and an iid prior H overcluster parameters

Computation becomes efficient when H is the conjugate prior for FI One of the reasons why Guassians are popular in modeling is their nice

mathematical properties including conjugacy

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 36 / 45

Nonparametric approach to clustering

Generative model of data:% ∼ CRP(α)θ?c |% ∼ Hxi |θ?, % ∼ F (θ?c)

The CRP prior is a prior over partitions of the data with the numberof partitions/clusters unknown a priori and is part of inference

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 37 / 45

Nonparametric approach to clusteringConsider a finite mixture model with K sources. How can we describeparition of data into clusters?

What is the distribution over paritions %

where [x ]ab = x(x + b) . . . (x + (a− 1)b)Taking the limit k →∞, we obtain a distribution over partitionswithout a limit on the number of sources (K disappears in the limit):

Note where the excheangibility comes from!Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 38 / 45

Pitman-Yor Process

The Pitman-Yor Process is a generalization of the Dirichlet Process

Recall the CRP probabilities:

Here the difference is a discount parameter d :

Effect is as d increases, the model tends to create more tables andmore tables with fewer customers

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 39 / 45

Pitman-Yor ProcessThe Pitman-Yor Process is better at capturing natural power-lawphenomena, specially around the tales and peak of the distribution.

Example: English word frequencies and ranks [Wood et. al 2011]

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 40 / 45

DiscussionNote on the Statistical Properties of Nonparametric Models:

I ConsistencyI Efficiency (i.e., statistical efficiency)I Coverage (Bayesian analoug of Confidence Intervals)

Computationally expensive (also related to decoupling of models andalgorithms)How to compare against non-parametric counterpart:

I Accuracy alone is not a good metric for comparison. It is a function ofthe model and a specific dataset

I Asymptotic performance as the amount of data increases is better forcomparison

Nonparametric models are extremely popular in settings where thedata follows a power-lawShould be considered when we suspect a continous increase inpossible configurations as we see more dataShould not be used when we know that the distribution of the data islikely to follow a parametric form or is generated using a finitenumber of sources (no coverage guarantees in this case)

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 41 / 45

Tutorials:

A webpage with a list Tutorials on Bayesian Nonparametrics:http://stat.columbia.edu/~porbanz/npb-tutorial.html

A tutorial on Bayesian nonparametric models by Gershman Bleihttp:

//gershmanlab.webfactional.com/pubs/GershmanBlei12.pdf

Dirichlet Process. Yee Whye Teh. http:

//www.stats.ox.ac.uk/~teh/research/npbayes/Teh2010a.pdf

A Tutorial on Guassian Processes. Mark Ebden. http:

//www.robots.ox.ac.uk/~mebden/reports/GPtutorial.pdf

Video tutorial

Bayesian Nonparametrics - Yee Whye Teh - MLSS 2013 Tubingen(Max Planck Institute for Intelligent Systems Tubingen)

Bayesian Nonparametrics - Tamara Broderic - MLSS 2015 Tubingen

Bayesian Nonparametrics Lectures - Larry Wasserman

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 42 / 45

Tutorials and Further ReferencesCourses on Bayesian Nonparametrics

Nonparametric modeling. UIC.http://georgek.people.uic.edu/Nonparametric.htm

Bayesian Nonparametric Statistics. UNITO.ithttp://www.master-sds.unito.it/do/corsi.pl/Show?_id=meln

Bayesian Nonparametrics - Foundations and Applications. FSU.http://stat.fsu.edu/~sethu/st718outline.pdf

A Course in Bayesian Statistics. Stanford University.http://statweb.stanford.edu/~sabatti/Stat370/

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 43 / 45

Tutorials and Further ReferencesBayesian Nonparametrics Textbooks

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 44 / 45

Thank you!

Fatima Al-Raisi (Carnegie Mellon University) A Tutorial on Bayesian Nonparametrics October 25, 2016 45 / 45