probability density estimation using product of conditional experts

PROBABILITY DENSITY ESTIMATION

USING PRODUCT OF CONDITIONAL

EXPERTS

Project Guides:

Dr. Harish Karnick (IIT Kanpur)

Dr. Vinod Nair (MSR India)

Dr. Sundar S. (MSR India)

Submitted by:

Chirag Gupta 10212

Pulkit Jain 10543

Density Estimation

Construct an estimate of the underlying probability

distribution function from observed data

Why ?

Underlying pattern in data

Useful statistical information

Modality

Skewness

How is it done ?

The observed data is considered a set of i.i.d.

samples from the distribution

Choose a model that can estimate the underlying

probability density function

Fit the model to the observed data

Maximum Likelihood Estimation

Maximize the probability of observed data

p(x) = f(x, θ)

Maximize p(data) = p(x1) * p(x2) * … * p(xn)

Background/Previous Work

High dimensional data is modeled by combining relatively simple models

Mixture models

Combination rule consists of taking a weighted arithmetic mean of the individual distributions.

Inefficient for high-dimensional data

Product of Experts

Combination rule is to multiply the relatively simple probabilistic models and renormalize

High dimensional data is relatively easy to handle

Product Of Experts

Probability of a data point d under the model is

calculated as

fi represent the relatively simple ‘expert’

θi are the parameters of ith expert

Product of Conditional Experts

Often, the individual experts are not known

We use Product of conditional experts, wherein

each expert gives the conditional probability

Product of Conditional Experts

Conditional probability can be estimated by using

classification models that associate a probability

with the output class

Classification models like Logistic Regression, Kernel

Logistic Regression, Decision Trees can be used

A wrapper model that can take in user’s choice of

experts and build the estimate

Learning/Training

Follow gradient ascent algorithm to maximize the

average log probability

Gradient for the objective function is given as:

Learning/Training

The second term in gradient expression (arising from

the normalization term) is intractable to compute

Contrastive divergence[1] is then used to

approximate the gradient

Experts Considered

Logistic Regression (linear classification model)

Kernel Logistic Regression (non linear model)

Density Estimation: Artificial Datasets

True Distribution Distribution Learnt by

Linear Model

Distribution Learnt by

Non Linear Model

Set 1

100 examples

5 dimensions

True Distribution

Set 2

100 examples

5 dimensions

Density Estimation: Artificial Datasets


Non Linear Model


Linear Model

Density Estimation: Real Datasets

Adult (200 train + 500 test)


Adult (500 train + 500 test)


OCR (200 train + 500 test)


OCR (500 train + 500 test)

Density Estimation

For artificial sets, the learned distribution is close to

actual distribution

MoB clearly performs better on the training

examples than the linear and non linear model

However, PoCE model generalizes better

Higher log probability on the test set

Performance is significantly better for lesser training

examples

Application: Outlier Detection

Detect points that do not belong to a particular class of points.

If the model builds a good enough representation of the data, it should assign high probability to points that are part of the inlier class and relatively low probability to points that are outside it.

We train the model on a mix of two classes, with less examples ( < 5%) from outlier class

Test to see if outliers and inliers are given low and high probabilities respectively


In 3 out of 4 cases, the outliers in test and train

data get lower average probability

This hints that outlier detection can be carried out to

some extent

We now present the precision recall curves

obtained for similar class pairs. Five outliers in

training as well as test were kept to obtain the

precision recall curves


CYT – MIT


CYT – NUC

Test Set:

All the three models perform equally well

Training Set:

KLR does as good as MoB

LR does both better and worse than MoB


Future Work

Partition Calculus

Annealed Importance Sampling

Evaluate on larger Datasets

Try other experts

Decision Trees

Known for interpretability of data

References

[1] Geoffrey Hinton, Training Product of Experts by minimizing Contrastive

Divergence. 2002.

[2] Hugo Larochelle, Iain Murray, The Neural Autoregressive Distribution

Estimator. 2011

[3] KL Divergence, http://en.wikipedia.org/wiki/Kullback-Leibler

divergence

[4] Restricted Boltzmann Machines, http://en.wikipedia.org/wiki/Restricted

Boltzmann machine

[5] Logistic Regression, http://en.wikipedia.org/wiki/Logistic regression

[6] Mixture Models, http://en.wikipedia.org/wiki/Mixture model

[7] UCI Repository, http://archive.ics.uci.edu/ml/

THANK YOU

probability density estimation using product of conditional experts

Technology

linear model distribution

probability density

product of experts probability

train data

high probability

outlier class test

outlier detection cyt

outlier detection cyt