probability density estimation using product of conditional experts
TRANSCRIPT
![Page 1: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/1.jpg)
PROBABILITY DENSITY ESTIMATION
USING PRODUCT OF CONDITIONAL
EXPERTS
Project Guides:
Dr. Harish Karnick (IIT Kanpur)
Dr. Vinod Nair (MSR India)
Dr. Sundar S. (MSR India)
Submitted by:
Chirag Gupta 10212
Pulkit Jain 10543
![Page 2: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/2.jpg)
Density Estimation
Construct an estimate of the underlying probability
distribution function from observed data
Why ?
Underlying pattern in data
Useful statistical information
Modality
Skewness
![Page 3: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/3.jpg)
How is it done ?
The observed data is considered a set of i.i.d.
samples from the distribution
Choose a model that can estimate the underlying
probability density function
Fit the model to the observed data
Maximum Likelihood Estimation
Maximize the probability of observed data
p(x) = f(x, θ)
Maximize p(data) = p(x1) * p(x2) * … * p(xn)
![Page 4: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/4.jpg)
Background/Previous Work
High dimensional data is modeled by combining relatively simple models
Mixture models
Combination rule consists of taking a weighted arithmetic mean of the individual distributions.
Inefficient for high-dimensional data
Product of Experts
Combination rule is to multiply the relatively simple probabilistic models and renormalize
High dimensional data is relatively easy to handle
![Page 5: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/5.jpg)
Product Of Experts
Probability of a data point d under the model is
calculated as
fi represent the relatively simple ‘expert’
θi are the parameters of ith expert
![Page 6: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/6.jpg)
Product of Conditional Experts
Often, the individual experts are not known
We use Product of conditional experts, wherein
each expert gives the conditional probability
![Page 7: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/7.jpg)
Product of Conditional Experts
Conditional probability can be estimated by using
classification models that associate a probability
with the output class
Classification models like Logistic Regression, Kernel
Logistic Regression, Decision Trees can be used
A wrapper model that can take in user’s choice of
experts and build the estimate
![Page 8: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/8.jpg)
Learning/Training
Follow gradient ascent algorithm to maximize the
average log probability
Gradient for the objective function is given as:
![Page 9: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/9.jpg)
Learning/Training
The second term in gradient expression (arising from
the normalization term) is intractable to compute
Contrastive divergence[1] is then used to
approximate the gradient
![Page 10: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/10.jpg)
Experts Considered
Logistic Regression (linear classification model)
Kernel Logistic Regression (non linear model)
![Page 11: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/11.jpg)
Density Estimation: Artificial Datasets
True Distribution Distribution Learnt by
Linear Model
Distribution Learnt by
Non Linear Model
Set 1
100 examples
5 dimensions
![Page 12: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/12.jpg)
True Distribution
Set 2
100 examples
5 dimensions
Density Estimation: Artificial Datasets
Distribution Learnt by
Non Linear Model
Distribution Learnt by
Linear Model
![Page 13: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/13.jpg)
Density Estimation: Real Datasets
Adult (200 train + 500 test)
![Page 14: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/14.jpg)
Density Estimation: Real Datasets
Adult (500 train + 500 test)
![Page 15: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/15.jpg)
Density Estimation: Real Datasets
OCR (200 train + 500 test)
![Page 16: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/16.jpg)
Density Estimation: Real Datasets
OCR (500 train + 500 test)
![Page 17: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/17.jpg)
Density Estimation
For artificial sets, the learned distribution is close to
actual distribution
MoB clearly performs better on the training
examples than the linear and non linear model
However, PoCE model generalizes better
Higher log probability on the test set
Performance is significantly better for lesser training
examples
![Page 18: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/18.jpg)
Application: Outlier Detection
Detect points that do not belong to a particular class of points.
If the model builds a good enough representation of the data, it should assign high probability to points that are part of the inlier class and relatively low probability to points that are outside it.
We train the model on a mix of two classes, with less examples ( < 5%) from outlier class
Test to see if outliers and inliers are given low and high probabilities respectively
![Page 19: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/19.jpg)
Application: Outlier Detection
![Page 20: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/20.jpg)
Application: Outlier Detection
![Page 21: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/21.jpg)
Application: Outlier Detection
![Page 22: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/22.jpg)
Application: Outlier Detection
![Page 23: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/23.jpg)
Application: Outlier Detection
In 3 out of 4 cases, the outliers in test and train
data get lower average probability
This hints that outlier detection can be carried out to
some extent
We now present the precision recall curves
obtained for similar class pairs. Five outliers in
training as well as test were kept to obtain the
precision recall curves
![Page 24: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/24.jpg)
Application: Outlier Detection
CYT – MIT
![Page 25: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/25.jpg)
Application: Outlier Detection
CYT – MIT
![Page 26: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/26.jpg)
Application: Outlier Detection
CYT – NUC
![Page 27: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/27.jpg)
Application: Outlier Detection
CYT – NUC
![Page 28: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/28.jpg)
Test Set:
All the three models perform equally well
Training Set:
KLR does as good as MoB
LR does both better and worse than MoB
Application: Outlier Detection
![Page 29: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/29.jpg)
Future Work
Partition Calculus
Annealed Importance Sampling
Evaluate on larger Datasets
Try other experts
Decision Trees
Known for interpretability of data
![Page 30: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/30.jpg)
References
[1] Geoffrey Hinton, Training Product of Experts by minimizing Contrastive
Divergence. 2002.
[2] Hugo Larochelle, Iain Murray, The Neural Autoregressive Distribution
Estimator. 2011
[3] KL Divergence, http://en.wikipedia.org/wiki/Kullback-Leibler
divergence
[4] Restricted Boltzmann Machines, http://en.wikipedia.org/wiki/Restricted
Boltzmann machine
[5] Logistic Regression, http://en.wikipedia.org/wiki/Logistic regression
[6] Mixture Models, http://en.wikipedia.org/wiki/Mixture model
[7] UCI Repository, http://archive.ics.uci.edu/ml/
![Page 31: Probability density estimation using Product of Conditional Experts](https://reader035.vdocument.in/reader035/viewer/2022081400/554e8b53b4c90573338b4a38/html5/thumbnails/31.jpg)
THANK YOU