bayesian evidence and model...

69
Bayesian Evidence and Model Selection: A Tutorial in Two Acts Kevin H. Knuth Depts. of Physics and Informatics, University at Albany, Albany NY USA Based on the paper: Knuth K.H., Habeck M., Malakar N.K., Mubeen A.M., Placek B. 2015. Bayesian evidence and model selection. In press at Digital Signal Processing. doi:10.1016/j.dsp.2015.06.012 DOWNLOAD TALK NOW: Google knuthlabClick Talks7/19/2015 MaxEnt 2015 Tutorial 1

Upload: vantruc

Post on 24-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Bayesian Evidence and Model Selection: A Tutorial in Two Acts

Kevin H. Knuth Depts. of Physics and Informatics, University at Albany, Albany NY USA Based on the paper: Knuth K.H., Habeck M., Malakar N.K., Mubeen A.M., Placek B. 2015. Bayesian evidence and model selection. In press at Digital Signal Processing. doi:10.1016/j.dsp.2015.06.012

DOWNLOAD TALK NOW: Google ‘knuthlab’ Click ‘Talks’

7/19/2015 MaxEnt 2015 Tutorial 1

7/19/2015 MaxEnt 2015 Tutorial 2

This tutorial follows the paper: Knuth K.H., Habeck M., Malakar N.K., Mubeen A.M., Placek B. 2015. Bayesian evidence and model selection. In press at Digital Signal Processing. doi:10.1016/j.dsp.2015.06.012

References are not provided in the talk slides, please consult the paper Equations in the talk are numbered in accordance with the paper When referencing anything from Act 1 of this talk, please reference this paper When referencing anything from Act 2 of this talk, please reference the slides

7/19/2015 MaxEnt 2015 Tutorial 3

Bayesian Evidence Odds Ratios Evidence, Model Order and Priors Numerical Techniques Laplace Approximation Importance Sampling Annealed Importance Sampling Variational Bayes Nested Sampling

Applications Signal Detection : Brain Computer Interface / Neuroscience Sensor Characterization : Robotics / Signal Processing Exoplanet Characterization : Astronomy / Astrophysics Examples Nested Sampling Demo Nested Sampling and Phase Transitions

7/19/2015 MaxEnt 2015 Tutorial 4

7/19/2015 MaxEnt 2015 Tutorial 5

Bayesian Evidence

7/19/2015 MaxEnt 2015 Tutorial 6

Bayesian Evidence : Odds Ratios : oooo

Bayes Theorem

M represents a class of models represented by a set of model parameters m represents a particular model defined by a set of particular model parameter values d represents the acquired data

Posterior Probability Likelihood Prior Probability

Evidence or Marginal Likelihood

7/19/2015 MaxEnt 2015 Tutorial 7

Bayesian Evidence : Odds Ratios : oooo

Bayesian Evidence The Bayesian evidence can be found by marginalizing the joint distribution 𝑃 𝑚, 𝑑|𝑀, 𝐼 over all model parameter values.

M represents a class of models represented by a set of model parameters m represents a particular model defined by a set of particular model parameter values d represents the acquired data I represents the dependence on any relevant prior information

7/19/2015 MaxEnt 2015 Tutorial 8

Bayesian Evidence : Odds Ratios : oooo

Model Comparison We derive the ratio of probabilities of two models given data

If the prior probabilities of the models are equal then this is the ratio of evidences

7/19/2015 MaxEnt 2015 Tutorial 9

Bayesian Evidence : Odds Ratios : oooo

Odds Ratio or Bayes Factor

The ratio of probabilities of the models given the data is proportional to the Odds Ratio

7/19/2015 MaxEnt 2015 Tutorial 10

Bayesian Evidence : Evidence, Model Order and Priors : ooooo

Evidence: Model Order and Priors It is instructive to see how the evidence depends on both the model order and prior probabilities Consider a model with a single parameter: 𝑥 𝜖 [𝑥𝑚𝑖𝑛, 𝑥𝑚𝑎𝑥] with a width of ∆𝑥 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 Define the effective width

where 𝐿𝑚𝑎𝑥 is the maximum likelihood value.

7/19/2015 MaxEnt 2015 Tutorial 11

Bayesian Evidence : Evidence, Model Order and Priors : ooooo

Model with a Single Parameter Consider a model with a single parameter: 𝑥 𝜖 [𝑥𝑚𝑖𝑛, 𝑥𝑚𝑎𝑥] with a width of ∆𝑥 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 Given the effective width

The evidence is

7/19/2015 MaxEnt 2015 Tutorial 12

Bayesian Evidence : Evidence, Model Order and Priors : ooooo

Occam Factor The evidence is proportional to the ratio of the effective width of the likelihood and the width of the prior

This ratio 𝛿𝑥

∆𝑥 is called the Occam factor after Occam’s Razor:

"Non sunt multiplicanda entia sine necessitate", " Entities must not be multiplied beyond necessity “

- William of Ockham

7/19/2015 MaxEnt 2015 Tutorial 13

Bayesian Evidence : Evidence, Model Order and Priors : ooooo

Model Order For models with multiple parameters this generalizes to the ratio of the volume of the models that are compatible with both the data and the prior and the prior volume. If we assume that each of the 𝐾 parameters has prior width ∆𝑥

then the Occam factor scales as 𝛿𝑥

∆𝑥

𝐾. As model parameters

added, eventually one fits the data asymptotically well so that 𝛿𝑥 attains a maximum value and further model parameters can only decrease the Occam factor. If we increase the flexibility of our model by the introduction of more model parameters, we reduce the Occam factor.

7/19/2015 MaxEnt 2015 Tutorial 14

Bayesian Evidence : Evidence, Model Order and Priors : ooooo

Odds Ratios and Occam Factors We compute the odds ratio for a model 𝑀0 without model parameters to a model 𝑀1 with a single model parameter.

𝐿𝑚𝑎𝑥 ratio Occam Factor

The likelihood ratio is a classical statistic in frequentist model selection. If we only consider the likelihood ratio in model comparison problems, we fail to acknowledge the importance of Occam factors.

7/19/2015 MaxEnt 2015 Tutorial 15

Numerical Techniques

7/19/2015 MaxEnt 2015 Tutorial 16

Numerical Techniques : o

Numerical Techniques There are a wide variety of techniques that can be used to estimate the Bayesian evidence:

Laplace Approximation Importance Sampling Path Sampling Thermodynamic Integration Simulated Annealing Annealed Importance Sampling Variational Bayes (Ensemble Learning) Nested Sampling

7/19/2015 MaxEnt 2015 Tutorial 17

Numerical Techniques : Laplace Approximation : oooo

Laplace Approximation is a simple and useful method for approximating a unimodal probability density function with a Gaussian Consider a function 𝑝 𝑥 with a peak at 𝑥 = 𝑥0 We write a Taylor series expansion of ln 𝑝 𝑥 about 𝑥 = 𝑥0

which can be simplified to

7/19/2015 MaxEnt 2015 Tutorial 18

Numerical Techniques : Laplace Approximation : oooo

Laplace Approximation

By defining

We can write

Previously, we had

7/19/2015 MaxEnt 2015 Tutorial 19

Numerical Techniques : Laplace Approximation : oooo

Laplace Approximation

By taking the exponential we can approximate the density by

with an integral (evidence) of:

Previously, we had

7/19/2015 MaxEnt 2015 Tutorial 20

Numerical Techniques : Laplace Approximation : oooo

Laplace Approximation

The evidence is then

In the case of a multidimensional posterior we have

where

7/19/2015 MaxEnt 2015 Tutorial 21

Numerical Techniques : Importance Sampling : oooo

Importance Sampling allows one to find expectation values with respect to one distribution p(x) by computing expectation values with respect to a second distribution q(x) that is easier to sample from.

The expectation value of 𝑓 𝑥 with respect to 𝑝 𝑥 is given by

One can write 𝑝 𝑥 as 𝑝 𝑥

𝑞 𝑥 𝑞 𝑥 as long as 𝑞 𝑥 is non-zero

whenever 𝑝 𝑥 is nonzero.

7/19/2015 MaxEnt 2015 Tutorial 22

Numerical Techniques : Importance Sampling : oooo

Importance Sampling

Writing 𝑝 𝑥 as 𝑝 𝑥

𝑞 𝑥 𝑞 𝑥 , we have:

As long as the ratio 𝑝 𝑥

𝑞 𝑥 does not attain extreme values we can

estimate this with samples from 𝑞 𝑥 by

7/19/2015 MaxEnt 2015 Tutorial 23

Numerical Techniques : Importance Sampling : oooo

Importance Sampling Importance sampling can be used to compute ratios of evidence values in a similar fashion by writing

7/19/2015 MaxEnt 2015 Tutorial 24

Numerical Techniques : Importance Sampling : oooo

Importance Sampling The evidence ratio can be found by sampling from 𝑞 𝑥 as long as

𝑝 𝑥 is sufficiently close to 𝑞 𝑥 to avoid extreme ratios of 𝑝 𝑥

𝑞 𝑥

7/19/2015 MaxEnt 2015 Tutorial 25

Numerical Techniques : Variational Bayes : ooo

Variational Bayes which is also known as ensemble learning, relies on approximating The posterior 𝑃 𝑚|𝑀, 𝐼 with another distribution 𝑄 𝑚 .

By defining the negative Free Energy

And the Kullback-Leibler (KL) Divergence

We can write

7/19/2015 MaxEnt 2015 Tutorial 26

Numerical Techniques : Variational Bayes : ooo

Variational Bayes With this expression in hand

We can show that the negative Free Energy is a lower bound to the evidence

By minimizing the negative Free Energy, we can approximate the evidence

7/19/2015 MaxEnt 2015 Tutorial 27

Numerical Techniques : Variational Bayes : ooo

Variational Bayes By choosing a distribution 𝑄 𝑚 that factorizes into

where

where the set of parameters 𝑚0 is disjoint from 𝑚1we can minimize the negative Free Energy and estimate the evidence by choosing

7/19/2015 MaxEnt 2015 Tutorial 28

Numerical Techniques : Nested Sampling : ooooooo

Nested Sampling was developed by John Skilling to stochastically integrate the posterior probability to obtain the evidence. Posterior estimates are used to obtain model parameter estimates. Nested sampling aims to estimate the cumulative distribution function of the density or states (DOS), which is the prior probability mass enclosed within a likelihood boundary.

7/19/2015 MaxEnt 2015 Tutorial 29

Numerical Techniques : Nested Sampling : ooooooo

Nested Sampling Given a likelihood 𝐿, one can find the prior mass such that the likelihood of those states is greater than 𝐿

Parameter Space

7/19/2015 MaxEnt 2015 Tutorial 30

Numerical Techniques : Nested Sampling : ooooooo

Nested Sampling One can then estimate the evidence via stochastic integration using samples distributed according to the prior

Likelihood integrated over Prior

7/19/2015 MaxEnt 2015 Tutorial 31

Numerical Techniques : Nested Sampling : ooooooo

Nested Sampling One begins with a set of N samples. Use the sample with the lowest likelihood to define an implicit likelihood boundary. This results in an average decrease of the prior volume by 1/N Sample from the prior (uniformly is easiest) from within the implicit likelihood boundary to maintain N samples Keep track of 𝐿𝑖 (𝑋𝑖+1 − 𝑋𝑖)𝑖 to estimate the evidence Z

7/19/2015 MaxEnt 2015 Tutorial 32

Numerical Techniques : Nested Sampling : ooooooo

Nested Sampling Note how the prior volume contracts by 1/N each time. Early steps contribute little to the integral (𝑍) since the likelihood is very low. Later steps contribute little to (𝑍) since the prior volume change is very small. The steps that contribute most are in the middle of the sequence.

7/19/2015 MaxEnt 2015 Tutorial 33

Numerical Techniques : Nested Sampling : ooooooo

Nested Sampling Since nested sampling contracts along the prior volume, it is relatively unaffected by local maxima in evidence (phase transitions). (See Figure A) Methods based on tempering, such as simulated annealing follow the slope of the log L curve and as such, get stuck at phase transitions. (See Figure B)

7/19/2015 MaxEnt 2015 Tutorial 34

Numerical Techniques : Nested Sampling : ooooooo

Nested Sampling The great challenge is sampling uniformly (from the prior) within the implicit Likelihood boundaries. Several versions of Nested Sampling now exist: MultiNest (developed by Feroz and Hobson): clusters samples (K-means) and fits clusters with ellipsoids. Samples uniformly from within those ellipsoids. Very fast. Excellent performance for multi-modal distributions. Clustering limits this to 10s of parameters and the ellipsoids may not cover the high likelihood regions. Galilean Monte Carlo (Developed by Feroz and Skilling): moves a new sample with momentum reflecting off of logL boundaries. Excellent at handling ridges both angled and curved. Constrained Hamilton Monte Carlo (developed by M. Betancourt): similar to Galilean Monte Carlo. Diffusive Nested Sampling (developed by Brewer): allows samples to diffuse to lower likelihood nested levels and takes a weighted average. Nested Sampling with Demons (developed by M. Habeck): utilizes “demon variables” that smooth the constraint boundary and push the samples away from it.

7/19/2015 MaxEnt 2015 Tutorial 35

7/19/2015 MaxEnt 2015 Tutorial 36

Signal Detection Brain Computer Interface / Neuroscience

Applications: Signal Detection: oooooooo

7/19/2015 MaxEnt 2015 Tutorial 37

Signal Detection (Mubeen and Knuth)

We consider a practical signal detection problem where the log odds-ratio can be analytically derived. The specific application was originally for the detection of evoked brain responses

Applications: Signal Detection: oooooooo

The signal absent case models the recording 𝑥 in channel 𝑚 as noise

The signal present case models the recording 𝑥 in channel 𝑚 as signal plus where the signal has a amplitude parameter 𝛼 and can be coupled differently to different detectors (via 𝐶)

7/19/2015 MaxEnt 2015 Tutorial 38

Considering the Evidence

Applications: Signal Detection: oooooooo

The odds ratio can be written as

For the noise only case, the evidence is the likelihood (Gaussian)

7/19/2015 MaxEnt 2015 Tutorial 39

Considering the Evidence

Applications: Signal Detection: oooooooo

In the signal plus noise case, the evidence is

Assigning a Gaussian Likelihood and Prior for 𝛼

7/19/2015 MaxEnt 2015 Tutorial 40

Considering the Evidence

Applications: Signal Detection: oooooooo

We can then write the evidence as

where

7/19/2015 MaxEnt 2015 Tutorial 41

Considering the Evidence

Applications: Signal Detection: oooooooo

If the signal amplitude must be positive: 𝛼 𝜖 0, +∞ then:

If amplitude can be positive or negative: 𝛼 𝜖 −∞,+∞ then:

7/19/2015 MaxEnt 2015 Tutorial 42

Considering the Evidence

Applications: Signal Detection: oooooooo

Look at:

The expression E (86) contains the cross-correlation term, which is what is typically used for the detection of a target signal in ongoing recordings. The log OR detection filters incorporate more information that leads to extra terms, which serve to aid in target signal detection.

7/19/2015 MaxEnt 2015 Tutorial 43

Detecting Signals

Applications: Signal Detection: oooooooo

A. The P300 template target signal. B. An example of three channels (Cz, Pz, Fz) of synthetic ongoing EEG with two P300 target signal events (indicated by the arrows) at an SNR of 5 dB.

7/19/2015 MaxEnt 2015 Tutorial 44

Signal Detection Performance

Applications: Signal Detection: oooooooo

Detection performance measured by that area under the ROC curve as a function of signal SNR. Both OR techniques outperform cross-correlation!

7/19/2015 MaxEnt 2015 Tutorial 45

Sensor Characterization Robotics / Signal Processing

Applications: Signal Detection: oooo

7/19/2015 MaxEnt 2015 Tutorial 46

Modeling a Robotic Sensor (Malakar, Gladkov, Knuth)

In this project, we aim to model the spatial sensitivity function of a LEGO light sensor for use on a robotic system.

Applications: Signal Detection: oooo

Here the application is to develop a robotic arm that can characterize the white circle by measuring light intensities are various locations. By modeling the light sensor, we aim to increase the robot’s performance.

7/19/2015 MaxEnt 2015 Tutorial 47

Modeling a Robotic Sensor The LEGO light sensor was slowly moved over a black-and-white albedo pattern on the surface of a table to obtain calibration data. Sensor orientation was varied as well.

Applications: Signal Detection: oooo

7/19/2015 MaxEnt 2015 Tutorial 48

Modeling a Robotic Sensor Mixture of Gaussians models were used. Four model orders were tested using Nested Sampling. The 1-MoG model was slightly favored.

Applications: Signal Detection: oooo

Note the increasing uncertainty as the model becomes more complex. This suggests that the permutation space was not fully explored.

7/19/2015 MaxEnt 2015 Tutorial 49

Examining the Sensor Model Performance Here we show a comparison between the 1-MoG model and the data

Applications: Signal Detection: oooo

7/19/2015 MaxEnt 2015 Tutorial 50

Star System Characterization Astronomy / Astrophysics

Applications: Star System Characterization: ooooooo

7/19/2015 MaxEnt 2015 Tutorial 51

Star System Characterization (Placek and Knuth)

In our DSP paper, we give an example of Bayesian model testing applied to exoplanet characterization. Ben Placek also has a paper and poster here at MaxEnt 2015 on the topic. Here I will apply these model testing concepts to determining the orbital configuration of a triple star system.

Digital Sky Survey (DSS)

Applications: Star System Characterization: ooooooo

Photometric data obtained from the Kepler mission (A) Quarter 13 light curve folded on the P1 = 6.45 day period, (B) Quarter 13 light curve folded on the P2 = 0.645 day period (C) is the entire Q13 light curve.

Digital Sky Survey (DSS)

KIC 5436161: Two Periods This star exhibits oscillations of two commensurate periods in its Light curve: 6.45 days and 0.645 days (a rare 10:1 resonance!)

Applications: Star System Characterization: ooooooo

7/19/2015 MaxEnt 2015 Tutorial 52

Courtesy of Geoff Marcy and Howard Issacson

KIC 5436161: Radial Velocity Measurements Eleven radial velocity measurements taken over the span of a week. The 6.45 day period is visible, but not the 0.645 day period.

Applications: Star System Characterization: ooooooo

7/19/2015 MaxEnt 2015 Tutorial 53

(A) A hierarchical arrangement (C1 and C2 orbit G with 6.45 day period, and orbit one another with 0.645 day period)

(B) A planetary arrangement (C1 orbits with 6.45 day period, and C2 orbits with 0.645 day period)

7/19/2015 MaxEnt 2015 Tutorial 54

KIC 5436161: Models Two possible models of the system. The main star is a G-star (like our sun), at least one of the other companions (C1) is M-dwarf.

Applications: Star System Characterization: ooooooo

7/19/2015 MaxEnt 2015 Tutorial 55

Applications: Star System Characterization: ooooooo

KIC 5436161: Results Testing the Hierarchical Model against the Planetary Model using the Radial Velocity Data. The Circular Hierarchical Model has the greatest evidence (by a factor of 𝑒𝑥𝑝 3.73 ≈ 42)

7/19/2015 MaxEnt 2015 Tutorial 56

KIC 5436161 This system is a hierarchical triple system consisting of a G-star with two co-orbiting M-dwarfs in a 1:10 resonance (P1 = 6.45 day , P2 = 0.645 day)

Applications: Star System Characterization: ooooooo

7/19/2015 MaxEnt 2015 Tutorial 57

Applications: Star System Characterization: ooooooo

KIC 5436161

7/19/2015 MaxEnt 2015 Tutorial 58

Nested Sampling Demo (sans model testing)

Demonstrations: Nested Sampling: Lighthouse Problem: ooooo

7/19/2015 MaxEnt 2015 Tutorial 59

Nested Sampling Demo The Lighthouse Problem (Gull) Consider a Lighthouse located just off of a straight shore that extends a great distance. Imagine that the lighthouse has a laser beam that it fires at random times as it rotates with a uniform speed. Along the shore are light detectors that detect laser beam hits. Based on this data, where is the lighthouse?

Demonstrations: Nested Sampling: Lighthouse Problem: ooooo

7/19/2015 MaxEnt 2015 Tutorial 60

The Likelihood Function It is a useful exercise to derive the likelihood

via a change of variables.

Demonstrations: Nested Sampling: Lighthouse Problem: ooooo

𝑝 𝑥|𝐼 = 𝛽

𝜋 𝛽2 + 𝛼 − 𝑥 2

𝑥 = 𝛽 tan 𝜃 + 𝛼

We assign a uniform prior for the location parameters 𝛼 and 𝛽

7/19/2015 MaxEnt 2015 Tutorial 61

Nested Sampling Run using D = 64 data points (recorded flashes) and N = 100 samples Iteration is halted when Δ log 𝑍 = 10−7

Demonstrations: Nested Sampling: Lighthouse Problem: ooooo

o Live Samples + Used Samples

# Iterations = 1193 Log Z = -0.401 +- 0.076 mean(x) = 0.48 ± 0.26 mean(y) = 0.51 ± 0.28

x position

y p

osi

tio

n

Location of Lighthouse

7/19/2015 MaxEnt 2015 Tutorial 62

Nested Sampling Run This shows the relationship between log L and log Prior Volume

Demonstrations: Nested Sampling: Lighthouse Problem: ooooo

7/19/2015 MaxEnt 2015 Tutorial 63

Nested Sampling Run with a Gaussian Likelihood This shows the relationship between log L and log Prior Volume

Demonstrations: Nested Sampling: Lighthouse Problem: ooooo

7/19/2015 MaxEnt 2015 Tutorial 64

Nested Sampling and

Phase Transitions

Applications: Nested Sampling and Phase Transitions: oooo

7/19/2015 MaxEnt 2015 Tutorial 65

Peaks on Peaks Here is a Gaussian Likelihood with a taller peak on the side

Applications: Nested Sampling and Phase Transitions: oooo

7/19/2015 MaxEnt 2015 Tutorial 66

Nested Sampling with Phase Transitions Phase Transitions represent local peaks in the evidence

Applications: Nested Sampling and Phase Transitions: oooo

Phase Transition

7/19/2015 MaxEnt 2015 Tutorial 67

Acoustic Source Localization: One Detector Consider an acoustic source localization problem using a single detector. There is a low (red) and a high (blue) frequency source. Note how the high frequency source is found first inducing a phase transition:

Applications: Nested Sampling and Phase Transitions: oooo

7/19/2015 MaxEnt 2015 Tutorial 68

Acoustic Source Localization: Two Detectors In this example, we have two detectors, which allow us to localize the sources to rings. Again, the low frequency source is found first.

Applications: Nested Sampling and Phase Transitions: oooo

7/19/2015 MaxEnt 2015 Tutorial 69

Acknowledgements Michael Habeck Nabin Malakar Asim Mubeen

Ben Placek