expectation propagation

Expecta(on Propaga(on Theory and Applica(on

Dong Guo Research Workshop 2013 Hulu Internal

See more details in

hEp://dongguo.me/blog/2014/01/01/expecta(on-‐propaga(on/ hEp://dongguo.me/blog/2013/12/01/bayesian-‐ctr-‐predic(on-‐for-‐bing/

Outline

•  Overview •  Background •  Theory •  Applica(ons

OVERVIEW

Bayesian Paradigm

•  Infer posterior distribu(on Prior

Posterior Make decision

Note: figure of LDA is from Wikipedia, and the right figure is from paper ‘Web-‐Scale Bayesian Click-‐Through Rate PredicFon for Sponsored Search AdverFsing in MicrosoI’s Bing Search Engine’

Bayesian inference methods

•  Exact inference – Belief propaga(on

•  Approximate inference – Stochas(c (sampling) – Determinis(c

•  Assumed density filtering •  Expecta(on propaga(on •  Varia(onal Bayes

Message passing

•  A form of communica(on used in mul(ple domains of computer science – Parallel compu(ng (MPI) – Object-‐oriented programming –  Inter-‐process communica(on – Bayesian inference

•  A family of methods to infer posterior distribu(on

Expecta(on Propaga(on

•  Belongs to message passing family

•  Approximated method (itera(on is needed)

•  Very popular in Bayesian inference, especially in graphic model

Researchers

•  Thomas Minka – EP was proposed in PhD thesis

•  Kevin p. Murphy – Machine Learning A ProbabilisFc PerspecFve

BACKGROUND

Background

•  (Truncated) Gaussian •  Exponen(al family •  Graphic model •  Factor graph •  Belief propaga(on •  Moment matching

Gaussian and Truncated Gaussian

•  Gaussian opera(on is a basis for EP inference – Gaussian +*/ Gaussian – Gaussian integral

•  Truncated Gaussian is used in many EP applica(ons

•  See details here

Exponen(al family distribu(on

•  Very good summary in Wikipedia •  Sufficient sta(s(cs of Gaussian distribu(on: (x, x^2) •  Typical distribu(on

q(z) = h(z)g(η)exp{ηTu(z)}

Note: above 4 figures are from Wikipedia

Graphical Models •  Directed graph (Bayesian Network) •  Undirected graph (Condi(onal

Random Field)

P(x) = p(xk | pak )k=1

x3 x2 x1

Factor graph

•  Express rela(on between variable nodes explicitly •  Rela(on in edge -‐> factor node

•  Hide the difference of BN and CRF in inference •  Make inference more intui(onal

x3 x2 x1

x3 x2 fa

BELIEF PROPAGATION

Belief Propaga(on Overview

•  Exact Bayesian method to infer marginal distribu(on –  ‘sum-‐product’ message passing

•  Key components – Calculate posterior distribu(on of variable node – Two kinds of messages

Posterior distribu(on of variable node

•  Factor graph

p(X) = Fs (s,Xs )s∈ne(x )∏ , for any variable x in the graph

p(x) = p(X)X \x∑ = Fs (s,Xs )

s∈ne(x )∏ =

X \x∑ Fs (x,Xs )

s∈ne(x )∏ = µ fs −>x

(x)s∈ne(x )∏

in which µ fs −>x(x) = Fs (x,Xs )

Note: the figure is from book ‘PaMern recogniFon and machine learning’

Message: factor -‐> variable node

•  Factor graph

µ fs −>x(x) = ...

∑ fs (x, x1,..., xM )xM∑ µxm −> fs

(xm )xm∈ne( fs )\x∏ ,

in which {x1,..., xM } is the set of variables on which the factor fs depends

Message: variable -‐> factor node

•  Factor graph

µxm −> fs(xm ) = µ fl −>xm

(xm )l∈ne(xm )\ fs∏

Summary: posterior distribuFon is only determined by factors !!

Whole steps of BP

•  Steps to calculate posterior distribu(on of given variable node –  Step 1: construct factor graph –  Step 2: treat the variable node as root, and ini(alize messages sent from leaf nodes

–  Step 3: leverage the message passing steps recursively un(l the root node receives messages from all of its neighbors

–  Step 4: get the marginal distribu(on by mul(plying all messages sent in

Note: the figures are from book ‘PaMern recogniFon and machine learning’

BP: example •  Infer marginal distribu(on of x_3

•  Infer marginal distribu(on of every variables

Note: the figures are from book ‘PaMern recogniFon and machine learning’

Posterior is intractable some(mes

•  Example –  Infer the mean of a Gaussian distribu(on

– Ad predictor

p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )

p(θ ) = N(θ | 0,bI )

Distribu(on Approxima(on

Such that: q(x) = h(x)g(η)exp{ηTu(x)}

KL(p || q) = − p(x)∫ In q(x)p(x)

dx = − p(x)Inq(x)dx +∫ p(x)Inp(x)∫ dx

= − p(x)Ing(η)dx − p(x)ηTu(x)∫ dx + const∫ = − Ing(η)−ηTΕ p(x )[u(x)]+ constwhere const terms are independent of the natural parameter η

Minimize KL(p || q) by setting the gradient with repect to η to zero: => −∇Ing(η) = Ε p(x )[u(x)]By leveraging formula (2.226) in PRML: => Eq(x )[u(x)]= −∇Ing(η) = Ε p(x )[u(x)]

Approximate p(x) with q(x), which belongs to exponential family

Moment matching

•  Moments of a distribu(on

It's called moment matching when q(x) is Gaussian distribution then u(x) = (x, x2 )T

=> q(x)xdx = p(x)xdx∫∫ , and q(x)x2 dx = p(x)x2 dx∫∫=> meanq(x ) = q(x)xdx = p(x)xdx∫∫ = meanp(x ),

varianceq(x ) = q(x)x2 dx − (meanq(x ) )2∫

= p(x)x2 dx∫ − (meanp(x ) )2 = variance p(x )

k'th moment Mk = xk f (x)dxa

EXPECTATION PROPAGATION = Belief Propaga(on + Moment matching?

Key Idea •  Approximate each factor with Gaussian distribu(on

•  Approximate corresponding factor pairs one by one?

•  Approximate each factor in turn in the context of all remaining factors (Proposed by Minka)

refine factor f j(θ ) by ensuring qnew (θ )∝ f j(θ )q \ j (θ ) is close with f j (θ )q \ j (θ )

in which q \ j (θ ) = q(θ )f j(θ )

EP: The detail steps

1.Initialize all of the approximating factors fi(θ )

2.Initialize the posterior approximation by setting :q(θ )∝ fi(θ )i∏

3.Until convergence :

(a). Choose a fator f j(θ ) to refine.

(b). Remove f j(θ ) from the posterior by division :q \ j (θ ) = q(θ )f j(θ )

(c). Get the new posterior by settting sufficient statistics of qnew (θ ) equal to those of f j (θ )q \ j (θ )

(minimize KL(f j (θ )q \ j (θ )

z j|| qnew (θ ))),in which z j = f j (θ )q \ j (θ )dθ∫ , and qnew (θ ) = 1

kf j(θ )q \ j (θ )

(d). Get the refined factor f j(θ ) : f j(θ ) = k qnew (θ )q \ j (θ )

Example: The cluEer problem

•  Infer the mean of a Gaussian distribu(on •  Want to try MLE, but

•  Approximate with

– Approximate mixture Gaussian using Gaussian

p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )

p(θ ) = N(θ | 0,bI )

q(θ ) = N(θ |m,vI ), and each factor fn(θ ) = N(θ |mn ,vnI )

Example: The cluEer problem(2)

•  Approximate complex factor(e.g. mixture Gaussian) with Gaussian

fn (θ ) in blue, fn(θ ) in red, and q \n (θ ) in green Remember variance of q \n (θ ) is usually very small, so fn(θ ) only need to approximate fn (θ ) in small range

Note: above 2 figures are from book ‘PaMern recogniFon and machine learning’

Applica(on: Bayesian CTR predictor for Bing

•  See the details here –  Inference step by step – Make predic(on

•  Some insights –  Variance of each feature increases aker every exposure

–  Sample with more features will have bigger variance •  Independent assump(on for features

Experimenta(on •  Dataset is very Inhomogeneous

•  Performance

– Other metrics

•  Pros: speed, parameter choice cost, online learning support, interpreta(ve, support add more factors

•  Cons: sparse •  Code

Model FTRL OWLQN Ad predictor

AUC 0.638 0.641 0.639

Application: XBOX skill rating system

• 

See details in P793~798 of Machine Learning A ProbabilisFc PerspecFve Note: the figure is from paper: ‘TrueSkill: A Bayesian Skill RaFng System’

Apply to all Bayesian models

•  Infer.net (Microsok/Bishop) – A framework for running Bayesian inference in graphical models

– Model-‐based machine learning

References •  Books

–  Chapter 2/8/10 of PaMern RecogniFon and Machine Learning –  Chapter 22 of Machine Learning: A ProbabilisFc PerspecFve

•  Papers –  A family of algorithms for approximate Bayesian inference –  From belief propagaFon to expectaFon propagaFon –  TrueSkill: A Bayesian Skill RaFng System –  Web-‐Scale Bayesian Click-‐Through Rate PredicFon for Sponsored

Search AdverFsing in MicrosoI’s Bing Search Engine

•  Roadmap for EP

expectation propagation

sne x f x

variable x

ssne x x sin

variance q x

dx meanq x

mean p x

graphsne x px

e q x ux

Technology

maths probability expectation and conditional expectation...

expectation propagation for large scale bayesian...

expectation particle belief...

an application of tree-structured expectation propagation...

icepac user manual - greg...

chap 12. plant propagation i. three methods of plant...

extending expectation propagation for graphical …graphical...

expectation & variance 1 expectation · course notes, week...

extending expectation propagation for graphical...

expectation & variance 1 expectation

expectation propagation in dynamical systems · 8/10/2012...

fast convergent algorithms for expectation propagation...

expectation propagation as a way of life: a...

extending expectation propagation on graphical models yuan...

expectation propagation for graphical models

expectation propagation for approximate inference in...

supplemental material whitened expectation...

expectation propagation for continuous time stochastic...

expectation propagation as a way of life∗ 1. background:...

expectation propagation for recti ed linear poisson...