expectation propagation

Post on 15-Jan-2015

580 Views

Category:

Technology

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

It's the deck for one Hulu internal machine learning workshop, which introduces the background, theory and application of expectation propagation method.

TRANSCRIPT

Expecta(on  Propaga(on  Theory  and  Applica(on  

Dong  Guo  Research  Workshop  2013  Hulu  Internal  

 See  more  details  in  

hEp://dongguo.me/blog/2014/01/01/expecta(on-­‐propaga(on/  hEp://dongguo.me/blog/2013/12/01/bayesian-­‐ctr-­‐predic(on-­‐for-­‐bing/  

   

Outline  

•  Overview  •  Background  •  Theory  •  Applica(ons  

OVERVIEW  

Bayesian  Paradigm  

•  Infer  posterior  distribu(on  Prior  

Data  

Posterior   Make  decision  

Note:  figure  of  LDA  is  from  Wikipedia,  and  the  right  figure  is  from  paper  ‘Web-­‐Scale  Bayesian  Click-­‐Through  Rate  PredicFon  for  Sponsored  Search  AdverFsing  in  MicrosoI’s  Bing  Search  Engine’    

Bayesian  inference  methods  

•  Exact  inference  – Belief  propaga(on  

•  Approximate  inference  – Stochas(c  (sampling)  – Determinis(c  

•  Assumed  density  filtering  •  Expecta(on  propaga(on  •  Varia(onal  Bayes  

Message  passing  

•  A  form  of  communica(on  used  in  mul(ple  domains  of  computer  science  – Parallel  compu(ng  (MPI)  – Object-­‐oriented  programming  –  Inter-­‐process  communica(on  – Bayesian  inference  

•  A  family  of  methods  to  infer  posterior  distribu(on  

Expecta(on  Propaga(on  

•  Belongs  to  message  passing  family  

•  Approximated  method  (itera(on  is  needed)    

•  Very  popular  in  Bayesian  inference,  especially  in  graphic  model  

Researchers  

•  Thomas  Minka  – EP  was  proposed  in  PhD  thesis  

•  Kevin  p.  Murphy  – Machine  Learning  A  ProbabilisFc  PerspecFve  

BACKGROUND  

Background  

•  (Truncated)  Gaussian  •  Exponen(al  family  •  Graphic  model  •  Factor  graph  •  Belief  propaga(on  •  Moment  matching  

Gaussian  and  Truncated  Gaussian  

•  Gaussian  opera(on  is  a  basis  for  EP  inference  – Gaussian  +*/  Gaussian  – Gaussian  integral  

•  Truncated  Gaussian  is  used  in  many  EP  applica(ons  

•  See  details  here  

Exponen(al  family  distribu(on  

•  Very  good  summary  in  Wikipedia      •  Sufficient  sta(s(cs  of  Gaussian  distribu(on:  (x,  x^2)  •  Typical  distribu(on  

q(z) = h(z)g(η)exp{ηTu(z)}

Note:  above  4  figures  are  from  Wikipedia  

Graphical  Models  •  Directed  graph  (Bayesian  Network)   •  Undirected  graph  (Condi(onal  

Random  Field)  

P(x) = p(xk | pak )k=1

K

x1  

x4  

x3  x2   x1  

x4  

x3  x2  

Factor  graph  

•  Express  rela(on  between  variable  nodes  explicitly  •  Rela(on  in  edge  -­‐>  factor  node  

•  Hide  the  difference  of  BN  and  CRF  in  inference  •  Make  inference  more  intui(onal  

x1  

x4  

x3  x2   x1  

x4  

x3  x2  fa  

fc  

c  

BELIEF  PROPAGATION  

Belief  Propaga(on  Overview  

•  Exact  Bayesian  method  to  infer  marginal  distribu(on  –  ‘sum-­‐product’  message  passing  

•  Key  components  – Calculate  posterior  distribu(on  of  variable  node  – Two  kinds  of  messages  

Posterior  distribu(on  of  variable  node  

•  Factor  graph  

p(X) = Fs (s,Xs )s∈ne(x )∏ , for any variable x in the graph

p(x) = p(X)X \x∑ = Fs (s,Xs )

s∈ne(x )∏ =

X \x∑ Fs (x,Xs )

Xs∑

s∈ne(x )∏ = µ fs −>x

(x)s∈ne(x )∏

in which µ fs −>x(x) = Fs (x,Xs )

Xs∑

Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  

Message:  factor  -­‐>  variable  node  

•  Factor  graph  

µ fs −>x(x) = ...

x1

∑ fs (x, x1,..., xM )xM∑ µxm −> fs

(xm )xm∈ne( fs )\x∏ ,

in which {x1,..., xM } is the set of variables on which the factor fs depends

Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  

Message:  variable  -­‐>  factor  node  

•  Factor  graph  

µxm −> fs(xm ) = µ fl −>xm

(xm )l∈ne(xm )\ fs∏

Summary:  posterior  distribuFon  is  only  determined  by  factors  !!    

Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  

Whole  steps  of  BP  

•  Steps  to  calculate  posterior  distribu(on  of  given  variable  node  –  Step  1:  construct  factor  graph  –  Step  2:  treat  the  variable  node  as  root,  and  ini(alize  messages  sent  from  leaf  nodes  

–  Step  3:  leverage  the  message  passing  steps  recursively  un(l  the  root  node  receives  messages  from  all  of  its  neighbors  

–  Step  4:  get  the  marginal  distribu(on  by  mul(plying  all  messages  sent  in  

Note:  the  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  

BP:  example  •  Infer  marginal  distribu(on  of  x_3  

•  Infer  marginal  distribu(on  of  every  variables  

Note:  the  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  

Posterior  is  intractable  some(mes  

•  Example  –  Infer  the  mean  of  a  Gaussian  distribu(on  

– Ad  predictor  

p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )

p(θ ) = N(θ | 0,bI )

Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  

Distribu(on  Approxima(on  

Such that: q(x) = h(x)g(η)exp{ηTu(x)}

KL(p || q) = − p(x)∫ In q(x)p(x)

dx = − p(x)Inq(x)dx +∫ p(x)Inp(x)∫ dx

= − p(x)Ing(η)dx − p(x)ηTu(x)∫ dx + const∫ = − Ing(η)−ηTΕ p(x )[u(x)]+ constwhere const terms are independent of the natural parameter η

Minimize KL(p || q) by setting the gradient with repect to η to zero: => −∇Ing(η) = Ε p(x )[u(x)]By leveraging formula (2.226) in PRML: => Eq(x )[u(x)]= −∇Ing(η) = Ε p(x )[u(x)]

Approximate p(x) with q(x), which belongs to exponential family

Moment  matching  

•  Moments  of  a  distribu(on  

It's called moment matching when q(x) is Gaussian distribution then u(x) = (x, x2 )T

=> q(x)xdx = p(x)xdx∫∫ , and q(x)x2 dx = p(x)x2 dx∫∫=> meanq(x ) = q(x)xdx = p(x)xdx∫∫ = meanp(x ),

varianceq(x ) = q(x)x2 dx − (meanq(x ) )2∫

= p(x)x2 dx∫ − (meanp(x ) )2 = variance p(x )

k'th moment Mk = xk f (x)dxa

b

EXPECTATION  PROPAGATION  =  Belief  Propaga(on  +  Moment  matching?  

Key  Idea  •  Approximate  each  factor  with  Gaussian  distribu(on  

•  Approximate  corresponding  factor  pairs  one  by  one?  

•  Approximate  each  factor  in  turn  in  the  context  of  all  remaining  factors  (Proposed  by  Minka)  

refine factor f j(θ ) by ensuring qnew (θ )∝ f j(θ )q \ j (θ ) is close with f j (θ )q \ j (θ )

in which q \ j (θ ) = q(θ )f j(θ )

EP:  The  detail  steps  

   

1.Initialize all of the approximating factors fi(θ )

2.Initialize the posterior approximation by setting :q(θ )∝ fi(θ )i∏

3.Until convergence :

(a). Choose a fator f j(θ ) to refine.

(b). Remove f j(θ ) from the posterior by division :q \ j (θ ) = q(θ )f j(θ )

(c). Get the new posterior by settting sufficient statistics of qnew (θ ) equal to those of f j (θ )q \ j (θ )

z j

(minimize KL(f j (θ )q \ j (θ )

z j|| qnew (θ ))),in which z j = f j (θ )q \ j (θ )dθ∫ , and qnew (θ ) = 1

kf j(θ )q \ j (θ )

(d). Get the refined factor f j(θ ) : f j(θ ) = k qnew (θ )q \ j (θ )

Example:  The  cluEer  problem  

•  Infer  the  mean  of  a  Gaussian  distribu(on  •  Want  to  try  MLE,  but  

•  Approximate  with  

– Approximate  mixture  Gaussian  using  Gaussian  

p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )

p(θ ) = N(θ | 0,bI )

q(θ ) = N(θ |m,vI ), and each factor fn(θ ) = N(θ |mn ,vnI )

Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  

Example:  The  cluEer  problem(2)  

•  Approximate  complex  factor(e.g.  mixture  Gaussian)  with  Gaussian  

fn (θ ) in blue, fn(θ ) in red, and q \n (θ ) in green Remember variance of q \n (θ ) is usually very small, so fn(θ ) only need to approximate fn (θ ) in small range

Note:  above  2  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  

Applica(on:  Bayesian  CTR  predictor  for  Bing  

•  See  the  details  here  –  Inference  step  by  step  – Make  predic(on  

•  Some  insights  –  Variance  of  each  feature  increases  aker  every  exposure  

–  Sample  with  more  features  will  have  bigger  variance  •  Independent  assump(on  for  features  

Experimenta(on  •  Dataset  is  very  Inhomogeneous  

 

•  Performance    

 – Other  metrics  

•  Pros:  speed,  parameter  choice  cost,  online  learning  support,  interpreta(ve,  support  add  more  factors  

•  Cons:  sparse  •  Code  

Model   FTRL   OWLQN   Ad  predictor  

AUC   0.638   0.641   0.639  

Application: XBOX skill rating system

•     

See  details  in  P793~798  of  Machine  Learning  A  ProbabilisFc  PerspecFve      Note:  the  figure  is  from  paper:  ‘TrueSkill:  A  Bayesian  Skill  RaFng  System’    

Apply  to  all  Bayesian  models  

•  Infer.net  (Microsok/Bishop)  – A  framework  for  running  Bayesian  inference  in  graphical  models    

– Model-­‐based  machine  learning    

References  •  Books  

–  Chapter  2/8/10  of  PaMern  RecogniFon  and  Machine  Learning  –  Chapter  22  of  Machine  Learning:  A  ProbabilisFc  PerspecFve  

•  Papers  –  A  family  of  algorithms  for  approximate  Bayesian  inference  –  From  belief  propagaFon  to  expectaFon  propagaFon  –  TrueSkill:  A  Bayesian  Skill  RaFng  System  –  Web-­‐Scale  Bayesian  Click-­‐Through  Rate  PredicFon  for  Sponsored  

Search  AdverFsing  in  MicrosoI’s  Bing  Search  Engine  

•  Roadmap  for  EP  

top related