heterogeneous multi-output gaussian process prediction

1
Heterogeneous Multi-output Gaussian Process Prediction Pablo Moreno-Muñoz 1 Antonio Artés-Rodríguez 1 Mauricio A. Álvarez 2 1 Universidad Carlos III de Madrid, Spain 2 University of Sheffield, UK {pmoreno, antonio}@tsc.uc3m.es mauricio.alvarez@sheffield.ac.uk Introduction A novel extension of multi-output Gaussian processes (MOGPs) for handling heterogeneous outputs (bi- nary, real, categorical, ... ). Each output has its own likelihood distribution and we use a MOGP prior to jointly model the parameters in all likelihoods as latent functions. We are able to obtain tractable variational bounds amenable to stochastic variational inference (SVI). Multi-output GPs We will use a linear model of corregionalisation type of covariance function for expressing correlations between latent parameter functions f d,j (x) (LPFs). Each LPF is a linear combination of independent latent functions U = {u q (x)} Q q =1 . Each u q (x) is assummed to be drawn from a GP prior such that u q (·) ∼ GP (0,k q (·, ·)), where k q can be any valid covariance function. f d,j (x)= Q X q =1 R q X i=1 a i d,j,q u i q (x), We assume that R q =1, meaning that the corregionalisation matrices are rank-one. In the literature such model is known as the semiparametric latent factor model. Heterogeneous Likelihood Model Consider a set of output functions Y = {y d (x)} D d=1 , with x R p , that we want to jointly model using GPs. Let y(x)=[y 1 (x),y 2 (x), ··· ,y D (x)] > be a vector-valued function. If outputs are conditionally independent given the vector of parameters θ (x)=[θ 1 (x)2 (x), ··· D (x)] > , we may define p(y(x)|θ (x)) = p(y(x)|f (x)) = D Y d=1 p(y d (x)|θ d (x)) = D Y d=1 p(y d (x)| e f d (x)), where e f d (x)=[f d,1 (x), ··· ,f d,J d (x)] > R J d ×1 are the set of LPFs that specify the parameters in θ d (x) for an arbitrary number D of likelihood functions. Variational Bounds Sparse Approximations in MOGPs: We define the set of M inducing variables per latent function u q (x) as u q =[u q (z 1 ), ··· ,u q (z M )] > , evaluated at a set of inducing inputs Z = {z m } M m=1 R M ×p . We also define u =[u > 1 , ··· , u > Q ] > R QM ×1 . We approximate the posterior p(f , u|y, X) as follows: p(f , u|y, X) q (f , u)= p(f |u)q (u)= D Y d=1 J d Y j =1 p(f d,j |u) Q Y q =1 q (u q ), Variational Inference: Exact posterior inference is intractable in our model due to the presence of an arbi- trary number of non-Gaussian likelihoods. We use variational inference to compute a lower bound L for the marginal log-likelihood log p(y), and for approximating the posterior distribution p(f , u|D ). L = D X d=1 E q ( e f d ) log p(y d (x n )| e f d ) - Q X q =1 KL ( q (u q )||p(u q ) ) Acknowledgements: PMM is supported by a doctoral FPI grant (BES2016-077626) under the project Macro-ADOBE (TEC2015-67719-P), MINECO, Spain. AAR acknowledges the projects ADVENTURE (TEC2015-69868-C2-1-R), AID (TEC2014-62194-EXP) and CASI-CAM-CM (S2013/ICE-2845). MAA has been partially financed by the Engineering and Physical Research Council (EPSRC) Research Projects EP/N014162/1 and EP/R034303/1. Results Code github.com/pmorenoz/HetMOGP Missing Gap Prediction: We predict observations in one output (binary classification) using training information from another one (Gaussian regression). Multi-output test-NLPD value: 32.5 ± 0.2 × 10 -2 / Single-output test-NLPD value: 40.51 ± 0.08 × 10 -2 . 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -6 -4 -2 0 2 4 6 Real Input Real Output Output 1: Gaussian Regression 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Real Input Binary Output Output 2: Binary Classification 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Real Input Binary Output Single Output: Binary Classification London House Price: Complete register of properties sold in the Greater London County during 2017. All properties addresses were translated to latitude-longitude points. For each spatial input, we considered two observations, one binary (property type) and one real (sale price). -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 51.37 51.45 51.53 51.61 51.69 Longitude Latitude Property Type Flat Other -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 51.37 51.45 51.53 51.61 51.69 Longitude Latitude Sale Price 79K£ 167K£ 351K£ 738K£ 1.5M£ -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 51.37 51.45 51.53 51.61 51.69 Longitude Latitude Probability of Flat House 0 0.2 0.4 0.6 0.8 1 -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 51.37 51.45 51.53 51.61 51.69 Longitude Latitude Log-price Variance 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 TEST-NLPD [L ONDON ] Bernoulli Heteroscedastic Global HetMOGP 6.38 ± 0.46 10.05 ± 0.64 16.44 ± 0.01 ChainedGP 6.75 ± 0.25 10.56 ± 1.03 17.31 ± 1.06 Human Behavior Data: We model human behavior in psy- chiatric patients. Our data comes from a medical study that uses the monitoring app (eB2). Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0 0.2 0.4 0.6 0.8 1 Output 1: Binary Presence/Absence at Home Monday Tuesday Wednesday Thursday Friday Saturday Sunday -4 -2 0 2 4 Output 2: Log-distance Distance from Home (Km) Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0 0.2 0.4 0.6 0.8 1 Output 3: Binary Use/non-use of Whatsapp Conclusions We present a MOGP model for handling heterogeneous obser- vations that is able to work on large scale datasets. Experimental results show relevant improvements with respect to indepen- dent learning. References Y. W. Teh et al. Semiparametric latent factor models. AISTATS, 2005 M. A. Álvarez et al., Sparse convolved Gaussian processes for multi-output regres- sion. NIPS, 2008 J. D. Hadfield, MCMC methods for multi-response GLMMs. JSS, 2010 J. Hensman et al., Gaussian processes for big data. UAI, 2013 A. Saul et al., Chained Gaussian processes. AISTATS, 2016 Likelihood Linked Parameters Gaussian μ(x)= f , σ (x) Het. Gaussian μ(x)= f 1 , σ (x) = exp(f 2 ) Bernoulli ρ(x)= exp(f ) 1+exp(f ) Categorical ρ k (x)= exp(f k ) 1+ K-1 k 0 =1 exp(f k 0 ) Poisson λ(x) = exp(f ) Gamma a(x) = exp(f 1 ),b(x) = exp(f 2 )

Upload: others

Post on 13-Jan-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heterogeneous Multi-output Gaussian Process Prediction

Heterogeneous Multi-output Gaussian Process PredictionPablo Moreno-Muñoz1 Antonio Artés-Rodríguez1 Mauricio A. Álvarez2

1Universidad Carlos III de Madrid, Spain 2University of Sheffield, UK{pmoreno, antonio}@tsc.uc3m.es [email protected]

IntroductionA novel extension of multi-output Gaussian processes (MOGPs) for handling heterogeneous outputs (bi-nary, real, categorical, . . . ). Each output has its own likelihood distribution and we use a MOGP prior tojointly model the parameters in all likelihoods as latent functions. We are able to obtain tractable variationalbounds amenable to stochastic variational inference (SVI).

Multi-output GPsWe will use a linear model of corregionalisation type of covariance function for expressing correlationsbetween latent parameter functions fd,j(x) (LPFs).

Each LPF is a linear combination of independent latent functions U = {uq(x)}Qq=1. Each uq(x) is assummedto be drawn from a GP prior such that uq(·) ∼ GP(0, kq(·, ·)), where kq can be any valid covariance function.

fd,j(x) =

Q∑q=1

Rq∑i=1

aid,j,quiq(x),

We assume that Rq = 1, meaning that the corregionalisation matrices are rank-one. In the literature suchmodel is known as the semiparametric latent factor model.

Heterogeneous Likelihood ModelConsider a set of output functions Y = {yd(x)}Dd=1, with x ∈ Rp, that we want to jointly model using GPs.Let y(x) = [y1(x), y2(x), · · · , yD(x)]> be a vector-valued function. If outputs are conditionally independentgiven the vector of parameters θ(x) = [θ1(x), θ2(x), · · · , θD(x)]>, we may define

p(y(x)|θ(x)) = p(y(x)|f(x)) =D∏

d=1

p(yd(x)|θd(x)) =D∏

d=1

p(yd(x)|̃fd(x)),

where f̃d(x) = [fd,1(x), · · · , fd,Jd(x)]> ∈ RJd×1 are the set of LPFs that specify the parameters in θd(x) for anarbitrary number D of likelihood functions.

Variational BoundsSparse Approximations in MOGPs: We define the set of M inducing variables per latent function uq(x)as uq = [uq(z1), · · · , uq(zM)]>, evaluated at a set of inducing inputs Z = {zm}Mm=1 ∈ RM×p. We also defineu = [u>1 , · · · ,u>Q]> ∈ RQM×1. We approximate the posterior p(f ,u|y,X) as follows:

p(f ,u|y,X) ≈ q(f ,u) = p(f |u)q(u) =D∏

d=1

Jd∏j=1

p(fd,j|u)Q∏

q=1

q(uq),

Variational Inference: Exact posterior inference is intractable in our model due to the presence of an arbi-trary number of non-Gaussian likelihoods. We use variational inference to compute a lower bound L forthe marginal log-likelihood log p(y), and for approximating the posterior distribution p(f ,u|D).

L =D∑

d=1

Eq(f̃d)

[log p(yd(xn)|̃fd)

]−

Q∑q=1

KL(q(uq)||p(uq)

)Acknowledgements: PMM is supported by a doctoral FPI grant (BES2016-077626) under the project Macro-ADOBE (TEC2015-67719-P), MINECO, Spain. AARacknowledges the projects ADVENTURE (TEC2015-69868-C2-1-R), AID (TEC2014-62194-EXP) and CASI-CAM-CM (S2013/ICE-2845). MAA has been partiallyfinanced by the Engineering and Physical Research Council (EPSRC) Research Projects EP/N014162/1 and EP/R034303/1.

Results Code→ github.com/pmorenoz/HetMOGPMissing Gap Prediction: We predict observations in one output (binary classification) using training information from anotherone (Gaussian regression). Multi-output test-NLPD value: 32.5±0.2×10−2 / Single-output test-NLPD value: 40.51±0.08×10−2.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−6

−4

−2

0

2

4

6

Real Input

RealOutput

Output 1: Gaussian Regression

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

Real Input

Bin

ary

Ou

tpu

t

Output 2: Binary Classification

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

Real Input

Bin

ary

Ou

tpu

t

Single Output: Binary Classification

London House Price: Complete register of properties sold in the Greater London Countyduring 2017. All properties addresses were translated to latitude-longitude points. For eachspatial input, we considered two observations, one binary (property type) and one real(sale price).

-0.51 -0.34 -0.17 -0.0 0.16 0.3351.29

51.37

51.45

51.53

51.61

51.69

Longitude

Latitude

Property Type

FlatOther

-0.51 -0.34 -0.17 -0.0 0.16 0.3351.29

51.37

51.45

51.53

51.61

51.69

Longitude

Latitude

Sale Price

79K£

167K£

351K£

738K£

1.5M£

-0.51 -0.34 -0.17 -0.0 0.16 0.3351.29

51.37

51.45

51.53

51.61

51.69

Longitude

Latitude

Probability of Flat House

0

0.2

0.4

0.6

0.8

1

-0.51 -0.34 -0.17 -0.0 0.16 0.3351.29

51.37

51.45

51.53

51.61

51.69

Longitude

Latitude

Log-price Variance

0.3

0.6

0.9

1.2

1.5

1.8

2.1

2.4

TEST-NLPD[LONDON] Bernoulli Heteroscedastic Global

HetMOGP 6.38± 0.46 10.05± 0.64 16.44± 0.01ChainedGP 6.75± 0.25 10.56± 1.03 17.31± 1.06

Human Behavior Data: Wemodel human behavior in psy-chiatric patients. Our datacomes from a medical study thatuses the monitoring app (eB2).

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

0

0.2

0.4

0.6

0.8

1

Output1:

Binary

Presence/Absence at Home

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

−4

−2

0

2

4

Output2:

Log-distance

Distance from Home (Km)

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

0

0.2

0.4

0.6

0.8

1

Output3:

Binary

Use/non-use of Whatsapp

ConclusionsWe present a MOGP model for handling heterogeneous obser-vations that is able to work on large scale datasets. Experimentalresults show relevant improvements with respect to indepen-dent learning.

ReferencesY. W. Teh et al. Semiparametric latent factor models. AISTATS, 2005M. A. Álvarez et al., Sparse convolved Gaussian processes for multi-output regres-sion. NIPS, 2008J. D. Hadfield, MCMC methods for multi-response GLMMs. JSS, 2010J. Hensman et al., Gaussian processes for big data. UAI, 2013A. Saul et al., Chained Gaussian processes. AISTATS, 2016

Likelihood Linked Parameters

Gaussian µ(x) = f , σ(x)

Het. Gaussian µ(x) = f1, σ(x) = exp(f2)

Bernoulli ρ(x) = exp(f)1+exp(f)

Categorical ρk(x) =exp(fk)

1+∑K−1

k′=1exp(fk′ )

Poisson λ(x) = exp(f)

Gamma a(x) = exp(f1), b(x) = exp(f2)