efficient sparsification for gaussian process...

10
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/297587777 Efficient Sparsification for Gaussian Process Regression Article in Neurocomputing · March 2016 DOI: 10.1016/j.neucom.2016.02.032 CITATION 1 READS 135 3 authors, including: Some of the authors of this publication are also working on these related projects: Semi-Autonomous 3rdHand Robot View project Optical Flow Computation: Spatiotemporal Stochastic and Adaptive Modelling View project Duy Nguyen-Tuong Bosch Corporate Research 42 PUBLICATIONS 829 CITATIONS SEE PROFILE Marc Toussaint Universität Stuttgart 147 PUBLICATIONS 2,231 CITATIONS SEE PROFILE All content following this page was uploaded by Duy Nguyen-Tuong on 13 May 2016. The user has requested enhancement of the downloaded file.

Upload: others

Post on 29-Dec-2019

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/297587777

EfficientSparsificationforGaussianProcessRegression

ArticleinNeurocomputing·March2016

DOI:10.1016/j.neucom.2016.02.032

CITATION

1

READS

135

3authors,including:

Someoftheauthorsofthispublicationarealsoworkingontheserelatedprojects:

Semi-Autonomous3rdHandRobotViewproject

OpticalFlowComputation:SpatiotemporalStochasticandAdaptiveModellingViewproject

DuyNguyen-Tuong

BoschCorporateResearch

42PUBLICATIONS829CITATIONS

SEEPROFILE

MarcToussaint

UniversitätStuttgart

147PUBLICATIONS2,231CITATIONS

SEEPROFILE

AllcontentfollowingthispagewasuploadedbyDuyNguyen-Tuongon13May2016.

Theuserhasrequestedenhancementofthedownloadedfile.

Page 2: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

Neurocomputing 192 (2016) 29–37

Contents lists available at ScienceDirect

Neurocomputing

http://d0925-23

☆Thisn Corr

and AdvE-m

journal homepage: www.elsevier.com/locate/neucom

Efficient sparsification for Gaussian process regression$

Jens Schreiter a,b,n, Duy Nguyen-Tuong a, Marc Toussaint b

a Cognitive Systems Group, Corporate Sector Research and Advance Engineering, Robert Bosch GmbH, 70465 Stuttgart, Germanyb Machine Learning and Robotics Laboratory, Institute for Parallel and Distributed Systems, University of Stuttgart, 70569 Stuttgart, Germany

a r t i c l e i n f o

Article history:Received 29 June 2015Received in revised form17 November 2015Accepted 1 February 2016Available online 3 March 2016

Keywords:Gaussian processesSparse approximationsGreedy subset selectionLearning robot inverse dynamics

x.doi.org/10.1016/j.neucom.2016.02.03212/& 2016 Elsevier B.V. All rights reserved.

paper is a longer and more detailed versionesponding author at: Cognitive Systems Grouance Engineering, Robert Bosch GmbH, 7046ail address: [email protected] (J. Sc

a b s t r a c t

Sparse Gaussian process models provide an efficient way to perform regression on large data sets.Sparsification approaches deal with the selection of a representative subset of available training data forinducing the sparse model approximation. A variety of insertion and deletion criteria have been pro-posed, but they either lack accuracy or suffer from high computational costs. In this paper, we present anew and straightforward criterion for successive selection and deletion of training points in sparseGaussian process regression. The proposed novel strategies for sparsification are as fast as the purelyrandomized schemes and, thus, appropriate for applications in online learning. Experiments on real-world robot data demonstrate that our obtained regression models are competitive with the computa-tionally intensive state-of-the-art methods in terms of generalization and accuracy. Furthermore, weemploy our approach in learning inverse dynamics models for compliant robot control using very largedata sets, i.e. with half a million training points. In this experiment, it is also shown that our approxi-mated sparse Gaussian process model is sufficiently fast for real-time prediction in robot control.

& 2016 Elsevier B.V. All rights reserved.

1. Introduction

Nowadays, Gaussian processes (GPs) are a widely used non-parametric Bayesian modeling technique [2]. In contrast to otherkernel approaches such as support vector machines (SVMs), see [3]for more details, GPs offer a probabilistic framework. This leads topredictive distributions for test points and model selection is easy toachieve with standard Bayesian procedures. However, the applic-ability of full Gaussian process regression (GPR) to large scale pro-blems with a high number of training points n is limited due to theunfavourable scaling in training time and memory requirements. Thedominating factors are usually Oðn3Þ cost for solving a linear systemand computing a logarithmic determinant with respect to a densecovariance matrix KARn�n between all available training points andthe Oðn2Þ space required to store it in memory. Furthermore, the fullGPR model needs Oðn2Þ cost per predictive test point variance, aswell as OðdnÞ for predicting the mean, if a kernel evaluation costsO dð Þ, where d is the data dimension.

To overcome these limitations in computational cost and sto-rage requirements, many approximations to full GPR have beenproposed. In Fig. 1, an illustration showing different approximation

of the publication [1].p, Corporate Sector Research5 Stuttgart, Germany.hreiter).

approaches is presented. Local GPR approaches, e.g. as proposed inNguyen-Tuong et al. [4], can be used to increase modeling per-formance. These methods are based on a partition of the inputspace, where for each region a local GP model is trained. On theother hand, more sophisticated GPR approximation techniquesconsider either approximations of the dense covariance matrix Kor focus on sparse likelihood approximations. For example, cov-ariance matrix approximations such as the Nystrøm method [6]can be employed to reduce modeling effort. Moreover, Fourierkernel approximations [7] like the sparse spectrum GPR scheme[8] directly consider an approximation of the specified covariancefunction to increase computational speed. Additionally, varioussparse likelihood approximations have emerged recently, whoserelations have been formalized in the unifying framework [9]. Thefully independent training conditional (FITC) approximation [10]uses a flexible subset of virtual training points to generate a sparseGPR model and optimizes the virtual training points along with allother hyperparameters. In contrast, the deterministic trainingconditional (DTC) approximation selects a representative subset ofreal training points, the so-called active points, that induces thesparse likelihood approximation. A variational formalism for bothsparse approximation techniques, which leads to a regularized logmarginal likelihood for hyperparameter learning and the addi-tional optimization of virtual training points with respect to theFITC approximation plus a new greedy selection method for theDTC approximation, is presented in [11]. Here, greedy schemes are

Page 3: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

Full GPR

ApproximatedGPR

Covariancematrix

approximations

Nystrømmethod Kernel

approximations

Sparselikelihood

approximations

DTC ap-proximation

FITC ap-proximation

Local GPR

Fig. 1. Relation between different approximation techniques for Gaussian processregression. In this paper, we focus on the insertion and deletion strategies for thedeterministic training conditional (DTC) approximation. Note that this illustrationis by far not complete. For example, Gaussian process models based on networkarchitectures [5] are not considered in this overview.

J. Schreiter et al. / Neurocomputing 192 (2016) 29–3730

employed for the DTC approximation due to the high computa-tional complexity of the optimal subset selection problem. A fastinformation gain criterion for insertion of training points to theactive set is proposed in [12]. Smola and Bartlett, cf. [13], use acomputationally costly selection heuristic which approximates thelogarithmic marginal posterior probability. The same formalism asthe previous criterion is used in [14], while improving computa-tional performance by using a simpler approximation of the pos-terior probability. Quiñonero-Candela's [15] selection is based onthe increase in the log marginal likelihood of the sparse GP by theinsertion of a training point in the active subset. In [16], Csató andOpper measure the projection-induced error in the reproducingkernel Hilbert space (RKHS) and select the point which maximallyextends the spanned subspace of the RKHS. Based on this idea,they also introduce a heuristic for deletion of training points fromthe active set. They show that removing active points can con-siderably reduce the prediction times for test points with onlyslightly decreasing generalization accuracy. All of the insertion anddeletion methods mentioned above either lack computationalspeed, have high memory requirements, or lack of modelingaccuracy. Moreover, if the regression model generation is based ona purely randomized selection or on a method with a small ran-domly selected subset of remaining training points for criteriaevaluation, e.g. as done in [11,13–15], the performance in hardregression tasks deteriorates.

Our proposed novel sparsification method is closely related to theinclusion heuristic by Smola and Bartlett [13]. However, by employ-ing some reasonable assumptions, we are able to significantly reducethe computational costs to the level of randomized selection, withouta huge loss in model accuracy. Compared to the deletion criterion byCsató and Opper [16], our approach offers nearly the same predictionperformance with lower computing time.

The remainder of the paper is organized as follows. In thefollowing section, we introduce the sparse GPR setting and reviewthe state-of-the-art sparsification criteria for inclusion and

deletion. Our novel strategies for fast greedy insertion and dele-tion are presented in Section 3. We also describe an efficient wayfor learning the resulting hyperparameters with a generalizedexpectation maximization (EM) algorithm in our specific setup. InSection 4, we report on the results of our comprehensive com-parison on several benchmark data sets for learning inversedynamics models. Furthermore, we demonstrate the real-timeapplicability of our learned inverse dynamics models for a com-pliant robot control task. Finally, in Section 5, we discuss theresults of our method and give directions for future work.

2. Sparse Gaussian process regression

Let D¼ y;Xð Þ be the training data set, where yARn is a vector ofnoisy realizations yi of the underlying regression function f : Rd-

R with f ðxiÞ ¼ f i, obeying the relationship yi ¼ f iþεi with Gaussiannoise εi �N ð0;σ2Þ. Furthermore, the n training inputs xiARd arerow-wise summarized in XARn�d. Our goal is the construction of asparse GPR model for the underlying regression function. Csató[17] and Seeger [18] laid the foundation for this sparse GPR modelunder the DTC approximation which is presented below. We adoptthe notation by Seeger et al. [12] to facilitate the comparability ofthe different criteria. Let I be the index set of size m of all activepoints xi with iA I, i.e. training points that represent the sparseapproximation, and R be the index set of all remaining points, suchthat I [ R¼ 1;…;nf g. The centered prior distribution over thelatent function values f IARm corresponding to the active subset isthen given by

P f I jX I� �¼N f Ij0;KIð Þ with KI ¼ kðxi; xjÞ

� �i;jA IARm�m: ð1Þ

Here, K I is the covariance matrix over the active training pointsdetermined through the specified covariance function kij ¼ kðxi; xjÞ.The sparseness of this method is introduced via a likelihoodapproximation QI yj f I ;X

� �that is optimized with respect to the

Kullback–Leibler divergence (KL-divergence) and induced throughthe active training points which leads to

QI yj f I ;X� �¼N yjPT

I f I ;σ2I

� �: ð2Þ

The projection matrix PI ¼K �1I K I;�ARm�n, where K I;�ARm�n

comprises the covariance function values between all trainingpoints ( �notation) and the active subset of training points, maps f Ito the prior conditional mean E P f j f I

� �� �¼K I;�TK�1I f IARn. With

Bayesian inference we get the approximated posterior distribution

QI f I jy;X� �¼N f I jLM�1Vy;σ2LM�1LT

� �; ð3Þ

which is proportional to the product of the prior in Eq. (1) and theapproximated likelihood in Eq. (2). Here LARm�m is the lowerCholesky factor of K I , V ¼ L�1K I;�ARm�n, and M ¼ σ2IþVVT ARm�m for fixed I of size m. The approximated marginal likelihooddirectly follows from the integration over the same product aboutthe active function values f I and results in

QI yjXð Þ ¼N yj0;σ2IþVTV� �

: ð4Þ

The predictive distribution for a test point xnARd is given by

QI f n jxn; y;X� �¼N f n jkT

I;nL�TL�T

M βI ;knn� JL�1kI;n J2�

þσ2 JL�1M L�1kI;n J2

�ð5Þ

with the Cholesky decomposition M ¼ LMLTM , βI ¼ L�1

M VyARm, andthe covariance vector kI;nARm between the test input and theactive points. If only the predicted mean values are of interest, theprediction vector αI ¼ L�TL�T

M βI can be precomputed to performcomputations of mean values with only O mdð Þ cost. Note that thiscost depends on the calculation of the vector kI;n and, thus, on the

Page 4: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

J. Schreiter et al. / Neurocomputing 192 (2016) 29–37 31

specified covariance function, and is typically proportional to theinput dimension d. The predictive variance is feasible in Oð�2Þ ifdom. The approximated posterior distribution for all trainingpoints QI f jy;Xð Þ induced through the active subset indexed by I isgiven by

QI f jy;Xð Þ ¼N f jVTL�TM βI ; K�VTVþσ2VTM�1V

� �ð6Þ

with the estimated mean vector μI ¼ E QI f jy;Xð Þ� �¼ VTL�TM βIARn.

Due to the matrix–matrix multiplications, the training complexityof this sparse GPR model is Oðnm2Þ. Predicting the mean for onetest point is feasible in OðdmÞ.

2.1. State-of-the-art strategies for sparsification

Before the various state-of-the-art insertion and deletionschemes are presented, the most important symbols of the sparseGPR technique are summarized in Table 1. Most of the GPapproximation techniques differ in the way, how the active set X I

is selected [9]. Sparsification deals with the insertion and deletionof training points to or from the active set. Usually, the remainingpoint xi with iAR that has maximum gain with respect to aninsertion criterion Δi will be selected. Analogously, we will removethe active point from the current posterior model given by Eq. (6)that has minimal loss with respect to a deletion criterion ∇i. In thefollowing, let I0 ¼ I [ i

� . One of the simplest and fastest point

selection and deletion methods is to randomly select one of thetraining points. This approach provides the baseline strategy towhich we compare all other methods.

Currently, one of the best selection methods with respect tomodeling accuracy is proposed by Smola and Bartlett. Their greedyscheme [13] select the remaining point that maximizes the pos-terior likelihood

P αjy;Xð ÞpP yjα;Xð ÞP αjXð Þ ¼N yjKα;σ2I� �N αj0;Kð Þ ð7Þ

for the admission of the prediction vector αARn of the full GPmodel under the given data set D. This likelihood approach isbased on the transformation of α¼K �1f and leads to the

Table 1This table summarizes the most important symbols for the presented sparse GPRapproximation and the considered insertion and deletion strategies. A shortdescription provides additional information about the used notations.

Symbol Format Description

K n� n Full covariance matrixK I m�m Covariance matrix induced by the active set of training

pointsK I;� m� n Covariance matrix induced by the active set points and all

other training pointsPI m� n Projection matrix PI ¼K �1

I K I;� which induces the sparseapproximation

L m�m Lower Cholesky factor of K I

V m� n Nystrøm factor for approximating the full covariance matrixK

M m�m Short-hand notation for the matrix M ¼ σ2IþVVT

LM m�m Lower Cholesky factor of M

kI;n m� 1 Covariance vector induced by the active set points and thetest point xn

kI;i m� 1 Covariance vector induced by the active set points and theremaining point xi

k�;i n� 1 Covariance vector induced by all training points and theremaining point xi

αI m� 1 Prediction vector αI ¼ L�TL�TM βI

βI m� 1 Short-hand notation for the vector βI ¼ L�1M Vy

μI n� 1 Estimated mean vector μI ¼ VTL�TM βI of the sparse GPR

model

equivalent formulation

τI ¼minαI

τðαIÞð Þ ¼minαI

12αTI LMLTαI�αT

I LVy �

¼ �12βTI βI ð8Þ

in the sparse sense as pointed out in [12], i.e. αR¼ 0. The decreasein the sparse posterior likelihood derived from Eq. (7) defines theselection criterion of Smola and Bartlett (SB), i.e.

SBΔi ¼ τI�τI0 ¼12β2I0 ;i; ð9Þ

for each remaining point and with the new component βI0 ;i of theupdated vector βI0 ARmþ1. Due to the high computational cost ofOðnmÞ for the criterion calculation per remaining point, the cri-terion is only evaluated for a randomly chosen subset of cardin-ality κ. The authors of [13] recommend κ ¼ 59, which they justifywith a probabilistic argument. Nevertheless, they end up with highcomputational cost of Oðκnm2Þ for the whole DTC approximation.The conjugation of this selection heuristic defines the corre-sponding deletion criterion

SB∇i ¼ SBΔi ð10Þwhich leads to cost of Oðm2Þ per active point.

To increase the performance of the former selection heuristicfrom Eq. (9), a matching pursuit approach (MPA) which reducesthe computational effort but not the memory requirements ispresented in [14]. Here, the authors fix αIARm in the minimizationof τðαI0 Þ in Eq. (8) and only vary αI0 ;i with iAR. This yields theinsertion criterion

MPAΔI ¼ τI�minαI0 ;i

τðαI0 Þð Þ ¼kT�;i y�μI

� ��σ2μI;i

� �2

2 σ2kiiþkT�;ik�;i

� � ð11Þ

with covariance vector k�;iARn. However, despite the lower com-putational cost of O dnð Þ per remaining point, on large data sets orunder high input dimensions they also select a randomized subsetof size κ for criteria evaluation to boost efficiency. An additionalmatrix cache that contains for example κ rows of the full covar-iance matrix K can help to speed up the criterion evaluations, butincreases the memory requirements considerably.

In [12], a very fast greedy criterion with computational com-plexity of Oð1Þ per remaining point is proposed. Here, the infor-mation gain (IG)

IGΔi ¼ KL½ ~Q I0 f jy;Xð Þj jQI f jy;Xð Þ� ð12Þis measured by the increase in the KL-divergence between thecurrent posterior distribution QI f jy;Xð Þ following from Eq. (6) andan approximated one ~Q I0 f jy;Xð Þ after inclusion of the remainingpoint xi. Thereby, couplings between the latent function value fiand the targets y⧹i, i.e. without the i-th element, are ignored toguarantee low computational costs.

The intention of the following greedy selection criterion byQuiñonero–Candela (QC) is to increase the logarithmic marginallikelihood φIðθÞ obtained from Eq. (4) by the inclusion of aremaining point. This leads to the equivalent criterion

QCΔi ¼φI0 ðθÞ�φIðθÞ ¼ SBΔi

σ2 � log ðlM;iiÞþ log ðσÞ; ð13Þ

where only the change induced through the inclusion is con-sidered, i.e. the hyperparameters θ are fixed during the selectionprocess, cf. [15]. More details for the adaption of hyperparametersare presented in Section 3.2. As pointed out in Eq. (13), thisheuristic is closely related to the criterion in [13], since β2

I0 ;ip l�2M;ii,

where lM;ii is the i-th diagonal element of LM . Consequently, thiscriterion leads to the same computational cost of OðnmÞ perremaining point and is also calculated for only a small randomlyselected remaining subset of size κ.

Page 5: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

J. Schreiter et al. / Neurocomputing 192 (2016) 29–3732

The idea in [11] is the same as in [15], but a regularized loga-rithmic marginal likelihood given through a variational (VAR)framework is increased. This results in the criterion

VARΔi ¼ QCΔiþJk�;i�VTL�1kI;i J2

2σ2 kii� JL�1kI;i J2� � ð14Þ

for active point selection. The relation to the criterion in [15]induces the same computational complexity of OðnmÞ perremaining point.To increase performance, the disadvantageoussub-sampling on a randomly chosen subset of remaining trainingpoints is again required.

The last insertion criterion which we discuss here is introducedin [17]. Csató (CS) defined his selection heuristic over theprojection-induced error in the RKHS specified by the covariancefunction, see [3] for more details, which leads to

CSΔi ¼ kii� JL�1kI;i J2 ð15Þwith covariance vector kI;iARm. Due to its relatively low compu-tational cost of OðmÞ per remaining point, it is possible to evaluatethis criterion for all remaining points, which slightly increases theoverall complexity of the DTC approximation to Oðnm3Þ. Further-more, in [16] Csató defined a greedy deletion criterion given by

CS∇i ¼ j CSΔiαI;i j : ð16ÞThe main difference between the deletion and selection heuristicby Csató lies in the influence of the respective element of theprediction vector αI . Since αI is known, we also need Oðm2Þarithmetic operations per active point, which is equally costly asthe removal heuristic by Smola and Bartlett in Eq. (10). Note thatthe active point of the posterior model in Eq. (6) with minimal lossin terms of a deletion criterion is always removed.

3. Maximum error criterion for sparse Gaussian processregression

All the insertion and deletion methods discussed above eitherlack computational speed, have high memory requirements, orlack modeling accuracy. Moreover, if the regression model isgenerated based on a purely randomized selection or on a methodwith a small randomly selected remaining subset for criteriaevaluation, e.g. as done in [13–15], the performance in hardregression tasks deteriorates. Our novel approach aims to providea favorable compromise between modeling accuracy, computa-tional cost, and memory requirements.

3.1. Maximum error criterion for fast greedy insertion and deletionof training points

In this section, we first discuss the successive inclusion oftraining points into the active subset. To include a remaining pointxi with iAR in the active subset, we have to update the Choleskyfactors L, LM and the matrix V , respectively K I;�, the vector βI , andthe mean μI of the posterior distribution given in Eq. (6), as shownin [12]. Thus, the cost for the sequential insertion in the m-thiteration is OðnmÞ.

Similar to the method of Smola and Bartlett in [13], our approachmaximizes the posterior probability given by Eqs. (7) and (8). Thegreedy scheme in Eq. (9) successively maximizes the Euclideannorm of the vector βI0 . This task is equivalent to iteratively mini-mizing Jy�μI0 J for the normalized vector y and, thus, approxi-mately normalized μI0 , since we have JβI0 J

2 ¼ βTI0βI0 ¼ yTμI0 after an

inclusion. Due to the equivalence of norms in finite dimensionalspaces, it holds true that Jy�μI0 Jrnmax8 j jyj�μI0 ;j j . In the limit,i.e. with increasing m, we will approximately have μI0 � μI . Since

this convergence assumption holds only true for large m, we yieldnot necessarily to an upper bound for the model error after aninclusion. Nevertheless, we define

MEΔi ¼ jyi�μI;i j ð17Þ

as our new insertion criterion and select the remaining point thathas the maximal error (ME) under the current posterior model inEq. (6). This computationally efficient approach has Oð1Þ cost forcriterion calculation per remaining point. The convergenceassumption obviates the update of the posterior model for eachremaining point as needed for other selection criteria, e.g. in[11,13,15].

In the following, we present our maximum error deletion cri-terion for the removal of active points. Typically, the maximumnumber of active points m is predefined, since it influences com-puting time quadratically and memory requirements linearly. If astopping criterion for m is used, for example, by monitoring theaveraged square training error as in [12], deletion of appropriateactive points improves the predictive performance without sig-nificantly deteriorating the existing model quality. The deletionalso provides a way to reduce redundancy in the greedily selectedactive subset. Similar to the presented insertion strategy, we optfor a greedy criterion to successively delete active points. Note thatdeleting an active point does not necessarily lead to a state thatwas previously encountered when iteratively inserting trainingpoints. The reason is that the underlying assumptions for greedyinsertion and deletion differ considerably. While for the inclusionstrategies Cholesky updates are sufficiently fast and stable, QR-downdates based on the factorization QR¼ LMLT are used fordeleting active points since they offer higher numerical stability.This advantageous behavior is discussed in [19]. For our proposedtechnique, the cost for deletion of one point is equal to its insertioncost, i.e. OðnmÞ in the m-th iteration of the DTC approximation.Inspired by the criterion in [17], we define our new deletion cri-terion as follows. Beginning with an already selected subsetdetermined by I, we remove the active point with minimal valueregarding the deletion criterion

ME∇i ¼ jMEΔiαI;i j ; ð18Þwhere αI ¼ R�1Q TK I;�yARm. Note that we use the maximum errorMEΔi instead of the expensive projection-induced error CSΔI . Thus,we obtain the same low complexity for a deletion criterion eva-luation of Oð1Þ per active point. Here, we coupled the error of anactive training point in the current sparse model given by Eq. (6)with its importance under prediction in relation to the behavior inEq. (7). So, our deletion criterion controls the current modelaccuracy and the generalization capability.

Finally, for a better comparison the complexity of each pre-sented insertion criterion is presented in Table 2. Thereby, the rowregarding to the memory effort describes only the storage amountwhich is needed for the criterion calculation for only oneremaining training point. Since the FITC approximation dependsnot on a selection criterion, the respective entries in the presentedtable are empty.

3.2. Generalized expectation maximization for learninghyperparameters

The presented sparse GPR model depends not only on theactive points xi with iA I, but also on the hyperparameters of thespecified covariance function and the variance σ2 of the Gaussiannoise model. Let the vector θ denote the collection of all hyper-parameters. For the sake of notational simplicity the dependencyof the above formulas on θ was neglected. The adaptation of thehyperparameters can be realized by gradient based optimization

Page 6: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

Table 2This table summarizes the computational complexity and storage requirements of the discussed sparse GPR approximation methods. The memory row describes only theadditional needed storage amount for criterion calculation based on the already determined predictive model (6). Nevertheless, the practical memory effort andcomputational performance of each method depends very strongly on their implementation, especially for the last four intelligent selection schemes.

Complexity FITC DTC insertion criterion

ME IG CS MPA SB QC VAR Random

Criterion – Oð1Þ Oð1Þ OðmÞ OðdnÞ OðnmÞ OðnmÞ OðnmÞ Oð1ÞComplete O nm2

� �O nm2� �

O nm2� �

O nm3� �

O nm2� �

O κnm2� �

O κnm2� �

O κnm2� �

O nm2� �

Memory – Oð1Þ Oð1Þ OðmÞ OðnÞ OðnÞ OðnÞ OðnÞ Oð1Þ

J. Schreiter et al. / Neurocomputing 192 (2016) 29–37 33

algorithms that maximize the logarithmic marginal likelihood

φIðθÞ ¼ log QI yjX;θ� �� �¼ �ðn�mÞ log ðσÞ

�Xmi ¼ 1

log ðlM;iiÞ�1

2σ2 yTy�βTI βI

� ��n2log 2πð Þ ð19Þ

obtained from Eq. (4). The values lM;ii are the diagonal entries ofthe Cholesky factor LM . In [11], the author uses a regularized ver-sion of the above logarithmic marginal likelihood, i.e.

VARφIðθÞ ¼ φIðθÞ�1

2σ2 trace K�VTV� �

; ð20Þ

described through his variational (VAR) approach. This additionalregularization term leads to correcting the Nystrøm approxima-tion of the full covariance matrix in the hyperparameter learningprocess and gives a lower bound to the true logarithmic marginallikelihood in Eq. (19) of the DTC approximation, cf. [11]. Oneproblem encountered when maximizing φIðθÞ or φIðθÞ in the var-iational framework, respectively, is their dependence on the activesubset of training points determined by I. To solve this problem,we take alternating constrained optimization steps in an expec-tation maximization manner employing the theory of [20]. Then,the expectation step for estimating the new posterior distributionQInew f Inew jy;X;θ

� �from Eq. (3) with fixed hyperparameters θ is

given by

QInew f Inew jy;X;θ� �¼ argmin

QI ðf I j y;X;θÞAQI ðy;X;θÞKL QI f I jy;X;θ

� �JQI y; f I jX;θ

� �� �: ð21Þ

Here, the posterior distribution in the expectation step in Eq. 21regarding the KL-divergence is conditioned on the probabilitydistribution QIðy;X;θÞ. That is the approximated posterior, whichis induced by an active subset of size m. This condition is handledwith a fixed final size m of the active subset in the greedy selectionprocess. Furthermore, the maximization step results in

θnew ¼ argmaxθ

Ef I j y;X;θ log QI yjX;θ� �� �� � ð22Þ

to determine an updated set of hyperparameters θnew. The max-imization step in Eq. (22) is realized with only few gradient ascentson the logarithmic marginal likelihoods from Eq. (19) or (20) with afixed active point set X I , respectively. In this case, the repeatedalternating computation of the E- and M-steps leads to a generalizedEM algorithm, since we only increase the logarithmic marginallikelihood. Since the generalized EM algorithm converges to localmaxima, the choice of the active training points is important in orderto obtain a good set of hyperparameters θ. For the selection of theactive subset we employ our efficient maximum error criterion tokeep the hyperparameter learning fast and stable.

4. Evaluations

In this section, we compare our maximum error insertion anddeletion criteria with other methods for the DTC approximationpresented in Section 2. Furthermore, we also consider the FITCapproximation given in [10] to present an extensive comparison.

For all experiments, we use the stationary squared exponentialcovariance function

kðxi; xjÞ ¼ σ2f exp �1

2ðxi�xjÞTΛ�2ðxi�xjÞ

�ð23Þ

with magnitude σ2f and automatic relevance determination, i.e.

with one lengthscale parameter λl per input dimension l in thediagonal matrix ΛARd�d, see [2]. The accuracy of the methodsunder consideration is measured by the normalized mean squareerror (NMSE), cf. [4].

4.1. Comparisons on learning inverse dynamics models

Here, a real benchmark data set from the 7 Degree of Freedom(DoF) SARCOS master arm (13922 training and 5569 test points),and a simulation data set from the SARCOS model (14904 trainingand 5520 test points) are used for validation, see [4,21]. Each pointof the data sets has 21 input dimensions, i.e. position, velocity andacceleration in joint space, and 7 targets, i.e. one torque for eachjoint of SARCOS robot arm, see Fig. 3(c) . The goal is to learn theinverse dynamics models, i.e. the mapping from position, velocityand acceleration to corresponding torque for each joint. Both robotdata sets contain independent training and test points for all DoF's.

The convergence trends with respect to the NMSE for all dis-cussed sparse GP approximations on the first DoF from the realSARCOS test data are shown in Fig. 2(a) and (b). Here, for all DTCtype approaches, the hyperparameters were learned with ten EMsteps and randomized active point selection up to a final set size ofm¼2000. For a fair comparison, we adapt the number of gradientsteps in the FITC approximation linearly with increasing virtualtraining points, i.e. we use 150þm

4 optimization steps, as thenumber of hyperparameters in the FITC approximation also growslinearly with m. The NMSE results for randomized selection in theDTC approximation are averaged over ten runs. The learning timesand training times for the different sparsifications are furthershown in Fig. 2(c) and (d), respectively. As shown by the results,the costs for insertion are similar for Smola and Bartlett [13] andQuiñonero–Candela [15], as pointed out in Section 2.1. The varia-tional framework in [11] leads to constantly higher effort in thelearning process, e.g. compared to the curves in [13,15], since theregularization term increase the cost of gradient based optimiza-tion techniques. We always use a remaining set size of κ ¼ 59 tospeed up the criteria calculation of the methods above. Ourmaximum error approach outperforms all DTC selection criteriawith respect to training times with low NMSE on test data, seeFig. 2(e). For large active set sizes, we nearly reach the sameaccuracy as the more costly selection heuristics in [13,15] andoutperform the matching pursuit approach from [14], see Fig. 2(a).But our selection criterion performs also very well for small activeset sizes, which empirically justify the adopted assumptions in thederivation of the maximum error (ME) strategy. As shown by theright column in Fig. 2, we outperform the DTC deletion criterionfrom Smola and Bartlett [13] and the randomized version in termof generalization accuracy. Our approach also yields the best

Page 7: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

200 400 600 800 1000 1200 1400 1600 1800 2000

10−3

10−2

10−1

m

NM

SE

FITCDTC + MEDTC + IGDTC + CSDTC + MPADTC + SBDTC + QCDTC + VARDTC + Random

NMSE for sparse GPR approximations

200 400 600 800 1000 1200 1400 1600 1800 2000

10−2

m

NM

SE

DTC + Random (training)DTC − MEDTC − CSDTC − SBDTC − Random

NMSE for various DTC deletion criteria

200 400 600 800 1000 1200 1500 2000

103

104

m

Tim

e [s

]

FITCDTC + MEDTC + IGDTC + CSDTC + MPADTC + SBDTC + QCDTC + VARDTC + Random

Learning time for sparse model selection

200 400 600 800 1000 1200 1500 2000101

102

103

m

Tim

e [s

]

DTC − MEDTC − CSDTC − SBDTC − Random

Deletion time of DTC type approaches

101 102 103

10−3

10−2

10−1

Time [s]

NM

SE

DTC + MEDTC + IGDTC + CSDTC + MPADTC + SBDTC + QCDTC + VARDTC + Random

Insertion time of DTC methods vs. NMSE

101 102 103

10−2

Time [s]

NM

SE

DTC − MEDTC − CSDTC − SBDTC − Random

Deletion time of DTC methods vs. NMSE

Fig. 2. Learning results as NMSE and learning times for the first DoF. The evaluation is performed on the SARCOS test data for different sparse GPR approximations, i.e. bySmola and Barlett (SB) [13], Seeger (IG) [18], Titsias (VAR) [11], Keerthi and Chu (MPA) [14], Quiñonero-Candela (QC) [15], and Csató (CS) [16]. The right column showsperformances of different deletion criteria of the DTC approximation. Our novel strategy (ME) gives the best trade-off between low computing times and accurate prediction.As shown in Fig. 2(e), our approach yields the lowest learning curve and, thus, provides the best trade-off between model accuracy and computation time.

J. Schreiter et al. / Neurocomputing 192 (2016) 29–3734

Page 8: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

DoF

NM

SE [%

]

ν−SVRLGPGPRRFRLSFITCDTC + ME

NMSE on real SARCOS test data

1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

DoF

NM

SE [%

]

ν−SVRLGPGPRRFRLSFITCDTC + ME

NMSE on simulated SARCOS test data

SARCOS arm

Fig. 3. Prediction errors as NMSE (in percent) for each degree of freedom (DoF) after prediction on the test sets with real SARCOS data 3(a) and simulated robot data 3(b) from the SARCOS master arm 3(c). Overall, the DTC approximation with the novel and fast maximum error (ME) criterion performs best, closely followed by the FITCapproximation.

Fig. 4. The PR2 robot.

Fig. 5. Feed-forward control with inverse dynamics models.

J. Schreiter et al. / Neurocomputing 192 (2016) 29–37 35

compromise between low computational effort and high predic-tion precision, as shown in Fig. 2(f). In Fig. 2(b), DTC þ Random(training) describes the convergence trend dependent on therandomized sparse GP model generation which builds the base forthe subsequent evaluated deletion criteria. Thus, in Fig. 2(d) presented deletion times depend on the current set size ofm¼2000 and it would be helpful to read all plots in the rightcolumn from right to left. Note that all deletion schemes yieldbetter NMSE results than the simple, randomized deletion.

In Fig. 3, we compare our maximum error selection method forthe sparse DTC approximation and the FITC approximation [10]with other established regression procedures on all DoF's for realand simulated SARCOS test data. The results for the other methods,i.e. local Gaussian processes (LGP), ν-SVR, GPR, and randomFourier regularized least squares (RFRLS), are taken from [4] and[22]. For comparison, we also employ a final set size of m¼2000active or virtual training points for both sparse GP approximations.We again use ten generalized EM steps for hyperparameterlearning of the DTC approximation and 650 gradient ascents foroptimization of virtual training points with subsequent prediction.The higher errors for the fifth and sixth DoF on the real SARCOStest data are due to more complex non-linearities in these DoF, e.g.induced by their joint inertia. Compared to these regressionalgorithms, our approach is efficient and one of the best per-forming methods. Compared to the FITC approximation, we havemuch less effort for sparse model selection with higher general-ization accuracy.

4.2. Compliant and real-time tracking control

In this section, we employ learned inverse dynamics models fora real-time tracking control task [23] on a PR2 robot, as shown inFig. 4. Here, the model-based tracking control law determines thejoint torques y for each of the seven DoF's necessary for followinga desired joint trajectory xd, _xd, €xd, where x, _x, €x are joint angles,velocities and accelerations of the robot, as shown in Fig. 5. Thiscontrol paradigm uses a dynamics model, while employing feed-back in order to stabilize the system. Here, the dynamics model ofthe robot can be used as a feed-forward model that predicts thejoint torques yff required to perform the desired trajectory, while afeedback term yfb ensures the stability of the tracking control witha resulting control law of y¼ yff þyfb. The feedback term can be alinear control law such as yfb ¼ GPeþGD _e , where e¼ xd�x denotesthe tracking error and GP as well as GD position-gain and velocity-gain, respectively. If an accurate inverse dynamics model can beobtained, the feed-forward term yff will largely cancel the robot's

non-linearities [24]. In this case, the gains GP and GD can be chosento have small values enabling compliant control performance [25].

To obtain a global and precise dynamics model, we sample517.783 data points with a frequency of 100 Hz from the right armof the PR2 robot. Furthermore, we train a sparse GP model giventhrough the DTC approximation with our maximum error (ME)criterion for each of the seven DoF's. Thereby, the hyperpara-meters are always learned with 10 generalized EM steps, asexplained in Section 3.2. We choose a final active set size ofm¼1000, which is sufficient to reach a good model quality while

Page 9: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

1 2 3 4 5 6 70

10

20

30

40

50

60

70

80

DoF

Perc

enta

ge o

f Tot

al T

orqu

e [%

]DTC + ME EffortP EffortD Effort

Torque percentage for each DoF

DTC+ME PD low PD high DTC+ME PD low PD high DTC+ME PD low PD high0

0.005

0.01

0.015

0.02

0.025

0.03

Circle Eight Star

Tota

l RM

SE

Deviation in xDeviation in yDeviation in z

Tracking error on three test trajectories

Fig. 6. The Torque percentage for each DoF of the right PR2 arm is shown in Fig. 6(a). The higher the torque percentage, the more contributions have the corresponding parts.Here, DTC þ ME dynamics models usually have far over 50% torque contribution which enables compliant control. Plot 6(b) shows the tracking errors in task-space ðx; y; zÞ forthe three test trajectories, i.e. circle-, eight- and star-shape. The RMSE value of each dimension is computed for 3 different control schemes, i.e. low-gain DTC þ ME model-based control, PD-control with low gains, and PD-control with high gains, i.e. about four times higher gains as in the low gain control schemes.

−0.5 −0.4 −0.3 −0.2

0.9

1

1.1

1.2

y

z

Desired TrajectoryDTC + ME TrackingPD low Tracking

Circle-shapes

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3

0.9

1

1.1

1.2

y

z

Desired TrajectoryDTC + ME TrackingPD low Tracking

Eight-shapes

−0.3 −0.2 −0.1 0

0.9

1

1.1

1.2

y

z

Star-shape

Fig. 7. Tracking performance on the three test trajectories using low-gain DTC þ ME model-based controller and PD-controller with low gains.

J. Schreiter et al. / Neurocomputing 192 (2016) 29–3736

yielding prediction times of less than 3 ms for all seven DoF's.Hereby, we are able to perform tracking control in real-time at100 Hz. In Fig. 6(a), the percentage on total torque for each DoF ofthe PR2 robot arm is shown, where the gains for feedback controlare chosen very small in order to enable compliant control. Thecontribution of our sparse GPR model, i.e. approximated with DTCþ ME, to the control effort is usually far over 50%. Here, a highcontribution to the control effort indicates a well approximatedmodel, since the feedback control loop does not need to stronglyintervene during the control task. The corresponding trackingperformance in task-space is presented in Fig. 7 for three testtrajectories, e.g. circle-, eight- and star-shape. Here, we comparethe low gain feed-forward control using the learned dynamicsmodel with the standard PD-control scheme. The results withrespect to the root mean square error (RMSE) are shown in Fig. 6(b). It can be seen that, applying learned dynamics models, weobtain compliant tracking control, and, at the same time, havetracking accuracy comparable to the high gain PD-control scheme.

5. Conclusion

In this paper, we proposed a very fast greedy insertion anddeletion scheme for sparse GPR or, more precisely, for the DTCapproximation. Our criterion is based on the maximum errorbetween model and training data, inspired by [13] and [16]. Itleads to a stable and efficient way for automatic sparse modelselection. The primary advantage of our maximum error greedyselection is the combination of high accuracy with low computa-tional costs for criterion calculation for all remaining points.

In contrast, the insertion methods in [11,13–15] have to select asmall random subset of remaining points for criterion evaluation.This random restriction can lead to poorer results, especially onharder regression tasks. Even without caching, we are nearly asfast as a randomized insertion. For the removal of active points ourapproach almost reaches the accuracy of Csató's deletion method,outperforms the deletion criterion in [13], and is still nearly as fastas a randomized removal. Compared to the FITC approximation[10], all DTC methods lead to higher prediction accuracy and lowerlearning times. Further results on a real PR2 robot show that theproposed method can be employed for compliant, real-time robotcontrol.

Acknowledgments

The authors would like to thank A. Gijsberts, Idiap ResearchInstitute, for providing his results on the SARCOS data sets. Wealso like to thank S. Schaal, University of Southern California andMax-Planck-Institute for Intelligent Systems in Tübingen, for pro-viding the image of the SARCOS master arm.

References

[1] J. Schreiter, D. Nguyen-Tuong, H. Markert, M. Hanselmann, M. Toussaint, Fastgreedy insertion and deletion in sparse Gaussian process regression, in: M.Verleysen (Ed.), Proceedings of the 23rd European Symposium on ArtificialNeural Networks, 2015, pp. 101–106.

[2] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, TheMIT Press, Cambridge, Massachusetts, 2006.

Page 10: Efficient sparsification for Gaussian process regressionipvs.informatik.uni-stuttgart.de/mlr/papers/16-schreiter-Neurocomputing.pdf · SEE PROFILE Marc Toussaint Universität Stuttgart

J. Schreiter et al. / Neurocomputing 192 (2016) 29–37 37

[3] B. Schölkopf, A.J. Smola, Learning with Kernels-Support Vector Machines,Regularization, Optimization, and Beyond, The MIT Press, Cambridge, Massa-chusetts, 2002.

[4] D. Nguyen-Tuong, J. Peters, M. Seeger, Local Gaussian process regression forreal time online model learning and control, in: D. Koller, D. Schuurmans,Y. Bengio, L. Bottou (Eds.), Advances in Neural Information Processing Systems,21, 2009, pp. 1193–1200.

[5] A.C. Damianou, N.D. Lawrence, Deep Gaussian processes, in: C.M. Carvalho,P. Ravikumar (Eds.), Proceedings of the 16th International Conference onArtificial Intelligence and Statistics, JMLR: W&CP, vol. 31, 2013, pp. 207–215.

[6] C.K.I. Williams, M. Seeger, Using the Nystrøm method to speed up kernelmachines, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Advances in NeuralInformation Processing Systems, vol. 13, 2001, pp. 682–688.

[7] E.G. Băzăvan, F. Li, C. Sminchisescu, Fourier kernel learning, in: A. Fitzgibbon,S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Proceedings of the 12th Eur-opean Conference on Computer Vision, 2012, pp. 459–473.

[8] M. Lázaro-Gredilla, J. Quiñonero-Candela, C.E. Rasmussen, A.R. Figueiras-Vidal,Sparse spectrum Gaussian process regression, in: T. Jaakkola (Ed.), J. Mach.Learn. Res. 11 (2010) 1865–1881.

[9] J. Quiñonero-Candela, C.E. Rasmussen, A Unifying view of sparse approximateGaussian process regression, in: R. Herbrich (Ed.), J. Mach. Learn. Res. 6 (2005)1939–1959.

[10] E.L. Snelson, Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs,in: Y. Weiss, B. Schölkopf, J. Platt (Eds.), Advances in Neural Information Pro-cessing Systems, vol. 18, 2006, pp. 1257–1264.

[11] M.K. Titsias, Variational learning of inducing variables in sparse Gaussianprocesses, in: D. van Dyk, M. Welling (Eds.), Proceedings of the 12th Inter-national Conference on Artificial Intelligence and Statistics, JMLR: W&CP, vol.5, 2009, pp. 567–574.

[12] M. Seeger, C.K.I. Williams, N.D. Lawrence, Fast forward selection to speed upsparse Gaussian process regression, in: C.M. Bishop, B.J. Frey (Eds.), Proceed-ings of the Ninth International Workshop on Artificial Intelligence and Sta-tistics, 2003, pp. 205–212.

[13] A.J. Smola, P.L. Bartlett, Sparse greedy Gaussian process regression, in:T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Advances in Neural InformationProcessing Systems, vol. 13, 2001, pp. 619–625.

[14] S.S. Keerthi, W. Chu, A matching pursuit approach to sparse Gaussian processregression, in: Y. Weiss, B. Schölkopf, J. Platt (Eds.), Advances in NeuralInformation Processing Systems, vol. 18, 2006, pp. 643–650.

[15] J. Quiñonero-Candela, Learning with uncertainty – Gaussian processes andrelevance vector machines (Ph.D. thesis), Technical University of Denmark,2004.

[16] L. Csató, M. Opper, Sparse representation for Gaussian process regression, in:T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Advances in Neural InformationProcessing Systems, vol. 13, 2001, pp. 444–450.

[17] L. Csató, Gaussian processes–iterative sparse approximations (Ph.D. thesis),Aston University Birmingham, 2002.

[18] M. Seeger, Bayesian Gaussian process models–PAC-Bayesian generalisationerror bounds and sparse approximations (Ph.D. thesis), University of Edin-burgh, 2003.

[19] L. Foster, A. Waagen, N. Aijaz, M. Hurley, A. Luis, J. Rinsky, C. Satyavolu,M.J. Way, P. Gazis, A. Srivastava, Stable and efficient Gaussian process calcu-lations, in: C.K.I. Williams (Ed.), J. Mach. Learn. Res., vol. 10, 2009, pp. 857–882.

[20] J.V. Graça, K. Ganchev, B. Taskar, Expectation maximization and posteriorconstraints, in: J. Platt, D. Koller, Y. Singer, S.T. Roweis (Eds.), Advances inNeural Information Processing Systems, vol. 20, 2008, pp. 569–576.

[21] D. Nguyen-Tuong, J. Peters, Incremental sparsification for real-time onlinemodel learning, in: A. Bouchachia, N. Nedjah, C.-S. Leung (Eds.), Neuro-computing, vol. 74, 2011, pp. 1859–1867.

[22] A. Gijsberts, Incremental learning for robotics with constant update com-plexity (Ph.D. thesis), University of Genoa, 2011.

View publication statsView publication stats

[23] J. Schreiter, P. Englert, D. Nguyen-Tuong, M. Toussaint, Sparse Gaussian ProcessRegression for Compliant, Real-Time Robot Control, in: A. Okamura,S. Hutchinson (Eds.), Proceedings of the IEEE International Conference onRobotics and Automation, 2015, pp. 2586–2591.

[24] M.W. Spong, S. Hutchinson, M. Vidyasagar, Robot Dynamics and Control, vol.2, John Wiley & Sons, Hoboken, New Jersey, 2004.

[25] D. Nguyen-Tuong, J. Peters, M. Seeger, Computed torque control with non-parametric regression models, in: Proceedings of the American Control Con-ference, 2008, pp. 212–217.

Jens Schreiter received his Bachelor and Masterdegrees in mathematics from the University of AppliedSciences in Mittweida, Germany, in 2010 and 2012,respectively. Since 2012, he is working on his Doctoraldegree at the Bosch Corporate Research facility inRenningen, located in the west of Stuttgart, Germany.The Machine Learning and Robotics Laboratory at theUniversity of Stuttgart, Germany, supports his indus-trial Ph.D. studentship. His research interests includeGaussian process modeling techniques for transientsystems and active learning methods for optimaldesign of experiments, both with automotive

applications.

Duy Nguyen-Tuong has been with Bosch CorporateResearch in Renningen since 2011, where he works onmachine learning with focus on robotics and auto-motive applications. Before joining Bosch Research, hestudied control and automation engineering at theUniversity of Stuttgart and the National University ofSingapore. From 2007 to 2011, he was a member of theRobot Learning Laboratory at the Max Planck Institutefor Biological Cybernetics, where he completed his Ph.D. studies. His main research interest is the applicationof machine learning techniques for model learning.Recently, he also becomes interested in control and

policy learning, which is useful for numerous auto-

motive and industrial applications.

Marc Toussaint is a full Professor for Machine Learningand Robotics at the University of Stuttgart since 2012.Before that, he was an Assistant Professor and leadingan Emmy Noether research group at FU & TU Berlin. Hisresearch focuses on the combination of decision theoryand machine learning, motivated by fundamentalresearch questions in robotics. Specific interests arecombining geometry, logic and probabilities in learningand reasoning, and appropriate representations andpriors for real-world robot learning. He is a Coordinatorof the German research priority program AutonomousLearning, Reviewer for the German Research Founda-

tion, and Program Committee Member of several top

conferences (UAI, RSS, ICRA, AISTATS, ICML).