gaussian processes - part iii advanced topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf ·...

53
Gaussian Processes - Part III Advanced Topics Philipp Hennig MLSS 2013 30 August 2013 Max Planck Institute for Intelligent Systems Department of Empirical Inference Tübingen, Germany 1 ,

Upload: others

Post on 12-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussian Processes - Part IIIAdvanced Topics

Philipp Hennig

MLSS 201330 August 2013

Max Planck Institute for Intelligent SystemsDepartment of Empirical InferenceTübingen, Germany

1 ,

Page 2: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussians have been discovered beforeIn virtually every area of science affected by uncertainty

▸ Thermodynamics Brownian motion, Ornstein-Uhlenbeck process▸ stochastic calculus stochastic differential equations, Ito calculus▸ control theory stochastic control, Kalman filter▸ signal processing filtering▸ other communities use other names for the same concept

Kriging; Ridge-Regression, Kolmogorov-Wiener prediction;least-squares regression; Wiener process; Brownian bridge, . . .

▸ Now: Gaussians show up in numerical methods, too . . .quadrature, optimization, solving ODEs, control . . .

Gaussian processes are central to many machine learningtechniques, and all areas of quantitative science.

2 ,

Page 3: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

The big picturewe need a coherent framework for hierarchical machine learning

D x θ

conditioning by quadrature

estimation by optimization

c apred

ictio

nby

anal

ysis

action by control

▸ uncertainty caused by finite computations should be accounted for▸ uncertainty should propagate among numerical methods▸ joint language required: probability

“off-the-shelf” methods are convenient, but not always efficient.

3 ,

Page 4: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Numerical algorithms are the elements of inferenceinferring solutions of non-analytic problems http://www.probabilistic-numerics.org

Numerical algorithmsestimate (infer) an intractable property of a function

given evaluations of function values.

quadrature estimate ∫ ba f(x)dx given {f(xi)}optimization estimate arg minx f(x) given {f(xi),∇f(xi)}

analysis estimate x(t) under x′ = f(x, t) given {f(xi, ti)}control estimate minu x(t, u) under x′ = f(x, t, u) {f(xi, ti, ui)}

▸ even deterministic problems can be uncertain▸ not a new idea1, but rarely studied

We need a theory of probabilistic numerics.

Gaussians, because of their connection to linear functions, areat the heart of probabilistic interpretations of numerics.

1H. Poincaré, 1896, Diaconis 1988, O’Hagan 19924 ,

Page 5: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Recall: GPs are closed under linear maps

p(z) = N (z;µ,Σ) ⇒ p(Az) = N (Az,Aµ,AΣA⊺)▸ this is not restricted to finite linear operators (matrices) A

▸ A(x) = I(a < x < b) gives Af = ∫ ba f(x) dx

p(∫ b

af(x) dx,∫ d

cf(x) dx) = N [(∫ ba f(x) dx

∫ dc f(x) dx) ;(∫ ba µ(x) dx

∫ dc µ(x) dx) ,

(∫ ba ∫ ba k(x,x′) dx dx′ ∫ ba ∫ dc k(x,x′) dx dx′∫ ba ∫ dc k(x,x′) dx dx′ ∫ dc ∫ dc k(x,x′) dx dx′)]

5 ,

Page 6: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Inferring F = ∫ f from observations of fquadrature

6 ,

Page 7: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Inferring F = ∫ f from observations of fquadrature

6 ,

Page 8: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Quadrature with GPsA O’Hagan, 1991; T Minka, 2000; M Osborne et al., 2012

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

x∗

#ev

als

▸ say what functions you expect to integrate▸ find arg minX[kFfX − kFfXk−1fXfXkfXF ] (depends on kernel!)▸ Bayesian quadrature

7 ,

Page 9: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussian processes can be used to construct quadrature rules.

8 ,

Page 10: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Inferring f from observations of Fµf ∣FX

= µf + kfFXk−1FXFX

(FX − ∫X µ) kff ∣FX= kff − kfFX

k−1FXFXkFXf

9 ,

Page 11: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Optimizationcontinuous, nonlinear, unconstrained

For f ∶ RN _R, find local minimum arg min f(x), starting at x0.

An old idea: Newton’s method

f(x) ≈ f(xt) + (x − xt)⊺∇f(xt) + 1

2(x − xt)⊺∇∇⊺f(xt)´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=∶B(xt)

(x − xt)_ xt+1 = xt −B−1(xt)∇f(xt)

Cost: O(N3)High-dimensional optimization requires

giving up knowledge in return for lower cost.

10 ,

Page 12: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Quasi-Newton methods (think BFGS, DFP, . . . )aka. variable metric optimization — low rank estimators for Hessians

▸ Instead of evaluating Hessian, build (low-rank) estimator fulfilling localdifference relation . . .

∇f(xt+1) −∇f(xt) = Bt+1(xt+1 − xt)yt = Bt+1st

▸ . . . otherwise close to previous estimator in ∥Bt+1 −Bt∥F,V▸ . . . so minimize regularised loss

Bt+1 = arg minB∈RN×N

{ limβ_0

1β∥yt −Bst∥2V + ∥B −Bt∥2F,V }

= limβ_0

arg maxB

N (yt;Bst, βV )N (Ð⇀B ;Ð⇀B t, V ⊗ V )

= arg maxB

N [B;Bt + (yt −Btst)V s⊺ts⊺t V st , V ⊗ (V − V ss⊺V

s⊺V s )]´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

posterior

Quasi-Newton methods perform local maximum a-posterioriGaussian inference on the Hessian’s elements.

11 ,

Page 13: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Optimization with GPsnonparametric quasi-Newton methods Hennig & Kiefel, ICML 2012, JMLR 2013

▸ Idea: replace

∇f(xt+1) −∇f(xt) ≈ B(xt+1 − xt)_ = ∫ xt+1

xt

B(x)dx▸ Gaussian process prior on B(x⊺, x)

p(B) = GP(B,B0(x⊺, x), k(x⊺, x′⊺)⊗ k(x,x′))▸ Gaussian likelihoods

p(yi(x⊺) ∣B,si) = limβ_0

N (yi;∑m

sim ∫ 1

0B(x⊺, x(t)) dt, k(x⊺, x′⊺)⊗ β)

p(yi(x)⊺ ∣B,s⊺i ) = limβ_0

N (y⊺i ;∑m

s⊺im ∫ 1

0B(x⊺(t), x) dt, β ⊗ k(x,x′))

▸ posterior of same algebraic form as before, but with linear maps ofnonlinear (integral of k) entries.▸ same computational complexity as L-BFGS (Nocedal, 1980): O(N)

12 ,

Page 14: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

A consistent model of the Hessian functionnonparametric inference on elements of the Hessian P.H. & M. Kiefel, ICML 2012, JMLR 2013

−2 −1 0 1 2−1

0

1

2

3

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

−2 −1 0 1 2−1

0

1

2

3

500

1,000

1,500

2,000

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

13 ,

Page 15: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

A consistent model of the Hessian functionnonparametric inference on elements of the Hessian P.H. & M. Kiefel, ICML 2012, JMLR 2013

−2 −1 0 1 2−1

0

1

2

3

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

−2 −1 0 1 2−1

0

1

2

3

0

500

1,000

1,500

2,000

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

14 ,

Page 16: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

A consistent model of the Hessian functionnonparametric inference on elements of the Hessian P.H. & M. Kiefel, ICML 2012, JMLR 2013

−2 −1 0 1 2−1

0

1

2

3

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

−2 −1 0 1 2−1

0

1

2

3

0

500

1,000

1,500

2,000

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

15 ,

Page 17: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

A consistent model of the Hessian functionnonparametric inference on elements of the Hessian P.H. & M. Kiefel, ICML 2012, JMLR 2013

−2 −1 0 1 2−1

0

1

2

3

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

−2 −1 0 1 2−1

0

1

2

3

0

500

1,000

1,500

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

16 ,

Page 18: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

A consistent model of the Hessian functionnonparametric inference on elements of the Hessian P.H. & M. Kiefel, ICML 2012, JMLR 2013

−2 −1 0 1 2−1

0

1

2

3

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

−2 −1 0 1 2−1

0

1

2

3

0

500

1,000

1,500

−2 −1 0 1 2−1

0

1

2

3

0

2,000

4,000

17 ,

Page 19: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

nonparametric quasi-Newton methodsfunctional generalizations Hennig, ICML 2013

Learning nonparametric models of Hessians allows▸ optimizing noisy functions▸ dynamically changing functions▸ parallelization▸ . . .

100 101 10210−25

10−12

101

# function evaluations

��x−x∗ ��

grad-descentNewton

100 101 102

# function evaluations

Hessian-freeNonparam.

18 ,

Page 20: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussian processes can be used in optimization.

19 ,

Page 21: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

GPs are closed under differentiationµf ∣f ′

X= µf + kff ′

Xk−1f ′X

f ′X(f ′X − µ

fX) kff ∣f ′

X= kff − kff ′

Xk−1f ′xf ′

Xkf ′

Xf

Rasmussen & Williams, 2006, §9.4

20 ,

Page 22: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

GPs can have multiple outputsReminder of Part I

21 ,

Page 23: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Solving ODEs with GPsobserve c′(t), infer c(t) Skilling, 1991

solve c′(t) = f(c(t), t) such that c(0) = a and c(1) = bp(c(t)) = GP(c;µc, V ⊗ k)p(yt ∣ c) = N (f(ct; t); ct, U)

▸ repeatedly estimate ct using GPposterior mean to “observe”c′(t) = f(ct) + δf

▸ estimate error in thisobservation by propagatingGaussian uncertainty through f .

Recent work:▸ Chkrebtii, Campbell, Girolami, Calderhead http://arxiv.org/abs/1306.2365▸ Hennig & Hauberg http://arxiv.org/abs/1306.0308

22 ,

Page 24: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Solving ODEs with GPsobserve c′(t), infer c(t) Skilling, 1991

solve c′(t) = f(c(t), t) such that c(0) = a and c(1) = bp(c(t)) = GP(c;µc, V ⊗ k)p(yt ∣ c) = N (f(ct; t); ct, U)

▸ repeatedly estimate ct using GPposterior mean to “observe”c′(t) = f(ct) + δf

▸ estimate error in thisobservation by propagatingGaussian uncertainty through f .

Recent work:▸ Chkrebtii, Campbell, Girolami, Calderhead http://arxiv.org/abs/1306.2365▸ Hennig & Hauberg http://arxiv.org/abs/1306.0308

22 ,

Page 25: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Solving ODEs with GPsobserve c′(t), infer c(t) Skilling, 1991

solve c′(t) = f(c(t), t) such that c(0) = a and c(1) = bp(c(t)) = GP(c;µc, V ⊗ k)p(yt ∣ c) = N (f(ct; t); ct, U)

▸ repeatedly estimate ct using GPposterior mean to “observe”c′(t) = f(ct) + δf

▸ estimate error in thisobservation by propagatingGaussian uncertainty through f .

Recent work:▸ Chkrebtii, Campbell, Girolami, Calderhead http://arxiv.org/abs/1306.2365▸ Hennig & Hauberg http://arxiv.org/abs/1306.0308

22 ,

Page 26: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

The Advantages of a Probabilistic Formulationjoint uncertainty over solution Hennig & Hauberg, under review

1st principal component

2nd

prin

cipa

lcom

pone

nt

x solver c

23 ,

Page 27: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

The Advantages of a Probabilistic Formulationuncertainty over problem Hennig & Hauberg, under review

x1 [arbitrary units]

x 2[a

rbitr

ary

units

]

x solver c

24 ,

Page 28: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussian processes can be used to solve differential equations.

25 ,

Page 29: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Lots of “Gaussian integrals” are knownand can be used to map uncertainty through almost any function see e.g. M. Deisenroth’s PhD, 2010

▸ write f(x) = ∑i φi(x)⊺w such that

∫ φi(x)N (x;µ,Σ) dx ∫ φi(x)φj(x)N (x;µ,Σ) dx

is analytic26 ,

Page 30: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Lots of “Gaussian integrals” are knownand can be used to map uncertainty through almost any function see e.g. M. Deisenroth’s PhD, 2010

∫ f(x)N (x;µ,Σ) dx =∑i

wi ∫ φi(x)N (x;µ,Σ) dx

∫ f2(x)N (x;µ,Σ) x =∑i

∑j

wiwj ∫ φi(x)φj(x)N (x;µ,Σ) dx

▸ also works if f ∈ RN , and if p(w) = N (w;m,V )26 ,

Page 31: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Some useful Gaussian integralsan expressive basis set for function approximation

∫ xpN (x; 0, σ2) dx = ⎧⎪⎪⎨⎪⎪⎩0 if p oddσp∏i=1∶2∶p−1 i if p even

∫ ∣x∣pN (x; 0, σ2) dx = σp√π

2p/2Γ(p + 1

2)

∫ (x −m)⊺V (x −m)N (x;µ,Σ) dx = (µ −m)⊺V (µ −m) + tr[V Σ]∫ N (x;a,A)N (x; b;B) dx = N (a, b,A +B)

∫ ∫ (x−m)/s−∞ N (x,0,1) dx N (x;µ,σ2) dx = ∫ (µ−m)/√(s2+σ2)

−∞ N (x,0,1)= 1

2

⎡⎢⎢⎢⎢⎣1 + erf⎛⎝ µ −m√

2(s2 + σ2)⎞⎠⎤⎥⎥⎥⎥⎦

c.f. DB Owen, A table of normal integrals. Comm. Stat.-Sim. Comp. 1980

27 ,

Page 32: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Expected values of monomialsfor moment computations

−4 −2 0 2 4 6 8

0

5

∫ xpN (x;µ,σ) = σp(−i√2 sgnµ)pU (−p2,1

2,−1

2

µ2

σ2) p ∈ N0

where U is Tricomi’s confluent hypergeometric function (cheap)28 ,

Page 33: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Expected values of monomialsfor moment computations

−4 −2 0 2 4 6 8

0

5

∫ xpN (x;µ,σ) = σp(−i√2 sgnµ)pU (−p2,1

2,−1

2

µ2

σ2) p ∈ N0

where U is Tricomi’s confluent hypergeometric function (cheap)28 ,

Page 34: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Expected values of monomialsfor moment computations

−4 −2 0 2 4 6 8

0

5

∫ xpN (x;µ,σ) = σp(−i√2 sgnµ)pU (−p2,1

2,−1

2

µ2

σ2) p ∈ N0

where U is Tricomi’s confluent hypergeometric function (cheap)28 ,

Page 35: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Expected values of monomialsfor moment computations

−4 −2 0 2 4 6 8

0

5

∫ xpN (x;µ,σ) = σp(−i√2 sgnµ)pU (−p2,1

2,−1

2

µ2

σ2) p ∈ N0

where U is Tricomi’s confluent hypergeometric function (cheap)28 ,

Page 36: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Expected values of error functionsfor moment computations e.g. Rasmussen & Williams, §3.9

−4 −2 0 2 4 6−10

1

2

3

Φ(x) = ∫x

−∞

N (x; 0,1) dx z =µ −m√v2 + σ2

∫ Φ(x −mv)N (x;µ,σ) = Φ(z)

29 ,

Page 37: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Expected values of error functionsfor moment computations e.g. Rasmussen & Williams, §3.9

−4 −2 0 2 4 6−10

1

2

3

Φ(x) = ∫x

−∞

N (x; 0,1) dx z =µ −m√v2 + σ2

∫ xΦ(x −mv)N (x;µ,σ) = µΦ(z) +

σ2

√v2 + σ2

N (z; 0,1)

29 ,

Page 38: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Expected values of error functionsfor moment computations e.g. Rasmussen & Williams, §3.9

−4 −2 0 2 4 6−10

1

2

3

Φ(x) = ∫x

−∞

N (x; 0,1) dx z =µ −m√v2 + σ2

∫ x2Φ(x −mv)N (x;µ,σ) = (µ2 + σ2)Φ(z) + (2µ

σ2

√v2 + σ2

−zσ4

v2 + σ2)N (z; 0,1)

29 ,

Page 39: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Treating Cancer with GPsAnalytical Probabilistic Modelling in Radiation Therapy Bangert, Hennig, Oelfke, 2013

image source: wikipedia30 ,

Page 40: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

the dataCT images source: wikipedia

31 ,

Page 41: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

the parameter spacemulti-beam plans source: M. Bangert, DKFZ

32 ,

Page 42: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

setup errors can be disastroushuman bodies are complicated Mark Bangert, DKFZ

▸ setup errors of 5mm and less can drastically change the clinicaloutcome

▸ accounting for these errors is currently not clinical practice▸ some prior work2,3, but problems of computational cost2Unkelbach et al.: Reducing the sensitivity of IMPT treatment plans to setup errors and

range uncertainties via probabilistic treatment planning. 2009 Med. Phys. 36: 1493Sobotta et al.: Accelerated evaluation of the robustness of treatment plans against

geometric uncertainties by Gaussian processes. 2012 Phys. Med. Biol. 57 (23): 802333 ,

Page 43: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Propagating Gaussian uncertainty through nonlinearitiesusing integrals against Gaussian measures

[mm]

rel.

dose

[Gy]

0 20 40 60 80 100 120 140 160 1800

5

10

15

20

25

30

(a)

[mm]

rel.

dose

[Gy]

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(b)

FIG. 1. (a) Depth dose profile of a 147 MeV proton beam according to [2] (crosses) and the

suggested analytical approximation (solid) composed by a weighted superposition of ten Gaussians

(dashed). (b) Measured lateral profile for a 1 cm⇥ 1 cm 6 MeV photon beam (crosses) at 50 mm

depth and the suggested analytical approximation (solid) with two error functions (solid).

exponential functions can be integrated analytically against Gaussians.142

d�i =

X

j

wj ·Z bx

ij

axij

dµxjq

2⇡�2ij

e�

(xij�µxj )2

2�2ij

Z byij

ayij

dµyjq

2⇡�2ij

e�

(yij�µyj)2

2�2ij · Ae�czij (6)

x/yij denote the lateral coordinates of voxel i projected to the beam orientation of beamlet143

j. The integration limits ax/yij and b

x/yij are given by the projection of the aperture outline144

onto the geometric depth zgeoij of voxel i from the virtual photon source of beamlet j. �ij145

increases with the radiological depth zij to account for the blur-out of the photon penumbra.146

A and c are found with a least squares fit of measured depth dose profiles. Comparisons of147

the analytical approximations to measured data for Lx/y and Z are shown in figures 1(b)148

and 3(b), respectively.149

B. Uncertainty model150

The calculated nominal dose d does not correspond to the dose that is actually delivered151

during the treatment because of uncertainties regarding the input parameters of the dose152

7

▸ works on virtually any continuous function▸ guaranteed numerical precision, fixed at design time▸ low computational cost: just matrix-matrix multiplications34 ,

Page 44: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Error Bars on Radiation Dosesetup error 1mm ±2mm, range error 3% Bangert, Hennig, Oelfke, 2013

d

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

5 Gy

65 Gy

45 Gy

25 Gy

E[d]

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

[mm][mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

�1

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

1 Gy

7 Gy

5 Gy

3 Gy

9 Gy

�10

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

�30

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

[mm]

[mm]

50 100 150 200 250 300 350 400

50

100

150

200

250

FIG. 5. Nominal dose d (first row), expectation value E[d] (second row), and standard deviation

� for delivery in 1 (third row), 10 (forth row), and 30 fractions (fifth row) for 9 photon beams

(left column) and 3 proton beams (right column). We assume a range error of 3%, a systematic

setup error of 1 mm, and a random setup error of 2mm. The color scale in the first row applies to

the nominal dose and the expectation value and the color scale in the third row to the standard

deviation.

19

35 ,

Page 45: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussian algebra can be used to buildnumerical methods for probabilistic computations.

36 ,

Page 46: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussians provide the linear algebra of inference

▸ products of Gaussians are Gaussians

N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)

▸ marginals of Gaussians are Gaussians

∫ N [(xy) ;(µx

µy) ,(Σxx Σxy

Σyx Σyy)] dy = N (x;µx,Σxx)

▸ (linear) conditionals of Gaussians are Gaussians

p(x ∣ y) = p(x, y)p(y) = N (x;µx +ΣxyΣ−1

yy(y − µy),Σxx −ΣxyΣ−1yyΣyx)

▸ linear projections of Gaussians are Gaussians

p(z) = N (z;µ,Σ) ⇒ p(Az) = N (Az,Aµ,AΣA⊺)▸ analytical integrals allow moment matching “projection to Gaussians”

∫ f(x)N (x;µ,Σ) = known e.g. for f(x) = xp, erf(x),N (x), x⊺V x37 ,

Page 47: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Generalised linear models learn nonlinear functions

f(x) = φ(x)⊺w p(w) = N (w;µ,Σ)

38 ,

Page 48: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Generalised linear models learn nonlinear functions

f(x) = φ(x)⊺w p(w) = N (w;µ,Σ)

38 ,

Page 49: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

infinite feature sets give nonparametric models

p(f) = GP(f ;µ, k)

39 ,

Page 50: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussian processes are powerful, but not magic

powerful models▸ kernels use infinitely many features▸ kernels can be combined to form expressive models▸ hyperparameters can be learned by hierarchical inference▸ individual nonlinear effects can be separated from superpositions▸ some kernels are universal

but no magic▸ every model has parameters chosen a priori▸ universal kernels can have logarithmic convergence rate

40 ,

Page 51: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Gaussian processes are at heart of probabilistic numerics

Gaussians have great algebraic properties▸ GPs are closed under linear projections, including

▸ differentiation▸ integration

▸ GPs can be integrated against an expressive set of functions

They are the elementary tool of probabilistic numerics▸ quadrature rules can be derived from GPs▸ quasi-Newton optimization can be generalised using GPs▸ GPs allow ODE solvers capable of probabilistic input▸ moment matching allows numerical probabilistic computations

Numerics is about turning nonlinear problems into linear ones.That’s what Gaussian regression does.

41 ,

Page 52: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Questions?

42 ,

Page 53: Gaussian Processes - Part III Advanced Topicsmlss.tuebingen.mpg.de/2013/hennig_slides3.pdf · Gaussian processes are central to many machine learning techniques, and all areas of

Bibliography▸ T. O’Hagan

Bayes-Hermite QuadratureJ. Statistical Planning and Inference 29, pp. 245–260▸ C.E. Rasmussen & C.K.I. WilliamsGaussian Processes for Machine LearningMIT Press, 2006▸ T. MinkaDeriving quadrature rules from Gaussian processesTech. Report 2000▸ M.A. Osborne, D. Duvenaud, R. Garnett, C.E. Rasmussen, S.J. Roberts, Z. GhahramaniActive Learning of Model Evidence Using Bayesian QuadratureAdvances in NIPS, 2012▸ P. Hennig & M. KiefelQuasi-Newton Methods: a new directionICML 2012 (short form), and JMLR 14 (2013), pp. 807–829▸ P. HennigFast Probabilistic Optimization from Noisy GradientsICML 2013▸ J. SkillingBayesian solution of ordinary differential equationsMaximum Entropy and Bayesian Methods, 1991▸ O. Chkrebtii, D.A. Campbell, M.A. Girolami, B. CalderheadBayesian Uncertainty Quantification for Differential Equationshttp://arxiv.org/abs/1306.2365▸ M. Bangert, P. Hennig, U. OelfkeAnalytical probabilistic modeling for radiation therapy treatment planningPhysics in Medicine and Biology, 2013, in press

43 ,