javier gonzalez´ (most of slides today: neil lawrence)javiergonzalezh.github.io › coursebo ›...

290
Course in Bayesian Optimization Javier Gonz´ alez (most of slides today: Neil Lawrence) University of Sheeld, Sheeld, UK 27th October 2015

Upload: others

Post on 02-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Course in Bayesian Optimization

    Javier González(most of slides today: Neil Lawrence)

    University of Sheffield, Sheffield, UK

    27th October 2015

  • Many thanks to

    I Mauricio Álvarez.I Cristian Guarizno.I Neil Lawrence, University of Sheffield.I Zhenwen Dai, University of Sheffield.I Machine Learning group, University of Sheffield.I Philipp Hennig, Max Planck institute.I Michael Osborne, University of Oxford.

  • Outline of the Course

    I Lecture 1: Uncertainty and Gaussian Processes.

    I Lecture 2: Introduction to Bayesian (probabilistic)optimization.

    I Lecture 3: Advanced topics in Bayesian Optimization.

  • Outline of the Course

    I Lab 1: Introduction to GPy.

    I Lab 2: Introduction to GPyOpt.

    I Lab 3: Advanced GPyOpt.

    I Day 4: Projects + presentations.

  • Points of the day

    What is machine learning?

    What is the uncertainty? Types?

    How the uncertainty plays a role in the learning process.

    Gaussian processes as models to handle uncertainty.

  • Points of the day

    What is machine learning?

    What is the uncertainty? Types?

    How the uncertainty plays a role in the learning process.

    Gaussian processes as models to handle uncertainty.

  • Points of the day

    What is machine learning?

    What is the uncertainty? Types?

    How the uncertainty plays a role in the learning process.

    Gaussian processes as models to handle uncertainty.

  • Points of the day

    What is machine learning?

    What is the uncertainty? Types?

    How the uncertainty plays a role in the learning process.

    Gaussian processes as models to handle uncertainty.

  • Points of the day

    What is machine learning?

    What is the uncertainty? Types?

    How the uncertainty plays a role in the learning process.

    Gaussian processes as models to handle uncertainty.

  • What is to learn?

  • The human learning process?

    Cancerous
Tumor?


  • What is Machine Learning?

    data

    + model = prediction

    I data: observations, could be actively or passively acquired(meta-data).

    I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.

    I prediction: an action to be taken or a categorization or aquality score.

  • What is Machine Learning?

    data +

    model = prediction

    I data: observations, could be actively or passively acquired(meta-data).

    I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.

    I prediction: an action to be taken or a categorization or aquality score.

  • What is Machine Learning?

    data + model

    = prediction

    I data: observations, could be actively or passively acquired(meta-data).

    I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.

    I prediction: an action to be taken or a categorization or aquality score.

  • What is Machine Learning?

    data + model =

    prediction

    I data: observations, could be actively or passively acquired(meta-data).

    I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.

    I prediction: an action to be taken or a categorization or aquality score.

  • What is Machine Learning?

    data + model = prediction

    I data: observations, could be actively or passively acquired(meta-data).

    I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.

    I prediction: an action to be taken or a categorization or aquality score.

  • Historical Perspective

    I A data driven approach to Artificial Intelligence.I Inspired by attempts to model the brain (the

    connectionists).I A community that transcended traditional boundaries

    (psychology, statistical physics, signal processing)I Led to an approach that dominates in the modern data-rich

    world.

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Two Dominant Approaches

    I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

    problem.I Typically intractable, so minimize a relaxed version of the

    cost function.I Prove characteristics of the resulting solution.

    I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

    (MCMC, variational etc).

  • Modelling Assumptions

    I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).

    I Typical assumptions: sparsity, smoothness.

  • Modelling Assumptions

    I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).

    I Typical assumptions: sparsity, smoothness.

  • Modelling Assumptions

    I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).

    I Typical assumptions: sparsity, smoothness.

  • Modelling Assumptions

    I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).

    I Typical assumptions: sparsity, smoothness.

  • Applications of Machine Learning

    Handwriting Recognition : Recognising handwrittencharacters. For example LeNethttp://bit.ly/d26fwK.

    Friend Indentification : Suggesting friends on social networkshttps:

    //www.facebook.com/help/501283333222485

    Ranking : Learning relative skills of on line game players,the TrueSkill system http://research.microsoft.com/en-us/projects/trueskill/.http://www.netflixprize.com/.

    Internet Search : For example Ad Click Through rateprediction http://bit.ly/a7XLH4.

    News Personalisation : For example Zitehttp://www.zite.com/.

    Game Play Learning : For example, learning to play Gohttp://bit.ly/cV77zM.

    http://bit.ly/d26fwKhttps://www.facebook.com/help/501283333222485https://www.facebook.com/help/501283333222485http://research.microsoft.com/en-us/projects/trueskill/http://research.microsoft.com/en-us/projects/trueskill/http://www.netflixprize.com/http://bit.ly/a7XLH4http://www.zite.com/http://bit.ly/cV77zM

  • Learning is Optimization

  • y = mx + c

  • 0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    y

    x

    y = mx + c

  • 0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    y

    x

    y = mx + cc

    m

  • 0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    y

    x

    y = mx + cc

    m

  • 0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    y

    x

    y = mx + cc

    m

  • 0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    y

    x

    y = mx + c

  • 0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    y

    x

    y = mx + c

  • 0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    y

    x

    y = mx + c

  • y = mx + c

    point 1: x = 1, y = 3

    3 = m + c

    point 2: x = 3, y = 1

    1 = 3m + c

    point 3: x = 2, y = 2.5

    2.5 = 2m + c

  • y = mx + c + �

    point 1: x = 1, y = 3

    3 = m + c + �1

    point 2: x = 3, y = 1

    1 = 3m + c + �2

    point 3: x = 2, y = 2.5

    2.5 = 2m + c + �3

  • Regression Revisited

    I We introduce an error function of the form

    E(w) =n∑

    i=1

    (yi −mxi − c

    )2I Minimize the error function with respect to m and c

  • Mathematical Interpretation

    I What is the mathematical interpretation?I There is a cost function.I It expresses mismatch between your prediction and reality.

    E(m, c) =n∑

    i=1

    (yi −mxi − c

    )2I This is known as the sum of squares error.

  • Learning is Optimization

    I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

    to zero.dE(m)

    dm= −2

    n∑i=1

    xi(yi −mxi − c

    )

  • Learning is Optimization

    I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

    to zero.

    0 = −2n∑

    i=1

    xi(yi −mxi − c

    )

  • Learning is Optimization

    I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

    to zero.

    0 = −2n∑

    i=1

    xiyi + 2n∑

    i=1

    mx2i + 2n∑

    i=1

    cxi

  • Learning is Optimization

    I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

    to zero.

    m =∑n

    i=1(yi − c

    )xi∑n

    i=1 x2i

  • Learning is Optimization

    I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

    to zero.dE(c)

    dc= −2

    n∑i=1

    (yi −mxi − c

    )

  • Learning is Optimization

    I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

    to zero.

    0 = −2n∑

    i=1

    (yi −mxi − c

    )

  • Learning is Optimization

    I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

    to zero.

    0 = −2n∑

    i=1

    yi + 2n∑

    i=1

    mxi + 2nc

  • Learning is Optimization

    I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

    to zero.

    c =∑n

    i=1(yi −mxi

    )n

  • Fixed Point Updates

    Worked example.

    c∗ =∑n

    i=1(yi −m∗xi

    )n

    ,

    m∗ =∑n

    i=1 xi(yi − c∗

    )∑ni=1 x

    2i

    ,

    σ2∗

    =

    ∑ni=1

    (yi −m∗xi − c∗

    )2n

  • Coordinate Descent

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    c

    m

    E(m, c)

  • Coordinate Descent

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    c

    m

    Iteration 1

  • Coordinate Descent

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    c

    m

    Iteration 1

  • Coordinate Descent

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    c

    m

    Iteration 2

  • Coordinate Descent

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    c

    m

    Iteration 2

  • Coordinate Descent

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    c

    m

    Iteration 3

  • Coordinate Descent

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

    c

    m

    Iteration 3

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 4

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 4

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 5

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 5

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 6

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 6

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 7

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 7

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 8

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 8

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 9

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 9

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 10

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 20

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 30

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 30

  • Coordinate Descent

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    0.2 0.25 0.3 0.35 0.4

    c

    m

    Iteration 30

  • Important Concepts Not Covered

    I Optimization methods.I Second order methods, conjugate gradient, quasi-Newton

    and Newton.I Effective heuristics such as momentum, CMA, etc

    I Local vs global solutions (Bayesian optimization!).

  • Learning is probabilistic modeling

  • Machine Learning and Probability

    I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of

    knowledge. (What colour socks is that personwearing?)

    Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)

  • Machine Learning and Probability

    I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of

    knowledge. (What colour socks is that personwearing?)

    Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)

  • Machine Learning and Probability

    I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of

    knowledge. (What colour socks is that personwearing?)

    Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)

  • Probability: A Framework to Characterise Uncertainty

    I We need a framework to characterise the uncertainty.I In this course we make use of probability theory to

    characterise uncertainty.

  • Probability: A Framework to Characterise Uncertainty

    I We need a framework to characterise the uncertainty.I In this course we make use of probability theory to

    characterise uncertainty.

  • Richard Price

    I Welsh philosopher and essay writer.I Edited Thomas Bayes’s essay which contained

    foundations of Bayesian philosophy.

    Figure: Richard Price, 1723–1791. (source Wikipedia)

  • Laplace

    I French Mathematician and Astronomer.

    Figure: Pierre-Simon Laplace, 1749–1827. (source Wikipedia)

  • Probabilistic Intepretation

    I Quadratic error functions can be seen as Gaussian noisemodels [1, 2].

    I Imagine we are seeing data given by,

    y(xi) = mxi + c + �

    where � is Gaussian noise with standard deviation σ,

    � ∼ N(0, σ2

    ).

  • Noise Corrupted Mapping

    I This implies that

    yi ∼ N(mxi + c, σ2

    )I Which we also write

    p(yi|w, σ) = N(yi|mxi + c, σ2

    )

  • Gaussian Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    p(y|m, c, σ2) =n∏

    i=1

    1√2πσ2

    exp(− (yi −mxi − c)

    2

    2σ2

    )

  • Gaussian Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    p(y|m, c, σ2) =n∏

    i=1

    1√2πσ2

    exp(− (yi −mxi − c)

    2

    2σ2

    )

  • Gaussian Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    p(y|m, c, σ2) ∝n∏

    i=1

    exp(− (yi −mxi − c)

    2

    2σ2

    )

  • Gaussian Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    p(y|m, c, σ2) ∝n∏

    i=1

    exp(− (yi −mxi − c)

    2

    2σ2

    )

  • Gaussian Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    p(y|m, c, σ2) ∝ exp− n∑

    i=1

    (yi −mxi − c)22σ2

  • Gaussian Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    p(y|m, c, σ2) ∝ exp− n∑

    i=1

    (yi −mxi − c)22σ2

  • Gaussian Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    p(y|m, c, σ2) ∝ exp− n∑

    i=1

    (yi −mxi − c)22σ2

  • Gaussian Log Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    logp(y|m, c, σ2) = − 12σ2

    n∑i=1

    (yi −mxi − c)2 + const

  • Gaussian Log Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    − logp(y|m, c, σ2) = 12σ2

    n∑i=1

    (yi −mxi − c)2 + const

  • Gaussian Log Likelihood

    I If the noise is sampled independently for each data pointfrom the same density we have

    p(y|m, c, σ2) =n∏

    i=1

    N(yi|mxi + c, σ2

    )I This is an i.i.d. assumption about the noise.I Writing the functional form we have

    − logp(y|m, c, σ2) = 12σ2

    E(m, c) + const

  • Probabilistic Interpretation of the Error Function

    I Probabilistic Interpretation for Error Function is NegativeLog Likelihood.

    I Minimizing error function is equivalent to maximizing loglikelihood.

    I Maximizing log likelihood is equivalent to maximizing thelikelihood because log is monotonic.

    I Probabilistic interpretation: Minimizing error function isequivalent to maximum likelihood with respect toparameters.

  • Sample Based Approximation implies i.i.d

    I The log likelihood is

    L(θ) = log P(y|θ)I If the likelihood is independent over the individual data

    points,

    P(y|θ) =n∏

    i=1

    P(yi|θ)

    I This is equivalent to the assumption that the data isindependent and identically distributed. This is known asi.i.d..

    I Now the log likelihood is

    L(θ) =n∑

    i=1

    log P(yi|θ)

    I We take the negative log likelihood to recover the sum ofsquares error.

  • Bayesian perspective

  • Underdetermined System

    What about two unknowns andone observation?

    y1 = mx1 + c

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    m =y1 − c

    x

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    c = 1.75 =⇒ m = 1.25

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    c = −0.777 =⇒ m = 3.78

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    c = −4.01 =⇒ m = 7.01

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    c = −0.718 =⇒ m = 3.72

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    c = 2.45 =⇒ m = 0.545

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    c = −0.657 =⇒ m = 3.66

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    c = −3.13 =⇒ m = 6.13

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.

    c = −1.47 =⇒ m = 4.47

    012345

    0 1 2 3y

    x

  • Underdetermined System

    Can compute m given c.Assume

    c ∼ N (0, 4) ,

    we find a distribution of solu-tions.

    012345

    0 1 2 3y

    x

  • Bayesian Approach

    I Likelihood for the regression example has the form

    p(y|w, σ2) =n∏

    i=1

    N(yi|w>φi, σ2

    ).

    I Suggestion was to maximize this likelihood with respect tow.

    I This can be done with gradient based optimization of thelog likelihood.

    I Alternative approach: integration across w.I Consider expected value of likelihood under a range of

    potential ws.I This is known as the Bayesian approach.

  • Note on the Term Bayesian

    I We will use Bayes’ rule to invert probabilities in theBayesian approach.

    I Bayesian is not named after Bayes’ rule (v. commonconfusion).

    I The term Bayesian refers to the treatment of the parametersas stochastic variables.

    I This approach was proposed by Laplace and Bayesindependently.

    I For early statisticians this was very controversial (Fisher etal).

  • Bayesian Controversy

    I Bayesian controversy relates to treating epistemicuncertainty as aleatoric uncertainty.

    I Another analogy:I Before a football match the uncertainty about the result is

    aleatoric.I If I watch a recorded match without knowing the result the

    uncertainty is epistemic.

  • Simple Bayesian Inference

    posterior =likelihood × prior

    marginal likelihood

    I Four components:1. Prior distribution: represents belief about parameter values

    before seeing data.2. Likelihood: gives relation between parameters and data.3. Posterior distribution: represents updated belief about

    parameters after data is observed.4. Marginal likelihood: represents assessment of the quality of

    the model. Can be compared with other models(likelihood/prior combinations). Ratios of marginallikelihoods are known as Bayes factors.

  • Recap

    I We can see the learning process as an optimizationproblem.

    I We can also see the learning process as probabilisticmodelling (that also depends on parameters that need tobe optimised).

    I The Bayesian frameworks allows us to handle ‘epistemic’uncertainty in systems.

    I Examples only for linear functions, so far.

  • Generalizations: What if we want to use a moreexpressive model (non-linear function)

    I ML as optimization: Regularization in RKHSs (kernelmethods)

    I Bayesian perspective: Gaussian Processes.

    Both approaches are related: as standard regression andBayesian regression are.

    More important for us: Gaussian processes.

  • Basis FunctionsNonlinear Regression

    I Problem with Linear Regression—x may not be linearlyrelated to y.

    I Potential solution: create a feature space: define φ(x)where φ(·) is a nonlinear function of x.

    I Model for target is a linear combination of these nonlinearfunctions

    f (x) =K∑

    j=1

    w jφ j(x) (1)

  • Quadratic Basis

    I Basis functions can be global. E.g. quadratic basis:

    [1, x, x2]

    -2

    -1

    0

    1

    2

    -1 0 1

    φ(x

    )

    x

    φ(x) = 1

    Figure: A quadratic basis.

  • Quadratic Basis

    I Basis functions can be global. E.g. quadratic basis:

    [1, x, x2]

    -2

    -1

    0

    1

    2

    -1 0 1

    φ(x

    )

    x

    φ(x) = 1

    φ(x) = x

    Figure: A quadratic basis.

  • Quadratic Basis

    I Basis functions can be global. E.g. quadratic basis:

    [1, x, x2]

    -2

    -1

    0

    1

    2

    -1 0 1

    φ(x

    )

    x

    φ(x) = 1

    φ(x) = xφ(x) = x2

    Figure: A quadratic basis.

  • Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2

    -4-3-2-10123

    -1 0 1

    f(x)

    xFigure: Function from quadratic basis with weights w1 = 0.87466,w2 = −0.38835, w3 = −2.0058 .

  • Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2

    -4-3-2-10123

    -1 0 1

    f(x)

    xFigure: Function from quadratic basis with weights w1 = −0.35908,w2 = 1.2274, w3 = −0.32825 .

  • Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2

    -4-3-2-10123

    -1 0 1

    f(x)

    xFigure: Function from quadratic basis with weights w1 = −1.5638,w2 = −0.73577, w3 = 1.6861 .

  • Radial Basis Functions

    I Or they can be local. E.g. radial (or Gaussian) basis

    φ j(x) = exp(− (x−µ j)

    2

    `2

    )

    0

    1

    -2 -1 0 1 2

    φ(x

    )

    x

    φ1(x) = e−2(x+1)2

    Figure: Radial basis functions.

  • Radial Basis Functions

    I Or they can be local. E.g. radial (or Gaussian) basis

    φ j(x) = exp(− (x−µ j)

    2

    `2

    )

    0

    1

    -2 -1 0 1 2

    φ(x

    )

    x

    φ1(x) = e−2(x+1)2

    φ2(x) = e−2x2

    Figure: Radial basis functions.

  • Radial Basis Functions

    I Or they can be local. E.g. radial (or Gaussian) basis

    φ j(x) = exp(− (x−µ j)

    2

    `2

    )

    0

    1

    -2 -1 0 1 2

    φ(x

    )

    x

    φ1(x) = e−2(x+1)2

    φ2(x) = e−2x2

    φ3(x) = e−2(x−1)2

    Figure: Radial basis functions.

  • Functions Derived from Radial Basisf (x) = w1e−2(x+1)

    2+ w2e−2x

    2+ w3e−2(x−1)

    2

    -2

    -1

    0

    1

    2

    -3 -2 -1 0 1 2 3

    f(x)

    xFigure: Function from radial basis with weights w1 = −0.47518,w2 = −0.18924, w3 = −1.8183 .

  • Functions Derived from Radial Basisf (x) = w1e−2(x+1)

    2+ w2e−2x

    2+ w3e−2(x−1)

    2

    -2

    -1

    0

    1

    2

    -3 -2 -1 0 1 2 3

    f(x)

    xFigure: Function from radial basis with weights w1 = 0.50596,w2 = −0.046315, w3 = 0.26813 .

  • Functions Derived from Radial Basisf (x) = w1e−2(x+1)

    2+ w2e−2x

    2+ w3e−2(x−1)

    2

    -2

    -1

    0

    1

    2

    -3 -2 -1 0 1 2 3

    f(x)

    xFigure: Function from radial basis with weights w1 = 0.07179,w2 = 1.3591, w3 = 0.50604 .

  • Probabilistic Model with Basis Functions

    I Define a general function:

    f (xi) = w>φ(xi)

    I Corrupt with independent noise:

    y(xi) = f (xi) + �i

    � ∼n∏

    i=1

    N(0, σ2

    )I Implies the following likelihood:

    p(y|w, σ2) =n∏

    i=1

    N(yi|w>φ(xi), σ2

    )

  • Multivariate Regression Likelihood

    I Noise corrupted data point

    yi = w>xi,: + �i

    I Multivariate regression likelihood:

    p(y|X,w) = 1(2πσ2)n/2

    exp

    − 12σ2n∑

    i=1

    (yi −w>xi,:

    )2I Now use a multivariate Gaussian prior:

    p(w) =1

    (2πα)p2

    exp(− 1

    2αw>w

    )

  • Multivariate Regression Likelihood

    I Noise corrupted data point

    yi = w>xi,: + �i

    I Multivariate regression likelihood:

    p(y|X,w) = 1(2πσ2)n/2

    exp

    − 12σ2n∑

    i=1

    (yi −w>xi,:

    )2I Now use a multivariate Gaussian prior:

    p(w) =1

    (2πα)p2

    exp(− 1

    2αw>w

    )

  • Multivariate Regression Likelihood

    I Noise corrupted data point

    yi = w>xi,: + �i

    I Multivariate regression likelihood:

    p(y|X,w) = 1(2πσ2)n/2

    exp

    − 12σ2n∑

    i=1

    (yi −w>xi,:

    )2I Now use a multivariate Gaussian prior:

    p(w) =1

    (2πα)p2

    exp(− 1

    2αw>w

    )

  • Posterior Density

    I Once again we want to know the posterior:

    p(w|y,X) ∝ p(y|X,w)p(w)

    I And we can compute by completing the square.

    log p(w|y,X) = − 12σ2

    n∑i=1

    y2i +1σ2

    n∑i=1

    yix>i,:w

    − 12σ2

    n∑i=1

    w>xi,:x>i,:w −1

    2αw>w + const.

  • Posterior Density

    I Once again we want to know the posterior:

    p(w|y,X) ∝ p(y|X,w)p(w)

    I And we can compute by completing the square.

    log p(w|y,X) = − 12σ2

    n∑i=1

    y2i +1σ2

    n∑i=1

    yix>i,:w

    − 12σ2

    n∑i=1

    w>xi,:x>i,:w −1

    2αw>w + const.

  • Computing the Posterior

    I By inspection we extract the inverse covariance

    log p(w|y,X) = − 12σ2

    n∑i=1

    y2i +1σ2

    n∑i=1

    yix>i,:w

    − 12σ2

    n∑i=1

    w>xi,:x>i,:w −1

    2αw>w + const.

    I Completing the square allows us to compute the mean.

  • Computing the Posterior

    I By inspection we extract the inverse covariance

    log p(w|y,X) = − 12σ2

    y>y +1σ2

    y>Xw

    − 12σ2

    w>X>Xw − 12α

    w>w + const.

    I Completing the square allows us to compute the mean.

  • Computing the Posterior

    I By inspection we extract the inverse covariance

    log p(w|y,X) = − 12σ2

    y>y +1σ2

    y>Xw

    − 12

    w>[σ−1X>X + α−1I

    ]w + const.

    I Completing the square allows us to compute the mean.

  • Making Predictions

    I Giving a Gaussian density

    p(w|y,X) = N(w|µw,Cw

    )Cw =

    [σ−2X>X + α−1I

    ]−1µw = Cwσ−2X>y

    I Posterior is combined with ‘test data’ likelihood to makefuture predictions:

    p(y∗|x∗,X,y) =∫

    p(y∗|x∗,w)p(w|X,y)dw

  • Bayesian vs Maximum Likelihood

    I Note the similarity between posterior mean

    µw = (σ−2X>X + α−1I)−1σ−2X>y

    I and Maximum likelihood solution

    ŵ = (X>X)−1X>y

  • Marginal Likelihood

    I In some sense though the real model is now the marginallikelihood.

    I Marginalization of W follows sum rule

    p(y|X) =∫

    p(y|X,w)p(w)dw

    givingp(y|X) = N

    (y|0, αXX> + σ2I

    )I Often the integral is intractable.I Leads to variational approximations, MCMC (Michael

    Betancourt, Mark Girolami), Laplace approximation(Harvard Rue).

    I For the case of Gaussians it’s trivial!!

  • Marginal Likelihood

    I Can compute the marginal likelihood as:

    p(y|X, α, σ) = N(y|0, αXX> + σ2I

    )I Or if we use a basis set we have

    p(y|X, α, σ) = N(y|0, αΦΦ> + σ2I

    )I This Gaussian is no longer i.i.d. across data and this is

    where things get interesting.

  • Marginal Likelihood

    I The marginal likelihood can also be computed, it has theform:

    p(y|X, σ2, α) = 1(2π)

    n2 |K| 12

    exp(−1

    2y>K−1y

    )where K = αΦΦ> + σ2I.

    I So it is a zero mean n-dimensional Gaussian withcovariance matrix K.

  • Sampling a Function

    Multi-variate Gaussians

    I We will consider a Gaussian with a particular structure ofcovariance matrix.

    I Generate a single sample from this 25 dimensionalGaussian distribution, f =

    [f1, f2 . . . f25

    ].

    I We will plot these points against their index.

  • Gaussian Distribution Sample

    -2

    -1

    0

    1

    2

    0 5 10 15 20 25

    f i

    i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

    ji

    0

    0.2

    0.4

    0.6

    0.8

    1

    (b) colormap showing correlations be-tween dimensions.

    Figure: A sample from a 25 dimensional Gaussian distribution.

  • Gaussian Distribution Sample

    -2

    -1

    0

    1

    2

    0 5 10 15 20 25

    f i

    i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

    ji

    0

    0.2

    0.4

    0.6

    0.8

    1

    (b) colormap showing correlations be-tween dimensions.

    Figure: A sample from a 25 dimensional Gaussian distribution.

  • Gaussian Distribution Sample

    -2

    -1

    0

    1

    2

    0 5 10 15 20 25

    f i

    i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

    0

    0.2

    0.4

    0.6

    0.8

    1

    (b) colormap showing correlations be-tween dimensions.

    Figure: A sample from a 25 dimensional Gaussian distribution.

  • Gaussian Distribution Sample

    -2

    -1

    0

    1

    2

    0 5 10 15 20 25

    f i

    i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

    0

    0.2

    0.4

    0.6

    0.8

    1

    (b) colormap showing correlations be-tween dimensions.

    Figure: A sample from a 25 dimensional Gaussian distribution.

  • Gaussian Distribution Sample

    -2

    -1

    0

    1

    2

    0 5 10 15 20 25

    f i

    i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

    0

    0.2

    0.4

    0.6

    0.8

    1

    (b) colormap showing correlations be-tween dimensions.

    Figure: A sample from a 25 dimensional Gaussian distribution.

  • Gaussian Distribution Sample

    -2

    -1

    0

    1

    2

    0 5 10 15 20 25

    f i

    i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

    0

    0.2

    0.4

    0.6

    0.8

    1

    (b) colormap showing correlations be-tween dimensions.

    Figure: A sample from a 25 dimensional Gaussian distribution.

  • Gaussian Distribution Sample

    -2

    -1

    0

    1

    2

    0 5 10 15 20 25

    f i

    i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

    0

    0.2

    0.4

    0.6

    0.8

    1

    (b) colormap showing correlations be-tween dimensions.

    Figure: A sample from a 25 dimensional Gaussian distribution.

  • Computing the Expected Output

    I Given the posterior for the parameters, how can wecompute the expected output at a given location?

    I Output of model at location xi is given by

    f (xi; w) = φ>i w

    I We want the expected output under the posterior density,p(w|y,X, σ2, α).

    I Mean of mapping function will be given by〈f (xi; w)

    〉p(w|y,X,σ2,α) = φ

    >i 〈w〉p(w|y,X,σ2,α)

    = φ>i µw

  • Variance of Expected Output

    I Variance of model at location xi is given by

    var( f (xi; w)) =〈( f (xi; w))2

    〉− 〈 f (xi; w)〉2

    = φ>i〈ww>

    〉φi −φ>i 〈w〉 〈w〉

    >φi

    = φ>i Cwφi

    where all these expectations are taken under the posteriordensity, p(w|y,X, σ2, α).

  • Book

  • Olympic Marathon Data

    I Gold medal times forOlympic Marathon since1896.

    I Marathons before 1924didn’t have astandardised distance.

    I Present results usingpace per km.

    I In 1904 Marathon wasbadly organised leadingto very slow times.

    Image from WikimediaCommons

    http://bit.ly/16kMKHQ

    http://bit.ly/16kMKHQ

  • Olympic Marathon Data

    3

    3.5

    4

    4.5

    5

    1900 1920 1940 1960 1980 2000 2020

    y,pa

    cem

    in/k

    m

    x, year

    Olympic Marathon Data.

  • Olympics Data analysis

    I Use Bayesian approach on olympics data withpolynomials.

    I Choose a prior w ∼ N (0, αI) with α = 1.I Choose noise variance σ2 = 0.01

  • Sampling the Prior

    I Always useful to perform a ‘sanity check’ and sample fromthe prior before observing the data.

    I Since y = Φw + � just need to sample

    w ∼ N (0, α)

    � ∼ N(0, σ2

    )with α = 1 and � = 0.01.

  • Recall: Validation Set for Maximum Likelihood

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 0, trainingerror -1.8774, validation error -0.13132, σ2 = 0.302, σ = 0.549.

  • Recall: Validation Set for Maximum Likelihood

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 1, trainingerror -15.325, validation error 2.5863, σ2 = 0.0733, σ = 0.271.

  • Recall: Validation Set for Maximum Likelihood

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 2, trainingerror -17.579, validation error -8.4831, σ2 = 0.0578, σ = 0.240.

  • Recall: Validation Set for Maximum Likelihood

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 3, trainingerror -18.064, validation error 11.27, σ2 = 0.0549, σ = 0.234.

  • Recall: Validation Set for Maximum Likelihood

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 4, trainingerror -18.245, validation error 232.92, σ2 = 0.0539, σ = 0.232.

  • Recall: Validation Set for Maximum Likelihood

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 5, trainingerror -20.471, validation error 9898.1, σ2 = 0.0426, σ = 0.207.

  • Recall: Validation Set for Maximum Likelihood

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 6, trainingerror -22.881, validation error 67775, σ2 = 0.0331, σ = 0.182.

  • Validation Set

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 0, trainingerror 29.757, validation error -0.29243, σ2 = 0.302, σ = 0.550.

  • Validation Set

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 1, trainingerror 14.942, validation error 4.4027, σ2 = 0.0762, σ = 0.276.

  • Validation Set

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 2, trainingerror 9.7206, validation error -8.6623, σ2 = 0.0580, σ = 0.241.

  • Validation Set

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 3, trainingerror 10.416, validation error -6.4726, σ2 = 0.0555, σ = 0.236.

  • Validation Set

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 4, trainingerror 11.34, validation error -8.431, σ2 = 0.0555, σ = 0.236.

  • Validation Set

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 5, trainingerror 11.986, validation error -10.483, σ2 = 0.0551, σ = 0.235.

  • Validation Set

    2.53

    3.54

    4.55

    5.5

    1892 1932 1972 2012-40-20

    020406080

    100

    0 1 2 3 4 5 6 7

    polynomial order

    Left: fit to data, Right: model error. Polynomial order 6, trainingerror 12.369, validation error -3.3823, σ2 = 0.0537, σ = 0.232.

  • Regularized Mean

    I Validation fit here based on mean solution for w only.I For Bayesian solution

    µw =[σ−2Φ>Φ + α−1I

    ]−1σ−2Φ>y

    instead ofw∗ =

    [Φ>Φ

    ]−1Φ>y

    I Two are equivalent when α→∞.I Equivalent to a prior for w with infinite variance.I In other cases αI regularizes the system (keeps parameters

    smaller).

  • Sampling the Posterior

    I Now check samples by extracting w from the posterior.I Now for y = Φw + � need

    w ∼ N(µw,Cw

    )with Cw =

    [σ−2Φ>Φ + α−1I

    ]−1and µw = Cwσ−2Φ>y

    � ∼ N(0, σ2

    )with α = 1 and � = 0.01.

  • Direct Construction of Covariance Matrix

    Use matrix notation to write function,

    f (xi; w) =m∑

    k=1

    wkφk (xi)

  • Direct Construction of Covariance Matrix

    Use matrix notation to write function,

    f (xi; w) =m∑

    k=1

    wkφk (xi)

    computed at training data gives a vector

    f = Φw.

  • Direct Construction of Covariance Matrix

    Use matrix notation to write function,

    f (xi; w) =m∑

    k=1

    wkφk (xi)

    computed at training data gives a vector

    f = Φw.

    w ∼ N (0, αI)

  • Direct Construction of Covariance Matrix

    Use matrix notation to write function,

    f (xi; w) =m∑

    k=1

    wkφk (xi)

    computed at training data gives a vector

    f = Φw.

    w ∼ N (0, αI)

    w and f are only related by an inner product.

  • Direct Construction of Covariance Matrix

    Use matrix notation to write function,

    f (xi; w) =m∑

    k=1

    wkφk (xi)

    computed at training data gives a vector

    f = Φw.

    w ∼ N (0, αI)

    w and f are only related by an inner product.

    Φ ∈

  • Direct Construction of Covariance Matrix

    Use matrix notation to write function,

    f (xi; w) =m∑

    k=1

    wkφk (xi)

    computed at training data gives a vector

    f = Φw.

    w ∼ N (0, αI)

    w and f are only related by an inner product.

    Φ ∈

  • Direct Construction of Covariance Matrix

    Use matrix notation to write function,

    f (xi; w) =m∑

    k=1

    wkφk (xi)

    computed at training data gives a vector

    f = Φw.

    w ∼ N (0, αI)

    w and f are only related by an inner product.

    Φ ∈

  • Expectations

    I We have〈f〉 = Φ 〈w〉 .

    I Prior mean of w was zero giving

    〈f〉 = 0.

    I Prior covariance of f is

    K =〈ff>

    〉− 〈f〉 〈f〉>

    We use 〈·〉 to denote expectations under prior distributions.

  • Expectations

    I We have〈f〉 = Φ 〈w〉 .

    I Prior mean of w was zero giving

    〈f〉 = 0.

    I Prior covariance of f is

    K =〈ff>

    〉− 〈f〉 〈f〉>

    We use 〈·〉 to denote expectations under prior distributions.

  • Expectations

    I We have〈f〉 = Φ 〈w〉 .

    I Prior mean of w was zero giving

    〈f〉 = 0.

    I Prior covariance of f is

    K =〈ff>

    〉− 〈f〉 〈f〉>

    We use 〈·〉 to denote expectations under prior distributions.

  • Expectations

    I We have〈f〉 = Φ 〈w〉 .

    I Prior mean of w was zero giving

    〈f〉 = 0.I Prior covariance of f is

    K =〈ff>

    〉− 〈f〉 〈f〉>

    〈ff>

    〉= Φ

    〈ww>

    〉Φ>,

    givingK = αΦΦ>.

    We use 〈·〉 to denote expectations under prior distributions.

  • Covariance between Two Points

    I The prior covariance between two points xi and x j is

    k(xi, x j

    )= αφ: (xi)

    > φ:(x j

    ),

    or in sum notation

    k(xi, x j

    )= α

    m∑k=1

    φk (xi)φk(x j

    )I For the radial basis used this gives

    k(xi, x j

    )= α

    m∑k=1

    exp

    −∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2

    2`2

    .

  • Covariance between Two Points

    I The prior covariance between two points xi and x j is

    k(xi, x j

    )= αφ: (xi)

    > φ:(x j

    ),

    or in sum notation

    k(xi, x j

    )= α

    m∑k=1

    φk (xi)φk(x j

    )I For the radial basis used this gives

    k(xi, x j

    )= α

    m∑k=1

    exp

    −∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2

    2`2

    .

  • Covariance between Two Points

    I The prior covariance between two points xi and x j is

    k(xi, x j

    )= αφ: (xi)

    > φ:(x j

    ),

    or in sum notation

    k(xi, x j

    )= α

    m∑k=1

    φk (xi)φk(x j

    )I For the radial basis used this gives

    k(xi, x j

    )= α

    m∑k=1

    exp

    −∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2

    2`2

    .

  • Covariance between Two Points

    I The prior covariance between two points xi and x j is

    k(xi, x j

    )= αφ: (xi)

    > φ:(x j

    ),

    or in sum notation

    k(xi, x j

    )= α

    m∑k=1

    φk (xi)φk(x j

    )I For the radial basis used this gives

    k(xi, x j

    )= α

    m∑k=1

    exp

    −∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2

    2`2

    .

  • Covariance Functions and Mercer Kernels

    I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic

    interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation

    is crucial for kernel parameter optimization.

  • Covariance Functions and Mercer Kernels

    I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic

    interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation

    is crucial for kernel parameter optimization.

  • Covariance Functions and Mercer Kernels

    I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic

    interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation

    is crucial for kernel parameter optimization.

  • More on Mercer Kernels

    Let X be a metric space and K : X × X→< a continuous andsymmetric function. If we assume that K is positive definite,that is, for any set x = {x1, · · · , xn} ⊂ X the n × n matrix K withcomponents

    Ki j = K(xi, x j),

    is positive semi-definite, then K is a Mercer kernel.

  • More on Mercer Kernels

    Mercer’s Theorem (1909): Let K : X × X −→ < a Mercer’skernel. Let λ j the j-th eigenvalue of LK and {φ j} j≥1 thecorresponding eigenvector. Then, for all x, x′ ∈ X

    K(x,y) =∞∑j=1

    λ jφ j(x)φ j(y)

    where the convergence is absolute (for each (x, x′) ∈ X × X) anduniform (on (x, x′) ∈ X × X).

    By using directly a kernel we are using a basis functionimplicitly (possibly with infinity elements: Bayesian

    non-parametrics).

  • Prediction with Correlated Gaussians

    I Prediction of f∗ from f requires multivariate conditionaldensity.

    I Multivariate conditional density is also Gaussian.

    p(f∗|f) = N(f∗|K∗,fK−1f,f f,K∗,∗ −K∗,fK−1f,f Kf,∗

    )

    I Here covariance of joint density is given by

    K =[Kf,f K∗,fKf,∗ K∗,∗

    ]

  • Prediction with Correlated Gaussians

    I Prediction of f∗ from f requires multivariate conditionaldensity.

    I Multivariate conditional density is also Gaussian.

    p(f∗|f) = N(f∗|µ,Σ

    )µ = K∗,fK−1f,f f

    Σ = K∗,∗ −K∗,fK−1f,f Kf,∗I Here covariance of joint density is given by

    K =[Kf,f K∗,fKf,∗ K∗,∗

    ]

  • Constructing Covariance Functions

    I Sum of two covariances is also a covariance function.

    k(x, x′) = k1(x, x′) + k2(x, x′)

  • Constructing Covariance Functions

    I Product of two covariances is also a covariance function.

    k(x, x′) = k1(x, x′)k2(x, x′)

  • Multiply by Deterministic Function

    I If f (x) is a Gaussian process.I g(x) is a deterministic function.I h(x) = f (x)g(x)I Then

    kh(x, x′) = g(x)k f (x, x′)g(x′)

    where kh is covariance for h(·) and k f is covariance for f (·).

  • Covariance Functions

    MLP Covariance Function

    k (x, x′) = αasin(

    wx>x′ + b√wx>x + b + 1

    √wx′>x′ + b + 1

    )

    I Based on infinite neuralnetwork model.

    w = 40

    b = 4

  • Covariance Functions

    MLP Covariance Function

    k (x, x′) = αasin(

    wx>x′ + b√wx>x + b + 1

    √wx′>x′ + b + 1

    )

    I Based on infinite neuralnetwork model.

    w = 40

    b = 4

  • Covariance Functions

    Linear Covariance Function

    k (x, x′) = αx>x′

    I Bayesian linearregression.

    α = 1

  • Covariance Functions

    Linear Covariance Function

    k (x, x′) = αx>x′

    I Bayesian linearregression.

    α = 1

  • Covariance FunctionsWhere did this covariance matrix come from?

    Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction

    k (x, x′) = α exp(−|x − x

    ′|2`2

    )I In one dimension arises

    from a stochasticdifferential equation.Brownian motion in aparabolic tube.

    I In higher dimension aFourier filter of the form

    1π(1+x2) .

  • Covariance FunctionsWhere did this covariance matrix come from?

    Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction

    k (x, x′) = α exp(−|x − x

    ′|2`2

    )I In one dimension arises

    from a stochasticdifferential equation.Brownian motion in aparabolic tube.

    I In higher dimension aFourier filter of the form

    1π(1+x2) .

  • Covariance FunctionsWhere did this covariance matrix come from?

    Markov Process

    k (t, t′) = αmin(t, t′)

    I Covariance matrix isbuilt using the inputs tothe function t.

  • Covariance FunctionsWhere did this covariance matrix come from?

    Markov Process

    k (t, t′) = αmin(t, t′)

    I Covariance matrix isbuilt using the inputs tothe function t.

    -3-2-10123

    0 0.5 1 1.5 2

  • Covariance FunctionsWhere did this covariance matrix come from?

    Matern 5/2 Covariance Function

    k (x, x′) = α(1 +√

    5r +53

    r2)

    exp(−√

    5r)

    where r =‖x − x′‖2

    `

    I Matern 5/2 is a twicedifferentiablecovariance.

    I Matern familyconstructed withStudent-t filters inFourier space.

  • Covariance FunctionsWhere did this covariance matrix come from?

    Matern 5/2 Covariance Function

    k (x, x′) = α(1 +√

    5r +53

    r2)

    exp(−√

    5r)

    where r =‖x − x′‖2

    `

    I Matern 5/2 is a twicedifferentiablecovariance.

    I Matern familyconstructed withStudent-t filters inFourier space.

  • Covariance Functions

    RBF Basis Functions

    k (x, x′) = αφ(x)>φ(x′)

    φk(x) = exp

    −∥∥∥x − µk∥∥∥22

    `2

    µ =

    −101

  • Covariance Functions

    RBF Basis Functions

    k (x, x′) = αφ(x)>φ(x′)

    φk(x) = exp

    −∥∥∥x − µk∥∥∥22

    `2

    µ =

    −101

    -3-2-10123

    -3 -2 -1 0 1 2 3

  • Covariance FunctionsWhere did this covariance matrix come from?

    Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

    k (x, x′) = α exp

    −‖x − x′‖222`2

    I Covariance matrix isbuilt using the inputs tothe function x.

    I For the example above itwas based on Euclideandistance.

    I The covariance functionis also know as a kernel.

  • Covariance FunctionsWhere did this covariance matrix come from?

    Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

    k (x, x′) = α exp

    −‖x − x′‖222`2

    I Covariance matrix isbuilt using the inputs tothe function x.

    I For the example above itwas based on Euclideandistance.

    I The covariance functionis also know as a kernel.

  • Gaussian Process Interpolation

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    f(x)

    x

    Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

  • Gaussian Process Interpolation

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    f(x)

    x

    Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

  • Gaussian Process Interpolation

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    f(x)

    x

    Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

  • Gaussian Process Interpolation

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    f(x)

    x

    Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

  • Gaussian Process Interpolation

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    f(x)

    x

    Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

  • Gaussian Process Interpolation

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    f(x)

    x

    Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

  • Gaussian Process Interpolation

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    f(x)

    x

    Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

  • Gaussian Process Interpolation

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    f(x)

    x

    Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

  • Gaussian Noise

    I Gaussian noise model,

    p(yi| fi

    )= N

    (yi| fi, σ2

    )where σ2 is the variance of the noise.

    I Equivalent to a covariance function of the form

    k(xi, x j) = δi, jσ2

    where δi, j is the Kronecker delta function.I Additive nature of Gaussians means we can simply add

    this term to existing covariance matrices.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • Gaussian Process Regression

    -3

    -2

    -1

    0

    1

    2

    3

    -2 -1 0 1 2

    y(x)

    x

    Figure: Examples include WiFi localization, C14 callibration curve.

  • I Learning with Gaussian processes allows to characterizethe problem uncertainty:.

    I We can choose a basis of functions o directly to select akernel (equivalent, but better to choose the kernel:non-parametric models).

    I Given a covariance (prior), how to select the rightparameters?

    I Back to optimization...

  • Learning Covariance ParametersCan we determine covariance parameters from the data?

    N (y|0,K) = 1(2π)

    n2 |K|12

    exp(−y>K−1y

    2

    )

    The parameters are inside the covariancefunction (matrix).

    ki, j = k(xi, x j;θ)

  • Learning Covariance ParametersCan we determine covariance parameters from the data?

    N (y|0,K) = 1(2π)

    n2 |K|12

    exp(−y>K−1y

    2

    )

    The parameters are inside the covariancefunction (matrix).

    ki, j = k(xi, x j;θ)

  • Learning Covariance ParametersCan we determine covariance parameters from the data?

    logN (y|0,K) =−12

    log |K|−y>K−1y

    2− n

    2log 2π

    The parameters are inside the covariancefunction (matrix).

    ki, j = k(xi, x j;θ)

  • Learning Covariance ParametersCan we determine covariance parameters from the data?

    E(θ) =12

    log |K| + y>K−1y

    2

    The parameters are inside the covariancefunction (matrix).

    ki, j = k(xi, x j;θ)

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Learning Covariance ParametersCan we determine length scales and noise levels from the data?

    -2

    -1

    0

    1

    2

    -2 -1 0 1 2

    y(x)

    x

    -10-505

    101520

    10−1 100 101

    length scale, `

    E(θ) =12

    log |K| + y>K−1y

    2

  • Limitations of Gaussian Processes

    I Inference is O(n3) due to matrix inverse (in practice useCholesky).

    I Gaussian processes don’t deal well with discontinuities(financial crises, phosphorylation, collisions, edges inimages).

    I Widely used exponentiated quadratic covariance (RBF) canbe too smooth in practice (but there are manyalternatives!!).

  • Conclusions

    I Machine learning has focussed on prediction.I Two main approaches: optimize objective, or model

    probabilistically.I Both approaches require to optimize parameters.I Gaussian processes: fundamental models to deal with

    uncertainty in complex scenarios.

  • Tomorow

    I Global optimization.I Parameter tuning in Machine Learning as a global

    optimization problem.I Can we automate the parameter choice of Machine

    Learning algorithms?I Yes! Bayesian Optimization.

  • References I

    [1] Carl Friedrich Gauss. Theoria motus corporum coelestium. Perthes et Besser,Hamburg.

    [2] Pierre Simon Laplace. Mémoire sur la probabilité des causes par lesévènemens. In Mémoires de mathèmatique et de physique, presentés àlAcadémie Royale des Sciences, par divers savans, & lù dans ses assemblées 6,pages 621–656, 1774. Translated in [5].

    [3] Jeremey Oakley and Anthony O’Hagan. Bayesian inference for theuncertainty distribution of computer model outputs. Biometrika,89(4):769–784, 2002.

    [4] Carl Edward Rasmussen and Christopher K. I. Williams. GaussianProcesses for Machine Learning. MIT Press, Cambridge, MA, 2006.

    [5] Stephen M. Stigler. Laplace’s 1774 memoir on inverse probability.Statistical Science, 1:359–378, 1986.

    IntroductionLearning as optimizationMultivariate Bayesian Linear RegressionMercer KernelsTwo Point MarginalsConstructing CovarianceGP InterpolationGP RegressionParameter OptimizationGP Limitations

    0.0: 0.1: 0.2: 0.3: 0.4: 0.5: 0.6: 0.7: 0.8: 0.9: 0.10: 0.11: 0.12: 0.13: 0.14: 0.15: 0.16: 0.17: 0.18: 0.19: 0.20: 0.21: 0.22: 0.23: 0.24: 0.25: 0.26: 0.27: 0.28: 0.29: 0.30: 0.31: 0.32: 0.33: 0.34: 0.35: 0.36: 0.37: 0.38: 0.39: anm0: 1.0: 1.1: 1.2: 1.3: 1.4: 1.5: 1.6: 1.7: 1.8: 1.9: 1.10: 1.11: 1.12: 1.13: 1.14: 1.15: 1.16: 1.17: 1.18: 1.19: 1.20: 1.21: 1.22: 1.23: 1.24: 1.25: 1.26: 1.27: 1.28: 1.29: anm1: 2.0: 2.1: 2.2: 2.3: 2.4: 2.5: 2.6: 2.7: 2.8: 2.9: 2.10: 2.11: 2.12: 2.13: 2.14: 2.15: 2.16: 2.17: 2.18: 2.19: 2.20: 2.21: 2.22: 2.23: 2.24: 2.25: 2.26: 2.27: 2.28: 2.29: anm2: 3.0: 3.1: 3.2: 3.3: 3.4: 3.5: 3.6: 3.7: 3.8: 3.9: 3.10: 3.11: 3.12: 3.13: 3.14: 3.15: 3.16: 3.17: 3.18: 3.19: 3.20: 3.21: 3.22: 3.23: 3.24: 3.25: 3.26: 3.27: 3.28: 3.29: anm3: 4.0: 4.1: 4.2: 4.3: 4.4: 4.5: 4.6: 4.7: 4.8: 4.9: 4.10: 4.11: 4.12: 4.13: 4.14: 4.15: 4.16: 4.17: 4.18: 4.19: 4.20: 4.21: 4.22: 4.23: 4.24: 4.25: 4.26: 4.27: 4.28: 4.29: anm4: