an introduction to probabilistic modeling · 2014-10-29 · motivation why probabilistic modeling?...

70
Oliver Stegle and Karsten Borgwardt: Computational Approaches for Analysing Complex Biological Systems, Page 1 An Introduction to Probabilistic modeling Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tübingen

Upload: others

Post on 06-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Oliver Stegle and Karsten Borgwardt: Computational Approaches for Analysing Complex Biological Systems, Page 1

An Introduction to Probabilistic modelingOliver Stegle and Karsten Borgwardt

Machine Learning andComputational Biology Research Group,

Max Planck Institute for Biological Cybernetics andMax Planck Institute for Developmental Biology, Tübingen

Page 2: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Motivation

Why probabilistic modeling?

I Inferences from data are intrinsically uncertain.

I Probability theory: model uncertainty instead of ignoring it!

I Applications: Machine learning, Data Mining, Pattern Recognition,etc.

I Goal of this part of the courseI Overview on probabilistic modelingI Key conceptsI Focus on Applications in Bioinformatics

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 1

Page 3: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Motivation

Why probabilistic modeling?

I Inferences from data are intrinsically uncertain.

I Probability theory: model uncertainty instead of ignoring it!

I Applications: Machine learning, Data Mining, Pattern Recognition,etc.

I Goal of this part of the courseI Overview on probabilistic modelingI Key conceptsI Focus on Applications in Bioinformatics

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 1

Page 4: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Motivation

Why probabilistic modeling?

I Inferences from data are intrinsically uncertain.

I Probability theory: model uncertainty instead of ignoring it!

I Applications: Machine learning, Data Mining, Pattern Recognition,etc.

I Goal of this part of the courseI Overview on probabilistic modelingI Key conceptsI Focus on Applications in Bioinformatics

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 1

Page 5: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Motivation

Further reading, useful material

I Christopher M. Bishop: Pattern Recognition and Machine learning.I Good background, covers most of the course material and much more!I Substantial parts of this tutorial borrow figures and ideas from this

book.

I David J.C. MacKay: Information Theory, Learning and InferenceI Very worth while reading, not quite the same quality of overlap with

the lecture synopsis.I Freely available online.

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 2

Page 6: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Motivation

Lecture overview

1. An Introduction to probabilistic modeling

2. Applications: linear models, hypothesis testing

3. An introduction to Gaussian processes

4. Applications: time series, model comparison

5. Applications: continued

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 3

Page 7: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Outline

Outline

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 4

Page 8: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Outline

Motivation

Prerequisites

Probability Theory

Parameter Inference for the Gaussian

Summary

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 5

Page 9: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsData

I Let D denote a dataset, consisting of N datapointsD = { xn︸︷︷︸

Inputs

, yn︸︷︷︸Outputs

}Nn=1.

I Typical (this course)I x = {x1, . . . , xD} multivariate, spanning D features for each

observation (nodes in a graph, etc.).I y univariate (fitness, expression level etc.).

I Notation:I Scalars are printed as y.I Vectors are printed in bold: x.I Matrices are printed in capital

bold: Σ.

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 6

Page 10: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsData

I Let D denote a dataset, consisting of N datapointsD = { xn︸︷︷︸

Inputs

, yn︸︷︷︸Outputs

}Nn=1.

I Typical (this course)I x = {x1, . . . , xD} multivariate, spanning D features for each

observation (nodes in a graph, etc.).I y univariate (fitness, expression level etc.).

I Notation:I Scalars are printed as y.I Vectors are printed in bold: x.I Matrices are printed in capital

bold: Σ.

X

Y

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 6

Page 11: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsData

I Let D denote a dataset, consisting of N datapointsD = { xn︸︷︷︸

Inputs

, yn︸︷︷︸Outputs

}Nn=1.

I Typical (this course)I x = {x1, . . . , xD} multivariate, spanning D features for each

observation (nodes in a graph, etc.).I y univariate (fitness, expression level etc.).

I Notation:I Scalars are printed as y.I Vectors are printed in bold: x.I Matrices are printed in capital

bold: Σ.

X

Y

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 6

Page 12: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsPredictions

I Observed dataset D = { xn︸︷︷︸Inputs

, yn︸︷︷︸Outputs

}Nn=1.

I Given D, what can we say about y? at an unseen test input x??

X

Y

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 7

Page 13: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsPredictions

I Observed dataset D = { xn︸︷︷︸Inputs

, yn︸︷︷︸Outputs

}Nn=1.

I Given D, what can we say about y? at an unseen test input x??

X

Y

?

x*

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 7

Page 14: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsModel

I Observed dataset D = { xn︸︷︷︸Inputs

, yn︸︷︷︸Outputs

}Nn=1.

I Given D, what can we say about y? at an unseen test input x??

I To make predictions we need to make assumptions.

I A model H encodes these assumptions and often depends on someparameters θ.

I Curve fitting: the model relatesx to y,

y = f(x |θ)

= θ0 + θ1 · x︸ ︷︷ ︸example, a linear model

X

Y

?

x*

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 8

Page 15: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsModel

I Observed dataset D = { xn︸︷︷︸Inputs

, yn︸︷︷︸Outputs

}Nn=1.

I Given D, what can we say about y? at an unseen test input x??

I To make predictions we need to make assumptions.

I A model H encodes these assumptions and often depends on someparameters θ.

I Curve fitting: the model relatesx to y,

y = f(x |θ)

= θ0 + θ1 · x︸ ︷︷ ︸example, a linear model

X

Y

x*

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 8

Page 16: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsUncertainty

I Virtually in all steps there is uncertaintyI Measurement uncertainty (D)I Parameter uncertainty (θ)I Uncertainty regarding the correct model (H)

I Uncertainty can occur in bothinputs and outputs.

I How to represent uncertainty?

X

Y

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 9

Page 17: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsUncertainty

I Virtually in all steps there is uncertaintyI Measurement uncertainty (D)I Parameter uncertainty (θ)I Uncertainty regarding the correct model (H)

I Uncertainty can occur in bothinputs and outputs.

I How to represent uncertainty?

X

Y

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 9

Page 18: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Prerequisites

Key conceptsUncertainty

I Virtually in all steps there is uncertaintyI Measurement uncertainty (D)I Parameter uncertainty (θ)I Uncertainty regarding the correct model (H)

Measurement uncertainty

I Uncertainty can occur in bothinputs and outputs.

I How to represent uncertainty?

X

Y

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 9

Page 19: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Outline

Motivation

Prerequisites

Probability Theory

Parameter Inference for the Gaussian

Summary

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 10

Page 20: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probabilities

I Let X be a random variable, defined over a set X or measurablespace.

I P (X = x) denotes the probability that X takes value x, short p(x).I Probabilities are positive, P (X = x) ≥ 0I Probabilities sum to one∫

x∈Xp(x)dx = 1

∑x∈X

p(x) = 1

I Special case: no uncertainty p(x) = δ(x− x).

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 11

Page 21: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability Theory

Joint Probability

P (X = xi, Y = yj) =ni,jN

Marginal Probability

P (X = xi) =ciN

Conditional Probability

P (Y = yj |X = xi) =ni,jci

(C.M. Bishop, Pattern Recognition and Machine Learning)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 12

Page 22: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability Theory

Product Rule

P (X = xi, Y = yj) =ni,jN

=ni,jci· ciN

= P (Y = yj |X = xi)P (X = xi)

Marginal Probability

P (X = xi) =ciN

Conditional Probability

P (Y = yj |X = xi) =ni,jci

(C.M. Bishop, Pattern Recognition and Machine Learning)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 12

Page 23: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability Theory

Product Rule

P (X = xi, Y = yj) =ni,jN

=ni,jci· ciN

= P (Y = yj |X = xi)P (X = xi)

Sum Rule

P (X = xi) =ciN

=1

N

L∑j=1

ni,j

=∑j

P (X = xi, Y = yj)

(C.M. Bishop, Pattern Recognition and Machine Learning)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 12

Page 24: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

The Rules of Probability

Sum & Product Rule

Sum Rule p(x) =∑

y p(x, y)

Product Rule p(x, y) = p(y |x)p(x)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 13

Page 25: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

The Rules of Probability

Bayes Theorem

I Using the product rule we obtain

p(y |x) =p(x | y)p(y)

p(x)

p(x) =∑y

p(x | y)p(y)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 14

Page 26: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Bayesian probability calculus

I Bayes rule is the basis for inference and learning.

I Assume we have a model with parameters θ,e.g.

y = θ0 + θ1 · x X

Y

x*

I Goal: learn parameters θ given Data D.

p(θ | D) =p(D |θ) p(θ)

p(D)

I Posterior

I Likelihood

I Prior

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 15

Page 27: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Bayesian probability calculus

I Bayes rule is the basis for inference and learning.

I Assume we have a model with parameters θ,e.g.

y = θ0 + θ1 · x X

Y

x*

I Goal: learn parameters θ given Data D.

p(θ | D) =p(D |θ) p(θ)

p(D)

posterior ∝ likelihood · prior

I Posterior

I Likelihood

I Prior

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 15

Page 28: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Information and Entropy

I Information is the reduction of uncertainty.I Entropy H(X) is the quantitative description of uncertainty

I H(X) = 0: certainty about X.I H(X) maximal if all possibilities are equal probable.I Uncertainty and information are additive.

I These conditions are fulfilled by the entropy function:

H(X) = −∑x∈X

P (X = x) logP (X = x)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 16

Page 29: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Information and Entropy

I Information is the reduction of uncertainty.I Entropy H(X) is the quantitative description of uncertainty

I H(X) = 0: certainty about X.I H(X) maximal if all possibilities are equal probable.I Uncertainty and information are additive.

I These conditions are fulfilled by the entropy function:

H(X) = −∑x∈X

P (X = x) logP (X = x)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 16

Page 30: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Definitions related to entropy and information

I Entropy is the average surprise

H(X) =∑x∈X

P (X = x) (− logP (X = x))︸ ︷︷ ︸surprise

I Conditional entropy

H(X |Y ) = −∑

x∈X ,y∈YP (X = x, Y = y) logP (X = x |Y = y)

I Mutual information

I(X : Y ) = H(X)−H(X |Y ) = H(Y )−H(Y |X)

H(X) +H(Y )−H(X,Y )

I Independence of X and Y , p(x, y) = p(x)p(y).

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 17

Page 31: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Definitions related to entropy and information

I Entropy is the average surprise

H(X) =∑x∈X

P (X = x) (− logP (X = x))︸ ︷︷ ︸surprise

I Conditional entropy

H(X |Y ) = −∑

x∈X ,y∈YP (X = x, Y = y) logP (X = x |Y = y)

I Mutual information

I(X : Y ) = H(X)−H(X |Y ) = H(Y )−H(Y |X)

H(X) +H(Y )−H(X,Y )

I Independence of X and Y , p(x, y) = p(x)p(y).

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 17

Page 32: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Definitions related to entropy and information

I Entropy is the average surprise

H(X) =∑x∈X

P (X = x) (− logP (X = x))︸ ︷︷ ︸surprise

I Conditional entropy

H(X |Y ) = −∑

x∈X ,y∈YP (X = x, Y = y) logP (X = x |Y = y)

I Mutual information

I(X : Y ) = H(X)−H(X |Y ) = H(Y )−H(Y |X)

H(X) +H(Y )−H(X,Y )

I Independence of X and Y , p(x, y) = p(x)p(y).

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 17

Page 33: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Definitions related to entropy and information

I Entropy is the average surprise

H(X) =∑x∈X

P (X = x) (− logP (X = x))︸ ︷︷ ︸surprise

I Conditional entropy

H(X |Y ) = −∑

x∈X ,y∈YP (X = x, Y = y) logP (X = x |Y = y)

I Mutual information

I(X : Y ) = H(X)−H(X |Y ) = H(Y )−H(Y |X)

H(X) +H(Y )−H(X,Y )

I Independence of X and Y , p(x, y) = p(x)p(y).

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 17

Page 34: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Entropy in actionThe optimal weighting problem

I Given 12 balls, all equal except for one that is lighter or heavier.I What is the ideal weighting strategy and how many weightings are

needed to identify the odd ball?

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 18

Page 35: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability distributions

I Gaussian

p(x |µ, σ2) = N (x |µ, σ) =1√

2πσ2e−

12σ2

(x−µ)2

−6 −4 −2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

I Multivariate Gaussian

p(x |µ,Σ) = N (x |µ,Σ)

=1√|2πΣ|

exp

[−1

2(x− µ)TΣ−1(x− µ)

]

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 19

Page 36: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability distributions

I Gaussian

p(x |µ, σ2) = N (x |µ, σ) =1√

2πσ2e−

12σ2

(x−µ)2

−6 −4 −2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

I Multivariate Gaussian

p(x |µ,Σ) = N (x |µ,Σ)

=1√|2πΣ|

exp

[−1

2(x− µ)TΣ−1(x− µ)

]−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Σ =1 0.80.8 1

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 19

Page 37: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability distributionscontinued...

I Bernoulli

p(x | θ) = θx(1− θ)1−xI Gamma

p(x | a, b) =ba

Γ(a)xa−1e−bx

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 20

Page 38: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability distributionscontinued...

I Bernoulli

p(x | θ) = θx(1− θ)1−xI Gamma

p(x | a, b) =ba

Γ(a)xa−1e−bx

0 1 2 3 4 5x

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

p(x|a

=1,b

=1)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 20

Page 39: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability distributionsThe Gaussian revisited

I Gaussian PDF

N(x∣∣µ, σ2

)=

1√2πσ2

e−1

2σ2(x−µ)2

I Positive: N(x∣∣µ, σ2

)> 0

I Normalized:∫ +∞

−∞N (x |µ, σ) dx = 1 (check)

I Expectation:

< x >=

∫ +∞

−∞N(x∣∣µ, σ2

)xdx = µ

I Variance: Var[x] =< x2 > − < x >2

= µ2 + σ2 − µ2 = σ2

−6 −4 −2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 21

Page 40: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Probability Theory

Probability distributionsThe Gaussian revisited

I Gaussian PDF

N(x∣∣µ, σ2

)=

1√2πσ2

e−1

2σ2(x−µ)2

I Positive: N(x∣∣µ, σ2

)> 0

I Normalized:∫ +∞

−∞N (x |µ, σ) dx = 1 (check)

I Expectation:

< x >=

∫ +∞

−∞N(x∣∣µ, σ2

)xdx = µ

I Variance: Var[x] =< x2 > − < x >2

= µ2 + σ2 − µ2 = σ2

− 6 − 4 − 2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 21

Page 41: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Outline

Motivation

Prerequisites

Probability Theory

Parameter Inference for the Gaussian

Summary

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 22

Page 42: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianIngredients

I Data

D = {x1, . . . , xN}

I Model HGauss – Gaussian PDF

N(x∣∣µ, σ2

)=

1√2πσ2

e−1

2σ2(x−µ)2

θ = {µ, σ2}

I Likelihood

p(D |θ) =

N∏n=1

N(xn∣∣µ, σ2)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 23

Page 43: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianIngredients

I Data

D = {x1, . . . , xN}

I Model HGauss – Gaussian PDF

N(x∣∣µ, σ2

)=

1√2πσ2

e−1

2σ2(x−µ)2

θ = {µ, σ2}

I Likelihood

p(D |θ) =

N∏n=1

N(xn∣∣µ, σ2)

− 6 − 4 − 2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 23

Page 44: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianIngredients

I Data

D = {x1, . . . , xN}

I Model HGauss – Gaussian PDF

N(x∣∣µ, σ2

)=

1√2πσ2

e−1

2σ2(x−µ)2

θ = {µ, σ2}

I Likelihood

p(D |θ) =

N∏n=1

N(xn∣∣µ, σ2)

x

p(x)

xn

N (xn|µ, σ2)

(C.M. Bishop, Pattern Recognition and Machine

Learning)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 23

Page 45: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

I Likelihood

p(D |θ) =

N∏n=1

N(xn∣∣µ, σ2)

I Maximum likelihood

θ = argmaxθ

p(D |θ)

x

p(x)

xn

N (xn|µ, σ2)

(C.M. Bishop, Pattern Recognition and Machine

Learning)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 24

Page 46: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

θ = argmaxθ

p(D |θ) = argmaxθ

N∏n=1

1√2πσ2

e−1

2σ2(xn−µ)2

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 25

Page 47: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

θ = argmaxθ

ln p(D |θ) = argmaxθ

ln

N∏n=1

1√2πσ2

e−1

2σ2(xn−µ)2

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 25

Page 48: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

θ = argmaxθ

ln p(D |θ) = argmaxθ

[−N

2ln(2π)− N

2lnσ2 − 1

2σ2

N∑n=1

(xn − µ)2

]

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 26

Page 49: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

θ = argmaxθ

ln p(D |θ) = argmaxθ

[−N

2ln(2π)− N

2lnσ2 − 1

2σ2

N∑n=1

(xn − µ)2

]

µ :d

µln p(D |µ) = 0 σ2 :

d

σ2ln p(D |σ2) = 0

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 26

Page 50: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 27

Page 51: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

I Maximum likelihood solutions

µML =1

N

N∑n=1

xn

σ2ML =1

N

N∑n=1

(xn − µML)2

Equivalent to common mean and variance estimators (almost).I Maximum likelihood ignores parameter uncertainty

I Think of the ML solution for a single observed datapoint x1

µML1 = x1

σ2ML1 = (x1 − µML1)2 = 0

I How about Bayesian inference?

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 28

Page 52: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

I Maximum likelihood solutions

µML =1

N

N∑n=1

xn

σ2ML =1

N

N∑n=1

(xn − µML)2

Equivalent to common mean and variance estimators (almost).I Maximum likelihood ignores parameter uncertainty

I Think of the ML solution for a single observed datapoint x1

µML1 = x1

σ2ML1 = (x1 − µML1)2 = 0

I How about Bayesian inference?

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 28

Page 53: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Inference for the GaussianMaximum likelihood

I Maximum likelihood solutions

µML =1

N

N∑n=1

xn

σ2ML =1

N

N∑n=1

(xn − µML)2

Equivalent to common mean and variance estimators (almost).I Maximum likelihood ignores parameter uncertainty

I Think of the ML solution for a single observed datapoint x1

µML1 = x1

σ2ML1 = (x1 − µML1)2 = 0

I How about Bayesian inference?

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 28

Page 54: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the GaussianIngredients

I Data

D = {x1, . . . , xN}

I Model HGauss – Gaussian PDF

N(x∣∣µ, σ2

)=

1√2πσ2

e−1

2σ2(x−µ)2

θ = {µ}I For simplicity: assume variance σ2 is

known.

I Likelihood

p(D |µ) =

N∏n=1

N(xn∣∣µ, σ2)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 29

Page 55: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the GaussianIngredients

I Data

D = {x1, . . . , xN}

I Model HGauss – Gaussian PDF

N(x∣∣µ, σ2

)=

1√2πσ2

e−1

2σ2(x−µ)2

θ = {µ}I For simplicity: assume variance σ2 is

known.

I Likelihood

p(D |µ) =

N∏n=1

N(xn∣∣µ, σ2)

− 6 − 4 − 2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 29

Page 56: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the GaussianIngredients

I Data

D = {x1, . . . , xN}

I Model HGauss – Gaussian PDF

N(x∣∣µ, σ2

)=

1√2πσ2

e−1

2σ2(x−µ)2

θ = {µ}I For simplicity: assume variance σ2 is

known.

I Likelihood

p(D |µ) =

N∏n=1

N(xn∣∣µ, σ2)

x

p(x)

xn

N (xn|µ, σ2)

(C.M. Bishop, Pattern Recognition and Machine

Learning)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 29

Page 57: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the GaussianBayes rule

I Combine likelihood with a Gaussian prior over µ

p(µ) = N(µ∣∣m0, s

20

)I The posterior is proportional to

p(µ | D, σ2) ∝ p(D |µ, σ2)p(µ)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 30

Page 58: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

p(µ | D, σ2) ∝ p(D |µ)p(µ)

=

[N∏n=1

1√2πσ2

e−1

2σ2(xn−µ)2

]1√2πs20

e− 1

2s20(µ−m0)

2

=1√

2πσ2

N 1√2πs20︸ ︷︷ ︸

C1

exp

[− 1

2s20(µ2 − 2µm0 +m2

0)− 1

2σ2

N∑n=1

(µ2 − 2µxn + x2n)

]

= C2 exp

[− 1

2

(1

s20+N

σ2

)︸ ︷︷ ︸

1/σ

(µ2 − 2µ σ(

1

s20m0 +

1

σ2

N∑n=1

xn)︸ ︷︷ ︸µ

)+ C3

]

I Posterior parameters follow as the new coefficients.

I Note: All the constants we dropped on the way yield the model

evidence: p(µ | D, σ2) =p(D |µ)p(µ)

ZO. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 31

Page 59: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

p(µ | D, σ2) ∝ p(D |µ)p(µ)

=

[N∏n=1

1√2πσ2

e−1

2σ2(xn−µ)2

]1√2πs20

e− 1

2s20(µ−m0)

2

=1√

2πσ2

N 1√2πs20︸ ︷︷ ︸

C1

exp

[− 1

2s20(µ2 − 2µm0 +m2

0)− 1

2σ2

N∑n=1

(µ2 − 2µxn + x2n)

]

= C2 exp

[− 1

2

(1

s20+N

σ2

)︸ ︷︷ ︸

1/σ

(µ2 − 2µ σ(

1

s20m0 +

1

σ2

N∑n=1

xn)︸ ︷︷ ︸µ

)+ C3

]

I Posterior parameters follow as the new coefficients.

I Note: All the constants we dropped on the way yield the model

evidence: p(µ | D, σ2) =p(D |µ)p(µ)

ZO. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 31

Page 60: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

p(µ | D, σ2) ∝ p(D |µ)p(µ)

=

[N∏n=1

1√2πσ2

e−1

2σ2(xn−µ)2

]1√2πs20

e− 1

2s20(µ−m0)

2

=1√

2πσ2

N 1√2πs20︸ ︷︷ ︸

C1

exp

[− 1

2s20(µ2 − 2µm0 +m2

0)− 1

2σ2

N∑n=1

(µ2 − 2µxn + x2n)

]

= C2 exp

[− 1

2

(1

s20+N

σ2

)︸ ︷︷ ︸

1/σ

(µ2 − 2µ σ(

1

s20m0 +

1

σ2

N∑n=1

xn)︸ ︷︷ ︸µ

)+ C3

]

I Posterior parameters follow as the new coefficients.

I Note: All the constants we dropped on the way yield the model

evidence: p(µ | D, σ2) =p(D |µ)p(µ)

ZO. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 31

Page 61: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the Gaussian

I Posterior of the mean: p(µ | D, σ2) ∝ N (µ | µ, σ), after somerewriting

µ =σ2

Ns20 + σ2m0 +

Ns20Ns20 + σ2

µML, µML =1

N

N∑n=1

xn

1

σ2=

1

s20+N

σ2

I Limiting cases for no and infinite amount of data

N = 0 N →∞µ m0 µML

σ2 s20 0

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 32

Page 62: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the GaussianExamples

I Posterior p(µ | D, σ2) for increasing data sizes.

N = 0

N = 1

N = 2

N = 10

−1 0 10

5

(C.M. Bishop, Pattern Recognition and Machine Learning)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 33

Page 63: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Conjugate priors

I It is not chance that the posterior

p(µ | D, σ2) ∝ p(D |µ, σ2)p(µ)

is tractable in closed form for the Gaussian.

Conjugate prior

p(θ) is a conjugate prior for a particular likelihood p(D | θ) if the posterioris of the same functional form than the prior.

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 34

Page 64: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Conjugate priors

I It is not chance that the posterior

p(µ | D, σ2) ∝ p(D |µ, σ2)p(µ)

is tractable in closed form for the Gaussian.

Conjugate prior

p(θ) is a conjugate prior for a particular likelihood p(D | θ) if the posterioris of the same functional form than the prior.

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 34

Page 65: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Conjugate priorsExponential family distributions

I A large class of probability distributions are part of the exponentialfamily (all in this course) and can be written as:

p(x |θ) = h(x)g(θ) exp{θTu(x)}

I For example for the Gaussian:

p(x |µ, σ2) =1

2πσ2exp{− 1

2σ2(x2 − 2xµ+ µ2)}

= h(x)g(θ)exp{θTu(x)}

with θ =

(µ/σ2

−1/2σ2

), h(x) =

1√2π

u(x) =

(xx2

), g(θ) = (−2θ2)

1/2 exp

(θ214θ2

)

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 35

Page 66: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Conjugate priorsExponential family distributions

Conjugacy and exponential family distributions

I For all members of the exponential family it is possible to construct aconjugate prior.

I Intuition: The exponential form ensures that we can construct a priorthat keeps its functional form.

I Conjugate priors for the Gaussian N(x∣∣µ, σ2)

I p(µ) = N(µ∣∣m0, s

20

)I p(

1

σ2) = Γ(

1

σ2, a0, b0).

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 36

Page 67: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Conjugate priorsExponential family distributions

Conjugacy and exponential family distributions

I For all members of the exponential family it is possible to construct aconjugate prior.

I Intuition: The exponential form ensures that we can construct a priorthat keeps its functional form.

I Conjugate priors for the Gaussian N(x∣∣µ, σ2)

I p(µ) = N(µ∣∣m0, s

20

)I p(

1

σ2) = Γ(

1

σ2, a0, b0).

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 36

Page 68: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Parameter Inference for the Gaussian

Bayesian Inference for the GaussianSequential learning

I Bayes rule naturally leads itself to sequential learning

I Assume one by one multiple datasets become available: D1, . . . ,DS

p1(θ) ∝ p(D1 |θ)p(θ)

p2(θ) ∝ p(D2 |θ)p1(θ)

. . .

I Note: Assuming the datasets are independent, sequential updates anda single learning step yield the same answer.

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 37

Page 69: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Summary

Outline

Motivation

Prerequisites

Probability Theory

Parameter Inference for the Gaussian

Summary

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 38

Page 70: An Introduction to Probabilistic modeling · 2014-10-29 · Motivation Why probabilistic modeling? I Inferences from data are intrinsicallyuncertain. I Probability theory: model uncertainty

Summary

Summary

I Probability theory: the language of uncertainty.

I Key rules of probability: sum rule, product rule.

I Bayes rules formes the fundamentals of learning.(posterior ∝ likelihood · prior).

I The entropy quantifies uncertainty.

I Parameter learning using maximum likelihood.

I Bayesian inference for the Gaussian.

O. Stegle & K. Borgwardt An introduction to probabilistic modeling Tubingen 39