lecture 1 introduction, probabilistic modeling · advancedprobabilisticmachinelearning...

Advanced Probabilistic Machine LearningLecture 1 – Introduction, probabilistic modeling

Niklas WahlströmDivision of Systems and ControlDepartment of Information TechnologyUppsala University

[email protected]/katalog/nikwa778

1 / 36 [email protected] Introduction

www.it.uu.se/katalog/nikwa778

mailto:[email protected]

What is the course about?



Previous course - Statistical machine learning

What was that course about? Supervised machine learning

Learning a model from labeled data.

Labels e.g. mat,mirror, boat, . . .

Training data

Model

Learningalgorithm

Predicting output of newdata based on this model.

1 1 6 | N a T u r e | V O L 5 4 2 | 2 F e b r u a r y 2 0 1 7

LetterreSeArCH

lesions. In this task, the CNN achieves 72.1 ± 0.9% (mean ± s.d.) overall accuracy (the average of individual inference class accuracies) and two dermatologists attain 65.56% and 66.0% accuracy on a subset of the validation set. Second, we validate the algorithm using a nine-class disease partition—the second-level nodes—so that the diseases of each class have similar medical treatment plans. The CNN achieves 55.4 ± 1.7% overall accuracy whereas the same two dermatologists attain 53.3% and 55.0% accuracy. A CNN trained on a finer disease partition performs better than one trained directly on three or nine classes (see Extended Data Table 2), demonstrating the effectiveness of our partitioning algorithm. Because images of the validation set are labelled by dermatologists, but not necessarily confirmed by biopsy, this metric is inconclusive, and instead shows that the CNN is learning relevant information.

To conclusively validate the algorithm, we tested, using only biopsy-proven images on medically important use cases, whether the algorithm and dermatologists could distinguish malignant versus benign lesions of epidermal (keratinocyte carcinoma compared to benign seborrheic keratosis) or melanocytic (malignant melanoma compared to benign nevus) origin. For melanocytic lesions, we show

two trials, one using standard images and the other using dermoscopy images, which reflect the two steps that a dermatologist might carry out to obtain a clinical impression. The same CNN is used for all three tasks. Figure 2b shows a few example images, demonstrating the difficulty in distinguishing between malignant and benign lesions, which share many visual features. Our comparison metrics are sensitivity and specificity:

=sensitivitytrue positive

positive

=specificitytrue negative

negative

where ‘true positive’ is the number of correctly predicted malignant lesions, ‘positive’ is the number of malignant lesions shown, ‘true neg-ative’ is the number of correctly predicted benign lesions, and ‘neg-ative’ is the number of benign lesions shown. When a test set is fed through the CNN, it outputs a probability, P, of malignancy, per image. We can compute the sensitivity and specificity of these probabilities

Acral-lentiginous melanomaAmelanotic melanomaLentigo melanoma…

Blue nevusHalo nevusMongolian spot…

Training classes (757)Deep convolutional neural network (Inception v3) Inference classes (varies by task)

92% malignant melanocytic lesion

8% benign melanocytic lesion

Skin lesion image

ConvolutionAvgPoolMaxPoolConcatDropoutFully connectedSoftmax

Figure 1 | Deep CNN layout. Our classification technique is a deep CNN. Data flow is from left to right: an image of a skin lesion (for example, melanoma) is sequentially warped into a probability distribution over clinical classes of skin disease using Google Inception v3 CNN architecture pretrained on the ImageNet dataset (1.28 million images over 1,000 generic object classes) and fine-tuned on our own dataset of 129,450 skin lesions comprising 2,032 different diseases. The 757 training classes are defined using a novel taxonomy of skin disease and a partitioning algorithm that maps diseases into training classes

(for example, acrolentiginous melanoma, amelanotic melanoma, lentigo melanoma). Inference classes are more general and are composed of one or more training classes (for example, malignant melanocytic lesions—the class of melanomas). The probability of an inference class is calculated by summing the probabilities of the training classes according to taxonomy structure (see Methods). Inception v3 CNN architecture reprinted from https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html

ba

Epidermal lesions

Ben

ign

Mal

igna

nt

Melanocytic lesions Melanocytic lesions (dermoscopy)

Skin disease

Benign

Melanocytic

Café aulait spot

Solarlentigo

Epidermal

Seborrhoeickeratosis

Milia

Dermal

Cyst

Non-neoplastic

AcneRosacea

Abrasion

Stevens-Johnsonsyndrome

Tuberoussclerosis

Malignant

Epidermal

Basal cellcarcinoma

Squamouscell

carcinoma

Dermal

Merkel cellcarcinoma

Angiosarcoma

T-cell

B-cell

GenodermatosisCongenitaldyskeratosis

Bullouspemphigoid

Cutaneouslymphoma

Melanoma

Psoriasis

Fibroma

Lipoma

In�ammatory

Atypicalnevus

Figure 2 | A schematic illustration of the taxonomy and example test set images. a, A subset of the top of the tree-structured taxonomy of skin disease. The full taxonomy contains 2,032 diseases and is organized based on visual and clinical similarity of diseases. Red indicates malignant, green indicates benign, and orange indicates conditions that can be either. Black indicates melanoma. The first two levels of the taxonomy are used in validation. Testing is restricted to the tasks of b. b, Malignant and benign

example images from two disease classes. These test images highlight the difficulty of malignant versus benign discernment for the three medically critical classification tasks we consider: epidermal lesions, melanocytic lesions and melanocytic lesions visualized with a dermoscope. Example images reprinted with permission from the Edinburgh Dermofit Library (https://licensing.eri.ed.ac.uk/i/software/dermofit-image-library.html).

© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.

?

Unseen data

Modelprediction

We learned multiple methods for finding such models:• Linear/logistic regression, discriminant analysis, trees, k-NN,

neural networks, ensemble methods , ......, and strategies for improving them (cross-validation)



What is this course about? (I/II)

This course extends the SML course in two aspects:1. Probabilistic machine learning We will have a probabilistic

(aka. Bayesian) viewpoint on machine learning problems2. Beyond supervised machine learning We will consider other

ML problems than just supervised ML

1. Probabilistic machine learning

Probabilistic? You talked about noise, random variables and stuff al-ready in the SML course!?

• Previously we treated the output data y as random variables.• We now treat the model itself as a random variable.• Advantage: Probabilistic models express the uncertainty of

predictions4 / 36 [email protected] Introduction


What is this course about? (I/II)

2. Beyond supervised machine learning

We consider problems where we for example want to ...• ... rank objects based on data (miniproject)• ... generate more data similar to the training data• ... compress or summarize the data

We will also learn about universal models/methods that are useful inprobabilistic machine learning, but also elsewhere• Graphical models• Monte Carlo methods• Variational inference

In this sense this course is both broader, deeper and more researchoriented than the SML course.



Example - Building magnetic field maps

From my own research: Build a map of the indoor magnetic fieldusing Gaussian processes.

https://www.youtube.com/watch?v=enlMiUqPVJo

More about Gaussian processes in lecture 7 and 8.

[1] Niklas Wahlström, Manon Kok, Thomas B. Schön and Fredrik Gustafsson, Modeling magnetic fields using Gaussianprocesses The 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada,2013.[2] Arno Solin, Manon Kok, Niklas Wahlström, Thomas B. Schön and Simo Särkkä, Modeling and interpolation of the ambientmagnetic field by Gaussian processes ArXiv e-prints, September 2015. arXiv:1509.04634. -


https://www.youtube.com/watch?v=enlMiUqPVJo


Example – indoor localization

MSc thesis project: Compute the position of a person moving aroundindoors using sensors (inertial, magnetometer and radio) and a map.

Show movie

More about Monte Carlo methods in lecture 4.Johan Kihlberg, Simon Tegelid, Manon Kok and Thomas B. Schön. Map aided indoor positioning using particle filters.Reglermöte (Swedish Control Conference), Linköping, Sweden, June 2014.



Example - Probabilistic ranking

Aim: Estimate skill of chess players throughout the history

Pierre Dangauthier, Ralf Herbrich, Tom Minka, Thore Graepel. TrueSkill Through Time: Revisiting the History of Chess.

NIPS, 2007.

You will work with this ranking model in the mini-project (but notnecessarily applied to chess)



Course information



Course elements

• 11 lectures• 10 problem solving sessions• 1 mini project (3-4 students, written report)• 1 computer lab (4h, no report)• Complete course information (including lecture slides) is available

from the course home page:

www.it.uu.se/edu/course/homepage/apml


www.it.uu.se/edu/course/homepage/apml


Teachers

Teachers involved in the course (in approximate order of appearance):

NiklasWahlströmRoom: 2319

AndreasLindholmRoom: 2340

RiccardoRisuleoRoom: 2237

ThomasSchönRoom: 2209

All room numbers are at ITC Polacksbacken.You can reach us by email: <firstname.lastname>@it.uu.se.



Lecture outline

1. Introduction, probabilistic modeling2. Bayesian linear regression3. Bayesian graphical models4. Monte Carlo methods5. Factor graphs6. Variational inference7. Gaussian processes I8. Gaussian processes II9. Unsupervised learning10. Variational autoencoders11. Summary and guest lecture by James Hensman



Problem solving sessions

10 problem solving sessions:• Solve problems, discuss and ask questions! (”räknestuga”)• 5 pen-and-paper sessions• 5 computer-based sessions (using Python)• Feel free to use your own laptops – Python is freely available• Exercises available via homepage or the student portal

The computer-based sessions are scheduled in 1 computer room + 1normal class room. The latter is intended for students who choose towork on their own laptops.

A great opportunity to discuss and ask questions!



Examination (I/II)

Mini project:

• Solved in groups of 3 or 4 students (no later than September 10)• Written report (deadline: October 3)• Peer-review: read and review another group’s report

(anonymously)• Material most relevant for the mini project presented at lectures

3–6, but you can start working on the solution after lecture 2• Graded U/G

Laboration:

• 4 h computer laboration, solved in groups of 2 students, gradedU/G• 4 sessions available – sign up for one of these• Solve the preparatory exercises before the lab session!



Oral Examination (II/II)

Instead of a written exam, we have an oral examination in the end ofthe course

• The exam is individual• 25 minutes discussion with teacher(s) about the course.• You start with a 7 minutes presentation about the course.• After the presentation the teacher(s) will lead the discussion.• The exam will be graded as U, 3, 4, or 5.• Time-slots for the oral exam will be in week 43 and 44.

For more information about the oral exam, see the course homepage.



Course literature

We recommend two books• Barber, D., Bayesian Reasoning and Machine Learning, 2012

• Christopher M. Bishop. Pattern Recognition and MachineLearning, Springer, 2006.

Both of them are freely available online, linked from the coursehomepage.

For some lecuture(s) we will use additional resources which will beavailable from the homepage.



Probability fundamentals



Medical inference (Hamburgers) I/III

Ex 1.2 (Barber)• 90% of people with Kreuzfeld-Jacob disease ate hamburgers.• One in 100,000 has this disease• Assume half of the population eat hamburgers

What is the probability that a hamburger eater will haveKreuzfeld-Jacob disease?Define the following events

KJ = Having Kreuzfeld-Jacob diseaseH = Eating hamburger

We know that

p(H = Yes|KJ = Yes) = 90%

p(KJ = Yes) = 0.001%

p(H = Yes) = 50%

Q: What is p(KJ = Yes|H = Yes) ?18 / 36 [email protected] Introduction


Medical inference (Hamburgers) II/III

Consider a population of 1 000 000.• p(KJ = Yes) = 0.001% has KJ disease, i.e. 10 people.

• p(H = Yes|KJ = Yes) = 90%, .i.e nine of them eat hamburgers• One of them doesn’t

• p(H = Yes) = 50% eat hamburgers, i.e. 500 000 people.• 499’991 of them does not have KJ disease.

H = Yes H = No

KJ = Yes 9 1KJ = No 499 991 499 999



Medical inference (Hamburgers) III/III

H = Yes H = No

KJ = Yes 9 1KJ = No 499 991 499 999

p(KJ = Yes|H = Yes) is the proportion of all hamburger eater howhave KJ disease.

p(KJ = Yes|H = Yes) =9

9 + 499 991= 0.0018%

This can also be written as

p(KJ = Y|H = Y) =p(KJ = Y, H = Y)

p(KJ = Y, H = Y) + p(KJ = N, H = Y)

=p(KJ = Y, H = Y)

p(H = Y)

This is an example of conditional probability and marginalization20 / 36 [email protected] Introduction


Conditioning, marginalization (discrete)

Conditional probability is defined as

p(x|y) =p(x, y)

p(y)where p(y) 6= 0

Marginalization is defined as

p(x) =∑y

p(x, y)

Much of the probability theory can be derived from these two rules.

Bayes’ theorem is derived by using the def. of conditional probabilitytwice

p(x|y)p(y) = p(x, y) = p(y|x)p(x) ⇒ p(x|y) =p(y|x)p(x)

p(y)



Medical inference (Hamburgers), revisited

Consider again the hamburger/Kreuzfeld-Jacob disease problem

KJ = Having Kreuzfeld-Jacob diseaseH = Eating hamburger

We know that

p(H = Yes|KJ = Yes) = 90%

p(KJ = Yes) = 0.001%

p(H = Yes) = 50%

By applying Bayes’ theorem we get

p(KJ = Y|H = Y) =p(H = Y|KJ = Y)p(KJ = Y)

p(H = Y)=

910 × 1

100 00012

= 1.8 · 10−5

= 0.0018%



Continuous random variablesProbability distribution

The probability distribution p(x) describes the probability for acontinuous random variable falling into a given interval

p(a < x < b) =

∫ b

ap(x)dx

a b0

0.5

1

1.5

p(x) is also called the probability density



Continuous random variablesConditioning and marginalization

Consider the joint distribution p(x, y)

γ

p(x|y=γ)

p(x,y=γ)

p(y)p(x)

x y

p(x,y)

Conditional probability Marginalizationp(x|y) = p(x,y)

p(y) , p(y) 6= 0 p(x) =∫y p(x, y)



Probabilistic/Bayesian inference

In this course most of the solutions to the problems can be stated as

p(θ|D) =p(D|θ)p(θ)p(D)

• D : observed data• θ : parameters of some model explaining the data• p(θ): prior belief of parameters before we collected any data• p(θ|D): posterior belief of parameters after inferring data• p(D|θ): likelihood of the data in view of the parameters• p(D): The marginal likelihood



Probabilistic/Bayesian inference

In this course most of the solutions to the problems can be stated as

p(θ|D) =p(D|θ)p(θ)p(D)

• If we view the quantities as functions of θ, we can write

p(θ|D)︸︷︷︸posterior

∝ p(D|θ)︸︷︷︸likelihood

p(θ)︸︷︷︸prior

∝ means: "proportional to with respect to the parameters θ".Hence, p(D) can be viewed as a normalization constant.

• Using marginalization, we can express p(D) in terms of thelikelihood and the prior

p(D) =

∫p(D, θ)dθ =

∫p(D|θ)p(θ)dθ



Example: Flipping of a coin

Consider a binary random variable x ∈ {0, 1} representing theoutcome of flipping of a coin• x = 1 represents "head" and x = 0 "tail".• The probability of x = 1 is denoted by the parameter µ

p(x = 1|µ) = µ, 0 ≤ µ ≤ 1

(assume damaged coin, so not necessary µ = 0.5)

Question:Given a dataset D = {x1, . . . , xN}, what is p(µ|D)?

Solution: Bayes’ theorem state that

p(µ|D)︸︷︷︸posterior

∝ p(D|µ)︸︷︷︸likelihood

p(µ)︸︷︷︸prior

Find likelihood and prior and then multiply! We start with the likelihood.26 / 36 [email protected] Introduction


Example: Flipping of a coinSolution - The likelihood (I/II)We know that

p(x = 1|µ) = µ, and consequentlyp(x = 0|µ) = 1− µ.

The distribution for one observation x can be written as

p(x|µ) = Bern (x; µ) = µx(1− µ)1−x

This is the Bernoulli distribution.

0 10

0.2

0.4

0.6

0.8

1

x

E[x] = µ

Var[x] = µ(1− µ)



Example: Flipping of a coinSolution - The likelihood (II/II)TheN observations are drawn independently. This gives the likelihood

p(D|µ) =

N∏n=1

p(xn|µ) =

N∏n=1

µxn(1− µ)1−xn

= µ∑N

n=1 xn(1− µ)N−∑N

n=1 xn = µm(1− µ)N−m

wherem =∑N

n=1 xn, i.e. the number of heads.

Note: The likelihood only depend on the data D viam.

The likelihood ofm is proportional to p(D|µ)

p(m|µ) = Bin (m; N, µ) =

(N

m

)µm(1− µ)N−m,

where(Nm

)= N !

(N−m)!m! is the number of sequences givingm heads.This is the binomial distribution .



Binomial distribution

m ∼ Bin (m; N, µ) =

(N

m

)µm(1− µ)N−m

0 1 2 3 4 5 6 7 8 9100

0.1

0.2

0.3

m

The binomial distribution forN = 10 and µ = 0.25.

E[m] = Nµ

Var[m] = Nµ(1− µ)



Example: Flipping of a coinSolution - The prior (I/II)

Remember Bayes’ theorem p(µ|m)︸︷︷︸posterior

∝ p(m|µ)︸︷︷︸likelihood

p(µ)︸︷︷︸prior

• Multiple possible prior distributions p(µ) exist.• We opt for a prior which has attractive analytical properties.

We choose a prior such that the posterior will be of the same functionalform as the prior. We call this a conjugate prior.

The conjugate prior of the Binomial distribution is the Beta distribution

Beta (µ; a, b) =Γ(a+ b)

Γ(a)Γ(b)µa−1(1− µ)b−1

where Γ(a) is the Gamma function.



Example: Flipping of a coinSolution - The posteriorThe posterior can now be computed

p(µ|m) ∝ p(m|µ)p(µ)

∝ Bin (m; N, µ)Beta (µ; a, b)

= µm(1− µ)N−mµa−1(1− µ)b−1

= µm+a−1(1− µ)N−m+b−1.

Hence, the posterior is also a Beta distribution

p(µ|m) = Beta (µ; a∗, b∗) ,

where

a∗ = m+ a,

b∗ = N −m+ b.



The Beta distribution

Beta (µ; a, b) =Γ(a+ b)

Γ(a)Γ(b)µa−1(1− µ)b−1,

where Γ(a) =

∫ ∞0

e−xxa−1dx

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

µ

a = 1 b = 1a = 0.1 b = 0.1a = 2 b = 3

E[µ] =a

a+ b, Var[µ] =

ab

(a+ b)2(a+ b+ 1)32 / 36 [email protected] Introduction


Example: Flipping of a coinBayesian inference

Prior Likelihoodfunction Posterior

p(µ) p(m|µ) p(µ|m)Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 1, N = 1 a∗ = 2, b∗ = 1

0 0.5 10

1

2

µ

×

0 0.5 10

1

2

µ

∝

0 0.5 10

1

2

µ• If you don’t know anything about the coin, start with an

uninformative prior. Beta (µ; 1, 1) = 1.• Assume we get one data point x1 = 1• Posterior ∝ likelihood × prior



Example: Flipping of a coinBayesian inference

Prior Likelihoodfunction Posterior

p(µ) p(m|µ) p(µ|m)Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 4, N = 5 a∗ = 5, b∗ = 2

0 0.5 10

1

2

µ

×

0 0.5 10

1

2

µ

∝

0 0.5 10

1

2

µ

Assume you get N = 5 data points, of whichm = 4 are heads,D = {1, 0, 1, 1, 1}.



Example: Flipping of a coinSequential Bayesian inference

• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior

Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)

Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 1 , N = 1 a∗ = 2, b∗ = 1

0 0.5 10

1

2

µ0 0.5 1

0

1

2

µ0 0.5 1

0

1

2

µ

D1 = {1},

D2 = {0}, D3 = {1}, D4 = {1}, D5 = {1}







0 0.5 10

1

2

µ0 0.5 1

0

1

2

µ0 0.5 1

0

1

2

µ

D1 = {1}, D2 = {0},

D3 = {1}, D4 = {1}, D5 = {1}







0 0.5 10

1

2

µ0 0.5 1

0

1

2

µ0 0.5 1

0

1

2

µ

D1 = {1}, D2 = {0}, D3 = {1},

D4 = {1}, D5 = {1}







0 0.5 10

1

2

µ0 0.5 1

0

1

2

µ0 0.5 1

0

1

2

µ

D1 = {1}, D2 = {0}, D3 = {1}, D4 = {1},

D5 = {1}







0 0.5 10

1

2

µ0 0.5 1

0

1

2

µ0 0.5 1

0

1

2

µ

D1 = {1}, D2 = {0}, D3 = {1}, D4 = {1}, D5 = {1}34 / 36 [email protected] Introduction


Concluding remarks

Probabilistic/Bayesian inference is a flexible way of dealing withmachine learning problemsProperties:• Treat not only the data, but also the model and its parameters

(if parametric) as random variables.• After learning, you not only get a single model, you get a

distribution of likely models.• You can encode prior knowledge you might have about the

model and its parameters.



A few concepts to summarize lecture 1

Probability distribution Function that describes the likelihood of obtaining thepossible values that a random variable can assume.Conditioning and marginalization Tow basic rules for manipulating probabilitydistributions.Bayes’ theorem p(x|y) = p(y|x)p(x)/p(y)Prior Belief of parameter before we have seen any dataLikelihood Belief of data in view of the parametersPosterior Belief of parameter after inferring dataBernoulli distribution Distribution for a binary random variableBinomial distribution Distribution for the sum of multiple binary random variables.Beta distribution Conjugate prior for the Binomial distributionConjugate prior A prior ensuring that the posterior and the prior belong to the sameprobability distribution family.



lecture 1 introduction, probabilistic modeling · advancedprobabilisticmachinelearning...

Documents