lecture 1 introduction, probabilistic modeling · advancedprobabilisticmachinelearning...
TRANSCRIPT
Advanced Probabilistic Machine LearningLecture 1 – Introduction, probabilistic modeling
Niklas WahlströmDivision of Systems and ControlDepartment of Information TechnologyUppsala University
[email protected]/katalog/nikwa778
1 / 36 [email protected] Introduction
Previous course - Statistical machine learning
What was that course about? Supervised machine learning
Learning a model from labeled data.
Labels e.g. mat,mirror, boat, . . .
Training data
Model
Learningalgorithm
Predicting output of newdata based on this model.
1 1 6 | N a T u r e | V O L 5 4 2 | 2 F e b r u a r y 2 0 1 7
LetterreSeArCH
lesions. In this task, the CNN achieves 72.1 ± 0.9% (mean ± s.d.) overall accuracy (the average of individual inference class accuracies) and two dermatologists attain 65.56% and 66.0% accuracy on a subset of the validation set. Second, we validate the algorithm using a nine-class disease partition—the second-level nodes—so that the diseases of each class have similar medical treatment plans. The CNN achieves 55.4 ± 1.7% overall accuracy whereas the same two dermatologists attain 53.3% and 55.0% accuracy. A CNN trained on a finer disease partition performs better than one trained directly on three or nine classes (see Extended Data Table 2), demonstrating the effectiveness of our partitioning algorithm. Because images of the validation set are labelled by dermatologists, but not necessarily confirmed by biopsy, this metric is inconclusive, and instead shows that the CNN is learning relevant information.
To conclusively validate the algorithm, we tested, using only biopsy-proven images on medically important use cases, whether the algorithm and dermatologists could distinguish malignant versus benign lesions of epidermal (keratinocyte carcinoma compared to benign seborrheic keratosis) or melanocytic (malignant melanoma compared to benign nevus) origin. For melanocytic lesions, we show
two trials, one using standard images and the other using dermoscopy images, which reflect the two steps that a dermatologist might carry out to obtain a clinical impression. The same CNN is used for all three tasks. Figure 2b shows a few example images, demonstrating the difficulty in distinguishing between malignant and benign lesions, which share many visual features. Our comparison metrics are sensitivity and specificity:
=sensitivitytrue positive
positive
=specificitytrue negative
negative
where ‘true positive’ is the number of correctly predicted malignant lesions, ‘positive’ is the number of malignant lesions shown, ‘true neg-ative’ is the number of correctly predicted benign lesions, and ‘neg-ative’ is the number of benign lesions shown. When a test set is fed through the CNN, it outputs a probability, P, of malignancy, per image. We can compute the sensitivity and specificity of these probabilities
Acral-lentiginous melanomaAmelanotic melanomaLentigo melanoma…
Blue nevusHalo nevusMongolian spot…
Training classes (757)Deep convolutional neural network (Inception v3) Inference classes (varies by task)
92% malignant melanocytic lesion
8% benign melanocytic lesion
Skin lesion image
ConvolutionAvgPoolMaxPoolConcatDropoutFully connectedSoftmax
Figure 1 | Deep CNN layout. Our classification technique is a deep CNN. Data flow is from left to right: an image of a skin lesion (for example, melanoma) is sequentially warped into a probability distribution over clinical classes of skin disease using Google Inception v3 CNN architecture pretrained on the ImageNet dataset (1.28 million images over 1,000 generic object classes) and fine-tuned on our own dataset of 129,450 skin lesions comprising 2,032 different diseases. The 757 training classes are defined using a novel taxonomy of skin disease and a partitioning algorithm that maps diseases into training classes
(for example, acrolentiginous melanoma, amelanotic melanoma, lentigo melanoma). Inference classes are more general and are composed of one or more training classes (for example, malignant melanocytic lesions—the class of melanomas). The probability of an inference class is calculated by summing the probabilities of the training classes according to taxonomy structure (see Methods). Inception v3 CNN architecture reprinted from https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
ba
Epidermal lesions
Ben
ign
Mal
igna
nt
Melanocytic lesions Melanocytic lesions (dermoscopy)
Skin disease
Benign
Melanocytic
Café aulait spot
Solarlentigo
Epidermal
Seborrhoeickeratosis
Milia
Dermal
Cyst
Non-neoplastic
AcneRosacea
Abrasion
Stevens-Johnsonsyndrome
Tuberoussclerosis
Malignant
Epidermal
Basal cellcarcinoma
Squamouscell
carcinoma
Dermal
Merkel cellcarcinoma
Angiosarcoma
T-cell
B-cell
GenodermatosisCongenitaldyskeratosis
Bullouspemphigoid
Cutaneouslymphoma
Melanoma
Psoriasis
Fibroma
Lipoma
In�ammatory
Atypicalnevus
Figure 2 | A schematic illustration of the taxonomy and example test set images. a, A subset of the top of the tree-structured taxonomy of skin disease. The full taxonomy contains 2,032 diseases and is organized based on visual and clinical similarity of diseases. Red indicates malignant, green indicates benign, and orange indicates conditions that can be either. Black indicates melanoma. The first two levels of the taxonomy are used in validation. Testing is restricted to the tasks of b. b, Malignant and benign
example images from two disease classes. These test images highlight the difficulty of malignant versus benign discernment for the three medically critical classification tasks we consider: epidermal lesions, melanocytic lesions and melanocytic lesions visualized with a dermoscope. Example images reprinted with permission from the Edinburgh Dermofit Library (https://licensing.eri.ed.ac.uk/i/software/dermofit-image-library.html).
© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
?
Unseen data
Modelprediction
We learned multiple methods for finding such models:• Linear/logistic regression, discriminant analysis, trees, k-NN,
neural networks, ensemble methods , ......, and strategies for improving them (cross-validation)
3 / 36 [email protected] Introduction
What is this course about? (I/II)
This course extends the SML course in two aspects:1. Probabilistic machine learning We will have a probabilistic
(aka. Bayesian) viewpoint on machine learning problems2. Beyond supervised machine learning We will consider other
ML problems than just supervised ML
1. Probabilistic machine learning
Probabilistic? You talked about noise, random variables and stuff al-ready in the SML course!?
• Previously we treated the output data y as random variables.• We now treat the model itself as a random variable.• Advantage: Probabilistic models express the uncertainty of
predictions4 / 36 [email protected] Introduction
What is this course about? (I/II)
2. Beyond supervised machine learning
We consider problems where we for example want to ...• ... rank objects based on data (miniproject)• ... generate more data similar to the training data• ... compress or summarize the data
We will also learn about universal models/methods that are useful inprobabilistic machine learning, but also elsewhere• Graphical models• Monte Carlo methods• Variational inference
In this sense this course is both broader, deeper and more researchoriented than the SML course.
5 / 36 [email protected] Introduction
Example - Building magnetic field maps
From my own research: Build a map of the indoor magnetic fieldusing Gaussian processes.
https://www.youtube.com/watch?v=enlMiUqPVJo
More about Gaussian processes in lecture 7 and 8.
[1] Niklas Wahlström, Manon Kok, Thomas B. Schön and Fredrik Gustafsson, Modeling magnetic fields using Gaussianprocesses The 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada,2013.[2] Arno Solin, Manon Kok, Niklas Wahlström, Thomas B. Schön and Simo Särkkä, Modeling and interpolation of the ambientmagnetic field by Gaussian processes ArXiv e-prints, September 2015. arXiv:1509.04634. -
6 / 36 [email protected] Introduction
Example – indoor localization
MSc thesis project: Compute the position of a person moving aroundindoors using sensors (inertial, magnetometer and radio) and a map.
Show movie
More about Monte Carlo methods in lecture 4.Johan Kihlberg, Simon Tegelid, Manon Kok and Thomas B. Schön. Map aided indoor positioning using particle filters.Reglermöte (Swedish Control Conference), Linköping, Sweden, June 2014.
7 / 36 [email protected] Introduction
Example - Probabilistic ranking
Aim: Estimate skill of chess players throughout the history
Pierre Dangauthier, Ralf Herbrich, Tom Minka, Thore Graepel. TrueSkill Through Time: Revisiting the History of Chess.
NIPS, 2007.
You will work with this ranking model in the mini-project (but notnecessarily applied to chess)
8 / 36 [email protected] Introduction
Course elements
• 11 lectures• 10 problem solving sessions• 1 mini project (3-4 students, written report)• 1 computer lab (4h, no report)• Complete course information (including lecture slides) is available
from the course home page:
www.it.uu.se/edu/course/homepage/apml
10 / 36 [email protected] Introduction
Teachers
Teachers involved in the course (in approximate order of appearance):
NiklasWahlströmRoom: 2319
AndreasLindholmRoom: 2340
RiccardoRisuleoRoom: 2237
ThomasSchönRoom: 2209
All room numbers are at ITC Polacksbacken.You can reach us by email: <firstname.lastname>@it.uu.se.
11 / 36 [email protected] Introduction
Lecture outline
1. Introduction, probabilistic modeling2. Bayesian linear regression3. Bayesian graphical models4. Monte Carlo methods5. Factor graphs6. Variational inference7. Gaussian processes I8. Gaussian processes II9. Unsupervised learning10. Variational autoencoders11. Summary and guest lecture by James Hensman
12 / 36 [email protected] Introduction
Problem solving sessions
10 problem solving sessions:• Solve problems, discuss and ask questions! (”räknestuga”)• 5 pen-and-paper sessions• 5 computer-based sessions (using Python)• Feel free to use your own laptops – Python is freely available• Exercises available via homepage or the student portal
The computer-based sessions are scheduled in 1 computer room + 1normal class room. The latter is intended for students who choose towork on their own laptops.
A great opportunity to discuss and ask questions!
13 / 36 [email protected] Introduction
Examination (I/II)
Mini project:
• Solved in groups of 3 or 4 students (no later than September 10)• Written report (deadline: October 3)• Peer-review: read and review another group’s report
(anonymously)• Material most relevant for the mini project presented at lectures
3–6, but you can start working on the solution after lecture 2• Graded U/G
Laboration:
• 4 h computer laboration, solved in groups of 2 students, gradedU/G• 4 sessions available – sign up for one of these• Solve the preparatory exercises before the lab session!
14 / 36 [email protected] Introduction
Oral Examination (II/II)
Instead of a written exam, we have an oral examination in the end ofthe course
• The exam is individual• 25 minutes discussion with teacher(s) about the course.• You start with a 7 minutes presentation about the course.• After the presentation the teacher(s) will lead the discussion.• The exam will be graded as U, 3, 4, or 5.• Time-slots for the oral exam will be in week 43 and 44.
For more information about the oral exam, see the course homepage.
15 / 36 [email protected] Introduction
Course literature
We recommend two books• Barber, D., Bayesian Reasoning and Machine Learning, 2012
• Christopher M. Bishop. Pattern Recognition and MachineLearning, Springer, 2006.
Both of them are freely available online, linked from the coursehomepage.
For some lecuture(s) we will use additional resources which will beavailable from the homepage.
16 / 36 [email protected] Introduction
Medical inference (Hamburgers) I/III
Ex 1.2 (Barber)• 90% of people with Kreuzfeld-Jacob disease ate hamburgers.• One in 100,000 has this disease• Assume half of the population eat hamburgers
What is the probability that a hamburger eater will haveKreuzfeld-Jacob disease?Define the following events
KJ = Having Kreuzfeld-Jacob diseaseH = Eating hamburger
We know that
p(H = Yes|KJ = Yes) = 90%
p(KJ = Yes) = 0.001%
p(H = Yes) = 50%
Q: What is p(KJ = Yes|H = Yes) ?18 / 36 [email protected] Introduction
Medical inference (Hamburgers) II/III
Consider a population of 1 000 000.• p(KJ = Yes) = 0.001% has KJ disease, i.e. 10 people.
• p(H = Yes|KJ = Yes) = 90%, .i.e nine of them eat hamburgers• One of them doesn’t
• p(H = Yes) = 50% eat hamburgers, i.e. 500 000 people.• 499’991 of them does not have KJ disease.
H = Yes H = No
KJ = Yes 9 1KJ = No 499 991 499 999
19 / 36 [email protected] Introduction
Medical inference (Hamburgers) III/III
H = Yes H = No
KJ = Yes 9 1KJ = No 499 991 499 999
p(KJ = Yes|H = Yes) is the proportion of all hamburger eater howhave KJ disease.
p(KJ = Yes|H = Yes) =9
9 + 499 991= 0.0018%
This can also be written as
p(KJ = Y|H = Y) =p(KJ = Y, H = Y)
p(KJ = Y, H = Y) + p(KJ = N, H = Y)
=p(KJ = Y, H = Y)
p(H = Y)
This is an example of conditional probability and marginalization20 / 36 [email protected] Introduction
Conditioning, marginalization (discrete)
Conditional probability is defined as
p(x|y) =p(x, y)
p(y)where p(y) 6= 0
Marginalization is defined as
p(x) =∑y
p(x, y)
Much of the probability theory can be derived from these two rules.
Bayes’ theorem is derived by using the def. of conditional probabilitytwice
p(x|y)p(y) = p(x, y) = p(y|x)p(x) ⇒ p(x|y) =p(y|x)p(x)
p(y)
21 / 36 [email protected] Introduction
Medical inference (Hamburgers), revisited
Consider again the hamburger/Kreuzfeld-Jacob disease problem
KJ = Having Kreuzfeld-Jacob diseaseH = Eating hamburger
We know that
p(H = Yes|KJ = Yes) = 90%
p(KJ = Yes) = 0.001%
p(H = Yes) = 50%
By applying Bayes’ theorem we get
p(KJ = Y|H = Y) =p(H = Y|KJ = Y)p(KJ = Y)
p(H = Y)=
910 × 1
100 00012
= 1.8 · 10−5
= 0.0018%
22 / 36 [email protected] Introduction
Continuous random variablesProbability distribution
The probability distribution p(x) describes the probability for acontinuous random variable falling into a given interval
p(a < x < b) =
∫ b
ap(x)dx
a b0
0.5
1
1.5
p(x) is also called the probability density
23 / 36 [email protected] Introduction
Continuous random variablesConditioning and marginalization
Consider the joint distribution p(x, y)
γ
p(x|y=γ)
p(x,y=γ)
p(y)p(x)
x y
p(x,y)
Conditional probability Marginalizationp(x|y) = p(x,y)
p(y) , p(y) 6= 0 p(x) =∫y p(x, y)
24 / 36 [email protected] Introduction
Probabilistic/Bayesian inference
In this course most of the solutions to the problems can be stated as
p(θ|D) =p(D|θ)p(θ)p(D)
• D : observed data• θ : parameters of some model explaining the data• p(θ): prior belief of parameters before we collected any data• p(θ|D): posterior belief of parameters after inferring data• p(D|θ): likelihood of the data in view of the parameters• p(D): The marginal likelihood
25 / 36 [email protected] Introduction
Probabilistic/Bayesian inference
In this course most of the solutions to the problems can be stated as
p(θ|D) =p(D|θ)p(θ)p(D)
• If we view the quantities as functions of θ, we can write
p(θ|D)︸ ︷︷ ︸posterior
∝ p(D|θ)︸ ︷︷ ︸likelihood
p(θ)︸︷︷︸prior
∝ means: "proportional to with respect to the parameters θ".Hence, p(D) can be viewed as a normalization constant.
• Using marginalization, we can express p(D) in terms of thelikelihood and the prior
p(D) =
∫p(D, θ)dθ =
∫p(D|θ)p(θ)dθ
25 / 36 [email protected] Introduction
Example: Flipping of a coin
Consider a binary random variable x ∈ {0, 1} representing theoutcome of flipping of a coin• x = 1 represents "head" and x = 0 "tail".• The probability of x = 1 is denoted by the parameter µ
p(x = 1|µ) = µ, 0 ≤ µ ≤ 1
(assume damaged coin, so not necessary µ = 0.5)
Question:Given a dataset D = {x1, . . . , xN}, what is p(µ|D)?
Solution: Bayes’ theorem state that
p(µ|D)︸ ︷︷ ︸posterior
∝ p(D|µ)︸ ︷︷ ︸likelihood
p(µ)︸︷︷︸prior
Find likelihood and prior and then multiply! We start with the likelihood.26 / 36 [email protected] Introduction
Example: Flipping of a coinSolution - The likelihood (I/II)We know that
p(x = 1|µ) = µ, and consequentlyp(x = 0|µ) = 1− µ.
The distribution for one observation x can be written as
p(x|µ) = Bern (x; µ) = µx(1− µ)1−x
This is the Bernoulli distribution.
0 10
0.2
0.4
0.6
0.8
1
x
E[x] = µ
Var[x] = µ(1− µ)
27 / 36 [email protected] Introduction
Example: Flipping of a coinSolution - The likelihood (II/II)TheN observations are drawn independently. This gives the likelihood
p(D|µ) =
N∏n=1
p(xn|µ) =
N∏n=1
µxn(1− µ)1−xn
= µ∑N
n=1 xn(1− µ)N−∑N
n=1 xn = µm(1− µ)N−m
wherem =∑N
n=1 xn, i.e. the number of heads.
Note: The likelihood only depend on the data D viam.
The likelihood ofm is proportional to p(D|µ)
p(m|µ) = Bin (m; N, µ) =
(N
m
)µm(1− µ)N−m,
where(Nm
)= N !
(N−m)!m! is the number of sequences givingm heads.This is the binomial distribution .
28 / 36 [email protected] Introduction
Binomial distribution
m ∼ Bin (m; N, µ) =
(N
m
)µm(1− µ)N−m
0 1 2 3 4 5 6 7 8 9100
0.1
0.2
0.3
m
The binomial distribution forN = 10 and µ = 0.25.
E[m] = Nµ
Var[m] = Nµ(1− µ)
29 / 36 [email protected] Introduction
Example: Flipping of a coinSolution - The prior (I/II)
Remember Bayes’ theorem p(µ|m)︸ ︷︷ ︸posterior
∝ p(m|µ)︸ ︷︷ ︸likelihood
p(µ)︸︷︷︸prior
• Multiple possible prior distributions p(µ) exist.• We opt for a prior which has attractive analytical properties.
We choose a prior such that the posterior will be of the same functionalform as the prior. We call this a conjugate prior.
The conjugate prior of the Binomial distribution is the Beta distribution
Beta (µ; a, b) =Γ(a+ b)
Γ(a)Γ(b)µa−1(1− µ)b−1
where Γ(a) is the Gamma function.
30 / 36 [email protected] Introduction
Example: Flipping of a coinSolution - The posteriorThe posterior can now be computed
p(µ|m) ∝ p(m|µ)p(µ)
∝ Bin (m; N, µ)Beta (µ; a, b)
= µm(1− µ)N−mµa−1(1− µ)b−1
= µm+a−1(1− µ)N−m+b−1.
Hence, the posterior is also a Beta distribution
p(µ|m) = Beta (µ; a∗, b∗) ,
where
a∗ = m+ a,
b∗ = N −m+ b.
31 / 36 [email protected] Introduction
The Beta distribution
Beta (µ; a, b) =Γ(a+ b)
Γ(a)Γ(b)µa−1(1− µ)b−1,
where Γ(a) =
∫ ∞0
e−xxa−1dx
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
µ
a = 1 b = 1a = 0.1 b = 0.1a = 2 b = 3
E[µ] =a
a+ b, Var[µ] =
ab
(a+ b)2(a+ b+ 1)32 / 36 [email protected] Introduction
Example: Flipping of a coinBayesian inference
Prior Likelihoodfunction Posterior
p(µ) p(m|µ) p(µ|m)Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 1, N = 1 a∗ = 2, b∗ = 1
0 0.5 10
1
2
µ
×
0 0.5 10
1
2
µ
∝
0 0.5 10
1
2
µ• If you don’t know anything about the coin, start with an
uninformative prior. Beta (µ; 1, 1) = 1.• Assume we get one data point x1 = 1• Posterior ∝ likelihood × prior
33 / 36 [email protected] Introduction
Example: Flipping of a coinBayesian inference
Prior Likelihoodfunction Posterior
p(µ) p(m|µ) p(µ|m)Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 4, N = 5 a∗ = 5, b∗ = 2
0 0.5 10
1
2
µ
×
0 0.5 10
1
2
µ
∝
0 0.5 10
1
2
µ
Assume you get N = 5 data points, of whichm = 4 are heads,D = {1, 0, 1, 1, 1}.
33 / 36 [email protected] Introduction
Example: Flipping of a coinSequential Bayesian inference
• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior
Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)
Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 1 , N = 1 a∗ = 2, b∗ = 1
0 0.5 10
1
2
µ0 0.5 1
0
1
2
µ0 0.5 1
0
1
2
µ
D1 = {1},
D2 = {0}, D3 = {1}, D4 = {1}, D5 = {1}
34 / 36 [email protected] Introduction
Example: Flipping of a coinSequential Bayesian inference
• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior
Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)
Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 2, b = 1 m = 0 , N = 1 a∗ = 2, b∗ = 2
0 0.5 10
1
2
µ0 0.5 1
0
1
2
µ0 0.5 1
0
1
2
µ
D1 = {1}, D2 = {0},
D3 = {1}, D4 = {1}, D5 = {1}
34 / 36 [email protected] Introduction
Example: Flipping of a coinSequential Bayesian inference
• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior
Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)
Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 2, b = 2 m = 1 , N = 1 a∗ = 3, b∗ = 2
0 0.5 10
1
2
µ0 0.5 1
0
1
2
µ0 0.5 1
0
1
2
µ
D1 = {1}, D2 = {0}, D3 = {1},
D4 = {1}, D5 = {1}
34 / 36 [email protected] Introduction
Example: Flipping of a coinSequential Bayesian inference
• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior
Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)
Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 3, b = 2 m = 1 , N = 1 a∗ = 4, b∗ = 2
0 0.5 10
1
2
µ0 0.5 1
0
1
2
µ0 0.5 1
0
1
2
µ
D1 = {1}, D2 = {0}, D3 = {1}, D4 = {1},
D5 = {1}
34 / 36 [email protected] Introduction
Example: Flipping of a coinSequential Bayesian inference
• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior
Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)
Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 4, b = 2 m = 1 , N = 1 a∗ = 5, b∗ = 2
0 0.5 10
1
2
µ0 0.5 1
0
1
2
µ0 0.5 1
0
1
2
µ
D1 = {1}, D2 = {0}, D3 = {1}, D4 = {1}, D5 = {1}34 / 36 [email protected] Introduction
Concluding remarks
Probabilistic/Bayesian inference is a flexible way of dealing withmachine learning problemsProperties:• Treat not only the data, but also the model and its parameters
(if parametric) as random variables.• After learning, you not only get a single model, you get a
distribution of likely models.• You can encode prior knowledge you might have about the
model and its parameters.
35 / 36 [email protected] Introduction
A few concepts to summarize lecture 1
Probability distribution Function that describes the likelihood of obtaining thepossible values that a random variable can assume.Conditioning and marginalization Tow basic rules for manipulating probabilitydistributions.Bayes’ theorem p(x|y) = p(y|x)p(x)/p(y)Prior Belief of parameter before we have seen any dataLikelihood Belief of data in view of the parametersPosterior Belief of parameter after inferring dataBernoulli distribution Distribution for a binary random variableBinomial distribution Distribution for the sum of multiple binary random variables.Beta distribution Conjugate prior for the Binomial distributionConjugate prior A prior ensuring that the posterior and the prior belong to the sameprobability distribution family.
36 / 36 [email protected] Introduction