primer on probabilities and statistics
TRANSCRIPT
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Primer on Probabilities
Probability and Statistics Primer
Basic Concepts
Maximum Likelihood Parameter Estimation
Reading:
β’ Many primers (check internet)
e.g., Chapters 1,2 of
Pattern Recognition & Machine Learning by C. Bishop
1
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
A Probability Primer
β’ Assume an event where there is a degree of uncertainty in
the outcome of the event
β’ Random Variable: A function which maps events or
outcomes to a number set (i.e., integers, real etc)
β’ Refers to an event
2
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Frequentistic Definition
β’ Frequency Probability:
Probability is the limit of its relative frequency in
a large number of trials
or approx.
β’ It is the relative frequency with which an outcome would
be obtained if the process were repeated a large number of
times under exactly the same conditions.
π(π₯)
π(π₯)= limπββ
ππ₯
ππ‘ π(π₯)β
ππ₯
ππ‘
3
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
β’ Bayesian view: Probability is a measure of belief regarding
the predicted outcome of an event.
β’ Uses the Bayes theorem to develop a calculus for
performing probability reasoning.
Bayesian view
Bayes theorem
or:
or
π π₯, π¦ = π π₯ π¦ π π¦
= π π¦ π₯ π(π₯)
π π₯ π¦ =π(π₯, π¦)
π(π¦) π π¦ π₯ =
π(π₯, π¦)
π(π₯)
π π₯ π¦ = π(π¦|π₯)π(π₯)
π(π¦)
4
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Joint Probability Distribution
β’ Joint probabilities can be between
any number of variables
eg.
β’ For every combination of variables
we need to define how probable
that combination is
β’ The probabilities of all
combinations need to sum up to 1.
β’ For 3 random variables taking two
values the table contains 8 entries
π π π
0 0 0 0.1
0 0 1 0.2
0 1 0 0.05
0 1 1 0.05
1 0 0 0.3
1 0 1 0.1
1 1 0 0.05
1 1 1 0.15
Sum up to
1
π(π = 1, π = 1, π = 1)
π(π, π, π)
5
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Joint Probability Distribution
β’ Given the joint probability distribution, you can calculate any probability involving π, π, and π
β’ Note: May need to use marginalization and Bayes rule, (both of which are not discussed in these slides)
π π π
0 0 0 0.1
0 0 1 0.2
0 1 0 0.05
0 1 1 0.05
1 0 0 0.3
1 0 1 0.1
1 1 0 0.05
1 1 1 0.15 Examples of things you can compute:
β’
β’
π(π, π, π)
π π = 1 = π π = 1, π, π π π’π πππ€π π = 1
π,π
π π = 1, π = 1|π = 1 =
π π = 1, π = 1, π = 1 |π π = 1 (π΅ππ¦ππ πβπππππ)
6
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Bayes Theorem: An example
β’ School with 60% boys and 40% girls as its students.
β’ The female students wear trousers or skirts in equal
numbers;
β’ All boys wear trousers.
An observer sees a (random) student from a distance, and
what the observer can see is that this student is wearing
trousers.
What is the probability this student is a girl? The correct
answer can be computed using Bayes' theorem.
7
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
The requested probability is translated as:
An observer sees a (random)
student from a distance, and what
the observer can see is that this
student is wearing trousers.
β’ What is the probability this
student is a girl?
Bayes Theorem: An example
9
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
β’ Bayes theorem:
β’ What do we
know?
β’ What are we missing?
Bayes Theorem: An example
10
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
β’ What can we find it? By marginalization.
β’ And Bayes again to find
Bayes Theorem: An example
β’ Hence and
π π₯ = 01, π = 10 = π π₯ = 01 π = 10 π(π = 10) = 0.5x0.4=0.2
π π₯ = 01
11
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Independence
How is independence useful?
β’ Suppose you have n coin flips and you want to
calculate the joint distribution p(c1, β¦, cn)
β’ If the coin flips are not independent, you need 2n
values in the table
β’ If the coin flips are independent, then
n
i
in cpccp1
1 )(),...,( Each p(ci) table has 2 entries and
there are n of them for a total of
2n values
12
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Independence
Variables a and b are conditionally
independent given c if any of the following
hold:
β’ p(a,b|c) = p(a|c) p(b|c)
β’ p(a|b,c)= p(a|c)
β’ p(b|a,c) = p(b|c)
Knowing c tells me everything about b (ie. I donβt gain
anything by knowing a). Either because a doesnβt
influence b or because knowing c provides all the
information knowing a would give.
π
π π
13
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Continuous Variables
β’ Probability density function (PDF)
β’ Probability of
β’ Cumulative distribution function (CDF)
πΉ π₯ < π = π π₯ ππ₯π
π₯=ββ
π π₯ =ππΉ(π₯)
ππ₯
π π₯ β₯ 0
14
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
β’ Mean operator (first order moment)
Continuous Variables
β’ Variance operator (second order moment)
πΈ π₯ = π₯π π₯ ππ₯π₯
πΈ((π₯ β π)2) = π₯ β π 2π π₯ ππ₯π₯
15
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Popular PDFs
β’ Gaussian or Normal distribution
β’ CDF
β’ PDF
β’ Parameters the mean and
standard deviations
π π₯ =1
π 2ππβ12π2(π₯βπ)2
π, π2
π₯~π(π₯|π, π2)
πΈ π₯ = π πΈ π₯ β π 2 = π2
16
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Parameter estimation with Gaussians
β’ What is estimation? Given a set of observations and a
model - estimate the modelβs parameter.
β’ First example: Given a population
assuming that are independent samples from a normal
distribution find an estimate for
β’ How we approach the problem?
(1) We write the joint probability distribution (likelihood).
{π₯1, π₯2, π₯3,β¦π₯π}
π₯~π(π₯|π, π2) π, π2
π(π₯1, π₯2, π₯3,β¦π₯π|π, π2)= π(π₯π|π, π
2)ππ=1
17
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
(2) We substitute our distributional assumptions.
(3) A common practice is to take the log of the joint function
Parameter estimation with Gaussians
π(π₯1, π₯2, π₯3,β¦π₯π|π, π2)=
1
π 2ππβ π₯πβπ
2/π2ππ=1
=(2π)βπ/2
πππβ (π₯πβπ)
2/2π2ππ=1
log π π, π2 = β1
2Nlog2Ο β Nlog π β
(π₯π β π)2π
π=1
2π2
18
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
(3) We take the maximum of the log likelihood (ML)
Parameter estimation with Gaussians
(4) We take the derivatives of p with regards to
π0, π0 = ππππππ₯π,π log π(π, π2)
π, π2
πππππ
ππ= 0
πππππ
ππ= 0
π0 = π₯πππ=1
π π0 =
(π₯π β π0)2π
π=1
π
19
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
The linear regression problem
ML estimation: Linear Regression
π π₯ = ππ₯ + π
π¦π
π₯π
20
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Observations:
ML estimation Linear Regression
Model: and
Parameters:
Methodology: Maximum Likelihood
π· = { π₯1, π¦1 , β¦ , π₯π, π¦π }
π¦π = π π₯π + ππ π π₯ = ππ₯ + π
ππ~π(ππ|0, π2)
π, π, π
21
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
(1) We write the joint probability distribution (likelihood).
π((π₯1, π¦1), (π₯2, π¦2),β¦, (π₯π, π¦π)|π, π2)
= π (π₯π , π¦π |π, π2)π
π=1
= π π¦π π, π2, π₯π π(π₯π)
ππ=1
ππ=1
we need to compute the following probability π π¦π π, π2, π₯π
ML estimation Linear Regression
π¦π = π π₯π + ππ we know that
ππ~π(ππ|0, π2)
22
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
Assume a random variable with pdf
β’ Change of random variables
ML estimation Linear Regression
π ππ
Assume a second random variable and that are
related by
π π, π π = π(π)
What is the pdf of ? π ππ
| ππ π ππ| = |ππ π ππ|
ππ π = |ππ
ππ|ππ π
ππ π = |ππβ1(π)
ππ|ππ π
β1(π)
23
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
β’ Lets go back to our case
ML estimation Linear Regression
π(ππ) = π(ππ|0, π2) π¦π = π ππ = π π₯π + ππ
we compute using the previous π(π¦π)
π(π¦π) = π(π¦π β π(π₯π)|0, π2)
ππ = π¦π β π(π₯π)
π(π¦π π₯π , π, π2 = π π¦π|π(π₯π , π
2)
or putting back what is constant we write more correctly
24
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
π(π·|π, π2) = π π¦π π, π2, π₯π π(π₯π)
ππ=1
ππ=1
= π 1
2ππ2πβ12π2(π¦πβπ(π₯π))
2
π
π=1
where c is constant c = π(π₯π)
π
π=1
= π(2π)βπ/2
πππβ12π2 (π¦πβπ(π₯π))
2ππ=1
ML estimation: Linear Regression
25
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
β’ Choosing to maximize its logarithm we get
removing the constant terms we get
πβ = ππππππ₯π β Nlog c2Ο β Nlog π β 1
2ππ2(π¦π β π(π₯π))
2
π
π=1
πβ = πππππππ (π(π₯π) β π¦π))2
π
π=1
ML estimation: Linear Regression
26
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
by taking the partial derivative with respect to a and b and setting
them equal to zero we get
πβ = πππππππ (ππ₯π + π β π¦π)2
π
π=1
π π, π = (ππ₯π + π β π¦π)2
π
π=1
ππ
ππ= 0 β 2 ππ₯π + π β π¦π = 0 β ππ = β ππ₯π + π¦π β
π
π=1
π
π=1
π
π=1
π = π¦ β ππ₯ 1 π¦ =1
π π¦π
π
π=1
πππ π₯ =1
π π₯π
π
π=1
ML estimation: Linear Regression
27
Stefanos Zafeiriou Adv. Statistical Machine Learning(495)
putting (1) into (2) we get
π₯ππ¦π βππ₯ π¦ = π
π
π=1
π₯π2 β ππ π₯ 2
π
π=1
β π = π₯ππ¦π βππ₯ π¦ ππ=1
π₯π2 βππ₯ 2π
π=1
ππ
ππ= 0 β π₯π ππ₯π + π β π¦π = 0
π
π=1
β π₯ππ¦π = π π₯π2 + π π₯π
π
π=1
π
π=1
π
π=1
(2)
ML estimation: Linear Regression
28