primer on probabilities and statistics

28
Stefanos Zafeiriou Adv. Statistical Machine Learning(495) Primer on Probabilities Probability and Statistics Primer Basic Concepts Maximum Likelihood Parameter Estimation Reading: β€’ Many primers (check internet) e.g., Chapters 1,2 of Pattern Recognition & Machine Learning by C. Bishop 1

Upload: others

Post on 21-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Primer on Probabilities

Probability and Statistics Primer

Basic Concepts

Maximum Likelihood Parameter Estimation

Reading:

β€’ Many primers (check internet)

e.g., Chapters 1,2 of

Pattern Recognition & Machine Learning by C. Bishop

1

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

A Probability Primer

β€’ Assume an event where there is a degree of uncertainty in

the outcome of the event

β€’ Random Variable: A function which maps events or

outcomes to a number set (i.e., integers, real etc)

β€’ Refers to an event

2

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Frequentistic Definition

β€’ Frequency Probability:

Probability is the limit of its relative frequency in

a large number of trials

or approx.

β€’ It is the relative frequency with which an outcome would

be obtained if the process were repeated a large number of

times under exactly the same conditions.

𝑝(π‘₯)

𝑝(π‘₯)= limπ‘›β†’βˆž

𝑛π‘₯

𝑛𝑑 𝑝(π‘₯)β‰ˆ

𝑛π‘₯

𝑛𝑑

3

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

β€’ Bayesian view: Probability is a measure of belief regarding

the predicted outcome of an event.

β€’ Uses the Bayes theorem to develop a calculus for

performing probability reasoning.

Bayesian view

Bayes theorem

or:

or

𝑝 π‘₯, 𝑦 = 𝑝 π‘₯ 𝑦 𝑝 𝑦

= 𝑝 𝑦 π‘₯ 𝑝(π‘₯)

𝑝 π‘₯ 𝑦 =𝑝(π‘₯, 𝑦)

𝑝(𝑦) 𝑝 𝑦 π‘₯ =

𝑝(π‘₯, 𝑦)

𝑝(π‘₯)

𝑝 π‘₯ 𝑦 = 𝑝(𝑦|π‘₯)𝑝(π‘₯)

𝑝(𝑦)

4

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Joint Probability Distribution

β€’ Joint probabilities can be between

any number of variables

eg.

β€’ For every combination of variables

we need to define how probable

that combination is

β€’ The probabilities of all

combinations need to sum up to 1.

β€’ For 3 random variables taking two

values the table contains 8 entries

π‘Ž 𝑏 𝑐

0 0 0 0.1

0 0 1 0.2

0 1 0 0.05

0 1 1 0.05

1 0 0 0.3

1 0 1 0.1

1 1 0 0.05

1 1 1 0.15

Sum up to

1

𝑝(π‘Ž = 1, 𝑏 = 1, 𝑐 = 1)

𝑝(π‘Ž, 𝑏, 𝑐)

5

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Joint Probability Distribution

β€’ Given the joint probability distribution, you can calculate any probability involving π‘Ž, 𝑏, and 𝑐

β€’ Note: May need to use marginalization and Bayes rule, (both of which are not discussed in these slides)

π‘Ž 𝑏 𝑐

0 0 0 0.1

0 0 1 0.2

0 1 0 0.05

0 1 1 0.05

1 0 0 0.3

1 0 1 0.1

1 1 0 0.05

1 1 1 0.15 Examples of things you can compute:

β€’

β€’

𝑝(π‘Ž, 𝑏, 𝑐)

𝑝 π‘Ž = 1 = 𝑝 π‘Ž = 1, 𝑏, 𝑐 π‘ π‘’π‘š π‘Ÿπ‘œπ‘€π‘  π‘Ž = 1

𝑏,𝑐

𝑝 π‘Ž = 1, 𝑏 = 1|𝑐 = 1 =

𝑝 π‘Ž = 1, 𝑏 = 1, 𝑐 = 1 |𝑝 𝑐 = 1 (π΅π‘Žπ‘¦π‘’π‘  π‘‡β„Žπ‘’π‘œπ‘Ÿπ‘’π‘š)

6

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Bayes Theorem: An example

β€’ School with 60% boys and 40% girls as its students.

β€’ The female students wear trousers or skirts in equal

numbers;

β€’ All boys wear trousers.

An observer sees a (random) student from a distance, and

what the observer can see is that this student is wearing

trousers.

What is the probability this student is a girl? The correct

answer can be computed using Bayes' theorem.

7

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Bayes Theorem: An example

8

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

The requested probability is translated as:

An observer sees a (random)

student from a distance, and what

the observer can see is that this

student is wearing trousers.

β€’ What is the probability this

student is a girl?

Bayes Theorem: An example

9

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

β€’ Bayes theorem:

β€’ What do we

know?

β€’ What are we missing?

Bayes Theorem: An example

10

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

β€’ What can we find it? By marginalization.

β€’ And Bayes again to find

Bayes Theorem: An example

β€’ Hence and

𝑝 π‘₯ = 01, 𝑠 = 10 = 𝑝 π‘₯ = 01 𝑠 = 10 𝑝(𝑠 = 10) = 0.5x0.4=0.2

𝑝 π‘₯ = 01

11

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Independence

How is independence useful?

β€’ Suppose you have n coin flips and you want to

calculate the joint distribution p(c1, …, cn)

β€’ If the coin flips are not independent, you need 2n

values in the table

β€’ If the coin flips are independent, then

n

i

in cpccp1

1 )(),...,( Each p(ci) table has 2 entries and

there are n of them for a total of

2n values

12

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Independence

Variables a and b are conditionally

independent given c if any of the following

hold:

β€’ p(a,b|c) = p(a|c) p(b|c)

β€’ p(a|b,c)= p(a|c)

β€’ p(b|a,c) = p(b|c)

Knowing c tells me everything about b (ie. I don’t gain

anything by knowing a). Either because a doesn’t

influence b or because knowing c provides all the

information knowing a would give.

𝑐

π‘Ž 𝑏

13

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Continuous Variables

β€’ Probability density function (PDF)

β€’ Probability of

β€’ Cumulative distribution function (CDF)

𝐹 π‘₯ < 𝑏 = 𝑝 π‘₯ 𝑑π‘₯𝑏

π‘₯=βˆ’βˆž

𝑝 π‘₯ =𝑑𝐹(π‘₯)

𝑑π‘₯

𝑝 π‘₯ β‰₯ 0

14

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

β€’ Mean operator (first order moment)

Continuous Variables

β€’ Variance operator (second order moment)

𝐸 π‘₯ = π‘₯𝑝 π‘₯ 𝑑π‘₯π‘₯

𝐸((π‘₯ βˆ’ πœ‡)2) = π‘₯ βˆ’ πœ‡ 2𝑝 π‘₯ 𝑑π‘₯π‘₯

15

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Popular PDFs

β€’ Gaussian or Normal distribution

β€’ CDF

β€’ PDF

β€’ Parameters the mean and

standard deviations

𝑝 π‘₯ =1

𝜎 2πœ‹π‘’βˆ’12𝜎2(π‘₯βˆ’πœ‡)2

πœ‡, 𝜎2

π‘₯~𝑁(π‘₯|πœ‡, 𝜎2)

𝐸 π‘₯ = πœ‡ 𝐸 π‘₯ βˆ’ πœ‡ 2 = 𝜎2

16

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Parameter estimation with Gaussians

β€’ What is estimation? Given a set of observations and a

model - estimate the model’s parameter.

β€’ First example: Given a population

assuming that are independent samples from a normal

distribution find an estimate for

β€’ How we approach the problem?

(1) We write the joint probability distribution (likelihood).

{π‘₯1, π‘₯2, π‘₯3,…π‘₯𝑁}

π‘₯~𝑁(π‘₯|πœ‡, 𝜎2) πœ‡, 𝜎2

𝑝(π‘₯1, π‘₯2, π‘₯3,…π‘₯𝑁|πœ‡, 𝜎2)= 𝑝(π‘₯𝑖|πœ‡, 𝜎

2)𝑁𝑖=1

17

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

(2) We substitute our distributional assumptions.

(3) A common practice is to take the log of the joint function

Parameter estimation with Gaussians

𝑝(π‘₯1, π‘₯2, π‘₯3,…π‘₯𝑁|πœ‡, 𝜎2)=

1

𝜎 2πœ‹π‘’βˆ’ π‘₯π‘–βˆ’πœ‡

2/𝜎2𝑁𝑖=1

=(2πœ‹)βˆ’π‘/2

πœŽπ‘π‘’βˆ’ (π‘₯π‘–βˆ’πœ‡)

2/2𝜎2𝑁𝑖=1

log 𝑝 πœ‡, 𝜎2 = βˆ’1

2Nlog2Ο€ βˆ’ Nlog 𝜎 βˆ’

(π‘₯𝑖 βˆ’ πœ‡)2𝑁

𝑖=1

2𝜎2

18

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

(3) We take the maximum of the log likelihood (ML)

Parameter estimation with Gaussians

(4) We take the derivatives of p with regards to

πœ‡0, 𝜎0 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœ‡,𝜎 log 𝑝(πœ‡, 𝜎2)

πœ‡, 𝜎2

π‘‘π‘™π‘œπ‘”π‘

π‘‘πœ‡= 0

π‘‘π‘™π‘œπ‘”π‘

π‘‘πœŽ= 0

πœ‡0 = π‘₯𝑖𝑁𝑖=1

𝑁 𝜎0 =

(π‘₯𝑖 βˆ’ πœ‡0)2𝑁

𝑖=1

𝑁

19

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

The linear regression problem

ML estimation: Linear Regression

𝑓 π‘₯ = π‘Žπ‘₯ + 𝑏

𝑦𝑖

π‘₯𝑖

20

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Observations:

ML estimation Linear Regression

Model: and

Parameters:

Methodology: Maximum Likelihood

𝐷 = { π‘₯1, 𝑦1 , … , π‘₯𝑁, 𝑦𝑁 }

𝑦𝑖 = 𝑓 π‘₯𝑖 + 𝑒𝑖 𝑓 π‘₯ = π‘Žπ‘₯ + 𝑏

𝑒𝑖~𝑁(𝑒𝑖|0, 𝜎2)

π‘Ž, 𝑏, 𝜎

21

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

(1) We write the joint probability distribution (likelihood).

𝑝((π‘₯1, 𝑦1), (π‘₯2, 𝑦2),…, (π‘₯𝑁, 𝑦𝑁)|𝑓, 𝜎2)

= 𝑝 (π‘₯𝑖 , 𝑦𝑖 |𝑓, 𝜎2)𝑁

𝑖=1

= 𝑝 𝑦𝑖 𝑓, 𝜎2, π‘₯𝑖 𝑝(π‘₯𝑖)

𝑁𝑖=1

𝑁𝑖=1

we need to compute the following probability 𝑝 𝑦𝑖 𝑓, 𝜎2, π‘₯𝑖

ML estimation Linear Regression

𝑦𝑖 = 𝑓 π‘₯𝑖 + 𝑒𝑖 we know that

𝑒𝑖~𝑁(𝑒𝑖|0, 𝜎2)

22

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

Assume a random variable with pdf

β€’ Change of random variables

ML estimation Linear Regression

π‘Ž π‘π‘Ž

Assume a second random variable and that are

related by

𝑏 π‘Ž, 𝑏 𝑏 = 𝑔(π‘Ž)

What is the pdf of ? 𝑏 𝑝𝑏

| 𝑝𝑏 𝑏 𝑑𝑏| = |π‘π‘Ž π‘Ž π‘‘π‘Ž|

𝑝𝑏 𝑏 = |π‘‘π‘Ž

𝑑𝑏|π‘π‘Ž π‘Ž

𝑝𝑏 𝑏 = |π‘‘π‘”βˆ’1(𝑏)

𝑑𝑏|π‘π‘Ž 𝑔

βˆ’1(𝑏)

23

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

β€’ Lets go back to our case

ML estimation Linear Regression

𝑝(𝑒𝑖) = 𝑁(𝑒𝑖|0, 𝜎2) 𝑦𝑖 = 𝑔 𝑒𝑖 = 𝑓 π‘₯𝑖 + 𝑒𝑖

we compute using the previous 𝑝(𝑦𝑖)

𝑝(𝑦𝑖) = 𝑁(𝑦𝑖 βˆ’ 𝑓(π‘₯𝑖)|0, 𝜎2)

𝑒𝑖 = 𝑦𝑖 βˆ’ 𝑓(π‘₯𝑖)

𝑝(𝑦𝑖 π‘₯𝑖 , 𝑓, 𝜎2 = 𝑁 𝑦𝑖|𝑓(π‘₯𝑖 , 𝜎

2)

or putting back what is constant we write more correctly

24

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

𝑝(𝐷|𝑓, 𝜎2) = 𝑝 𝑦𝑖 𝑓, 𝜎2, π‘₯𝑖 𝑝(π‘₯𝑖)

𝑁𝑖=1

𝑁𝑖=1

= 𝑐 1

2πœ‹πœŽ2π‘’βˆ’12𝜎2(π‘¦π‘–βˆ’π‘“(π‘₯𝑖))

2

𝑁

𝑖=1

where c is constant c = 𝑝(π‘₯𝑖)

𝑁

𝑖=1

= 𝑐(2πœ‹)βˆ’π‘/2

πœŽπ‘π‘’βˆ’12𝜎2 (π‘¦π‘–βˆ’π‘“(π‘₯𝑖))

2𝑁𝑖=1

ML estimation: Linear Regression

25

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

β€’ Choosing to maximize its logarithm we get

removing the constant terms we get

π‘“βˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯𝑓 βˆ’ Nlog c2Ο€ βˆ’ Nlog 𝜎 βˆ’ 1

2πœ‹πœŽ2(𝑦𝑖 βˆ’ 𝑓(π‘₯𝑖))

2

𝑁

𝑖=1

π‘“βˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›π‘“ (𝑓(π‘₯𝑖) βˆ’ 𝑦𝑖))2

𝑛

𝑖=1

ML estimation: Linear Regression

26

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

by taking the partial derivative with respect to a and b and setting

them equal to zero we get

π‘“βˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›π‘“ (π‘Žπ‘₯𝑖 + 𝑏 βˆ’ 𝑦𝑖)2

𝑁

𝑖=1

𝑔 π‘Ž, 𝑏 = (π‘Žπ‘₯𝑖 + 𝑏 βˆ’ 𝑦𝑖)2

𝑁

𝑖=1

πœ•π‘”

πœ•π‘= 0 β†’ 2 π‘Žπ‘₯𝑖 + 𝑏 βˆ’ 𝑦𝑖 = 0 β†’ 𝑁𝑏 = βˆ’ π‘Žπ‘₯𝑖 + 𝑦𝑖 β†’

𝑁

𝑖=1

𝑁

𝑖=1

𝑁

𝑖=1

𝑏 = 𝑦 βˆ’ π‘Žπ‘₯ 1 𝑦 =1

𝑁 𝑦𝑖

𝑁

𝑖=1

π‘Žπ‘›π‘‘ π‘₯ =1

𝑁 π‘₯𝑖

𝑁

𝑖=1

ML estimation: Linear Regression

27

Stefanos Zafeiriou Adv. Statistical Machine Learning(495)

putting (1) into (2) we get

π‘₯𝑖𝑦𝑖 βˆ’π‘π‘₯ 𝑦 = π‘Ž

𝑁

𝑖=1

π‘₯𝑖2 βˆ’ π‘Žπ‘ π‘₯ 2

𝑁

𝑖=1

β†’ π‘Ž = π‘₯𝑖𝑦𝑖 βˆ’π‘π‘₯ 𝑦 𝑁𝑖=1

π‘₯𝑖2 βˆ’π‘π‘₯ 2𝑁

𝑖=1

πœ•π‘”

πœ•π‘Ž= 0 β†’ π‘₯𝑖 π‘Žπ‘₯𝑖 + 𝑏 βˆ’ 𝑦𝑖 = 0

𝑁

𝑖=1

β†’ π‘₯𝑖𝑦𝑖 = π‘Ž π‘₯𝑖2 + 𝑏 π‘₯𝑖

𝑁

𝑖=1

𝑁

𝑖=1

𝑁

𝑖=1

(2)

ML estimation: Linear Regression

28