CE-717: Machine Learning Sharif University of Technology
M. Soleymani
Review (Probability & Linear Algebra)
Outline
Axioms of probability theory
Joint probability, conditional probability, Bayes theorem
Discrete and continuous random variables
Probability mass and density functions
Expected value, variance, standard deviation
Expectation for two variables
covariance, correlation
Some probability distributions
Gaussian distribution
Linear Algebra
2
Basic Probability Elements
3
Sample space (ฮฉ): set of all possible outcomes (or worlds)
Outcomes are assumed to be mutually exclusive.
An event ๐ด is a certain set of outcomes (i.e., subset of ฮฉ).
A random variable is a function defined over the sample
space
Gender: ๐๐ก๐ข๐๐๐๐ก๐ โ {๐๐๐๐, ๐๐๐๐๐๐}
Height: ๐๐ก๐ข๐๐๐๐ก๐ โ โ+
Probability Space
A probability space is defined as a triple (ฮฉ, ๐น, ๐):
A sample space ฮฉ โ โ that contains the set of all possible
outcomes (outcomes also called states of nature)
A set ๐น whose elements are called events. The events are
subsets of ฮฉ.๐น should be a โBorel Fieldโ.
๐ represents the probability measure that assigns
probabilities to events.
4
Probability Axioms (Kolomogrov)
5
Axioms define a reasonable theory of uncertainty
Kolomogrovโs probability axioms
๐(๐ด) โฅ 0 (โ๐ด โ ฮฉ)
๐ ฮฉ = 1
๐ ๐ด โช ๐ต = ๐ ๐ด + ๐ ๐ต โ ๐(๐ด โฉ ๐ต) (โ๐ด, ๐ต โ ฮฉ)
ฮฉ
A B
๐ด โฉ ๐ต
Random Variables
Random variables: Variables in probability theory
Domain of random variables:Boolean,discrete or continuous
Probability distribution: the function describing probabilities
of possible values of a random variable
๐ ๐ท๐๐๐ = 1 =1
6, ๐(๐ท๐๐๐ = 2) =
1
6, โฆ
6
Random Variables
Random variable is a function that maps every outcome
in ฮฉ to a real (complex) number.
To define probabilities easily as functions defined on (real)
numbers.
To compute expectation, variance,โฆ
Head
Real line0 1
Tail
7
Base Definitions
8
Joint probability distribution
The rules of probability (sum and product rule)
Bayesโ theorem
Independence
new evidence may be irrelevant
Joint Probability Distribution
9
Probability of all combinations of the values for a set of
random variables.
If two or more random variables are considered together, they can
be described in terms of their joint probability
Example: Joint probability of features
๐(๐1, ๐2, โฆ , ๐๐)
Two Fundamental Rules
10
Sum rule:
๐ ๐ =
๐
๐(๐, ๐)
Product rule:๐ ๐, ๐ = ๐ ๐ ๐ ๐(๐)
Chain Rule
Chain rule is derived by successive application of product rule:
๐(๐1, โฆ , ๐๐)= ๐(๐1, โฆ , ๐๐โ1 ) ๐(๐๐|๐1, โฆ , ๐๐โ1)= ๐(๐1, โฆ , ๐๐โ2) ๐(๐๐โ1|๐1, โฆ , ๐๐โ2) ๐(๐๐|๐1, โฆ , ๐๐โ1)= โฆ
= ๐(๐1) ๐=2
๐
๐(๐๐|๐1, โฆ , ๐๐โ1)
11
Sum Rule: Example
12
[Bishop, Section1.2]
Conditional Probability
๐(๐|๐) = ๐(๐, ๐) ๐(๐) if ๐(๐) > 0
Obtained from the product rule
๐(๐|๐) obeys the same rules as probabilities
๐๐(๐|๐) = 1
13
Conditional Probability
For statistically dependent variables, knowing the value of
one variable may allow us to better estimate the other.
All probabilities in effect are conditional probabilities
E.g., ๐(๐ด) = ๐(๐ด | ๐๐ข๐ ๐๐๐๐๐๐๐๐ข๐๐ ๐๐๐๐ค๐๐๐๐๐)
Type equation here.
๐ด๐ต
๐(. |๐ต)ฮฉ
ฮฉ
Renormalize the probability of events jointly occur with B
14
Conditional Probability: Example Rolling a fair dice
๐ด : the outcome is an even number
๐ต : the outcome is a prime number
๐ ๐ด ๐ต =๐(๐ด, ๐ต)
๐(๐ต)=
1/6
1/2=
1
3
Meningitis(๐) & Stiff neck (๐)
๐ ๐ = 1, ๐ = 1 =1
5000
๐ ๐ = 1 =1
2000
๐ ๐ = 1 ๐ = 1 =๐ ๐ = 1, ๐ = 1
๐(๐ = 1)= 0.4
15
Conditional Probability: Another Example
16
[Bishop, Section1.2]
Bayes Theorem
๐ ๐ ๐ =๐ ๐ ๐ ๐ ๐
๐ ๐
Obtained form the product rule and the symmetry property
๐ ๐, ๐ = ๐ ๐, ๐
๐ ๐, ๐ = ๐ ๐|๐ ๐ ๐ = ๐(๐|๐)๐ ๐
In some problems, it may be difficult to compute ๐(๐|๐)directly, yet we might have information about ๐(๐|๐).
๐(๐ถ๐๐ข๐ ๐|๐ธ๐๐๐๐๐ก) = ๐(๐ธ๐๐๐๐๐ก|๐ถ๐๐ข๐ ๐) ๐(๐ถ๐๐ข๐ ๐) / ๐(๐ธ๐๐๐๐๐ก)
17
Bayes Theorem
Often it would be useful to derive the rule a bit further:
๐ ๐ ๐ =๐ ๐ ๐ ๐ ๐
๐ ๐=
๐ ๐ ๐ ๐ ๐
๐๐ ๐ ๐ ๐ ๐
18
Bayes Theorem: Example
19
Meningitis(๐) & Stiff neck (๐)
๐ ๐ = 1 =1
5000
๐ ๐ = 1 = 0.01
๐ ๐ = 1 ๐ = 1 = 0.7
๐ ๐ = 1 ๐ = 1 =?
๐(๐ = 1|๐ = 1) = ๐ ๐ = 1 ๐ = 1 ๐(๐ = 1) /๐(๐ = 1)= 0.7 ร 0.0002/0.01 = 0.0014
Prior and Posterior Probabilities
Prior or unconditional probabilities: belief in the absence of
any other evidence
e.g.,๐ ๐ = 1 = 0.01
Posterior or conditional probabilities: belief in the presence of
evidences
e.g.,๐ ๐ = 1 ๐ = 1) = 0.7
20
Independence of Random Variables
๐ and ๐ are independent iff
๐(๐|๐) = ๐(๐)
๐ ๐ ๐ = ๐ ๐
๐(๐, ๐) = ๐(๐) ๐(๐)
Knowing ๐ tells us nothing about ๐ (and vice versa)
21
Probability Mass Function (PMF)
Probability Mass Function (PMF) shows the probability
for each value of a discrete random variable
Each impulse magnitude is equal to the probability of the
corresponding outcome
Example: PMF of a fair dice
๐(๐ = ๐ฅ) โฅ 0
๐ฅโ๐
๐(๐ = ๐ฅ) = 1
22
๐(๐)
๐
Probability Density Function (PDF)
Probability Density Function (PDF) is defined for
continuous random variables
The probability of ๐ฅ โ (๐ฅ0, ๐ฅ0 + ๐ฟ๐ฅ) is ๐(๐ฅ0) ร ๐ฟ๐ฅ (for ๐ฟ๐ฅ โ 0)
๐(๐ฅ): probability density over ๐ฅ
23
๐(๐ฅ)
๐(๐ฅ) โฅ 0
๐ ๐ฅ ๐๐ฅ = 1
๐ฅ
๐(๐ฅ0)
๐ฅ0
Cumulative Distribution Function (CDF)
Cumulative Distribution Function (CDF)
Defined as the integration of PDF
Similarly defined on discrete variables (summation instead of integration)
Non-decreasing
Right Continuous
๐น(โโ) = 0
๐น โ = 1
๐๐น(๐ฅ)
๐๐ฅ= ๐ ๐ฅ
๐ ๐ข โค ๐ฅ โค ๐ฃ = ๐น ๐ฃ โ ๐น(๐ข)
24
๐ฅ
๐น ๐ฅ
๐(๐ฅ)
๐ถ๐ท๐น ๐ฅ = ๐น ๐ฅ = โโ
๐ฅ
๐(๐ฅ)
Distribution Statistics
Basic descriptors of spatial distributions:
Mean value
Variance & standard deviation
Moments
Covariance & correlation
25
Expected Value
Expected (or mean) value: weighted average of all possible
values of the random variable
Expectation of a discrete random variable ๐:
๐ธ ๐ฅ = ๐ฅ๐ฅ๐(๐ฅ)
Expectation of a function of a discrete random variable ๐ :
๐ธ ๐(๐ฅ) = ๐ฅ๐(๐ฅ)๐(๐ฅ)
Expected value of a continuous random variable ๐ :
๐ธ ๐ฅ = ๐ฅ๐ ๐ฅ ๐๐ฅ
Expectation of a function of a continuous random variable ๐ :
๐ธ ๐(๐ฅ) = ๐(๐ฅ)๐ ๐ฅ ๐๐ฅ
26
Expected Value
27
For expectation of a function of several variables, a
subscript is used to specify the variables is being average
over
Examples:
๐ธ๐ฅ ๐ ๐ฅ, ๐ฆ = ๐ฅ ๐ ๐ฅ ๐(๐ฅ, ๐ฆ)
๐ธ๐ฅ|๐ฆ ๐ ๐ฅ, ๐ฆ = ๐ฅ ๐ ๐ฅ|๐ฆ ๐(๐ฅ, ๐ฆ)
Other notation:๐ธ๐ฅ[๐(๐ฅ, ๐ฆ)|๐ฆ]
๐ธ๐ฅ,๐ฆ ๐ ๐ฅ, ๐ฆ = ๐ฅ ๐ฆ ๐ ๐ฅ, ๐ฆ ๐(๐ฅ, ๐ฆ)
Variance
Variance: a measure of how far values of a random
variable are spread out around its expected value
๐๐๐ ๐ฅ = ๐ธ ๐ฅ โ ๐ธ ๐ฅ 2
= ๐ธ ๐ฅ2 โ ๐ธ ๐ฅ 2
Standard deviation: square root of variance:
ฯ๐ฅ = ๐๐๐[๐ฅ]
28
Moments
Moments nth order moment of a random variable ๐:
๐ธ ๐ฅ๐
Normalized nth order moment:
๐ธ ๐ฅ โ ๐ธ ๐ฅ ๐
The first order moment is the mean value.
The second order moment is the variance added by thesquare of the mean.
29
Correlation & Covariance
Correlation
๐ถ๐๐ ๐ฅ, ๐ฆ = ๐ธ๐ฅ,๐ฆ ๐ฅ๐ฆ
Covariance is the correlation of mean removed variables:
๐ถ๐๐ฃ ๐ฅ, ๐ฆ = ๐ธ๐ฅ,๐ฆ (๐ฅ โ ๐ธ ๐ฅ )(๐ฆ โ ๐ธ ๐ฆ )
30
Discrete RVs
๐ฅ
๐ฆ๐ฅ๐ฆ ๐(๐ฅ, ๐ฆ)
Covariance: Example
31
๐ถ๐๐ฃ ๐ฅ, ๐ฆ = 0 ๐ถ๐๐ฃ ๐ฅ, ๐ฆ = 0.9๐ฅ
๐ฆ
๐ฅ
๐ฆ
Covariance Properties
32
The covariance value shows the tendency of the pair of
RVs to increase together
๐ถ๐๐ฃ๐ฅ๐ฆ > 0 โถ ๐ฅ and ๐ฆ tend to increase together
๐ถ๐๐ฃ๐ฅ๐ฆ < 0 : ๐ฅ tends to decrease when ๐ฆ increases
๐ถ๐๐ฃ๐ฅ๐ฆ = 0 : no linear correlation between ๐ฅ and ๐ฆ
Pearsonโs Product Moment Correlation
33
๐๐๐ =๐ถ๐๐ฃ(๐, ๐)
๐๐๐๐=
๐ธ ๐ โ ๐๐ ๐ โ ๐๐๐๐๐๐
Defined only if both ๐๐ and ๐๐ are finite and nonzero.
๐๐๐ shows the degree of linear dependence between ๐ and ๐.
โ1 โค ๐๐๐โค 1
๐ธ ๐ โ ๐๐ ๐ โ ๐๐2 โค ๐ธ ๐ โ ๐๐
2 ๐ธ ๐ โ ๐๐2 according to Cauchy-
Schwarz inequality (๐ถ๐๐ โค ๐๐๐๐)
๐๐๐ = 1 shows a perfect positive linear relationship and ๐๐๐ = โ1
shows a perfect negative linear relationship
Pearsonโs Correlation: Examples
34
[Wikipedia]
๐๐๐ =๐ถ๐๐ฃ(๐, ๐)
๐๐๐๐
Orthogonal, Uncorrelated & Independent RVs
Orthogonal random variables (๐ธ ๐ฅ๐ฆ = 0)
๐ถ๐๐ ๐ฅ, ๐ฆ = 0
Uncorrelated random variables (๐ธ (๐ฅ โ ๐ธ[๐ฅ])(๐ฆ โ ๐ธ[๐ฆ])= 0)
๐ถ๐๐ฃ ๐ฅ, ๐ฆ = 0
Independent random variables โ ๐ถ๐๐ฃ ๐ฅ, ๐ฆ = 0
๐ถ๐๐ฃ ๐ฅ, ๐ฆ = 0 โ Independent random variables
35
Covariance Matrix
If ๐ is a vector of random variables (๐ -dim random
vector):
Covariance matrix indicates the tendency of each pair of RVs
to vary together
36
๐ฎ =๐ธ((๐ฅ1 โ ๐1)(๐ฅ1 โ ๐1)) โฏ ๐ธ((๐ฅ1 โ ๐1)(๐ฅ๐ โ ๐๐))
โฎ โฑ โฎ๐ธ((๐ฅ๐ โ ๐๐)(๐ฅ1 โ ๐1)) โฏ ๐ธ((๐ฅ๐ โ ๐๐)(๐ฅ๐ โ ๐๐))
๐ฎ = ๐ธ[ ๐ โ ๐๐ ๐ โ ๐๐๐]
๐๐ =
๐1โฎ๐๐
=๐ธ(๐ฅ1)
โฎ๐ธ(๐ฅ๐)
Covariance Matrix: Two Variables
37
ฮฃ = ๐ถ =๐12 ๐12
๐21 ๐22
๐ฎ =1 00 1 ๐ฎ =
1 0.90.9 1
๐
๐
๐
๐
๐21 = ๐12 = ๐ถ๐๐ฃ(๐, ๐)
Covariance Matrix
38
๐๐๐ shows the covariance of ๐๐ and ๐๐:
๐๐๐ = ๐๐๐ = ๐ถ๐๐ฃ(๐๐ , ๐๐)
๐ฎ =
๐12 ๐12
๐21 ๐22
โฏ ๐1๐โฏ ๐2๐
โฎ โฎ๐๐1 ๐๐2
โฑ โฎโฏ ๐๐
2
Sums of Random Variables
๐ = ๐ + ๐
Mean: ๐ธ[๐ง] = ๐ธ[๐ฅ] + ๐ธ[๐ฆ]
Variance:๐๐๐(๐) = ๐๐๐(๐) + ๐๐๐(๐) + 2๐ถ๐๐ฃ(๐, ๐)
If ๐,๐ independent:๐๐๐(๐) = ๐๐๐(๐) + ๐๐๐(๐)
Distribution:
,( ) ( , )
( ) ( ) ( ) ( )
Z X Y
X Y X Y
p z p x z x dx
p x p y p x p z x dx
39
independence
Some Famous Probability Density Functions
Uniform
Gaussian (Normal)
1
p x U a U bb a
2
22.1
,. 2
x
p x e N
๐ฅ
40
๐ฅ~๐(๐, ๐)๐(๐ฅ)
๐๐
1
๐ โ ๐
๐ฅ~๐(๐, ๐2)
๐ฅ
๐(๐ฅ)
1
2๐๐
๐
Gaussian (Normal) Distribution
41
68% within ๐ โ ๐,๐ + ๐
95% within ๐ โ 2๐,๐ + 2๐
It is widely used to model the distribution of continuousvariables
Standard Normal distribution: ๐ = 0, ๐ = 1
Some Famous Probability Density Functions
42
Exponential
0
0 0
x
x e xp x e U x
x
๐ฅ
๐(๐ฅ)
0
Some Famous Probability Mass Functions
Bernoulli: ๐ฅ โ 0,1
๐ ๐ฅ = 1 โ ๐ 1โ๐ฅ๐๐ฅ
Binomial
1n k k
nP X k p p
k
43
๐ฅ~๐ต๐๐๐๐๐๐๐(๐, ๐)
๐๐๐
๐(๐ฅ = ๐)
๐(๐ฅ)
0 1
๐
Multivariate Gaussian Distribution
๐ is a vector of ๐ Gaussian variables
๐ ๐ ~๐ ๐, ๐ฎ =1
2๐ ๐/2 ๐ฎ 1/2๐โ
12
๐โ๐ ๐๐ฎโ1 ๐โ๐
44
๐ =
๐1โฎ๐๐
=๐ธ(๐ฅ1)
โฎ๐ธ(๐ฅ๐)
๐ฎ = ๐ธ ๐ โ ๐ ๐ โ ๐ ๐
Multivariate Gaussian Distribution
45
The covariance matrix is always symmetric and positive semi-
definite
Multivariate Gaussian is completely specified by ๐ + ๐(๐ + 1)/2 parameters
Special cases of :
= ๐2๐ฐ : Independent random variables with the same variance
(circularly symmetric Gaussian)
Digonal matrix =๐12 โฆ ๐โฎ โฑ โฎ๐ โฏ ๐๐
2: Independent random variables with
different variances
Multivariate Gaussian Distribution
Level Surfaces
46
The Gaussian distribution will be constant on surfaces in
๐-space for which:๐ โ ๐ ๐๐ฎโ๐ ๐ โ ๐ = ๐ถ
Principal axes of the hyper-ellipsoids are the eigenvectors of .
Bivariate Gaussian: Curves of constant density are
ellipses.
Hyper-ellipsoid
Bivariate Gaussian distribution
๐1 and ๐2 are the eigenvalues of (๐1 โฅ ๐2) and ๐1 and
๐2 are the corresponding eigenvectors
47
๐1๐2
๐1๐2
๐1๐2
=๐1
๐2
Gaussian Distribution Properties
Some attracting properties of the Gaussian distribution:
Marginal and conditional distributions of a Gaussian are also Gaussian
After a linear transformation, a Gaussian distribution is again Gaussian
There exists a linear transformation that diagonalizes the covariance matrix
(whitening transform).
It converts the multivariate normal distribution into a spherical one.
Gaussian is a distribution that maximizes the entropy
Gaussian is stable and infinitely divisible
Central Limit Theorem
Some distributions can be estimated by Gaussian distribution when their
parameter value is sufficiently large (e.g., Binomial)
48
Central Limit Theorem
(under mild conditions)
Suppose i.i.d. (Independent Identically Distributed) RVs ๐๐ (๐= 1,โฆ , ๐) with finite variances
Let ๐๐ = ๐=1๐ ๐๐ be the sum of these RVs
Distribution of ๐๐ converges to a normal distribution as ๐increases, regardless to the distribution of the RVs.
Example:
49
๐๐~ uniform, ๐ = 1,โฆ , ๐
๐๐ =1
๐
๐=1
๐
๐๐
๐1 ๐2 ๐10
Linear Algebra: Basic Definitions
Matrix ๐จ:
Matrix Transpose
Symetric matrix ๐จ = ๐จ๐
Vector ๐
11 12 1
21 22 2
1 2
...
...[ ]
... ... ... ...
...
n
n
ij m n
m m mn
a a a
a a aa
a a a
A
1 ,1T
ij jib a i n j m B A
1
1... [ ,..., ]T
n
n
a
a a
a
a a
50
Linear Mapping
Linear function
๐(๐ + ๐) = ๐(๐) + ๐(๐) โ๐, ๐ โ ๐
๐(๐๐) = ๐๐(๐) โ๐ โ ๐, ๐ โ ๐น
A linear function: ๐ ๐ = ๐ค1๐ฅ1 +โฏ+๐ค๐๐ฅ๐ = ๐๐ ๐
In general, a matrix ๐mรd = ๐1 โฏ ๐๐๐ can be used to
denote a map ๐:โ๐ โ โ๐ where ๐๐ ๐ = ๐ค11๐ฅ1 +โฏ+๐ค๐1๐ฅ๐ = ๐๐
๐๐
51
Inner (dot) product
Matrix multiplication
. .
[ ] [ ]
[ ] ,
ij m p ij p n
T
ij m n ij i j
a b
c c A B
A B
AB = C
Linear Algebra: Basic Definitions
1
,n
T
i i
i
a b
a b a b
52
i-th row of A j-th column of B
Inner Product
Inner (dot) product
Length (Euclidean norm) of a vector
๐ is normalized iff ๐ 2 = 1
Angle between vectors
Orthogonal vectors ๐ and ๐:
Orthonormal set of vectors ๐1, ๐2, โฆ , ๐๐:
โ๐, ๐ ๐๐๐๐๐ =
1 ๐ = ๐
0 ๐.๐ค.53
๐ 2 = ๐๐๐ =
๐=1
๐
๐๐2
๐๐๐ =
๐=1
๐
๐๐๐๐
๐๐๐ = 0
๐๐๐ ๐ =๐๐๐
๐ 2 ๐ 2
Linear Independence
A set of vectors is linearly independent if no vector is
a linear combination of other vectors.
๐1๐1 + ๐2๐2+ . . . + ๐๐๐๐ = 0 โ
๐1 = ๐2 = . . . = ๐๐ = 0
54
Matrix Determinant and Trace
Determinant
๐๐๐ก(๐จ๐ฉ) = ๐๐๐ก(๐จ) ร ๐๐๐ก(๐ฉ)
Trace
1
[ ]n
jj
j
tr A a
1
det( ) ; 1,.... ;
( 1) det( )
n
ij ij
j
i j
ij ij
A a A i n
A M
55
Matrix Inversion
Inverse of ๐ด๐ร๐:
๐ดโ1 exists iff det(๐ด) โ 0 (๐ด is nonsingular)
Singular:det(๐ด) = 0
ill-conditioned:๐ด is nonsingular but close to being singular
Pseudo-inverse for a non square matrix ๐ด# = ๐ด๐๐ด โ1๐ด๐
๐ด๐๐ด is not singular
๐ด#๐ด = ๐ผ
1 nAB BA I B A
56
Matrix Rank
๐๐๐๐(๐จ) : maximum number of linearly independent
columns or rows of A.
๐จ๐ร๐: ๐๐๐๐(๐จ) โค min(๐, ๐)
Full rank ๐จ๐ร๐ : ๐๐๐๐(๐จ) = ๐ iff ๐จ is nonsingular
(det(๐จ) โ 0)
57
Eigenvectors and Eigenvalues
det( ) 0n A I
1
( )n
j
j
tr
A
1
det( )n
j
j
A A
๐ด๐ = ๐๐
Characteristic equation:
n-th order polynomial, with n roots
58
Eigenvector: Example
59
๐จ =2 11 2
[wikipedia]
Eigenvectors and Eigenvalues:
Symmetric Matrix
For a symmetric matrix, the eigenvectors corresponding
to distinct eigenvalues are orthogonal
These eigenvectors can be used to form an orthonormal
set (โ๐ โ ๐ ๐๐๐๐๐ = 0 and ๐๐ = 1)
60
Eigen Decomposition: Symmetric Matrix
๐ฝ = ๐1 โฆ ๐๐ , ๐ฆ =๐1 โฆ ๐โฎ โฑ โฎ๐ โฆ ๐๐
๐จ๐ฝ = ๐ฝ๐ฆ โ ๐จ๐ฝ๐ฝ๐ = ๐ฝ๐ฆ๐ฝ๐๐ฝ๐ฝ๐=๐ฐ
๐จ = ๐ฝ๐ฆ๐ฝ๐
Eigen decomposition of a symmetric matrix: ๐จ = ๐ฝ๐ฆ๐ฝ๐
61
Positive Definite Matrix
Symmetric ๐ด๐ร๐ is positive definite iff:
โ๐ โ โ๐ โ ๐๐๐ด๐ > 0
Eigen values of a positive define matrix are positive:
โ๐, ๐๐ > 0
62
Vector Derivatives
๐๐๐๐จ๐
๐๐= (๐จ + ๐จ๐)๐
๐๐๐๐
๐๐= ๐
You can see more on the vector derivatives in the
uploaded review materials
63