sharif university of...

63
CE-717: Machine Learning Sharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Upload: vankhanh

Post on 27-May-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

CE-717: Machine Learning Sharif University of Technology

M. Soleymani

Review (Probability & Linear Algebra)

Page 2: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Outline

Axioms of probability theory

Joint probability, conditional probability, Bayes theorem

Discrete and continuous random variables

Probability mass and density functions

Expected value, variance, standard deviation

Expectation for two variables

covariance, correlation

Some probability distributions

Gaussian distribution

Linear Algebra

2

Page 3: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Basic Probability Elements

3

Sample space (Ω): set of all possible outcomes (or worlds)

Outcomes are assumed to be mutually exclusive.

An event 𝐴 is a certain set of outcomes (i.e., subset of Ω).

A random variable is a function defined over the sample

space

Gender: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 → {𝑚𝑎𝑙𝑒, 𝑓𝑒𝑚𝑎𝑙𝑒}

Height: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 → ℝ+

Page 4: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Probability Space

A probability space is defined as a triple (Ω, 𝐹, 𝑃):

A sample space Ω ≠ ∅ that contains the set of all possible

outcomes (outcomes also called states of nature)

A set 𝐹 whose elements are called events. The events are

subsets of Ω.𝐹 should be a “Borel Field”.

𝑃 represents the probability measure that assigns

probabilities to events.

4

Page 5: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Probability Axioms (Kolomogrov)

5

Axioms define a reasonable theory of uncertainty

Kolomogrov’s probability axioms

𝑃(𝐴) ≥ 0 (∀𝐴 ⊆ Ω)

𝑃 Ω = 1

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵) (∀𝐴, 𝐵 ⊆ Ω)

Ω

A B

𝐴 ∩ 𝐵

Page 6: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Random Variables

Random variables: Variables in probability theory

Domain of random variables:Boolean,discrete or continuous

Probability distribution: the function describing probabilities

of possible values of a random variable

𝑃 𝐷𝑖𝑐𝑒 = 1 =1

6, 𝑃(𝐷𝑖𝑐𝑒 = 2) =

1

6, …

6

Page 7: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Random Variables

Random variable is a function that maps every outcome

in Ω to a real (complex) number.

To define probabilities easily as functions defined on (real)

numbers.

To compute expectation, variance,…

Head

Real line0 1

Tail

7

Page 8: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Base Definitions

8

Joint probability distribution

The rules of probability (sum and product rule)

Bayes’ theorem

Independence

new evidence may be irrelevant

Page 9: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Joint Probability Distribution

9

Probability of all combinations of the values for a set of

random variables.

If two or more random variables are considered together, they can

be described in terms of their joint probability

Example: Joint probability of features

𝑃(𝑋1, 𝑋2, … , 𝑋𝑑)

Page 10: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Two Fundamental Rules

10

Sum rule:

𝑃 𝑌 =

𝑋

𝑃(𝑋, 𝑌)

Product rule:𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌)

Page 11: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Chain Rule

Chain rule is derived by successive application of product rule:

𝑃(𝑋1, … , 𝑋𝑛)= 𝑃(𝑋1, … , 𝑋𝑛−1 ) 𝑃(𝑋𝑛|𝑋1, … , 𝑋𝑛−1)= 𝑃(𝑋1, … , 𝑋𝑛−2) 𝑃(𝑋𝑛−1|𝑋1, … , 𝑋𝑛−2) 𝑃(𝑋𝑛|𝑋1, … , 𝑋𝑛−1)= …

= 𝑃(𝑋1) 𝑖=2

𝑛

𝑃(𝑋𝑖|𝑋1, … , 𝑋𝑖−1)

11

Page 12: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Sum Rule: Example

12

[Bishop, Section1.2]

Page 13: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Conditional Probability

𝑃(𝑋|𝑌) = 𝑃(𝑋, 𝑌) 𝑃(𝑌) if 𝑃(𝑋) > 0

Obtained from the product rule

𝑃(𝑋|𝑌) obeys the same rules as probabilities

𝑋𝑃(𝑋|𝑌) = 1

13

Page 14: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Conditional Probability

For statistically dependent variables, knowing the value of

one variable may allow us to better estimate the other.

All probabilities in effect are conditional probabilities

E.g., 𝑃(𝐴) = 𝑃(𝐴 | 𝑜𝑢𝑟 𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑 𝑘𝑛𝑜𝑤𝑙𝑒𝑑𝑔𝑒)

Type equation here.

𝐴𝐵

𝑃(. |𝐵)Ω

Ω

Renormalize the probability of events jointly occur with B

14

Page 15: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Conditional Probability: Example Rolling a fair dice

𝐴 : the outcome is an even number

𝐵 : the outcome is a prime number

𝑃 𝐴 𝐵 =𝑃(𝐴, 𝐵)

𝑃(𝐵)=

1/6

1/2=

1

3

Meningitis(𝑀) & Stiff neck (𝑆)

𝑃 𝑀 = 1, 𝑆 = 1 =1

5000

𝑃 𝑀 = 1 =1

2000

𝑃 𝑆 = 1 𝑀 = 1 =𝑃 𝑀 = 1, 𝑆 = 1

𝑃(𝑀 = 1)= 0.4

15

Page 16: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Conditional Probability: Another Example

16

[Bishop, Section1.2]

Page 17: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Bayes Theorem

𝑃 𝑌 𝑋 =𝑃 𝑋 𝑌 𝑃 𝑌

𝑃 𝑋

Obtained form the product rule and the symmetry property

𝑃 𝑋, 𝑌 = 𝑃 𝑌, 𝑋

𝑃 𝑋, 𝑌 = 𝑃 𝑋|𝑌 𝑃 𝑌 = 𝑃(𝑌|𝑋)𝑃 𝑋

In some problems, it may be difficult to compute 𝑃(𝑌|𝑋)directly, yet we might have information about 𝑃(𝑋|𝑌).

𝑃(𝐶𝑎𝑢𝑠𝑒|𝐸𝑓𝑓𝑒𝑐𝑡) = 𝑃(𝐸𝑓𝑓𝑒𝑐𝑡|𝐶𝑎𝑢𝑠𝑒) 𝑃(𝐶𝑎𝑢𝑠𝑒) / 𝑃(𝐸𝑓𝑓𝑒𝑐𝑡)

17

Page 18: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Bayes Theorem

Often it would be useful to derive the rule a bit further:

𝑃 𝑌 𝑋 =𝑃 𝑋 𝑌 𝑃 𝑌

𝑃 𝑋=

𝑃 𝑋 𝑌 𝑃 𝑌

𝑌𝑃 𝑋 𝑌 𝑃 𝑌

18

Page 19: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Bayes Theorem: Example

19

Meningitis(𝑀) & Stiff neck (𝑆)

𝑃 𝑀 = 1 =1

5000

𝑃 𝑆 = 1 = 0.01

𝑃 𝑆 = 1 𝑀 = 1 = 0.7

𝑃 𝑀 = 1 𝑆 = 1 =?

𝑃(𝑀 = 1|𝑆 = 1) = 𝑃 𝑆 = 1 𝑀 = 1 𝑃(𝑀 = 1) /𝑃(𝑆 = 1)= 0.7 × 0.0002/0.01 = 0.0014

Page 20: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Prior and Posterior Probabilities

Prior or unconditional probabilities: belief in the absence of

any other evidence

e.g.,𝑃 𝑆 = 1 = 0.01

Posterior or conditional probabilities: belief in the presence of

evidences

e.g.,𝑃 𝑆 = 1 𝑀 = 1) = 0.7

20

Page 21: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Independence of Random Variables

𝑋 and 𝑌 are independent iff

𝑃(𝑋|𝑌) = 𝑃(𝑋)

𝑃 𝑌 𝑋 = 𝑃 𝑌

𝑃(𝑋, 𝑌) = 𝑃(𝑋) 𝑃(𝑌)

Knowing 𝑌 tells us nothing about 𝑋 (and vice versa)

21

Page 22: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Probability Mass Function (PMF)

Probability Mass Function (PMF) shows the probability

for each value of a discrete random variable

Each impulse magnitude is equal to the probability of the

corresponding outcome

Example: PMF of a fair dice

𝑃(𝑋 = 𝑥) ≥ 0

𝑥∈𝑋

𝑃(𝑋 = 𝑥) = 1

22

𝑃(𝑋)

𝑋

Page 23: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Probability Density Function (PDF)

Probability Density Function (PDF) is defined for

continuous random variables

The probability of 𝑥 ∈ (𝑥0, 𝑥0 + 𝛿𝑥) is 𝑝(𝑥0) × 𝛿𝑥 (for 𝛿𝑥 → 0)

𝑝(𝑥): probability density over 𝑥

23

𝑝(𝑥)

𝑝(𝑥) ≥ 0

𝑝 𝑥 𝑑𝑥 = 1

𝑥

𝑝(𝑥0)

𝑥0

Page 24: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Cumulative Distribution Function (CDF)

Cumulative Distribution Function (CDF)

Defined as the integration of PDF

Similarly defined on discrete variables (summation instead of integration)

Non-decreasing

Right Continuous

𝐹(−∞) = 0

𝐹 ∞ = 1

𝑑𝐹(𝑥)

𝑑𝑥= 𝑝 𝑥

𝑃 𝑢 ≤ 𝑥 ≤ 𝑣 = 𝐹 𝑣 − 𝐹(𝑢)

24

𝑥

𝐹 𝑥

𝑝(𝑥)

𝐶𝐷𝐹 𝑥 = 𝐹 𝑥 = −∞

𝑥

𝑝(𝑥)

Page 25: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Distribution Statistics

Basic descriptors of spatial distributions:

Mean value

Variance & standard deviation

Moments

Covariance & correlation

25

Page 26: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Expected Value

Expected (or mean) value: weighted average of all possible

values of the random variable

Expectation of a discrete random variable 𝑋:

𝐸 𝑥 = 𝑥𝑥𝑝(𝑥)

Expectation of a function of a discrete random variable 𝑋 :

𝐸 𝑓(𝑥) = 𝑥𝑓(𝑥)𝑝(𝑥)

Expected value of a continuous random variable 𝑋 :

𝐸 𝑥 = 𝑥𝑝 𝑥 𝑑𝑥

Expectation of a function of a continuous random variable 𝑋 :

𝐸 𝑓(𝑥) = 𝑓(𝑥)𝑝 𝑥 𝑑𝑥

26

Page 27: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Expected Value

27

For expectation of a function of several variables, a

subscript is used to specify the variables is being average

over

Examples:

𝐸𝑥 𝑓 𝑥, 𝑦 = 𝑥 𝑝 𝑥 𝑓(𝑥, 𝑦)

𝐸𝑥|𝑦 𝑓 𝑥, 𝑦 = 𝑥 𝑝 𝑥|𝑦 𝑓(𝑥, 𝑦)

Other notation:𝐸𝑥[𝑓(𝑥, 𝑦)|𝑦]

𝐸𝑥,𝑦 𝑓 𝑥, 𝑦 = 𝑥 𝑦 𝑝 𝑥, 𝑦 𝑓(𝑥, 𝑦)

Page 28: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Variance

Variance: a measure of how far values of a random

variable are spread out around its expected value

𝑉𝑎𝑟 𝑥 = 𝐸 𝑥 − 𝐸 𝑥 2

= 𝐸 𝑥2 − 𝐸 𝑥 2

Standard deviation: square root of variance:

σ𝑥 = 𝑉𝑎𝑟[𝑥]

28

Page 29: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Moments

Moments nth order moment of a random variable 𝑋:

𝐸 𝑥𝑛

Normalized nth order moment:

𝐸 𝑥 − 𝐸 𝑥 𝑛

The first order moment is the mean value.

The second order moment is the variance added by thesquare of the mean.

29

Page 30: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Correlation & Covariance

Correlation

𝐶𝑟𝑟 𝑥, 𝑦 = 𝐸𝑥,𝑦 𝑥𝑦

Covariance is the correlation of mean removed variables:

𝐶𝑜𝑣 𝑥, 𝑦 = 𝐸𝑥,𝑦 (𝑥 − 𝐸 𝑥 )(𝑦 − 𝐸 𝑦 )

30

Discrete RVs

𝑥

𝑦𝑥𝑦 𝑝(𝑥, 𝑦)

Page 31: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Covariance: Example

31

𝐶𝑜𝑣 𝑥, 𝑦 = 0 𝐶𝑜𝑣 𝑥, 𝑦 = 0.9𝑥

𝑦

𝑥

𝑦

Page 32: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Covariance Properties

32

The covariance value shows the tendency of the pair of

RVs to increase together

𝐶𝑜𝑣𝑥𝑦 > 0 ∶ 𝑥 and 𝑦 tend to increase together

𝐶𝑜𝑣𝑥𝑦 < 0 : 𝑥 tends to decrease when 𝑦 increases

𝐶𝑜𝑣𝑥𝑦 = 0 : no linear correlation between 𝑥 and 𝑦

Page 33: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Pearson’s Product Moment Correlation

33

𝜌𝑋𝑌 =𝐶𝑜𝑣(𝑋, 𝑌)

𝜎𝑋𝜎𝑌=

𝐸 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌𝜎𝑋𝜎𝑌

Defined only if both 𝜎𝑋 and 𝜎𝑌 are finite and nonzero.

𝜌𝑋𝑌 shows the degree of linear dependence between 𝑋 and 𝑌.

−1 ≤ 𝜌𝑋𝑌≤ 1

𝐸 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌2 ≤ 𝐸 𝑋 − 𝜇𝑋

2 𝐸 𝑌 − 𝜇𝑌2 according to Cauchy-

Schwarz inequality (𝐶𝑋𝑌 ≤ 𝜎𝑋𝜎𝑌)

𝜌𝑋𝑌 = 1 shows a perfect positive linear relationship and 𝜌𝑋𝑌 = −1

shows a perfect negative linear relationship

Page 34: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Pearson’s Correlation: Examples

34

[Wikipedia]

𝜌𝑋𝑌 =𝐶𝑜𝑣(𝑋, 𝑌)

𝜎𝑋𝜎𝑌

Page 35: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Orthogonal, Uncorrelated & Independent RVs

Orthogonal random variables (𝐸 𝑥𝑦 = 0)

𝐶𝑟𝑟 𝑥, 𝑦 = 0

Uncorrelated random variables (𝐸 (𝑥 − 𝐸[𝑥])(𝑦 − 𝐸[𝑦])= 0)

𝐶𝑜𝑣 𝑥, 𝑦 = 0

Independent random variables ⇒ 𝐶𝑜𝑣 𝑥, 𝑦 = 0

𝐶𝑜𝑣 𝑥, 𝑦 = 0 ⇏ Independent random variables

35

Page 36: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Covariance Matrix

If 𝒙 is a vector of random variables (𝑑 -dim random

vector):

Covariance matrix indicates the tendency of each pair of RVs

to vary together

36

𝜮 =𝐸((𝑥1 − 𝜇1)(𝑥1 − 𝜇1)) ⋯ 𝐸((𝑥1 − 𝜇1)(𝑥𝑑 − 𝜇𝑑))

⋮ ⋱ ⋮𝐸((𝑥𝑑 − 𝜇𝑑)(𝑥1 − 𝜇1)) ⋯ 𝐸((𝑥𝑑 − 𝜇𝑑)(𝑥𝑑 − 𝜇𝑑))

𝜮 = 𝐸[ 𝒙 − 𝝁𝒙 𝒙 − 𝝁𝒙𝑇]

𝝁𝒙 =

𝜇1⋮𝜇𝑑

=𝐸(𝑥1)

⋮𝐸(𝑥𝑑)

Page 37: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Covariance Matrix: Two Variables

37

Σ = 𝐶 =𝜎12 𝜎12

𝜎21 𝜎22

𝜮 =1 00 1 𝜮 =

1 0.90.9 1

𝑋

𝑌

𝑋

𝑌

𝜎21 = 𝜎12 = 𝐶𝑜𝑣(𝑋, 𝑌)

Page 38: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Covariance Matrix

38

𝜎𝑖𝑗 shows the covariance of 𝑋𝑖 and 𝑋𝑗:

𝜎𝑖𝑗 = 𝜎𝑗𝑖 = 𝐶𝑜𝑣(𝑋𝑖 , 𝑋𝑗)

𝜮 =

𝜎12 𝜎12

𝜎21 𝜎22

⋯ 𝜎1𝑑⋯ 𝜎2𝑑

⋮ ⋮𝜎𝑑1 𝜎𝑑2

⋱ ⋮⋯ 𝜎𝑑

2

Page 39: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Sums of Random Variables

𝑍 = 𝑋 + 𝑌

Mean: 𝐸[𝑧] = 𝐸[𝑥] + 𝐸[𝑦]

Variance:𝑉𝑎𝑟(𝑍) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) + 2𝐶𝑜𝑣(𝑋, 𝑌)

If 𝑋,𝑌 independent:𝑉𝑎𝑟(𝑍) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)

Distribution:

,( ) ( , )

( ) ( ) ( ) ( )

Z X Y

X Y X Y

p z p x z x dx

p x p y p x p z x dx

39

independence

Page 40: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Some Famous Probability Density Functions

Uniform

Gaussian (Normal)

1

p x U a U bb a

2

22.1

,. 2

x

p x e N

𝑥

40

𝑥~𝑈(𝑎, 𝑏)𝑝(𝑥)

𝑏𝑎

1

𝑏 − 𝑎

𝑥~𝑁(𝜇, 𝜎2)

𝑥

𝑝(𝑥)

1

2𝜋𝜎

𝜇

Page 41: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Gaussian (Normal) Distribution

41

68% within 𝜇 − 𝜎,𝜇 + 𝜎

95% within 𝜇 − 2𝜎,𝜇 + 2𝜎

It is widely used to model the distribution of continuousvariables

Standard Normal distribution: 𝜇 = 0, 𝜎 = 1

Page 42: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Some Famous Probability Density Functions

42

Exponential

0

0 0

x

x e xp x e U x

x

𝑥

𝑝(𝑥)

0

Page 43: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Some Famous Probability Mass Functions

Bernoulli: 𝑥 ∈ 0,1

𝑝 𝑥 = 1 − 𝜇 1−𝑥𝜇𝑥

Binomial

1n k k

nP X k p p

k

43

𝑥~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑝, 𝑛)

𝑘𝑛𝑝

𝑃(𝑥 = 𝑘)

𝑃(𝑥)

0 1

𝜇

Page 44: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Multivariate Gaussian Distribution

𝒙 is a vector of 𝑑 Gaussian variables

𝑝 𝒙 ~𝑁 𝝁, 𝜮 =1

2𝜋 𝑑/2 𝜮 1/2𝑒−

12

𝒙−𝝁 𝑇𝜮−1 𝒙−𝝁

44

𝝁 =

𝜇1⋮𝜇𝑑

=𝐸(𝑥1)

⋮𝐸(𝑥𝑑)

𝜮 = 𝐸 𝒙 − 𝝁 𝒙 − 𝝁 𝑇

Page 45: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Multivariate Gaussian Distribution

45

The covariance matrix is always symmetric and positive semi-

definite

Multivariate Gaussian is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters

Special cases of :

= 𝜎2𝑰 : Independent random variables with the same variance

(circularly symmetric Gaussian)

Digonal matrix =𝜎12 … 𝟎⋮ ⋱ ⋮𝟎 ⋯ 𝜎𝑑

2: Independent random variables with

different variances

Page 46: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Multivariate Gaussian Distribution

Level Surfaces

46

The Gaussian distribution will be constant on surfaces in

𝒙-space for which:𝒙 − 𝝁 𝑇𝜮−𝟏 𝒙 − 𝝁 = 𝐶

Principal axes of the hyper-ellipsoids are the eigenvectors of .

Bivariate Gaussian: Curves of constant density are

ellipses.

Hyper-ellipsoid

Page 47: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Bivariate Gaussian distribution

𝜆1 and 𝜆2 are the eigenvalues of (𝜆1 ≥ 𝜆2) and 𝒗1 and

𝒗2 are the corresponding eigenvectors

47

𝒗1𝒗2

𝑙1𝑙2

𝑙1𝑙2

=𝜆1

𝜆2

Page 48: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Gaussian Distribution Properties

Some attracting properties of the Gaussian distribution:

Marginal and conditional distributions of a Gaussian are also Gaussian

After a linear transformation, a Gaussian distribution is again Gaussian

There exists a linear transformation that diagonalizes the covariance matrix

(whitening transform).

It converts the multivariate normal distribution into a spherical one.

Gaussian is a distribution that maximizes the entropy

Gaussian is stable and infinitely divisible

Central Limit Theorem

Some distributions can be estimated by Gaussian distribution when their

parameter value is sufficiently large (e.g., Binomial)

48

Page 49: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Central Limit Theorem

(under mild conditions)

Suppose i.i.d. (Independent Identically Distributed) RVs 𝑋𝑖 (𝑖= 1,… , 𝑁) with finite variances

Let 𝑆𝑁 = 𝑖=1𝑁 𝑋𝑖 be the sum of these RVs

Distribution of 𝑆𝑁 converges to a normal distribution as 𝑁increases, regardless to the distribution of the RVs.

Example:

49

𝑋𝑖~ uniform, 𝑖 = 1,… , 𝑁

𝑆𝑁 =1

𝑁

𝑖=1

𝑁

𝑋𝑖

𝑆1 𝑆2 𝑆10

Page 50: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Linear Algebra: Basic Definitions

Matrix 𝑨:

Matrix Transpose

Symetric matrix 𝑨 = 𝑨𝑇

Vector 𝒂

11 12 1

21 22 2

1 2

...

...[ ]

... ... ... ...

...

n

n

ij m n

m m mn

a a a

a a aa

a a a

A

1 ,1T

ij jib a i n j m B A

1

1... [ ,..., ]T

n

n

a

a a

a

a a

50

Page 51: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Linear Mapping

Linear function

𝑓(𝒙 + 𝒚) = 𝑓(𝒙) + 𝑓(𝒚) ∀𝒙, 𝒚 ∈ 𝑉

𝑓(𝑎𝒙) = 𝑎𝑓(𝒙) ∀𝒙 ∈ 𝑉, 𝑎 ∈ 𝐹

A linear function: 𝑓 𝒙 = 𝑤1𝑥1 +⋯+𝑤𝑑𝑥𝑑 = 𝒘𝑇 𝒙

In general, a matrix 𝐖m×d = 𝒘1 ⋯ 𝒘𝑚𝑇 can be used to

denote a map 𝑓:ℝ𝑑 → ℝ𝑚 where 𝑓𝑖 𝒙 = 𝑤11𝑥1 +⋯+𝑤𝑑1𝑥𝑑 = 𝒘𝑖

𝑇𝒙

51

Page 52: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Inner (dot) product

Matrix multiplication

. .

[ ] [ ]

[ ] ,

ij m p ij p n

T

ij m n ij i j

a b

c c A B

A B

AB = C

Linear Algebra: Basic Definitions

1

,n

T

i i

i

a b

a b a b

52

i-th row of A j-th column of B

Page 53: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Inner Product

Inner (dot) product

Length (Euclidean norm) of a vector

𝒂 is normalized iff 𝒂 2 = 1

Angle between vectors

Orthogonal vectors 𝒂 and 𝒃:

Orthonormal set of vectors 𝒂1, 𝒂2, … , 𝒂𝑛:

∀𝑖, 𝑗 𝒂𝑖𝑇𝒂𝑗 =

1 𝑖 = 𝑗

0 𝑜.𝑤.53

𝒂 2 = 𝒂𝑇𝒂 =

𝑖=1

𝑑

𝑎𝑖2

𝒂𝑇𝒃 =

𝑖=1

𝑑

𝑎𝑖𝑏𝑖

𝒂𝑇𝒃 = 0

𝑐𝑜𝑠𝜃 =𝒂𝑇𝒃

𝒂 2 𝒃 2

Page 54: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Linear Independence

A set of vectors is linearly independent if no vector is

a linear combination of other vectors.

𝑐1𝒗1 + 𝑐2𝒗2+ . . . + 𝑐𝑘𝒗𝑘 = 0 ⇒

𝑐1 = 𝑐2 = . . . = 𝑐𝑘 = 0

54

Page 55: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Matrix Determinant and Trace

Determinant

𝑑𝑒𝑡(𝑨𝑩) = 𝑑𝑒𝑡(𝑨) × 𝑑𝑒𝑡(𝑩)

Trace

1

[ ]n

jj

j

tr A a

1

det( ) ; 1,.... ;

( 1) det( )

n

ij ij

j

i j

ij ij

A a A i n

A M

55

Page 56: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Matrix Inversion

Inverse of 𝐴𝑛×𝑛:

𝐴−1 exists iff det(𝐴) ≠ 0 (𝐴 is nonsingular)

Singular:det(𝐴) = 0

ill-conditioned:𝐴 is nonsingular but close to being singular

Pseudo-inverse for a non square matrix 𝐴# = 𝐴𝑇𝐴 −1𝐴𝑇

𝐴𝑇𝐴 is not singular

𝐴#𝐴 = 𝐼

1 nAB BA I B A

56

Page 57: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Matrix Rank

𝑟𝑎𝑛𝑘(𝑨) : maximum number of linearly independent

columns or rows of A.

𝑨𝑚×𝑛: 𝑟𝑎𝑛𝑘(𝑨) ≤ min(𝑚, 𝑛)

Full rank 𝑨𝑛×𝑛 : 𝑟𝑎𝑛𝑘(𝑨) = 𝑛 iff 𝑨 is nonsingular

(det(𝑨) ≠ 0)

57

Page 58: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Eigenvectors and Eigenvalues

det( ) 0n A I

1

( )n

j

j

tr

A

1

det( )n

j

j

A A

𝐴𝒗 = 𝜆𝒗

Characteristic equation:

n-th order polynomial, with n roots

58

Page 59: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Eigenvector: Example

59

𝑨 =2 11 2

[wikipedia]

Page 60: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Eigenvectors and Eigenvalues:

Symmetric Matrix

For a symmetric matrix, the eigenvectors corresponding

to distinct eigenvalues are orthogonal

These eigenvectors can be used to form an orthonormal

set (∀𝑖 ≠ 𝑗 𝒗𝑖𝑇𝒗𝑗 = 0 and 𝒗𝑖 = 1)

60

Page 61: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Eigen Decomposition: Symmetric Matrix

𝑽 = 𝒗1 … 𝒗𝑁 , 𝜦 =𝜆1 … 𝟎⋮ ⋱ ⋮𝟎 … 𝜆𝑁

𝑨𝑽 = 𝑽𝜦 ⇒ 𝑨𝑽𝑽𝑇 = 𝑽𝜦𝑽𝑇𝑽𝑽𝑇=𝑰

𝑨 = 𝑽𝜦𝑽𝑇

Eigen decomposition of a symmetric matrix: 𝑨 = 𝑽𝜦𝑽𝑇

61

Page 62: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Positive Definite Matrix

Symmetric 𝐴𝑛×𝑛 is positive definite iff:

∀𝒙 ∈ ℝ𝑛 ⇒ 𝒙𝑇𝐴𝒙 > 0

Eigen values of a positive define matrix are positive:

∀𝑖, 𝜆𝑖 > 0

62

Page 63: Sharif University of Technologyce.sharif.edu/courses/95-96/1/ce717-2/resources/root/Review/Review.pdfSharif University of Technology M. Soleymani Review (Probability & Linear Algebra)

Vector Derivatives

𝜕𝒙𝑇𝑨𝒙

𝜕𝒙= (𝑨 + 𝑨𝑇)𝒙

𝜕𝒃𝑇𝒙

𝜕𝒙= 𝒃

You can see more on the vector derivatives in the

uploaded review materials

63