sharif university of...

CE-717: Machine Learning Sharif University of Technology

M. Soleymani

Review (Probability & Linear Algebra)

Outline

Axioms of probability theory

Joint probability, conditional probability, Bayes theorem

Discrete and continuous random variables

Probability mass and density functions

Expected value, variance, standard deviation

Expectation for two variables

covariance, correlation

Some probability distributions

Gaussian distribution

Linear Algebra

2

Basic Probability Elements

3

Sample space (Ω): set of all possible outcomes (or worlds)

Outcomes are assumed to be mutually exclusive.

An event 𝐴 is a certain set of outcomes (i.e., subset of Ω).

A random variable is a function defined over the sample

space

Gender: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 → {𝑚𝑎𝑙𝑒, 𝑓𝑒𝑚𝑎𝑙𝑒}

Height: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 → ℝ+

Probability Space

A probability space is defined as a triple (Ω, 𝐹, 𝑃):

A sample space Ω ≠ ∅ that contains the set of all possible

outcomes (outcomes also called states of nature)

A set 𝐹 whose elements are called events. The events are

subsets of Ω.𝐹 should be a “Borel Field”.

𝑃 represents the probability measure that assigns

probabilities to events.

4

Probability Axioms (Kolomogrov)

5

Axioms define a reasonable theory of uncertainty

Kolomogrov’s probability axioms

𝑃(𝐴) ≥ 0 (∀𝐴 ⊆ Ω)

𝑃 Ω = 1

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵) (∀𝐴, 𝐵 ⊆ Ω)

Ω

A B

𝐴 ∩ 𝐵

Random Variables

Random variables: Variables in probability theory

Domain of random variables:Boolean,discrete or continuous

Probability distribution: the function describing probabilities

of possible values of a random variable

𝑃 𝐷𝑖𝑐𝑒 = 1 =1

6, 𝑃(𝐷𝑖𝑐𝑒 = 2) =

1

6, …

6

Random Variables

Random variable is a function that maps every outcome

in Ω to a real (complex) number.

To define probabilities easily as functions defined on (real)

numbers.

To compute expectation, variance,…

Head

Real line0 1

Tail

7

Base Definitions

8

Joint probability distribution

The rules of probability (sum and product rule)

Bayes’ theorem

Independence

new evidence may be irrelevant

Joint Probability Distribution

9

Probability of all combinations of the values for a set of

random variables.

If two or more random variables are considered together, they can

be described in terms of their joint probability

Example: Joint probability of features

𝑃(𝑋1, 𝑋2, … , 𝑋𝑑)

Two Fundamental Rules

10

Sum rule:

𝑃 𝑌 =

𝑋

𝑃(𝑋, 𝑌)

Product rule:𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌)

Chain Rule

Chain rule is derived by successive application of product rule:

𝑃(𝑋1, … , 𝑋𝑛)= 𝑃(𝑋1, … , 𝑋𝑛−1 ) 𝑃(𝑋𝑛|𝑋1, … , 𝑋𝑛−1)= 𝑃(𝑋1, … , 𝑋𝑛−2) 𝑃(𝑋𝑛−1|𝑋1, … , 𝑋𝑛−2) 𝑃(𝑋𝑛|𝑋1, … , 𝑋𝑛−1)= …

= 𝑃(𝑋1) 𝑖=2

𝑛

𝑃(𝑋𝑖|𝑋1, … , 𝑋𝑖−1)

11

Sum Rule: Example

12

[Bishop, Section1.2]

Conditional Probability

𝑃(𝑋|𝑌) = 𝑃(𝑋, 𝑌) 𝑃(𝑌) if 𝑃(𝑋) > 0

Obtained from the product rule

𝑃(𝑋|𝑌) obeys the same rules as probabilities

𝑋𝑃(𝑋|𝑌) = 1

13

Conditional Probability

For statistically dependent variables, knowing the value of

one variable may allow us to better estimate the other.

All probabilities in effect are conditional probabilities

E.g., 𝑃(𝐴) = 𝑃(𝐴 | 𝑜𝑢𝑟 𝑏𝑎𝑐𝑘𝑔𝑟𝑜𝑢𝑛𝑑 𝑘𝑛𝑜𝑤𝑙𝑒𝑑𝑔𝑒)

Type equation here.

𝐴𝐵

𝑃(. |𝐵)Ω

Ω

Renormalize the probability of events jointly occur with B

14

Conditional Probability: Example Rolling a fair dice

𝐴 : the outcome is an even number

𝐵 : the outcome is a prime number

𝑃 𝐴 𝐵 =𝑃(𝐴, 𝐵)

𝑃(𝐵)=

1/6

1/2=

1

3

Meningitis(𝑀) & Stiff neck (𝑆)

𝑃 𝑀 = 1, 𝑆 = 1 =1

5000

𝑃 𝑀 = 1 =1

2000

𝑃 𝑆 = 1 𝑀 = 1 =𝑃 𝑀 = 1, 𝑆 = 1

𝑃(𝑀 = 1)= 0.4

15

Conditional Probability: Another Example

16

[Bishop, Section1.2]

Bayes Theorem

𝑃 𝑌 𝑋 =𝑃 𝑋 𝑌 𝑃 𝑌

𝑃 𝑋

Obtained form the product rule and the symmetry property

𝑃 𝑋, 𝑌 = 𝑃 𝑌, 𝑋

𝑃 𝑋, 𝑌 = 𝑃 𝑋|𝑌 𝑃 𝑌 = 𝑃(𝑌|𝑋)𝑃 𝑋

In some problems, it may be difficult to compute 𝑃(𝑌|𝑋)directly, yet we might have information about 𝑃(𝑋|𝑌).

𝑃(𝐶𝑎𝑢𝑠𝑒|𝐸𝑓𝑓𝑒𝑐𝑡) = 𝑃(𝐸𝑓𝑓𝑒𝑐𝑡|𝐶𝑎𝑢𝑠𝑒) 𝑃(𝐶𝑎𝑢𝑠𝑒) / 𝑃(𝐸𝑓𝑓𝑒𝑐𝑡)

17

Bayes Theorem

Often it would be useful to derive the rule a bit further:

𝑃 𝑌 𝑋 =𝑃 𝑋 𝑌 𝑃 𝑌

𝑃 𝑋=

𝑃 𝑋 𝑌 𝑃 𝑌

𝑌𝑃 𝑋 𝑌 𝑃 𝑌

18

Bayes Theorem: Example

19

Meningitis(𝑀) & Stiff neck (𝑆)

𝑃 𝑀 = 1 =1

5000

𝑃 𝑆 = 1 = 0.01

𝑃 𝑆 = 1 𝑀 = 1 = 0.7

𝑃 𝑀 = 1 𝑆 = 1 =?

𝑃(𝑀 = 1|𝑆 = 1) = 𝑃 𝑆 = 1 𝑀 = 1 𝑃(𝑀 = 1) /𝑃(𝑆 = 1)= 0.7 × 0.0002/0.01 = 0.0014

Prior and Posterior Probabilities

Prior or unconditional probabilities: belief in the absence of

any other evidence

e.g.,𝑃 𝑆 = 1 = 0.01

Posterior or conditional probabilities: belief in the presence of

evidences

e.g.,𝑃 𝑆 = 1 𝑀 = 1) = 0.7

20

Independence of Random Variables

𝑋 and 𝑌 are independent iff

𝑃(𝑋|𝑌) = 𝑃(𝑋)

𝑃 𝑌 𝑋 = 𝑃 𝑌

𝑃(𝑋, 𝑌) = 𝑃(𝑋) 𝑃(𝑌)

Knowing 𝑌 tells us nothing about 𝑋 (and vice versa)

21

Probability Mass Function (PMF)

Probability Mass Function (PMF) shows the probability

for each value of a discrete random variable

Each impulse magnitude is equal to the probability of the

corresponding outcome

Example: PMF of a fair dice

𝑃(𝑋 = 𝑥) ≥ 0

𝑥∈𝑋

𝑃(𝑋 = 𝑥) = 1

22

𝑃(𝑋)

𝑋

Probability Density Function (PDF)

Probability Density Function (PDF) is defined for

continuous random variables

The probability of 𝑥 ∈ (𝑥0, 𝑥0 + 𝛿𝑥) is 𝑝(𝑥0) × 𝛿𝑥 (for 𝛿𝑥 → 0)

𝑝(𝑥): probability density over 𝑥

23

𝑝(𝑥)

𝑝(𝑥) ≥ 0

𝑝 𝑥 𝑑𝑥 = 1

𝑥

𝑝(𝑥0)

𝑥0

Cumulative Distribution Function (CDF)

Cumulative Distribution Function (CDF)

Defined as the integration of PDF

Similarly defined on discrete variables (summation instead of integration)

Non-decreasing

Right Continuous

𝐹(−∞) = 0

𝐹 ∞ = 1

𝑑𝐹(𝑥)

𝑑𝑥= 𝑝 𝑥

𝑃 𝑢 ≤ 𝑥 ≤ 𝑣 = 𝐹 𝑣 − 𝐹(𝑢)

24

𝑥

𝐹 𝑥

𝑝(𝑥)

𝐶𝐷𝐹 𝑥 = 𝐹 𝑥 = −∞

𝑥

𝑝(𝑥)

Distribution Statistics

Basic descriptors of spatial distributions:

Mean value

Variance & standard deviation

Moments

Covariance & correlation

25

Expected Value

Expected (or mean) value: weighted average of all possible

values of the random variable

Expectation of a discrete random variable 𝑋:

𝐸 𝑥 = 𝑥𝑥𝑝(𝑥)

Expectation of a function of a discrete random variable 𝑋 :

𝐸 𝑓(𝑥) = 𝑥𝑓(𝑥)𝑝(𝑥)

Expected value of a continuous random variable 𝑋 :

𝐸 𝑥 = 𝑥𝑝 𝑥 𝑑𝑥

Expectation of a function of a continuous random variable 𝑋 :

𝐸 𝑓(𝑥) = 𝑓(𝑥)𝑝 𝑥 𝑑𝑥

26

Expected Value

27

For expectation of a function of several variables, a

subscript is used to specify the variables is being average

over

Examples:

𝐸𝑥 𝑓 𝑥, 𝑦 = 𝑥 𝑝 𝑥 𝑓(𝑥, 𝑦)

𝐸𝑥|𝑦 𝑓 𝑥, 𝑦 = 𝑥 𝑝 𝑥|𝑦 𝑓(𝑥, 𝑦)

Other notation:𝐸𝑥[𝑓(𝑥, 𝑦)|𝑦]

𝐸𝑥,𝑦 𝑓 𝑥, 𝑦 = 𝑥 𝑦 𝑝 𝑥, 𝑦 𝑓(𝑥, 𝑦)

Variance

Variance: a measure of how far values of a random

variable are spread out around its expected value

𝑉𝑎𝑟 𝑥 = 𝐸 𝑥 − 𝐸 𝑥 2

= 𝐸 𝑥2 − 𝐸 𝑥 2

Standard deviation: square root of variance:

σ𝑥 = 𝑉𝑎𝑟[𝑥]

28

Moments

Moments nth order moment of a random variable 𝑋:

𝐸 𝑥𝑛

Normalized nth order moment:

𝐸 𝑥 − 𝐸 𝑥 𝑛

The first order moment is the mean value.

The second order moment is the variance added by thesquare of the mean.

29

Correlation & Covariance

Correlation

𝐶𝑟𝑟 𝑥, 𝑦 = 𝐸𝑥,𝑦 𝑥𝑦

Covariance is the correlation of mean removed variables:

𝐶𝑜𝑣 𝑥, 𝑦 = 𝐸𝑥,𝑦 (𝑥 − 𝐸 𝑥 )(𝑦 − 𝐸 𝑦 )

30

Discrete RVs

𝑥

𝑦𝑥𝑦 𝑝(𝑥, 𝑦)

Covariance: Example

31

𝐶𝑜𝑣 𝑥, 𝑦 = 0 𝐶𝑜𝑣 𝑥, 𝑦 = 0.9𝑥

𝑦

𝑥

𝑦

Covariance Properties

32

The covariance value shows the tendency of the pair of

RVs to increase together

𝐶𝑜𝑣𝑥𝑦 > 0 ∶ 𝑥 and 𝑦 tend to increase together

𝐶𝑜𝑣𝑥𝑦 < 0 : 𝑥 tends to decrease when 𝑦 increases

𝐶𝑜𝑣𝑥𝑦 = 0 : no linear correlation between 𝑥 and 𝑦

Pearson’s Product Moment Correlation

33

𝜌𝑋𝑌 =𝐶𝑜𝑣(𝑋, 𝑌)

𝜎𝑋𝜎𝑌=

𝐸 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌𝜎𝑋𝜎𝑌

Defined only if both 𝜎𝑋 and 𝜎𝑌 are finite and nonzero.

𝜌𝑋𝑌 shows the degree of linear dependence between 𝑋 and 𝑌.

−1 ≤ 𝜌𝑋𝑌≤ 1

𝐸 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌2 ≤ 𝐸 𝑋 − 𝜇𝑋

2 𝐸 𝑌 − 𝜇𝑌2 according to Cauchy-

Schwarz inequality (𝐶𝑋𝑌 ≤ 𝜎𝑋𝜎𝑌)

𝜌𝑋𝑌 = 1 shows a perfect positive linear relationship and 𝜌𝑋𝑌 = −1

shows a perfect negative linear relationship

Pearson’s Correlation: Examples

34

[Wikipedia]

𝜌𝑋𝑌 =𝐶𝑜𝑣(𝑋, 𝑌)

𝜎𝑋𝜎𝑌

Orthogonal, Uncorrelated & Independent RVs

Orthogonal random variables (𝐸 𝑥𝑦 = 0)

𝐶𝑟𝑟 𝑥, 𝑦 = 0

Uncorrelated random variables (𝐸 (𝑥 − 𝐸[𝑥])(𝑦 − 𝐸[𝑦])= 0)

𝐶𝑜𝑣 𝑥, 𝑦 = 0

Independent random variables ⇒ 𝐶𝑜𝑣 𝑥, 𝑦 = 0

𝐶𝑜𝑣 𝑥, 𝑦 = 0 ⇏ Independent random variables

35

Covariance Matrix

If 𝒙 is a vector of random variables (𝑑 -dim random

vector):

Covariance matrix indicates the tendency of each pair of RVs

to vary together

36

𝜮 =𝐸((𝑥1 − 𝜇1)(𝑥1 − 𝜇1)) ⋯ 𝐸((𝑥1 − 𝜇1)(𝑥𝑑 − 𝜇𝑑))

⋮ ⋱ ⋮𝐸((𝑥𝑑 − 𝜇𝑑)(𝑥1 − 𝜇1)) ⋯ 𝐸((𝑥𝑑 − 𝜇𝑑)(𝑥𝑑 − 𝜇𝑑))

𝜮 = 𝐸[ 𝒙 − 𝝁𝒙 𝒙 − 𝝁𝒙𝑇]

𝝁𝒙 =

𝜇1⋮𝜇𝑑

=𝐸(𝑥1)

⋮𝐸(𝑥𝑑)

Covariance Matrix: Two Variables

37

Σ = 𝐶 =𝜎12 𝜎12

𝜎21 𝜎22

𝜮 =1 00 1 𝜮 =

1 0.90.9 1

𝑋

𝑌

𝑋

𝑌

𝜎21 = 𝜎12 = 𝐶𝑜𝑣(𝑋, 𝑌)

Covariance Matrix

38

𝜎𝑖𝑗 shows the covariance of 𝑋𝑖 and 𝑋𝑗:

𝜎𝑖𝑗 = 𝜎𝑗𝑖 = 𝐶𝑜𝑣(𝑋𝑖 , 𝑋𝑗)

𝜮 =

𝜎12 𝜎12

𝜎21 𝜎22

⋯ 𝜎1𝑑⋯ 𝜎2𝑑

⋮ ⋮𝜎𝑑1 𝜎𝑑2

⋱ ⋮⋯ 𝜎𝑑

2

Sums of Random Variables

𝑍 = 𝑋 + 𝑌

Mean: 𝐸[𝑧] = 𝐸[𝑥] + 𝐸[𝑦]

Variance:𝑉𝑎𝑟(𝑍) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) + 2𝐶𝑜𝑣(𝑋, 𝑌)

If 𝑋,𝑌 independent:𝑉𝑎𝑟(𝑍) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)

Distribution:

,( ) ( , )

( ) ( ) ( ) ( )

Z X Y

X Y X Y

p z p x z x dx

p x p y p x p z x dx

39

independence

Some Famous Probability Density Functions

Uniform

Gaussian (Normal)

1

p x U a U bb a

2

22.1

,. 2

x

p x e N

𝑥

40

𝑥~𝑈(𝑎, 𝑏)𝑝(𝑥)

𝑏𝑎

1

𝑏 − 𝑎

𝑥~𝑁(𝜇, 𝜎2)

𝑥

𝑝(𝑥)

1

2𝜋𝜎

𝜇

Gaussian (Normal) Distribution

41

68% within 𝜇 − 𝜎,𝜇 + 𝜎

95% within 𝜇 − 2𝜎,𝜇 + 2𝜎

It is widely used to model the distribution of continuousvariables

Standard Normal distribution: 𝜇 = 0, 𝜎 = 1

Some Famous Probability Density Functions

42

Exponential

0

0 0

x

x e xp x e U x

x

𝑥

𝑝(𝑥)

0

Some Famous Probability Mass Functions

Bernoulli: 𝑥 ∈ 0,1

𝑝 𝑥 = 1 − 𝜇 1−𝑥𝜇𝑥

Binomial

1n k k

nP X k p p

k

43

𝑥~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑝, 𝑛)

𝑘𝑛𝑝

𝑃(𝑥 = 𝑘)

𝑃(𝑥)

0 1

𝜇

Multivariate Gaussian Distribution

𝒙 is a vector of 𝑑 Gaussian variables

𝑝 𝒙 ~𝑁 𝝁, 𝜮 =1

2𝜋 𝑑/2 𝜮 1/2𝑒−

12

𝒙−𝝁 𝑇𝜮−1 𝒙−𝝁

44

𝝁 =

𝜇1⋮𝜇𝑑

=𝐸(𝑥1)

⋮𝐸(𝑥𝑑)

𝜮 = 𝐸 𝒙 − 𝝁 𝒙 − 𝝁 𝑇


45

The covariance matrix is always symmetric and positive semi-

definite

Multivariate Gaussian is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters

Special cases of :

= 𝜎2𝑰 : Independent random variables with the same variance

(circularly symmetric Gaussian)

Digonal matrix =𝜎12 … 𝟎⋮ ⋱ ⋮𝟎 ⋯ 𝜎𝑑

2: Independent random variables with

different variances


Level Surfaces

46

The Gaussian distribution will be constant on surfaces in

𝒙-space for which:𝒙 − 𝝁 𝑇𝜮−𝟏 𝒙 − 𝝁 = 𝐶

Principal axes of the hyper-ellipsoids are the eigenvectors of .

Bivariate Gaussian: Curves of constant density are

ellipses.

Hyper-ellipsoid

Bivariate Gaussian distribution

𝜆1 and 𝜆2 are the eigenvalues of (𝜆1 ≥ 𝜆2) and 𝒗1 and

𝒗2 are the corresponding eigenvectors

47

𝒗1𝒗2

𝑙1𝑙2

𝑙1𝑙2

=𝜆1

𝜆2

Gaussian Distribution Properties

Some attracting properties of the Gaussian distribution:

Marginal and conditional distributions of a Gaussian are also Gaussian

After a linear transformation, a Gaussian distribution is again Gaussian

There exists a linear transformation that diagonalizes the covariance matrix

(whitening transform).

It converts the multivariate normal distribution into a spherical one.

Gaussian is a distribution that maximizes the entropy

Gaussian is stable and infinitely divisible

Central Limit Theorem

Some distributions can be estimated by Gaussian distribution when their

parameter value is sufficiently large (e.g., Binomial)

48

Central Limit Theorem

(under mild conditions)

Suppose i.i.d. (Independent Identically Distributed) RVs 𝑋𝑖 (𝑖= 1,… , 𝑁) with finite variances

Let 𝑆𝑁 = 𝑖=1𝑁 𝑋𝑖 be the sum of these RVs

Distribution of 𝑆𝑁 converges to a normal distribution as 𝑁increases, regardless to the distribution of the RVs.

Example:

49

𝑋𝑖~ uniform, 𝑖 = 1,… , 𝑁

𝑆𝑁 =1

𝑁

𝑖=1

𝑁

𝑋𝑖

𝑆1 𝑆2 𝑆10

Linear Algebra: Basic Definitions

Matrix 𝑨:

Matrix Transpose

Symetric matrix 𝑨 = 𝑨𝑇

Vector 𝒂

11 12 1

21 22 2

1 2

...

...[ ]

... ... ... ...

...

n

n

ij m n

m m mn

a a a

a a aa

a a a

A

1 ,1T

ij jib a i n j m B A

1

1... [ ,..., ]T

n

n

a

a a

a

a a

50

Linear Mapping

Linear function

𝑓(𝒙 + 𝒚) = 𝑓(𝒙) + 𝑓(𝒚) ∀𝒙, 𝒚 ∈ 𝑉

𝑓(𝑎𝒙) = 𝑎𝑓(𝒙) ∀𝒙 ∈ 𝑉, 𝑎 ∈ 𝐹

A linear function: 𝑓 𝒙 = 𝑤1𝑥1 +⋯+𝑤𝑑𝑥𝑑 = 𝒘𝑇 𝒙

In general, a matrix 𝐖m×d = 𝒘1 ⋯ 𝒘𝑚𝑇 can be used to

denote a map 𝑓:ℝ𝑑 → ℝ𝑚 where 𝑓𝑖 𝒙 = 𝑤11𝑥1 +⋯+𝑤𝑑1𝑥𝑑 = 𝒘𝑖

𝑇𝒙

51

Inner (dot) product

Matrix multiplication

. .

[ ] [ ]

[ ] ,

ij m p ij p n

T

ij m n ij i j

a b

c c A B

A B

AB = C

Linear Algebra: Basic Definitions

1

,n

T

i i

i

a b

a b a b

52

i-th row of A j-th column of B

Inner Product

Inner (dot) product

Length (Euclidean norm) of a vector

𝒂 is normalized iff 𝒂 2 = 1

Angle between vectors

Orthogonal vectors 𝒂 and 𝒃:

Orthonormal set of vectors 𝒂1, 𝒂2, … , 𝒂𝑛:

∀𝑖, 𝑗 𝒂𝑖𝑇𝒂𝑗 =

1 𝑖 = 𝑗

0 𝑜.𝑤.53

𝒂 2 = 𝒂𝑇𝒂 =

𝑖=1

𝑑

𝑎𝑖2

𝒂𝑇𝒃 =

𝑖=1

𝑑

𝑎𝑖𝑏𝑖

𝒂𝑇𝒃 = 0

𝑐𝑜𝑠𝜃 =𝒂𝑇𝒃

𝒂 2 𝒃 2

Linear Independence

A set of vectors is linearly independent if no vector is

a linear combination of other vectors.

𝑐1𝒗1 + 𝑐2𝒗2+ . . . + 𝑐𝑘𝒗𝑘 = 0 ⇒

𝑐1 = 𝑐2 = . . . = 𝑐𝑘 = 0

54

Matrix Determinant and Trace

Determinant

𝑑𝑒𝑡(𝑨𝑩) = 𝑑𝑒𝑡(𝑨) × 𝑑𝑒𝑡(𝑩)

Trace

1

[ ]n

jj

j

tr A a

1

det( ) ; 1,.... ;

( 1) det( )

n

ij ij

j

i j

ij ij

A a A i n

A M

55

Matrix Inversion

Inverse of 𝐴𝑛×𝑛:

𝐴−1 exists iff det(𝐴) ≠ 0 (𝐴 is nonsingular)

Singular:det(𝐴) = 0

ill-conditioned:𝐴 is nonsingular but close to being singular

Pseudo-inverse for a non square matrix 𝐴# = 𝐴𝑇𝐴 −1𝐴𝑇

𝐴𝑇𝐴 is not singular

𝐴#𝐴 = 𝐼

1 nAB BA I B A

56

Matrix Rank

𝑟𝑎𝑛𝑘(𝑨) : maximum number of linearly independent

columns or rows of A.

𝑨𝑚×𝑛: 𝑟𝑎𝑛𝑘(𝑨) ≤ min(𝑚, 𝑛)

Full rank 𝑨𝑛×𝑛 : 𝑟𝑎𝑛𝑘(𝑨) = 𝑛 iff 𝑨 is nonsingular

(det(𝑨) ≠ 0)

57

Eigenvectors and Eigenvalues

det( ) 0n A I

1

( )n

j

j

tr

A

1

det( )n

j

j

A A

𝐴𝒗 = 𝜆𝒗

Characteristic equation:

n-th order polynomial, with n roots

58

Eigenvector: Example

59

𝑨 =2 11 2

[wikipedia]

Eigenvectors and Eigenvalues:

Symmetric Matrix

For a symmetric matrix, the eigenvectors corresponding

to distinct eigenvalues are orthogonal

These eigenvectors can be used to form an orthonormal

set (∀𝑖 ≠ 𝑗 𝒗𝑖𝑇𝒗𝑗 = 0 and 𝒗𝑖 = 1)

60

Eigen Decomposition: Symmetric Matrix

𝑽 = 𝒗1 … 𝒗𝑁 , 𝜦 =𝜆1 … 𝟎⋮ ⋱ ⋮𝟎 … 𝜆𝑁

𝑨𝑽 = 𝑽𝜦 ⇒ 𝑨𝑽𝑽𝑇 = 𝑽𝜦𝑽𝑇𝑽𝑽𝑇=𝑰

𝑨 = 𝑽𝜦𝑽𝑇

Eigen decomposition of a symmetric matrix: 𝑨 = 𝑽𝜦𝑽𝑇

61

Positive Definite Matrix

Symmetric 𝐴𝑛×𝑛 is positive definite iff:

∀𝒙 ∈ ℝ𝑛 ⇒ 𝒙𝑇𝐴𝒙 > 0

Eigen values of a positive define matrix are positive:

∀𝑖, 𝜆𝑖 > 0

62

Vector Derivatives

𝜕𝒙𝑇𝑨𝒙

𝜕𝒙= (𝑨 + 𝑨𝑇)𝒙

𝜕𝒃𝑇𝒙

𝜕𝒙= 𝒃

You can see more on the vector derivatives in the

uploaded review materials

63

sharif university of...

Documents