l2: review of probability and statistics - texas a&m...
TRANSCRIPT
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1
L2: Review of probability and statistics
โข Probability
โ Definition of probability
โ Axioms and properties
โ Conditional probability
โ Bayes theorem
โข Random variables
โ Definition of a random variable
โ Cumulative distribution function
โ Probability density function
โ Statistical characterization of random variables
โข Random vectors
โ Mean vector
โ Covariance matrix
โข The Gaussian random variable
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2
Review of probability theory
โข Definitions (informal) โ Probabilities are numbers assigned to events that
indicate โhow likelyโ it is that the event will occur when a random experiment is performed
โ A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment
โ The sample space S of a random experiment is the set of all possible outcomes
โข Axioms of probability โ Axiom I: ๐ ๐ด๐ โฅ 0
โ Axiom II: ๐ ๐ = 1
โ Axiom III: ๐ด๐ โฉ ๐ด๐ = โ โ ๐ ๐ด๐โ๐ด๐ = ๐ ๐ด๐ + ๐ ๐ด๐
A1
A2
A3 A4
event
pro
bab
ility
A1 A2 A3 A4
Sample space
Probability law
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3
โข Warm-up exercise โ I show you three colored cards
โข One BLUE on both sides โข One RED on both sides โข One BLUE on one side, RED on the other
โ I shuffle the three cards, then pick one and show you one side only. The side visible to you is RED โข Obviously, the card has to be either A or C, right?
โ I am willing to bet $1 that the other side of the card has the same color, and need someone in class to bet another $1 that it is the other color โข On the average we will end up even, right? โข Letโs try it!
A B C
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4
โข More properties of probability
โ ๐ ๐ด๐ถ = 1 โ ๐ ๐ด
โ ๐ ๐ด โค 1
โ ๐ โ = 0
โ ๐๐๐ฃ๐๐ ๐ด1โฆ๐ด๐ , ๐ด๐ โฉ ๐ด๐ = โ ,โ๐๐ โ ๐ โ ๐ด๐๐๐=1 = ๐ ๐ด๐
๐๐=1
โ ๐ ๐ด1โ๐ด2 = ๐ ๐ด1 + ๐ ๐ด2 โ ๐ ๐ด1 โฉ ๐ด2
โ ๐ โ ๐ด๐๐๐=1 =
๐ ๐ด๐ โ ๐ ๐ด๐ โฉ ๐ด๐ +โฏ+ โ1 ๐+1๐ ๐ด1 โฉ ๐ด2 โฆโฉ ๐ด๐๐๐<๐
๐๐=1
โ ๐ด1 โ ๐ด2 โ ๐ ๐ด1 โค ๐ ๐ด2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5
โข Conditional probability โ If A and B are two events, the probability of event A when we already
know that event B has occurred is
๐ ๐ด|๐ต =๐ ๐ดโ๐ต
๐ ๐ต ๐๐ ๐ ๐ต > 0
โข This conditional probability P[A|B] is read: โ the โconditional probability of A conditioned on Bโ, or simply
โ the โprobability of A given Bโ
โ Interpretation โข The new evidence โB has occurredโ has the following effects
โข The original sample space S (the square) becomes B (the rightmost circle)
โข The event A becomes AB
โข P[B] simply re-normalizes the probability of events that occur jointly with B
S S
A AB B A AB B B has
occurred
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6
โข Theorem of total probability โ Let ๐ต1, ๐ต2โฆ๐ต๐ be a partition of ๐ (mutually exclusive that add to ๐)
โ Any event ๐ด can be represented as
๐ด = ๐ด โฉ ๐ = ๐ด โฉ ๐ต1 โช ๐ต2โฆ๐ต๐ = ๐ด โฉ ๐ต1 โช ๐ด โฉ ๐ต2 โฆ ๐ด โฉ ๐ต๐
โ Since ๐ต1, ๐ต2โฆ๐ต๐ are mutually exclusive, then
๐ ๐ด = ๐ ๐ด โฉ ๐ต1 + ๐ ๐ด โฉ ๐ต2 +โฏ+ ๐ ๐ด โฉ ๐ต๐
โ and, therefore
๐ ๐ด = ๐ ๐ด|๐ต1 ๐ ๐ต1 +โฏ๐ ๐ด|๐ต๐ ๐ ๐ต๐ = ๐ ๐ด|๐ต๐ ๐ ๐ต๐๐๐=1
B1
B2
B3 BN-1
BN
A
B4
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7
โข Bayes theorem โ Assume ๐ต1, ๐ต2โฆ๐ต๐ is a partition of S
โ Suppose that event ๐ด occurs
โ What is the probability of event ๐ต๐?
โ Using the definition of conditional probability and the Theorem of total probability we obtain
๐ ๐ต๐|๐ด =๐ ๐ด โฉ ๐ต๐
๐ ๐ด=
๐ ๐ด|๐ต๐ ๐ ๐ต๐
๐ ๐ด|๐ต๐ ๐ ๐ต๐๐๐=1
โ This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8
โข Bayes theorem and statistical pattern recognition โ When used for pattern classification, BT is generally expressed as
๐ ๐๐|๐ฅ =๐ ๐ฅ|๐๐ ๐ ๐๐
๐ ๐ฅ|๐๐ ๐ ๐๐๐๐=1
=๐ ๐ฅ|๐๐ ๐ ๐๐
๐ ๐ฅ
โข where ๐๐ is the ๐-th class (e.g., phoneme) and ๐ฅ is the
feature/observation vector (e.g., vector of MFCCs)
โ A typical decision rule is to choose class ๐๐ with highest P ๐๐|๐ฅ
โข Intuitively, we choose the class that is more โlikelyโ given observation ๐ฅ
โ Each term in the Bayes Theorem has a special name
โข ๐ ๐๐ prior probability (of class ๐๐)
โข ๐ ๐๐|๐ฅ posterior probability (of class ๐๐ given the observation ๐ฅ)
โข ๐ ๐ฅ|๐๐ likelihood (probability of observation ๐ฅ given class ๐๐)
โข ๐ ๐ฅ normalization constant (does not affect the decision)
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9
โข Example โ Consider a clinical problem where we need to decide if a patient has a
particular medical condition on the basis of an imperfect test
โข Someone with the condition may go undetected (false-negative)
โข Someone free of the condition may yield a positive result (false-positive)
โ Nomenclature
โข The true-negative rate P(NEG|ยฌCOND) of a test is called its SPECIFICITY
โข The true-positive rate P(POS|COND) of a test is called its SENSITIVITY
โ Problem
โข Assume a population of 10,000 with a 1% prevalence for the condition
โข Assume that we design a test with 98% specificity and 90% sensitivity
โข Assume you take the test, and the result comes out POSITIVE
โข What is the probability that you have the condition?
โ Solution โข Fill in the joint frequency table next slide, or
โข Apply Bayes rule
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10
TEST IS POSITIVE
TEST IS NEGATIVE
ROW TOTAL
HAS CONDITION True-positive P(POS|COND)
100ร0.90
False-negative P(NEG|COND) 100ร(1-0.90)
100
FREE OF CONDITION
False-positive P(POS|ยฌCOND) 9,900ร(1-0.98)
True-negative P(NEG|ยฌCOND)
9,900ร0.98
9,900 COLUMN TOTAL 288 9,712 10,000
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11
TEST IS POSITIVE
TEST IS NEGATIVE
ROW TOTAL
HAS CONDITION True-positive P(POS|COND)
100ร0.90
False-negative P(NEG|COND) 100ร(1-0.90)
100
FREE OF CONDITION
False-positive P(POS|ยฌCOND) 9,900ร(1-0.98)
True-negative P(NEG|ยฌCOND)
9,900ร0.98
9,900 COLUMN TOTAL 288 9,712 10,000
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12
โ Applying Bayes rule
๐ ๐๐๐๐| + =
=๐ +|๐๐๐๐ ๐ ๐๐๐๐
๐ +=
=๐ +|๐๐๐๐ ๐ ๐๐๐๐
๐ +|๐๐๐๐ ๐ ๐๐๐๐ + ๐ +|ยฌ๐๐๐๐ ๐ ยฌ๐๐๐๐=
=0.90 ร 0.01
0.90 ร 0.01 + 1 โ 0.98 ร 0.99=
= 0.3125
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13
โข Random variables โ When we perform a random experiment we are usually interested in
some measurement or numerical attribute of the outcome โข e.g., weights in a population of subjects, execution times when
benchmarking CPUs, shape parameters when performing ATR
โ These examples lead to the concept of random variable โข A random variable ๐ is a function that assigns a real number ๐ ๐ to each
outcome ๐ in the sample space of a random experiment
โข ๐ ๐ maps from all possible outcomes in sample space onto the real line
โ The function that assigns values to each outcome is fixed and deterministic, i.e., as in the rule โcount the number of heads in three coin tossesโ โข Randomness in ๐ is due to the underlying randomness
of the outcome ๐ of the experiment
โ Random variables can be โข Discrete, e.g., the resulting number after rolling a dice
โข Continuous, e.g., the weight of a sampled individual
๐
๐ฅ = ๐ ๐
๐ฅ
๐๐ฅ
real line
๐
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
โข Cumulative distribution function (cdf) โ The cumulative distribution function ๐น๐ ๐ฅ
of a random variable ๐ is defined as the probability of the event ๐ โค ๐ฅ
๐น๐ ๐ฅ = ๐ ๐ โค ๐ฅ โ โ < ๐ฅ < โ
โ Intuitively, ๐น๐ ๐ is the long-term proportion of times when ๐ ๐ โค ๐
โ Properties of the cdf
โข 0 โค ๐น๐ ๐ฅ โค 1
โข lim๐ฅโโ
๐น๐ ๐ฅ = 1
โข lim๐ฅโโโ
๐น๐ ๐ฅ = 0
โข ๐น๐ ๐ โค ๐น๐ ๐ ๐๐ ๐ โค ๐
โข FX ๐ = limโโ0
๐น๐ ๐ + โ = ๐น๐ ๐+
1 2 3 4 5 6
P(X
<x)
x
cdf for rolling a dice
1
5/6
4/6
3/6
2/6
1/6
P(X
<x)
100 200 300 400 500 x(lb)
cdf for a personโs weight
1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15
โข Probability density function (pdf)
โ The probability density function ๐๐ ๐ฅ of a continuous random variable ๐, if it exists, is defined as the derivative of ๐น๐ ๐ฅ
๐๐ ๐ฅ =๐๐น๐ ๐ฅ
๐๐ฅ
โ For discrete random variables, the equivalent to the pdf is the probability mass function
๐๐ ๐ฅ =ฮ๐น๐ ๐ฅ
ฮ๐ฅ
โ Properties
โข ๐๐ ๐ฅ > 0
โข ๐ ๐ < ๐ฅ < ๐ = ๐๐ ๐ฅ ๐๐ฅ๐
๐
โข ๐น๐ ๐ฅ = ๐๐ ๐ฅ ๐๐ฅ๐ฅ
โโ
โข 1 = ๐๐ ๐ฅ ๐๐ฅโ
โโ
โข ๐๐ ๐ฅ|๐ด =๐
๐๐ฅ๐น๐ ๐ฅ|๐ด ๐คโ๐๐๐ ๐น๐ ๐ฅ|๐ด =
๐ ๐<๐ฅ โฉ๐ด
๐ ๐ด ๐๐ ๐ ๐ด > 0
100 200 300 400 500
pd
f
x(lb)
pdf for a personโs weight
1
1 2 3 4 5 6
pm
f
x
pmf for rolling a (fair) dice
1
5/6
4/6
3/6
2/6
1/6
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16
โข What is the probability of somebody weighting 200 lb? โข According to the pdf, this is about 0.62 โข This number seems reasonable, right?
โข Now, what is the probability of somebody weighting 124.876 lb?
โข According to the pdf, this is about 0.43 โข But, intuitively, we know that the probability should be zero (or very,
very small)
โข How do we explain this paradox? โข The pdf DOES NOT define a probability, but a probability DENSITY! โข To obtain the actual probability we must integrate the pdf in an interval โข So we should have asked the question: what is the probability of
somebody weighting 124.876 lb plus or minus 2 lb?
1 2 3 4 5 6
pm
f
x
pmf for rolling a (fair) dice
1
5/6
4/6
3/6
2/6
1/6
100 200 300 400 500
pd
f
x(lb)
pdf for a personโs weight
1
โข The probability mass function is a โtrueโ probability (reason why we call it a โmassโ as opposed to a โdensityโ)
โข The pmf is indicating that the probability of any number when rolling a fair dice is the same for all numbers, and equal to 1/6, a very legitimate answer
โข The pmf DOES NOT need to be integrated to obtain the probability (it cannot be integrated in the first place)
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17
โข Statistical characterization of random variables โ The cdf or the pdf are SUFFICIENT to fully characterize a r.v.
โ However, a r.v. can be PARTIALLY characterized with other measures
โ Expectation (center of mass of a density)
๐ธ ๐ = ๐ = ๐ฅ๐๐ ๐ฅ ๐๐ฅโ
โโ
โ Variance (spread about the mean)
๐ฃ๐๐ ๐ = ๐2 = ๐ธ ๐ โ ๐ธ ๐ 2 = ๐ฅ โ ๐ 2๐๐ ๐ฅ ๐๐ฅโ
โโ
โ Standard deviation ๐ ๐ก๐ ๐ = ๐ = ๐ฃ๐๐ ๐ 1/2
โ N-th moment
๐ธ ๐๐ = ๐ฅ๐๐๐ ๐ฅ ๐๐ฅโ
โโ
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18
โข Random vectors
โ An extension of the concept of a random variable
โข A random vector ๐ is a function that assigns a vector of real numbers to each outcome ๐ in sample space ๐
โข We generally denote a random vector by a column vector
โ The notions of cdf and pdf are replaced by โjoint cdfโ and โjoint pdfโ
โข Given random vector ๐ = ๐ฅ1, ๐ฅ2โฆ๐ฅ๐๐we define the joint cdf as
๐น๐ ๐ฅ = ๐๐ ๐1 โค ๐ฅ1 โฉ ๐2 โค ๐ฅ2 โฆ ๐๐ โค ๐ฅ๐
โข and the joint pdf as
๐๐ ๐ฅ =๐๐๐น๐ ๐ฅ
๐๐ฅ1๐๐ฅ2โฆ๐๐ฅ๐
โ The term marginal pdf is used to represent the pdf of a subset of all the random vector dimensions
โข A marginal pdf is obtained by integrating out variables that are of no interest
โข e.g., for a 2D random vector ๐ = ๐ฅ1, ๐ฅ2๐, the marginal pdf of ๐ฅ1 is
๐๐1 ๐ฅ1 = ๐๐1๐2 ๐ฅ1๐ฅ2 ๐๐ฅ2
๐ฅ2=+โ
๐ฅ2=โโ
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19
โข Statistical characterization of random vectors โ A random vector is also fully characterized by its joint cdf or joint pdf
โ Alternatively, we can (partially) describe a random vector with measures similar to those defined for scalar random variables
โ Mean vector
๐ธ ๐ = ๐ = ๐ธ ๐1 , ๐ธ ๐2 โฆ๐ธ ๐๐๐= ๐1, ๐2, โฆ ๐๐
๐
โ Covariance matrix
๐๐๐ฃ ๐ = ฮฃ = ๐ธ ๐ โ ๐ ๐ โ ๐๐=
=๐ธ ๐ฅ1 โ ๐1
2 โฆ ๐ธ ๐ฅ1 โ ๐1 ๐ฅ๐ โ ๐๐โฎ โฑ โฎ
๐ธ ๐ฅ1 โ ๐1 ๐ฅ๐ โ ๐๐ โฆ ๐ธ ๐ฅ๐ โ ๐๐2
=
=๐12 โฆ ๐1๐โฎ โฑ โฎ๐1๐ โฆ ๐๐
2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20
โ The covariance matrix indicates the tendency of each pair of features (dimensions in a random vector) to vary together, i.e., to co-vary*
โข The covariance has several important properties
โ If ๐ฅ๐ and ๐ฅ๐ tend to increase together, then ๐๐๐ > 0
โ If ๐ฅ๐ tends to decrease when ๐ฅ๐ increases, then ๐๐๐ < 0
โ If ๐ฅ๐ and ๐ฅ๐ are uncorrelated, then ๐๐๐ = 0
โ ๐๐๐ โค ๐1๐๐, where ๐๐ is the standard deviation of ๐ฅ๐
โ ๐๐๐ = ๐๐2 = ๐ฃ๐๐ ๐ฅ๐
โข The covariance terms can be expressed as ๐๐๐ = ๐๐2 and ๐๐๐ = ๐๐๐๐๐๐๐
โ where ๐๐๐ is called the correlation coefficient
Xi
Xk
Cik=-sisk
rik=-1
Xi
Xk
Cik=-ยฝsisk
rik=-ยฝ
Xi
Xk
Cik=0
rik=0
Xi
Xk
Cik=+ยฝsisk
rik=+ยฝ
Xi
Xk
Cik=sisk
rik=+1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21
A numerical example
โข Given the following samples from a 3D distribution โ Compute the covariance matrix
โ Generate scatter plots for every pair of vars.
โ Can you observe any relationships between the covariance and the scatter plots?
โ You may work your solution in the templates below
Exam
ple
x 1
x 2
x 3
x 1-
1
x 2-
2
x 3-
3
(x1-
1)2
(x2-
2) 2
(x3-
3)2
(x1-
1)(
x2-
2)
(x1-
1)(
x3-
3)
(x2-
2)(
x3-
3)
1 2 3 4
Average
Variables (or features)
Examples x1 x2 x3
1 2 2 4
2 3 4 6
3 5 4 2
4 6 6 4
x1
x2
x3
x2
x3
x1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 22
โข The Normal or Gaussian distribution โ The multivariate Normal distribution ๐ ๐, ฮฃ is defined as
๐๐ ๐ฅ =1
2๐ ๐/2 ฮฃ 1/2๐โ12 ๐ฅโ๐
๐ฮฃโ1 ๐ฅโ๐
โ For a single dimension, this expression is reduced to
๐๐ ๐ฅ =1
2๐๐๐โ๐ฅโ๐ 2
2๐2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 23
โ Gaussian distributions are very popular since
โข Parameters ๐, ฮฃ uniquely characterize the normal distribution
โข If all variables ๐ฅ๐ are uncorrelated ๐ธ ๐ฅ๐๐ฅ๐ = ๐ธ ๐ฅ๐ ๐ธ ๐ฅ๐ , then
โ Variables are also independent ๐ ๐ฅ๐๐ฅ๐ = ๐ ๐ฅ๐ ๐ ๐ฅ๐ , and
โ ฮฃ is diagonal, with the individual variances in the main diagonal
โข Central Limit Theorem (next slide)
โข The marginal and conditional densities are also Gaussian
โข Any linear transformation of any ๐ jointly Gaussian rvโs results in ๐ rvโs that are also Gaussian
โ For ๐ = ๐1๐2โฆ๐๐๐jointly Gaussian, and ๐ด๐ร๐ invertible, then ๐ = ๐ด๐ is
also jointly Gaussian
๐๐ ๐ฆ =๐๐ ๐ด
โ1๐ฆ
๐ด
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 24
โข Central Limit Theorem โ Given any distribution with a mean ๐ and variance ๐2, the sampling
distribution of the mean approaches a normal distribution with mean ๐ and variance ๐2/๐ as the sample size ๐ increases โข No matter what the shape of the original distribution is, the sampling
distribution of the mean approaches a normal distribution
โข ๐ is the sample size used to compute the mean, not the overall number of samples in the data
โ Example: 500 experiments are performed using a uniform distribution
โข ๐ = 1 โ One sample is drawn from the distribution
and its mean is recorded (500 times)
โ The histogram resembles a uniform distribution, as one would expect
โข ๐ = 4 โ Four samples are drawn and the mean of the
four samples is recorded (500 times)
โ The histogram starts to look more Gaussian
โข As ๐ grows, the shape of the histograms resembles a Normal distribution more closely