02 machine learning - introduction probability
TRANSCRIPT
Machine Learning for Data MiningProbability Review
Andres Mendez-Vazquez
May 14, 2015
1 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
2 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
3 / 87
Gerolamo Cardano: Gambling out of Darkness
GamblingGambling shows our interest in quantifying the ideas of probability formillennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)While gambling he developed the following rule!!!
Equal conditions“The most fundamental principle of all in gambling is simply equalconditions, e.g. of opponents, of bystanders, of money, of situation, of thedice box and of the dice itself. To the extent to which you depart fromthat equity, if it is in your opponent’s favour, you are a fool, and if in yourown, you are unjust.”
4 / 87
Gerolamo Cardano: Gambling out of Darkness
GamblingGambling shows our interest in quantifying the ideas of probability formillennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)While gambling he developed the following rule!!!
Equal conditions“The most fundamental principle of all in gambling is simply equalconditions, e.g. of opponents, of bystanders, of money, of situation, of thedice box and of the dice itself. To the extent to which you depart fromthat equity, if it is in your opponent’s favour, you are a fool, and if in yourown, you are unjust.”
4 / 87
Gerolamo Cardano: Gambling out of Darkness
GamblingGambling shows our interest in quantifying the ideas of probability formillennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)While gambling he developed the following rule!!!
Equal conditions“The most fundamental principle of all in gambling is simply equalconditions, e.g. of opponents, of bystanders, of money, of situation, of thedice box and of the dice itself. To the extent to which you depart fromthat equity, if it is in your opponent’s favour, you are a fool, and if in yourown, you are unjust.”
4 / 87
Gerolamo Cardano’s Definition
Probability“If therefore, someone should say, I want an ace, a deuce, or a trey, youknow that there are 27 favourable throws, and since the circuit is 36, therest of the throws in which these points will not turn up will be 9; theodds will therefore be 3 to 1.”
MeaningProbability as a ratio of favorable to all possible outcomes!!! As long allevents are equiprobable...
Thus, we get
P(All favourable throws) = Number All favourable throwsNumber of All throws (1)
5 / 87
Gerolamo Cardano’s Definition
Probability“If therefore, someone should say, I want an ace, a deuce, or a trey, youknow that there are 27 favourable throws, and since the circuit is 36, therest of the throws in which these points will not turn up will be 9; theodds will therefore be 3 to 1.”
MeaningProbability as a ratio of favorable to all possible outcomes!!! As long allevents are equiprobable...
Thus, we get
P(All favourable throws) = Number All favourable throwsNumber of All throws (1)
5 / 87
Gerolamo Cardano’s Definition
Probability“If therefore, someone should say, I want an ace, a deuce, or a trey, youknow that there are 27 favourable throws, and since the circuit is 36, therest of the throws in which these points will not turn up will be 9; theodds will therefore be 3 to 1.”
MeaningProbability as a ratio of favorable to all possible outcomes!!! As long allevents are equiprobable...
Thus, we get
P(All favourable throws) = Number All favourable throwsNumber of All throws (1)
5 / 87
Intuitive Formulation
Empiric DefinitionIntuitively, the probability of an event A could be defined as:
P(A) = limn→∞
N (A)n
Where N (A) is the number that event a happens in n trials.
ExampleImagine you have three dices, then
The total number of outcomes is 63If we have event A = all numbers are equal, |A| = 6Then, we have that P(A) = 6
63 = 136
6 / 87
Intuitive Formulation
Empiric DefinitionIntuitively, the probability of an event A could be defined as:
P(A) = limn→∞
N (A)n
Where N (A) is the number that event a happens in n trials.
ExampleImagine you have three dices, then
The total number of outcomes is 63If we have event A = all numbers are equal, |A| = 6Then, we have that P(A) = 6
63 = 136
6 / 87
Intuitive Formulation
Empiric DefinitionIntuitively, the probability of an event A could be defined as:
P(A) = limn→∞
N (A)n
Where N (A) is the number that event a happens in n trials.
ExampleImagine you have three dices, then
The total number of outcomes is 63If we have event A = all numbers are equal, |A| = 6Then, we have that P(A) = 6
63 = 136
6 / 87
Intuitive Formulation
Empiric DefinitionIntuitively, the probability of an event A could be defined as:
P(A) = limn→∞
N (A)n
Where N (A) is the number that event a happens in n trials.
ExampleImagine you have three dices, then
The total number of outcomes is 63If we have event A = all numbers are equal, |A| = 6Then, we have that P(A) = 6
63 = 136
6 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
7 / 87
Axioms of Probability
AxiomsGiven a sample space S of events, we have that
1 0 ≤ P(A) ≤ 12 P(S) = 13 If A1,A2, ...,An are mutually exclusive events (i.e. P(Ai ∩Aj) = 0),
then:
P(A1 ∪A2 ∪ ... ∪An) =n∑
i=1P(Ai)
8 / 87
Axioms of Probability
AxiomsGiven a sample space S of events, we have that
1 0 ≤ P(A) ≤ 12 P(S) = 13 If A1,A2, ...,An are mutually exclusive events (i.e. P(Ai ∩Aj) = 0),
then:
P(A1 ∪A2 ∪ ... ∪An) =n∑
i=1P(Ai)
8 / 87
Axioms of Probability
AxiomsGiven a sample space S of events, we have that
1 0 ≤ P(A) ≤ 12 P(S) = 13 If A1,A2, ...,An are mutually exclusive events (i.e. P(Ai ∩Aj) = 0),
then:
P(A1 ∪A2 ∪ ... ∪An) =n∑
i=1P(Ai)
8 / 87
Axioms of Probability
AxiomsGiven a sample space S of events, we have that
1 0 ≤ P(A) ≤ 12 P(S) = 13 If A1,A2, ...,An are mutually exclusive events (i.e. P(Ai ∩Aj) = 0),
then:
P(A1 ∪A2 ∪ ... ∪An) =n∑
i=1P(Ai)
8 / 87
Set Operations
We are usingSet Notation
ThusWhat Operations?
9 / 87
Set Operations
We are usingSet Notation
ThusWhat Operations?
9 / 87
Example
SetupThrow a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following eventAt least one head!!! Can you tell me which events are part of it?
What about this one?Tail on first toss.
10 / 87
Example
SetupThrow a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following eventAt least one head!!! Can you tell me which events are part of it?
What about this one?Tail on first toss.
10 / 87
Example
SetupThrow a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following eventAt least one head!!! Can you tell me which events are part of it?
What about this one?Tail on first toss.
10 / 87
We need to count!!!
We have four main methods of counting1 Ordered samples of size r with replacement2 Ordered samples of size r without replacement3 Unordered samples of size r without replacement4 Unordered samples of size r with replacement
11 / 87
We need to count!!!
We have four main methods of counting1 Ordered samples of size r with replacement2 Ordered samples of size r without replacement3 Unordered samples of size r without replacement4 Unordered samples of size r with replacement
11 / 87
We need to count!!!
We have four main methods of counting1 Ordered samples of size r with replacement2 Ordered samples of size r without replacement3 Unordered samples of size r without replacement4 Unordered samples of size r with replacement
11 / 87
We need to count!!!
We have four main methods of counting1 Ordered samples of size r with replacement2 Ordered samples of size r without replacement3 Unordered samples of size r without replacement4 Unordered samples of size r with replacement
11 / 87
Ordered samples of size r with replacement
DefinitionThe number of possible sequences (ai1 , ..., air ) for n different numbers isn × n × ...× n = nr
ExampleIf you throw three dices you have 6× 6× 6 = 216
12 / 87
Ordered samples of size r with replacement
DefinitionThe number of possible sequences (ai1 , ..., air ) for n different numbers isn × n × ...× n = nr
ExampleIf you throw three dices you have 6× 6× 6 = 216
12 / 87
Ordered samples of size r without replacement
DefinitionThe number of possible sequences (ai1 , ..., air ) for n different numbers isn × n − 1× ...× (n − (r − 1)) = n!
(n−r)!
ExampleThe number of different numbers that can be formed if no digit can berepeated. For example, if you have 4 digits and you want numbers of size3.
13 / 87
Ordered samples of size r without replacement
DefinitionThe number of possible sequences (ai1 , ..., air ) for n different numbers isn × n − 1× ...× (n − (r − 1)) = n!
(n−r)!
ExampleThe number of different numbers that can be formed if no digit can berepeated. For example, if you have 4 digits and you want numbers of size3.
13 / 87
Unordered samples of size r without replacement
DefinitionActually, we want the number of possible unordered sets.
HoweverWe have n!
(n−r)! collections where we care about the order. Thus
n!(n−r)!
r ! = n!r ! (n − r)! =
(nr
)(2)
14 / 87
Unordered samples of size r without replacement
DefinitionActually, we want the number of possible unordered sets.
HoweverWe have n!
(n−r)! collections where we care about the order. Thus
n!(n−r)!
r ! = n!r ! (n − r)! =
(nr
)(2)
14 / 87
Unordered samples of size r with replacement
DefinitionWe want to find an unordered set ai1 , ..., air with replacement
Use a digit trick for thatLook at the Board
Thus (n + r − 1
r
)(3)
15 / 87
Unordered samples of size r with replacement
DefinitionWe want to find an unordered set ai1 , ..., air with replacement
Use a digit trick for thatLook at the Board
Thus (n + r − 1
r
)(3)
15 / 87
Unordered samples of size r with replacement
DefinitionWe want to find an unordered set ai1 , ..., air with replacement
Use a digit trick for thatLook at the Board
Thus (n + r − 1
r
)(3)
15 / 87
How?Change encoding by adding more signsImagine all the strings of three numbers with 1, 2, 3
We haveOld String New String
111 1+0,1+1,1+2=123112 1+0,1+1,2+2=124113 1+0,1+1,3+2=125122 1+0,2+1,2+2=134123 1+0,2+1,3+2=135133 1+0,3+1,3+2=145222 2+0,2+1,2+2=234223 2+0,2+1,3+2=225233 1+0,3+1,3+2=233333 3+0,3+1,3+2=345
16 / 87
How?Change encoding by adding more signsImagine all the strings of three numbers with 1, 2, 3
We haveOld String New String
111 1+0,1+1,1+2=123112 1+0,1+1,2+2=124113 1+0,1+1,3+2=125122 1+0,2+1,2+2=134123 1+0,2+1,3+2=135133 1+0,3+1,3+2=145222 2+0,2+1,2+2=234223 2+0,2+1,3+2=225233 1+0,3+1,3+2=233333 3+0,3+1,3+2=345
16 / 87
Independence
DefinitionTwo events A and B are independent if and only ifP(A,B) = P(A ∩ B) = P(A)P(B)
17 / 87
Example
We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9
So, we can doLook at the board!!! Independence between A,B,C
18 / 87
Example
We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9
So, we can doLook at the board!!! Independence between A,B,C
18 / 87
Example
We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9
So, we can doLook at the board!!! Independence between A,B,C
18 / 87
Example
We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9
So, we can doLook at the board!!! Independence between A,B,C
18 / 87
Example
We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9
So, we can doLook at the board!!! Independence between A,B,C
18 / 87
We can use to derive the Binomial Distribution
WHAT?????
19 / 87
First, we use a sequence of n Bernoulli Trials
We have this“Success” has a probability p.“Failure” has a probability 1− p.
ExamplesToss a coin independently n times.Examine components produced on an assembly line.
NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).
20 / 87
First, we use a sequence of n Bernoulli Trials
We have this“Success” has a probability p.“Failure” has a probability 1− p.
ExamplesToss a coin independently n times.Examine components produced on an assembly line.
NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).
20 / 87
First, we use a sequence of n Bernoulli Trials
We have this“Success” has a probability p.“Failure” has a probability 1− p.
ExamplesToss a coin independently n times.Examine components produced on an assembly line.
NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).
20 / 87
First, we use a sequence of n Bernoulli Trials
We have this“Success” has a probability p.“Failure” has a probability 1− p.
ExamplesToss a coin independently n times.Examine components produced on an assembly line.
NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).
20 / 87
First, we use a sequence of n Bernoulli Trials
We have this“Success” has a probability p.“Failure” has a probability 1− p.
ExamplesToss a coin independently n times.Examine components produced on an assembly line.
NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).
20 / 87
Thus, taking a sample ωω = 11 · · · 10 · · · 0k 1’s followed by n − k 0’s.
We have then
P (ω) = P(A1 ∩A2 ∩ . . . ∩Ak ∩Ac
k+1 ∩ . . . ∩Acn)
= P (A1) P (A2) · · ·P (Ak) P(Ac
k+1)· · ·P (Ac
n)= pk (1− p)n−k
ImportantThe number of such sample is the number of sets with k elements.... or...(
nk
)
21 / 87
Thus, taking a sample ωω = 11 · · · 10 · · · 0k 1’s followed by n − k 0’s.
We have then
P (ω) = P(A1 ∩A2 ∩ . . . ∩Ak ∩Ac
k+1 ∩ . . . ∩Acn)
= P (A1) P (A2) · · ·P (Ak) P(Ac
k+1)· · ·P (Ac
n)= pk (1− p)n−k
ImportantThe number of such sample is the number of sets with k elements.... or...(
nk
)
21 / 87
Thus, taking a sample ωω = 11 · · · 10 · · · 0k 1’s followed by n − k 0’s.
We have then
P (ω) = P(A1 ∩A2 ∩ . . . ∩Ak ∩Ac
k+1 ∩ . . . ∩Acn)
= P (A1) P (A2) · · ·P (Ak) P(Ac
k+1)· · ·P (Ac
n)= pk (1− p)n−k
ImportantThe number of such sample is the number of sets with k elements.... or...(
nk
)
21 / 87
Did you notice?
We do not care where the 1’s and 0’s areThus all the probabilities are equal to pk (1− p)k
Thus, we are looking to sum all those probabilities of all thosecombinations of 1’s and 0’s ∑
k 1’sp(ωk)
Then ∑k 1’s
p(ωk)
=(
nk
)p (1− p)n−k
22 / 87
Did you notice?
We do not care where the 1’s and 0’s areThus all the probabilities are equal to pk (1− p)k
Thus, we are looking to sum all those probabilities of all thosecombinations of 1’s and 0’s ∑
k 1’sp(ωk)
Then ∑k 1’s
p(ωk)
=(
nk
)p (1− p)n−k
22 / 87
Did you notice?
We do not care where the 1’s and 0’s areThus all the probabilities are equal to pk (1− p)k
Thus, we are looking to sum all those probabilities of all thosecombinations of 1’s and 0’s ∑
k 1’sp(ωk)
Then ∑k 1’s
p(ωk)
=(
nk
)p (1− p)n−k
22 / 87
Proving this is a probability
Sum of these probabilities is equal to 1n∑
k=0
(nk
)p (1− p)n−k = (p + (1− p))n = 1
The other is simple
0 ≤(
nk
)p (1− p)n−k ≤ 1 ∀k
This is know asThe Binomial probability function!!!
23 / 87
Proving this is a probability
Sum of these probabilities is equal to 1n∑
k=0
(nk
)p (1− p)n−k = (p + (1− p))n = 1
The other is simple
0 ≤(
nk
)p (1− p)n−k ≤ 1 ∀k
This is know asThe Binomial probability function!!!
23 / 87
Proving this is a probability
Sum of these probabilities is equal to 1n∑
k=0
(nk
)p (1− p)n−k = (p + (1− p))n = 1
The other is simple
0 ≤(
nk
)p (1− p)n−k ≤ 1 ∀k
This is know asThe Binomial probability function!!!
23 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
24 / 87
Different Probabilities
UnconditionalThis is the probability of an event A prior to arrival of any evidence, it isdenoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,there is a 10% chance that the patient is having a cavity”.
ConditionalThis is the probability of an event A given some evidence B, it is denotedP(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance thatthe patient is having a cavity given that he is having a toothache”
25 / 87
Different Probabilities
UnconditionalThis is the probability of an event A prior to arrival of any evidence, it isdenoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,there is a 10% chance that the patient is having a cavity”.
ConditionalThis is the probability of an event A given some evidence B, it is denotedP(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance thatthe patient is having a cavity given that he is having a toothache”
25 / 87
Different Probabilities
UnconditionalThis is the probability of an event A prior to arrival of any evidence, it isdenoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,there is a 10% chance that the patient is having a cavity”.
ConditionalThis is the probability of an event A given some evidence B, it is denotedP(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance thatthe patient is having a cavity given that he is having a toothache”
25 / 87
Different Probabilities
UnconditionalThis is the probability of an event A prior to arrival of any evidence, it isdenoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,there is a 10% chance that the patient is having a cavity”.
ConditionalThis is the probability of an event A given some evidence B, it is denotedP(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance thatthe patient is having a cavity given that he is having a toothache”
25 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
26 / 87
Posterior Probabilities
Relation between conditional and unconditional probabilitiesConditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) = P(A, B)P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilitiesif B1, B2, ..., Bn is a partition of mutually exclusive events and Ais an event, thenP(A) =
∑ni=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =∑n
i=1 P(A|Bi)P(Bi).
27 / 87
Posterior Probabilities
Relation between conditional and unconditional probabilitiesConditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) = P(A, B)P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilitiesif B1, B2, ..., Bn is a partition of mutually exclusive events and Ais an event, thenP(A) =
∑ni=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =∑n
i=1 P(A|Bi)P(Bi).
27 / 87
Posterior Probabilities
Relation between conditional and unconditional probabilitiesConditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) = P(A, B)P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilitiesif B1, B2, ..., Bn is a partition of mutually exclusive events and Ais an event, thenP(A) =
∑ni=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =∑n
i=1 P(A|Bi)P(Bi).
27 / 87
Example
Three cards are drawn from a deckFind the probability of no obtaining a heart
We have52 cards39 of them not a heart
DefineAi =Card i is not a heart Then?
28 / 87
Example
Three cards are drawn from a deckFind the probability of no obtaining a heart
We have52 cards39 of them not a heart
DefineAi =Card i is not a heart Then?
28 / 87
Example
Three cards are drawn from a deckFind the probability of no obtaining a heart
We have52 cards39 of them not a heart
DefineAi =Card i is not a heart Then?
28 / 87
Independence and Conditional
From here, we have that...P(A|B) = P(A) and P(B|A) = P(B).
Conditional independenceA and B are conditionally independent given C if and only if
P(A|B,C ) = P(A|C )
Example: P(WetGrass|Season,Rain) = P(WetGrass|Rain).
29 / 87
Independence and Conditional
From here, we have that...P(A|B) = P(A) and P(B|A) = P(B).
Conditional independenceA and B are conditionally independent given C if and only if
P(A|B,C ) = P(A|C )
Example: P(WetGrass|Season,Rain) = P(WetGrass|Rain).
29 / 87
Bayes TheoremOne Version
P(A|B) = P(B|A)P(A)P(B)
WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.
30 / 87
Bayes TheoremOne Version
P(A|B) = P(B|A)P(A)P(B)
WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.
30 / 87
Bayes TheoremOne Version
P(A|B) = P(B|A)P(A)P(B)
WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.
30 / 87
Bayes TheoremOne Version
P(A|B) = P(B|A)P(A)P(B)
WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.
30 / 87
Bayes TheoremOne Version
P(A|B) = P(B|A)P(A)P(B)
WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.
30 / 87
General Form of the Bayes Rule
DefinitionIf A1,A2, ...,An is a partition of mutually exclusive events and B anyevent, then:
P(Ai |B) = P(B|Ai)P(Ai)P(B) = P(B|Ai)P(Ai)∑n
i=1 P(B|Ai)P(Ai)
where
P(B) =n∑
i=1P(B ∩Ai) =
n∑i=1
P(B|Ai)P(Ai)
31 / 87
General Form of the Bayes Rule
DefinitionIf A1,A2, ...,An is a partition of mutually exclusive events and B anyevent, then:
P(Ai |B) = P(B|Ai)P(Ai)P(B) = P(B|Ai)P(Ai)∑n
i=1 P(B|Ai)P(Ai)
where
P(B) =n∑
i=1P(B ∩Ai) =
n∑i=1
P(B|Ai)P(Ai)
31 / 87
Example
SetupThrow two unbiased dice independently.
Let1 A =sum of the faces =82 B =faces are equal
Then calculate P (B|A)Look at the board
32 / 87
Example
SetupThrow two unbiased dice independently.
Let1 A =sum of the faces =82 B =faces are equal
Then calculate P (B|A)Look at the board
32 / 87
Example
SetupThrow two unbiased dice independently.
Let1 A =sum of the faces =82 B =faces are equal
Then calculate P (B|A)Look at the board
32 / 87
Another Example
We have the followingTwo coins are available, one unbiased and the other two headed
AssumeThat you have a probability of 3
4 to choose the unbiased
EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen
I Find that if a head come up, find the probability that the two headedcoin was chosen
33 / 87
Another Example
We have the followingTwo coins are available, one unbiased and the other two headed
AssumeThat you have a probability of 3
4 to choose the unbiased
EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen
I Find that if a head come up, find the probability that the two headedcoin was chosen
33 / 87
Another Example
We have the followingTwo coins are available, one unbiased and the other two headed
AssumeThat you have a probability of 3
4 to choose the unbiased
EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen
I Find that if a head come up, find the probability that the two headedcoin was chosen
33 / 87
Another Example
We have the followingTwo coins are available, one unbiased and the other two headed
AssumeThat you have a probability of 3
4 to choose the unbiased
EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen
I Find that if a head come up, find the probability that the two headedcoin was chosen
33 / 87
Another Example
We have the followingTwo coins are available, one unbiased and the other two headed
AssumeThat you have a probability of 3
4 to choose the unbiased
EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen
I Find that if a head come up, find the probability that the two headedcoin was chosen
33 / 87
Random Variables I
DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.
ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.
Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.
ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.
Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.
ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.
Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.
ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.
Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.
ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.
Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.
ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.
Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).
34 / 87
Random Variables I
DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.
ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.
Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).
34 / 87
Thus...
It is necessary to define a function “random variable as follow”
X : S → R
Graphically
35 / 87
Thus...
It is necessary to define a function “random variable as follow”
X : S → R
Graphically
35 / 87
Random Variables II
How?What is the probability function of the random variable is being definedfrom the probability function of the original sample space?
Suppose the sample space is S = s1, s2, ..., snSuppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the randomexperiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
Random Variables II
How?What is the probability function of the random variable is being definedfrom the probability function of the original sample space?
Suppose the sample space is S = s1, s2, ..., snSuppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the randomexperiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
Random Variables II
How?What is the probability function of the random variable is being definedfrom the probability function of the original sample space?
Suppose the sample space is S = s1, s2, ..., snSuppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the randomexperiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
Random Variables II
How?What is the probability function of the random variable is being definedfrom the probability function of the original sample space?
Suppose the sample space is S = s1, s2, ..., snSuppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the randomexperiment is an sj ∈ S s.t. X(sj) = xj or
P(X = xj) = P(sj ∈ S |X(sj) = xj)
36 / 87
Example
SetupThrow a coin 10 times, and let R be the number of heads.
ThenS = all sequences of length 10 with components H and T
We have forω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
Example
SetupThrow a coin 10 times, and let R be the number of heads.
ThenS = all sequences of length 10 with components H and T
We have forω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
Example
SetupThrow a coin 10 times, and let R be the number of heads.
ThenS = all sequences of length 10 with components H and T
We have forω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
Example
SetupLet R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?Ω =HH,HT,TH,TT
Thus, we can calculateP (R = 0) ,P (R = 1) ,P (R = 2)
38 / 87
Example
SetupLet R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?Ω =HH,HT,TH,TT
Thus, we can calculateP (R = 0) ,P (R = 1) ,P (R = 2)
38 / 87
Example
SetupLet R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?Ω =HH,HT,TH,TT
Thus, we can calculateP (R = 0) ,P (R = 1) ,P (R = 2)
38 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
39 / 87
Types of Random Variables
DiscreteA discrete random variable can assume only a countable number of values.
ContinuousA continuous random variable can assume a continuous range of values.
40 / 87
Types of Random Variables
DiscreteA discrete random variable can assume only a countable number of values.
ContinuousA continuous random variable can assume a continuous range of values.
40 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.
Properties of the pmf and pdf
Some properties of the pmf:I∑
x p(x) = 1 and P(a < X < b) =∑b
k=a p(k).
In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =
´ ba p(t)dt .
41 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.
Properties of the pmf and pdf
Some properties of the pmf:I∑
x p(x) = 1 and P(a < X < b) =∑b
k=a p(k).
In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =
´ ba p(t)dt .
41 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.
Properties of the pmf and pdf
Some properties of the pmf:I∑
x p(x) = 1 and P(a < X < b) =∑b
k=a p(k).
In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =
´ ba p(t)dt .
41 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.
Properties of the pmf and pdf
Some properties of the pmf:I∑
x p(x) = 1 and P(a < X < b) =∑b
k=a p(k).
In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =
´ ba p(t)dt .
41 / 87
Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.
Properties of the pmf and pdf
Some properties of the pmf:I∑
x p(x) = 1 and P(a < X < b) =∑b
k=a p(k).
In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =
´ ba p(t)dt .
41 / 87
42 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
43 / 87
Cumulative Distributive Function I
Cumulative Distribution FunctionWith every random variable, we associate a function calledCumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:I FX(x) ≥ 0I FX(x) in a non-decreasing function of X .
ExampleIf X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) =∑N
k=1 P(Xk = pk).
44 / 87
Cumulative Distributive Function I
Cumulative Distribution FunctionWith every random variable, we associate a function calledCumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:I FX(x) ≥ 0I FX(x) in a non-decreasing function of X .
ExampleIf X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) =∑N
k=1 P(Xk = pk).
44 / 87
Cumulative Distributive Function I
Cumulative Distribution FunctionWith every random variable, we associate a function calledCumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:I FX(x) ≥ 0I FX(x) in a non-decreasing function of X .
ExampleIf X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) =∑N
k=1 P(Xk = pk).
44 / 87
Cumulative Distributive Function I
Cumulative Distribution FunctionWith every random variable, we associate a function calledCumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:I FX(x) ≥ 0I FX(x) in a non-decreasing function of X .
ExampleIf X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) =∑N
k=1 P(Xk = pk).
44 / 87
Example: Discrete Function
.16
.48
.36
.16
.48
.36
1 2 1 2
1
45 / 87
Cumulative Distributive Function IIContinuous FunctionIf X is continuous, its CDF can be computed as follows:
F(x) =ˆ x
−∞f (t)dt.
RemarkBased in the fundamental theorem of calculus, we have the followingequality.
p(x) = dFdx (x)
NoteThis particular p(x) is known as the Probability Mass Function (PMF) orProbability Distribution Function (PDF).
46 / 87
Cumulative Distributive Function IIContinuous FunctionIf X is continuous, its CDF can be computed as follows:
F(x) =ˆ x
−∞f (t)dt.
RemarkBased in the fundamental theorem of calculus, we have the followingequality.
p(x) = dFdx (x)
NoteThis particular p(x) is known as the Probability Mass Function (PMF) orProbability Distribution Function (PDF).
46 / 87
Cumulative Distributive Function IIContinuous FunctionIf X is continuous, its CDF can be computed as follows:
F(x) =ˆ x
−∞f (t)dt.
RemarkBased in the fundamental theorem of calculus, we have the followingequality.
p(x) = dFdx (x)
NoteThis particular p(x) is known as the Probability Mass Function (PMF) orProbability Distribution Function (PDF).
46 / 87
Example: Continuous Function
SetupA number X is chosen at random between a and bXhas a uniform distribution
I fX (x) = 1b−a for a ≤ x ≤ b
I fX (x) = 0 for x < a and x > b
We have
FX (x) = P X ≤ x =ˆ x
−∞fX (t) dt (4)
P a < X ≤ b =ˆ b
afX (t) dt (5)
47 / 87
Example: Continuous Function
SetupA number X is chosen at random between a and bXhas a uniform distribution
I fX (x) = 1b−a for a ≤ x ≤ b
I fX (x) = 0 for x < a and x > b
We have
FX (x) = P X ≤ x =ˆ x
−∞fX (t) dt (4)
P a < X ≤ b =ˆ b
afX (t) dt (5)
47 / 87
Example: Continuous Function
SetupA number X is chosen at random between a and bXhas a uniform distribution
I fX (x) = 1b−a for a ≤ x ≤ b
I fX (x) = 0 for x < a and x > b
We have
FX (x) = P X ≤ x =ˆ x
−∞fX (t) dt (4)
P a < X ≤ b =ˆ b
afX (t) dt (5)
47 / 87
Example: Continuous Function
SetupA number X is chosen at random between a and bXhas a uniform distribution
I fX (x) = 1b−a for a ≤ x ≤ b
I fX (x) = 0 for x < a and x > b
We have
FX (x) = P X ≤ x =ˆ x
−∞fX (t) dt (4)
P a < X ≤ b =ˆ b
afX (t) dt (5)
47 / 87
Example: Continuous Function
SetupA number X is chosen at random between a and bXhas a uniform distribution
I fX (x) = 1b−a for a ≤ x ≤ b
I fX (x) = 0 for x < a and x > b
We have
FX (x) = P X ≤ x =ˆ x
−∞fX (t) dt (4)
P a < X ≤ b =ˆ b
afX (t) dt (5)
47 / 87
Example: Continuous Function
SetupA number X is chosen at random between a and bXhas a uniform distribution
I fX (x) = 1b−a for a ≤ x ≤ b
I fX (x) = 0 for x < a and x > b
We have
FX (x) = P X ≤ x =ˆ x
−∞fX (t) dt (4)
P a < X ≤ b =ˆ b
afX (t) dt (5)
47 / 87
GraphicallyExample uniform distribution
1
48 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
49 / 87
Properties of the PMF/PDF I
Conditional PMF/PDFWe have the conditional pdf:
p(y|x) = p(x, y)p(x) .
From this, we have the general chain rule
p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn).
IndependenceIf X and Y are independent, then:
p(x, y) = p(x)p(y).
50 / 87
Properties of the PMF/PDF I
Conditional PMF/PDFWe have the conditional pdf:
p(y|x) = p(x, y)p(x) .
From this, we have the general chain rule
p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn).
IndependenceIf X and Y are independent, then:
p(x, y) = p(x)p(y).
50 / 87
Properties of the PMF/PDF II
Law of Total Probability
p(y) =∑
xp(y|x)p(x).
51 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
52 / 87
ExpectationSomething NotableYou have the random variables R1,R2 representing how long is a call andhow much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)if 3 < R1 ≤ 6(minute) R2 = 20(cents)if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilitiesP R2 = 10 = 0.6, P R2 = 20 = 0.25, P R2 = 10 = 0.15
If we observe N calls and N is very largeWe can say that we have N × 0.6 calls and 10×N × 0.6 the cost of thosecalls
53 / 87
ExpectationSomething NotableYou have the random variables R1,R2 representing how long is a call andhow much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)if 3 < R1 ≤ 6(minute) R2 = 20(cents)if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilitiesP R2 = 10 = 0.6, P R2 = 20 = 0.25, P R2 = 10 = 0.15
If we observe N calls and N is very largeWe can say that we have N × 0.6 calls and 10×N × 0.6 the cost of thosecalls
53 / 87
ExpectationSomething NotableYou have the random variables R1,R2 representing how long is a call andhow much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)if 3 < R1 ≤ 6(minute) R2 = 20(cents)if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilitiesP R2 = 10 = 0.6, P R2 = 20 = 0.25, P R2 = 10 = 0.15
If we observe N calls and N is very largeWe can say that we have N × 0.6 calls and 10×N × 0.6 the cost of thosecalls
53 / 87
Expectation
SimilarlyR2 = 20 =⇒ 0.25N and total cost 5NR2 = 20 =⇒ 0.15N and total cost 4.5N
We have then the probabilitiesThe total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents percall
The average10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =∑y yP R2 = y
54 / 87
Expectation
SimilarlyR2 = 20 =⇒ 0.25N and total cost 5NR2 = 20 =⇒ 0.15N and total cost 4.5N
We have then the probabilitiesThe total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents percall
The average10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =∑y yP R2 = y
54 / 87
Expectation
SimilarlyR2 = 20 =⇒ 0.25N and total cost 5NR2 = 20 =⇒ 0.15N and total cost 4.5N
We have then the probabilitiesThe total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents percall
The average10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =∑y yP R2 = y
54 / 87
Expected Value
DefinitionDiscrete random variable X : E(X) =
∑x xp(x).
Continuous random variable Y : E(Y ) =´
x xp(x)dx.
Extension to a function g(X)E(g(X)) =
∑x g(x)p(x) (Discrete case).
E(g(X)) =´∞
=∞ g(x)p(x)dx (Continuous case)
Linearity propertyE(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
Expected Value
DefinitionDiscrete random variable X : E(X) =
∑x xp(x).
Continuous random variable Y : E(Y ) =´
x xp(x)dx.
Extension to a function g(X)E(g(X)) =
∑x g(x)p(x) (Discrete case).
E(g(X)) =´∞
=∞ g(x)p(x)dx (Continuous case)
Linearity propertyE(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
Expected Value
DefinitionDiscrete random variable X : E(X) =
∑x xp(x).
Continuous random variable Y : E(Y ) =´
x xp(x)dx.
Extension to a function g(X)E(g(X)) =
∑x g(x)p(x) (Discrete case).
E(g(X)) =´∞
=∞ g(x)p(x)dx (Continuous case)
Linearity propertyE(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
Example
Imagine the followingWe have the following functions
1 f (x) = e−x , x ≥ 02 g (x) = 0, x < 0
FindThe expected Value
56 / 87
Example
Imagine the followingWe have the following functions
1 f (x) = e−x , x ≥ 02 g (x) = 0, x < 0
FindThe expected Value
56 / 87
Example
Imagine the followingWe have the following functions
1 f (x) = e−x , x ≥ 02 g (x) = 0, x < 0
FindThe expected Value
56 / 87
Example
Imagine the followingWe have the following functions
1 f (x) = e−x , x ≥ 02 g (x) = 0, x < 0
FindThe expected Value
56 / 87
Variance
DefinitionVar(X) = E((X − µ)2) where µ = E(X)
Standard DeviationThe standard deviation is simply σ =
√Var(X).
57 / 87
Variance
DefinitionVar(X) = E((X − µ)2) where µ = E(X)
Standard DeviationThe standard deviation is simply σ =
√Var(X).
57 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
58 / 87
Example
SupposeYou have that the number of call made per day at a given exchange has aPoisson distribution with an unknown parameter θ:
p (x|θ) = θxe−θ
x! x = 0, 1, 2, ... (6)
We need to obtain information about θFor this, we observe that certain information is needed!!!
For exampleWe could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
Example
SupposeYou have that the number of call made per day at a given exchange has aPoisson distribution with an unknown parameter θ:
p (x|θ) = θxe−θ
x! x = 0, 1, 2, ... (6)
We need to obtain information about θFor this, we observe that certain information is needed!!!
For exampleWe could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
Example
SupposeYou have that the number of call made per day at a given exchange has aPoisson distribution with an unknown parameter θ:
p (x|θ) = θxe−θ
x! x = 0, 1, 2, ... (6)
We need to obtain information about θFor this, we observe that certain information is needed!!!
For exampleWe could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
Thus, we want to take a decision about θ
To avoid making an incorrect decisionTo avoid losing money!!!
60 / 87
Ingredients of statistical decision models
FirstN , the set of states
SecondA random variable or random vector X , the observable, whose distributionFθ depends on θ ∈ N
ThirdA, the set of possible actions:
A = N = (0,∞)
FourthA loss (cost) function L (θ, a), θ ∈ N , a ∈ A:
It represents the loss of taking a decision.
61 / 87
Ingredients of statistical decision models
FirstN , the set of states
SecondA random variable or random vector X , the observable, whose distributionFθ depends on θ ∈ N
ThirdA, the set of possible actions:
A = N = (0,∞)
FourthA loss (cost) function L (θ, a), θ ∈ N , a ∈ A:
It represents the loss of taking a decision.
61 / 87
Ingredients of statistical decision models
FirstN , the set of states
SecondA random variable or random vector X , the observable, whose distributionFθ depends on θ ∈ N
ThirdA, the set of possible actions:
A = N = (0,∞)
FourthA loss (cost) function L (θ, a), θ ∈ N , a ∈ A:
It represents the loss of taking a decision.
61 / 87
Ingredients of statistical decision models
FirstN , the set of states
SecondA random variable or random vector X , the observable, whose distributionFθ depends on θ ∈ N
ThirdA, the set of possible actions:
A = N = (0,∞)
FourthA loss (cost) function L (θ, a), θ ∈ N , a ∈ A:
It represents the loss of taking a decision.
61 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
62 / 87
Hypothesis Testing
SupposeH0 and H1 two subset such that
H0 ∩H1 = ∅H0 ∪H1 = N
In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1
In other words“θ ∈ H0”“θ ∈ H1”
63 / 87
Hypothesis Testing
SupposeH0 and H1 two subset such that
H0 ∩H1 = ∅H0 ∪H1 = N
In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1
In other words“θ ∈ H0”“θ ∈ H1”
63 / 87
Hypothesis Testing
SupposeH0 and H1 two subset such that
H0 ∩H1 = ∅H0 ∪H1 = N
In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1
In other words“θ ∈ H0”“θ ∈ H1”
63 / 87
Hypothesis Testing
SupposeH0 and H1 two subset such that
H0 ∩H1 = ∅H0 ∪H1 = N
In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1
In other words“θ ∈ H0”“θ ∈ H1”
63 / 87
Hypothesis Testing
SupposeH0 and H1 two subset such that
H0 ∩H1 = ∅H0 ∪H1 = N
In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1
In other words“θ ∈ H0”“θ ∈ H1”
63 / 87
Hypothesis Testing
SupposeH0 and H1 two subset such that
H0 ∩H1 = ∅H0 ∪H1 = N
In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1
In other words“θ ∈ H0”“θ ∈ H1”
63 / 87
Hypothesis Testing
SupposeH0 and H1 two subset such that
H0 ∩H1 = ∅H0 ∪H1 = N
In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1
In other words“θ ∈ H0”“θ ∈ H1”
63 / 87
Simple Hypothesis Vs. Simple Alternative
In this specific caseEach H0 and H1 contains one element, θ0 and θ1
ThusWe have that our random variable X which depends on θ:
If we are in H0, X ∼ f0If we are in H1, X ∼ f1
Thus, the problemIt is deciding whether X has density f0 or f1
64 / 87
Simple Hypothesis Vs. Simple Alternative
In this specific caseEach H0 and H1 contains one element, θ0 and θ1
ThusWe have that our random variable X which depends on θ:
If we are in H0, X ∼ f0If we are in H1, X ∼ f1
Thus, the problemIt is deciding whether X has density f0 or f1
64 / 87
Simple Hypothesis Vs. Simple Alternative
In this specific caseEach H0 and H1 contains one element, θ0 and θ1
ThusWe have that our random variable X which depends on θ:
If we are in H0, X ∼ f0If we are in H1, X ∼ f1
Thus, the problemIt is deciding whether X has density f0 or f1
64 / 87
Simple Hypothesis Vs. Simple Alternative
In this specific caseEach H0 and H1 contains one element, θ0 and θ1
ThusWe have that our random variable X which depends on θ:
If we are in H0, X ∼ f0If we are in H1, X ∼ f1
Thus, the problemIt is deciding whether X has density f0 or f1
64 / 87
What do we do?
We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved
We have thenIf ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0
65 / 87
What do we do?
We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved
We have thenIf ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0
65 / 87
What do we do?
We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved
We have thenIf ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0
65 / 87
What do we do?
We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved
We have thenIf ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0
65 / 87
What do we do?
We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved
We have thenIf ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0
65 / 87
What do we do?
We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved
We have thenIf ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0
65 / 87
Thus
x|ϕ (x) = 1It is called the rejection region or critical section.
Andϕ is called a test!!!
Clearly the decision could be erroneous!!!A type 1 error occurs if we reject H0 when H0 is true!!!A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
Thus
x|ϕ (x) = 1It is called the rejection region or critical section.
Andϕ is called a test!!!
Clearly the decision could be erroneous!!!A type 1 error occurs if we reject H0 when H0 is true!!!A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
Thus
x|ϕ (x) = 1It is called the rejection region or critical section.
Andϕ is called a test!!!
Clearly the decision could be erroneous!!!A type 1 error occurs if we reject H0 when H0 is true!!!A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
Thus
x|ϕ (x) = 1It is called the rejection region or critical section.
Andϕ is called a test!!!
Clearly the decision could be erroneous!!!A type 1 error occurs if we reject H0 when H0 is true!!!A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
Thus the probability of error when X = x
If H0 is rejected when trueProbability of a type error 1
α =ˆ ∞−∞
ϕ (x) f0 (x) dx (7)
If H0 is accepted when falseProbability of a type error 2
β =ˆ ∞−∞
(1− ϕ (x)) f1 (x) dx (8)
67 / 87
Thus the probability of error when X = x
If H0 is rejected when trueProbability of a type error 1
α =ˆ ∞−∞
ϕ (x) f0 (x) dx (7)
If H0 is accepted when falseProbability of a type error 2
β =ˆ ∞−∞
(1− ϕ (x)) f1 (x) dx (8)
67 / 87
Actually
If the test is an indicator function ϕ (x) = IAccept H0 (x) and1− ϕ (x) = IReject H0 (x)
True
True
Retain Reject
68 / 87
Problem!!!
There is not a unique answer to the question of what is a good testThus, we suppose there is a nonnegative cost ci associated to errortype i.In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1− p)× c2 × β (ϕ) (9)
69 / 87
Problem!!!
There is not a unique answer to the question of what is a good testThus, we suppose there is a nonnegative cost ci associated to errortype i.In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1− p)× c2 × β (ϕ) (9)
69 / 87
Problem!!!
There is not a unique answer to the question of what is a good testThus, we suppose there is a nonnegative cost ci associated to errortype i.In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1− p)× c2 × β (ϕ) (9)
69 / 87
We can do the followingThe over-all average cost associated with ϕ is
B (ϕ) = p × c1 ׈ ∞−∞
ϕ (x) f0 (x) dx + (1− p)× c2 ׈ ∞−∞
(1− ϕ (x)) f1 (x) dx
ThusB (ϕ) =
ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx + (1− p) c2 (1− ϕ (x)) f1 (x)] dx
=ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x) + (1− p) c2f1 (x)] dx
=ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x)] dx + ...
(1− p) c2
ˆ ∞−∞
f1 (x) dx
We have thatB (ϕ) =
ˆ ∞−∞
ϕ (x) [pc1f0 (x)− (1− p) c2f1 (x)] dx + (1− p) c2
70 / 87
We can do the followingThe over-all average cost associated with ϕ is
B (ϕ) = p × c1 ׈ ∞−∞
ϕ (x) f0 (x) dx + (1− p)× c2 ׈ ∞−∞
(1− ϕ (x)) f1 (x) dx
ThusB (ϕ) =
ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx + (1− p) c2 (1− ϕ (x)) f1 (x)] dx
=ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x) + (1− p) c2f1 (x)] dx
=ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x)] dx + ...
(1− p) c2
ˆ ∞−∞
f1 (x) dx
We have thatB (ϕ) =
ˆ ∞−∞
ϕ (x) [pc1f0 (x)− (1− p) c2f1 (x)] dx + (1− p) c2
70 / 87
We can do the followingThe over-all average cost associated with ϕ is
B (ϕ) = p × c1 ׈ ∞−∞
ϕ (x) f0 (x) dx + (1− p)× c2 ׈ ∞−∞
(1− ϕ (x)) f1 (x) dx
ThusB (ϕ) =
ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx + (1− p) c2 (1− ϕ (x)) f1 (x)] dx
=ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x) + (1− p) c2f1 (x)] dx
=ˆ ∞−∞
[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x)] dx + ...
(1− p) c2
ˆ ∞−∞
f1 (x) dx
We have thatB (ϕ) =
ˆ ∞−∞
ϕ (x) [pc1f0 (x)− (1− p) c2f1 (x)] dx + (1− p) c2
70 / 87
Bayes Risk
We have that...B (ϕ) is called the Bayes risk associated to the test function ϕ
In additionA test that minimizes B (ϕ) is called a Bayes test corresponding to thegiven p, c1, c2, f0 and f1.
71 / 87
Bayes Risk
We have that...B (ϕ) is called the Bayes risk associated to the test function ϕ
In additionA test that minimizes B (ϕ) is called a Bayes test corresponding to thegiven p, c1, c2, f0 and f1.
71 / 87
What do we want?
We wantTo minimize
´S ϕ (x) g (x) dx
We want to find g (x)!!!This will tell us how to select the correct hypothesis!!!
72 / 87
What do we want?
We wantTo minimize
´S ϕ (x) g (x) dx
We want to find g (x)!!!This will tell us how to select the correct hypothesis!!!
72 / 87
What do we want?
We wantTo minimize
´S ϕ (x) g (x) dx
We want to find g (x)!!!This will tell us how to select the correct hypothesis!!!
72 / 87
What do we want?
Case 1If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S .
Case 2If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S .
Case 3If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
What do we want?
Case 1If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S .
Case 2If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S .
Case 3If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
What do we want?
Case 1If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S .
Case 2If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S .
Case 3If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
Finally
We choose
g (x) = pc1f0 (x)− (1− p) c2f1 (x) (10)
We look at the moment where g (x) = 0
pc1f0 (x)− (1− p) c2f1 (x) = 0pc1f0 (x) = (1− p) c2f1 (x)
pc1(1− p) c2
= f1 (x)f0 (x)
74 / 87
Finally
We choose
g (x) = pc1f0 (x)− (1− p) c2f1 (x) (10)
We look at the moment where g (x) = 0
pc1f0 (x)− (1− p) c2f1 (x) = 0pc1f0 (x) = (1− p) c2f1 (x)
pc1(1− p) c2
= f1 (x)f0 (x)
74 / 87
Bayes Solution
Thus, we haveLet L (x) = f1(x)
f0(x)
If L (x) > pc1(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.I If L (x) < pc1
(1−p)c2then take ϕ (x) = 0 i.e. accept H0.
I If L (x) = pc1(1−p)c2
then take ϕ (x) =anything
75 / 87
Bayes Solution
Thus, we haveLet L (x) = f1(x)
f0(x)
If L (x) > pc1(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.I If L (x) < pc1
(1−p)c2then take ϕ (x) = 0 i.e. accept H0.
I If L (x) = pc1(1−p)c2
then take ϕ (x) =anything
75 / 87
Bayes Solution
Thus, we haveLet L (x) = f1(x)
f0(x)
If L (x) > pc1(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.I If L (x) < pc1
(1−p)c2then take ϕ (x) = 0 i.e. accept H0.
I If L (x) = pc1(1−p)c2
then take ϕ (x) =anything
75 / 87
Bayes Solution
Thus, we haveLet L (x) = f1(x)
f0(x)
If L (x) > pc1(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.I If L (x) < pc1
(1−p)c2then take ϕ (x) = 0 i.e. accept H0.
I If L (x) = pc1(1−p)c2
then take ϕ (x) =anything
75 / 87
Likelihood Ratio
We haveL is called the likelihood ratio.
For the test ϕThere is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Likelihood Ratio
We haveL is called the likelihood ratio.
For the test ϕThere is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Likelihood Ratio
We haveL is called the likelihood ratio.
For the test ϕThere is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Likelihood Ratio
We haveL is called the likelihood ratio.
For the test ϕThere is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Likelihood Ratio
We haveL is called the likelihood ratio.
For the test ϕThere is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
Example
Let X be a discrete random variablex = 0, 1, 2, 3
We have thenx 0 1 2 3
p0 (x) .1 .2 .3 .4p1 (x) .2 .1 .4 .3
We have the following likelihood ratiox 1 3 2 0
L (x) 12
34
43 2
77 / 87
Example
Let X be a discrete random variablex = 0, 1, 2, 3
We have thenx 0 1 2 3
p0 (x) .1 .2 .3 .4p1 (x) .2 .1 .4 .3
We have the following likelihood ratiox 1 3 2 0
L (x) 12
34
43 2
77 / 87
Example
Let X be a discrete random variablex = 0, 1, 2, 3
We have thenx 0 1 2 3
p0 (x) .1 .2 .3 .4p1 (x) .2 .1 .4 .3
We have the following likelihood ratiox 1 3 2 0
L (x) 12
34
43 2
77 / 87
Example
We have the following situationLRT Reject Region Acceptance Region α β
0 ≤ λ < 12 All x Empty 1 0
12 < λ < 3
4 x = 0, 2, 3 x = 1 .8 .134 < λ < 4
3 x = 0, 2 x = 1, 3 .4 .443 < λ < 2 x = 0 x = 1, 2, 3 .1 .8
2 < λ ≤ ∞ Empty All x 0 1
78 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2Accept H0 if x = 1
If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)
79 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2Accept H0 if x = 1
If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)
79 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2Accept H0 if x = 1
If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)
79 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2Accept H0 if x = 1
If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)
79 / 87
Example
Assume λ = 3/4
Reject H0 if x = 0, 2Accept H0 if x = 1
If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)
79 / 87
The Graph of B (ϕ)
Thus, we have for each λ value
80 / 87
Thus, we have several test
The classic one: Minimax TestThe test that minimize max α, β
WhichAn admissible test with constant risk (α = β) is minimax
ThenWe have only one test where α = β = 0.4 then 3
4 < λ < 43 , Thus
We reject H0 when x =0 or 2We accept H0 when x =1 or 3
81 / 87
Thus, we have several test
The classic one: Minimax TestThe test that minimize max α, β
WhichAn admissible test with constant risk (α = β) is minimax
ThenWe have only one test where α = β = 0.4 then 3
4 < λ < 43 , Thus
We reject H0 when x =0 or 2We accept H0 when x =1 or 3
81 / 87
Thus, we have several test
The classic one: Minimax TestThe test that minimize max α, β
WhichAn admissible test with constant risk (α = β) is minimax
ThenWe have only one test where α = β = 0.4 then 3
4 < λ < 43 , Thus
We reject H0 when x =0 or 2We accept H0 when x =1 or 3
81 / 87
Thus, we have several test
The classic one: Minimax TestThe test that minimize max α, β
WhichAn admissible test with constant risk (α = β) is minimax
ThenWe have only one test where α = β = 0.4 then 3
4 < λ < 43 , Thus
We reject H0 when x =0 or 2We accept H0 when x =1 or 3
81 / 87
Remark
From this ideasWe can work out the classics of hypothesis testing
82 / 87
Outline1 Basic Theory
Intuitive FormulationAxioms
2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability
3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance
4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation
83 / 87
Introduction
Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).
There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)
84 / 87
Introduction
Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).
There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)
84 / 87
Introduction
Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).
There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)
84 / 87
Introduction
Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).
There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)
84 / 87
Introduction
Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).
There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)
84 / 87
Maximum Likelihood Estimation
Suppose the followingfθ be a density or probability function corresponding to the state ofnature θ.Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = θ or the value of θthat maximizes fθ (x)
85 / 87
Maximum Likelihood Estimation
Suppose the followingfθ be a density or probability function corresponding to the state ofnature θ.Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = θ or the value of θthat maximizes fθ (x)
85 / 87
Maximum Likelihood Estimation
Suppose the followingfθ be a density or probability function corresponding to the state ofnature θ.Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = θ or the value of θthat maximizes fθ (x)
85 / 87
Example
Let X have a binomial distributionWith parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =(
nx
)θx (1− θ)n−x with x = 0, 1, 2, ...,n
Derive with respect to θ∂∂θ ln pθ (x) = 0
86 / 87
Example
Let X have a binomial distributionWith parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =(
nx
)θx (1− θ)n−x with x = 0, 1, 2, ...,n
Derive with respect to θ∂∂θ ln pθ (x) = 0
86 / 87
Example
Let X have a binomial distributionWith parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =(
nx
)θx (1− θ)n−x with x = 0, 1, 2, ...,n
Derive with respect to θ∂∂θ ln pθ (x) = 0
86 / 87
Example
We getxθ− n − x
1− θ = 0 =⇒ θ = xn
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ...+ Xn
where: Xi is 1 with probability θ or 0 with probability 1− θ
We get finally
θ (X) =∑n
i=1 Xin ⇒ lim
n→∞θ (X) = E (Xi) = θ
87 / 87
Example
We getxθ− n − x
1− θ = 0 =⇒ θ = xn
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ...+ Xn
where: Xi is 1 with probability θ or 0 with probability 1− θ
We get finally
θ (X) =∑n
i=1 Xin ⇒ lim
n→∞θ (X) = E (Xi) = θ
87 / 87
Example
We getxθ− n − x
1− θ = 0 =⇒ θ = xn
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ...+ Xn
where: Xi is 1 with probability θ or 0 with probability 1− θ
We get finally
θ (X) =∑n
i=1 Xin ⇒ lim
n→∞θ (X) = E (Xi) = θ
87 / 87