probabilities, likelihood, bayesian...
Post on 27-Mar-2021
8 Views
Preview:
TRANSCRIPT
Probabilities, Likelihood, Bayesian inference
Tuesday, April 12, 2011
Taking samples for statistical analysis
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 4
Combining probabilities• Multiply probabilities if the component events
must happen simultaneously (i.e. where you would naturally use the word AND when describing the problem)
(1/6) ! (1/6) = 1/36
Using 2 dice, what is the probability of
AND ?AND
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 6
Combining probabilities
• Add probabilities if the component events are mutually exclusive (i.e. where you would naturally use the word OR in describing the problem)
(1/6) + (1/6) = 1/3
Using one die, what is the probability of
OR ?OR
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis
Combining AND and OR
7
(1/36) + (1/36) + (1/36) + (1/36) + (1/36) + (1/36) = 1/6
1 and 6
2 and 5
3 and 4
4 and 3
5 and 2
6 and 1
What is the probability that the sum of two dice is 7?
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 4
Joint probabilities B = Black S = SolidW = White D = Dotted
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 5
Conditional probabilities
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 9
Independence
Pr(A and B) = Pr(A) Pr(B|A)joint probability conditional
probability
Pr(A and B) = Pr(A) Pr(B)
...then events A and B are independent and we can express the joint probability as the product of Pr(A) and Pr(B)
If we can say this...
Pr(B|A) = Pr(B)
This is always true...
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 11
The Likelihood CriterionThe probability of the observations computed using a model tells
us how surprised we should be.The preferred model is the one that surprises us least.
Suppose I threw 20 dice down on the table and this was the result...
11
Tuesday, April 12, 2011
Pr(obs.|fair dice model) =�
16
�20
=1
3, 656, 158, 440, 062, 976
Copyright © 2011 Paul O. Lewis 12
The Fair Dice model
12
You should have been very surprised at this result because the probability of this event is very small: only 1 in 3.6 quadrillion!
Tuesday, April 12, 2011
Pr(obs.|trick dice model) = 120 = 1
Copyright © 2011 Paul O. Lewis 13
The Trick Dice model(assumes dice each have 5 on every side)
You should not be surprised at all at this result because the observed outcome is certain under this model
13
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 14
Results
Model Likelihood Surprise level
Fair Dice 13,656,158,440,062,976
Very, very, very surprised
Trick Dice 1.0 Not surprised at all
winning model maximizes likelihood(and thus minimizes surprise)
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis
Likelihood and model comparison
• Analyses using likelihoods ultimately involve model comparison
• The models compared can be discrete (as in the fair vs. trick dice example)
• More often the models compared differ continuously:– Model 1: branch length is 0.05– Model 2: branch length is 0.06
1515
Rather than having an infinity of models, we instead think of
the branch length as a parameter within
one model
Tuesday, April 12, 2011
Likelihood versus log likelihood
0.2 0.4 0.6 0.8 1.0
0.05
0.10
0.15
0.20
0.25
0.2 0.4 0.6 0.8 1.0
�14
�12
�10
�8
�6
�4
�2
Log[L]
L
Tuesday, April 12, 2011
Invariance principle
Tuesday, April 12, 2011
Coin flipping
Tuesday, April 12, 2011
Pr(y|p) =�
n
y
�py(1− p)n−y = L(p|y)
Copyright © 2011 Paul O. Lewis 33
Coin-flipping
y = observed number of headsn = number of flips (sample size)p = (unobserved) proportion of heads
Note that the same formula serves as both the: - probability of y (if p is fixed)- likelihood of p (if y is fixed)
Tuesday, April 12, 2011
Outcome Fair coin model Two-heads model
H 0.5 1.0
T 0.5 0.0
1.0 1.0
Copyright © 2011 Paul O. Lewis 34
Likelihood vs. Probability
• Symbol Pr(D|M) often used to represent the likelihood (I will be following convention in this regard)
Pr(D | M)Probabilities
are functions of the data (the
model is fixed)
L(M | D)
Likelihoods are functions of
models (the data is fixed)
Tuesday, April 12, 2011
Examples
Mathematic
a example
Tuesday, April 12, 2011
Bayesian inference
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 4
Joint probabilities B = Black S = SolidW = White D = Dotted
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 5
Conditional probabilities
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 6
Bayes’ rule
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 7
Probability of "Dotted"
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 8
Bayes' rule (cont.)
Pr(D) is the marginal probability of being dottedTo compute it, we marginalize over colors
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 9
Joint probabilities
B W
Pr(D,B)D
S Pr(S,B) Pr(S,W)
Pr(D,W)
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 11
Marginal probabilities
B W
Pr(D,B) + Pr(D,W)D
S Pr(S,B) + Pr(S,W)
Marginal probabilityof being solid
Marginal probabilityof being dotted
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 12
Marginalizing over "dottedness"
B W
Pr(D,B)
D
S
Pr(S,B) Pr(S,W)
Pr(D,W) Marginal probability of being a white
marble
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 13
Bayes' rule (cont.)
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 14
Bayes' rule in Statistics
D refers to the "observables" (i.e. the Data)! refers to one or more "unobservables"
(i.e. parameters of a model, or the model itself):– tree model (i.e. tree topology)– substitution model (e.g. JC, F84, GTR, etc.)– parameter of a substitution model (e.g. a branch length,
a base frequency, transition/transversion rate ratio, etc.)– hypothesis (i.e. a special case of a model)– a latent variable (e.g. ancestral state)
Tuesday, April 12, 2011
Pr(θ|D) =Pr(D|θ) Pr(θ)�θ Pr(D|θ) Pr(θ)
Copyright © 2011 Paul O. Lewis 15
Bayes’ rule in statistics
Likelihood of hypothesis ! Prior probability of hypothesis !
Posterior probabilityof hypothesis !
Marginal probabilityof the data (marginalizing
over hypotheses)
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 16
Simple (albeit silly) paternity example
Possibilities !1 !2 Row sum
Genotypes AA Aa ---
Prior 1/2 1/2 1
Likelihood 1 1/2 ---
Prior X Likelihood 1/2 1/4 3/4
Posterior 2/3 1/3 1
!1 and !2 are assumed to be the only possible fathers, child has genotype Aa, mother has genotype aa, so child must have received allele A from the true father. Note: the data in this case is the child’s genotype (Aa)
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis
The prior can be your friend
17
Suppose the test for a rare disease is 99% accurate.
Pr(+|disease) = 0.99Pr(+|healthy) = 0.01
datum hypothesis
(Note that we do not need to consider the case of a negative test result.)
It is very tempting to (mis)interpret the likelihood as a posterior probability and conclude “There is a 99% chance that I have the disease.”
Suppose further I test positive for the disease. How worried should I be?
Tuesday, April 12, 2011
Pr(disease|+) =Pr(+|disease)
�12
�
Pr(+|disease)�
12
�+ Pr(+|healthy)
�12
�
=(0.99)
�12
�
(0.99)�
12
�+ (0.01)
�12
� = 0.99
Copyright © 2011 Paul O. Lewis
The prior can be your friend
18
The posterior probability is 0.99 only if the prior probability of having the disease is 0.5:
Pr(disease|+) =(0.99)
�1
1000000
�
(0.99)�
11000000
�+ (0.01)
�9999991000000
�
≈ 0.0001
If, however, the prior odds against having the disease are a million to 1, then the posterior probability is much more reassuring:
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis
An important caveat
19
This (rare disease) example involves a tiny amount of data (one observation) and an extremely informative prior, and gives the impression that maximum likelihood (ML) inference is not very reliable.
However, in phylogenetics, we often have lots of data and use much less informative priors, so in phylogenetics ML inference is generally very reliable.
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 20
Discrete vs. Continuous
• So far, we've been dealing with discrete hypotheses (e.g. either this father or that father, have disease or don’t have disease)
• In phylogenetics, substitution models represent an infinite number of hypotheses (each combination of parameter values is in some sense a separate hypothesis)
• How do we use Bayes' rule when our hypotheses form a continuum?
Tuesday, April 12, 2011
f(θ|D) =f(D|θ)f(θ)�f(D|θ)f(θ)dθ
Copyright © 2011 Paul O. Lewis 21
Bayes’ rule: continuous case
Likelihood
Marginal probabilityof the data
Posterior probabilitydensity
Prior probabilitydensity
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 22
If you had to guess...
0.0 !
1 meter
Not knowing anything about my archery abilities,draw a curve representingyour view of the chances of my arrow landing a distanced from the center of the target(if it helps, I'm standing 50meters away from the target)
d
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 23
Case 1: assume I have talent
0.0
1 meter
20.0 40.0 60.0 80.0distance in centimeters from target center
An informative prior(low variance) thatsays most of my arrows will fall within10 cm of the center(thanks for yourconfidence!)
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 24
Case 2: assume I have a talent for missing the target!
0.0
1 meter
20.0 40.0 60.0
distance in cm from target center
Also an informative prior,but one that says most of my arrows will fall withina narrow range justoutside the entire target!
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 25
Case 3: assume I have no talent
0.0
1 meter
20.0 40.0 60.0 80.0distance in cm from target center
This is a vague prior:its high variance reflectsnearly total ignoranceof my abilities, saying that my arrows could land nearly anywhere!
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 26
A matter of scale
!
Notice that I haven't provided a scale forthe vertical axis.
What exactly does the height of thiscurve mean?
For example, does the height of the dottedline represent the probability that my arrow lands 60 cm from the center of the target?
0.0 20.0 40.0 60.0
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 27
Probabilities apply to intervals
Probabilities are attached to intervals(i.e. ranges of values), not individual values
The probability of any given point (e.g. d = 60.0) is zero!
However, we can ask about the probability that d falls in a particular range e.g. 50.0 < d < 65.0
0.0 20.0 40.0 60.0
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 28
Probabilities vs. probability densities
Probability density functionNote: the height of this curve does not represent a probability (if it did, it would not exceed 1.0)
density.ai example_density.xls
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 30density.ai
Integration of densities
The density curve is scaled so that the value of this integral(i.e. the total area) equals 1.0
!
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 31density.ai
Integration of a probability density yields a probability
Area under the densitycurve from 0 to 2 is theprobability that ! is less
than 2
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 32
mean=60var=3
! "! #! $! %! &! '!
!(!!
!(!&
!("!
!("&
!(#!
!(#&
!($!
mean=200var=40000
mean=2.5var=3.125 Archery Priors Revisited
These density curves areall variations of a gammaprobability distribution.
We could have used agamma distribution to
specify each of the priorprobability distributions
for the archery example.Note that higher variancemeans less informative
Tuesday, April 12, 2011
Pr(y|p) =�
n
y
�py(1− p)n−y = L(p|y)
Copyright © 2011 Paul O. Lewis 33
Coin-flipping
y = observed number of headsn = number of flips (sample size)p = (unobserved) proportion of heads
Note that the same formula serves as both the: - probability of y (if p is fixed)- likelihood of p (if y is fixed)
Tuesday, April 12, 2011
Outcome Fair coin model Two-heads model
H 0.5 1.0
T 0.5 0.0
1.0 1.0
Copyright © 2011 Paul O. Lewis 34
Likelihood vs. Probability
• Symbol Pr(D|M) often used to represent the likelihood (I will be following convention in this regard)
Pr(D | M)Probabilities
are functions of the data (the
model is fixed)
L(M | D)
Likelihoods are functions of
models (the data is fixed)
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 35
The posterior is generally more informative than the prior (data contains information)
p
uniform prior density
posterior density
= posterior probability (mass)
Mathematic
a example
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 36
Beta prior gives more flexibility
Beta(2,2) prior density
posterior density
Posterior probability of p between 0.45 and 0.55 is 0.223
Note:• prior is vague but not flat• posterior has smaller variance (i.e. it is more informative) than the prior
Tuesday, April 12, 2011
Copyright © 2011 Paul O. Lewis 37
Usually there are many parameters...
Likelihood
Marginal probability of dataPosteriorprobability
density
Prior probabilitydensityA 2-parameter example
An analysis of 100 sequences under the simplestmodel (JC69) requires 197 branch length parameters.The denominator is a 197-fold integral in this case!
Now consider summing over all possible tree topologies!It would thus be nice to avoid having to calculate the
marginal probability of the data...
Tuesday, April 12, 2011
ISC5317-Fall 2007-PB Computational Evolutionary Biology
influenced the posterior will be by the likelihood. Eventually, the posterior will be determined
entirely by the likelihood. For instance, if 10,000 throws still indicated that the coin had a heads
probability of 0.45, then the posterior would spike on that value even for a Beta(100, 100) prior
distribution that put a lot of probability on fair coins (p = 0.5).
6 Accumulating scientific knowledge
One of the most attractive aspects of Bayesian theory is that it provides a rational and consistent
framework for accumulating scientific knowledge. According to the Bayesian view, knowledge is
accumulated by successively updating probability distributions to take new data into account. In
fact, we can show that a successive series of Bayesian updates based on stepwise accumulation of
evidence is mathematically equivalent to a single update based on all the evidence. Assume that
we start with an initial prior P (θ) and an initial set of observations D1. The posterior probability
distribution after this step would be:
P (θ|D1) =P (θ)P (D1|θ)
P (D1)
If we use this posterior probability distribution as the prior in the next step, where we take a second
set of observations D2 into account, we get:
P (θ|D2, D1) =P (θ|D1)P (D2|θ)
P (D2)
=P (θ)P (D1|θ)P (D2|θ)
P (D1)P (D2)
=P (θ)P (D1, D2|θ)
P (D1, D2)
The last rearrangement can be made because D1 and D2 are independent sets of observations.
Note that the last equation is the same as applying Bayes’ theorem directly on the collected set
of observations, D1 + D2. In other words, there is a rational procedure for accumulating scientific
knowledge in the Bayesian framework and the end result is the same regardless of the order or
exact way in which the data are added as long as the posterior probability distribution obtained in
one analysis is used as the prior in the next analysis. In maximum likelihood analyses, background
knowledge is typically ignored.
7
Accumulation of scientific knowledge
Tuesday, April 12, 2011
ISC5317-Fall 2007-PB Computational Evolutionary Biology
influenced the posterior will be by the likelihood. Eventually, the posterior will be determined
entirely by the likelihood. For instance, if 10,000 throws still indicated that the coin had a heads
probability of 0.45, then the posterior would spike on that value even for a Beta(100, 100) prior
distribution that put a lot of probability on fair coins (p = 0.5).
6 Accumulating scientific knowledge
One of the most attractive aspects of Bayesian theory is that it provides a rational and consistent
framework for accumulating scientific knowledge. According to the Bayesian view, knowledge is
accumulated by successively updating probability distributions to take new data into account. In
fact, we can show that a successive series of Bayesian updates based on stepwise accumulation of
evidence is mathematically equivalent to a single update based on all the evidence. Assume that
we start with an initial prior P (θ) and an initial set of observations D1. The posterior probability
distribution after this step would be:
P (θ|D1) =P (θ)P (D1|θ)
P (D1)
If we use this posterior probability distribution as the prior in the next step, where we take a second
set of observations D2 into account, we get:
P (θ|D2, D1) =P (θ|D1)P (D2|θ)
P (D2)
=P (θ)P (D1|θ)P (D2|θ)
P (D1)P (D2)
=P (θ)P (D1, D2|θ)
P (D1, D2)
The last rearrangement can be made because D1 and D2 are independent sets of observations.
Note that the last equation is the same as applying Bayes’ theorem directly on the collected set
of observations, D1 + D2. In other words, there is a rational procedure for accumulating scientific
knowledge in the Bayesian framework and the end result is the same regardless of the order or
exact way in which the data are added as long as the posterior probability distribution obtained in
one analysis is used as the prior in the next analysis. In maximum likelihood analyses, background
knowledge is typically ignored.
7
Tuesday, April 12, 2011
top related