what is the probability that of 10 newborn babies at least 7 are boys? p(girl) = p(boy) = 0.5...
Post on 19-Dec-2015
213 Views
Preview:
TRANSCRIPT
What is the probability that of 10 newborn babies at least 7 are boys?
( ) k n knp k p q
k
172.05.05.010
105.05.0
9
105.05.0
8
105.05.0
7
10)6( 010192837
kp
0
0.05
0.1
0.15
0.2
0.25
0.3
0 2 4 6 8 10
p(X)
X
p(girl) = p(boy) = 0.5
Lecture 10Important statistical distributions
n
iip
0
1
Bernoulli distribution
( ) k n knp k p q
k
0
( ) ( )k
x n x
x
nF k p x k p q
x
The Bernoulli or binomial distribution comes from the Taylor expansion of the binomial
n
i
nin
i
nin qpi
nqp
i
nqp
0
1
0
1 )1()(
npq
np
2
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 1 2 3 4 5 6 7 8 9 10p
f(p)
1010( ) 0.2 0.8k kp k
k
1010( ) 0.2 0.8k kp k
k
Bernoulli or binomial distribution
Assume the probability to find a certain disease in a tree population is 0.01. A bio-monitoring program surveys 10 stands of trees and takes in each case a random sample of
100 trees. How large is the probability that in these stands 1, 2, 3, and more than 3 cases of this disease will occur?
146.39.9
9.999.0*01.0*1000
1001.0*10002
0074.099.0*01.03
1000)3(
0022.099.0*01.02
1000)2(
0004.099.0*01.01
1000)1(
9973
9982
999
p
p
pMean, variance, standard deviation
99.099.001.03
100099.001.0
2
100099.001.0
1
1000
99.001.00
1000199.001.01)3(1)3(
997399829991
100003
0
i
inikpkp
What happens if the number of trials n becomes larger and larger and p the event probability becomes smaller and smaller.
( )! 1 ( )!( )
!( 1)! ( ) ( ) ! ( 1)!( )1
k r k
rk r k
r k r r kp X k
k r r r k r r
r
1lim
1
( )!lim 1
( 1)!( )
r r
r k
e
r
r k
r r
( )!
k
p X k ek
Poisson distribution
( ) k n knp k p q
k
rr
pqr
pp
rpnp 1
1
The distribution or rare events
Assume the probability to find a certain disease in a tree population is 0.01. A bio-monitoring program surveys 10 stands of trees and takes in each case a random sample of
100 trees. How large is the probability that in these stands 1, 2, 3, and more than 3 cases of this disease will occur?
1001.0*1000
0076.0!3
10)3(
0023.0!2
10)2(
00045.0!110
)1(
103
102
10
ep
ep
ep
0074.0)3(
0022.0)2(
0004.0)1(
p
p
pPoisson solution Bernoulli solution
The probability that no infected tree will be detected
000045.0!0
10)0( 1010
0
eepep )0(
The probability of more than three infected trees
981.0019.01)3(
019.00076.00023.000045.0)3()2()1()0(
kp
pppp99.0)3( kp
Bernoulli solution
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 1 2 3 4 5 6 7 8 9 10 11 12 13
k
p(k
)
= 1
= 2 = 3
= 4 = 6
2 1
Variance, mean
Skewness
What is the probability in Duży Lotek to have three times cumulation if the first time 14 000 000 people bet, the second time 20 000 000,
and the third time 30 000 000?
The probability to win is
140000001
!49!43!6
)6( p
142857.214000000
130000000
428571.114000000
120000000
114000000
114000000
3
2
1
117.0!0
142857.2
239.0!0
428571.1
368.0!01
142857.20
3
428571.10
2
10
1
ep
ep
ep
The events are independent:
01.0117.0*239.0*368.03,2,1 p
The zero term of the Poisson distribution gives the probability of no eventThe probability of at least one event:
ekp 1)1(
T→CTCA→GAG→C→GTG→C→AAACG
TTCA→GAGTGCCCT
Single substitution
Parallel substitution
Back substitution
Multiple substitution
Probabilities of DNA substitutionWe assume equal substitution probabilities. If the total probability for a substitution is p:
A T
C G
p
pp p
p
The probability that A mutates to T, C, or G isP¬A=p+p+pThe probability of no mutation ispA=1-3p
Independent events)()()( BpApBAp
Independent events
)()()( BpApBAp The probability that A mutates to T and C to G isPAC=(p)x(p)
p(A→T)+p(A→C)+p(A→G)+p(A→A) =1
The construction of evolutionary trees from DNA sequence data
pppp
pppp
pppp
pppp
P
31
31
31
31
The probability matrix
T→CTCA→GAG→C→GTG→C→AAACG
TTCA→GAGTGCCCT
Single substitution
Parallel substitution
Back substitution
Multiple substitution
A T C GA
T
CG
What is the probability that after 5 generations A did not change?
55 )31( pp
The Jukes - Cantor model (JC69) now assumes that all substitution probabilities are equal.
Arrhenius model
The Jukes Cantor model assumes equal substitution probabilities within these 4 nucleotides.
Substitution probability after time t
tttt
tttt
tttt
tttt
eeee
eeee
eeee
eeee
P
4444
4444
4444
4444
43
41
41
41
41
41
41
41
41
41
43
41
41
41
41
41
41
41
41
41
43
41
41
41
41
41
41
41
41
41
43
41
Transition matrix
pppp
pppp
pppp
pppp
P
31
31
31
31
tPtP )0()(
tePtPtPdttdP )0()()()(
Substitution matrix
tA,T,G,C A
The probability that nothing changes is the zero term of the Poisson distribution
pteeGTCAP 4),,(
The probability of at least one substitution ispteeGTCAP 41)(
The probability to reach a nucleotide from any other is
)1(41
),,,( 4 pteACGTAP
The probability that a nucleotide doesn’t change after time t is
ptpt eeAGCTAAP 44
4
3
4
1))1(
4
1(31)|,,,(
Probability for a single difference
This is the mean time to get x different sites from a sequence of n nucleotides. It is also a measure of distance that dependents only on the number of
substitutions
ptpt eeGCTAAP 44
43
43
))1(41(3),,,(
What is the probability of n differences after time t?
xnpt
xptxnx ee
x
npp
x
ntxp
)
43
43(1
43
43
)1(),( 44
)
4
3
4
1ln)(
4
3
4
3lnln)1ln()(lnln),(ln 44 ptpt exnex
x
npxnpx
x
ntxp
nx
pt
34
1ln41
We use the principle of maximum likelihood and the Bernoulli distribution
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 1 2 3 4 5 6 7 8 9 10p
f(p)
1010( ) 0.2 0.8k kp k
k
GorillaPan paniscusPan troglodytesHomo sapiens
Homo neandertalensis
Time
nx
pt
34
1ln41
Divergence - number of substitutions
Phylogenetic trees are the basis of any systematic
classificaton
A pile model to generate the binomial.If the number of steps is very, very large the binomial becomes smooth.
The normal distribution is the continous equivalent to the discrete
Bernoulli distribution
Abraham de Moivre (1667-1754)
2
2
1
2
1)(
x
exf
)( 2
)( xCexf
If we have a series of random variates Xn, a new random variate Yn that is the sum of all Xn will for n→∞ be a variate that is asymptotically normally distributed.
00.010.020.030.040.05
-2 -1.2 -0.4 0.4 1.2 2X
Fre
qu
en
cy
00.010.020.030.040.05
-2 -1.2 -0.4 0.4 1.2 2X
Fre
qu
en
cy
0
0.02
0.04
0.06
-2 -1.2 -0.4 0.4 1.2 2X
Fre
qu
en
cy
0
0.05
0.1
0.15
-2 -1.2 -0.4 0.4 1.2 2X
Fre
qu
en
cy
00.05
0.10.15
0.20.25
-2 -1.2 -0.4 0.4 1.2 2X
Fre
qu
en
cy
0
0.05
0.1
0.15
-2 -1.2 -0.4 0.4 1.2 2X
Fre
qu
en
cy
The central limit theorem
00.020.040.060.08
0.10.120.140.160.18
0.2
0 3 6 9 12 15 18X
f(x)
n=20
0
0.02
0.04
0.06
0.08
0.1
0.12
0 6 12 18 24 30 36 42 48X
f(x)
n=50
0
0.05
0.1
0.15
0.2
0.25
0.3
0 2 4 6 8 10X
f(x)
n=10
0
0.01
0.02
0.03
0.04
0.05
0.06
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5X
f(x)
2
2
( )
21( )
2
x
f x e
2
2
( )
21( )
2
x
f x e
0
0.2
0.4
0.6
0.8
1
1.2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5X
f(x)
2
2
( )
21( )
2
vx
F x e dv
The normal or Gaussian distribution
Mean: mVariance: s2
Important features of the normal distribution• The function is defined for every real x.• The frequency at x = m is given by
1 0.4( )
2p x
• The distribution is symmetrical around m. • The points of inflection are given by the second
derivative. Setting this to zero gives
( )x x
00.020.040.060.08
0.10.120.140.160.18
0.2
0 3 6 9 12 15 18X
f(x)
n=20
0
0.02
0.04
0.06
0.08
0.1
0.12
0 6 12 18 24 30 36 42 48X
f(x)
n=50
0
0.05
0.1
0.15
0.2
0.25
0.3
0 2 4 6 8 10X
f(x)
n=10
0
0.01
0.02
0.03
0.04
0.05
0.06
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5X
f(x)
2
2
( )
21( )
2
x
f x e
+s-s 0.68
+2s-2s 0.95
95.02
1
68.02
1
12
1
2
2
2
2
12
2
2
1
2
1
x
x
x
e
e
e
975.02
1
5.02
1
2
2
2
12
2
1
x
x
e
e
Many statistical tests compare observed values with those of the standard normal distribution and assign
the respective probabilities to H1.
2
2
( )
21( )
2
vx
F x e dv
The Z-transform
2
2
1
2
1)(
x
exf
x
Z
22
1
2
1)(
Zexf
The variate Z has a mean of 0 and and variance of 1.
A Z-transform normalizes every statistical distribution.Tables of statistical distributions are always given as Z-
transforms.
The standard normal
The 95% confidence limit
P(m - s < X < m + s) = 68%P(m - 1.65s < X < m + 1.65s) =
90%P(m - 1.96s < X < m + 1.96s) =
95%P(m - 2.58s < X < m + 2.58s) =
99% P(m - 3.29s < X < m + 3.29s) =
99.9%
The Fisherian significance levels
00.020.040.060.08
0.10.120.140.160.18
0.2
0 3 6 9 12 15 18X
f(x)
n=20
0
0.02
0.04
0.06
0.08
0.1
0.12
0 6 12 18 24 30 36 42 48X
f(x)
n=50
0
0.05
0.1
0.15
0.2
0.25
0.3
0 2 4 6 8 10X
f(x)
n=10
0
0.01
0.02
0.03
0.04
0.05
0.06
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5X
f(x)
2
2
( )
21( )
2
x
f x e
+s-s 0.68
+2s-2s 0.95
The Z-transformed (standardized) normal distribution
Why is the normal distribution so important?
The normal distribution is often at least approximately found in nature. Many additive or multiplicative processes generate distributions of patterns that are normal. Examples are body sizes,
intelligence, abundances, phylogenetic branching patterns, metabolism rates of individuals, plant and animal organ sizes, or egg numbers. Indeed following the Belgian biologist Adolphe Quetelet (1796-1874)
the normal distribution was long hold even as a natural law. However, new studies showed that most often the normal distribution is only a approximation and that real distributions frequently follow more
complicated unsymmetrical distributions, for instance skewed normals.
The normal distribution follows from the binomial. Hence if we take samples out of a large population of discrete events we expect the distribution of events (their frequency) to be normally
distributed.
The central limit theorem holds that means of additive variables should be normally distributed. This is a generalization of the second argument. In other words the normal is the expectation when
dealing with a large number of influencing variables.
Gauß derived the normal distribution from the distribution of errors within his treatment of measurement errors. If we measure the same thing many times our measurements will not always give
the same value. Because many factors might influence our measurement errors the central limit theorem points again to a normal distribution of errors around the mean.
In the next lecture we will see that the normal distribution can be approximated by a number of
other important distribution that form the basis of important statistical tests.
x,s
x,s
x,s
x,s
x,sx,sx,s
x,s
,m s
The estimation of the population mean from a series of samples
xn
n
n
xn
s
nxZ
n
i
i
n
ii
n
ii
1
1
2
1
xZn
The n samples from an additive random variate.
Z is asymptotically normally distributed.
nx
Confidence limit of the estimate of a mean from a series of
samples.
a is the desired probability level.
00.020.040.060.08
0.10.120.140.160.18
0.2
0 3 6 9 12 15 18X
f(x)
n=20
0
0.02
0.04
0.06
0.08
0.1
0.12
0 6 12 18 24 30 36 42 48X
f(x)
n=50
0
0.05
0.1
0.15
0.2
0.25
0.3
0 2 4 6 8 10X
f(x)
n=10
0
0.01
0.02
0.03
0.04
0.05
0.06
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5X
f(x)
2
2
( )
21( )
2
x
f x e
+- 0.68
+2-2 0.95
Standard error
How to apply the normal distribution
Intelligence is approximately normally distributed with a mean of 100 (by definition) and a standard deviation of 16 (in North America). For an intelligence study we need 100 persons with an IO above 130. How many persons do we have to test to find this
number if we take random samples (and do not test university students only)?
2 2
2 2
( ) ( )1302 2
130
1 1( 130) 1
2 2
v v
F x e dv e dv
( ) ( )a
z F x a
0
0.005
0.01
0.015
0.02
0.025
0.03
40 60 80 100 120 140 160
IQ
f(IQ
)
IQ<130 IQ>130
One and two sided tests
We measure blood sugar concentrations and know that our method estimates the concentration with an error of about 3%. What is the probability that our
measurement deviates from the real value by more than 5%?
Albinos are rare in human populations. Assume their frequency is 1 per 100000 persons. What is the probability to find 15
albinos among 1000000 persons?
15 9999851000000( 15) (0.00001) (0.99999)
15p X
=KOMBINACJE(1000000,15)*0.00001^15*(1-0.00001)^999985 = 0.0347
np 2 npq
Home work and literature
Refresh:
• Bernoulli distribution• Poisson distribution• Normal distribution• Central limit theorem• Confidence limits• One, two sided tests • Z-transform
Prepare to the next lecture:
• c2 test• Mendel rules• t-test• F-test• Contingency table• G-test
Literature:
Łomnicki: Statystyka dla biologówMendel:http://en.wikipedia.org/wiki/Mendelian_inheritancePearson Chi2 testhttp://en.wikipedia.org/wiki/Pearson's_chi-square_test
top related