population genetics ii. bio5488 - 2016genetics.wustl.edu/bio5488/...popgen_2016_lecture2.pdf ·...
Post on 26-Feb-2021
3 Views
Preview:
TRANSCRIPT
Population Genetics II.�Bio5488 - 2016
Don Conraddconrad@genetics.wustl.edu
Agenda
• Population Genetic Inference
– Mutation – Selection – Recombination
The Coalescent Process • “Backward in time process” • Discovered by JFC Kingman,
F. Tajima, R. R. Hudson c. 1980
• DNA sequence diversity is shaped by genealogical history
• Genealogies are unobserved but can be estimated
• Conceptual framework for population genetic inference: mutation, recombination, demographic history
ACTT
ACGT ACGT ACTT ACTT AGTT
T
G
C G
2 sample coalescent MRCA
sequence1 sequence2
N = population size of diploid individuals
n = sample size of haploid chromosomes
MRCA = most recent common ancestor
T2 = coalescence time for 2 chromosomes
T2
2 sample coalescent
P(T2 = t) = 1− 12N
"
#$
%
&'t−1 12N"
#$
%
&'
P(T2 = t) =12N!
"#
$
%&e
−12N!
"#
$
%&t
MRCA
sequence1 sequence2
Probability that the time of MRCA is t generations ago
If we “blur” our eyes a bit (as N gets very large) this becomes
T2
2 sample coalescent E(T2 ) = 2NMRCA
sequence1 sequence2
T2
If we consider t’ = t/2N, call that “coalescent time”
Then E(T2) = 1
Var(T2 ) = 2N2
n-coalescent There are possible pairs, each coalescing at rate 1/2N
n2
!
"#
$
%&=
n(n−1)2
P(Tn = t) =
n2
!
"#
$
%&
2N
!
"
#####
$
%
&&&&&
e
−
n2
!
"##
$
%&&
2N
!
"
#####
$
%
&&&&&
t
E(Tn ) =2Nn2
!
"#
$
%&
n-coalescent
T2
T3
T4
T5
Mean elapsed time (*2N)
1
1/3
1/61/10 = 2/(5*4)
E(TMRCA for n chromosomes)= T2 + T3 + T4 … + Tn = 2(1-1/n) coalescent units
Ne and coalescence times in humans and other animals
In humans, it is known that “appropriate” values for Ne are surprisingly small. This is approximation is called the “effective population size”:
Ne ≈ 10,000 in EuropeNe ≈ 9,500 in East AsiaNe < 25,000 for all human populations, highest in Africa
Simon Myers
Ne and coalescence times in humans and other animals
The mean coalescence time for two lineages is just in units of 2Ne generations, so if we have G=22 years per generation, the average ancestry depth for 2 human chromosomes is
1 ×2Ne ×G in years (20,000-50,000) × 22 = 440,000-1,100,000 years
Ne varies widely across species (Charlesworth, Nature Reviews Genetics 2009):25,000,000 for E.coli2,000,000 for fruit fly D. Melanogaster
1)( 2 =TE
Simon Myers
<100 for Salamanders (Funk et al. 1999)
Adding mutations For neutral models, can separately model the genealogical process (the tree) and the mutation process (genetic types)
-Infinite sites mutation model
What is expected number of mutations between 2 chroms?
t ~ Expo (2N) π ~ Pois(2tμ)
tE(t) = 2N E(π | t) = 2tµ
E(π ) = 2µE(t) = 4Nµ
4Nu comes up repeatedly in population genetics, often referred to as theta
θ = 4Nµ
μ = mutation rate per bp per generation
Number of segregating sites in a sample of size n
θ̂W =S1ii=1
n−1
∑
= iE(ti )i=2
n
∑
= i 4Ni(i−1)#
$%
&
'(
i=2
n
∑
= 4N 1ii=1
n−1
∑
The total time in the tree for a sample of “n” chromosomes is L = 4t4 + 3t3 + 2t2
or in general: L =
Watterson Estimator for population scaled mutation rate
t2
t3t4
i* tii=2
n
∑
E(L) E(S) = µE(L) = 4Nµ 1ii=1
n−1
∑ =θ1ii=1
n−1
∑
Estimators of Theta
θ̂W =S1ii=1
n−1
∑
θ̂T =nn−1
2pi (1−i=1
S
∑ pi )There is also the Tajima estimator – “Average # of pairwise differences”
Watterson estimator is just one approach, uses summary of the data
pi = allele frequency of mutation i�S = number of polymorphic sites
The site frequency spectrum
Under standard neutral model, the site frequency spectrum is a beta distribution with parameters theta/2,theta/2
Expected proportion of sites with count “x” is theta/x
Plot of theoretical SFS for 1Mb
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
010
020
030
040
0
Derived allele count
Num
ber o
f site
s
The sfs under neutrality and selection
splice-disrupting 621stop-gain 1,654non-synonymous 84,358synonymous 61,155
Daniel MacArthur, Suganti Balasubramaniam
The sfs of genic variants
Sample-based estimators of Θ using the sfs
€
θπ =n2
$
% & '
( )
−1
i n − i( )ξ ii=1
n−1∑
€
θξ e = ξe = ξ1
€
θH =n2
#
$ % &
' (
−1
i2ξ ii=1
n−1∑
€
θL =1n −1
iξ ii=1
n−1∑
Estimator Sensitivity Source
low Watterson (1975)
intermediate Tajima (1989)
singleton Fu and Li (1993)
high Fay and Wu (2000)
high Zeng et al. (2006)
€
θW =11ii=1
n−1∑
ξ ii=1
n−1∑
Sensitivity = the frequency of observed polymorphisms that makes estimates using a given estimator large relative to the others.
Eckert, UCDavis
Tajima’s D
Dt=θπ −θWC
Dt = 0 neutral evolution
Dt > 0 balancing selection, more intermediate variants
Dt < 0 positive selection
Bamshad & Wooding 2001
Nielsen, et al 2007
LCTD
A typical population genomics study design for detecting positive selection.
Akey J M Genome Res. 2009;19:711-722
©2009 by Cold Spring Harbor Laboratory Press
Are humans still experiencing adaptive evolution?
iHS(x), a homozygosity-based statistic• function of test allele frequency• summarizes extent of haplotype homozygosity of derived allele
chromosomes compared to ancestral• built-in control for local recombination rate PLoS Biology 2006
Conflicting evidence of population-specific selection
Coop, et al PLoS 2009
(A) Genic SNPs are more likely than non-genic SNPs to have extreme allele frequency differences
Conflicting evidence of population-specific selection
Coop, et al PLoS 2009
(B) Maximum allele frequency difference is well explained by average genetic distance between populations
Fst
Polygenic model of adaptation
Pritchard, et al Curr Biology, 2010
Linkage Disequilibrium
“nonrandom associations between alleles” • Compare to HWE:
– Under HWE, gametes unite at random. Pr(A,a) = pr(A) * pr(a) where A and a are alleles at same locus
• LD statistics measure to what extent Pr(A,B) = pr(A) * pr(B) when A and B are alleles at different loci
• Example applications: mapping genes and measuring recombination
Linkage Disequilibrium
A/a B/b
Frequencies: PA PB Pa Pb
The 4 allele frequencies
The 4 haplotype frequencies
PAB PAb PaB Pab
A B
A b
a B
a b
Define D = PAB – PAPB = PAB Pab – PAbPaB
• If D != 0 then we have LD • r2 = D/(PAPBPaPb)
Final Aligned Data Set:
Where does LD come from?
l locate disease locus l Unlikely to be among our genotyped markers
è Detect indirectly using available markers
--A--------C--------A----G---X----T---C---A---- --T--------C--------A----G---X----C---C---A---- --A--------G--------G----G---X----C---C---A---- --A--------C--------A----G---X----T---C---A---- --T--------C--------A----G---X----T---C---A---- --T--------C--------A----T---X----T---A---A----
Cases (affected)
Controls (unaffected)
LD-based Case-Control Association Study
--A--------C--------A----T---X----T---A---A---- --A--------G--------G----G---X----C---C---A---- --A--------G--------G----T---X----C---A---A---- --A--------C--------A----G---X----T---C---A---- --T--------C--------A---T---X----T---A---A---- --T--------C--------G----T---X----A---A---A----
Where does array content come from? 2003 2005 2007 2008
10k 500k 1M
Human Genome Project
Phase I HapMap Project
Phase II HapMap
CNV Project
1000 Genomes Project
Number of SNPs on an Affymetrix Chip
Publicly Funded Genomics Projects
The International �HapMap Project
-HapMap project (2002-2008) cost $120 Million USD -Measured variation at 4 million SNPs in 4 populations (90 Europeans, 90 Nigerians, 45 Chinese, 45 Japanese) Results:- Over 2.8 M SNP with > 5% allele frequency-80% of variants within each population can be captured with 30% of the most informative SNPs, “tagSNPS”-Nigerians require most tag SNPs, followed by Europeans and then Asians
HapMap tagSNPs are useful for other populations
Conrad, et al. (2006) genotyped 3000 SNPs in 52 populations across the globe (the “Human Genome Diversity Panel” or HGDP)
C+J: Asian, CEU: European, YRI: Nigerian
The serial bottleneck model
The recombination rate
Can vary hugely along a sequence
Determines association between loci in the population
Is hard to measure directly, because recombination occurs on average only ~1 in 100,000,000 meioses between any pair of successive nucleotides in the genome.
Can be measured indirectly, by parametric analysis of variation dataResearchers in Oxford, and elsewhere, have developed such parametric approaches (Li and Stephens, 2003; Ptak et al. 2005; Hudson 2001, McVean 2002, McVean et al. 2004)
Now, we are considering ρ = 4Nr
Marginal Trees
Time
Position A Position B Position C
Physical position
L(4Nr) = L(ni,nj,nij | 4Nrij )ij∑
Calculated a “composite likelihood” for a sample of haplotypes:
-sum over pairs-lookup table-RJMCMC
Nature Genet 2008
Science 2009
Nat Genet 2010
top related