population genetics ii. bio5488 - 2016genetics.wustl.edu/bio5488/...popgen_2016_lecture2.pdf ·...

Population Genetics II.�Bio5488 - 2016

Don Conraddconrad@genetics.wustl.edu

Agenda

•  Population Genetic Inference

– Mutation – Selection – Recombination

The Coalescent Process •  “Backward in time process” •  Discovered by JFC Kingman,

F. Tajima, R. R. Hudson c. 1980

•  DNA sequence diversity is shaped by genealogical history

•  Genealogies are unobserved but can be estimated

•  Conceptual framework for population genetic inference: mutation, recombination, demographic history

ACGT ACGT ACTT ACTT AGTT

2 sample coalescent MRCA

sequence1 sequence2

N = population size of diploid individuals

n = sample size of haploid chromosomes

MRCA = most recent common ancestor

T2 = coalescence time for 2 chromosomes

2 sample coalescent

P(T2 = t) = 1− 12N

&'t−1 12N"

P(T2 = t) =12N!

−12N!

sequence1 sequence2

Probability that the time of MRCA is t generations ago

If we “blur” our eyes a bit (as N gets very large) this becomes

2 sample coalescent E(T2 ) = 2NMRCA

sequence1 sequence2

If we consider t’ = t/2N, call that “coalescent time”

Then E(T2) = 1

Var(T2 ) = 2N2

n-coalescent There are possible pairs, each coalescing at rate 1/2N

n(n−1)2

P(Tn = t) =

E(Tn ) =2Nn2

n-coalescent

Mean elapsed time (*2N)

1/61/10 = 2/(5*4)

E(TMRCA for n chromosomes)= T2 + T3 + T4 … + Tn = 2(1-1/n) coalescent units

Ne and coalescence times in humans and other animals

In humans, it is known that “appropriate” values for Ne are surprisingly small. This is approximation is called the “effective population size”:

Ne ≈ 10,000 in EuropeNe ≈ 9,500 in East AsiaNe < 25,000 for all human populations, highest in Africa

Simon Myers

Ne and coalescence times in humans and other animals

The mean coalescence time for two lineages is just in units of 2Ne generations, so if we have G=22 years per generation, the average ancestry depth for 2 human chromosomes is

1 ×2Ne ×G in years (20,000-50,000) × 22 = 440,000-1,100,000 years

Ne varies widely across species (Charlesworth, Nature Reviews Genetics 2009):25,000,000 for E.coli2,000,000 for fruit fly D. Melanogaster

1)( 2 =TE

Simon Myers

<100 for Salamanders (Funk et al. 1999)

Adding mutations For neutral models, can separately model the genealogical process (the tree) and the mutation process (genetic types)

-Infinite sites mutation model

What is expected number of mutations between 2 chroms?

t ~ Expo (2N) π ~ Pois(2tμ)

tE(t) = 2N E(π | t) = 2tµ

E(π ) = 2µE(t) = 4Nµ

4Nu comes up repeatedly in population genetics, often referred to as theta

θ = 4Nµ

μ = mutation rate per bp per generation

Number of segregating sites in a sample of size n

θ̂W =S1ii=1

= iE(ti )i=2

= i 4Ni(i−1)#

= 4N 1ii=1

The total time in the tree for a sample of “n” chromosomes is L = 4t4 + 3t3 + 2t2

or in general: L =

Watterson Estimator for population scaled mutation rate

i* tii=2

E(L) E(S) = µE(L) = 4Nµ 1ii=1

∑ =θ1ii=1

Estimators of Theta

θ̂W =S1ii=1

θ̂T =nn−1

2pi (1−i=1

∑ pi )There is also the Tajima estimator – “Average # of pairwise differences”

Watterson estimator is just one approach, uses summary of the data

pi = allele frequency of mutation i�S = number of polymorphic sites

The site frequency spectrum

Under standard neutral model, the site frequency spectrum is a beta distribution with parameters theta/2,theta/2

Expected proportion of sites with count “x” is theta/x

Plot of theoretical SFS for 1Mb

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

Derived allele count

f site

The sfs under neutrality and selection

splice-disrupting 621stop-gain 1,654non-synonymous 84,358synonymous 61,155

Daniel MacArthur, Suganti Balasubramaniam

The sfs of genic variants

Sample-based estimators of Θ using the sfs

θπ =n2

i n − i( )ξ ii=1

n−1∑

θξ e = ξe = ξ1

θH =n2

i2ξ ii=1

n−1∑

θL =1n −1

iξ ii=1

n−1∑

Estimator Sensitivity Source

low Watterson (1975)

intermediate Tajima (1989)

singleton Fu and Li (1993)

high Fay and Wu (2000)

high Zeng et al. (2006)

θW =11ii=1

n−1∑

ξ ii=1

n−1∑

Sensitivity = the frequency of observed polymorphisms that makes estimates using a given estimator large relative to the others.

Eckert, UCDavis

Tajima’s D

Dt=θπ −θWC

Dt = 0 neutral evolution

Dt > 0 balancing selection, more intermediate variants

Dt < 0 positive selection

Bamshad & Wooding 2001

Nielsen, et al 2007

A typical population genomics study design for detecting positive selection.

Akey J M Genome Res. 2009;19:711-722

Are humans still experiencing adaptive evolution?

iHS(x), a homozygosity-based statistic•  function of test allele frequency•  summarizes extent of haplotype homozygosity of derived allele

chromosomes compared to ancestral•  built-in control for local recombination rate PLoS Biology 2006

Conflicting evidence of population-specific selection

Coop, et al PLoS 2009

(A) Genic SNPs are more likely than non-genic SNPs to have extreme allele frequency differences

Conflicting evidence of population-specific selection

Coop, et al PLoS 2009

(B) Maximum allele frequency difference is well explained by average genetic distance between populations

Polygenic model of adaptation

Pritchard, et al Curr Biology, 2010

Linkage Disequilibrium

“nonrandom associations between alleles” •  Compare to HWE:

–  Under HWE, gametes unite at random. Pr(A,a) = pr(A) * pr(a) where A and a are alleles at same locus

•  LD statistics measure to what extent Pr(A,B) = pr(A) * pr(B) when A and B are alleles at different loci

•  Example applications: mapping genes and measuring recombination

Linkage Disequilibrium

A/a B/b

Frequencies: PA PB Pa Pb

The 4 allele frequencies

The 4 haplotype frequencies

PAB PAb PaB Pab

Define D = PAB – PAPB = PAB Pab – PAbPaB

•  If D != 0 then we have LD •  r2 = D/(PAPBPaPb)

Final Aligned Data Set:

Where does LD come from?

l  locate disease locus l  Unlikely to be among our genotyped markers

è  Detect indirectly using available markers

--A--------C--------A----G---X----T---C---A---- --T--------C--------A----G---X----C---C---A---- --A--------G--------G----G---X----C---C---A---- --A--------C--------A----G---X----T---C---A---- --T--------C--------A----G---X----T---C---A---- --T--------C--------A----T---X----T---A---A----

Cases (affected)

Controls (unaffected)

LD-based Case-Control Association Study

--A--------C--------A----T---X----T---A---A---- --A--------G--------G----G---X----C---C---A---- --A--------G--------G----T---X----C---A---A---- --A--------C--------A----G---X----T---C---A---- --T--------C--------A---T---X----T---A---A---- --T--------C--------G----T---X----A---A---A----

Where does array content come from? 2003 2005 2007 2008

10k 500k 1M

Human Genome Project

Phase I HapMap Project

Phase II HapMap

CNV Project

1000 Genomes Project

Number of SNPs on an Affymetrix Chip

Publicly Funded Genomics Projects

The International �HapMap Project

-HapMap project (2002-2008) cost $120 Million USD -Measured variation at 4 million SNPs in 4 populations (90 Europeans, 90 Nigerians, 45 Chinese, 45 Japanese) Results:- Over 2.8 M SNP with > 5% allele frequency-80% of variants within each population can be captured with 30% of the most informative SNPs, “tagSNPS”-Nigerians require most tag SNPs, followed by Europeans and then Asians

HapMap tagSNPs are useful for other populations

Conrad, et al. (2006) genotyped 3000 SNPs in 52 populations across the globe (the “Human Genome Diversity Panel” or HGDP)

C+J: Asian, CEU: European, YRI: Nigerian

The serial bottleneck model

The recombination rate

Can vary hugely along a sequence

Determines association between loci in the population

Is hard to measure directly, because recombination occurs on average only ~1 in 100,000,000 meioses between any pair of successive nucleotides in the genome.

Can be measured indirectly, by parametric analysis of variation dataResearchers in Oxford, and elsewhere, have developed such parametric approaches (Li and Stephens, 2003; Ptak et al. 2005; Hudson 2001, McVean 2002, McVean et al. 2004)

Now, we are considering ρ = 4Nr

Marginal Trees

Position A Position B Position C

Physical position

L(4Nr) = L(ni,nj,nij | 4Nrij )ij∑

Calculated a “composite likelihood” for a sample of haplotypes:

-sum over pairs-lookup table-RJMCMC

Nature Genet 2008

Science 2009

Nat Genet 2010

population genetics ii. bio5488 - 2016genetics.wustl.edu/bio5488/...popgen_2016_lecture2.pdf ·...

Documents

primate comparative...

genetics classic-genetics-mendelian-genetics2320

mendelian genetics. introduction to genetics introduction to...

genetics developed from curiosity about inheritance genetics...

population genetics i. bio5488 -2017

genetics, genetics, and more genetics - science4inquiry ·...

genetics classic genetics – mendelian genetics

january, 2002 genetics 453 evolutionary genetics history...

population genetics, lecture...

dea conrad-curry partner in education email me:...

research development fellows program may 10, 2010 david...

clinical genetics cytogenetics molecular genetics national...

molecular genetics genetics, gen 301

assignment 12: substitution rates and identifying...

genetics & genomics of disease...

bio5488 genomics spring,...

genetics and variations genetics and variations the language...

human molecular genetics iv. genetics of common diseases/...

midterm(review -...

population genetics, lecture...