lecture 14: population structure and population assignment
DESCRIPTION
Lecture 14: Population structure and Population Assignment. October 12, 2012. Lab 7 Update. Corrected instructions for lab 7 will be posted today Problem 1: consider relative levels of F-statistics as well as significance from bootstrapping - PowerPoint PPT PresentationTRANSCRIPT
Lecture 14: Population structure and Population
AssignmentOctober 12, 2012
Lab 7 Update Corrected instructions for lab 7 will be posted
today
Problem 1: consider relative levels of F-statistics as well as significance from bootstrapping
Up to 3 points extra credit if problem 2 is done correctly
See lab open hours schedule on lab web page
Caveat: exams and class usage of lab
Other computers are available: see Hari or me
America
Africa
Eurasia
East Asia
Oceania
Population structure from worldwide human populationPopulation = subpopulation. Group = Regions
Lab 7 Revised Problem 1
Problem 1. File human_struc.xls contains data for 10 microsatellite loci used to genotype 41 human populations from a worldwide sample.
a.) Convert the file into Arlequin format and perform AMOVA based on this grouping of populations within regions using distance. How do you interpret these results? Report values of the phi-statistics and their statistical significance for each AMOVA you run.
b.) Do you think that any of these regions can justifiably be divided into subregions? Pick a region, form a hypothesis for what would be a reasonable grouping of populations into subregions, then run AMOVA only for the region you selected using distance measures. Was your hypothesis supported by the data?
c.) GRADUATE STUDENTS: Which of the 5 initially defined regions has the highest diversity in terms of effective number of alleles? What is your biological explanation for this?
Lab 7 Original Problem 2 (worth 8 points if you answer this). Use Structure to further test the hypotheses you developed in Problem 1.
a.) Calculate the posterior probabilities to test whether:i. All populations form a single genetically homogeneous group.ii. There are two genetically distinct groups within your selected regioniii. There are three genetically distinct groups within your selected region.
b.) Use the ΔK method to determine the most likely number of groups. How does this compare to the method based on posterior probabilities?
c.) How do the groupings of subpopulations compare to your expectations from Problem 1?
d.) Is there evidence of admixture among the groups? If so, include a table or figure showing the proportion of each subpopulation assigned to each group.
e.) GRADUATE STUDENTS: Provide a brief, literature-based explanation for the groupings you observe.
Last Time
Sample calculation of FST
Defining populations on genetic criteria: introduction to Structure
Today
Interpretation of F-statistics
More on the Structure program
Principal Components Analysis
Population assignment
FST: What does it tell us?
Degree of differentiation of subpopulations
Rules of thumb:
0.05 to 0.15 is weak to moderate
0.15 to 0.25 is strong differentiation
>0.25 is very strong differentiation
Related to the historical level of gene exchange between populations
May not represent current conditions
FST is related to life history
Seed DispersalGravity 0.446Explosive/capsule 0.262Winged/Plumose 0.079
(Loveless and Hamrick, 1984)
Successional StageEarly 0.411Middle 0.184Late 0.105
Life CycleAnnual 0.430Short-lived 0.262Long-lived 0.077
Structure Program
One of the most widely-used programs in population genetics (original paper cited >8,000 times since 2000)
Very flexible model can determine:
The most likely number of uniform groups (populations, K)
The genomic composition of each individual (admixture coefficients)
Possible population of origin
Individuals in our sample represent a mixture of K (unknown) ancestral populations.
Each population is characterized by (unknown) allele frequencies at each locus.
Within populations, markers are in Hardy-Weinberg and linkage equilibrium.
Roughly speaking, the model sorts individuals into K clusters so as to minimize departures from HWE and Linkage Equilibrium.
A simple model of population structure
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
More on the model...
Let A1, A2, …, AK represent the (unknown) allele frequencies in each subpopulation
Let Z1, Z2, … , Zm represent the (unknown) subpopulation of origin of the sampled individuals
Assuming Hardy-Weinberg and linkage equilibrium within subpopulations, the likelihood of an individual’s genotype in subpopulation k is given by the product of the relevant allele frequencies:
Where Pl is probability of observing genotype l at a particular locus in subpopulation k
Pr(Gi | Zi= k, Ak) = Pl loci
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Probability of observing a genotype in a subpopulation
Probability of observing a genotype at locus l by chance in population is a function of allele frequencies:
for m loci
Homozygote Heterozygote
Assumes unlinked (independent loci) and Hardy-Weinberg equilibrium
If we knew the population allele frequencies in advance, then it would be easy to assign individuals.
If we knew the individual assignments, it would be easy to estimate frequencies.
In practice, we don’t know either of these, but the following MCMC algorithm converges to sensible joint estimates of both.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
MCMC algorithm (for fixed K)
Start with random assignment of individuals to populations
Step 1: Gene frequencies in each population are estimated based on the individuals that are assigned to it.
Step 2: Individuals are assigned to populations based on gene frequencies in each population.
And this is repeated...
…Estimation of K performed separately.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Admixed individuals are mosaics of ancestry from the original
populations
AncestralAncestralPopulationsPopulations
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
The two basic ancestry models used by structure.
No Admixture: each individual is derived completely from a single subpopulation
Admixture: individuals may have mixed ancestry: some fraction qk of the genome of individual i is derived from subpopulation k.
The admixture model allows for hybrids, but it is more flexible and often provides a better fit for complicated structure. This is what we used in lab.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Notes on Estimating the Number of Subpopulations (k)
Likelihood-based method is the simplest, but likelihood often increases continuously with k
More variability at values of k beyond “natural” value
Evanno et al. (2005) method measures change in likelihood and discounts for variation
Use biological reasoning at arriving at final value
Priors based on population locations, other information
Often need to do hierarchical analyses: break into subregions and run Structure separately for each
Inferred human population structure
Each individual is a thin vertical line that is partitioned into K colored segments according to its membership coefficients in K clusters.
Africans Europeans MidEast Cent/S Asia Asia Oceania America
Rosenberg et al. 2002 Science 298: 2381-2385
Structure is Hierarchical: Groups reveal more substructure when examined separately
Rosenberg et al. 2002 Science 298: 2381-2385
Alternative clustering method: Principal Components Analysis
Structure is very computationally intensive
Often no clear best-supported K-value
Alternative is to use traditional multivariate statistics to find uniform groups
Principal Components Analysis is most commonly used algorithm
EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190).
Eckert, Population Structure, 5-Aug-2008 49
Principal Components Analysis Efficient way to summarize multivariate data like
genotypes
Each axis passes through maximum variation in data, explains a component of the variation
http://www.mech.uq.edu.au/courses/mech4710/pca/s1.htm
How do we identify population of origin?
Human Population Assignment with SNP Assayed 500,000 SNP genotypes for 3,192 Europeans
Used Principal Components Analysis to ordinate samples in space
High correspondence betweeen sample ordination and geographic origin of samples
Individuals assigned to populations of origin with high accuracy
Novembre et al. 2008 Nature 456:98
Likelihood Approaches
Allow evaluation of alternative hypotheses by comparing their relative likelihoods given the evidence
)|(
)|()|,(
2
121 HEP
HEPEHHL
In a population assignment or forensic context, definition of the competing hypothesis is the most essential component
Population Assignment: Likelihood Assume you find skin cells and blood under
fingernails of a murder victim
Victim had major debts with the Sicilian mafia as well as the Chinese mafia
Can population assignment help to focus investigation?
,)|(
)|()|,(
2
121 HGP
HGPLRGHHL
What is H1 and what is H2?
Population Assignment: Likelihood "Assignment Tests" based on allele
frequencies in source populations and genetic composition of individuals
Likelihood-Based Approaches
Calculate likelihood that individual genotype originated in particular population
Assume Hardy-Weinberg and linkage equilibria
Genotype frequencies corrected for presence of sampled individual
Usually reported as log10 likelihood for origin in given population relative to other population
Implemented in ‘GENECLASS’ program (http://www.montpellier.inra.fr/URLB/geneclass/geneclass.html)
for m loci
m
kkPP
1
2lilk pP
for homozygote AiAi in population l at locus k
ljlilk ppP 2for heterozygote AiAj in population l at locus k
Power of Population Assignment using Likelihood
Assignment success depends on:
Number of markers used Polymorphism of markers Number of possible source populations Differentiation of populations Accuracy of allele frequency estimations
Rules of Thumb (Cornuet et al. 1999) for 100% assignment success, for 10 reference populations need:
30 to 50 reference individuals per population 10 microsatellite loci HE > 0.6 FST > 0.1
Knowing what you know about human population genetics, is it worth the effort to assign our skin
sample to Asian or Sicilian populations?
Rules of Thumb (Cornuet et al. 1999) for 100% assignment success, for 10 reference populations need:
30 to 50 reference individuals per population 10 microsatellite loci HE > 0.6 FST > 0.1
Carmichael et al. 2001 Mol Ecol 10:2787
Population Assignment Example: Wolf Populations in Northwest Territories Wolf populations sampled on
island and mainland populations in Canadian Northwest Territories
Immigrants detected on mainland (black circles) from Banks Island (white circles)
Population Assignment Example:Fish Stories Fishing competition on
Lake Saimaa in Southeast Finland
Contestant allegedly caught a 5.5 kg salmon, much larger than usual for the lake
Compared fish from the lake to fish from local markets (originating from Norway and Baltic sea)
7 microsatellites
Based on likelihood analysis, fish was purchased rather than caught in lake
Lake Saimaa Market
-