lecture 2: population structuresssykim/teaching/s13/slides/lecture2.pdfpopulation structure from...

Lecture 2: Population Structure

02-‐715 Advanced Topics in Computa8onal Genomics

1

What is population structure?

• Popula8on Structure – A set of individuals characterized by some measure of gene8c

dis8nc8on

– A “popula8on” is usually characterized by a dis8nct distribu8on over genotypes

– Example Genotypes aa aA AA

Popula8on 1 Popula8on 2

2

Motivation

• Reconstruc*ng individual ancestry: The Genographic Project – hIps://genographic.na8onalgeographic.com/genographic/index.html

• Studying human migra*on – Out of Africa

– Mul*-‐regional hypothesis

• Study of various traits – Lactose intolerance

– Origins in Europe?

– Infer from

• Migra8on studies

• Muta8on studies in popula8ons

3

200,000 years ago

50,000 years ago

30,000 years ago 10,000 years ago

hIps://genographic.na8onalgeographic.com/genographic/index.html

4

Overview

• Background – Hardy-‐Weinberg Equilibrium

– Gene8c driZ – Wright’s FST

• Inferring popula8on structure from genotype data – Structure (Falush et al., 2003) – Matrix factoriza8on/dimensionality reduc8on methods (Engelhardt &

Stephens, 2010)

5

Hardy-Weinberg Equilibrium

• Hardy-‐Weinberg Equilibrium – Under random ma8ng, both allele and genotype frequencies in a

popula8on remain constant over genera8ons.

– Assump8ons of the standard random ma8ng • Diploid organism

• Sexual reproduc8on • Nonoverlapping genera8ons • Random ma8ng

• Large popula8on size • Equal allele frequencies in the sexes • No migra8on/muta8on/selec8on

– Chi-‐square test for Hardy-‐Weinberg equilibrium

6


• p q: allele frequencies of A and a • D, H, R: genotype frequencies for AA, Aa, aa, respec8vely.

– D = p2 – H=2pq – R=q2

7


• p q: allele frequencies of A and a • D, H, R: genotype frequencies for AA, Aa, aa, respec8vely.

8


• The genotype and allele frequencies of the offspring

9

Testing Whether Hardy-Weinberg Equilibrium Holds

• Chi-‐square test – Null hypothesis: HWE holds in the observed data

– Test if the null hypothesis is violated in the data by comparing the observed genotype frequencies (in the parent genera8on) with the expected frequencies (in the offspring genera8on)


Genotype AA Aa aa Total

Observed 224 64 6 294

Expected ? ? ? 294


Genotype AA Aa aa Total

Observed 224 64 6 294

Expected 222.9 66.2 4.9 294

Step 3: Compute the test sta8s8c

€

χ2 =(observed - expected)2

expected∑

=(224 − 222.9)2

222.9+(64 − 66.2)2

66.2+(6 − 4.9)2

4.9= 0.32

€

p =224 × 2 + 64294 × 2

= 0.871

q =1− p = 0.129

Step 1: Compute allele frequencies from the observed data

€

Expected(AA) = p2n = 0.87072 × 294 = 222.9Step 2: Compute the expected genotype frequencies

Genetic Drift

• The change in allele frequencies in a popula8on due to random sampling

• Neutral process unlike natural selec8on – But gene8c driZ can eliminate an allele from the given popula8on.

• The effect of gene8c driZ is larger in a small popula8on

13

Population Divergence

• Wright’s FST – Sta8s8cs used to quan8fy the extent of divergence among mul8ple

popula8ons rela8ve to the overall gene8c diversity

– Summarizes the average devia8on of a collec8on of popula8ons a way from the mean

– FST = Var(pk)/p’(1-p’) • p’: the overall frequency of an allele across all subpopulations • pk :the allele frequency within population k

14

Scenarios of How Populations Evolve

15

Methods for Learning Population Structure from Genetic Markers

• Low-‐dimensional projec8on – Matrix-‐factoriza8on-‐based methods (PaIerson et al., PLoS Gene8cs 2006)

• Model-‐based clustering – STRUCTURE (Pritchard et al., Gene8cs 2000)

16

Low-dimensional Projections

• Gene8c data is very large – Number of markers may range from a few hundreds to hundreds of

thousands

– Thus each individual is described by a high-‐dimensional vector of marker configura8ons

– A low-‐dimensional projec8on allows easy visualiza8on

• Allows projec8on of individuals into a low dimensional space

• Usually projected to 2 dimensions to allow visualiza8on

17

Matrix Factorization and Population Structure

• Matrix factoriza8on for learning popula8on structure

Genotype Data (NxP matrix)

N: number of samples P: number of genotypes

Individuals’ ancestry propor8ons (NxK matrix) K: number of subpopula8ons

Subpopula8on Allele Frequencies (KxP matrix) = x

18

Unifying Framework of Matrix Factorization

• PCA – Based on eigen decomposi8on: columns of Λ are orthogonal, rows of F

are orthnormal. – Works well for the case of isola8on-‐by-‐distance (con8nuous varia8on

of popula8ons among individuals)

• Admixture – Based on probability models: rows of Λ and columns of F should sum

to 1. – Works well if the individuals are admixtures of discretely separated

popula8ons

• Sparse factor model – Sparsity via automa8c relevance determina8on prior

19

Principal Component Analysis

• Most common form of factor analysis

• The new variables/dimensions ... – Are linear combina8ons of the original ones

– Are uncorrelated with one another • Orthogonal in original dimension space

– Capture as much of the original variance in the data as possible

– Are called Principal Components

20

What are the new axes?

Original Variable A

PC 1 PC 2

• Orthogonal direc8ons of greatest variance in data • Projec8ons along PC1 discriminate the data most along any one axis

Original Variable B

21

Principal Components

• First principal component is the direc8on of greatest variability (covariance) in the data

• Second is the next orthogonal (uncorrelated) direc8on of greatest variability – So first remove all the variability along the first component, and then find the next direc8on of greatest variability

• And so on …

22

Dimensionality Reduction

Can ignore the components of lesser significance.

You do lose some informa8on, but if the eigenvalues are small, you don’t lose much

– n dimensions in original data – calculate n eigenvectors and eigenvalues – choose only the first p eigenvectors, based on their eigenvalues – final data set has only p dimensions

23

PCA Analysis (Cavalli-sforza,1978)

• Plot of geographical distribu8on of 3 PCs (Intensity propor8onal to value of each component) – First – blue

– Second -‐ green

– Third -‐ red

24

Discrete/Admixed Populations

SFA

PCA

Admixture

Loading (popula8on) 1 Loading 2 Loading 3

25

Analysis of European Genotype Data

PCA SFAm Admixture 26

Probabilistic Models for Population Structure

• Mixture model – Cluster individuals into K popula8ons

• Admixture model – The genotypes of each individual are an admixture of mul8ple

ancestor popula8ons

– Assumes alleles are in linkage equilibrium

• Linkage model – Model recombina8on, correla8on in alleles across chromosome

27

• Organizing data into clusters such that there is

• high intra-‐cluster similarity

• low inter-‐cluster similarity

• Informally, finding natural groupings among objects.

0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

• For a pre-‐defined number of clusters K, ini8alize K centers randomly

0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

• Iterate between the following two steps – Assign all objects to the nearest center.

– Move a center to the mean of its members.

0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

• AZer moving centers, re-‐assign the objects…

0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

• AZer moving centers, re-‐assign the objects to nearest centers.

• Move a center to the mean of its new members.

k1

k2 k3

• Re-‐assign and move centers, un8l no objects changed membership.

Soft-Clustering of Individuals into Three Clusters with Gaussian Mixture Model

Cluster 1 Cluster 2 Cluster 3

0.1 0.4 0.5

0.8 0.1 0.1

0.7 0.2 0.1

0.10 0.05 0.85

… … …

… … …

… … …

… … …

… … …

… … …

Probability of

Individual 1

Individual 2

Individual 3

Individual 4

Individual 5

Individual 6

Individual 7

Individual 8

Individual 9

Individual 10

Sum

1

1

1

1

1

1

1

1

1

1 • Each individual can assigned to more than one clusters with a certain probability. • For each individual, the probabili8es for all clusters should sum to 1. (i.e., each row should sum to 1.) • Each cluster is explained by a cluster center variable (i.e., cluster mean)

Mixture Model

• The goal is to discover K clusters for K popula8ons from NxJ genotype matrix (N: # of samples, J: # of loci) (xi,n in the diagram on the right)

• Assume K popula8ons (clusters)

• θ = Distribu8on over popula8ons – Mixing propor8ons in mixture model

• β = Distribu8on over alleles at each locus in each popula8on – Mixture component model in mixture model

• To generate an individual’s genome – All individuals share the same θ – Sample zi from Mul8nomial(θ) – For each locus

• Sample xi,n from β corresponding to the popula8on chosen by zi

35

βki =1…I λ

xi,n

zi,

θ

i=1…J

n=1…N

α

k=1…K

Admixture Model

• Relax the assump8on of one popula8on per individual in mixture model

• Individuals can be assigned to mul8ple different popula8ons in different loci

36

The Admixture Model

• β = Distribu8on over alleles – One per popula8on –locus pair

• To generate an individual’s genome – Sample θn from Dirichlet(α)

– For each locus • Sample zi,n from Mul8nomial(θn)

• Sample xi,n from β corresponding to the popula8on chosen by zi,n

37

Structure Model

• Hypothesis: Modern popula8ons are created by an intermixing of ancestral popula8ons.

• An individual’s genome contains contribu8ons from one or more ancestral popula8ons.

• The contribu8ons of popula8ons can be different for different individuals.

• Other assump8ons – Hardy-‐weinberg equilbrium

– No linkage disequilbrium – Markers are i.i.d (independent and iden8cally distributed)

38

Linkage Model

• From admixture model, replace the assump8on that the ancestry labels zil for individual i, locus l are independent with the assump8on that adjacent zil are correlated.

• Use Poisson process to model the correla8on between neighboring alleles – dl : distance between locus l and locus l+1 – r: recombina8on rate

39

Linkage Model

• As recombina8on rate r goes to infinity, all loci become independent and linkage model becomes admixture model.

• Recombina8on rate r can be viewed as being related to the number of genera8ons since admixture occurred.

• Use MCMC algorithm to fit the unkown parameters.

40

Population Structure from Ancestry Proportion of Each Individual

• How to display popula8on structure?

Genetic structure of Human Populations (Rosenberg et al., Science 2002)‏#

Africa Europe Mid-‐East Cent./S. Asia East Asia Oceania

Ancestral proportion

41

Population of Origin Assignments of a Single Individual

True origin

Es8mated Origin (Unphased data)

Es8mated Origin (Phased data)

42

Comparison of Different Methods

PCA Model-‐based Clustering

Advantages • Sta8s8cal tests for significance of results (PaIerson et al. 2006) • Easy visualiza8on

• Genera8ve process that explicitly models admixture • Clustering is probabilis8c: it is possible to assign confidence level of clusters

Disadvantages • No intui8on about underlying processes

• Computa8onal more demanding • Based on assump8ons of evolu8onary models: • Structure: No models of muta8on, recombina8on • Recombina8on added in extension by Falush et al.

43

lecture 2: population structuresssykim/teaching/s13/slides/lecture2.pdfpopulation structure from...

Documents