natural selection in humans sharareh noorbaloochi cs 374 oct 10, 2006
Post on 15-Jan-2016
218 views
TRANSCRIPT
Natural Selection in Humans
Sharareh NoorbaloochiCS 374Oct 10, 2006
Papers to be presented…
Science, 16 June 2006, Volume 312
PLoS Biology, March 2006, Volume 4
Overview
• Pursuit of natural selection• Biological Background• Methods for detecting positive
selection• Genome-wide studies• From candidate to function
Images from: Voight et al. 2006
Adaptability of Modern Humans
Humans have undergone tremendous cultural and environmental changes during the last ~40-50 KY.
• Spread around the world (migrate out of Africa 100KY)
• Global warming trend since last ice ageice age ~14 KYA
• Transition from hunter to agricultural society (<~10KYA)
• Increase in pathogen load due to greater population density and proximity to livestock
Voight et al. (2006)
Pursuit of Natural Selection
Some Facts• In human beings, 99.9 percent of bases are the same.
• Remaining 0.1 percent makes a person unique. – Different attributes / characteristics / traits
• how a person looks, • diseases he or she develops.
• These variations can be:– Harmless (change in phenotype)– Harmful (diabetes, cancer, heart disease, Huntington's
disease, and hemophilia )– Latent (variations found in coding and regulatory
regions, are not harmful on their own, and the change in each gene only becomes apparent under certain conditions e.g. susceptibility to lung cancer)
Human Genetic Variations
Two types of genetic mutation events for today:
• Single base mutation which substitutes one nucleotide for another
-- Single Nucleotide Polymorphisms (SNP)
• Insertion or deletion of one or more nucleotide(s)
--Tandem Repeat Polymorphisms --Insertion/Deletion Polymorphisms
• Structural variations also important (copy numbers)
• One of the Most common type of genetic variation
What is SNP ?
• A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more than 1 percent) of a large population.
For example a SNP might change the DNA sequence
AAGGCTAA ATGGCTAA.
SNP facts• SNPs are found in
– coding and (mostly) noncoding regions.
• Occur with a very high frequency– about 1 in 1200 bases on average. – approximately 10 million SNPs occur
commonly in the human genome.
Allele
• Allele: Any one of a number of viable DNA codings occupying a given locus (position) on a chromosome.
• Usually alleles are DNA sequences that code for a gene, but sometimes the term is used to refer to a non-gene sequence.
• In a diploid organism, like humans, one that has two copies of each chromosome, two alleles make up the individual's genotype.
Haplotype
• Haplotype is a set of SNPs on a single chromatid that are statistically associated.
SNP Maps
• Sequence genomes of a large number of people
• Compare the base sequences to
discover SNPs.
• Generate a single map of the human genome containing all possible SNPs => SNP maps
SNP Maps
The HapMap Project
• The DNA samples for the HapMap come from a total of 270 people: 1. Yoruba people in Ibadan, Nigeria (30 both-
parent-and-adult-child trios), 2. Japanese in Tokyo (45 unrelated individuals), 3. Han Chinese in Beijing (45 unrelated
individuals), 4. CEPH (European) (30 trios).
• These numbers of samples will allow the Project to find almost all haplotypes with frequencies of 5% or higher.
• Ascertainment Bias (not enough samples to look at lower frequencies than 5%)
http://www.hapmap.org/index.html.en
Hapmap, SNPs, Haplotype, Tag SNPs
The construction of the HapMap occurs in three steps.
• (a) Single nucleotide polymorphisms (SNPs) are identified in DNA samples from multiple individuals.
• (b) Adjacent SNPs that are inherited together are compiled into "haplotypes."
• (c) "Tag" SNPs within haplotypes are identified that uniquely identify those haplotypes.
By genotyping the three tag SNPs shown in this figure, researchers can identify which of the four haplotypes shown here are present in each individual.
SNPs may / may not alter protein structure
• Genetic variants that alter protein functions are usually deleterious less likely to become common or fixated.
• Synonymous: AKA silent mutation, are mutations that have no functional affect on the protein.
• Non-synonymous: amino acid-altering mutations sickle cell anemia
Synonymous Non- Synonymous
• Degeneracy of Genetic Code
How does human history affect genetic variation?
A genome-wide survey of Linkage Disequilibrium
Linkage disequilibrium is a phenomenon whereby genetic variants are associated: people who have one variant tend to have a second variant as well.
Slide by: David Reich, Broad Institute
Variation Over time
Variations in Chromosomes Within a Population
Common Ancestor
Emergence of Variations Over Time
time present
Mutation
Slide by: David Reich, Broad Institute
Time = present
What Determines Extent of LD?
2,000 gens. ago
Mutation
1,000 gens. ago
Slide by: David Reich, Broad Institute
Recombination is the key!
Neutral Evolution Versus
Positive Natural Selection
Neutral Evolution
Generations
1
Reproduced from Sabeti et al.
2 3 4 5 6 7 8 9 10
Genetic Drift: slow processFrequency of the neutral mutations in the population changes randomly.
Positive Natural Selection
Generations
1
Reproduced from Sabeti et al.
2 3 4 5 6 7 8 9 10
Positive Selection:A selective regime that favors the fixation of an allele that increases the fitness of its carrier.
Fixation: The process by which one allele increases in a population until all other alleles go extinct and the locus becomes monomorphic.
Simply: 100% frequency.
Methods for detecting selection• Difference between species
1. High proportion of function altering mutations
• Within-species variation2. Low diversity3. Excess of derived alleles4. Differences between populations5. Long unbroken haplotypes
Methods for detecting selection
Test 1: Function altering mutations
Age: many millions of years
P. C. Sabeti et al., Science 312, 1614 -1620 (2006)
Excess of function-altering mutations in PRM1 exon 2
Test 1: High proportion of function altering mutations
• Over a prolonged period, positive selection can increase the fixation rate of beneficial function-altering mutations.
•Signature detected by comparing rates of mutations
• power limited: needs multiple selected changes before gene stands out from background neutral rate
Synonymous mutation
• Common Statistical test:
• Ka/Ks test
• Relative rate test
• McDonald-Kreitman test
Ka/Ks test (Li et al. 1985)• Main idea:
Contrast two types of substitutions events.
• Goal of the test:– calculate the synonymous rate (Ks) and the non-
synonymous rate (Ka), at each codon site.
• Purified (negative) selection: Ka decreases,
Ka/Ks < 1 is indicative of purifying selection.
• Positive Selection: Ka increases (replacement of amino acid is beneficial to the organism)
Ka/Ks > 1
Within-species Tests
• Test 2:Low diversity, many rare alleles(age < 250,000 years)
• Test 3: Many high frequency derived
alleles(age < 80,000 years)
• Test 4:Long common haplotypes
unbroken by recombination(age < 30,000 years)
Sweep Signatures
Within-species Test 5: population Difference
Extreme population differences (PD) in FY*O allele frequency.
• FY*O allele, which confers resistance to P. vivax malaria, is prevalent and even fixed in many African populations, but virtually absent outside Africa.
Example 1
Age < 50,000 to 70,000 years
New data sets make genome surveys possible
• Full sequence for human, chimpanzee, mouse
• Dense surveys of human genetic variation
Genome-wide Studies
• Limited power to detect selection at single genes
• Powerful for functional classes of genes rapidly changing:
• Sperm-related genes• Olfactory (sense of smell) receptors
Between-species results
Finding selective sweeps
• Statistical tests:– Distinguish the pattern of genetic
variation expected under neutrality from that expected under natural selection
• Pick a statistical test to detect sweeps
• Apply the statistic across the genome
ProblemWe do not fully know the shape of the neutral distribution and how it’s affected by other factors such as demographic history.
Finding selective sweeps
However, the best we can do:
• use statistic based on simulations
• apply it to empirical genome-wide data sets
• Identify the loci in the extreme tail
Most likely
candidate of
selection
Old alleles: • low or high frequency • short-range LD
Positive Selection
Test based on the relationship betweenallele frequency and extent of linkage disequilibrium
Young alleles: • low frequency • long-range LD (long haplotypes)
No Selection
Young alleles: • high frequency • long-range LD
Slide by: David Reich, Broad Institute
The signal of selection
frequency
Linka
ge D
iseq
uili
bri
um
(H
om
ozy
gosi
ty)
Neutrality
Positive Selection
Slide by: David Reich, Broad Institute
Let us understand these methods better …
Methods for detecting selection
Test 2: Low genetic diversity/many rare alleles
Age < 250,000 years
Within-species Tests
Test 2: Low genetic diversity/many rare alleles
P. C. Sabeti et al., Science 312, 1614 -1620 (2006)
Low diversity and many rare alleles at the Kell blood antigen cluster
• As allele increases in population frequency variants at nearby locations
on the same chromosome (linked variants) rise in frequency.
Such so-called "hitchhiking" "selective sweep”.
• Most common type of variant used: SNPs
• Common Statistical Test:
• Tajima’s D
• Hudson-Kreitman-Aguade (HKA)
• Fu and Li’s D*
Methods for detecting selection
Test 3: High-frequency derived alleles
Age < 80,000 years
Within-species Tests
Test 3: Many high-frequency derived alleles
• Derived alleles: non-ancestral alleles– Arise by new mutations– Typically lower allele frequencies than
ancestral– However, in selective sweep, derived alleles
linked to the beneficial alleles can hitchhike to high frequency.
P. C. Sabeti et al., Science 312, 1614 -1620 (2006)
Figure: Excess of high-frequency derived alleles at the Duffy red cell antigen (FY) gene (34)
Methods for detecting selection
Test 5: Differences between populations
Age < 50,000 to 70,000 years
Within-species Tests
Test 5: Differences between populations• Geographically separate populations are subject to distinct
environmental or cultural pressures change of allele frequency in one populations and not the other.
• Can only arise when populations are at least partially isolated reproductively.– For humans, after the major human migrations out of Africa some
50,000 to 70,000 years ago.
• Weakness of the test: similar to other population genetic signatures, distinguishing between genuine selection and the effect of demographic history (especially population bottleneck) on genetic variation can be hard.
• Common Statistical Tests:
• FST
• Pexcess
Reduction in size of a single, previously larger, population and a loss of prior diversity.
Extreme population Difference
Extreme population differences (PD) in FY*O allele frequency.
• FY*O allele, which confers resistance to P. vivax malaria, is prevalent and even fixed in many African populations, but virtually absent outside Africa.
Example 1
Example 2:
• Region around LCT locus demonstrates large PD between Europeans and non-Europeans Strong selection for lactase persistence allele in Europeans.
LCT
Extreme population Difference
Genome-wide Survey using Tests: Low diversity and population separation
Outliers: low diversity with high population differentiation
A Little Break?
Interesting fact:Pardis Sabeti is a rock star!
Back to work now…
Methods for detecting selection
Test 4: Long Haplotypes
Age < 30,000 years
Within-species Tests
Recent past
Advent ofa beneficial allele
A short while later…
Present
Model: Maynard Smith and Haigh, 1974, Simulation by SelSim, Coop and Spencer, 2004
Mb0.1 0.1 0.2 0.30.20.3
gene
Haplotype
5
3
2
1
4
Core Haplotypes
Slide by: David Reich, Broad Institute
Adjacent SNPs that are inherited together are compiled into "haplotypes."
Long-range multi-SNP haplotypes
5
3
2
1
4
C/T A/G A/G C/T C/T C/T
Long-range markersCoremarkers
gene
Decay of LD
Slide by: David Reich, Broad Institute
Long-range multi-SNP haplotypes
100%
Decay of homozygosity
(probability, at any distance, that any two haplotypes that start out the same have all the same SNP genotypes) 18%
gene
C/T A/G A/G C/T C/T C/T
Coremarkers
Long-range markers
G G
C
C
C
C
T
T
T
T
C
T
75% 35%
T TC
C
A G
3
Slide by: David Reich, Broad Institute
Test Statistic: Extended Haplotype Homozygosity (EHH)
(A) Decay of haplotypes in a single region in which a new selected allele (red) is sweeping to fixation, replacing the ancestral allele(blue).
(B) Decay of haplotype homozygosity for ten replicate simulations.
Right side: derived alleles are favored sigma=2Ns= 250
Haplotype Homozygosity: Sabeti et al. 2002
Simulations of decay of haplotype homozygosity
10 simulations: SelSim
iHS: Measures the extent of haplotypes along alleles at a given SNP
EH
H
Genetic Distance
Ancestral Allele
Derived Allele
0.05
iHHA : iHH with respect to Ancestral core allele.iHHD : iHH with respect to Derived core allele.
iHS Score
• Useful for variants that have not yet reached fixation.
• Large negative iHS: derived allele has swept up in frequency
• Large positive iHS: an ancestral alleles hitchhike with the selected sites.
• Hence, both cases are considered interesting!
The Data: Hapmap Project
• 860,000 SNPs genome-wide
• 60 unrelated individuals– European (CEPH): CEU– Nigerians from Ibadan (Yoruba): YRI
• 89 unrelated individuals– Han Chinese from Beijing and Japanese from
Tokyo: ASN
|iHS|
|iHS|
|iHS|
-iHS
Lines of Evidence: Selection
• Enrichment of signal in genic relative to non-genic regions (p < 10-20)
• Replication of previously published candidates– LCT (Bersagleri et al 2004, Coelho et al 2005)– ADH (Osier et al 2002)– 17q23inv (Stefanson et al 2005)– CYP3A5 (Thompson et al 2004)– Ch. 11 Olfactory gene Cluster (Gilad et al, 2003)
• Correlates with departures in the frequency spectrum (Fay and Wu, 2000)
SYT1, Yoruba
HaplotypeDecay atSYT1,iHS = -4.7
Plot of high iHS scores on
Chromosome 12
Binds Ca2+; implicated in release
of neurotransmitters [OMIM]
HaplotypeDecay atSPAG4,iHS = -3.1
Plot of high iHS scores on
Chromosome 20
Interacts with ODF27, gene found
in mammalian sperm tails [OMIM]
SPAG4, East Asians
How much do regions overlap across populations?
Distribution of
RegionsAcross
Populations
Do signals of selection correlate with known biological processes?
Enriched Ontological Categories• Olfaction [CEU/YRI]
• Gametogenesis/Fertilization [ASN/CEU]
• MHC-I related Immunity [CEU/YRI]
• Metabolism [All]
Other Interesting Stories...
• Skin Pigmentation– MYO5A, OCA2, DTNBP1, TYRP1, SLC24A5* [CEU]
• Sugar Metabolism– MAN2A1 [ASN/YRI]; SI [ASN]; LCT [CEU]
• Processing of Fatty Acids– SLC27A4, PPARD [CEU]; LEPR [ASN]; NCOA1 [YRI]
Conclusion on iHS Method
• Pervasive signals of positive selection across the human genome
• Both population specific and signals shared between populations
• Strong evidence of selection in Africa (unlike other reports)
• Putative medical relevance because:– Have phenotypic consequences – Differences between populations
http://hg-wen.uchicago.edu/selection/haplotter.html
Happlotter