Edward BucklerUSDA-ARS
Cornell Universityhttp://www.maizegenetics.net
Why can GBS be complicated? Tools
for filtering, error correction and
imputation.
Many Organisms Are Diverse
Humans are at the lower end of diversity, which results in most genomic and bioinformatics
being optimized to this situation
What is diverse: Grapes, Flies, Arabidopsis, and most dominant species on the planet
Maize has more molecular diversity than humans and apes combined
Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)
1.34%0.09%
1.42%
Maize genetic variation has been evolving for 5 million years
Modern Variation Begins Evolving
Sister Genus Diverges
Zea species begin diverging
Maize domesticated
5mya
4mya
3mya
2mya
1mya
War
m
Plio
cene
Col
d Pl
eist
ocen
eDivergence from Chimps
Ardipithecus
Homo erectus
Modern HumansModern Variation Begins
Australopithecus
Features of TASSEL GBS BioinformaticPipeline
• Designed for large numbers of samples
• Modest computing requirements• Species or Genus‐level SNP calling & filtering
• Filtering to deal with extensive paralogy
Glaubitz et al 2014 PLOSone. Available in TASSEL at maizegenetics.net
Reference‐based GBS bioinformatics pipelineDiscovery
Tag Counts
SNP Caller
Genotypes
Tags by Taxa
FASTQ
Production
FASTQ
Filtered Genotypes
Tags on Physical Map
Tags on Physical Map
Genotypes
What are our expectations with GBS?
High Diversity Ensures High Return on Sequencing
• Proportion of informative markers– Highly repetitive – 15% not easily
informative– Half the genome is not shared between two
maize line• Potentially all of these are informative with a
large enough database– Low copy shared proportion (1% diversity)
• Bi-parental information = (1-0.01)^64bp = 48% informative
• Association information = (1-0.05)^64bp= 97% informative
Expectation of marker distribution
Biallelic, 17%
Too Repetitiv
e, 15%
Non-polymor
phic; 18%
Presense/Absense
, 50%
Multiallelic, 34%
Too Repetitiv
e, 15%
Non-polymorphic; 1%
Presense/Absense
, 50%
Biparental population Across the species
Sequencing Error
Illumina Basic Error Rate is ~1%
• Error rates are associated with distance from start of sequence– Bad – GBS puts these all at the same
position– Good – Reverse reads can correct– Good – Error are consistent and modelable
Reads with errors
• Perfect sequences:0.9964=52.5% of the 64bp sequences are
perfect47.5 are NOT perfect
The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.
Do we see these errors?• Assume 10,000 lines genotyped at
0.5X coverage
Base Type Read # (no SNP)
Read # (w/ SNP)
A Major 4950 4900
C Minor 17 67 (50 real)
G Error 17 17
T Error 17 17
Do Errors Matter?• Yes –Imputation, Haplotype
reconstruction• Maybe – GWAS for low frequency
SNPs• No – GS, genetic distance, mapping
on biparental populations
Expectations of Real SNPs
• Vast majority are biallelic• Homozygosity is predicted by
inbreeding coefficient• Allele frequency is constrained in
structured populations• In linkage disequilibrium with
neighboring SNPs
Studying Errors in Biparental populationsLimited range of alleles,
expected allele frequencies, high LD
Maize RIL population expectations
• Allele frequency 0% or 50%• Nearby sites should be in very high
LD (r2>50%)• Most sites can be tested if multiple
populations are available
Bi-parental populations allow identification of error, and non-Mendelian segregation
Error
Non-segregating
Segregating
Bi-parental populations allow identification of error, and non-Mendelian segregation
Error
Median error rate is 0, but there is a long tail of some high error sites
Median
GBS error rates vs. Maize 50K SNP Chip• 7,254 SNPs in common• 279 maize inbreds in common (“Maize282” panel)
Comparison to 50K SNPs Mean Error Rate (per SNP)
Median ErrorRate (per SNP)
Filtered GBS genotypes
All genotypic comparisons: 1.18% 0.93%Homozygotes only: 0.58% 0.42%
Internal GBS recall error rate with 1X coverage is ~0.2%, half of 50K chip errors are paralogyissues between the systems
MergeDuplicateSNPsPlugin
• When restriction sites are less than 128bp apart, we may read SNP from both directions (strands)
• ~13% of all sites• Fusing increases coverage• Fixes errors• -misMat = set maximum mismatch rate• -callHets = mismatch set to hets or not
Use the biology of your system to filter SNP
• Hardy-Weinberg Disequilibrium• For inbreeding crops use the expected
inbreeding coefficient to filter SNPs• How to deal with the low coverage—
– Although many individuals only have 1X coverage (27% of the samples will have 2X or more). Use these individuals to evaluate the quality of segregation.
Product of Filtering
• After filters, in maize we find 0.0018 error rate– AA<>aa = < 0.0018– AA<>Aa = 0.8 at low coverage
• SNPs in wrong location <~1%. Lower in other species.
Only a limited proportion of the best alignments are shared
between alignersBowtie2 BWA
51.5%31.2% 17.4%
Bowtie2 BWA
45.4%
BWA-MEM
16.9% 13.2%
14.4%
2.5%
4.7% 3.0%
12.4%
Bowtie2
BWA Blast
BWA-MEM
9.5% 12.8%
10.9%1.4%
1.3%
6.5%
1.2% 0.8%
1.6%
0.6%
3.9%3.3%
0.8%
32.9%
Which alignment is real
Genetic Mapping of All Unique Tags
• Test for association between between taxa with tags and an anchor map
• Apply in both an association and linkage context
• Trillions of tests, so computational speed is key
Fei Lu in review
Fei Lu
What do YOU want to filter on?• MAF• Mapping accuracy• Support from Paired-End• Support from multiple aligners• Heterozygosity• Coverage• Agreement with WGS• Allele frequency within certain populations• LD with other SNPs• Stats within a particular population
Finding the Good SNPs
Discovery TOPM
Discovery TOPM
SNP & Features
SNP & Features TrainTrain Filter
TOPMFilter TOPM
Production TOPM
Production TOPM
Unstable Genomes
Using the Presence/Absence Variants
• In species like maize, this is the majority of the data
• Less subject to sequencing error• Need imputation methods to
differentiate between missing from sampling and biologically missing
Only 50% of the maize genome is shared between two varieties
Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010
50%
Plant 1
Plant 2 Plant 3
99%
Person 1
Person 2 Person 3
Maize Humans
GWAS and Joint linkage mapping
Read depth (B73 vs Non-B73)
Identify structural variations
Tags
GWAS
JointLinkage
If Map AlignIf
AlignY
PAVsN
Chromosome
B73
Non‐B73
Bin 1 Bin 2 Bin 3
PAV (Deletion) CNV (Duplication)
Insertion
Developed by Fei Lu
Dependent variable, distance (abs(genetic position – physical position))
Attributes: P-value, likelihood ratio, tag count, etc.
Machine learning models: decision tree, SVM, M5Rules, etc.
Model training for mapping accuracy
6.4 M
0.5 M
8.6 M
GJ
G
J
Y
Y
Y
4.5 M mapped tags26M tags
GWAS
Joint linkage
Model training and prediction using M5Rules
High-resolution mapped tags
95%
50%
4.5 million mapped tags in all maize lines
100 bp 10 Kb 1 Mb 10 Kb1 Gb 10 Gb
Median resolution
Framework of maize pan genome
Paired-End Sequencing
• Most of the tags are less than 500bp in size.
• Full contigs can be developed by sequencing them on a MiSeq with paired-end 250bp reads
• Strategies – physically bridge from the mapped end, genetically anchor one end.
Phylogenetic (Cross Species) Analysis
• When divergence <10% between taxa, GBS makes sense as 40% of the cut site pairs are preserved. If focused a genic regions, can go higher.
• Currently pipeline tools not great for very high levels of divergence. Standard nextgen or Paired-End approaches might be a good strategy.
Future• Clean output and input into
machine learning algorithms• Developing a community to
explore and develop filters
Become a power user:
• Play with the code to add your features– Documentation at maizegenetics.net– Source code at SourceForge
• Guide to programming TASSEL:– http://bit.ly/KZLi8l
• Google Group– https://groups.google.com/forum/#!forum/ta
ssel