gbs usage cases: non-model organisms...vcf pipeline snp calling algorithm hohenlohe pa, cresko wa et...

Post on 17-May-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

GBS Usage

Cases:

Non-model

OrganismsKatie E. Hyma, PhD

Bioinformatics Core

Institute for Genomic Diversity

Cornell University

Q: How many SNPs will I get?

A: 42.

What question do you really want to ask?

Q: How many SNPs do I need?

A: 42.

What question are you trying to answer?

The Best Advice I can Give You:

The Best Advice I can Give You:

Does my Experimental Design make Sense?

Do I have a Plan for Analysis?

Do my Results make Biological Sense??

Why can GBS be

complicated?

Sequencing Error

SNP calling

Missing data

Large data sets

Population type

Biological Reality

Is GBS more complicated for

non-model organisms?

I don’t have a reference genome

We don’t know anything about

population structure or diversity

My reference genome is in a pre-

assembly state, or the physical ordering is

inaccurate

My species is outcrossing

Roadmap

No reference genome (UNEAK pipeline)

SNP CALLING – reference pipeline

Examples and usage cases

The non-model problem:No Reference Genome

Universal Network Enabled Analysis Kit (UNEAK)

A reference free SNP calling pipeline

Fei Lu, Cornell University

Overview of UNEAK

Genome is digested, sequenced using GBS

Reads are trimmed to 64 bp

Identical reads = tag

A

B

slide courtesy Fei Lu, Cornell University

Overview of UNEAK – Network filter

Pairwise alignment to findtag pairs with 1 bp mismatch

Topology of tag networks

Keep common reciprocal tags

C

E

D

errorreal tags

F

Build tag networks

count

slide courtesy Fei Lu, Cornell University

Topology of tag networks

Tag

Error

Plastid &Highly repetitivetags

Moderatelyrepetitive tags,Paralogs &SNPs

Networks of 2496 tags

slide courtesy Fei Lu, Cornell University

Details about network filter

SNP

Error tolerance

slide courtesy Fei Lu, Cornell University

Pipeline validated with maize inbred linkage population –network filter

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0

0.0

8

0.1

6

0.2

4

0.3

2

0.4

0.4

8

0.5

6

0.6

4

0.7

2

0.8

0.8

8

0.9

6Pro

po

rtio

n o

f S

NP

s

Allele frequency

Single-site rate(Blast to maize)

Allele frequencydistribution

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0

0.0

7

0.1

4

0.2

1

0.2

8

0.3

5

0.4

2

0.4

9

0.5

6

0.6

3

0.7

0.7

7

0.8

4

0.9

1

0.9

8

Pro

po

rtio

n o

f S

NP

s

Allele frequency

23.3% 85.0%

Step 1 Pairwise alignment of tags

Step 2Network filter

Evaluationcriteria

slide courtesy Fei Lu, Cornell University

Pipeline validated with maize inbred linkage population –SNP validation

LD distribution (SNPs against 1106 markers)

Alignment (B97 tags against B97 shotgun genome from Maize HapMap2 data)

93.2% SNPs are polymorphic

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0

0.0

4

0.0

8

0.1

2

0.1

6

0.2

0.2

4

0.2

8

0.3

2

0.3

6

0.4

0.4

4

0.4

8

0.5

2

0.5

6

0.6

0.6

4

0.6

8

0.7

2

0.7

6

0.8

0.8

4

0.8

8

0.9

2

0.9

6

Pro

po

rtio

n o

f S

NP

s

LD (r2)

0.2

92.2%

slide courtesy Fei Lu, Cornell University

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

Network Filter

GBS UNEAK Pipeline Fork

1) UTagCountToTagPairPlugin

2) UExportTagPairPlugin

TASSEL 3.0

No Reference Genome SNPs can be called with the UNEAK pipeline if

there is no reference genome.

Be careful with your sampling strategy

Optimized for single species

Allele frequency can matter!

Decide whether your downstream analysis will be qualitatively influenced by biases in SNP calling

Are you doing association mapping? Estimating population genetic parameters?

No Reference Genome

Tag pairs are maximum 1 bp different / 64

bp sequenced

Examples when sampling strategy matters:

The same locus may be identified as two separate loci when there is

population structure and overall divergence is > 1.5%

With highly divergent sub-populations some loci that are monomorphic within

a subpopulation but diverged > 1bp/tag between species will not be

detected

No Reference Genome

Allele frequency can matter for SNP calling

in the UNEAK pipeline – and is influenced by

your population/ sampling strategy

Undersampled diversity (and true rare

alleles) can look like error, and are even

harder to detect with the non-reference

pipeline

Sequence based SNP calling:

What’s in a SNP?

Total Reads# minor allele reads to call

heterozygote (1% error rate)

2 1

3 1

4 1

5 1

6 1

7 2

8 2

9 2

10 2

11 2

12 2

13 2

14 3

Lookup table for GBS pipeline is calculated based on binomial distribution

Sequence based SNP calling:

What’s in a SNP?

Total Reads# minor allele reads to call

heterozygote (1% error rate)# minor allele reads to call

heterozygote (0.33% error rate)

2 1 1

3 1 1

4 1 1

5 1 1

6 1 1

7 2 1

8 2 1

9 2 2

10 2 2

11 2 2

12 2 2

13 2 2

14 3 2

Lookup table for GBS pipeline is calculated based on binomial distribution

Sequence based SNP calling:

What’s in a SNP?

Total read depth: 7

Major allele (“G”) count: 4

Minor allele (“C”) count: 3

SNP call: “S” (heterozygous G/C)

Sequence based SNP calling:

What’s in a SNP?

Total read

depth: 7

Major allele

(“G”) count: 6

Minor allele

(“A”) count: 1

SNP call: “G”

(invariant)

Sequence based SNP calling:

What’s in a SNP?

Total read

depth: 3

Major allele

(“G”) count: 3

Minor allele

(“C”) count: 0

SNP call: “G”

(homozygote)

Sequence based SNP calling:

What’s in a SNP?

Total read depth: 3

Major allele (“G”) count: 2

Minor allele (“A”) count: 1

SNP call: “R” (A/G heterozygote)

Sequence based SNP calling:

What’s in a SNP?

Depth Depth

Heterozygote undercalling in

Vitis spp.

Without filters Aa > AA error rate

(heterozygote undercalling)

in Vitis is

30-50%

50,951 SNPs

Mean depth 5.7x

Heterozygote Undercalling

Solutions

1. Focus on the high coverage genotypes (through VCF plugin)

Ex: Vitisgen project

2. Use genetics (ex. Maize project)

Each project is different, and different solutions are appropriate for each situation

SNP calling in Vitis

(heterozygous)

Read Depth of Each allele is captured for SNP calling in Vitis using the Variant Call format (VCF)

0/0:3,0,0:3:88:0,0,108

Plugins available in TASSEL 3 to output VCF format tbt2vcfPlugin MergeDuplicateSNP_vcf_Plugin Does NOT work with HDF5 TBT/TOPM

VCF integration in TASSEL 4 in now implemented

Qi Sun Jon Zhang

Jeff Glaubitz

Terry Casstevens

Sequenced based SNP calling: What’s in a SNP?

VCF pipeline SNP calling algorithm

Hohenlohe PA, Cresko WA et al. PLOS Genetics

2010 Feb 26;6(2):e1000862.

Likelihood score calculation. No allele

frequency based prior probability was

used due to mixed population

structures.

Heterozygote undercalling in

Vitis spp.

0%

20%

40%

60%

80%

100%

120%

no filter >3 >5 GQ>98

Filtering criteria

Call rate

% Errors

8%

15%

30%

3%

100%

82%

56%

67%

depth depth- - -

Working With VCF Files VCFtools

http://vcftools.sourceforge.net/

VCFtools output formats: plink *** can be used as TASSEL input

012 IMPUTE ldhat BEAGLE-GL

VCFtools is installed on CBSU BioHPC Computing Lab http://cbsu.tc.cornell.edu/lab/lab.aspx

How to Get VCF Output Tassel 3.0

1. QseqToTagCountPlugin/FastqToTagCountPlugin2. MergeMultipleTagCountPlugin3. TagCountToFastqPlugin4. Align5. SAMConverterPlugin6. QseqToTBTPlugin/FastqToTBTPlugin (use –y to output in TBTByte format)7. MergeTagsByTaxaFilesPlugin8. tbt2vcfPlugin9. MergeDuplicateSNP_vcf_Plugin

running each plugin without any options will output a list of available options on the console there is no plugin to merge taxa for VCF files. Merge samples by name at the TBT stage if desired using

the flag –x with the MergeTagsByTaxaFilesPlugin.

TASSEL 4.0 From TagsToSNPByAlignmentPlugin

The non-model problem:Less-than-perfect reference genome

How less than perfect is it? Even relatively short contigs are suitable for anchoring 64bp

GBS tags!

Assembly errors can affect alignment and SNP calling.

Common with short read assemblies

Reference allele does not match GBS alleles

Reference genome assembly

Paralogous regions

Biological Reality

Less than Perfect Reference

SNPs can be called with the reference

pipeline even on genomes that are not

fully assembled

Hint: concatenate contigs into pseudo-

chromosomes, but pad with “Ns” in

between contigs to avoid aligning to

junctions – for 64 bp tags pad with > 64 “Ns”

A non-model case study:Vitis – a highly heterozygous outcrossing crop

species with a genus-level breeding program

“Mapping the way to a better grape”

– 37 Mapping populations

– Genotyping-by-Sequencing (GBS) and SSR based approach to identify QTL and candidate genes

– biotic stress resistance, abiotic stress resistance, and fruit quality.

– Inclusion of 32 different wild and cultivated species to help in map building and genome assembly.

Outcrosses

• Use pseudo-testcross markers (Aa x aa)

• Leverage deep sequencing of parentalgenotyping to phase markers

• Filter by Mendelian errors and/or Linkage Disequilibrium

• Methods development in Vitis for genetic mapping with GBS data

• Dense Mapping of 4-way Cross GBS Data

• Three pipelines:

– Synteny

– Synteny with genetic ordering

– De-novo linkage group formation and genetic ordering

Dense maps from GBS data – 4 way cross

Dense maps from GBS data – 4 way cross

Progeny Number (index)

• ~26,000 markers for a single bi-parental family with ~ 200 progeny

• Separate maps for male and female parents

• No parental genotypes

• Up to 50% missing data per SNP locus

Chromosome 1 markers from male parent in a VitisGen family (phased) and ordered by physical position

Ph

ysic

al p

osi

tio

n

The non-model problem:Take home messages

Designing your GBS study Know what you need for your analyses before

planning your project

GBS is a powerful reduced representation approach, with two distinct components molecular methods

bioinformatics pipelines

The GBS bioinformatics pipelines were developed and optimized for specific applications in mind They can work for you, but be mindful of your

needs

Keep Calm and… Filter – expect to filter up to 90% of SNPs when imputation is not

practical

Pedigree independent filtering: Read depth/genotype quality Percentage of absent calls (missingness) Minor allele frequency For inbred species, inbreeding coefficient filters out most paralogous

sites

Likely to require customized bioinformatics – especially without a reference genome

Be ready for post-processing of data into other formats! Have a plan!

The Best Advice I can Give You:

The Best Advice I can Give You:

Does my Experimental Design make Sense?

Do I have a Plan for Analysis?

Do my Results make Biological Sense??

top related