gbs usage cases: non-model organisms...vcf pipeline snp calling algorithm hohenlohe pa, cresko wa et...

48
GBS Usage Cases: Non-model Organisms Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

Upload: others

Post on 17-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

GBS Usage

Cases:

Non-model

OrganismsKatie E. Hyma, PhD

Bioinformatics Core

Institute for Genomic Diversity

Cornell University

Page 2: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Q: How many SNPs will I get?

A: 42.

What question do you really want to ask?

Page 3: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Q: How many SNPs do I need?

A: 42.

What question are you trying to answer?

Page 4: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

The Best Advice I can Give You:

Page 5: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

The Best Advice I can Give You:

Does my Experimental Design make Sense?

Do I have a Plan for Analysis?

Do my Results make Biological Sense??

Page 6: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Why can GBS be

complicated?

Sequencing Error

SNP calling

Missing data

Large data sets

Population type

Biological Reality

Page 7: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Is GBS more complicated for

non-model organisms?

I don’t have a reference genome

We don’t know anything about

population structure or diversity

My reference genome is in a pre-

assembly state, or the physical ordering is

inaccurate

My species is outcrossing

Page 8: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Roadmap

No reference genome (UNEAK pipeline)

SNP CALLING – reference pipeline

Examples and usage cases

Page 9: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

The non-model problem:No Reference Genome

Page 10: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Universal Network Enabled Analysis Kit (UNEAK)

A reference free SNP calling pipeline

Fei Lu, Cornell University

Page 11: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Overview of UNEAK

Genome is digested, sequenced using GBS

Reads are trimmed to 64 bp

Identical reads = tag

A

B

slide courtesy Fei Lu, Cornell University

Page 12: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Overview of UNEAK – Network filter

Pairwise alignment to findtag pairs with 1 bp mismatch

Topology of tag networks

Keep common reciprocal tags

C

E

D

errorreal tags

F

Build tag networks

count

slide courtesy Fei Lu, Cornell University

Page 13: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Topology of tag networks

Tag

Error

Plastid &Highly repetitivetags

Moderatelyrepetitive tags,Paralogs &SNPs

Networks of 2496 tags

slide courtesy Fei Lu, Cornell University

Page 14: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Details about network filter

SNP

Error tolerance

slide courtesy Fei Lu, Cornell University

Page 15: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Pipeline validated with maize inbred linkage population –network filter

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0

0.0

8

0.1

6

0.2

4

0.3

2

0.4

0.4

8

0.5

6

0.6

4

0.7

2

0.8

0.8

8

0.9

6Pro

po

rtio

n o

f S

NP

s

Allele frequency

Single-site rate(Blast to maize)

Allele frequencydistribution

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0

0.0

7

0.1

4

0.2

1

0.2

8

0.3

5

0.4

2

0.4

9

0.5

6

0.6

3

0.7

0.7

7

0.8

4

0.9

1

0.9

8

Pro

po

rtio

n o

f S

NP

s

Allele frequency

23.3% 85.0%

Step 1 Pairwise alignment of tags

Step 2Network filter

Evaluationcriteria

slide courtesy Fei Lu, Cornell University

Page 16: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Pipeline validated with maize inbred linkage population –SNP validation

LD distribution (SNPs against 1106 markers)

Alignment (B97 tags against B97 shotgun genome from Maize HapMap2 data)

93.2% SNPs are polymorphic

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0

0.0

4

0.0

8

0.1

2

0.1

6

0.2

0.2

4

0.2

8

0.3

2

0.3

6

0.4

0.4

4

0.4

8

0.5

2

0.5

6

0.6

0.6

4

0.6

8

0.7

2

0.7

6

0.8

0.8

4

0.8

8

0.9

2

0.9

6

Pro

po

rtio

n o

f S

NP

s

LD (r2)

0.2

92.2%

slide courtesy Fei Lu, Cornell University

Page 17: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele
Page 18: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Discovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa

Sequence

Network Filter

GBS UNEAK Pipeline Fork

1) UTagCountToTagPairPlugin

2) UExportTagPairPlugin

TASSEL 3.0

Page 19: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

No Reference Genome SNPs can be called with the UNEAK pipeline if

there is no reference genome.

Be careful with your sampling strategy

Optimized for single species

Allele frequency can matter!

Decide whether your downstream analysis will be qualitatively influenced by biases in SNP calling

Are you doing association mapping? Estimating population genetic parameters?

Page 20: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

No Reference Genome

Tag pairs are maximum 1 bp different / 64

bp sequenced

Examples when sampling strategy matters:

The same locus may be identified as two separate loci when there is

population structure and overall divergence is > 1.5%

With highly divergent sub-populations some loci that are monomorphic within

a subpopulation but diverged > 1bp/tag between species will not be

detected

Page 21: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

No Reference Genome

Allele frequency can matter for SNP calling

in the UNEAK pipeline – and is influenced by

your population/ sampling strategy

Undersampled diversity (and true rare

alleles) can look like error, and are even

harder to detect with the non-reference

pipeline

Page 22: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Sequence based SNP calling:

What’s in a SNP?

Total Reads# minor allele reads to call

heterozygote (1% error rate)

2 1

3 1

4 1

5 1

6 1

7 2

8 2

9 2

10 2

11 2

12 2

13 2

14 3

Lookup table for GBS pipeline is calculated based on binomial distribution

Page 23: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Sequence based SNP calling:

What’s in a SNP?

Total Reads# minor allele reads to call

heterozygote (1% error rate)# minor allele reads to call

heterozygote (0.33% error rate)

2 1 1

3 1 1

4 1 1

5 1 1

6 1 1

7 2 1

8 2 1

9 2 2

10 2 2

11 2 2

12 2 2

13 2 2

14 3 2

Lookup table for GBS pipeline is calculated based on binomial distribution

Page 24: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Sequence based SNP calling:

What’s in a SNP?

Total read depth: 7

Major allele (“G”) count: 4

Minor allele (“C”) count: 3

SNP call: “S” (heterozygous G/C)

Page 25: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Sequence based SNP calling:

What’s in a SNP?

Total read

depth: 7

Major allele

(“G”) count: 6

Minor allele

(“A”) count: 1

SNP call: “G”

(invariant)

Page 26: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Sequence based SNP calling:

What’s in a SNP?

Total read

depth: 3

Major allele

(“G”) count: 3

Minor allele

(“C”) count: 0

SNP call: “G”

(homozygote)

Page 27: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Sequence based SNP calling:

What’s in a SNP?

Total read depth: 3

Major allele (“G”) count: 2

Minor allele (“A”) count: 1

SNP call: “R” (A/G heterozygote)

Page 28: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Sequence based SNP calling:

What’s in a SNP?

Depth Depth

Page 29: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Heterozygote undercalling in

Vitis spp.

Without filters Aa > AA error rate

(heterozygote undercalling)

in Vitis is

30-50%

50,951 SNPs

Mean depth 5.7x

Page 30: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Heterozygote Undercalling

Solutions

1. Focus on the high coverage genotypes (through VCF plugin)

Ex: Vitisgen project

2. Use genetics (ex. Maize project)

Each project is different, and different solutions are appropriate for each situation

Page 31: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

SNP calling in Vitis

(heterozygous)

Read Depth of Each allele is captured for SNP calling in Vitis using the Variant Call format (VCF)

0/0:3,0,0:3:88:0,0,108

Plugins available in TASSEL 3 to output VCF format tbt2vcfPlugin MergeDuplicateSNP_vcf_Plugin Does NOT work with HDF5 TBT/TOPM

VCF integration in TASSEL 4 in now implemented

Qi Sun Jon Zhang

Jeff Glaubitz

Terry Casstevens

Page 32: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Sequenced based SNP calling: What’s in a SNP?

VCF pipeline SNP calling algorithm

Hohenlohe PA, Cresko WA et al. PLOS Genetics

2010 Feb 26;6(2):e1000862.

Likelihood score calculation. No allele

frequency based prior probability was

used due to mixed population

structures.

Page 33: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Heterozygote undercalling in

Vitis spp.

0%

20%

40%

60%

80%

100%

120%

no filter >3 >5 GQ>98

Filtering criteria

Call rate

% Errors

8%

15%

30%

3%

100%

82%

56%

67%

depth depth- - -

Page 34: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Working With VCF Files VCFtools

http://vcftools.sourceforge.net/

VCFtools output formats: plink *** can be used as TASSEL input

012 IMPUTE ldhat BEAGLE-GL

VCFtools is installed on CBSU BioHPC Computing Lab http://cbsu.tc.cornell.edu/lab/lab.aspx

Page 35: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

How to Get VCF Output Tassel 3.0

1. QseqToTagCountPlugin/FastqToTagCountPlugin2. MergeMultipleTagCountPlugin3. TagCountToFastqPlugin4. Align5. SAMConverterPlugin6. QseqToTBTPlugin/FastqToTBTPlugin (use –y to output in TBTByte format)7. MergeTagsByTaxaFilesPlugin8. tbt2vcfPlugin9. MergeDuplicateSNP_vcf_Plugin

running each plugin without any options will output a list of available options on the console there is no plugin to merge taxa for VCF files. Merge samples by name at the TBT stage if desired using

the flag –x with the MergeTagsByTaxaFilesPlugin.

TASSEL 4.0 From TagsToSNPByAlignmentPlugin

Page 36: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

The non-model problem:Less-than-perfect reference genome

Page 37: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

How less than perfect is it? Even relatively short contigs are suitable for anchoring 64bp

GBS tags!

Assembly errors can affect alignment and SNP calling.

Common with short read assemblies

Reference allele does not match GBS alleles

Reference genome assembly

Paralogous regions

Biological Reality

Page 38: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Less than Perfect Reference

SNPs can be called with the reference

pipeline even on genomes that are not

fully assembled

Hint: concatenate contigs into pseudo-

chromosomes, but pad with “Ns” in

between contigs to avoid aligning to

junctions – for 64 bp tags pad with > 64 “Ns”

Page 39: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

A non-model case study:Vitis – a highly heterozygous outcrossing crop

species with a genus-level breeding program

Page 40: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

“Mapping the way to a better grape”

– 37 Mapping populations

– Genotyping-by-Sequencing (GBS) and SSR based approach to identify QTL and candidate genes

– biotic stress resistance, abiotic stress resistance, and fruit quality.

– Inclusion of 32 different wild and cultivated species to help in map building and genome assembly.

Page 41: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Outcrosses

• Use pseudo-testcross markers (Aa x aa)

• Leverage deep sequencing of parentalgenotyping to phase markers

• Filter by Mendelian errors and/or Linkage Disequilibrium

• Methods development in Vitis for genetic mapping with GBS data

Page 42: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

• Dense Mapping of 4-way Cross GBS Data

• Three pipelines:

– Synteny

– Synteny with genetic ordering

– De-novo linkage group formation and genetic ordering

Dense maps from GBS data – 4 way cross

Page 43: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Dense maps from GBS data – 4 way cross

Progeny Number (index)

• ~26,000 markers for a single bi-parental family with ~ 200 progeny

• Separate maps for male and female parents

• No parental genotypes

• Up to 50% missing data per SNP locus

Chromosome 1 markers from male parent in a VitisGen family (phased) and ordered by physical position

Ph

ysic

al p

osi

tio

n

Page 44: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

The non-model problem:Take home messages

Page 45: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Designing your GBS study Know what you need for your analyses before

planning your project

GBS is a powerful reduced representation approach, with two distinct components molecular methods

bioinformatics pipelines

The GBS bioinformatics pipelines were developed and optimized for specific applications in mind They can work for you, but be mindful of your

needs

Page 46: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

Keep Calm and… Filter – expect to filter up to 90% of SNPs when imputation is not

practical

Pedigree independent filtering: Read depth/genotype quality Percentage of absent calls (missingness) Minor allele frequency For inbred species, inbreeding coefficient filters out most paralogous

sites

Likely to require customized bioinformatics – especially without a reference genome

Be ready for post-processing of data into other formats! Have a plan!

Page 47: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

The Best Advice I can Give You:

Page 48: GBS Usage Cases: Non-model Organisms...VCF pipeline SNP calling algorithm Hohenlohe PA, Cresko WA et al. PLOS Genetics 2010 Feb 26;6(2):e1000862. Likelihood score calculation. No allele

The Best Advice I can Give You:

Does my Experimental Design make Sense?

Do I have a Plan for Analysis?

Do my Results make Biological Sense??