2013 pag-equine-workshop

C. Titus BrownAsst Prof, CSE and

Microbiology;BEACON NSF STC

Michigan State [email protected]

Next-Gen Sequencing:4 years in the trenches

These slides are available online.

“titus brown slideshare”

You can also e-mail me: [email protected]

Also note that these are my opinions and observations, culled from personal experience,

online material, and reading. I’m happy to cite/explain further upon request, but:

Your Mileage May Vary

Things I won’t talk aboutDon’t work on/with/have anything useful to

say about:Exome sequencingAncient DNAChIP-seq (protein-DNA interactions)

Work on but you’re probably not interested in:Metagenomics (sequencing uncultured

microbial communities)Bioinformatics data structures and algorithms

OverviewShotgun sequencing basics

Things everyone wants to know: how much $$...

Various current problems & challenges

Technology, now and future

Some papers and projects worth looking at; & our own experiences

Two specific concepts:First, sequencing everything at random is very

much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.)

Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences.

These two concepts underlie the recent stunning increases

in sequencing capacity.

What are current costs for Illumina?Approximate costs from MSU sequencing

center, a few months ago, including labor:

RNAseq:$200 prep / sampleSingle-ended 1x50 -- $1100/lane – 100-150 mn

readsPaired-end 2x100 -- $2500/lane – 200-300 mn

reads (/ 2)

Barcoding samples, etc, gets complicated.Discuss biology, etc with a sequencing geek

before going forward!

What does this data really give you??With RNAseq, you can do de novo (genome- and gene-

annotation-independent) gene & isoform discovery and quantification; 50-100m reads/sample is probably “enough”(see: http://blog.fejes.ca/?p=607 for a good discussion)

With genome resequencing, you can do variant analysis/discovery; I recommend 20x depth.

De novo assembly of complex vertebrate genomes is not casual:Cheap short-read sequencing does not yet deliver good long-range contiguity; repeats, heterozygosity get in the way.Assembly & scaffolding process itself is still evolving.

Why so much data?Why do we need 10-20x coverage

(resequencing) or 50-100m reads (mRNAseq) with Illumina?

Two (linked) reasons:Shotgun sequencing is randomCounting/sampling variation

1. Useful minimum coverage depends on high average coverage

2. mRNAseq quantitation – must overcome sampling variation

Coverage conclusionsMore coverage rarely hurts (you can always

discard data, but it is harder/more $$ to get more data from an old sample)

Your desired coverage numbers should be driven by sensitivity considerations.

Problems and challengesSystematic bias in sequencing and

software.

Genome assembly: scaffolding and sensitivity

Gene references

mRNAseq isoform construction

Resequencing: bias and errorCalling SNPs by mapping --

U. Coloradohttp://genomics-course.jasondk.org/?p=395

Both sequencing and bioinformatics yield many low-frequency artifacts!“Obvious” things like misalignments to

paralogous/repeat sequences.Indels are handled badly by current tools (up to

60% false positive rate?!)Oxidation of DNA during library prep step

(acoustic shearing) generated 8-oxoguanine “lesions” responsible for artifacts involving C>A/G>T triplets.

=> With any data set, especially big ones, there will both random and systematic error and bias.

http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-truth-you-cant-handle-the-truth/

Suggestion: Cortex variant caller

Iqbal et al., Nat Genet. 2012, pmid 22231483

Genome assembly: scaffolding & sensitivityEveryone wants two things from a genome assembly --

Long/correct scaffolds

See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon

Complete genome content

Sequence dataReads

http://www.cbcb.umd.edu/research/assembly_primer.shtmlslides from http://slideshare.net/flxlex/ ; Lex Nederbragt

original DNA

fragments

original DNA

fragments

Sequenced ends

ContigsBuilding contigs

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

ScaffoldsOrdered, oriented contigs

contigs

mate pairs

gap size estimate

http://dx.doi.org/10.6084/m9.figshare.100940

Scaffoldcontig

gap


Longer reads!

Repeat copy 1

Repeat copy 2

Long reads can span repeats

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

and heterozygous regions


Cod: PacBio resultsMapping to the published genome

11.4 kbp subread

10.6 kbp subread

10.9 kbp subread


Sensitivity – does your genome include everything?Generally not!

For example, the chick genome is missing a substantial number of genes from microchromosomes:723 genes from HSA19q missing from

chicken galGal4.ESTs and RNAseq transcripts for many or

most.

Approach - Digital normalization(a computational version of library normalization)

Digital normalization “smooths out” coverage from

different loci, and can “recover” low

coverage regions for assembly.

Applying diginorm to increase sensitivityReassembled chick genome from 70x Illumina -

> normalized reads in ~24 hours.Contig assembly contained partial or complete

matches to 70% of previously unmappable transcripts assembled from chick mRNAseq

Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio and normalization to improve chick genome; should be generalizable approach.

Mapping => mRNAseq quantitation

Reference transcriptome required.

Existing chick gene models lack exons, isoforms

*This gene contains at least 4 isoforms.

Our data

Models

Likit Preeyanon

(Exon detection is pretty good.)

Likit Preeyanon

Gene Modeler Pipeline (“gimme”?)Merge transcripts together based on

transcript mapping to genome; can include existing gene predictions, iterate.

Construct gene modelsRemove redundant sequencesPredict strands and ORFs

Likit Preeyanon

Some thoughts on bioinfoSoftware is evolving very fast. Don’t worry

about using the latest, but keep an eye on possible artifacts/problems with what you do use.

In NGS, online information (seqanswers, biostar, Twitter) is generally far less behind than publications.

Technology – where next?Most slides taken from Lex Nederbragt:

http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-technologies-at-the-norwegian-sequencing-centre-and-beyond

High-throughput sequencingPhase 1: more is better2005 GS20 200 000 reads100 bp

0.02 Gb/run

2011 GS FLX+1.2 million reads750 bp0.7 Gb/run

2006 GA 28 million reads25 bp0.7 Gb/run

2011 HiSeq 2000 3 billion reads 2x100 bp600 Gb/run


High-throughput sequencingPhase 2: smaller is better

GS Junior from Roche/454

MiSeq from Illumina

PGM from Ion Torrent/Life Technologies

0.04 GB/run400 bp reads

0.7 GB/run700 bp reads

4.5 GB/run2x150 bp

reads600 GB/run2x100 bp reads

0.01, 0.1 or 1 GB/run

100 or 200 bp reads


High-throughput sequencingWhy benchtop sequencing instruments?

Affordable price per

instrumentSmall projects

Diagnostics

Fast turn around time

http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/http://www.vetlearn.com/ http://vanillajava.blogspot.com


Which instrument to choose?


High-throughput sequencingPhase 3: single-molecule

C2 (current) chemistry:Average read length 2500 bp36 000 reads90 MB per ‘run’


High-throughput sequencingTechnology


Need to combine Illumina + PacBio still.

+

+

2.7x

23x

24 cpus4.5 days 100 Gb RAM

Alignments of at least 1kb to cod published assembly

Raw reads

Err

or-

corr

ect

ed r

eads

P_errorCorrection pipeline from

93% of reads recovered


My perspective on tech:Illumina HiSeq + benchtop sequencers

(MiSeq) currently most reliable for data generation: data in hand, decent quality.

PacBio data is an excellent add-on for situations where long reads are needed (to bridge repeats or het regions).

Two final pieces of adviceShould you work with genome centers? Maybe.

Genome centers are good at large, well funded projects.

Their default pipelines are reliable but not always cutting edge.

“Weird” problems (high heterozygosity, or complex repeats) may require more attention than they can give.

They also have their own schedules and incentives.

Where should you go for contract sequencing?I get asked this a lot!My best recommendation is UC Davis.“Cheaper” is not always “better”; data quality can

vary immensely.

June 10-June 20, Kellogg Biological Station; < $500

Hands on exposure to data, analysis tools.

Advertisement: next-gen sequence course

http://bioinformatics.msu.edu/ngs-summer-course-2013

AcknowledgementsI showed work from Likit Preeyanon and

Alexis Black Pyrkosz, in my labHans Cheng is primary collaborator on

chick work

USDA funded our technology development.

Lex Nederbragt for his slides :)

2013 pag-equine-workshop

Documents

sequencing capacity

sequencing geek

gen sequencing

polymorphic contig contig

genome assembly

msu sequencing center

genome resequencing

novo genome