2013 pag-equine-workshop
TRANSCRIPT
![Page 1: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/1.jpg)
C. Titus BrownAsst Prof, CSE and
Microbiology;BEACON NSF STC
Michigan State [email protected]
Next-Gen Sequencing:4 years in the trenches
![Page 2: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/2.jpg)
These slides are available online.
“titus brown slideshare”
You can also e-mail me: [email protected]
Also note that these are my opinions and observations, culled from personal experience,
online material, and reading. I’m happy to cite/explain further upon request, but:
Your Mileage May Vary
![Page 3: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/3.jpg)
Things I won’t talk aboutDon’t work on/with/have anything useful to
say about:Exome sequencingAncient DNAChIP-seq (protein-DNA interactions)
Work on but you’re probably not interested in:Metagenomics (sequencing uncultured
microbial communities)Bioinformatics data structures and algorithms
![Page 4: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/4.jpg)
OverviewShotgun sequencing basics
Things everyone wants to know: how much $$...
Various current problems & challenges
Technology, now and future
Some papers and projects worth looking at; & our own experiences
![Page 5: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/5.jpg)
![Page 6: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/6.jpg)
![Page 7: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/7.jpg)
Two specific concepts:First, sequencing everything at random is very
much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.)
Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences.
These two concepts underlie the recent stunning increases
in sequencing capacity.
![Page 8: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/8.jpg)
![Page 9: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/9.jpg)
![Page 10: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/10.jpg)
![Page 11: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/11.jpg)
![Page 12: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/12.jpg)
What are current costs for Illumina?Approximate costs from MSU sequencing
center, a few months ago, including labor:
RNAseq:$200 prep / sampleSingle-ended 1x50 -- $1100/lane – 100-150 mn
readsPaired-end 2x100 -- $2500/lane – 200-300 mn
reads (/ 2)
Barcoding samples, etc, gets complicated.Discuss biology, etc with a sequencing geek
before going forward!
![Page 13: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/13.jpg)
What does this data really give you??With RNAseq, you can do de novo (genome- and gene-
annotation-independent) gene & isoform discovery and quantification; 50-100m reads/sample is probably “enough”(see: http://blog.fejes.ca/?p=607 for a good discussion)
With genome resequencing, you can do variant analysis/discovery; I recommend 20x depth.
De novo assembly of complex vertebrate genomes is not casual:Cheap short-read sequencing does not yet deliver good long-range contiguity; repeats, heterozygosity get in the way.Assembly & scaffolding process itself is still evolving.
![Page 14: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/14.jpg)
Why so much data?Why do we need 10-20x coverage
(resequencing) or 50-100m reads (mRNAseq) with Illumina?
Two (linked) reasons:Shotgun sequencing is randomCounting/sampling variation
![Page 15: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/15.jpg)
1. Useful minimum coverage depends on high average coverage
![Page 16: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/16.jpg)
2. mRNAseq quantitation – must overcome sampling variation
![Page 17: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/17.jpg)
Coverage conclusionsMore coverage rarely hurts (you can always
discard data, but it is harder/more $$ to get more data from an old sample)
Your desired coverage numbers should be driven by sensitivity considerations.
![Page 18: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/18.jpg)
Problems and challengesSystematic bias in sequencing and
software.
Genome assembly: scaffolding and sensitivity
Gene references
mRNAseq isoform construction
![Page 19: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/19.jpg)
Resequencing: bias and errorCalling SNPs by mapping --
U. Coloradohttp://genomics-course.jasondk.org/?p=395
![Page 20: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/20.jpg)
Both sequencing and bioinformatics yield many low-frequency artifacts!“Obvious” things like misalignments to
paralogous/repeat sequences.Indels are handled badly by current tools (up to
60% false positive rate?!)Oxidation of DNA during library prep step
(acoustic shearing) generated 8-oxoguanine “lesions” responsible for artifacts involving C>A/G>T triplets.
=> With any data set, especially big ones, there will both random and systematic error and bias.
http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-truth-you-cant-handle-the-truth/
![Page 21: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/21.jpg)
Suggestion: Cortex variant caller
Iqbal et al., Nat Genet. 2012, pmid 22231483
![Page 22: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/22.jpg)
Genome assembly: scaffolding & sensitivityEveryone wants two things from a genome assembly --
Long/correct scaffolds
See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon
Complete genome content
![Page 23: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/23.jpg)
Sequence dataReads
http://www.cbcb.umd.edu/research/assembly_primer.shtmlslides from http://slideshare.net/flxlex/ ; Lex Nederbragt
original DNA
fragments
original DNA
fragments
Sequenced ends
![Page 24: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/24.jpg)
ContigsBuilding contigs
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 25: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/25.jpg)
ScaffoldsOrdered, oriented contigs
contigs
mate pairs
gap size estimate
http://dx.doi.org/10.6084/m9.figshare.100940
Scaffoldcontig
gap
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 26: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/26.jpg)
Longer reads!
Repeat copy 1
Repeat copy 2
Long reads can span repeats
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4Contig 1
and heterozygous regions
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 27: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/27.jpg)
Cod: PacBio resultsMapping to the published genome
11.4 kbp subread
10.6 kbp subread
10.9 kbp subread
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 28: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/28.jpg)
Sensitivity – does your genome include everything?Generally not!
For example, the chick genome is missing a substantial number of genes from microchromosomes:723 genes from HSA19q missing from
chicken galGal4.ESTs and RNAseq transcripts for many or
most.
![Page 29: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/29.jpg)
Approach - Digital normalization(a computational version of library normalization)
Digital normalization “smooths out” coverage from
different loci, and can “recover” low
coverage regions for assembly.
![Page 30: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/30.jpg)
Applying diginorm to increase sensitivityReassembled chick genome from 70x Illumina -
> normalized reads in ~24 hours.Contig assembly contained partial or complete
matches to 70% of previously unmappable transcripts assembled from chick mRNAseq
Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio and normalization to improve chick genome; should be generalizable approach.
![Page 31: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/31.jpg)
Mapping => mRNAseq quantitation
Reference transcriptome required.
![Page 32: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/32.jpg)
Existing chick gene models lack exons, isoforms
*This gene contains at least 4 isoforms.
Our data
Models
Likit Preeyanon
![Page 33: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/33.jpg)
(Exon detection is pretty good.)
Likit Preeyanon
![Page 34: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/34.jpg)
Gene Modeler Pipeline (“gimme”?)Merge transcripts together based on
transcript mapping to genome; can include existing gene predictions, iterate.
Construct gene modelsRemove redundant sequencesPredict strands and ORFs
Likit Preeyanon
![Page 35: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/35.jpg)
Some thoughts on bioinfoSoftware is evolving very fast. Don’t worry
about using the latest, but keep an eye on possible artifacts/problems with what you do use.
In NGS, online information (seqanswers, biostar, Twitter) is generally far less behind than publications.
![Page 36: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/36.jpg)
Technology – where next?Most slides taken from Lex Nederbragt:
http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-technologies-at-the-norwegian-sequencing-centre-and-beyond
![Page 37: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/37.jpg)
High-throughput sequencingPhase 1: more is better2005 GS20 200 000 reads100 bp
0.02 Gb/run
2011 GS FLX+1.2 million reads750 bp0.7 Gb/run
2006 GA 28 million reads25 bp0.7 Gb/run
2011 HiSeq 2000 3 billion reads 2x100 bp600 Gb/run
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 38: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/38.jpg)
High-throughput sequencingPhase 2: smaller is better
GS Junior from Roche/454
MiSeq from Illumina
PGM from Ion Torrent/Life Technologies
0.04 GB/run400 bp reads
0.7 GB/run700 bp reads
4.5 GB/run2x150 bp
reads600 GB/run2x100 bp reads
0.01, 0.1 or 1 GB/run
100 or 200 bp reads
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 39: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/39.jpg)
High-throughput sequencingWhy benchtop sequencing instruments?
Affordable price per
instrumentSmall projects
Diagnostics
Fast turn around time
http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/http://www.vetlearn.com/ http://vanillajava.blogspot.com
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 40: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/40.jpg)
Which instrument to choose?
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 41: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/41.jpg)
High-throughput sequencingPhase 3: single-molecule
C2 (current) chemistry:Average read length 2500 bp36 000 reads90 MB per ‘run’
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 42: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/42.jpg)
High-throughput sequencingTechnology
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 43: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/43.jpg)
Need to combine Illumina + PacBio still.
+
+
2.7x
23x
24 cpus4.5 days 100 Gb RAM
Alignments of at least 1kb to cod published assembly
Raw reads
Err
or-
corr
ect
ed r
eads
P_errorCorrection pipeline from
93% of reads recovered
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
![Page 44: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/44.jpg)
My perspective on tech:Illumina HiSeq + benchtop sequencers
(MiSeq) currently most reliable for data generation: data in hand, decent quality.
PacBio data is an excellent add-on for situations where long reads are needed (to bridge repeats or het regions).
![Page 45: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/45.jpg)
Two final pieces of adviceShould you work with genome centers? Maybe.
Genome centers are good at large, well funded projects.
Their default pipelines are reliable but not always cutting edge.
“Weird” problems (high heterozygosity, or complex repeats) may require more attention than they can give.
They also have their own schedules and incentives.
Where should you go for contract sequencing?I get asked this a lot!My best recommendation is UC Davis.“Cheaper” is not always “better”; data quality can
vary immensely.
![Page 46: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/46.jpg)
June 10-June 20, Kellogg Biological Station; < $500
Hands on exposure to data, analysis tools.
Advertisement: next-gen sequence course
http://bioinformatics.msu.edu/ngs-summer-course-2013
![Page 47: 2013 pag-equine-workshop](https://reader031.vdocument.in/reader031/viewer/2022020110/554e746ab4c905f66a8b4cc9/html5/thumbnails/47.jpg)
AcknowledgementsI showed work from Likit Preeyanon and
Alexis Black Pyrkosz, in my labHans Cheng is primary collaborator on
chick work
USDA funded our technology development.
Lex Nederbragt for his slides :)