![Page 1: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/1.jpg)
Genetics 211 - 2014 Lecture 2
High Throughput Sequencing Gavin Sherlock [email protected] January 14th 2014
![Page 2: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/2.jpg)
• interactions between nucleic acids and proteins"
• transcript identity"• transcript abundance"
• RNA editing"• SNPs"
• Allele specific expression"• Regulation"
• Nucleosome positioning"• 3D genome architecture"
• Active promoters"• interactions between
nucleic acids and proteins"• chromatin modifications"
• genome variability"• metagenomics"
• genome modifications"• detection of mutations"
• association studies"• phylogeny"• evolution"
Applications of Next-Gen Sequencing
genome chromatin transcriptome"
de novo sequencing"
assembly"
annotation"
mapping"
resequencing"
detection of variants"
mapping"
Hi-C"
3D reconstruction"
mapping"
ChIP-Seq"
detection of binding sites"
mapping"
RNA-Seq"
transcript detection and quantification"
mapping"
ATAC-Seq"
Identify open
chromatin"
![Page 3: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/3.jpg)
How do we make an Illumina Genomic DNA library?
Fragment (Covaris)"
Polish, add dA overhang"Add adaptors, size select"
Genomic DNA"
Sequence"
![Page 4: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/4.jpg)
Making fragments asymmetric
5'-pNNNN.........NNNNA-3' 3'-ANNNN.........NNNNp-5'
Fragmented, end polished, phosphorylated, dA ligated DNA sample"
Genomic Y-adapter"
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCT-3' 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGp-5'
Ligate"
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’
[Ligation product is gel purified, selecting only those products in a certain size range]"
![Page 5: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/5.jpg)
Making our genomic DNA library asymmetric
Round 1 of PCR"
5'-ACACTCTTTCCCTACACGAC
GCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3’ 3'-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCG
CAGCACATCCCTTTCTCACA-5’
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
Products of first round:"
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3’ 3’-GTTCGTCTTCTGCCGTATGCTCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACA-5’
![Page 6: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/6.jpg)
Finishing and Sequencing the Library
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3'-TGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
TCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
Rounds 2-18"
Product of PCR amplification"
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
[Anneal to flow cell. Perform cluster generation]"
Genomic DNA Sequencing Primer"
5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNN.........NNNNAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG-3' 3’-TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNN.........NNNNTCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC-5'
5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT
![Page 7: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/7.jpg)
How Much Sequence? • HiSeq 2500 can give ~250 million reads/lane of
paired end 100bp reads • This is 50Gb of sequence • This is ~4000x coverage yeast (12Mb). • This is an obvious waste of resources (it’s also ~500x
C. elegans, and ~500x D. melanogaster) • How can we sequence on a HiSeq and not waste all
these resources when sequencing smaller genomes?
![Page 8: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/8.jpg)
Barcode Sequencing • Two ways to perform barcode sequencing
– In-line barcodes • Barcode is read as part of the normal sequencing read
– Multiplex barcodes • Barcode is read as a third, short sequencing run (also known
as index reads)
• Can be used to run multiple samples from any particular origin on the same lane of a HiSeq, with the barcodes allowing the samples to be de-convoluted afterwards.
• Barcodes should be designed so that they are balanced in GC content, and as dissimilar as possible.
![Page 9: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/9.jpg)
In-line Barcode Sequencing
![Page 10: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/10.jpg)
Multiplex, or Index barcoding
![Page 11: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/11.jpg)
Random barcoding
• During the PCR step, each template gets amplified many times
• If your library is of insufficient complexity, or you overamplify you may have PCR duplicates
• You want to make independent observations, not redundant observations
• When sequencing to high coverage, you may have identical, but non-redundant observations.
• Want to be able to distinguish these.
![Page 12: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/12.jpg)
Random Barcoding
![Page 13: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/13.jpg)
What are the data?
• Illumina produces data in fastq format.
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
‘@’ followed by a sequence Identifier
The sequence ‘+’, optionally followed by a sequence Identifier The quality scores
![Page 14: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/14.jpg)
Example of Illumina SeqID
@HWUSI-EAS100R:6:73:941:1973#0/1
HWUSI-EAS100R The unique instrument name 6 Flowcell lane 73 Tile number within the flowcell 941 'x'-coordinate of the cluster within the tile 1973 'y'-coordinate of the cluster within the tile #0 index number for a multiplexed sample (0 for no
indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair
reads only)
![Page 15: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/15.jpg)
Assessing Quality
![Page 16: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/16.jpg)
FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/"
![Page 17: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/17.jpg)
HTQC
A
C
F G
D E
B
https://sourceforge.net/projects/htqc"
![Page 18: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/18.jpg)
De novo assembly
• Several methods available • Short reads require long overlaps
• e.g., 33 bp reads must overlap by 20 bp • end-trimming helps, to remove low quality bases.
• Most de novo short read assemblers use a k-mer hashing based approach.
• The central challenge of genome assembly is resolving repeat regions.
![Page 19: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/19.jpg)
De novo Assembly Strategies
• Many, many different algorithms and open source (as well as closed source) software for short read sequence assembly.
• Choice of tool depends on exactly what you are trying to assemble: – Genome size – Genome complexity – Level of polymorphism – Genome vs. transcriptome vs. – Sequence coverage you have (more is generally better) – Paired-end vs. single end (you should really have paired-end data)
• E.g. – SSAKE (Warren et al, 2007)
• Uses DNA prefix tree to find k-mer matches – Edena (Hernandez et al, 2008)
• Overlap layout algorithm plus error correction – Velvet (Zerbino and Birney, 2008)
• Uses DeBruijn graph algorithm plus error correction
![Page 20: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/20.jpg)
Example of Velvet de novo Assembly
![Page 21: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/21.jpg)
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA
TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA
TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG
GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA
GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG
Sequence (7bp reads)
Hashing (k = 4)
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
![Page 22: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/22.jpg)
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)
� � � �
CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)
� � � �
CGAC GACG ACGC (1x) (1x) (1x)
� � �
TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)
� � � � � � � � � �
GATT (1x) �
AGAA (1x)
�
{ {
Graph Building
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
![Page 23: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/23.jpg)
TAGTCGA CGAG
CGACGC
GCTCTAG
GCTTTAG
GATCCGATGAG AGAT
AGAA
� �
�
� � { {�
�
GATT "
�
� �
� � GAGGCT TAGA AGAGA AGACAG
�
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)
� � � � � � � � � � � � � � CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x)
� � � �
CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x)
� � � �
CGAC GACG ACGC (1x) (1x) (1x)
� � �
TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x)
� � � � � � � � � �
GATT (1x) �
AGAA (1x)
�
{ {
Simplification of Linear Stretches
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
![Page 24: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/24.jpg)
TAGTCGAG GAGGCTTTAGA AGAGACAG!
� �
�
�
AGATCCGATGAG!
Error (tip and bubble) removal
Tips
{TAGTCGA CGAG
CGACGC
GCTCTAG
GCTTTAG
GATCCGATGAG AGAT
AGAA
� �
�
� �
{�
�
GATT "
�
� �
� � GAGGCT TAGA AGAGA AGACAG
�
Bubble
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
![Page 25: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/25.jpg)
De novo Assembler Performance
• All three programs run with default parameters on the same dataset – Input: 8.6 millions reads – Platform: 64-bit Opteron, 4CPUs, 32 GB memory
Program Version CPU time Wall clock
SSAKE 3.0 2:24:59 5:08:59
Edena 2.11 0:28:31 28:58
Velvet 0.5 0:08:48 10:36
![Page 26: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/26.jpg)
De novo assemblies"
Program # Contigs >200 bp N50 (bp) Sum (bp) Singletons
SSAKE 12,532 549 6,090,567 3,164,495
Edena 8,316 902 5,759,209 3,955,865
Velvet 7,382 1,252 6,474,426 1,273,164
Program # Contigs N50 (bp) Sum (bp) Max contig
SSAKE 185,030 87 14,287,079 5,490
Edena 11,180 837 6,175,460 11,300
Velvet 10,684 1,184 6,841,458 16,239
![Page 27: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/27.jpg)
Assembly Limitations
• Common repeat regions are typically missing/collapsed – Han Chinese genome missing ~420Mbp of repeats
• Same is true for segmental duplications – Han Chinese genome only contains ~10Mbp of ~150Mbp of
segmental duplications. • You typically get very large numbers of contigs, which
range in size from very small, to sometimes quite large.
![Page 28: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/28.jpg)
Recent Assembler Comparisons
• Earl et al (2011). Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research 21: 2224-2241.
– Used a simulated dataset for all competitors to assemble • Salzberg et al (2012). GAGE: A critical evaluation of genome
assemblies and assembly algorithms. Genome Research 22(3):557-67. – Applied several assembly algorithms to their own datasets, for several
different sized genomes • Bradnam et al. (2013). Assemblathon 2: evaluating de novo methods of
genome assembly in three vertebrate species. Gigascience 2(1):10. – See http://assemblathon.org/
• If you have an assembly problem, you should read these papers to gain some insights into strengths and weaknesses of different assemblers
![Page 29: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/29.jpg)
Improving de novo Assemblies
• Need to generate additional long range continuity to be able to orient and order contigs
• Mate pairs • Hybrid approach using PacBio Reads • Synthetic long reads (aka Moleculo)
![Page 30: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/30.jpg)
Mate-pair libraries
• Goal is to have the equivalent of 2-5kb insert libraries.
• However, technology is limited to ~700 bp fragments – Means you have to use some molecular biology to
accomplish the equivalent.
![Page 31: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/31.jpg)
Fragment"
Genomic DNA"
Size Select (2-5kb)"
Biotinylate"
Bio"
Bio"*"
*"
![Page 32: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/32.jpg)
Bio"
Bio"*"
*"
Circularize"
*"*"
Fragment (400-600bp)"
*"*"
*"*"
*"*"
*"*"
*"*"
![Page 33: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/33.jpg)
*"*"
Enrich Biotinylated fragments"
Standard Paired End Illumina Sequencing"
Incorporate Mate-pair information into assembly"
![Page 34: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/34.jpg)
Leveraging Multiple Technologies
• Illumina is great, because you can get a ton of data – BUT read length is short
• PacBio is great, because read lengths are long – BUT the data quality is terrible
• Two approaches have been used: 1. Hybrid error correction
• Use short reads to perform correction of long PacBio reads, and then assemble those
2. Use PacBio reads to improve existing (short read or Sanger based) assemblies • E.g. With 24× mapped coverage of PacBio long-reads applied
to a D. pseudoobscura assembly, 99% of gaps were addressed, with 69% being closed and further 12% improved.
![Page 35: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/35.jpg)
Long “Synthetic Reads” aka Moleculo
Fragment"
Genomic DNA"
Size Select (10kb)"
Polish, ligate amplification adaptors"
![Page 36: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/36.jpg)
~10 kb DNA"
Dilute to 500 molecules per well "
Amplify, fragment, add sequencing adaptors"
![Page 37: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/37.jpg)
Pool"
Sequence"
![Page 38: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/38.jpg)
Separate, based on bar code"
Remove barcodes, assemble 10kb fragments"
Assemble genome from 10kb fragments"
![Page 39: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/39.jpg)
• interactions between nucleic acids and proteins"
• transcript identity"• transcript abundance"
• RNA editing"• SNPs"
• Allele specific expression"• Regulation"
• Nucleosome positioning"• 3D genome architecture"
• Active promoters"• interactions between
nucleic acids and proteins"• chromatin modifications"
• genome variability"• metagenomics"
• genome modifications"• detection of mutations"
• association studies"• phylogeny"• evolution"
Applications of Next-Gen Sequencing
genome chromatin transcriptome"
de novo sequencing"
assembly"
annotation"
mapping"
resequencing"
detection of variants"
mapping"
Hi-C"
3D reconstruction"
mapping"
ChIP-Seq"
detection of binding sites"
mapping"
RNA-Seq"
transcript detection and quantification"
mapping"
ATAC-Seq"
Identify open
chromatin"
![Page 40: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/40.jpg)
Mapping Short Reads
• Many options; often a trade off between speed, resources and sensitivity.
• Several open source projects to solve this problem, continually improving in speed and memory requirements.
• New features being added all the time. • When dealing with short read data, make sure you
have the very latest versions of the software you’re using, as some are updated frequently.
![Page 41: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/41.jpg)
Alignment
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 42: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/42.jpg)
Approaches to Short Read Alignment
• Hash-Based mapping – Hashing of reads (E.g. Maq, Eland, SHRiMP) – Hashing of genome (E.g. novoalign, SHRiMP2)
• Indexing using Suffix Array/Burrows-Wheeler Transform (BWT) (E.g. bowtie, bwa)
![Page 43: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/43.jpg)
SHRiMP2
• Uses hash of genome to find alignment seeds, then performs Smith-Waterman – SW, while slow, is accelerated; requires x86_64
processor (which most macs have nowadays) • Can detect indels, as well as mismatches • As of v2.2, takes into account quality scores • Takes longer than bowtie and bwa, but is more
sensitive than both
![Page 44: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/44.jpg)
MAQ
• Much faster than SHRiMP, at the cost of accuracy (cannot find indels)
• Uses hashing technique to index genome • Guaranteed to find alignments with up to 2 mismatches • Can take advantage of paired end reads • Uses sequence quality scores to determine best alignments • Generally no longer used
http://sourceforge.net/projects/maq/
![Page 45: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/45.jpg)
How does “hashing” work?
• A hash function simply converts a string (“key”) to an integer (“value”).
• The integer is then used as an index in an array, for fast look up.
• In MAQ, the reads are “hashed”, using 6 different permutations of the first 28bp.
• The genome is then looked through, in 28bp chunks, to see if they match, via the hash, to reads.
![Page 46: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/46.jpg)
Bowtie
• Similar to MAQ, in that it uses quality scores to find best alignments.
• Uses “Burrows-Wheeler index” to keep its memory footprint small.
• Can find alignments with up to 3 mismatches in the first L bases of the read.
• Only ungapped alignments • Also supports paired end reads.
http://bowtie-bio.sourceforge.net/ • Bowtie2 supports gapped alignments too.
![Page 47: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/47.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 48: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/48.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 49: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/49.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 50: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/50.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 51: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/51.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 52: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/52.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 53: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/53.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 54: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/54.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 55: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/55.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 56: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/56.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 57: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/57.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 58: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/58.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 59: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/59.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 60: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/60.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
![Page 61: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/61.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATGATACGGCGACCACCGAGATCTA!
AATAATACGGCGACCACCGAGATCTA!
![Page 62: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/62.jpg)
Bowtie Algorithm
!!!!!!!!!!@HWI-EAS412_4:1:1:1376:380!AATAATACGGCGACCACCGAGATCTA!
![Page 63: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/63.jpg)
BWA
• From the author of Maq, but now uses Burrows-Wheeler transform to significantly speed it up.
• Can also find small indels, in contrast to both Maq and Bowtie.
• Is slightly slower than bowtie, but ability to find indels make it more useful if SNVs are important to you.
![Page 64: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/64.jpg)
Comparison
• PC: 2.4 GHz Intel Core 2, 2 GB RAM • Server: 2.4 GHz AMD Opteron, 32 GB RAM • Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10 • SOAP not run on PC due to memory constraints • Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115) • Reference: Human (NCBI 36.3, contigs)
CPU time Wall clock
time
Reads per hour
Peak virtual memory footprint
Bowtie speedup
Reads aligned
(%)
Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4
SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3
Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9
Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7
Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9
Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7
![Page 65: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/65.jpg)
Comparison
• Bowtie delivers about 30 million alignments per CPU hour
CPU time Wall clock
time
Reads per hour
Peak virtual memory footprint
Bowtie speedup
Reads aligned
(%)
Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4
SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3
Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9
Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7
Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9
Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7
![Page 66: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/66.jpg)
Comparison
• Bowtie and Maq have memory footprints compatible with a typical workstation with 2 GB of RAM
• SOAP requires a computer with >13 GB of RAM • SOAP2 claims to be “super-fast”, and require less RAM (also uses BW
Transform). • Your choice will be dictated by your needs (sensitivity, genome size,
number of reads) and your computing resources, and may change over time.
CPU time Wall clock
time
Reads per hour
Peak virtual memory footprint
Bowtie speedup
Reads aligned
(%)
Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4
SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3
Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9
Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7
Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9
Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7
![Page 67: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/67.jpg)
Precision and recall by amount of variation for 4 datasets, by polymorphism: (number of SNPs, Indel size).
More Comparison Data
David M, Dzamba M, Lister D, Ilie L, Brudno M. (2011). SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics 27(7):1011-2."
![Page 68: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/68.jpg)
Current Practice
• Most people use bwa for mapping their short read data if they want to discover variants
• If not interested in variants, people use bowtie for speed • Most people don’t determine whether the tool they are
using is the best for their purpose • There is no standard benchmark dataset, though see:
– Holtgrewe et al. (2011). A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinformatics 12:210.
• It doesn’t hurt to experiment
![Page 69: Genetics 211 - 2014 Lecture 2 - Stanford Medicinemed.stanford.edu/content/dam/sm/genetics/documents/... · Genetics 211 - 2014 Lecture 2 High Throughput Sequencing Gavin Sherlock](https://reader035.vdocument.in/reader035/viewer/2022070800/5f0256cf7e708231d403c828/html5/thumbnails/69.jpg)
Recommended Reading Mapping • Li, H., Ruan, J. and Durbin, R.. (2008). Mapping short DNA sequencing reads and calling variants using mapping
quality scores. Genome Res. 18(11):1851-8. MAQ • David, M., Dzamba, M., Lister, D., Ilie, L. and Brudno, M. (2011). SHRiMP2: sensitive yet practical SHort Read
Mapping. Bioinformatics 27(7):1011-2. • Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L. (2009). Ultrafast and memory-efficient alignment of short
DNA sequences to the human genome. Genome Biol. 10(3):R25. Bowtie • Langmead, B. and Salzberg, S. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357-359. • Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K. and Wang, J. (2009). SOAP2: an improved ultrafast tool for
short read alignment. Bioinformatics 25(15):1966-7. • Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics 25(14):1754-60. BWA Assembly • Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
Genome Res. 18(5):821-9. • Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009). Pebble and rock band: heuristic resolution of
repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4(12):e8407. • Simpson, J.T. and Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index.
Bioinformatics 26(12):i367-73. • Simpson, J.T. and Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data
structures. Genome Research 22(3):549-56. SGA • Earl, D. et al. (2011). Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome
Research 21(12):2224-41. • Bradnam, K.R. et al. (2013). Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate
species. Gigascience 2(1):10. • English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., Worley, K.C., Gibbs,
R.A. (2012). Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7(11):e47768.
Moleculo • Voskoboynik, A., et al. (2013). The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2:e00569.