wrangling short read data with shrimp
DESCRIPTION
Wrangling Short Read Data with SHRiMP. Stephen M. Rumble Department of Computer Science University of Toronto 07/19/08. Handling NGS Data. NGS: at least 3 distinct read types: Illumina/Solexa, 454 letter-space AB SOLiD color-space (di-base sequencing) 2-pass SMS (Helicos) - PowerPoint PPT PresentationTRANSCRIPT
Wrangling Short Read Datawith SHRiMP
Stephen M. RumbleDepartment of Computer Science
University of Toronto 07/19/08
Handling NGS Data
• NGS: at least 3 distinct read types:– Illumina/Solexa, 454
letter-space
– AB SOLiD color-space (di-base sequencing)
– 2-pass SMS (Helicos) 2 reads, same location higher error rates
• Need new algorithms– SOLiD: Biologists want letters, not colors– 2-pass: How to best handle two reads?
SHRiMP Overview
Isolate similarity in stages:
1. Spaced Seed Filtering
2. Vectorized Smith-Waterman
3. Full Alignment– Specialized for SOLiD, 2-pass, Letter-space
4. Compute p-values (and other statistics)
} Common
Outline
1. AB SOLiD Reads
2. 2-pass (SMS) Reads
TGAGCGTTC|||TGAATAGGA
A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
AB SOLiD: Color-space Sequencing
AB SOLiD reads look like this:
T012233102
A G
C T
1
2
2
33
0 0
00
1
TGAGCGTTCT012033102TGAATAGGA
G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT
SNPs
TGAGTT 12210 TGACTT 12120TGAATT 12030TGATTT 12300
AB SOLiD: Color space is complex!
INDELS
TGAGTTA 122103
TGA-TTA 12-303
TGAGTTTA 1221003
TGAGTATA 1221333It’s
bloody complicated!
AB SOLiD: Translations
• Look at: 012233102• Recall: 012033102• 4 translations for every color sequence
A A C T T A T G G A A G
C T
1
2
2
33
0 0
00
1
0 1 2 0 3 3 1 0 2
C C A G G C G T T C
G G T C C G C A A G
T T G A A T A C C T
TGAGCGTTC|||TGAATAGGA
TGAGCGTTC|||||||||TGAGCGTTC
AB SOLiD: Modified Smith-Waterman
• 4 S-W matrices, one per translation• Errors transition into other matrix• ‘Crossover’ penalty charged for errors
Translation A Translation C
T T GT T
GGe
no
me
G A T A C C T C C A A G C G T T C
A G
C G
T T
C
…
AB SOLiD: Obligatory Comparison
• SHRiMP and AB Mapper (1.6)– SHRiMP seed 1111001111– AB 35_2, 35_3 schemas
• 10,000 35mers– C. savignyi (173Mb), very high polymorphism
• Considering single top hits only
SHRiMP AB 35_2 AB 35_3
% mapped 19.83 6.67 10.94
Runtime 13m04 1h24 2h25
AB SOLiD: Resultant Alignments
• SHRiMP emits letter-space alignments
– Clear to biologists
– Color-space need not be scary!
G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| |||T: GAaCCCCTTACAACTGAACCCC-TACR: 1 T1211000203110121201000-231 25
Outline
1. AB SOLiD Reads
2. 2-pass (SMS) Reads
2-pass SMS Reads
• SMS reads have high error rates
– “Dark bases” (skipped letters)
– Multiple passes are possible
– Ameliorate errors over passes• Good chance of missing base in one read• Acceptable chance of getting it in at least one
CTG-ACTCAGCA-T
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
S=9
SMS 2-pass: SHRiMP with 2 reads
CTG-ACTCAGCA-T 9 1
1 9 8
8 9 1
1 9 9 1
9 3 9
1 9 9 1
1 9
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
CTGAC-TCAG-CAT
S=9
SMS 2-pass: SHRiMP with 2 reads
CTG-ACTCAGCA-T 9 1
1 9 8
8 9 1
1 9 9 1
9 3 9
1 9 9 1
1 9
C T G A C T
C
A
G
C
A
T
Match = +4 Mismatch = -3 Gap = -2
AT
—T
A—
CC
A—
—T G
G
CC
A—
—A
AA
—C
C—
TT
CTGAC-TCAG-CAT
S=9
C-TG-ACTCA-GCA-T
CT-GAC-TC-AG-CAT
S=8
SMS 2-pass: SHRiMP with 2 reads
• Build a DAG representing the (near) optimal alignments of the two reads
• Generate seeds (short paths) from the DAG
• Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW.
• Do full WSG alignment for top hits
SMS 2-pass: SHRiMP with 2-pass data
AT
—T
A—
CC
A—
—T G
G
CC
A—
—A
AA
—C
C—
TT
Type Separate Profile WSG
No hits % 0.13 4.91 4.31
Multiple % 26.45 9.34 9.13
Uniq cor % 63.00 74.90 75.84
Runtime 9m 11m 12m
SMS 2-pass: Results (in brief)
• 10,000 synthetic reads (~25-65 bp)– 7% deletion,1% insertion, 1% sub rate
• Mapped to Human chromosome 1– Spaced seed span 9: 111110111
• Fast mapping of short reads to a genome
-- Handles:
• color-space (SOLiD) reads
• 2-pass (SMS) reads
• insertions and deletions
-- Easy to parallelize
• Computation of p-values & other statistics for hits
SHRiMP Summary
Acknowledgements
• SHRiMP is brought to you by:
– Michael Brudno– Adrian Dalca – Marc Fiume– Vlad Yanovsky
– Phil Lacroute– Arend Sidow
http://compbio.cs.toronto.edu/shrimp
University of Toronto
Stanford University