wrangling short read data with shrimp

19
Wrangling Short Read Data with SHRiMP Stephen M. Rumble Department of Computer Science University of Toronto 07/19/08

Upload: graham

Post on 30-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Wrangling Short Read Data with SHRiMP. Stephen M. Rumble Department of Computer Science University of Toronto 07/19/08. Handling NGS Data. NGS: at least 3 distinct read types: Illumina/Solexa, 454  letter-space AB SOLiD  color-space (di-base sequencing) 2-pass SMS (Helicos) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Wrangling Short Read Data with SHRiMP

Wrangling Short Read Datawith SHRiMP

Stephen M. RumbleDepartment of Computer Science

University of Toronto 07/19/08

Page 2: Wrangling Short Read Data with SHRiMP

Handling NGS Data

• NGS: at least 3 distinct read types:– Illumina/Solexa, 454

letter-space

– AB SOLiD color-space (di-base sequencing)

– 2-pass SMS (Helicos) 2 reads, same location higher error rates

• Need new algorithms– SOLiD: Biologists want letters, not colors– 2-pass: How to best handle two reads?

Page 3: Wrangling Short Read Data with SHRiMP

SHRiMP Overview

Isolate similarity in stages:

1. Spaced Seed Filtering

2. Vectorized Smith-Waterman

3. Full Alignment– Specialized for SOLiD, 2-pass, Letter-space

4. Compute p-values (and other statistics)

} Common

Page 4: Wrangling Short Read Data with SHRiMP

Outline

1. AB SOLiD Reads

2. 2-pass (SMS) Reads

Page 5: Wrangling Short Read Data with SHRiMP

TGAGCGTTC|||TGAATAGGA

A C G T

A 0 1 2 3

C 1 0 3 2

G 2 3 0 1

T 3 2 1 0

AB SOLiD: Color-space Sequencing

AB SOLiD reads look like this:

T012233102

A G

C T

1

2

2

33

0 0

00

1

TGAGCGTTCT012033102TGAATAGGA

Page 6: Wrangling Short Read Data with SHRiMP

G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT

SNPs

TGAGTT 12210 TGACTT 12120TGAATT 12030TGATTT 12300

AB SOLiD: Color space is complex!

INDELS

TGAGTTA 122103

TGA-TTA 12-303

TGAGTTTA 1221003

TGAGTATA 1221333It’s

bloody complicated!

Page 7: Wrangling Short Read Data with SHRiMP

AB SOLiD: Translations

• Look at: 012233102• Recall: 012033102• 4 translations for every color sequence

A A C T T A T G G A A G

C T

1

2

2

33

0 0

00

1

0 1 2 0 3 3 1 0 2

C C A G G C G T T C

G G T C C G C A A G

T T G A A T A C C T

TGAGCGTTC|||TGAATAGGA

TGAGCGTTC|||||||||TGAGCGTTC

Page 8: Wrangling Short Read Data with SHRiMP

AB SOLiD: Modified Smith-Waterman

• 4 S-W matrices, one per translation• Errors transition into other matrix• ‘Crossover’ penalty charged for errors

Translation A Translation C

T T GT T

GGe

no

me

G A T A C C T C C A A G C G T T C

A G

C G

T T

C

Page 9: Wrangling Short Read Data with SHRiMP

AB SOLiD: Obligatory Comparison

• SHRiMP and AB Mapper (1.6)– SHRiMP seed 1111001111– AB 35_2, 35_3 schemas

• 10,000 35mers– C. savignyi (173Mb), very high polymorphism

• Considering single top hits only

SHRiMP AB 35_2 AB 35_3

% mapped 19.83 6.67 10.94

Runtime 13m04 1h24 2h25

Page 10: Wrangling Short Read Data with SHRiMP

AB SOLiD: Resultant Alignments

• SHRiMP emits letter-space alignments

– Clear to biologists

– Color-space need not be scary!

G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| |||T: GAaCCCCTTACAACTGAACCCC-TACR: 1 T1211000203110121201000-231 25

Page 11: Wrangling Short Read Data with SHRiMP

Outline

1. AB SOLiD Reads

2. 2-pass (SMS) Reads

Page 12: Wrangling Short Read Data with SHRiMP

2-pass SMS Reads

• SMS reads have high error rates

– “Dark bases” (skipped letters)

– Multiple passes are possible

– Ameliorate errors over passes• Good chance of missing base in one read• Acceptable chance of getting it in at least one

Page 13: Wrangling Short Read Data with SHRiMP

CTG-ACTCAGCA-T

C T G A C T

C

A

G

C

A

T

Match = +4 Mismatch = -3 Gap = -2

S=9

SMS 2-pass: SHRiMP with 2 reads

Page 14: Wrangling Short Read Data with SHRiMP

CTG-ACTCAGCA-T 9 1

1 9 8

8 9 1

1 9 9 1

9 3 9

1 9 9 1

1 9

C T G A C T

C

A

G

C

A

T

Match = +4 Mismatch = -3 Gap = -2

CTGAC-TCAG-CAT

S=9

SMS 2-pass: SHRiMP with 2 reads

Page 15: Wrangling Short Read Data with SHRiMP

CTG-ACTCAGCA-T 9 1

1 9 8

8 9 1

1 9 9 1

9 3 9

1 9 9 1

1 9

C T G A C T

C

A

G

C

A

T

Match = +4 Mismatch = -3 Gap = -2

AT

—T

A—

CC

A—

—T G

G

CC

A—

—A

AA

—C

C—

TT

CTGAC-TCAG-CAT

S=9

C-TG-ACTCA-GCA-T

CT-GAC-TC-AG-CAT

S=8

SMS 2-pass: SHRiMP with 2 reads

Page 16: Wrangling Short Read Data with SHRiMP

• Build a DAG representing the (near) optimal alignments of the two reads

• Generate seeds (short paths) from the DAG

• Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW.

• Do full WSG alignment for top hits

SMS 2-pass: SHRiMP with 2-pass data

AT

—T

A—

CC

A—

—T G

G

CC

A—

—A

AA

—C

C—

TT

Page 17: Wrangling Short Read Data with SHRiMP

Type Separate Profile WSG

No hits % 0.13 4.91 4.31

Multiple % 26.45 9.34 9.13

Uniq cor % 63.00 74.90 75.84

Runtime 9m 11m 12m

SMS 2-pass: Results (in brief)

• 10,000 synthetic reads (~25-65 bp)– 7% deletion,1% insertion, 1% sub rate

• Mapped to Human chromosome 1– Spaced seed span 9: 111110111

Page 18: Wrangling Short Read Data with SHRiMP

• Fast mapping of short reads to a genome

-- Handles:

• color-space (SOLiD) reads

• 2-pass (SMS) reads

• insertions and deletions

-- Easy to parallelize

• Computation of p-values & other statistics for hits

SHRiMP Summary

Page 19: Wrangling Short Read Data with SHRiMP

Acknowledgements

• SHRiMP is brought to you by:

– Michael Brudno– Adrian Dalca – Marc Fiume– Vlad Yanovsky

– Phil Lacroute– Arend Sidow

http://compbio.cs.toronto.edu/shrimp

University of Toronto

Stanford University