seqmapreduce: software and web service for accelerating sequence mapping yanen li department of...

23
SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign Email: [email protected] 10/05/2009, CAMDA 2009, Chicago

Upload: prince-maples

Post on 30-Mar-2015

221 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

SeqMapReduce: software and web service for accelerating sequence mapping

Yanen LiDepartment of Computer Science, University of Illinois at

Urbana-ChampaignEmail: [email protected]

10/05/2009, CAMDA 2009, Chicago

Page 2: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Challenge of NGS Alignment

• Sequences: Short (25 ~ 76 bp)• Size of data set: large, still increasing• BLAST?

Transaction /Long Query

Batch/Short Query

BLAST

NGS Aligner

We need INDEX !

Page 3: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

The NGS Aligner War

Where are you?

Page 4: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

NGS Aligner Classification

• Standalone AlgorithmsHash Reads: Eland, RMAP, MAQ, SHRiMP …Pros: less RAM, less overheadCons: waste of genome scan

Hash Genome: SOAP, PASS, Mosaik, BFAST …Pros: fast, scale up wellCons: big RAM, heavy overhead

Index Genome (Burrows-Wheeler): Bowtie, BWA

Page 5: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

NGS Aligner Classification

Parallel Algorithm Options Things Needed to Consider

Multi-thread Hard to scale up to many cores

Cluster Computing Load balancing, Fault tolerance

Cloud Computing Restricted programming interface

Page 6: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Programming Model of Cloud Computing

• MapReduceDeveloper supplies two functions

– All v’ with the same k’ are reduced together

Simple framework usually can scale up well

Page 7: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Why Cloud Computing Attractive?

• Fit for Data Intensive Computing (DIC)• NGS alignment is DIC in nature

• Hadoop – open sourced Cloud Computing system

Built-in Load balancing and Fault tolerance Easy to program

Page 8: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Cloud Based NGS Aligner

Hash Reads Hash Genome Hash Both

SeqMapReduce *CloudBurst *

Hash/index Genome will be the next

SeqMapReduce: Hash all reads in RAM in every nodeCloudBurst: Hash reads and the genome, but not in RAM

Page 9: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

The SeqMapReduce Framework

Page 10: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Inside SeqMapReduce

• Pre-processing: formatting the genome

Format once, use every time

Bases at the end are duplicated

Page 11: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Inside SeqMapReduce

• Map phase: Seed & Filtering Divide a read into K parts, If M mismatches: at least (K-M) parts are exactly matched e.g. K=4, M=2 4-2=2 parts exactly matched combinations We need only 6 Hash Tables

Genome seqs scanned for potential hitsThen go to Mismatches Counting

46

2

Page 12: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Inside SeqMapReduce

• Reduce Phase Aggregating intermediate results

• Post Processing Duplication detection Mismatches counting Final output report

Page 13: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Inside SeqMapReduce• Mismatches counting Naive way: simple counting (O(N))• Mismatches counting using bit operationsBit-wise XOR (Exclusive or)

00 01 10 11

00 00 01 10 11

01 01 00 11 10

10 10 11 00 01

11 11 10 01 00

Page 14: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Mismatches counting

• Original R (read), and G (genome)• W=R XOR G• Define 2 constantsW1=10101010…W2=01010101…X=W & W1 (keep 10, clear 01, 11=>10)Y=W & W2(keep 01, clear 10, 11=>01)Then Y << 1N=POPCNT(X | Y)

W is combinations of 00 01 10 11

W 00 01 10 11W2 01 01 01 01Y=W & W2 00 01 00 01Y << 1 00 10 00 10

W 00 01 10 11

W1 10 10 10 10

X=W & W1 00 00 10 10

X=W & W1

X | Y W 00 01 10 11

X | Y 00 10 10 10

Y =W & W2

Page 15: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Web Service of SeqMapReduce

Page 16: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Web Service of SeqMapReduce

• Input format .zip of fasta format reads• Reads can be upload through web site• Support 13 model organisms• Support reads longer than 32 bps• Up to 5 mismatches • No indels in current version (will update soon)• Output with ELAND format• Free of charge for academics • Users: Small labs, want quick results but could be afford expensive hardware

and softwares

Page 17: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Results on CAMDA 2009 datasets

• Pol II ChIP-seq FC201WVA_20080307_s_5 (4.5 million)

• IFNg stimulated STAT1 ChIP-seq FC302MA_20080507_s_1 (6.2 million)

• Illinois Cloud Computing Testbed (CCT). Each node: 64 bit 2.6 GHz CPUs, 16 GB RAM, and 2 TB storage.

• 2 mismatches are allowed.• Accuracy: 95% of results are the same as MAQ.

Page 18: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Speed Up

Run time VS No. of coresPol II data set

Run time VS No. of coresSTAT1 data set

Speed up is quasi-linear to the No. of cores

1 2 4 8 16 320

500

1000

1500

2000

2500

3000

3500

4000

4500

With overhead

Without overhead

1 2 4 8 16 320

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

With overheadWithout overhead

Ave overhead time: 67.22s Ave overhead time: 86.09 s

Page 19: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Scale UpSize Size Ratio Run Time Run Time

RatioSTAT1 6.2 million 1.38 364 second 1.03

Pol II 4.5 million 354 second

RAM requirement: ~ 50 M per million readsCan scale up to tens of millions of read with several Gs of RAM

Page 20: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Comparison to CloudBurst

Why CloudBurst is slow?It hashes Reads and genome, with Hadoop system hash functionNo filtering in the Map phase: heavy I/O to Reduce phase

Page 21: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Results on Amazon EC2

• Speed up similar of using UIUC Hadoop Cluster, but slower• Large Standard Instances are chosen• Cost $99.01

Page 22: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Future Plans

• Apply to Bisulfite Reads to genome wide methylation analysis

• Web-based visualization of short-read alignments

Page 23: SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign

Acknowledgements

• UIUC Cloud Test Bed • Michael Schatz • CAMDA Organizers

This work is supported by NSF DBI 08-45823 (SZ)

Thank you!