seqmapreduce: software and web service for accelerating sequence mapping yanen li department of...
TRANSCRIPT
SeqMapReduce: software and web service for accelerating sequence mapping
Yanen LiDepartment of Computer Science, University of Illinois at
Urbana-ChampaignEmail: [email protected]
10/05/2009, CAMDA 2009, Chicago
Challenge of NGS Alignment
• Sequences: Short (25 ~ 76 bp)• Size of data set: large, still increasing• BLAST?
Transaction /Long Query
Batch/Short Query
BLAST
NGS Aligner
We need INDEX !
The NGS Aligner War
Where are you?
NGS Aligner Classification
• Standalone AlgorithmsHash Reads: Eland, RMAP, MAQ, SHRiMP …Pros: less RAM, less overheadCons: waste of genome scan
Hash Genome: SOAP, PASS, Mosaik, BFAST …Pros: fast, scale up wellCons: big RAM, heavy overhead
Index Genome (Burrows-Wheeler): Bowtie, BWA
NGS Aligner Classification
Parallel Algorithm Options Things Needed to Consider
Multi-thread Hard to scale up to many cores
Cluster Computing Load balancing, Fault tolerance
Cloud Computing Restricted programming interface
Programming Model of Cloud Computing
• MapReduceDeveloper supplies two functions
– All v’ with the same k’ are reduced together
Simple framework usually can scale up well
Why Cloud Computing Attractive?
• Fit for Data Intensive Computing (DIC)• NGS alignment is DIC in nature
• Hadoop – open sourced Cloud Computing system
Built-in Load balancing and Fault tolerance Easy to program
Cloud Based NGS Aligner
Hash Reads Hash Genome Hash Both
SeqMapReduce *CloudBurst *
Hash/index Genome will be the next
SeqMapReduce: Hash all reads in RAM in every nodeCloudBurst: Hash reads and the genome, but not in RAM
The SeqMapReduce Framework
Inside SeqMapReduce
• Pre-processing: formatting the genome
Format once, use every time
Bases at the end are duplicated
Inside SeqMapReduce
• Map phase: Seed & Filtering Divide a read into K parts, If M mismatches: at least (K-M) parts are exactly matched e.g. K=4, M=2 4-2=2 parts exactly matched combinations We need only 6 Hash Tables
Genome seqs scanned for potential hitsThen go to Mismatches Counting
46
2
Inside SeqMapReduce
• Reduce Phase Aggregating intermediate results
• Post Processing Duplication detection Mismatches counting Final output report
Inside SeqMapReduce• Mismatches counting Naive way: simple counting (O(N))• Mismatches counting using bit operationsBit-wise XOR (Exclusive or)
00 01 10 11
00 00 01 10 11
01 01 00 11 10
10 10 11 00 01
11 11 10 01 00
Mismatches counting
• Original R (read), and G (genome)• W=R XOR G• Define 2 constantsW1=10101010…W2=01010101…X=W & W1 (keep 10, clear 01, 11=>10)Y=W & W2(keep 01, clear 10, 11=>01)Then Y << 1N=POPCNT(X | Y)
W is combinations of 00 01 10 11
W 00 01 10 11W2 01 01 01 01Y=W & W2 00 01 00 01Y << 1 00 10 00 10
W 00 01 10 11
W1 10 10 10 10
X=W & W1 00 00 10 10
X=W & W1
X | Y W 00 01 10 11
X | Y 00 10 10 10
Y =W & W2
Web Service of SeqMapReduce
Web Service of SeqMapReduce
• Input format .zip of fasta format reads• Reads can be upload through web site• Support 13 model organisms• Support reads longer than 32 bps• Up to 5 mismatches • No indels in current version (will update soon)• Output with ELAND format• Free of charge for academics • Users: Small labs, want quick results but could be afford expensive hardware
and softwares
Results on CAMDA 2009 datasets
• Pol II ChIP-seq FC201WVA_20080307_s_5 (4.5 million)
• IFNg stimulated STAT1 ChIP-seq FC302MA_20080507_s_1 (6.2 million)
• Illinois Cloud Computing Testbed (CCT). Each node: 64 bit 2.6 GHz CPUs, 16 GB RAM, and 2 TB storage.
• 2 mismatches are allowed.• Accuracy: 95% of results are the same as MAQ.
Speed Up
Run time VS No. of coresPol II data set
Run time VS No. of coresSTAT1 data set
Speed up is quasi-linear to the No. of cores
1 2 4 8 16 320
500
1000
1500
2000
2500
3000
3500
4000
4500
With overhead
Without overhead
1 2 4 8 16 320
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
With overheadWithout overhead
Ave overhead time: 67.22s Ave overhead time: 86.09 s
Scale UpSize Size Ratio Run Time Run Time
RatioSTAT1 6.2 million 1.38 364 second 1.03
Pol II 4.5 million 354 second
RAM requirement: ~ 50 M per million readsCan scale up to tens of millions of read with several Gs of RAM
Comparison to CloudBurst
Why CloudBurst is slow?It hashes Reads and genome, with Hadoop system hash functionNo filtering in the Map phase: heavy I/O to Reduce phase
Results on Amazon EC2
• Speed up similar of using UIUC Hadoop Cluster, but slower• Large Standard Instances are chosen• Cost $99.01
Future Plans
• Apply to Bisulfite Reads to genome wide methylation analysis
• Web-based visualization of short-read alignments
Acknowledgements
• UIUC Cloud Test Bed • Michael Schatz • CAMDA Organizers
This work is supported by NSF DBI 08-45823 (SZ)
Thank you!