accelerating error correction in high-throughput short-read dna sequencing data with cuda

Accelerating Error Correction in High-Throughput Short-Read DNA

Sequencing Data with CUDA

Haixiang ShiBertil Schmidt

Weiguo LiuWolfgang Müller-WittigPresenter: Erkan Okuyan

Motivation

• Massive amount of sequencing data (Illumina – 454 - SOLID) (short reads - with high error rate)

• Assembly processes sensitive to errors in reads thus sequencing errors needs to be corrected

• Size of error correction problem is computationally demanding

Definitions- Let R = {r1, r2,…,rk} be a set of k reads with |ri| = L

- Let ri be in {A, C, G, T}L for all 1 ≤ i ≤ k.

- Let m (multiplicity) and l (length) satisfy m>1 and l<L

•Definition1 (Solid and Weak): An l-tuple (a DNA string of length l) is called solid with respect to R and m if it is a substring of at least m reads in R and weak otherwise.

–m-way replicated l-tuple is probably a correct l-tuple •Definition2 (Spectrum): The spectrum of R with respect to m and l, denoted as Tm,l(R), is the set of all solid l-tuples with respect to R and m.

–Spectrum Tm,l(R) is the set of all correct l-tuples

Definitions- Let R = {r1, r2,…,rk} be a set of k reads with |ri| = L

- Let ri be in {A, C, G, T}L for all 1 ≤ i ≤ k.

- Let m (multiplicity) and l (length) satisfy m>1 and l<L

•Definition3 (T-string): A DNA string s is called a Tm,l(R)-string if every l-tuple in s is an element of Tm,l(R).

•Definition4 (SAP): Given a DNA string s and spectrum Tm,l(R). Find a Tm,l(R)-string s* in the set of Tm,l(R)-strings that minimizes the distance function d(s,s*).

CUDA (Compute UnifiedDevice Architecture)

Serial Code (host)

Parallel Kernel (device)

KernelA<<< nBlk,nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk,nTid >>>(args);

•Integrated host+device app program–Serial or modestly parallel parts in host C code–Highly parallel parts in device SPMD kernel C code

CUDA Execution

• A GPU device – Is a coprocessor to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel

• Data-parallel portions of an application are expressed as device kernels which run on many threads

• Differences between GPU and CPU threads – GPU threads are extremely lightweight – Very little creation overhead – GPU needs 1000s of threads for full efficiency

Parallel Error Correction with CUDA

• Each kernel thread is responsible for correction of a single read ri.

• Voting based algorithm– First Step: Calculation of voting matrix

– Second Step: Single-Mutation fixing/trimming/discarding

Step1: Voting Matrix Calculation

Step2: Fixing/Trimming/Discarding Reads

Fast Membership Tests

• First algorithm(kernel) dominates time– (L-l).(l+3.p.l) membership tests required where

p is the number of l-tuples that do not belong in the spectrum.

– Space efficient Bloom filter speeds up membership test of spectrum

• Compute bloom filter on CPU and store it on texture memory (fast read only cache) on device

Bloom Filter

• Probabilistic data structure– No false negatives

– Small percentage of false positives

– Space efficient and fast

• Uses a bit array B of length m and d hash functions – to insert x, we set B[hi(x)] = 1, for i=1,…,d

– to query y, we check if B[hi(y)] all equal 1, for i=1,…,d

Bloom Filter Example

• a and b are inserted to a m=10 n=2 d=3 bloom filter

• Query of c on bloom filter returns false since some bits are 0.

• Query of d on bloom filter returns true since all bits are 1 (False positive).

Overall Algorithm

1) Pre-Computation on the CPU: Program the Bloom filter (counting bloom filter) bit-vector by hashing each l-tuple present on read R.

2) Data transfer from CPU to GPU: Allocate memory/transfer Bloom filter and reads.

3) Execute CUDA kernel.

4) Data transfer from GPU to CPU: Transfer the set of corrected/trimmed reads.

Performance Evaluation

• System Parameters– Nvidia Geforce GTX 280 with 1GB memory– AMD Opteron dual core 2.2Ghz CPU with 2GB

memory

• Datasets– Artificial Sets (1%, 2%, 3% error rates)

• Yeast Chromosomes (S.cer5, S.cer7)• Bacterial Genomes (H.inf, E.col)

– Real Set• Staphylococcus Aureus strain MW2 (H.Aci) (error rate ~1%)

Performance Evaluation

Discussion/Conclusion (GOOD)

• Runtime savings of 10 to 19 times reported.

• Bigger datasets is not an issue as long as Bloom filter fits in texture memory. (More than one round of read-load/read-correct approach)

• Possible to even further parallelize on distributed memory GPU farms.

Discussion/Conclusion (BAD)

• Does not exploit fast shared memory within thread blocks (i.e. each read ri does not really have to be handled by a single thread, voting matrix can be constructed in parallel) thus further speed-up is possible.

• Predetermined read length L is a bit restrictive.

Thank You

accelerating error correction in high-throughput short-read dna sequencing data with cuda

Documents