27th reconfigurable architectures workshop (raw 2020) - … · 2020. 6. 3. · leveraging succinct...
TRANSCRIPT
-
Leveraging Succinct Data Structures for DNA Sequence Mapping on FPGA
27th IEEE Reconfigurable Architecture Workshop, May 18th 2020
POLITECNICO DI MILANO
MANUALE DI CORPORATE IDENTITY
MILANO, MAGGIO 2015
Guido Walter Di Donato [email protected] Zeni
Lorenzo Di TucciMarco D. Santambrogio
[email protected]@[email protected]
Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) - Politecnico di Milano
-
2
DNA Sample
Sequencing
Genome
Assembly
Genome
Analysis
Drug
Creation
Computational Bottleneck
DNA Analysis for Precision Medicine
-
Sequence Mapping
3
Sequence MappingSequenced reads
Sequence mapping is a computational intensive step involved in genome assembly and
genomic analysis pipelines, leading to long execution times and high computational costs.[1]
[1] Beretta, S.. Algorithms for strings and sequences: Pairwise alignment. Encyclopaedia of Bioinformatics and Computational Biology. Oxford: Academic Press, 2019.
-
Sequence Mapping
4
Sequence MappingSequenced reads
Sequence mapping is a computational intensive step involved in genome assembly and
genomic analysis pipelines, leading to long execution times and high computational costs.[1]
[1] Beretta, S.. Algorithms for strings and sequences: Pairwise alignment. Encyclopaedia of Bioinformatics and Computational Biology. Oxford: Academic Press, 2019.
Sequence mapping is the computational bottleneck of
many genome assembly and genome analysis pipelines:
Long execution time
High computational cost
-
5
Burrows-Wheeler Transform
Reversible string permutation that can be searched directly and has long strings
of repeated characters (great for compression).
TACGACGTCGACT$
1 TACGACGTCGACT$
2 ACGACGTCGACT$T
3 CGACGTCGACT$TA
… …
13 T$TACGACGTCGAC
14 $TACGACGTCGACT Sort each row by their prefixes in
lexicographic order 14 $TACGACGTCGACT
2 ACGACGTCGACT$T
5 ACGTCGACT$TACG
… …
1 TACGACGTCGACT$
8 TCGACT$TACGACG
TTGGATAACCCC$G
14-2-5-11-3-9-6-12-4-10-7-13-1-8
BWT output
Suffix array
Burrows, M. and Wheeler, D. J.: A block-sorting lossless data compression algorithm. 1994.
-
BWaveR Data Structure
6
A flexible combination of succinct data structures, namely Wavelet Tree and RRR
Sequences, able to store long sequences in an amount of space "close" to the theoretic
lower bound, while still allowing for efficient rank query operations.
S., Beller, T., Moffat, A., and Petri, M.: From theory to practice: Plug and play with succinct data structures, 2014
RRR Sequences
Parameters: block size & superblock factor
-
Backward Search on BWaveR Data Structure
start = C(X) + 1
end = C(X+1)
INITIALIZATION ITERATIONS
start = C(X) + rankWT(X, start-1) +1
end = C(X) + rankWT(X, end)end < pos($)
start = C(X) + rankWT(X, start-1) +1
end = C(X) + rankWT(X, end-1)start < pos($) ≤ end
start = C(X) + rankWT(X, start-2) +1
end = C(X) + rankWT(X, end-1)start ≥ pos($)C(X)
$ A C G T
0 1 4 8 11
TTGGATAACCCC$G
7
-
Backward Search on BWaveR Data Structure
start = C(X) + 1
end = C(X+1)
INITIALIZATION ITERATIONS
start = C(X) + rankWT(X, start-1) +1
end = C(X) + rankWT(X, end)end < pos($)
start = C(X) + rankWT(X, start-1) +1
end = C(X) + rankWT(X, end-1)start < pos($) ≤ end
start = C(X) + rankWT(X, start-2) +1
end = C(X) + rankWT(X, end-1)start ≥ pos($)C(X)
$ A C G T
0 1 4 8 11
TTGGATAACCCC$G
8
A pattern of length p is found in O(p) time,
independently of the reference size!
-
Custom Sequence Mapping Architecture
FPGA-tailored optimizations:
1. Reference structure stored in BRAM to reduce memory access time
2. Dimensioned query sequences structures to exploit memory burst
3. Parallel mapping of each query sequence and its reverse complement
4. Implemented rank queries using a reduction strategy9
Load reference structureHost DRAM FPGA
Load query sequence
Send mapping result
Compute reverse complement
Mapping via backward search
Initialization
Loop
-
10
Results - Memory Footprint
Data
stru
ct s
ize [b
ytes
]
1M
5M
10M
14M
18M
Superblock factor2 7 17 32 50 100
3 7 11 15
Data
stru
ct s
ize [b
ytes
]
10M
43M
75M
108M
140M
Superblock factor2 7 17 32 50 100
Block size
Memory reduction up to
63.7% w.r.t BWT sequence
98.1% w.r.t naive Occ matrix
E.Coli Genome Human Chr. 21
Memory reduction up to
68.3% w.r.t BWT sequence
98.4% w.r.t naive Occ matrix
-
11
Results - Mapping Time (HW)
Map
pin
g T
ime
[ms]
1
10
100
1000
10000
100000
1 million reads 10 millions reads 100 millions reads
BWaveR - FPGA BWaveR - CPU Bowtie2 - 1 threadBowtie2 - 8 threads Bowtie2 - 16 threads
Time for mapping 100 bp reads against Human Chr. 21 (b = 15, sf =50)
14x7,8x
1,4x0,7x
62x 42x
7,6x4x
70x 51x
9,5x4,9x
-
12
En
ergy
Co
nsu
mp
tio
n [J
]
1
10
100
1000
10000
1 million reads 10 millions reads 100 millions reads
BWaveR - FPGA BWaveR - CPU Bowtie2 - 1 threadBowtie2 - 8 threads Bowtie2 - 16 threads
Energy for mapping 100 bp reads against Human Chr. 21 (b = 15, sf =50)
Results - Energy Consumption (HW)
74x42x
7,7x4x
337x225x
41x21x
380x274x
51x27x
-
13
Conclusions
This paper presents BWaveR, a fast and memory efficient sequence mapper
leveraging succinct data structures and HW acceleration on FPGA:
➡ Flexibility of usage
➡ Great compression capabilities
➡ Time independence from reference size
➡ Low energy consumption
➡ Low execution time
For questions regarding this work, email:
Guido Walter Di Donato: [email protected]