![Page 1: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/1.jpg)
Wide-SIMD Parallelization of Streaming Dataflow, with
Applications to Bioinformatics
Jeremy Buhler For CSE 591
NSF Awards CNS-1500173, CNS-1763503
![Page 2: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/2.jpg)
Take-Home Message
• Biological sequence analysis is a source of high-impact computational problems
• Using SIMD parallel computing for these problems requires dealing with irregularity
• MERCATOR is an ongoing research effort to make irregular application development on SIMD platforms easier.
2
![Page 3: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/3.jpg)
Who Am I? • I study how to accelerate high-
impact bioinformatics problems.
• One way to do this is via parallelization on modern architectures (FPGAs, GPUs, …)
• Along the way, many interesting CS questions… – Streaming computation [FCCM’07, JVSP’07,M&M’09] – Systolic array design [FPL’09,FCCM’10,ASAP’10] – Deadlock avoidance [SPAA’10,PPoPP’12,DFM’13,JPDC’17] – SIMD mapping [ISPDC’14,DFM’15,HPCS’17]
3
![Page 4: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/4.jpg)
Talk Overview
• Problems: DNA comparison and read mapping
• Algorithmic approach – Why SIMD?
• MERCATOR overview and performance
• Research challenges
4
![Page 5: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/5.jpg)
Molecular Biology is Fundamental
• Genetic basis of disease and disease risk
• Systems biology – what are your cells doing?
• Studying natural history and evolution
• Engineering cells’ behavior for medicine, industry, agriculture
5
![Page 6: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/6.jpg)
The First Step: DNA Sequencing
• Sequencing can tell us what is in a genome…
• … but also the basis of experiments to probe gene expression, protein binding, chromosome
conformation, epigenetic marks, polymorphism, copy number variants …
…acaggatagtaccgataccat cacccggataggacctatgag ggacacaggacttatggcattt… 6
![Page 7: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/7.jpg)
7
![Page 8: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/8.jpg)
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
Billi
ons o
f DN
A Ba
ses
NIH Sequence Read Archive Size (Open Access Only)
http://www.ncbi.nlm.nih.gov/Traces/sra/
8
![Page 9: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/9.jpg)
Problem: Classical Similarity Search
• Given – a genome-sized or larger DNA
sequence database D – a “query” sequence q of some
length L << |D|
• Does q appear in D with at most k differences, and if so, where?
9
![Page 10: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/10.jpg)
Typical Parameters
• Database D has size 109 – 1010 bases
• Query q has size 102 – 104 bases
• # differences k is 5-25% of |q| (bases added, deleted, changed)
10
![Page 11: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/11.jpg)
Tools for Similarity Search
• BLAST [Altschul et al. 1990, 1996]
• BLAT [Kent 2002]
These tools use variants of same basic search algorithm.
11
![Page 12: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/12.jpg)
Problem: Short-Read Mapping
• Given – a genome-sized or larger DNA
sequence database D – N “reads” – DNA seqs of some
length L << |D|
• For each read, does it appear in D with at most k differences, and if so, where?
12
![Page 13: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/13.jpg)
Typical Parameters
• Database D has size 109 – 1010 bases
• Number of reads N is 106 – 108
• Length L is 75-150 (may vary among reads)
• # differences k is 0-3 (added, deleted, changed)
13
![Page 14: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/14.jpg)
Tools for Mapping
• Bowtie [Langmead et al. 2009, 2012]
• BWA [Li & Durbin 2009, 2010]
• SOAP2 [Li et al. 2009]
All these tools use variants of same basic search algorithm.
14
![Page 15: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/15.jpg)
Why Short-Read Mapping?
• Some experimental procedure selects a subset of everything in the database
• Reads are sampled from this subset by your sequencing machine
• Mapping tells you which parts of database are present in your sample
15
![Page 16: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/16.jpg)
Problem: Alignment-Free Organism ID
• Given – a metagenome-sized DNA
sequence database D – N microbial genomes – DNA seqs
of some length L << |D|
• For each genome, do (some of) its sequences appear in D?
16
![Page 17: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/17.jpg)
Alignment-Free Techniques
• Min-Hash Sketching – convert a seq to a small sample (m ~ 1000) of hash values
• Approximate Containment: how much of (the sketch for) a genome overlaps (the sketch for) a metagenome?
• MASH (Ondov et al. 2016) • SourMASH (Brown et al. 2016)
17
![Page 18: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/18.jpg)
Talk Overview
• Problems: DNA alignment and read mapping
• Algorithmic approach – Why SIMD?
• MERCATOR overview and performance
• Research challenges
18
![Page 19: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/19.jpg)
How BLAST Works
19
Substring
Matching
Gapped
Filter Ungapped
Filter
SEQUENCES REMAINING
COMPUTATIONAL COST
BLAST operates as a pipeline of computational stages.
![Page 20: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/20.jpg)
Stages of BLAST
• Stage 1: identify potential match locations between q, D
• Stage 2: keep only those locations that look somewhat promising
• Stage 3: keep only those locations that actually yield high-similarity alignments
20
![Page 21: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/21.jpg)
Generating Possible Matches
• Every place where some 11-mer from q matches an 11-mer from D exactly is a candidate.
• Can rapidly find all such matching locations using hash table of 11-mers in sequence q
21
accagatacatagcactcgctacgtcagatgggtaca gttaagtcagatgggtagactcaggatgacagtggaca
![Page 22: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/22.jpg)
Filtering Candidates
• Uses explicit edit distance computation between q, part of D (Smith-Waterman algo)
• Expensive dynamic programming!
• “Easy” version (substitutions only), followed by hard version (add/delete chars allowed)
22
![Page 23: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/23.jpg)
BLAST Parallelization
• Can generate candidates in parallel at each DB location, then filter them in parallel.
23
Gen Candidates
Filter Ungapped
Filter Gapped
![Page 24: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/24.jpg)
What About Read Mapping?
• Uses an index (virtual suffix tree) of database
• Matching involves tracing a path down index tree for each read
• (must try several paths if differences are allowed)
• Can do in parallel for many reads at once!
24
![Page 25: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/25.jpg)
Suffix Tree Example
D = acagaccaga$ 0 1 2 3 4 5 6 7 8 9 10
10 $
9 a$
0 aca…
4 acc…
7 aga$
2 agac…
6 caga$
5 cca…
1 cagac…
8 ga$
3 gac…
A
9 0 4 7 2 6 1 5 8 3
a c g a
$ c g a
a g a
c
$ c
$ c
a c $ c
…
…
… …
…
…
T
25
![Page 26: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/26.jpg)
Rapid Matching vs Suffix Tree
• Can find all matches to a read in D in time proportional to read length L.
9 0 4 7 2 6 1 5 8 3
a c g a
$ c g a
a g a
c
$ c
$ c
a c $ c
…
…
… …
…
…
Where is cag?
26
![Page 27: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/27.jpg)
Rapid Matching vs Suffix Tree
• Can find all matches to a read in D in time proportional to read length L.
9 0 4 7 2 6 1 5 8 3
a c g a
$ c g a
a g a
c
$ c
$ c
a c $ c
…
…
… …
…
…
Where is cag?
27
![Page 28: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/28.jpg)
Rapid Matching vs Suffix Tree
• Can find all matches to a read in D in time proportional to read length L.
9 0 4 7 2 6 1 5 8 3
a c g a
$ c g a
a g a
c
$ c
$ c
a c $ c
…
…
… …
…
…
Where is cag?
28
![Page 29: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/29.jpg)
Rapid Matching vs Suffix Tree
• Can find all matches to a read in D in time proportional to read length L.
9 0 4 7 2 6 1 5 8 3
a c g a
$ c g a
a g a
c
$ c
$ c
a c $ c
…
…
… …
…
…
Where is cag?
29
![Page 30: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/30.jpg)
Rapid Matching vs Suffix Tree
• Can find all matches to a read in D in time proportional to read length L.
9 0 4 7 2 6 1 5 8 3
a c g a
$ c g a
a g a
c
$ c
$ c
a c $ c
…
…
… …
…
…
Where is cag?
D = acagaccaga$ 0 1 2 3 4 5 6 7 8 9 10
30
![Page 31: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/31.jpg)
Extension to Inexact Matching
• To permit matches with k substitutions, try multiple paths, but charge for each mismatch.
• To permit matches with k differences, we do dynamic programming to compute edit distance of read against each path in tree.
• Descent stops for a read when we hit bottom of tree or find that path requires > k differences.
31
![Page 32: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/32.jpg)
Parallel Alignment is a SIMD Computation
• We process every BLAST starting loc / every read through same filtering computation
• Single Instruction stream, Multiple Data items
Thread 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
![Page 33: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/33.jpg)
SIMD Targets
• Our work: NVIDIA GPUs (32 SIMD lanes x 4+ threads x 2-64 cores)
• Other possibilities: any multicore with wide vector instructions (Intel Xeon, AMD, ARM, …)
• ~All modern processors have wide SIMD!
33
![Page 34: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/34.jpg)
Batched Traversal (Short Reads)
34 9 0 4 7 2 6 1 5 8 3
![Page 35: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/35.jpg)
Batched Traversal
35 9 0 4 7 2 6 1 5 8 3
![Page 36: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/36.jpg)
Batched Traversal
36 9 0 4 7 2 6 1 5 8 3
x x x
Some reads may accumulate > k diffs before others
![Page 37: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/37.jpg)
Batched Traversal
37 9 0 4 7 2 6 1 5 8 3
x x x x x
![Page 38: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/38.jpg)
Batched Traversal
38 9 0 4 7 2 6 1 5 8 3
x x x x x x
Stop descending when all reads either have > k diffs or are completely matched with fewer diffs
![Page 39: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/39.jpg)
Batched Traversal
39 9 0 4 7 2 6 1 5 8 3
x x x x x x
Continue on next branch starting from batch on top of stack
![Page 40: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/40.jpg)
Batched Traversal
40 9 0 4 7 2 6 1 5 8 3
x x x x x x
x x x x x x x x
![Page 41: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/41.jpg)
Batched Traversal
41 9 0 4 7 2 6 1 5 8 3
x x x x x x
![Page 42: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/42.jpg)
Batched Traversal
42 9 0 4 7 2 6 1 5 8 3
x x x x
![Page 43: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/43.jpg)
Performance?
• Each stage of BLAST costs more but processes less input.
• 98% of threads idle for 110/111 ms • 1.99% of threads idle for 100/111 ms • SIMD EFFICIENCY: 1.1%
43
Substring
Matching
Gapped
Filter Ungapped
Filter
1 ms 10 ms 100 ms
100% 2% 0.01%
![Page 44: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/44.jpg)
Irregular Computations
• DNA alignment is an irregular computation: different inputs (i.e. DB locations, reads) require different amounts of work to process.
• Antithesis of, e.g., linear algebra calculations that are easily vectorized
• Irregular computations are highly inefficient if naively implemented on SIMD processors.
44
![Page 45: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/45.jpg)
The Key Problem
• How can we efficiently map irregular computations onto a SIMD architecture?
45
![Page 46: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/46.jpg)
Talk Overview
• Problems: DNA alignment and read mapping
• Algorithmic approach – Why SIMD?
• MERCATOR overview and performance
• Research challenges
46
![Page 47: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/47.jpg)
47
Pause for MERCATOR demo
![Page 48: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/48.jpg)
MERCATOR Paradigm
• Application processes a long stream of inputs
• Application graph consists of nodes (computations), edges (data transfer)
• Data flows through graph of computations
• Irregularity: paths differ per input, each input to a node generates 0, 1, or multiple outputs 48
![Page 49: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/49.jpg)
Handling Irregularity
• Each edge between nodes has a queue
• MERCATOR queues inputs to a node until there are enough to fill all its SIMD lanes
• Node is only fired when it has “full ensemble” of inputs in all lanes.
49
![Page 50: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/50.jpg)
Illustration of Queues
50
![Page 51: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/51.jpg)
Illustration of Queues
51
x
x
![Page 52: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/52.jpg)
Illustration of Queues
52
![Page 53: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/53.jpg)
Illustration of Queues
53
![Page 54: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/54.jpg)
Illustration of Queues
54
x
x x
![Page 55: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/55.jpg)
Illustration of Queues
55
![Page 56: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/56.jpg)
Illustration of Queues
56
![Page 57: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/57.jpg)
Illustration of Queues
57
x
x
![Page 58: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/58.jpg)
Illustration of Queues
58
![Page 59: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/59.jpg)
Illustration of Queues
59
![Page 60: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/60.jpg)
A Few Complications
• Shared Code – two or more nodes may do same thing (e.g. Viola-Jones)
• Overhead – queueing isn’t free
• Asynchrony – must use multiple processors, each with multiple SIMD lanes
• Ordering – are inputs processed “in order”?
60
![Page 61: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/61.jpg)
Exploiting Shared Code
• “Module type” CUDA code
• Multiple nodes with same function have same module type
• We execute all nodes of a given module type in parallel!
• [Requires pulling data from each node’s queue concurrently]
61
![Page 62: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/62.jpg)
Minimizing Overhead
• Queue manipulation is itself parallelized
• Easy case: “read next k inputs from queue into threads 1..k.”
• More fun: “read k total inputs from all queues combined into threads 1..k, and remember which queue each input came from.”
62
![Page 63: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/63.jpg)
Sneaky Tricks
• Parallel scan
• Branch-free binary search
• Parallel output compaction
• [exploits, maintains input ordering] 63
![Page 64: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/64.jpg)
Results of Synthetic Trial
64
![Page 65: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/65.jpg)
Dealing with Asynchrony
• Shared input / output buffers
• Output order with multiple processors?
• [Need stream-synchronized signaling]
• Associative (and commutative?) reductions
65
![Page 66: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/66.jpg)
Applications with Cycles
• App graph can have back edges
• Issue: deadlock prevention
• [topology restrictions, queueing policy]
• Order preservation?
66
![Page 67: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/67.jpg)
Optimization Opportunities
• Parameter tuning (queue sizes, scheduler, …)
• Latency-sensitive applications vs occupancy
• Fusing nodes to elide queueing (at what cost to occupancy?)
67
![Page 68: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows](https://reader030.vdocument.in/reader030/viewer/2022041103/5f029cde7e708231d405226b/html5/thumbnails/68.jpg)
Want to Play?
• https://github.com/jdbuhler/mercator
• MERCATOR will be a testing ground for SIMD-aware irregular streaming computation
• Many interesting problems still to be solved!
• thesis topics
68