parallel longest common subsequence using graphics hardware
TRANSCRIPT
1
Parallel Longest Common Subsequence using Graphics Hardware
John KloetzliBrian Strege
Jonathan DeckerDr. Marc Olano
Presented by: Brian Strege
2
Overview
• Introduction– Problem Statement
• Background and Related Work– The NVIDIA G80 Architecture
• Algorithm Description• Results and Analysis• Conclusion
3
Introduction
• Worked on GPU acceleration of Dynamic Programming– Specifically, problems in the Gaussian
Elimination Paradigm (GEP)– More specifically, Longest Common
Subsequence as a representative problem belonging to the GEP
4
Problem Statement
• Design and implement an algorithm for finding the LCS of two arbitrary length strings on a CPU + GPU machine– Must make efficient use of both CPU and
GPU architectures– Must have theoretical justification of design
5
Overview
• Introduction– Problem Statement
• Background and Related Work– The NVIDIA G80 Architecture
• Algorithm Description• Results and Analysis• Conclusion
6
Related Work
• General Purpose on Graphics Hardware– NVIDIA CUDA– Owens et al. (2005)
• Linear Dynamic Programming– Hirschberg (1975)– Chowdhury et al. (2006)
• GPU Sequence Alignment– Liu et al. (2007)– Schatz et al. (2007)
7
• 16 multiprocessors, 8 cores each128 logical processors
• 1.35 GHz• 768 MB of RAM• 86.4GB/sec transfer rate
(8.5GB/sec Core 2 Duo)
• 520 GFLOPS(22 GFLOPS Core 2 Duo)
NV
IDIA
CU
DA
Pro
gram
min
g G
uide
, 1.0
The NVIDIA G80 Architecture
8
The NVIDIA G80 Architecture
Program workflow:• CPU (host) creates
kernel program• GPU maps kernel
“blocks” to processors• Processors map
kernel “threads” to processor cores
• Cores execute in parallel
NV
IDIA
CU
DA
Pro
gram
min
g G
uide
, 1.0
9
Overview
• Introduction– Problem Statement
• Background and Related Work– The NVIDIA G80 Architecture
• Algorithm Description• Results and Analysis• Conclusion
10
Algorithm Description
• The SIMPLE-LCS recurrence– Requires quadratic space, which limits
scalability– Faster than Chowdhury et al. linear space
method
11
A B A B
AABB
SIMPLE-LCS Example
12
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0
0
0
0
13
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 10
0
0
14
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 10
0
0
15
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 10
0
0
16
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10
0
0
17
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 10
0
18
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 10
0
19
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 20
0
20
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20
0
21
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 10
22
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 20
23
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 20
24
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30
25
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1
26
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2
27
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2
28
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3
29
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3
30
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3
31
Algorithm Description
• Chowdhury et al. perform CPU quadratic space algorithm on small subproblems– CH-LCS is their linear space algorithm– CUTOFF ranges from 28 – 210
32
Algorithm Description• Our approach is to add another base case
solved quickly on the GPU– GPU-LCS is our new algorithm (not recursive)– GPU-CUTOFF is 216
– CUTOFF is 211
33
Algorithm Description
• CH: CPU Linear Space DP• GPU: GPU DP
– GPU level 1: GPU Quadratic Space DP (block level)
– GPU level 2: GPU Linear Space DP (thread level)
• Simple: CPU Quadratic Space DP
34
CH: CPU Linear Space DP
Two recursive functions used:• Output boundary• LCS reconstruction
35
CH: CPU Linear Space DP
Output boundary:• Given input boundary,
computes output boundary
• Expects subproblem size to be square, with power-of-two lengths
36
A B A B
AABB
Pushing Example
19 20 21 22 2220202020
37
A B A B
AABB
Pushing Example
19 20 21 22 2220202020
20 20 20 20 19 20 21 22 22
38
A B A B
AABB
Pushing Example
19 20 21 22 2220 20202020
20 20 20 20 20 20 21 22 22
39
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21202020
20 20 20 20 20 21 21 22 22
40
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 2120 212020
20 20 20 21 20 21 21 22 22
41
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 2120 21 212020
20 20 20 21 21 21 21 22 22
42
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 2220 21 212020
20 20 20 21 21 21 22 22 22
43
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 212020
20 20 20 21 21 21 22 22 22
44
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 222020
20 20 20 21 21 22 22 22 22
45
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 222020
20 20 20 21 21 22 22 22 22
46
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 2120
20 20 21 21 21 22 22 22 22
47
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220
20 20 21 22 21 22 22 22 22
48
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21
20 21 21 22 21 22 22 22 22
49
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21 22
20 21 22 22 21 22 22 22 22
50
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 2220 21 22
20 21 22 22 22 22 22 22 22
51
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22
20 21 22 22 22 23 22 22 22
52
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22
20 21 22 22 22 23 22 22 22
53
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23
20 21 22 22 23 23 22 22 22
54
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23
20 21 22 22 23 23 22 22 22
55
Algorithm Description
• CH: CPU Linear Space DP • GPU: GPU DP
– GPU level 1: GPU Quadratic Space DP (block level)
– GPU level 2: GPU Linear Space DP (thread level)
• Simple: CPU Quadratic Space DP
56
GPU Processing Overview• Two levels of parallelism
– Blocks are executed on a processor– Threads are executed on a processor core– Each thread is computed by exactly one processor core
57
GPU Level 1: Quadratic Space
• Length of LCS with max length of 216
• Divide DP matrix into “blocks,” each block is solved by one of the GPU processors
• We must enforce the correct order of block execution– Each diagonal can be
computed in parallel
58
GPU Level 1: Quadratic Space
• The basic quadratic space DP algorithm would require 16 GB of memory– We “fold” the memory to store only the input/output boundary
for each block– Reduces the storage required to 64 MB– From n2 to 2(n2/m) where m = 512– Duplicate some values to avoid memory contention
59
Algorithm Description
• CH: CPU Linear Space DP • GPU: GPU DP
– GPU level 1: GPU Quadratic Space DP (block level)
– GPU level 2: GPU Linear Space DP (thread level)
• Simple: CPU Quadratic Space DP
60
GPU Level 2: Linear Space
• Within each block we also have more parallelism– Divide each block into “threads”– Each processor core computes one thread at a time– Hardware-level synchronization ensures the correct
diagonal ordering– Each core reuses the same space (white) and
computes the entire logical matrix (grey)
61
GPU Level 2 : Linear Space
• Each thread is a 4x4 subproblem– The size was determined by experimentation– This memory is on chip, so we do not have to
worry about memory conflicts– The linear space algorithm allows us to make
each block as large as possible, which allows for very fast execution
62
Algorithm Description
• CH: CPU Linear Space DP • GPU: GPU DP
– GPU level 1: GPU Quadratic Space DP (block level)
– GPU level 2: GPU Linear Space DP (thread level)
• Simple: CPU Quadratic Space DP
63
Simple: CPU Quadratic Space DP
• Only gets called when a subproblem is too small for the GPU
• Implements SIMPLE-LCS, the “classic” matrix-based LCS algorithm
64
Overview
• Introduction– Problem Statement
• Background and Related Work– The NVIDIA G80 Architecture
• Algorithm Description• Results and Analysis• Conclusion
65
Results and Analysis
GPU thread width of 4 proves optimal
66
Results and Analysis
GPU block width of 512 is slightly faster
67
Results and Analysis
CPU/GPU cutoff sizes determined experimentally
68
Results and Analysis
• Test DNA sequence data obtained from Mike Brudno• Over five-fold performance improvement from results in
Chowdhury et al. on all sequence comparisons
Species LengthHuman 1.80Chimp 1.32Baboon 1.51Chicken 0.42Fugu 0.27Cow 1.46Mouse 1.49Rat 1.50Cat 1.16Dog 1.05
Lengths in millions
69
Conclusion
• We present a GPU based Dynamic Programming algorithm to compute the LCS of very large sequences
• GPU implementation over five-fold performance boost over single CPU implementation
70
Future Work
• We believe our algorithm can be accelerated further with careful optimization– Memory management on the GPU– Memory transfer between CPU and GPU
• Investigation of other computation models– Implementations using 8xCPU + 2xGPU?
71
Questions?
Special thanks to Rezaul Chowdhury for his support and Mike Brudno for the DNA sequence data
72
NVIDIA CUDA
• Compute Unified Device Architecture• Available on G80 Series• Architecture for utilizing the GPU as a
data-parallel computing device• Eliminates the need to map computation
through graphics API• User writes a C style function which is
then run in parallel on the GPU
73
CH: CPU Linear Space DP
LCS reconstruction• Computes output
boundaries in specific order
• Traces back through boundaries to generate LCS
• Linear space
74
CH: CPU Linear Space DP
LCS reconstruction omissions:
• Non-power-of-two sequence lengths
• Non-equal sequence lengths
75
Integration with Parallel CPUs
• Chowdhury et al. implemented a parallel version of their algorithm– No data available for LCS, but results from other
algorithms show we should expect ~6 times speedup for LCS using 8 server processors
– Disadvantages: • Number of processors which can be effectively used scales
poorly with input size
• Server CPUs cost between $500 and $1600 each, while the GPU we used cost $550