lossless data compression on gpus - nvidiadeveloper.download.nvidia.com/.../presentationpdf/... ·...
TRANSCRIPT
![Page 1: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/1.jpg)
Lossless Data Compression on GPUs
Ritesh Patel Jason Mak University of California, Davis University of California, Davis
1
![Page 2: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/2.jpg)
Motivation • Data storage
• Computation vs. Data Transfer o Is compress-then-send worthwhile?
2
GPU CPU GPU CPU 20 MB 100 MB
Less transfer time
More computation
![Page 3: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/3.jpg)
PCI-E vs. GPU bandwidth
3
0
50
100
150
200
250
1995 1999 2003 2007 2011
Ban
dw
idth
(G
B/s
)
GPU Global Memory
Bandwidth
PCI-Express
Bandwdith
![Page 4: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/4.jpg)
Topics Covered • Related Work
• Three Algorithms in the Compression Pipeline 1. Burrows-Wheeler Transform
2. Move-to-Front Transform
3. Huffman Coding
• Results
• Future Work
4
![Page 5: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/5.jpg)
Related Work • Domain-specific compression
o Floating-point data compression
o Texture compression
• Parallel bzip2 (pbzip2) o Uses pthread library
o bzip2 not data-parallel-friendly
5
![Page 6: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/6.jpg)
Compression Pipeline
6
Burrows-Wheeler Transform
Input Data: n characters
Move-to-Front Transform
Huffman Coding
Output Data: compressed
n characters
n characters
Many runs of repeated characters
Low entropy Lots of 0s, 1s, etc.
![Page 7: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/7.jpg)
Burrows-Wheeler Transform • Transforms a string and gives has many
runs of repeated characters
• Same characters get grouped
7
Input Data
Output Data: compressed
![Page 8: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/8.jpg)
Burrows-Wheeler Transform • Transforms a string and gives has many
runs of repeated characters
• Same characters get grouped
8
Input Data
Output Data: compressed
ababacabac ccbbbaaaaa
Input Output
![Page 9: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/9.jpg)
Burrows-Wheeler Transform
a b a b a c a b a c
c a b a b a c a b a
a c a b a b a c a b
b a c a b a b a c a
a b a c a b a b a c
c a b a c a b a b a
a c a b a c a b a b
b a c a b a c a b a
a b a c a b a c a b
b a b a c a b a c a
BWT Input: ababacabac
1) Cyclical Rotations
9
Input Data
Output Data: compressed
![Page 10: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/10.jpg)
Burrows-Wheeler Transform
a b a b a c a b a c
c a b a b a c a b a
a c a b a b a c a b
b a c a b a b a c a
a b a c a b a b a c
c a b a c a b a b a
a c a b a c a b a b
b a c a b a c a b a
a b a c a b a c a b
b a b a c a b a c a
a b a b a c a b a c
a b a c a b a b a c
a b a c a b a c a b
a c a b a b a c a b
a c a b a c a b a b
b a b a c a b a c a
b a c a b a b a c a
b a c a b a c a b a
c a b a b a c a b a
c a b a c a b a b a
BWT Input: ababacabac
1) Cyclical Rotations 2) Sort
10
Input Data
Output Data: compressed
![Page 11: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/11.jpg)
Burrows-Wheeler Transform
a b a b a c a b a c
a b a c a b a b a c
a b a c a b a c a b
a c a b a b a c a b
a c a b a c a b a b
b a b a c a b a c a
b a c a b a b a c a
b a c a b a c a b a
c a b a b a c a b a
c a b a c a b a b a
Index: 0 BWT: ccbbbaaaaa
11
Input Data
Output Data: compressed
Sorted Cyclical Rotations
![Page 12: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/12.jpg)
BWT with Merge Sort • Sorting n strings each n characters long
o n = 1 million per block in our implementation
• Radix sort not a good fit
• Break ties on the fly
12
Input Data
Output Data: compressed
![Page 13: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/13.jpg)
BWT with Merge Sort • Sorting n strings each n characters long
o n = 1 million per block in our implementation
• Radix sort not a good fit
• Break ties on the fly
• String sorting on GPU o Non-uniform
13
Input Data
Output Data: compressed
a b a b a c a b a c
a b a c a b a b a c
a b a c a b a c a b
a c a b a b a c a b
a c a b a c a b a b
b a b a c a b a c a
b a c a b a b a c a
b a c a b a c a b a
c a b a b a c a b a
c a b a c a b a b a
Pair 1: 7 ties
Pair 2: 0 ties
Pair 3: 4 ties
![Page 14: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/14.jpg)
BWT with Merge Sort • Sorting n strings each n characters long
o n = 1 million per block in our implementation
• Radix sort not a good fit
• Break ties on the fly
• String sorting on GPU o Non-uniform
o Parallelize BWT with string sorting algorithm based on merge sort
o Currently fastest string sort on GPU
14
Input Data
Output Data: compressed
![Page 15: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/15.jpg)
Move-to-Front Transform • Exploit clumpy characters from BWT
o BWT: ababacabac ccbbbaaaaa
• Improves effectiveness of entropy
encoding
15
Input Data
Output Data: compressed
![Page 16: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/16.jpg)
Move-to-Front Transform • Each symbol in the data is replaced by
its index in the list o Initial list: …abcd… (ASCII)
• Recently seen characters are kept at
front of the list o Long sequences of identical symbols replaced by many zeros
16
Input Data
Output Data: compressed ccbbbaaaaa 99,0,99,0,0,99,0,0,0,0
![Page 17: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/17.jpg)
Move-to-Front Transform Input MTF List Output
ccbbbaaaaa …abc… (ASCII) [99]
17
• ‘c’ occurs at 99th index Output ‘99’
Input Data
Output Data: compressed
![Page 18: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/18.jpg)
Move-to-Front Transform Input MTF List Output
ccbbbaaaaa c…ab… (ASCII) [99]
18
• ‘c’ occurs at 99th index Output ’99’ • Move ‘c’ to front of the MTF list • Shift all previous elements to the right
Input Data
Output Data: compressed
![Page 19: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/19.jpg)
Move-to-Front Transform Input MTF List Output
ccbbbaaaaa …abc… (ASCII) [99]
ccbbbaaaaa c…ab… [99,0]
19
• ‘c’ occurs at 0th index Output ’0’
Input Data
Output Data: compressed
![Page 20: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/20.jpg)
Move-to-Front Transform Input MTF List Output
ccbbbaaaaa …abc… (ASCII) [99]
ccbbbaaaaa c…ab… [99,0]
20
• ‘c’ occurs at 0th index Output ’0’ • ‘c’ already at front, no shifts
Input Data
Output Data: compressed
![Page 21: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/21.jpg)
Move-to-Front Transform Input MTF List Output
ccbbbaaaaa …abc… (ASCII) [99]
ccbbbaaaaa c…ab… [99,0]
21
• ‘c’ occurs at 0th index Output ’0’ • ‘c’ already at front, no shifts • Repeat for all elements
Input Data
Output Data: compressed
![Page 22: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/22.jpg)
Move-to-Front Transform Iteration MTF List Transformed String
ccbbbaaaaa …abc… (ASCII) [99]
ccbbbaaaaa c…ab… [99,0]
ccbbbaaaaa c…ab… [99,0,99]
ccbbbaaaaa bc…a… [99,0,99,0]
ccbbbaaaaa bc…a… [99,0,99,0,0]
ccbbbaaaaa bc…a… [99,0,99,0,0,99]
ccbbbaaaaa abc… [99,0,99,0,0,99,0]
ccbbbaaaaa abc… [99,0,99,0,0,99,0,0]
ccbbbaaaaa abc… [99,0,99,0,0,99,0,0,0]
ccbbbaaaaa abc… [99,0,99,0,0,99,0,0,0,0]
22
Input Data
Output Data: compressed
![Page 23: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/23.jpg)
Parallel MTF • MTF appears to be highly serial
o Character-by-character dependency
23
Input Data
Output Data: compressed
![Page 24: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/24.jpg)
Parallel MTF • MTF appears to be highly serial
o Character-by-character dependency
• Goal: generate multiple MTF lists at
different indices o Compute MTF output in parallel
24
Input Data
Output Data: compressed
![Page 25: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/25.jpg)
Parallel MTF • MTF appears to be highly serial
o Character-by-character dependency
• Goal: generate multiple MTF lists at
different indices o Compute MTF output in parallel
• How to parallelize? o Two key insights
1. Generate partial MTF list of a substring s
2. Combine two adjacent partial MTF lists (using
“concatenation”)
25
Input Data
Output Data: compressed
![Page 26: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/26.jpg)
Example Parallel MTF
B A D A D D D A C C C B B A A A
26
Input Data
Output Data: compressed
![Page 27: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/27.jpg)
Example Parallel MTF
B A D A D D D A C C C B B A A A
27
Input Data
Output Data: compressed
A D B
Part 1
MTF List
End of part 1
![Page 28: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/28.jpg)
Example Parallel MTF
B A D A D D D A C C C B B A A A
28
Input Data
Output Data: compressed
Part 1 Part 2
A B C D Final MTF List
End of part 2
![Page 29: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/29.jpg)
Example Parallel MTF
B A D A D D D A
29
Input Data
Output Data: compressed
A D B A B C 2 partial MTF lists (in parallel)
List 1 List 2
C C C B B A A A Break input into 2 chunks
![Page 30: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/30.jpg)
Example Parallel MTF
30
Input Data
Output Data: compressed
A D B A B C 2 partial MTF lists (in parallel)
Appends all unique symbols from list 1 to list 2
A B C D
B A D A D D D A C C C B B A A A
![Page 31: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/31.jpg)
Example Parallel MTF
31
Input Data
Output Data: compressed
A D B A B C 2 partial MTF lists (in parallel)
Same MTF list as we had before! A B C D
B A D A D D D A C C C B B A A A
![Page 32: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/32.jpg)
Example Parallel MTF
B A D A D D D A C C C B B A A A
32
Input Data
Output Data: compressed
B A D A D D D A C C C B B A A A
Break input into 4 chunks
![Page 33: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/33.jpg)
Example Parallel MTF
B A D A D D D A C C C B B A A A
33
Input Data
Output Data: compressed
B A D A D D D A C C C B B A A A
Break input into 4 chunks
Generate partial MTF lists
A D B… A D… B C… A B…
![Page 34: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/34.jpg)
Example Parallel MTF A D B A D B C A B
34
Input Data
Output Data: compressed
Operator Appends all unique symbols to current list
MTF list at index 4
![Page 35: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/35.jpg)
Example Parallel MTF A D B A D B C A B
A D B
35
Input Data
Output Data: compressed
Operator Appends all unique symbols to current list
MTF list at index 8
![Page 36: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/36.jpg)
Example Parallel MTF A D B A D B C A B
A D B
36
Input Data
Output Data: compressed
Operator Appends all unique symbols to current list
B C A D MTF list at index 12
![Page 37: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/37.jpg)
Example Parallel MTF A D B A D B C A B
A D B
37
Input Data
Output Data: compressed
Operator Appends all unique symbols to current list
B C A D
A B C D MTF list at index 16
![Page 38: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/38.jpg)
Example Parallel MTF A D B A D B C A B
A D B
38
Input Data
Output Data: compressed
Operator Appends all unique symbols to current list
B C A D
A B C D
4
8
12
16
Generated lists at indices: 4,8,12,16
![Page 39: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/39.jpg)
Example Parallel MTF A D B A D B C A B
A D B A B C
39
Input Data
Output Data: compressed
Operator Appends all unique symbols to current list
![Page 40: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/40.jpg)
Example Parallel MTF A D B A D B C A B
A D B A B C
A B C D
40
Input Data
Output Data: compressed
Operator Appends all unique symbols to current list
![Page 41: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/41.jpg)
Example Parallel MTF
Operator Appends all unique symbols to current list
A D B A D B C A B
A D B A B C
A B C D
B C A D 41
Input Data
Output Data: compressed
![Page 42: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/42.jpg)
Example Parallel MTF A D B A D B C A B
A D B A B C
A B C D
B C A D
4
8
12
16
42
Input Data
Output Data: compressed
Generated lists at indices: 4,8,12,16
![Page 43: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/43.jpg)
Example Parallel MTF
B A D A D D D A C C C B B A A A
B A D A D D D A C C C B B A A A
A D B… B C A D… A D B… A B C… (Initial MTF List)
43
0 4 8 12
Input Data
Output Data: compressed
MTFLists
In
![Page 44: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/44.jpg)
Example Parallel MTF
B A D A D D D A C C C B B A A A
B A D A D D D A C C C B B A A A
A D B… B C A D… A D B… A B C… (Initial MTF List)
[1,0,0,1] [0,2,0,0] [3,0,0,3] [1,1,3,1]
44
0 4 8 12
Input Data
Output Data: compressed
MTFLists
In
MTFOut
![Page 45: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/45.jpg)
So far…
45
Burrows-Wheeler Transform
ababacabac
Move-to-Front Transform
Huffman Coding
Output Data: compressed
ccbbbaaaaa
99,0,99,0,0,99,0,0,0,0
Grouped characters
Lots of 0s
Input
![Page 46: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/46.jpg)
Huffman Coding • Final stage that performs actual
compression
• Replace each character with a bit code
• Characters that occur more often get
shorter codes
• Huffman Coding is more effective with
repetitive data (after MTF)
46
Input Data
Output Data: compressed
![Page 47: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/47.jpg)
Huffman Coding
47
“HUFF” 00 01 1 1 32 bits 6 bits
Input Data
Output Data: compressed
![Page 48: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/48.jpg)
Huffman Coding • Huffman Stages:
1. Generate 256-bin histogram
2. Build Huffman tree
3. Replace characters with codes
48
Input Data
Output Data: compressed
![Page 49: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/49.jpg)
Huffman Histogram • Histogram
o Build a 256-entry histogram to count characters
o To do this on the GPU, divide the input data among
threads with each thread maintaining its own
histogram
o Merging histograms requires atomic operations
49
+ + =
Input Data
Output Data: compressed
![Page 50: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/50.jpg)
Huffman Tree • Huffman Tree Algorithm
o Remove two lowest counts, combine to form composite node
o Composite node “inserted” into histogram with combined
counts
o Each step is dependent on the previous step
o Parallelize finding the two lowest counts (maximum 256-way
parallelism)
50
Input Data
Output Data: compressed
![Page 51: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/51.jpg)
Huffman Tree Algorithm
51
Histogram H: 1 U: 1 F: 2
“HUFF”
Input Data
Output Data: compressed
![Page 52: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/52.jpg)
Huffman Tree Algorithm
52
Histogram H: 1 U: 1 F: 2
H U
1 1
Input Data
Output Data: compressed
![Page 53: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/53.jpg)
Huffman Tree Algorithm
53
Histogram F: 2
H U
1 1
Input Data
Output Data: compressed
![Page 54: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/54.jpg)
Huffman Tree Algorithm
54
Histogram H+U: 2
F: 2
H U
1 1
2
Input Data
Output Data: compressed
![Page 55: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/55.jpg)
Huffman Tree Algorithm
55
Histogram H+U: 2
F: 2
H U
F
1 1
2 2
Input Data
Output Data: compressed
![Page 56: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/56.jpg)
Huffman Tree Algorithm
56
H U
F
1 1
2 2
Histogram (empty)
Input Data
Output Data: compressed
![Page 57: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/57.jpg)
Huffman Tree Algorithm
57
H U
F
0 1
0 1
Root
Huffman Codes
H: 00 U: 01 F: 1
Left 0 Right 1
Input Data
Output Data: compressed
![Page 58: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/58.jpg)
Huffman Coding • Huffman coding assigns prefix codes
• No codes share the same prefix
58
Huffman Codes H: 00 U: 01 F: 1
Input Data
Output Data: compressed
![Page 59: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/59.jpg)
Huffman Coding
59
Huffman Codes
H: 00 U: 01 F: 1
“HUFF” 000111 32 bits 6 bits
Input Data
Output Data: compressed
![Page 60: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/60.jpg)
Huffman Coding • Replace characters with codes
o To do in parallel, divide input among threads
o Hard to do on the GPU because codes are variable-length
o Each thread must calculate the correct offset to begin writing
its code
60
1 0010 000011 0010 000011
t0 t1 t2 t3 t4
Input Data
Output Data: compressed
![Page 61: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/61.jpg)
Results • GPU vs. Bzip2
o Overall
• GPU 2.78x slower
o BWT
• GPU 2.9x slower (91% of runtime)
o MTF + Huffman
• GPU 1.34x slower
61
![Page 62: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/62.jpg)
Results
62
File (Size)
Compress Rate (MB/s)
BWT Sort Rate (Mstrings/s)
MTF+Huffman Rate (MB/s)
Compress Ratio 𝑪𝒐𝒎𝒑𝒓𝒆𝒔𝒔𝒆𝒅 𝑺𝒊𝒛𝒆
𝑼𝒏𝒄𝒐𝒎𝒑𝒓𝒆𝒔𝒔𝒆𝒅 𝑺𝒊𝒛𝒆
Text (97 MB)
GPU: 7.37 bzip2: 10.26
GPU: 9.84 bzip2: 14.2
GPU: 29.4 bzip2: 33.1
GPU: 0.33 bzip2: 0.29
Source Code (203 MB)
GPU: 4.25 bzip2: 9.8
GPU: 4.71 bzip2: 12.2
GPU: 44.3 bzip2: 48.8
GPU: 0.24 bzip2: 0.18
XML (151 MB)
GPU: 1.42 bzip2: 5.3
GPU: 1.49 bzip2: 5.4
GPU: 32.6 bzip2: 69.2
GPU: 0.19 bzip2: 0.10
Benchmark results
Worst performance during string sort and overall
Best amount of compression
![Page 63: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/63.jpg)
Results • BWT Performance
o GPU – 91% of runtime
o Bzip2 – 81% of runtime
o Tie Analysis:
• Amount of compression is
data dependent
• Better compression leads to
longer ties and poor
performance
63
![Page 64: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/64.jpg)
Results • MTF Performance
o Majority of runtime during
character replacement
o Moving element to front, shifting
all other elements
o Thread divergence
64
![Page 65: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/65.jpg)
Results • Huffman Performance
o Majority of runtime during
Huffman tree building
o 256-way parallelism is
inadequate for the GPU
65
![Page 66: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/66.jpg)
Decompression • Reverse Burrows-Wheeler transform
o Much faster than forward BWT
o Requires only 1 character-sort
o Similar to linked-list traversal
o Preliminary implementation is an extension of Hillis and Steele’s parallel linked-list traversal algorithm
66
![Page 67: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/67.jpg)
Decompression • Reverse Burrows-Wheeler transform
o Much faster than forward BWT
o Requires only 1 character-sort
o Similar to linked-list traversal
o Preliminary implementation is an extension of Hillis and Steele’s parallel linked-list traversal algorithm
• Reverse Move-to-Front transform o Parallel approach uses similar algorithm as forward-MTF
• Generate partial MTF lists
• Combine adjacent lists
67
![Page 68: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/68.jpg)
Decompression • Reverse Burrows-Wheeler transform
o Much faster than forward BWT
o Requires only 1 character-sort
o Similar to linked-list traversal
o Preliminary implementation is an extension of Hillis and Steele’s parallel linked-list traversal algorithm
• Reverse Move-to-Front transform o Parallel approach uses similar algorithm as forward-MTF
• Generate partial MTF lists
• Combine adjacent lists
• Huffman decoding o Decode each encoded block in parallel
68
![Page 69: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/69.jpg)
Future Work • Explore other GPU-based string sorts that can better
handle long strings with many ties
• Develop a Huffman tree building algorithm that has
more than 256-way parallelism
• Overlap GPU compression and PCI-Express data
transfer
69
![Page 70: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/70.jpg)
Conclusion • Implemented parallel lossless data compression on
the GPU
• Parallelized BWT, MTF, and Huffman Coding o Developed a novel algorithm for MTF
o BWT string sort contributes to the majority of runtime
• Implementation is slow but may be better suited for
future GPU architectures and other parallel
environments
70
![Page 71: Lossless Data Compression on GPUs - Nvidiadeveloper.download.nvidia.com/.../PresentationPDF/... · BWT with Merge Sort • Sorting n strings each n characters long o n = 1 million](https://reader033.vdocument.in/reader033/viewer/2022060208/5f0421a47e708231d40c7802/html5/thumbnails/71.jpg)
Acknowledgements • NSF grants OCI-1032859 and CCF-1017399
• HP Labs Innovation Research Program
• Discussion and feedback o Shubho Sengupta (Intel)
o Adam Herr (Intel)
o Anjul Patney (UC Davis)
o Stanley Tzeng (UC Davis)
71