synthesizing effective data compression algorithms for gpus annie yang and martin burtscher*...
TRANSCRIPT
Synthesizing Effective Data Compression Algorithms for GPUs
Annie Yang and Martin Burtscher*Department of Computer Science
2
Highlights MPC compression algorithm
Brand-new lossless compression algorithm for single- and double-precision floating-point data
Systematically derived to work well on GPUs
MPC features Compression ratio is similar to best CPU algorithms Throughput is much higher Requires little internal state (no tables or dictionaries)
Synthesizing Effective Data Compression Algorithms for GPUs
3
Introduction High-Performance Computing Systems
Depend increasingly on accelerators Process large amounts of floating-point (FP) data
Moving this data is often the performance bottleneck Data compression
Can increase transfer throughput Can reduce storage requirement But only if effective, fast (real-time), and lossless
Synthesizing Effective Data Compression Algorithms for GPUs
MantissaExponentS
4
Problem Statement Existing FP compression algorithms for GPUs
Fast but compress poorly Existing FP compression algorithms for CPUs
Compress much better but are slow Parallel codes run serial algorithms on multiple chunks Too much state per thread for a GPU implementation Best serial algos may not be scalably parallelizable
Do effective FP compression algos for GPUs exist? And if so, how can we create such an algorithm?
Synthesizing Effective Data Compression Algorithms for GPUs
5
Our Approach Need a brand-new massively-parallel algorithm Study existing FP compression algorithms
Break them down into constituent parts Only keep GPU-friendly parts Generalize them as much as possible
Resulted in algorithmic components CUDA implementation: each component takes sequence
of values as input and outputs transformed sequence Components operate on integer representation of data
Synthesizing Effective Data Compression Algorithms for GPUs
Charles Trevelyan for http://plus.maths.org/
6
Our Approach (cont.) Automatically synthesize
compression algorithms by chaining components Use exhaustive search to find
best four-component chains
Synthesize decompressor Employ inverse components Perform opposite
transformation on data
Synthesizing Effective Data Compression Algorithms for GPUs
7
Mutator Components Mutators computationally transform each value
Do not use information about any other value NUL outputs the input block (identity) INV flips all the bits │, called cut, is a singleton pseudo component that
converts a block of words into a block of bytes Merely a type cast, i.e., no computation or data copying Byte granularity can be better for compression
Synthesizing Effective Data Compression Algorithms for GPUs
8
Shuffler Components Shufflers reorder whole values or bits of values
Do not perform any computation Each thread block operates on a chunk of values
BIT emits most significant bits of all values, followed by the second most significant bits, etc.
DIMn groups values by dimension n Tested n = 2, 3, 4, 5, 8, 16, and 32 For example, DIM2 has the following effect:
sequence A, B, C, D, E, F becomes A, C, E, B, D, F
Synthesizing Effective Data Compression Algorithms for GPUs
9
Predictor Components Predictors guess values based on previous values
and compute residuals (true minus guessed value) Residuals tend to cluster around zero, making them
easier to compress than the original sequence Each thread block operates on a chunk of values
LNVns subtracts nth prior value from current value Tested n = 1, 2, 3, 5, 6, and 8
LNVnx XORs current with nth prior value Tested n = 1, 2, 3, 5, 6, and 8
Synthesizing Effective Data Compression Algorithms for GPUs
10
Reducer Components Reducers eliminate redundancies in value sequence
All other components cannot change length of sequence, i.e., only reducers can compress sequence
Each thread block operates on a chunk of values ZE emits bitmap of 0s followed by non-zero values
Effective if input sequence contains many zeros RLE performs run-length encoding, i.e., replaces
repeating values by count and a single value Effective if input contains many repeating values
Synthesizing Effective Data Compression Algorithms for GPUs
11
Algorithm Synthesis Determine best four-stage algorithms with a cut
Exhaustive search of all possible 138,240 combinations
13 double-precision data sets (19 – 277 MB) Observational data, simulation results, MPI messages Single-precision data derived from double-precision data
Create general GPU-friendly compression algorithm Analyze best algorithm for each data set and precision Find commonalities and generalize into one algorithm
Synthesizing Effective Data Compression Algorithms for GPUs
12
Best of 138,240 Algorithms
Synthesizing Effective Data Compression Algorithms for GPUs
data set double precision single precisionmsg_bt LNV1s BIT LNV1s ZE | DIM5 ZE LNV6x | ZE msg_lu LNV5s | DIM8 BIT RLE LNV5s LNV5s LNV5x | ZE msg_sp DIM3 LNV5x BIT ZE | DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE RLE DIM5 LNV6s ZE | msg_sweep3d LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE num_brain LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE num_control LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE obs_error LNV1x ZE LNV1s ZE | LNV6s BIT LNV1s ZE | obs_info LNV2s | DIM8 BIT RLE LNV8s DIM2 | DIM4 RLE obs_spitzer ZE BIT LNV1s ZE | ZE BIT LNV1s ZE | obs_temp LNV8s BIT LNV1s ZE | BIT LNV1x DIM32 | RLE overall best LNV6s BIT LNV1s ZE | LNV6s BIT LNV1s ZE |
13
Analysis of Reducers Double prec results only
Single prec results similar ZE or RLE required at end
Not counting cut; (encoder) ZE dominates
Many 0s but not in a row First three stages
Contain almost no reducers Transformations are key to
making reducer effective Chaining whole compression
algorithms may be futile
Synthesizing Effective Data Compression Algorithms for GPUs
data set double precisionmsg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE obs_error LNV1x ZE LNV1s ZE | obs_info LNV2s | DIM8 BIT RLE obs_spitzer ZE BIT LNV1s ZE | obs_temp LNV8s BIT LNV1s ZE | overall best LNV6s BIT LNV1s ZE |
14
Analysis of Mutators NUL and INV never used
No need to invert bits Fewer stages perform worse
Cut often at end (not used) Word granularity suffices Easier/faster to implement
DIM8 right after cut DIM4 with single precision Used to separate byte
positions of each word Synthesis yielded unforeseen
use of DIM component
Synthesizing Effective Data Compression Algorithms for GPUs
data set double precisionmsg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE obs_error LNV1x ZE LNV1s ZE | obs_info LNV2s | DIM8 BIT RLE obs_spitzer ZE BIT LNV1s ZE | obs_temp LNV8s BIT LNV1s ZE | overall best LNV6s BIT LNV1s ZE |
15
Analysis of Shufflers Shufflers are important
Almost always included BIT used very frequently
FP bit positions correlate more strongly than values
DIM has two uses Separate bytes (see before)
Right after cut Separate values of multi-dim
data sets (intended use) Early stages
Synthesizing Effective Data Compression Algorithms for GPUs
data set double precisionmsg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE obs_error LNV1x ZE LNV1s ZE | obs_info LNV2s | DIM8 BIT RLE obs_spitzer ZE BIT LNV1s ZE | obs_temp LNV8s BIT LNV1s ZE | overall best LNV6s BIT LNV1s ZE |
16
Analysis of Predictors Predictors very important
(Data model) Used in every case Often 2 predictors used
LNVns dominates LNVnx Arithmetic (sub) difference
superior to bit-wise (xor) difference in residual
Dimension n Separates values of multi-
dim data sets (in 1st stage)
Synthesizing Effective Data Compression Algorithms for GPUs
data set double precisionmsg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE obs_error LNV1x ZE LNV1s ZE | obs_info LNV2s | DIM8 BIT RLE obs_spitzer ZE BIT LNV1s ZE | obs_temp LNV8s BIT LNV1s ZE | overall best LNV6s BIT LNV1s ZE |
17
Analysis of Overall Best Algorithm Same algo for SP and DP Few components mismatch
But LNV6s dim is off Most frequent pattern
LNV*s BIT LNV1s ZE Star denotes dimensionality
Why 6 in starred position? Not used in individual algos 6 is least common multiple
of 1, 2, and 3 Did not test n > 8
Synthesizing Effective Data Compression Algorithms for GPUs
data set double precisionmsg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE obs_error LNV1x ZE LNV1s ZE | obs_info LNV2s | DIM8 BIT RLE obs_spitzer ZE BIT LNV1s ZE | obs_temp LNV8s BIT LNV1s ZE | overall best LNV6s BIT LNV1s ZE |
18
MPC: Generalization of Overall Best
MPC algorithm Massively Parallel Compression
Uses generalized pattern “LNVds BIT LNV1s ZE” where d is data set dimensionality
Matches best algorithm on several DP and SP data sets
Performs even better when true dimensionality is used
Synthesizing Effective Data Compression Algorithms for GPUs
data set double precisionmsg_bt LNV1s BIT LNV1s ZE | msg_lu LNV5s | DIM8 BIT RLE msg_sp DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE msg_sweep3d LNV1s DIM32 | DIM8 RLE num_brain LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | num_control LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE obs_error LNV1x ZE LNV1s ZE | obs_info LNV2s | DIM8 BIT RLE obs_spitzer ZE BIT LNV1s ZE | obs_temp LNV8s BIT LNV1s ZE | overall best LNV6s BIT LNV1s ZE |
19
Evaluation Methodology System
Dual 10-core Xeon E5-2680 v2 CPU K40 GPU with 15 SMs (2880 cores)
13 DP and 13 SP real-world data sets Same as before
Compression algorithms CPU: bzip2, gzip, lzop, and pFPC GPU: GFC and MPC (our algorithm)
Synthesizing Effective Data Compression Algorithms for GPUs
20
Compression Ratio (Double Precision)
MPC delivers record compression on 5 data sets In spite of “GPU-friendly components” constraint
MPC outperformed by bzip2 and pFPC on average Due to msg_sppm and num_plasma
MPC superior to GFC (only other GPU compressor)
Synthesizing Effective Data Compression Algorithms for GPUs
HarMean msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp
bzip2 --best 1.321 1.088 1.018 1.055 6.933 1.294 1.043 1.173 1.029 5.789 1.339 1.217 1.752 1.024gzip --best 1.239 1.130 1.055 1.107 7.431 1.092 1.064 1.162 1.058 1.608 1.448 1.154 1.231 1.036lzop -9 1.158 1.052 1.000 1.003 6.780 1.017 1.000 1.082 1.017 1.503 1.273 1.096 1.142 1.000pFPC -1M 1.365 1.250 1.137 1.238 4.710 1.888 1.148 1.151 1.038 7.042 1.542 1.215 1.022 0.997GFC 1.179 1.122 1.148 1.202 3.506 1.217 1.090 1.110 1.013 1.125 1.233 1.141 1.022 1.037MPC 1.248 1.207 1.212 1.208 2.999 1.287 1.182 1.267 1.106 1.164 1.180 1.214 1.184 1.101
21
Compression Ratio (Single Precision)
MPC delivers record compression 8 data sets In spite of “GPU-friendly components” constraint
MPC is outperformed by bzip2 on average Due to num_plasma
MPC is “superior” to GFC and pFPC They do not support single-precision data, MPC does
Synthesizing Effective Data Compression Algorithms for GPUs
HarMean msg_bt msg_lu msg_sp msg_sppm msg_sweep3d num_brain num_comet num_control num_plasma obs_error obs_info obs_spitzer obs_temp
bzip2 --best 1.398 1.129 1.041 1.141 8.741 2.355 1.113 1.117 1.043 8.652 1.338 1.327 1.394 1.049gzip --best 1.267 1.179 1.086 1.200 9.605 1.151 1.128 1.151 1.080 1.383 1.466 1.200 1.188 1.079lzop -9 1.153 1.075 1.000 1.083 8.634 1.033 1.003 1.086 1.016 1.223 1.246 1.129 1.077 1.000MPC 1.350 1.336 1.440 1.385 3.813 1.534 1.344 1.178 1.122 1.345 1.298 1.436 1.047 1.114
22
Throughput (Gigabytes per Second) MPC outperforms all CPU compressors
Including pFPC running on two 10-core CPUs by 7.5x MPC slower than GFC but mostly faster than PCIe
MPC uses slow O(n log n) prefix scan implementation
Synthesizing Effective Data Compression Algorithms for GPUs
compr. decom. compr. decom.bzip2 --best 0.01 0.02 0.01 0.02gzip --best 0.02 0.15 0.03 0.15lzop -9 0.01 1.87 0.01 1.44pFPC -1M 1.41 1.04 n/a n/aGFC 32.28 31.47 n/a n/aMPC 10.78 7.91 5.81 4.23
single precisiondouble precisioncompr. decom. compr. decom.
bzip2 --best 0.1% 0.3% 0.1% 0.6%gzip --best 0.2% 1.9% 0.4% 3.5%lzop -9 0.1% 23.6% 0.2% 33.9%pFPC -1M 13.0% 13.2% n/a n/aGFC 299.4% 398.0% n/a n/aMPC 100.0% 100.0% 100.0% 100.0%
double precision single precision
23
Summary Goal of research
Create an effective algorithm for FP data compression that is suitable for massively-parallel GPUs
Approach Extracted 24 GPU-friendly components and evaluated
138,240 combinations to find best 4-stage algorithms Generalized findings to derive MPC algorithm
Result Brand new compression algorithm for SP and DP data Compresses about as well as CPU algos but much faster
Synthesizing Effective Data Compression Algorithms for GPUs
24
Future Work and Acknowledgments Future work
Faster implementation, more components, longer chains, and other inputs, data types, and constraints
Acknowledgments National Science Foundation NVIDIA Corporation Texas Advanced Computing Center
Contact information [email protected]
Synthesizing Effective Data Compression Algorithms for GPUs
Nvidia
25
Number of Stages 3 stages reach about 95% of compression ratio
Synthesizing Effective Data Compression Algorithms for GPUs
26
Single- vs Double-Precision Algorithms
Synthesizing Effective Data Compression Algorithms for GPUs
data set double precision single precisionmsg_bt LNV1s BIT LNV1s ZE | DIM5 ZE LNV6x | ZE msg_lu LNV5s | DIM8 BIT RLE LNV5s LNV5s LNV5x | ZE msg_sp DIM3 LNV5x BIT ZE | DIM3 LNV5x BIT ZE | msg_sppm DIM5 LNV6x ZE | ZE RLE DIM5 LNV6s ZE | msg_sweep3d LNV1s DIM32 | DIM8 RLE LNV1s DIM32 | DIM4 RLE num_brain LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_comet LNV1s BIT LNV1s ZE | LNV1s | DIM4 BIT RLE num_control LNV1s BIT LNV1s ZE | LNV1s BIT LNV1s ZE | num_plasma LNV2s LNV2s LNV2x | ZE LNV2s LNV2s LNV2x | ZE obs_error LNV1x ZE LNV1s ZE | LNV6s BIT LNV1s ZE | obs_info LNV2s | DIM8 BIT RLE LNV8s DIM2 | DIM4 RLE obs_spitzer ZE BIT LNV1s ZE | ZE BIT LNV1s ZE | obs_temp LNV8s BIT LNV1s ZE | BIT LNV1x DIM32 | RLE overall best LNV6s BIT LNV1s ZE | LNV6s BIT LNV1s ZE |
27
MPC Operation What does “LNVds BIT LNV1s ZE” do?
LNVds predicts each value using a similar value to obtain a residual sequence with many small values
Similar value = most recent prior value from same dim BIT groups residuals by bit position
All LSBs, then all second LSBs, etc. LNV1s turns identical consecutive words into zeros ZE eliminates these zero words
GPU friendly All four components are massively parallel Can be implemented with prefix scans or simpler
Synthesizing Effective Data Compression Algorithms for GPUs