massively-parallel lossless data decompressioneva/sitaridi-icpp-2016.pdf•big data workloads •...

32
Massively-Parallel Lossless Data Decompression Eva Sitaridi *, Rene Mueller, Tim Kaldewey, Guy Lohman , Ken Ross * Columbia University *, IBM Almaden, IBM WatsonICPP 2016, Philadelphia

Upload: others

Post on 14-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Massively-Parallel Lossless Data Decompression

Eva Sitaridi *, Rene Mueller✤, Tim Kaldewey☨, Guy Lohman ✤, Ken Ross *

Columbia University *, IBM Almaden✤, IBM Watson☨

ICPP 2016, Philadelphia

Page 2: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Why Decompression?• Big Data workloads • Data compressed once during ingestion• Data decompressed repeatedly for analytics & machine learning jobs

2Big data pipeline

Compression Decompression

Ingestion Analytics StorageData Sources…

Page 3: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Why GPUs?•Super-computer with accelerators (Source: Top500)June 2014: 62 à June 2016: 93

•Top super-computer with accelerators (Source: Green500)• June 2015: 32 top positions•November 2015: 40 top positions

25% increase

3

Better “fuel efficiency”: If the problem is parallelizable!

Page 4: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

ContributionChallenge: Data dependencies limit parallelism•Designed alternative methods tackling dependencies for LZ77

a) MRR (Multi-Round Resolution) • Resolves dependencies by using hardware primitives

b) DE (Dependency Elimination)• Eliminates dependencies proactively during compression• Achieves 2X faster decompression sacrificing <10% compression ratio

• Implemented Gompresso• Combines LZ-like compression with Huffman Coding

• Benchmark Gompresso against multi-core CPU libraries• 2X faster than multi-core gzip• 17% reduced energy consumption

4

Page 5: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

How to Parallelize Decompression?

Thread 1

Thread 2

Data block 1

Data block 2

Naïve approach performance 200 MB/s << 250 GB/s (K20x)

Input file

Split input file in independent blocksInter-block parallelism

>1000 threads available!

5

Data block n

Page 6: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Challenges• Intra-block parallelism harder• Group of threads processing the same data block• Data dependencies between threads decompressing a data block

• Balance compression ratio & decompression speed• How to trade-off compression ratio for higher parallelism degree?

• Reduce per character cost• Writing a single or few characters wastes memory bandwidth

6

Page 7: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

(0,4)M(5,4)COMM…

OutputdatablockUncompressedinput

W I K I P E D I A . C O

Windowbuffercontents

Background: LZ77 Decompression

InputdatablockCompressedinput

Tokens

7

LZ77 Compression: String matching intensiveØ Search for the longest match in the recent input

PositionLength

Page 8: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

(0,4)M(5,4)COMM…

Outputdatablock

W I K I P E D I A . C O

Windowbuffercontents

Background: LZ77 Decompression

Inputdatablock

Tokens

7

WIKI

Page 9: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

(0,4)M(5,4)COMM…

Outputdatablock

W I K I P E D I A . C O

Windowbuffercontents

Background: LZ77 Decompression

Inputdatablock

Tokens

7

WIKIM

Page 10: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

(0,4)M(5,4)COMM…

Outputdatablock

W I K I P E D I A . C O

Windowbuffercontents

Background: LZ77 Decompression

Inputdatablock

Tokens

7

WIKIMEDIA

Page 11: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

(0,4)M(5,4)COMM…

Outputdatablock

W I K I P E D I A . C O

Windowbuffercontents

Background: LZ77 Decompression

Inputdatablock

Tokens

7

WIKIMEDIACOMM

Page 12: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

(0,4)M(5,4)COMM…

Outputdatablock

W I K I P E D I A . C O

Windowbuffercontents

Background: LZ77 Decompression

Inputdatablock

Tokens

7

WIKIMEDIACOMM

How should threads handle possible dependencies?

Page 13: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

MRR Decompression

8

CCGA(0,2)CGG(4,3)AGTT(12,4)

Compressed input Uncompressed output

Tokens

Improve utilization: Group strings of literals with the following back-reference

Larger tokens

Page 14: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

MRR Decompression – Step 1

CCGA(0,2)CGG(4,3)AGTT(12,4)

Compressed input Uncompressed output

1) Read tokens (parallel)

Tokens

9

Page 15: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

MRR Decompression – Step 2

CCGA(0,2)CGG(4,3)AGTT(12,4) CCGA CGG AGTT

Compressed input Uncompressed output

2) Write literals (parallel prefix sum)

Tokens

10

Page 16: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

MRR Decompression – Step 3

Problem: Back-references processed in parallel might be dependent! àUse voting function __ballot to detect dependencies2 rounds of serialization in this example

3.1 ) Compute uncompressed output

11

CCGA(0,2)CGG(4,3)AGTT(12,4)

3.2) Write uncompressed output:

CCGACCCGGCCCAGTTCCGA

Compressedinput Uncompressedoutput

Tokens

Thread1 Thread2 Thread3

Page 17: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

DE: Dependency Elimination during Compression

12

CCGA(0,3)T(4,4)

…CCGACCGTCCGT… Uncompressed input

MRR Longest match

Page 18: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

DE: Dependency Elimination during Compression

12

DE CCGA(0,3)T(0,3)T

A) Compression Only search for matches w/o dependencies

B) DecompressionCopy back-references (fully parallel)

CCGA(0,3)T(4,4)

…CCGACCGTCCGT… Uncompressed input

MRR Longest match

Longest match not causing dependencies

Page 19: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

DE Overhead

13

• <19% in compression ratio• <13% in compression speed

DatasetsWikipedia: 1 GB XML dump of English WikipediaMatrix: 0.77GB Sparse Matrix in Market File format

Page 20: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Decompression Speed vs. Compression Ratio

14

Byte-level encoding

Bit-level encoding

CPU: Dual-socket E5-2620 – Bandwidth 102.4 GB/sGPU: Tesla K40 – Bandwidth 288 GB/s

Page 21: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Consumed Energy vs. Compression Ratio

Byte-level encoding

Bit-level encoding

15

Page 22: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Summary & Future Work

•Designed & integrated parallel methods in Gompresso• Parallel LZ77 decompression• MRR: Iteratively resolve back-reference dependencies• DE: Proactively eliminate dependencies during compression

• Parallel Huffman decoding•Gompresso vs. multi-core CPU libraries• 2X faster decompression• <10% penalty in compression ratio• 17% reduced energy

• Future work: Investigate alternative coding algorithms 16

Page 23: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Summary & Future Work

•Designed & integrated parallel methods in Gompresso• Parallel LZ77 decompression• MRR: Iteratively resolve back-reference dependencies• DE: Proactively eliminate dependencies during compression

• Parallel Huffman decoding•Gompresso vs. multi-core CPU libraries• 2X faster decompression• <10% penalty in compression ratio• 17% reduced energy

• Future work: Investigate alternative coding algorithms

Thank you!

16

Page 24: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Back-up Slides

Page 25: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Background: LZ77 Compression

ATCATCATTACTACAATGTTGATAATCAGCTATGAGTGATCGGGCCGGGCCT

Slidingwindowbuffer

Unencodedlookaheadcharacters

ATCATCATTACTACAATGTTGATAATCAGCTATGAGTGATCGGGCCGGGCCT

Slidingwindowbuffer Unencodedlookaheadcharacters

Inputcharacters

0 1 2 3 …ATTACTAGAATGTTACTAATCTGATCGGGCCGGGCCTG

Slidingwindowbuffer Unencodedlookaheadcharacters

Findlongestmatch Output

ATTACTAGAATGT(2,5)…

Backreferences(Position,Length)

LiteralsUnmatchedcharacters

17

Page 26: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Thread Utilization

18

T1T2T3 T4T5T6

SIMT Architecture:Groupexecution

Datablock1

ü ü ü ü ü ü 6activethreadsIter.1

i=threadidj=0…while(window[i]==lookahead[j]){

j++;….

}

Datablock2 Datablock3Datablock4Datablock5Datablock6 Different#iterationsforeachthread

Page 27: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Thread Utilization

19

T1T2T3 T4T5T6

SIMT Architecture:Groupexecution

Datablock1

i=threadidj=0…while(window[i]==lookahead[j]){

j++;….

}

Datablock2 Datablock3Datablock4Datablock5Datablock6

û ü ü û ü ü 4activethreadsIter.2

Different#iterationsforeachthread

Page 28: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Thread Utilization

20

T1T2T3 T4T5T6

SIMT Architecture:Groupexecution

Datablock1

i=threadidj=0…while(window[i]==lookahead[j]){

j++;….

}

Datablock2 Datablock3Datablock4Datablock5Datablock6

û û û û û ü 1activethreadIter.3

Different#iterationsforeachthread

(6+4+1)/(3*6)=11/18=61%threadutilization

Page 29: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Memory Access Pattern

21

Data Block 1 Data Block 2 Data Block 3

T1 T2 T3

Thread memory accesses in the same cache line

Optimal GPU memory access pattern

Data block 1 Data Block 2 Data Block 3

T1 T2 T3

Data block size>32K àMany cache lines loaded•Low memory bandwidth

Actual memory access pattern

Page 30: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Why Deflate?•Deflate: Combination of LZ77 & Huffman coding• LZ77 component of many Big Data compression solutions• Snappy• LZO• gzip• LZ4

22

Page 31: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Bytes Per Round

0.01

0.1

1

10

100

1000

1 2 3 4 5 6 7 8 9 10 11 12

#By

tes

Round

WikipediaMatrix

0.01

0.1

1

10

100

1000

1 2 3 4 5 6 7 8 9 10 11 12

#By

tes

Round

WikipediaMatrix

23

Page 32: Massively-Parallel Lossless Data Decompressioneva/sitaridi-icpp-2016.pdf•Big Data workloads • Data compressed once during ingestion • Data decompressed repeatedly for analytics

Level of Nestedness

0 50

100 150 200 250 300 350

0 5 10 15 20 25 30 35

Deco

mpr

essi

on ti

me

(ms)

Nesting depth24