blosc: sending data from memory to cpu (and back) faster than memcpy by francesc alted
DESCRIPTION
Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc AltedTRANSCRIPT
![Page 1: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/1.jpg)
Blosc Sending data from memory to CPU (and back)
faster than memcpy()
Francesc AltedSoftware Architect
PyData London 2014 February 22, 2014
![Page 2: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/2.jpg)
About Me
• I am the creator of tools like PyTables, Blosc, BLZ and maintainer of Numexpr.
• I learnt the hard way that ‘premature optimization is the root of all evil’.
• Now I only humbly try to optimize if I really need to and I just hope that Blosc is not an example of ‘premature optimization’.
![Page 3: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/3.jpg)
About Continuum Analytics
• Develop new ways on how data is stored, computed, and visualized.
• Provide open technologies for data integration on a massive scale.
• Provide software tools, training, and integration/consulting services to corporate, government, and educational clients worldwide.
![Page 4: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/4.jpg)
Overview
• Compressing faster than memcpy(). Really?
• How that can be? (The ‘Starving CPU’ problem)
• How Blosc works.
• Being faster than memcpy() means that my programs would actually run faster?
![Page 5: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/5.jpg)
Compressing Faster than memcpy()
![Page 6: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/6.jpg)
Interactive Session Starts
• If you want to experiment with Blosc in your own machine: http://www.blosc.org/materials/PyData-London-2014.tar.gz
• blosc (blz too for later on) is required (both are included in conda repository).
![Page 7: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/7.jpg)
Open Questions
We have seen that, sometimes, Blosc can actually be faster than memcpy(). Now:
1. If compression takes way more CPU than memcpy(), why Blosc can beat it?
2. Does this mean that Blosc can actually accelerate computations in real scenarios?
![Page 8: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/8.jpg)
The Starving CPU Problem
“Across the industry, today’s chips are largely able to execute code faster than we can feed
them with instructions and data.” !
– Richard Sites, after his article “It’s The Memory, Stupid!”,
Microprocessor Report, 10(10),1996
![Page 9: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/9.jpg)
Memory Access Time vs CPU Cycle Time
![Page 10: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/10.jpg)
Book in 2009
![Page 11: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/11.jpg)
The Status of CPU Starvation in 2014
• Memory latency (~10 ns) is much slower (between 100x and 250x) than processors.
• Memory bandwidth (~15 GB/s) is improving at a better rate than memory latency, but it is also slower than processors (between 30x and 100x).
![Page 12: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/12.jpg)
Blosc Goals and Implementation
![Page 13: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/13.jpg)
Blosc: (de)compressing faster than memcpy()
Transmission + decompression faster than direct transfer?
![Page 14: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/14.jpg)
Taking Advantage of Memory-CPU Gap
• Blosc is meant to discover redundancy in data as fast as possible.
• It comes with a series of fast compressors: BloscLZ, LZ4, Snappy, LZ4HC and Zlib
• Blosc is meant for speed, not for high compression ratios.
![Page 15: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/15.jpg)
Blosc Is All About Efficiency
• Uses data blocks that fit in L1 or L2 caches (better speed, less compression ratios).
• Uses multithreading by default.
• The shuffle filter uses SSE2 instructions in modern Intel and AMD processors.
![Page 16: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/16.jpg)
Blocking: Divide and Conquer
![Page 17: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/17.jpg)
Suffling: Improving the Compression Ratio
The shuffling algorithm does not actually compress the data; it rather changes the byte order in the data stream:
![Page 18: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/18.jpg)
Shuffling Caveat
• Shuffling usually produces better compression ratios with numerical data, except when it does not.
• If you mind about the compression ratio, it is worth to deactivate it and check (it is active by default).
• Will see an example on real data later on.
![Page 19: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/19.jpg)
Blosc Performance: Laptop back in 2005
![Page 20: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/20.jpg)
Blosc Performance: Desktop Computer in 2012
![Page 21: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/21.jpg)
First Answer for Open Questions
• Blosc data blocking optimizes the cache behavior during memory access.
• Additionally, it uses multithreading and SIMD instructions.
• Add these to the Starved CPU problem and you have a good hint now on why Blosc can beat memcpy().
![Page 22: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/22.jpg)
How Compression Works With Real Data?
![Page 23: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/23.jpg)
The Need for Compression
• Compression allows to store more data using the same storage capacity.
• Sure, it uses more CPU time to compress/decompress data.
• But, that actually means using more wall clock time?
![Page 24: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/24.jpg)
The Need for a Compressed Container • A compressed container is meant to store
data in compressed state and transparently deliver it uncompressed.
• That means that the user only perceives that her dataset takes less memory.
• Only less space? What about data access speed?
![Page 25: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/25.jpg)
Source: Howison, M. High-throughput compression of FASTQ data with SeqDB. IEEE Transactions on Computational Biology and Bioinformatics.
Example of How Blosc Accelerates Genomics I/O: SeqDB (backed by Blosc)
![Page 26: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/26.jpg)
Bloscpack (I)
• Command line interface and serialization format for Blosc:
!
$ blpk c data.dat # compress
$ blpk d data.dat.blp # decompress
![Page 27: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/27.jpg)
Bloscpack (II)
• Very convenient for easily serializing your in-memory NumPy datasets:
>>> a = np.linspace(0, 1, 3e8)
>>> print a.size, a.dtype
300000000 float64
>>> bp.pack_ndarray_file(a, 'a.blp')
>>> b = bp.unpack_ndarray_file('a.blp')
>>> (a == b).all()
True
![Page 28: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/28.jpg)
Yet Another Example: BLZ
• BLZ is a both a format and library that has been designed as an efficient data container for Big Data.
• Blosc and Bloscpack are at the heart of it in order to achieve high-speed compression/decompression.
• BLZ is one of the backends supported by our nascent Blaze library.
![Page 29: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/29.jpg)
Appending Data in Large NumPy Objects
Copy!
New memory allocation
array to be enlarged final array object
new data to append
• Normally a realloc() syscall will not succeed • Both memory areas have to exist simultaneously
![Page 30: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/30.jpg)
Contiguous vs ChunkedNumPy container
Contiguous memory
BLZ container
chunk 1
chunk 2
Discontiguous memory
chunk N
...
![Page 31: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/31.jpg)
Appending data in BLZ
compressnew chunk
array to be enlarged final array object
new data to append
Only a small amount of data has to be compressed
Xchunk 1
chunk 2
chunk 1
chunk 2
![Page 32: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/32.jpg)
The btable object in BLZ
New row to append
• Columns are contiguous in memory • Chunks follow column order • Very efficient for querying (specially with a large number of columns)
Chunks
![Page 33: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/33.jpg)
Second Interactive Session: BLZ and Blosc
on a Real Dataset
![Page 34: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/34.jpg)
Second Hint for Open Questions
Blosc usage in BLZ means not only less storage usage (~15x-40x reduction for the real life data shown), but almost the same access time to the data (~2x-10x slowdown).
(Still need to address implementation details for getting better performance)
![Page 35: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/35.jpg)
Summary
• Blosc, being able to transfer data faster than memcpy(), has enormous implications on data management.
• It is well suited not only for saving memory, but for allowing close performance to typical uncompressed data containers.
• It works well not only for synthetic data, but also for real-life datasets.
![Page 36: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/36.jpg)
References
• Blosc: http://www.blosc.org
• Bloscpack: https://github.com/Blosc/bloscpack
• BLZ: http://blz.pydata.org
![Page 37: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/37.jpg)
“Across the industry, today’s chips are largely able to execute code faster than we can feed them with instructions and data. There are no longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit. The real design action is in memory subsystems— caches, buses, bandwidth, and latency.” !“Over the coming decade, memory subsystem design will be the only important design issue for microprocessors.” !
– Richard Sites, after his article “It’s The Memory, Stupid!”, Microprocessor Report, 10(10),1996
“Over this decade (2010-2020), memory subsystem optimization will be (almost) the only important design issue for improving performance.”
– Me :)
![Page 38: Blosc: Sending Data from Memory to CPU (and back) Faster than Memcpy by Francesc Alted](https://reader038.vdocument.in/reader038/viewer/2022110310/559af5611a28aba8708b48ad/html5/thumbnails/38.jpg)
Thank you!