capi gzip and other capi solutions - ibm - united …...the input data is srr034947, half of about...
TRANSCRIPT
GZIP Hardware Acceleration
• CAPI GZIP Acceleration Adapter (EJ1A/EJ1B)– Characteristics– Usage Example– Experiences
• Other CAPI Solutions
• Future Directions
• Summary
2© Copyright IBM 2016
© Copyright IBM 2016 3
CAPI GZIP Acceleration Adapter
GenWQE/PCIe GZIP Accelerator #EJ12, #EJ13
On CPU Accelerator for P9
4
Evolution
© Copyright IBM 2016
• CAPI GZIP (FC #EJ1A & EJ1B)• Power8 and OpenPower enabled: Tuleta-L, Habanero-
LC, Firestone-LC (bare-metal only)• Ubuntu 14.04.5 and later, RHEL 7.2 LE
• Hardware platform: Altera Stratix V FPGA
CAPI GZIP Accelerator #EJ1A & EJ1B
5
GZIP Compression Solution Architecture & 3 Questions
Software stack (Linux on Power)
CAPI KernelExtension
libz
libzHW
libcxl
libz(SW)
NativeApplication
Java
JavaApplications
CAPI GZIPAccelerator Card
CAPI FPGA bitstream
GenWQELinux Driver
Work-queue
GenWQE/PCIeAccelerator Card
PCIe FPGA bitstream
© Copyright IBM 2016
AFU(directed mode)
Work queue support, hardware scheduler for multiple AFU contexts
process(using AFU slave context)
thre
ad
thre
a
d thre
ad…
queue
process(using AFU slave context)
thre
ad
thre
ad
thre
ad…
queue
process(using AFU master context)
• Predecessor: PCIe GZIP Accelerator:
• Using genwqe-card.ko Linux driver to access work-queue
• Bandwidth determined by comp/decomp
• CAPI GZIP Accelerator:
• CAPI AFU directed mode: one work-queue per AFU-context
• Re-use generic libcxl & cxl.ko to access CAPI device
• User-space library implements work-queue access
• 3 Questions:
1. Bandwidth
2. Efficiency for small requests
3. CPU load
Work-queue
use
r-spa
ceke
rne
lFP
GA
ha
rdw
are
PSL
Solution Provider
IBM
Card Vendor
Application Developer
GenWQE/PCIeCAPI
6
Simple steps to try out the CAPI GZIP card
1. Install required OS and genwqe-zlib, genwqe-tools packages
2. Figure out how zlib is used by your application (assume that it is linked dynamically for the example)
3. Execute your program without hardware accelerated zlib:
$ time scp -C linux.tar tul3:
linux.tar 100% 550MB 26.2MB/s 00:21
real 0m21.276s
user 0m20.720s
sys 0m0.508s
4. Execute your program with hardware accelerated zlib:
$ time ZLIB_ACCELERATOR=CAPI LD_PRELOAD=/usr/lib/genwqe/libz.so.1 scp -C linux.tar tul3:
linux.tar 100% 550MB 110.0MB/s 00:05
real 0m4.997s
user 0m2.028s
sys 0m1.048s
$ time gzip -c linux.tar > linux.sw.tar.gz
real 0m16.730suser 0m16.580ssys 0m0.124s
$ time genwqe_gzip -ACAPI -B0 -c linux.tar > linux.hw.tar.gz
real 0m1.221suser 0m0.024ssys 0m0.168s
© Copyright IBM 2016
7
Example – Multithreading/Throughput
Compressing and decompressing in parallel by multiple CPU threads/streams
Maximum hardware throughput reached already with 2 threads
32 software deflate threads cannot beat CAPI acceleratorhttps://github.com/ibm-genwqe/genwqe-user/blob/master/tools/genwqe_mt_perf
Test-setup: IBM Tuleta 8247-22L20 CPU cores @ 8 threads eachAvg: 3.694 GHzBitstream: 0x06030703475a4950
0
500
1000
1500
2000
2500
1 2 3 4 8 16 32 64
MiB
/sec
threads
DEFLATE CAPI, GENWQE and SW
CAPI MiB/sec GENWQE MiB/sec SW MiB/sec
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 3 4 8 16 32 64M
iB/s
ec
threads
INFLATE CAPI, GENWQE and SW
CAPI MiB/sec GENWQE MiB/sec SW MiB/sec
© Copyright IBM 2016
8
CPU Load
Compressing and decompressing in parallel by multiple CPU threads/streams
Software CAPI GZIP Acceleration
*) Software fallback due to small buffers
Drastic speed increase at almost no CPU load with CAPI accelerator!
Data: http://corpus.canterbury.ac.nz/resources/cantrbry.tar.gz
GenWQE/PCIe GZIP Acceleration
© Copyright IBM 2016
9
Internal User Feedback
User feed-back (Frank Liu, IBM Research):
„I measured the runtime on a larger file (5.4G), the performance difference iseven more drastic: 11m41sec vs 4.7 sec. (...) But over 100x speedup is veryimpressive.“
The input data is SRR034947, half of about 1/40 of human genome sequence(the whole human genome sequence is split into about 40 parts, each part hastwo files, paired. This is one of the pair). The raw file size (uncompressed) is5.4G. Compression reduced the file to 1.6G. The machine is a P8 Tuleta, S824, running RH 7.2.
© Copyright IBM 2016
© Copyright IBM 2016 10
CAPI Experiences & 3 Answers
Questions Explanation Results
1. Bandwidth (sufficiently large data)
Limited by comp/decompUsing the same card and FPGA
Small improvements ☺
2. Efficiency for small requests
PCIe GZIP: Requests smaller then 16KiB; drop to softwareCAPI GZIP: Requests smaller then 8KiB, drop to software
Great improvements, but not enough to get rid of fallback to software �
3. CPU load Smarter CAPI DMA memory management Significant improvements ☺ ☺ ☺
Conclusions:
• CAPI cannot do magic: Work must still be adjusted for hardware accelerator usage
• CAPI helps to implement hardware acceleration intuitively (no worries about DMA memory handling)
• Getting CPU load down to almost zero was right at the first try – that was really impressive
© Copyright IBM 2016 11
Other CAPI Solutions
12
Sample List of Accelerators
Financial
L3 Order Book (Algo Logic)
Monte Carlo Risk Analysis (IBM)
Database
IBM Data Engine for No-SQL (“CAPI-Flash”)
IBM Data Engine for No-SQL NVME Edition 2016
Erasure code for Hadoop, CEPH (IBM, Xilinx)
Key Value Store (KVS) (Xilinx)
Dynamic Time Warp Pattern Match (IBM)
Bitwise Encryption (DRC / SecurityFirst)
General Purpose
GZIP Compression (IBM)
Fast-Fourier Transfer (IBM)
Linear Algebra (Auviz)
Security
Bank Fraud Detection
In-Betweenness Djikstra(DRC)
Video Surveillance (SiliconScapes)
Digital DNA for forensics (DRC, i-Abra)
Retail & Analytics
RegEx Text Analytics (IBM)
Mood Detection (SiliconScapes)
Real-Time Ad Auctions (Algo-Logic)
Visually impaired assistance (SiliconScapes)
Computer Vision & Machine Learning
CV Library (Auviz)
People Identification
DNN Library (Auviz)
Activity Recognition (SiliconScapes)
AccDNN: Bridge to Deep Learning
Health Care
Light Activated Cancer Therapy (U of Toronto)
Genomics Processing (Edico)
PairHMM Accelerator (IBM)
© Copyright IBM 2016
http://www-304.ibm.com/webapp/set2/sas/f/capi/capiexamples.html
strategy ( )
CAPI Attached Flash Optimization
– Attach IBM FlashSystem to POWER8 via CAPI
– Frees up processor cores to running applications vs driving IOPS – Up to 75% reduced overhead
– Reduced number of cores needed to per 1M IOPS
Pin buffers, Translate, Map DMA, Start I/O
Application
Read/Write Syscall
Interrupt, unmap, unpin,Iodone scheduling
29K instructions reduced to
~5K
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix AsyncI/O Style API
Shared Memory Work Queue
aio_read()aio_write()
iodone ( )
LVM
13 Different tools are used for measurements© Copyright IBM 2016
CAPI Unlocks the Next Level of Performance for Flash
Identical hardware with 3 different paths to data
FlashSystem
ConventionalI/O (FC) CAPI - E
0
50.000
100.000
150.000
200.000
250.000
300.000
350.000
400.000
450.000
Conventional CAPI - I CAPI - E
IOPS per Hardware Thread
0
20
40
60
80
100
120
140
160
180
200
Conventional CAPI - I CAPI - E
Latency (microseconds)
IBM POWER S822L
>3x better IOPS per HW thread
Lower latency
14
CAPI – I : Integrated CAPI Flash Card
CAPI – E: CAPI attached External Flash
CAPI - I
© Copyright IBM 2016
© Copyright IBM 2016 15
Future CAPI Directions
16
New Directions:
In-Line Programmable Framework
Chip
Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Me
mo
ry B
us
SM
P In
terco
nn
ect
SM
PC
AP
IP
CIe
SM
PA C
1. App calls SW Compression
2. App stores Compressed Data
through ARC/Block API
B
A C
A
B
Chip
Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Me
mo
ry B
us
SM
P In
terco
nn
ect
SM
PC
AP
IP
CIe
SM
PA
B
C
1. App calls CAPI HW Compression
2. App stores Compressed Data
through ARC/Block API
Benefits:
High performance compression
Processor off-load
A C
A
B
Chip
Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Me
mo
ry B
us
SM
P In
terco
nn
ect
SM
PC
AP
IP
CIe
SM
PA
B C
1. App calls In-Line Framework, which
does in-line compression
Benefits:
High performance compression
Processor off-load
Single call for Application
One PCI-E crossing
A
B C
A. SW compression and 2TB Flash B. HW compression and 2TB Flash C. Compression on 2TB Flash
17
Summary
• Compression/decompression hardware accelerator– High throughput
(almost as fast as storing uncompressed data)
– Saving storage space– Impressive CPU load reduction
(positive influence on power consumption)
• Different other CAPI examples available
• CAPI technology offers very efficient approach for designing hardware acceleration solutions
© Copyright IBM 2016
© Copyright IBM 2016 18
Backup
19
References/Links
GenWQE Hardware Accelerator Software
• Sources: https://github.com/ibm-genwqe/genwqe-user
CAPI accelerated GZIP Compression Adapter User’s guide
• https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-
44a4f27eba32/entry/Introducting_the_CAPI_accelerated_GZIP_card?lang=en
OpenPOWERTM
• General Information: http://openpowerfoundation.org
• OpenPOWERTM Accelerator Work Group: http://openpowerfoundation.org/technical/working-
groups/#wg_accelerator
CAPI Examples
• http://www-304.ibm.com/webapp/set2/sas/f/capi/capiexamples.html
© Copyright IBM 2016
Additional CAPI Solutions
• IBM POWER8 CAPI Order Book
– http://algo-logic.com/CAPIorderbook
– https://www.youtube.com/watch?v=Kca8TuqLjnI&feature=youtu.be
• CAPIflash
– http://openpowerfoundation.org/blogs/capi-and-flash-for-larger-faster-nosql-and-analytics/
– https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-44a4f27eba32/entry/power8_capi_flash_in_memory_expansion_to_speed_data_access?lang=en
– https://github.com/open-power/capiflash
• CAPIflash using NVME
– http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=ca&infotype=an&appname=iSource&supplier=897&letternum=ENUS116-062
• Accelerating Key-value Stores (KVS) with FPGAs and OpenPOWER
– http://openpowerfoundation.org/blogs/accelerating-key-value-stores-kvs-with-fpgas-and-openpower/
© Copyright IBM 2016 20
© Copyright IBM 2016 21
CAPI Directions
© Copyright IBM 2016 22
© Copyright IBM 2016 23
Genomics Example
24
Processing SAM/Bam Formats
SAM: Sequence Alignment/Map Format
SAM/BAM file-format specification can be found here: https://samtools.github.io/hts-specs/SAMv1.pdf
BAM is the binary representation of the SAM
format. BAM is compressed in the BGZF format. The
BGZF format is a series of blocks as shown in the
table.
BGZF allows seeking efficiently within the data.
Blocksize of 64KiB limits the effectiveness of the
CAPI GZIP accelerator.
© Copyright IBM 2016
25
Example – Genomics: samtools
Samtools example
Testcase: https://github.com/ibm-genwqe/genwqe-user/blob/master/misc/samtools_test.sh
1 2 4 8 16 32 64 128 1 1 1 2 4 8 16 32 64 128
SRT B2F IDX VIEW
0
10
20
30
40
50
60
70
80
sec
samtools with software zlib and CAPI GZIP acceleration
SW/sec
CAPI/sec
© Copyright IBM 2016
Smaller is better
© Copyright IBM 2016 26
CAPI-Flash Details
IBM Data Engine for NoSQL is an integrated platform for large and fast growing NoSQL data stores. It
builds on the CAPI capability of POWER8 systems and provides super-fast access to large flash storage capacity. It delivers high speed access to both RAM and flash storage which can result in
significantly lower cost, and higher workload density for NoSQL deployments than a standard RAM-based system. The solution offers superior performance and price-performance to scale out x86
server deployments that are either limited in available memory per server or have flash memory with limited data access latency.
IBM Data Engine for NoSQLCost Savings for In-Memory NoSQL Data Stores
Up to 56TB of extended memory with one POWER8 server + CAPI attach FLASH
Power S822L / S812L Flash System 900Power S822L / S812L / S822 LC
NEW
External Flash Configuration Integrated Flash Configuration
Up to 8TB of super-fast storage tier on one POWER8 server
27 © Copyright IBM 2016
NoSQL Databases enabled for Data Engine for NoSQL
28
Name Classification Optimized for Lead use cases / data types
Redis Labs In-memory Key Value Store
Data queues, Strings, Lists, Counts, caching, Statistics, Text, session IDs, pictures, videos
Live in memory cache, data queues, User session data, shopping cart data,
Cassandra Wide Column Store
NoSQL environments that need Very High Performance and Scalability, Very Highdata volumes
Messaging, Fraud detection, Internet of Things data – sensor data, log data, telco call detail records
Neo4j Graph Store Data stored as edges, nodes, or attributes (Graphs).
Fraud detection, Social Network Analysis, Location aware apps, Master data mgmt., Machine Learning
© Copyright IBM 2016
IBM Data Engine for NoSQL API Choices
• Key-Value Layer APIs (KV) form the highest level of abstraction for accessing the Data Engine for NoSQL. Usage of these APIs is typically by applications the inherently understand semantics of a key/value store, hash table, or collection API. Example use cases include simple object storage / retrieval (e.g. ark_get or ark_set).
• Block APIs provide a fast-path between blocks on flash and a user-space application. Semantics are similar to a low-level block device (e.g. open / close / read / write). Data may be transient or persistent, depending on which APIs are used.
• Refer to the White Paper for a full API Guide, including forward-looking guidance for future releases.
29
Ubuntu
Redis Labs Enterprise Cluster
Block Fcn
Client Client Client Client…
Master
Context
AFU
PSL
PO
WE
R8 S
erv
er
Fla
sh
Sys
tem
FLASH Array (up to 56TB)C
AP
IA
dapte
r
Integrated CAPI-FLASH Card
KV Fcn
CA
PI
Adapte
r PSL
AFU
NVMe
960 GB
NVMe
960 GB
External Flash subsystem
© Copyright IBM 2016
Example Component Overview – Redis Labs Enterprise Cluster (External Flash and Integrated Flash configurations)
Redis Clustered Server
KVS Library
Block Library
Standard Redis Clients
Linux OS
PSL-AFU via FPGA
RedisUsers
NoSQLDBs
BlockStore
30
Ubuntu
Redis Labs Enterprise Cluster
Block Fcn
Client Client Client Client…
Master
Context
AFU
PSL
PO
WE
R8 S
erv
er
Fla
sh
Sys
tem
FLASH Array (up to 56TB)
CA
PI
Adapte
r
Integrated CAPI-FLASH Card
KV Fcn
CA
PI
Adapte
r PSL
AFU
NVMe
960 GB
NVMe
960 GB
External Flash subsystem
© Copyright IBM 2016
IBM Data Engine for NoSQL – Integrated Flash configuration
• CAPI NVMe Adapter brings up to 2 TB of flashin-system
• Depending on system configuration, up to 4 CAPI NVMecards may be present, offering up to 8 TB of in-system accelerated flash
(Per-card capacity may grow in the future up to 8 TB via additional
higher-density flash sticks)
• User APIs are 100% compatible with current FibreChannel-attached flash adapter, providing solution architects added choice in storage density
• Standard single-wide 2U low-profile PCIe card
• Projected ~4.3GB/sec sequential read and 3GB/sec sequential write per CAPI NVMe card
• Projected 600k random read IOPs and 200k random write IOPs per card
Linux
Your Application(s)
Block Fcn
Client Client Client Client…
cxlflashkernel
module
AFU
PSL
Pow
er
8 S
erv
er
NVMe
(up to 960 GB)C
AP
IA
dapte
r(s)
KV Fcn
NVMe
(up to960 GB)Xilinx Kintex
KU060
in A1156 35mm x 35mm
Package
4 GBytesDDR4-2400
x40
NVMe
NVMe
PCIex8 Gen 3
M.2
Con
necto
r
M.2
Con
necto
r
M.2 2280 NVMeFlash SM951 –512GByte
M.2 2280 NVMe Flash SM951 –512GByte
PSUs
M.2 22110 1TByte
M.2 22110 1TByte
31 © Copyright IBM 2016
© Copyright IBM 2016 32
CAPI GZIP Details
GZIP - Overview
• Business value
– High throughput compression; Saves storage and I/O bandwidth with little or no overhead
– CPU offload, CAPI interface with negligible software load; Frees up CPU cores for higher value computation or licensed software
– Lower power consumption by offloading the CPU intensive compression to an FPGA
• Features
– DEFLATE standard format, widely used for data interchange (RFC1950 zlib, RFC1951 deflate and RFC1952 gzip)
– ~1.8 GiB/sec compression throughput, ~2GiB/sec decompression throughput
– 4 – 30 x speed-up achievable
– Compression ratio near software zlib/gzip
• Example use cases
– Genomics, Datacenter/Cloud, Backup solutions
© Copyright IBM 2016 33
34
GZIP: DEFLATE Algorithm
Compression with Lempel-Ziv (LZ) methods aims at finding repeated byte sequences in the data. It then replaces the repeated byte sequence by a backwards reference (sequence length, distance) to the original occurrence. Unmatched bytes are coded as literals. A subsequent Huffman coder then usually codes the symbols (literals and references) with bit sequences which are short for frequent codes, and longer for less frequent ones.
The CAPI GZIP compression implementation consists of three main stages:
1. The first stage uses two sets of hashes to find previous occurrences of byte sequences in the input data stream. The best match, i.e. longest byte sequence with shortest distance, gets selected and overlapping matches are removed. Due to the nature of hashes, there may be hash collisions. The current implementation, for example, can find at most the previous two of such colliding sequences. Every clock cycle this stage outputs a sequence of 2-12 literals (single bytes) or backwards references (3-11 byte length matches with a distance to the previous occurrence).
2. The second stage picks one of the matches per cycle and compares it with the original data to search for longer matches. If this extender stage finds a longer match, that match will replace the original match from the first stage.
3. The third stage encodes the literals and matches into Huffman codes. The encoder has multiple sets of Huffman codes to choose from. Based on the initial few kilobytes of data it selects which of the sets is most suitable.
© Copyright IBM 2016
35
Coherent Accelerator Interface (CAPI)
• What is the Coherent Accelerator Interface (CAPI)?
• Accelerator Functional Unit (AFU) Modes
• Accessing AFU via Software
© Copyright IBM 2016
36
What is CAPI?
CAPI (Coherent Accelerator Processor Interface) is a set of IBM innovations in hardware and software that allow an application and it’s accelerated
component to share the same virtual address space. FPGA
POWER8Core
PC
Ie
POWER8 Processor
OS
App
Memory (Coherent)
AFU
IBM
Supplied PSL
Virtual Memory
CA
PP
Customer application and accelerator
Operating system enablement
•Ubuntu LE/RHEL LE
•Libcxl function calls
Proprietary hardware to enable
coherent acceleration
For customers, CAPI enables performance and ease of programming:
• Application to set up data coherently in shared virtual memory and call the accelerator functional unit (AFU)
• AFU to reads and writes data from and to shared virtual memory coherently across PCIe and communicating with the application
• AFU to coherently cache data in the PSL cache for quick AFU access
© Copyright IBM 2016
CAPI Kernel Extension(cxl.ko)
libcxl.so
37
CAPI Layering
User-Application(s)
CAPP
PCIe as transport-layer
PSL
AFU
Acc
ele
rato
r C
ard
Ha
rdw
are
Sim
ula
tor
Use
rsp
ace
Ha
rdw
are
libcxl.so (PSLSE)
PSLSE
AFU
(top.v, VPI)
So
ftw
are
Sim
ula
tio
n
OS
User-Application(s)
© Copyright IBM 2016
38
Accelerator Functional Unit (AFU) Modes
• Application: Single process
• AFU dedicated mode: Single AFU context
• Hardware provides: MMIO registers, Work Element Descriptor
• Application: Single process/multi-threaded
• AFU dedicated mode: Single AFU context
• Hardware provides: MMIO registers, Work-queue, interrupt
• Application: Multi process
• AFU directed mode: Multiple AFU slave contexts, 1 AFU master context
• Hardware provides: MMIO registers, Work-queues, interrupt(s) per AFU context
process
AFU(dedicated mode)
AFU(dedicated mode)
Work queue support
process
thre
ad
thre
ad
thre
ad…
queue
AFU(directed mode)
Work queue support, hardware scheduler for multiple AFU contexts
process(using AFU slave context)
thre
ad
thre
ad
thre
ad…
queue
process(using AFU slave context)
thre
ad
thre
ad
thre
ad…
queue
process(using AFU master context)
• Summary: CAPI GZIP supports up to 512 processes in parallel, which can have arbitrary number of threads
© Copyright IBM 2016
39
Accessing AFU with Software
� Openafu_h = cxl_afu_open_dev("/dev/cxl/afu0.0s”);
rc = cxl_get_api_version_compatible(afu_h, &api_version_compatible);
rc = cxl_get_cr_vendor(afu_h, 0, &cr_vendor);
rc = cxl_get_cr_device(afu_h, 0, &cr_device);
afu_fd = cxl_afu_fd(afu_h);
rc = cxl_afu_attach(afu_h, (__u64)(unsigned long)(void *)w);
rc = cxl_mmio_map(afu_h, CXL_MMIO_BIG_ENDIAN);
� MMIOrc = cxl_mmio_write64(afu_h, MMIO_DDCBQ_START_REG, &work_element_descriptor);
rc = cxl_mmio_write64(afu_h, MMIO_DDCBQ_COMMAND_REG, reg);
rc = cxl_mmio_read64(afu_h, MMIO_DDCBQ_STATUS_REG, ®);
� Coherent memory access� Use “malloced” memory to communicate with accelerator
� Interrupts/eventsrc = select(ctx->afu_fd + 1, &set, NULL, NULL, &timeout);
...
if (cxl_read_event(ctx->afu_h, &ctx->event) != 0)
continue;
switch (ctx->event.header.type) {
case CXL_EVENT_AFU_INTERRUPT:
while (__ddcb_done_post(ctx, DDCB_OK)); break;
case CXL_EVENT_DATA_STORAGE:
/* do something */ break;
case CXL_EVENT_AFU_ERROR:
/* do something */ break;
default:
/* do something */ break;}
� Closecxl_mmio_unmap(afu_h);
cxl_afu_free(afu_h);
© Copyright IBM 2016
© Copyright IBM 2016 40
41
Special Environment Variables
ZLIB_ACCELERATOR: Use CAPI or GENWQE (PCIe) cards. The default is GENWQE. In order to use a CAPI GZIP card
with the accelerated zlib, set ZLIB_ACCELERATOR to CAPI.NOTE: Most of our tools e.g. genwqe_gzip/genwqe_gunzip allow to switch between GenWQE and CAPI using the -A parameter instead of using
the environment variable.
ZLIB_CARD: Card ID to be used. The library supports intelligent selection of cards when ZLIB_CARD is set to -1.
Default setting is -1.
ZLIB_TRACE: Turn on tracing bits: 0x1: General, 0x2: Hardware, 0x4: Software, 0x8: Generate summary
ZLIB_DEFLATE_IMPL: Default setting is to use hardware.
0x00: SW, 0x01: HW The last 4 bits are used to switch between the hard- and software-implementation
0x40: Useful for cases like Genomics, or filesystems where the dictionary is not used after the operation
completed, e.g. Z_FULL_FLUSH, etc. This is the default setting.
ZLIB_INFLATE_IMPL: Default setting is to use hardware, so you normally do not need to set it.
0x00: SW, 0x01: HW Use software or hardware for inflate
See above for control bits
ZLIB_LOGFILE: Printing out zlib traces on stderr is not appropriate e.g. if using zlib within a daemon, which closes
stdin, stdout, and stderr use this to get the trace data.
The current list of possible environment variables is available here:
• https://github.com/ibm-genwqe/genwqe-user/wiki/Environment%20Variables
© Copyright IBM 2016
42
Simple steps to try out the PCIe/GenWQE card
1. Install required OS and genwqe-zlib, genwqe-tools packages
2. Figure out how zlib is used by your application (assume that it is linked dynamically for the example)
3. Execute your program without hardware accelerated zlib:
$ time scp -C linux.tar tul3:
linux.tar 100% 550MB 27.5MB/s 00:20
real 0m21.268s
user 0m20.780s
sys 0m0.432s
4. Execute your program with hardware accelerated zlib:
$ time ZLIB_ACCELERATOR=GENWQE LD_PRELOAD=/usr/lib/genwqe/libz.so.1 scp -C linux.tar tul3:
linux.tar 100% 550MB 91.7MB/s 00:06
real 0m6.248s
user 0m1.300s
sys 0m2.444s
$ time gzip -c linux.tar > linux.sw.tar.gz
real 0m17.248suser 0m16.588ssys 0m0.128s$ time genwqe_gzip -AGENWQE -B0 -c linux.tar > linux.hw.tar.gz
real 0m1.217suser 0m0.040ssys 0m0.168s© Copyright IBM 2016
43
Tracing zlib
The simplest way to use the trace is to let the library generate a summary of what it did:
$ ZLIB_TRACE=0x8 genwqe_gzip -ACAPI -C-1 -c test_data.bin > /dev/null
Info: deflateInit: 1
Info: deflate: 2242 sw: 0 hw: 2242
Info: deflate_avail_in 4 KiB: 1120
Info: deflate_avail_in 44 KiB: 1
Info: deflate_avail_in 132 KiB: 1121
Info: deflate_avail_out 132 KiB: 2242
Info: deflate_total_in 1024 KiB: 1
Info: deflate_total_out 1024 KiB: 1
Info: deflateSetHeader: 1
Info: deflateEnd: 1
Info: inflateInit: 0
Info: inflate: 0 sw: 0 hw: 0
Info: inflateEnd: 0 If the in and output buffer sizesare too small, consider enlargingit!
© Copyright IBM 2016