capi gzip and other capi solutions - ibm - united …...the input data is srr034947, half of about...

43
CAPI GZIP and other CAPI Solutions Frank Haverkamp [email protected] Sept 2016 – Version 0.3

Upload: others

Post on 30-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

CAPI GZIP and other CAPI Solutions

Frank Haverkamp [email protected]

Sept 2016 – Version 0.3

Page 2: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

GZIP Hardware Acceleration

• CAPI GZIP Acceleration Adapter (EJ1A/EJ1B)– Characteristics– Usage Example– Experiences

• Other CAPI Solutions

• Future Directions

• Summary

2© Copyright IBM 2016

Page 3: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 3

CAPI GZIP Acceleration Adapter

Page 4: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

GenWQE/PCIe GZIP Accelerator #EJ12, #EJ13

On CPU Accelerator for P9

4

Evolution

© Copyright IBM 2016

• CAPI GZIP (FC #EJ1A & EJ1B)• Power8 and OpenPower enabled: Tuleta-L, Habanero-

LC, Firestone-LC (bare-metal only)• Ubuntu 14.04.5 and later, RHEL 7.2 LE

• Hardware platform: Altera Stratix V FPGA

CAPI GZIP Accelerator #EJ1A & EJ1B

Page 5: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

5

GZIP Compression Solution Architecture & 3 Questions

Software stack (Linux on Power)

CAPI KernelExtension

libz

libzHW

libcxl

libz(SW)

NativeApplication

Java

JavaApplications

CAPI GZIPAccelerator Card

CAPI FPGA bitstream

GenWQELinux Driver

Work-queue

GenWQE/PCIeAccelerator Card

PCIe FPGA bitstream

© Copyright IBM 2016

AFU(directed mode)

Work queue support, hardware scheduler for multiple AFU contexts

process(using AFU slave context)

thre

ad

thre

a

d thre

ad…

queue

process(using AFU slave context)

thre

ad

thre

ad

thre

ad…

queue

process(using AFU master context)

• Predecessor: PCIe GZIP Accelerator:

• Using genwqe-card.ko Linux driver to access work-queue

• Bandwidth determined by comp/decomp

• CAPI GZIP Accelerator:

• CAPI AFU directed mode: one work-queue per AFU-context

• Re-use generic libcxl & cxl.ko to access CAPI device

• User-space library implements work-queue access

• 3 Questions:

1. Bandwidth

2. Efficiency for small requests

3. CPU load

Work-queue

use

r-spa

ceke

rne

lFP

GA

ha

rdw

are

PSL

Solution Provider

IBM

Card Vendor

Application Developer

GenWQE/PCIeCAPI

Page 6: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

6

Simple steps to try out the CAPI GZIP card

1. Install required OS and genwqe-zlib, genwqe-tools packages

2. Figure out how zlib is used by your application (assume that it is linked dynamically for the example)

3. Execute your program without hardware accelerated zlib:

$ time scp -C linux.tar tul3:

linux.tar 100% 550MB 26.2MB/s 00:21

real 0m21.276s

user 0m20.720s

sys 0m0.508s

4. Execute your program with hardware accelerated zlib:

$ time ZLIB_ACCELERATOR=CAPI LD_PRELOAD=/usr/lib/genwqe/libz.so.1 scp -C linux.tar tul3:

linux.tar 100% 550MB 110.0MB/s 00:05

real 0m4.997s

user 0m2.028s

sys 0m1.048s

$ time gzip -c linux.tar > linux.sw.tar.gz

real 0m16.730suser 0m16.580ssys 0m0.124s

$ time genwqe_gzip -ACAPI -B0 -c linux.tar > linux.hw.tar.gz

real 0m1.221suser 0m0.024ssys 0m0.168s

© Copyright IBM 2016

Page 7: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

7

Example – Multithreading/Throughput

Compressing and decompressing in parallel by multiple CPU threads/streams

Maximum hardware throughput reached already with 2 threads

32 software deflate threads cannot beat CAPI acceleratorhttps://github.com/ibm-genwqe/genwqe-user/blob/master/tools/genwqe_mt_perf

Test-setup: IBM Tuleta 8247-22L20 CPU cores @ 8 threads eachAvg: 3.694 GHzBitstream: 0x06030703475a4950

0

500

1000

1500

2000

2500

1 2 3 4 8 16 32 64

MiB

/sec

threads

DEFLATE CAPI, GENWQE and SW

CAPI MiB/sec GENWQE MiB/sec SW MiB/sec

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 2 3 4 8 16 32 64M

iB/s

ec

threads

INFLATE CAPI, GENWQE and SW

CAPI MiB/sec GENWQE MiB/sec SW MiB/sec

© Copyright IBM 2016

Page 8: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

8

CPU Load

Compressing and decompressing in parallel by multiple CPU threads/streams

Software CAPI GZIP Acceleration

*) Software fallback due to small buffers

Drastic speed increase at almost no CPU load with CAPI accelerator!

Data: http://corpus.canterbury.ac.nz/resources/cantrbry.tar.gz

GenWQE/PCIe GZIP Acceleration

© Copyright IBM 2016

Page 9: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

9

Internal User Feedback

User feed-back (Frank Liu, IBM Research):

„I measured the runtime on a larger file (5.4G), the performance difference iseven more drastic: 11m41sec vs 4.7 sec. (...) But over 100x speedup is veryimpressive.“

The input data is SRR034947, half of about 1/40 of human genome sequence(the whole human genome sequence is split into about 40 parts, each part hastwo files, paired. This is one of the pair). The raw file size (uncompressed) is5.4G. Compression reduced the file to 1.6G. The machine is a P8 Tuleta, S824, running RH 7.2.

© Copyright IBM 2016

Page 10: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 10

CAPI Experiences & 3 Answers

Questions Explanation Results

1. Bandwidth (sufficiently large data)

Limited by comp/decompUsing the same card and FPGA

Small improvements ☺

2. Efficiency for small requests

PCIe GZIP: Requests smaller then 16KiB; drop to softwareCAPI GZIP: Requests smaller then 8KiB, drop to software

Great improvements, but not enough to get rid of fallback to software �

3. CPU load Smarter CAPI DMA memory management Significant improvements ☺ ☺ ☺

Conclusions:

• CAPI cannot do magic: Work must still be adjusted for hardware accelerator usage

• CAPI helps to implement hardware acceleration intuitively (no worries about DMA memory handling)

• Getting CPU load down to almost zero was right at the first try – that was really impressive

Page 11: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 11

Other CAPI Solutions

Page 12: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

12

Sample List of Accelerators

Financial

L3 Order Book (Algo Logic)

Monte Carlo Risk Analysis (IBM)

Database

IBM Data Engine for No-SQL (“CAPI-Flash”)

IBM Data Engine for No-SQL NVME Edition 2016

Erasure code for Hadoop, CEPH (IBM, Xilinx)

Key Value Store (KVS) (Xilinx)

Dynamic Time Warp Pattern Match (IBM)

Bitwise Encryption (DRC / SecurityFirst)

General Purpose

GZIP Compression (IBM)

Fast-Fourier Transfer (IBM)

Linear Algebra (Auviz)

Security

Bank Fraud Detection

In-Betweenness Djikstra(DRC)

Video Surveillance (SiliconScapes)

Digital DNA for forensics (DRC, i-Abra)

Retail & Analytics

RegEx Text Analytics (IBM)

Mood Detection (SiliconScapes)

Real-Time Ad Auctions (Algo-Logic)

Visually impaired assistance (SiliconScapes)

Computer Vision & Machine Learning

CV Library (Auviz)

People Identification

DNN Library (Auviz)

Activity Recognition (SiliconScapes)

AccDNN: Bridge to Deep Learning

Health Care

Light Activated Cancer Therapy (U of Toronto)

Genomics Processing (Edico)

PairHMM Accelerator (IBM)

© Copyright IBM 2016

http://www-304.ibm.com/webapp/set2/sas/f/capi/capiexamples.html

Page 13: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

strategy ( )

CAPI Attached Flash Optimization

– Attach IBM FlashSystem to POWER8 via CAPI

– Frees up processor cores to running applications vs driving IOPS – Up to 75% reduced overhead

– Reduced number of cores needed to per 1M IOPS

Pin buffers, Translate, Map DMA, Start I/O

Application

Read/Write Syscall

Interrupt, unmap, unpin,Iodone scheduling

29K instructions reduced to

~5K

Disk and Adapter DD

strategy ( ) iodone ( )

FileSystem

Application

User Library

Posix AsyncI/O Style API

Shared Memory Work Queue

aio_read()aio_write()

iodone ( )

LVM

13 Different tools are used for measurements© Copyright IBM 2016

Page 14: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

CAPI Unlocks the Next Level of Performance for Flash

Identical hardware with 3 different paths to data

FlashSystem

ConventionalI/O (FC) CAPI - E

0

50.000

100.000

150.000

200.000

250.000

300.000

350.000

400.000

450.000

Conventional CAPI - I CAPI - E

IOPS per Hardware Thread

0

20

40

60

80

100

120

140

160

180

200

Conventional CAPI - I CAPI - E

Latency (microseconds)

IBM POWER S822L

>3x better IOPS per HW thread

Lower latency

14

CAPI – I : Integrated CAPI Flash Card

CAPI – E: CAPI attached External Flash

CAPI - I

© Copyright IBM 2016

Page 15: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 15

Future CAPI Directions

Page 16: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

16

New Directions:

In-Line Programmable Framework

Chip

Interconnect

Core Core Core

L2 L2 L2

L2 L2 L2L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

Core Core Core

Me

mo

ry B

us

SM

P In

terco

nn

ect

SM

PC

AP

IP

CIe

SM

PA C

1. App calls SW Compression

2. App stores Compressed Data

through ARC/Block API

B

A C

A

B

Chip

Interconnect

Core Core Core

L2 L2 L2

L2 L2 L2L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

Core Core Core

Me

mo

ry B

us

SM

P In

terco

nn

ect

SM

PC

AP

IP

CIe

SM

PA

B

C

1. App calls CAPI HW Compression

2. App stores Compressed Data

through ARC/Block API

Benefits:

High performance compression

Processor off-load

A C

A

B

Chip

Interconnect

Core Core Core

L2 L2 L2

L2 L2 L2L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

L3

Bank

Core Core Core

Me

mo

ry B

us

SM

P In

terco

nn

ect

SM

PC

AP

IP

CIe

SM

PA

B C

1. App calls In-Line Framework, which

does in-line compression

Benefits:

High performance compression

Processor off-load

Single call for Application

One PCI-E crossing

A

B C

A. SW compression and 2TB Flash B. HW compression and 2TB Flash C. Compression on 2TB Flash

Page 17: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

17

Summary

• Compression/decompression hardware accelerator– High throughput

(almost as fast as storing uncompressed data)

– Saving storage space– Impressive CPU load reduction

(positive influence on power consumption)

• Different other CAPI examples available

• CAPI technology offers very efficient approach for designing hardware acceleration solutions

© Copyright IBM 2016

Page 18: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 18

Backup

Page 19: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

19

References/Links

GenWQE Hardware Accelerator Software

• Sources: https://github.com/ibm-genwqe/genwqe-user

CAPI accelerated GZIP Compression Adapter User’s guide

• https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-

44a4f27eba32/entry/Introducting_the_CAPI_accelerated_GZIP_card?lang=en

OpenPOWERTM

• General Information: http://openpowerfoundation.org

• OpenPOWERTM Accelerator Work Group: http://openpowerfoundation.org/technical/working-

groups/#wg_accelerator

CAPI Examples

• http://www-304.ibm.com/webapp/set2/sas/f/capi/capiexamples.html

© Copyright IBM 2016

Page 20: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

Additional CAPI Solutions

• IBM POWER8 CAPI Order Book

– http://algo-logic.com/CAPIorderbook

– https://www.youtube.com/watch?v=Kca8TuqLjnI&feature=youtu.be

• CAPIflash

– http://openpowerfoundation.org/blogs/capi-and-flash-for-larger-faster-nosql-and-analytics/

– https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-44a4f27eba32/entry/power8_capi_flash_in_memory_expansion_to_speed_data_access?lang=en

– https://github.com/open-power/capiflash

• CAPIflash using NVME

– http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=ca&infotype=an&appname=iSource&supplier=897&letternum=ENUS116-062

• Accelerating Key-value Stores (KVS) with FPGAs and OpenPOWER

– http://openpowerfoundation.org/blogs/accelerating-key-value-stores-kvs-with-fpgas-and-openpower/

© Copyright IBM 2016 20

Page 21: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 21

CAPI Directions

Page 22: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 22

Page 23: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 23

Genomics Example

Page 24: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

24

Processing SAM/Bam Formats

SAM: Sequence Alignment/Map Format

SAM/BAM file-format specification can be found here: https://samtools.github.io/hts-specs/SAMv1.pdf

BAM is the binary representation of the SAM

format. BAM is compressed in the BGZF format. The

BGZF format is a series of blocks as shown in the

table.

BGZF allows seeking efficiently within the data.

Blocksize of 64KiB limits the effectiveness of the

CAPI GZIP accelerator.

© Copyright IBM 2016

Page 25: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

25

Example – Genomics: samtools

Samtools example

Testcase: https://github.com/ibm-genwqe/genwqe-user/blob/master/misc/samtools_test.sh

1 2 4 8 16 32 64 128 1 1 1 2 4 8 16 32 64 128

SRT B2F IDX VIEW

0

10

20

30

40

50

60

70

80

sec

samtools with software zlib and CAPI GZIP acceleration

SW/sec

CAPI/sec

© Copyright IBM 2016

Smaller is better

Page 26: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 26

CAPI-Flash Details

Page 27: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

IBM Data Engine for NoSQL is an integrated platform for large and fast growing NoSQL data stores. It

builds on the CAPI capability of POWER8 systems and provides super-fast access to large flash storage capacity. It delivers high speed access to both RAM and flash storage which can result in

significantly lower cost, and higher workload density for NoSQL deployments than a standard RAM-based system. The solution offers superior performance and price-performance to scale out x86

server deployments that are either limited in available memory per server or have flash memory with limited data access latency.

IBM Data Engine for NoSQLCost Savings for In-Memory NoSQL Data Stores

Up to 56TB of extended memory with one POWER8 server + CAPI attach FLASH

Power S822L / S812L Flash System 900Power S822L / S812L / S822 LC

NEW

External Flash Configuration Integrated Flash Configuration

Up to 8TB of super-fast storage tier on one POWER8 server

27 © Copyright IBM 2016

Page 28: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

NoSQL Databases enabled for Data Engine for NoSQL

28

Name Classification Optimized for Lead use cases / data types

Redis Labs In-memory Key Value Store

Data queues, Strings, Lists, Counts, caching, Statistics, Text, session IDs, pictures, videos

Live in memory cache, data queues, User session data, shopping cart data,

Cassandra Wide Column Store

NoSQL environments that need Very High Performance and Scalability, Very Highdata volumes

Messaging, Fraud detection, Internet of Things data – sensor data, log data, telco call detail records

Neo4j Graph Store Data stored as edges, nodes, or attributes (Graphs).

Fraud detection, Social Network Analysis, Location aware apps, Master data mgmt., Machine Learning

© Copyright IBM 2016

Page 29: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

IBM Data Engine for NoSQL API Choices

• Key-Value Layer APIs (KV) form the highest level of abstraction for accessing the Data Engine for NoSQL. Usage of these APIs is typically by applications the inherently understand semantics of a key/value store, hash table, or collection API. Example use cases include simple object storage / retrieval (e.g. ark_get or ark_set).

• Block APIs provide a fast-path between blocks on flash and a user-space application. Semantics are similar to a low-level block device (e.g. open / close / read / write). Data may be transient or persistent, depending on which APIs are used.

• Refer to the White Paper for a full API Guide, including forward-looking guidance for future releases.

29

Ubuntu

Redis Labs Enterprise Cluster

Block Fcn

Client Client Client Client…

Master

Context

AFU

PSL

PO

WE

R8 S

erv

er

Fla

sh

Sys

tem

FLASH Array (up to 56TB)C

AP

IA

dapte

r

Integrated CAPI-FLASH Card

KV Fcn

CA

PI

Adapte

r PSL

AFU

NVMe

960 GB

NVMe

960 GB

External Flash subsystem

© Copyright IBM 2016

Page 30: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

Example Component Overview – Redis Labs Enterprise Cluster (External Flash and Integrated Flash configurations)

Redis Clustered Server

KVS Library

Block Library

Standard Redis Clients

Linux OS

PSL-AFU via FPGA

RedisUsers

NoSQLDBs

BlockStore

30

Ubuntu

Redis Labs Enterprise Cluster

Block Fcn

Client Client Client Client…

Master

Context

AFU

PSL

PO

WE

R8 S

erv

er

Fla

sh

Sys

tem

FLASH Array (up to 56TB)

CA

PI

Adapte

r

Integrated CAPI-FLASH Card

KV Fcn

CA

PI

Adapte

r PSL

AFU

NVMe

960 GB

NVMe

960 GB

External Flash subsystem

© Copyright IBM 2016

Page 31: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

IBM Data Engine for NoSQL – Integrated Flash configuration

• CAPI NVMe Adapter brings up to 2 TB of flashin-system

• Depending on system configuration, up to 4 CAPI NVMecards may be present, offering up to 8 TB of in-system accelerated flash

(Per-card capacity may grow in the future up to 8 TB via additional

higher-density flash sticks)

• User APIs are 100% compatible with current FibreChannel-attached flash adapter, providing solution architects added choice in storage density

• Standard single-wide 2U low-profile PCIe card

• Projected ~4.3GB/sec sequential read and 3GB/sec sequential write per CAPI NVMe card

• Projected 600k random read IOPs and 200k random write IOPs per card

Linux

Your Application(s)

Block Fcn

Client Client Client Client…

cxlflashkernel

module

AFU

PSL

Pow

er

8 S

erv

er

NVMe

(up to 960 GB)C

AP

IA

dapte

r(s)

KV Fcn

NVMe

(up to960 GB)Xilinx Kintex

KU060

in A1156 35mm x 35mm

Package

4 GBytesDDR4-2400

x40

NVMe

NVMe

PCIex8 Gen 3

M.2

Con

necto

r

M.2

Con

necto

r

M.2 2280 NVMeFlash SM951 –512GByte

M.2 2280 NVMe Flash SM951 –512GByte

PSUs

M.2 22110 1TByte

M.2 22110 1TByte

31 © Copyright IBM 2016

Page 32: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 32

CAPI GZIP Details

Page 33: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

GZIP - Overview

• Business value

– High throughput compression; Saves storage and I/O bandwidth with little or no overhead

– CPU offload, CAPI interface with negligible software load; Frees up CPU cores for higher value computation or licensed software

– Lower power consumption by offloading the CPU intensive compression to an FPGA

• Features

– DEFLATE standard format, widely used for data interchange (RFC1950 zlib, RFC1951 deflate and RFC1952 gzip)

– ~1.8 GiB/sec compression throughput, ~2GiB/sec decompression throughput

– 4 – 30 x speed-up achievable

– Compression ratio near software zlib/gzip

• Example use cases

– Genomics, Datacenter/Cloud, Backup solutions

© Copyright IBM 2016 33

Page 34: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

34

GZIP: DEFLATE Algorithm

Compression with Lempel-Ziv (LZ) methods aims at finding repeated byte sequences in the data. It then replaces the repeated byte sequence by a backwards reference (sequence length, distance) to the original occurrence. Unmatched bytes are coded as literals. A subsequent Huffman coder then usually codes the symbols (literals and references) with bit sequences which are short for frequent codes, and longer for less frequent ones.

The CAPI GZIP compression implementation consists of three main stages:

1. The first stage uses two sets of hashes to find previous occurrences of byte sequences in the input data stream. The best match, i.e. longest byte sequence with shortest distance, gets selected and overlapping matches are removed. Due to the nature of hashes, there may be hash collisions. The current implementation, for example, can find at most the previous two of such colliding sequences. Every clock cycle this stage outputs a sequence of 2-12 literals (single bytes) or backwards references (3-11 byte length matches with a distance to the previous occurrence).

2. The second stage picks one of the matches per cycle and compares it with the original data to search for longer matches. If this extender stage finds a longer match, that match will replace the original match from the first stage.

3. The third stage encodes the literals and matches into Huffman codes. The encoder has multiple sets of Huffman codes to choose from. Based on the initial few kilobytes of data it selects which of the sets is most suitable.

© Copyright IBM 2016

Page 35: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

35

Coherent Accelerator Interface (CAPI)

• What is the Coherent Accelerator Interface (CAPI)?

• Accelerator Functional Unit (AFU) Modes

• Accessing AFU via Software

© Copyright IBM 2016

Page 36: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

36

What is CAPI?

CAPI (Coherent Accelerator Processor Interface) is a set of IBM innovations in hardware and software that allow an application and it’s accelerated

component to share the same virtual address space. FPGA

POWER8Core

PC

Ie

POWER8 Processor

OS

App

Memory (Coherent)

AFU

IBM

Supplied PSL

Virtual Memory

CA

PP

Customer application and accelerator

Operating system enablement

•Ubuntu LE/RHEL LE

•Libcxl function calls

Proprietary hardware to enable

coherent acceleration

For customers, CAPI enables performance and ease of programming:

• Application to set up data coherently in shared virtual memory and call the accelerator functional unit (AFU)

• AFU to reads and writes data from and to shared virtual memory coherently across PCIe and communicating with the application

• AFU to coherently cache data in the PSL cache for quick AFU access

© Copyright IBM 2016

Page 37: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

CAPI Kernel Extension(cxl.ko)

libcxl.so

37

CAPI Layering

User-Application(s)

CAPP

PCIe as transport-layer

PSL

AFU

Acc

ele

rato

r C

ard

Ha

rdw

are

Sim

ula

tor

Use

rsp

ace

Ha

rdw

are

libcxl.so (PSLSE)

PSLSE

AFU

(top.v, VPI)

So

ftw

are

Sim

ula

tio

n

OS

User-Application(s)

© Copyright IBM 2016

Page 38: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

38

Accelerator Functional Unit (AFU) Modes

• Application: Single process

• AFU dedicated mode: Single AFU context

• Hardware provides: MMIO registers, Work Element Descriptor

• Application: Single process/multi-threaded

• AFU dedicated mode: Single AFU context

• Hardware provides: MMIO registers, Work-queue, interrupt

• Application: Multi process

• AFU directed mode: Multiple AFU slave contexts, 1 AFU master context

• Hardware provides: MMIO registers, Work-queues, interrupt(s) per AFU context

process

AFU(dedicated mode)

AFU(dedicated mode)

Work queue support

process

thre

ad

thre

ad

thre

ad…

queue

AFU(directed mode)

Work queue support, hardware scheduler for multiple AFU contexts

process(using AFU slave context)

thre

ad

thre

ad

thre

ad…

queue

process(using AFU slave context)

thre

ad

thre

ad

thre

ad…

queue

process(using AFU master context)

• Summary: CAPI GZIP supports up to 512 processes in parallel, which can have arbitrary number of threads

© Copyright IBM 2016

Page 39: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

39

Accessing AFU with Software

� Openafu_h = cxl_afu_open_dev("/dev/cxl/afu0.0s”);

rc = cxl_get_api_version_compatible(afu_h, &api_version_compatible);

rc = cxl_get_cr_vendor(afu_h, 0, &cr_vendor);

rc = cxl_get_cr_device(afu_h, 0, &cr_device);

afu_fd = cxl_afu_fd(afu_h);

rc = cxl_afu_attach(afu_h, (__u64)(unsigned long)(void *)w);

rc = cxl_mmio_map(afu_h, CXL_MMIO_BIG_ENDIAN);

� MMIOrc = cxl_mmio_write64(afu_h, MMIO_DDCBQ_START_REG, &work_element_descriptor);

rc = cxl_mmio_write64(afu_h, MMIO_DDCBQ_COMMAND_REG, reg);

rc = cxl_mmio_read64(afu_h, MMIO_DDCBQ_STATUS_REG, &reg);

� Coherent memory access� Use “malloced” memory to communicate with accelerator

� Interrupts/eventsrc = select(ctx->afu_fd + 1, &set, NULL, NULL, &timeout);

...

if (cxl_read_event(ctx->afu_h, &ctx->event) != 0)

continue;

switch (ctx->event.header.type) {

case CXL_EVENT_AFU_INTERRUPT:

while (__ddcb_done_post(ctx, DDCB_OK)); break;

case CXL_EVENT_DATA_STORAGE:

/* do something */ break;

case CXL_EVENT_AFU_ERROR:

/* do something */ break;

default:

/* do something */ break;}

� Closecxl_mmio_unmap(afu_h);

cxl_afu_free(afu_h);

© Copyright IBM 2016

Page 40: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

© Copyright IBM 2016 40

Page 41: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

41

Special Environment Variables

ZLIB_ACCELERATOR: Use CAPI or GENWQE (PCIe) cards. The default is GENWQE. In order to use a CAPI GZIP card

with the accelerated zlib, set ZLIB_ACCELERATOR to CAPI.NOTE: Most of our tools e.g. genwqe_gzip/genwqe_gunzip allow to switch between GenWQE and CAPI using the -A parameter instead of using

the environment variable.

ZLIB_CARD: Card ID to be used. The library supports intelligent selection of cards when ZLIB_CARD is set to -1.

Default setting is -1.

ZLIB_TRACE: Turn on tracing bits: 0x1: General, 0x2: Hardware, 0x4: Software, 0x8: Generate summary

ZLIB_DEFLATE_IMPL: Default setting is to use hardware.

0x00: SW, 0x01: HW The last 4 bits are used to switch between the hard- and software-implementation

0x40: Useful for cases like Genomics, or filesystems where the dictionary is not used after the operation

completed, e.g. Z_FULL_FLUSH, etc. This is the default setting.

ZLIB_INFLATE_IMPL: Default setting is to use hardware, so you normally do not need to set it.

0x00: SW, 0x01: HW Use software or hardware for inflate

See above for control bits

ZLIB_LOGFILE: Printing out zlib traces on stderr is not appropriate e.g. if using zlib within a daemon, which closes

stdin, stdout, and stderr use this to get the trace data.

The current list of possible environment variables is available here:

• https://github.com/ibm-genwqe/genwqe-user/wiki/Environment%20Variables

© Copyright IBM 2016

Page 42: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

42

Simple steps to try out the PCIe/GenWQE card

1. Install required OS and genwqe-zlib, genwqe-tools packages

2. Figure out how zlib is used by your application (assume that it is linked dynamically for the example)

3. Execute your program without hardware accelerated zlib:

$ time scp -C linux.tar tul3:

linux.tar 100% 550MB 27.5MB/s 00:20

real 0m21.268s

user 0m20.780s

sys 0m0.432s

4. Execute your program with hardware accelerated zlib:

$ time ZLIB_ACCELERATOR=GENWQE LD_PRELOAD=/usr/lib/genwqe/libz.so.1 scp -C linux.tar tul3:

linux.tar 100% 550MB 91.7MB/s 00:06

real 0m6.248s

user 0m1.300s

sys 0m2.444s

$ time gzip -c linux.tar > linux.sw.tar.gz

real 0m17.248suser 0m16.588ssys 0m0.128s$ time genwqe_gzip -AGENWQE -B0 -c linux.tar > linux.hw.tar.gz

real 0m1.217suser 0m0.040ssys 0m0.168s© Copyright IBM 2016

Page 43: CAPI GZIP and other CAPI Solutions - IBM - United …...The input data is SRR034947, half of about 1/40 of human genome sequence (the whole human genome sequence is split into about

43

Tracing zlib

The simplest way to use the trace is to let the library generate a summary of what it did:

$ ZLIB_TRACE=0x8 genwqe_gzip -ACAPI -C-1 -c test_data.bin > /dev/null

Info: deflateInit: 1

Info: deflate: 2242 sw: 0 hw: 2242

Info: deflate_avail_in 4 KiB: 1120

Info: deflate_avail_in 44 KiB: 1

Info: deflate_avail_in 132 KiB: 1121

Info: deflate_avail_out 132 KiB: 2242

Info: deflate_total_in 1024 KiB: 1

Info: deflate_total_out 1024 KiB: 1

Info: deflateSetHeader: 1

Info: deflateEnd: 1

Info: inflateInit: 0

Info: inflate: 0 sw: 0 hw: 0

Info: inflateEnd: 0 If the in and output buffer sizesare too small, consider enlargingit!

© Copyright IBM 2016