rhythm: harnessing data parallel hardware for server workloads · alvin r. lebeck$ explosive...

40
$ Duke CS # NVIDIA Rhythm: Harnessing Data Parallel Hardware for Server Workloads Duke Computer Architecture Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $

Upload: others

Post on 17-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

$Duke CS #NVIDIA

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

Duke Computer Architecture

Sandeep R. Agrawal$ Valentin Pistol$ Jun Pang$ John Tran#

David Tarjan# Alvin R. Lebeck$

Page 2: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Explosive Internet Growth •  Increasing web traffic and cloud service demands •  Example: Facebook •  900 million users •  ~1 trillion page views a month or ~350,000 per second •  1.2 million photos per second •  >100,000 servers •  How to best satisfy this demand? •  Add more machines – More space and cooling costs •  Improve existing machines to achieve higher throughput/Watt

2  Duke Computer Architecture

Page 3: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

33%

22% 19%

26% Fetch/Decode Register File Functional Units Other

3

SIMD Accelerator Efficiency

•  Instruction Fetch + Decode as high as 30%-40% of core power! •  SIMT/SIMD amortizes fetch and decode costs •  NVIDIA Kepler and Intel Xeon Phi achieve > 5 GFlops/Watt •  Can we harness accelerator efficiency to increase throughput/Watt for server

workloads?

*Sartori, et al. HPCA 2012

Tensilica Xtensa$ OpenSPARC T1*

$Hameed, et al. ISCA 2010

37%

27%

15%

21% Fetch/Decode Execute Writeback Other

Page 4: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 4

Insight •  Many requests perform the same task(s), such as login or search query •  Delay some requests in order to “align” the execution of similar requests (a

cohort) •  Execute cohort on a SIMD accelerator, trading response time for throughput

and efficiency

•  Motivated by Cohort Scheduling [Larus & Parkes ’02] •  PacketShader [Han et al. ’10] •  Memcached on GPUs [Hetherington et al. ’12]

Page 5: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 5

Enabling Trends •  Memory bandwidth to accelerator •  PCIe 4.0 •  SoC designs like Tegra K1 and AMD Fusion •  Network bandwidth •  100 Gbps Ethernet (IEEE 802.3bj) •  400 Gbps Ethernet Study Group (http://www.ieee802.org/3/400GSG/) •  High Throughput OS/DB Services •  GPU-FS [Silberstein, et al. ASPLOS ’13] •  Vector Interfaces [Vasudevan, et al. SOCC ’12] •  Memcached

Page 6: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 6  

Server Design Space

Normalized ARM Efficiency

Nor

mal

ized

x86

Thro

ughp

ut

1

1

Desired Operating

Region

x86 core

ARM core

•  Ideal design •  Throughput (Requests/second) >= an x86 core •  Energy efficiency (Requests/Joule) >= an ARM core

Page 7: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

Outline •  Motivation •  Software Architecture •  Implementation •  Evaluation •  Conclusion

7  

Page 8: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 8

Conventional Server Pipeline

•  Requests processed individually •  Thread per request (Apache) •  Event driven execution (Nginx)

Clients Reader Parser Process Response Clients

read request parse using

HTTP spec N backend, N+1 process stages

response, generation, send

to client

Stage

Process DB

HTTP Request

Page 9: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

Rhythm •  Pipelined architecture for processing “cohorts” of requests on data parallel

hardware •  Extends cohort scheduling and event-based staged servers •  Cohort – group of “similar” requests •  Control flow similarity for SIMD accelerators •  Maximize throughput by stalling only on resource shortage

9  

Page 10: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 10

The Rhythm Pipeline

Clients Clients Inflight cohort

HTTP Request Different colors denote different types

Page 11: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 11

The Rhythm Pipeline

Clients Reader

Clients

request accumulation

Inflight cohort HTTP Request Different colors denote different types

Page 12: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 12

The Rhythm Pipeline

Clients Reader Parser

Clients

parse using HTTP spec

Inflight cohort HTTP Request Different colors denote different types

Page 13: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 13

The Rhythm Pipeline

Clients Reader Parser Dispatch

Clients

execute on host/device?

Inflight cohort HTTP Request Different colors denote different types

Page 14: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 14

The Rhythm Pipeline

Clients Reader Parser Dispatch Process

Clients

N backend, N+1 process stages

Inflight cohort

Process DB

HTTP Request Different colors denote different types

Page 15: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 15

The Rhythm Pipeline

Clients Reader Parser Dispatch Process Response

Clients

response, generation,

send to clients

Inflight cohort

Process DB

HTTP Request Different colors denote different types

Page 16: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 16

The Rhythm Pipeline

Clients Reader Parser Dispatch Process Response

Clients

request accumulation

parse using HTTP spec execute on

host/device?

N backend, N+1 process stages

response, generation,

send to clients

Inflight cohort

Process DB

HTTP Request Different colors denote different types

Page 17: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

Design goals •  Applicable to any SIMD hardware •  Utilize the most efficient computational resource •  Dispatch cohort on host or accelerator •  Support deep pipelines •  Arbitrary number of process and backend stages •  Support wide pipelines •  Multiple instances of slowest stage to maximize throughput

•  A pipeline stage implementation can be on host or accelerator

17  

Page 18: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

18  

Optimizations •  Thread per request in cohort •  Transpose request, response buffers for coalescing

•  Whitespace Padding in HTML Content and Headers

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Request 0

Request 1

Request 2

Request 3

Duke Computer Architecture

Original request array Transposed request array

Memory addresses

Page 19: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 19  

Prototype Implementation (GTX Titan)

Read Dispatch Write Transpose Parse Process Transpose

Write Transpose Process Read Transpose

1 2 3 4 5 6

7

8 9 10

11 12 13

Storage/ Backend

HTTP Requests/

Responses

10

Process

5

Process

9

Read requests from clients,

launch warps

Dispatch requests based on type, and schedule

on host or device

A request can access backend multiple times

Deinterleave client response

Duke Computer Architecture

Host Accelerator

Page 20: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

Outline •  Motivation •  Software Architecture •  Implementation •  Evaluation •  Conclusion

20  

Page 21: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 21  

Methodology

•  SPECWeb Banking (14 of 16 request types) •  C version on x86 and ARM platforms •  C+CUDA version on Titan platform (Rhythm) •  Metrics (weighted arithmetic mean) •  Throughput •  Power (Kill-A-Watt) •  Latency

Platform GHz Description

Core i5 3.4 Core i5 3570, 22 nm, 4 cores (4 threads), 8GB DDR3 RAM, 1Gbps NIC

Core i7 3.4 Core i7 3770, 22 nm, 4 cores (8 threads), 16GB DDR3 RAM, 1Gbps NIC

ARM A9 1.2 OMAP 4460, 45 nm, Pandaboard, 2 cores, 1GB LPDDR2 RAM

Titan 0.8 GTX Titan, 28 nm, 14 Streaming Multiprocessors, 6GB GDDR5 Memory

Page 22: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 22  

Future Server Platforms •  1 Gbps NIC cannot sustain required throughputs •  Max throughput 517Gbps raw, <100Gbps compressed •  Requests generated locally on host, No responses sent •  x86 and ARM platforms •  Backend emulated as function call •  Titan A •  Backend emulated on host as separate thread •  Titan B •  Backend emulated on device as function call •  Titan C •  Titan B + No transpose for final response

Page 23: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

core i5

core i7

arm a9

titan A

titan B titan C

0.01

0.10

1.00

10.00

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00

Nor

mal

ized

i7 th

roug

hput

(Req

/Se

c)

Normalized ARM efficiency (Req/Joule)

Evaluation

23  

3.4x efficiency 8.2x throughput

1.2x efficiency 4.1x throughput

0.6x efficiency 1.1x throughput

why?

Page 24: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

Titan A Limitations

24  

0 100 200 300 400 500 600 700

Thro

ughp

ut (K

reqs

/sec

)

Throughput achieved Throughput capped by PCIE

PCIE 3.0 bandwidth < 12 Gbps

Page 25: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 25

Latency •  Titan A – 86 ms (weighted average) •  PCIE Stalls •  Titan B – 24 ms (weighted average) •  99th percentile latency for different request types is 18%-40% longer than the average •  Titan C – 10 ms (weighted average) •  99th percentile latency for different request types is 7%-34% longer than the average •  Latency omits time for transpose of response

Page 26: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

Scaling Many Core Processors •  Assume dynamic power of 1W per ARM core and 10W per x86 core •  Find #cores to match accelerator throughput •  To match Titan B’s throughput (232W dynamic power)

•  To match Titan C’s throughput (211W dynamic power)

•  Titan C has >170W to implement the transpose operation and still outperform the scaled systems.

26  

Platform #Cores Power(W) Available Uncore Power(W(%)) ARM 192 192 40 (21%) x86 21 210 22 (10%)

Platform #Cores Power(W) Available Uncore Power(W(%)) ARM 385 385 -174 (-45%) x86 41 410 -199 (-48%)

Page 27: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 27  

Conclusion •  Web Server workloads amenable to SIMD accelerators •  Cohort scheduling using Rhythm to improve Requests/Joule •  Programming language/Runtime •  reduce programmer effort to create/edit web pages •  Other workloads •  More SIMD platforms •  Xeon Phi, ARM NEON, Tegra K1, AMD Fusion

Page 28: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 28  

Questions? •  Thanks!

Page 29: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 29  

Backup

Page 30: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 30

Dynamic Power • Dynamic Power = Power_under_load – Power_idle

Page 31: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

Evaluation – Dynamic power

31  

core i5 core i7

arm a9

titan A

titan B titan C

0.01

0.10

1.00

10.00

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00

Nor

mal

ized

i7 th

roug

hput

(Req

/Se

c)

Normalized ARM efficiency (Req/Joule)

2.5x efficiency 8.2x throughput 0.5x efficiency

1.1x throughput 0.9x efficiency

4.1x throughput why?

Page 32: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture

Titan B – Dynamic Power

32  

login

account_summary add_payee bill_pay

bill_pay_status_ output

change_profile

check_detail_html

order_check place_check_order

post_payee

post_transfer

profile

transfer

logout 2.00

2.50

3.00

3.50

4.00

4.50

5.00

5.50

0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 Nor

mal

ized

i7 th

roug

hput

(Req

/Se

c)

Normalized ARM efficiency (Req/Joule)

Why?

Page 33: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 33

Titan B – Response Buffer Sizes Request type

Response size (KB) Difference (KB)

SPECWeb Rhythm login 4 8 4 account_summary 17 32 15 add_payee 18 32 14 bill_pay 15 32 17 bill_pay_status_output 24 32 8 change_profile 29 32 3 check_detail_html 11 16 5 order_check 21 32 11 place_check_order 25 32 7 post_payee 34 64 30 post_transfer 16 32 16 profile 32 64 32 transfer 13 16 3 logout 46 64 18

• Well matched buffers give higher efficiency

Page 34: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 34

Request similarity (SPECWeb Banking)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Estim

ated

Spe

edup

/ Id

eal S

peed

up

Page 35: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 35  

Rhythm Pipeline Control •  Event loop •  Single-threaded, epoll based •  Handle New connections •  Backend responses •  File system responses •  Callbacks •  Linked list traversed on each iteration •  Track stage completion via polling (no device interrupts) •  Track stage transitions •  Data Structures •  Cohort pool – static array •  Session state – concurrent hash table

Duke Computer Architecture

Page 36: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 36

CUDA Specific Optimizations • Transposes using shared memory • Max butterfly reduction in shared memory to calculate HTML

padding • Constant memory to store static HTML content • Store frequently used pointers in constant memory to reduce

register pressure

Page 37: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 37

Delay Requests? • Can you really delay requests? •  EcoDB [Lang & Patel CIDER ’09] • DreamWeaver [Meisner & Wenisch ASPLOS ’12]

Page 38: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 38

Contributions • Rhythm, a software architecture for high throughput SIMT-

based servers • Evaluation of future server platform architectures • Prototype implementation of Rhythm on NVIDIA GPUs • Standalone C and C+CUDA implementations of SPECWeb2009

Banking

Page 39: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 39

The Rhythm Pipeline

Clients  Readers   Parsers   Dispatch   Process   Responses  

.  .  .  

Clients  

request accumulation

parse using HTTP spec

execute on

host/device?

n backend, n+1 process stages

response, generation,

send to clients

Page 40: Rhythm: Harnessing Data Parallel Hardware for Server Workloads · Alvin R. Lebeck$ Explosive Internet Growth • Increasing web traffic and cloud service demands • Example: Facebook

Duke Computer Architecture 40

The Rhythm Pipeline (SPECweb Banking)

• Maximize #inflight cohorts to improve throughput & efficiency • A pipeline stage implementation can be on host or accelerator

Clients Readers Parsers Dispatch Process Responses

Clients

request accumulation

parse using HTTP spec execute on

host/device? N backend, N+1 process stages

response, generation,

send to clients requests requests  

requests  

login bill_pay  

image  

.  .  .  .  .  .  

login bill_pay  

image  

.  .  .  

login1 bill_pay1  

image1  

.  .  .  

login bill_pay  

image  

.  .  .  

DB DB  

DB  

.  .  .  

login2 bill_pay2  

.  .  .  

DB

.  .  .  

login3

image2  

Inflight cohort of requests