sequences to systems: smart healthcare and smart systems ......acmc2: an hls engine for mcmc the...

Abstract

The TCGA Accelerator

The Symphony RuntimeAcMC2: An HLS Engine for MCMC

Sequences to Systems: Smart Healthcare and Smart SystemsS. S. Banerjee, C. Y. Tan, M. el-Hadedy, D. Chen, W-m Hwu, S. Lumetta, Z. T. Kalbarczyk, R. K. Iyer

CS/ECE/Coordinated Science Lab, University of Illinois at Urbana-Champaign, [email protected]

• Identify commonly algorithmic patterns in the range of analytics in personalized medicine

• Design and implement accelerators to target these algorithmic patterns

• Runtime systems to allow the execution of these workloads in heterogeneous (accelerator rich) cloud environments

• Compiles high level specifications of probabilistic models (probabilisticprogram) into optimized HDL

• Symphony: A machine-learning based runtime system forexecuting data flow graphs on heterogeneous computingenvironments (CPUs, GPUs, FPGAs)

This material is based upon work supported in part by Xilinx, IBM, Intel and the National Science Foundation under Grant Nos. CNS 13-37732 and CNS 16-24790.

Acknowledgements

xn

zn

N

⇡

µk

K

[K ]

x

SE

Stre

amin

g I/O

SE

µ

z

Contr

olle

r DR

AM

SE SE SE SE…

SEk SE C

ontr

olle

r

t ype Pi ;

#Pi ~ Poi sson( 2) ;

r andom Pi z( I nt eger i ) ~ Uni f or mChoi ce( { p f or Pi p} ) ;

r andom Real mean( Pi c) ~ Uni f or mReal ( - 1, 1) ;

r andom Real x( I nt eger i ) ~

i f z( i ) ! = nul l t hen Gaussi an( mean( z( i ) ) , 1. 0) ;

quer y s i ze( { p f or Pi p} ) ; / / Number of c l ust er s

quer y z( 0) ; / / Assi gnment of dat a poi nt 0

obs x( 0) = 0. 2; obs x( 1) = 1. 0; / / N i n number

obs x( 2) = 0. 5; obs x( 3) = 0. 7;

1

2

3

4

1

2

34

Model

Query

Dat

a

Hardware ModuleProbabilistic Program Graphical Representation

Plate Notation: Repeats of Variables

Random Variables

Factor Functions

1,24

3

Inte

rmedia

te

Resu

lts

Inter-task parallelism: Multiple SEs explore independent

random walks

Intra-task parallelism: Conditional Independence based pipelined sampling for every random variable N

oC

Route

r

49.8x 68.3x 44.8x 32.2x 39.2x

Results:• 47 − 100× improvement in runtime performance over a 6-core IBM Power8 CPU• 753 − 1600× improvement in performance-per-watt terms.

Accelerator

IBM PSL

• Uses conditional independences todetermine parallelism

• Allow execution of compositionalMCMC: Gibbs, Hamiltonian &Metropolis-Hastings

x i

Bx i

Markov Blanket

All variables of

same color can be executed in parallel

Host CPU

Symphony

Runtime Env

FPGA Driver

and Library

PCIe

PCIe

Endpoint

Storage

Device

NIC

Control

Unit

Pipelined Interconnect

Mem

ory

C

ontr

olle

r

On-b

oard

DR

AM

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(n,0)

PE(n,1)

PE(n,2)

PE(n,3)

…

On-c

hip

M

em

ory

PCIe DMA

Controller

CAPI Controller

P2P Device

Access

Systolic element

Can use either

PCIe or CAPI

Runtime System Controlled Dynamically Reconfigured OS Controlled Off-the-shelf IPsGPU

Pipelined systolic connections

1

2

34

(Application Accelerators)

• TCGA: The Computational Genomics Accelerator• Reconfigurable architecture to multiplex accelerators• Example accelerators:

DFG

Specification

Accelerated

Kernels

System Arch

Specification

FPGA CPU GPU

Application

Workload

Statistics

Static Dependencies Dynamic Dependencies

1

3

2

45

State Estimation

Symphony Runtime

DRL Agent

2

Results:• 109× improvement (73 hours

to under 40 minutes).• 210× improvement in

performance-per-watt terms.

ASAP: LD Accelerator

CPU 0 CPU 1Memory Bus

Mem

ory M

em

ory

PCIe

Switch

PCIe

Switch

PCIe

Switch

PCIe

Switch

GPU CAPI FPGA

NIC

GPU

PC

Ie D

evice

PC

Ie D

evice

PC

Ie D

evice

PC

Ie D

evice

PC

Ie D

evice

GPU

PCIe Bus

PCIe Backplane

To Network Source of bandwidth contention

Pharaoh: PairHMM FA Accelerator

System

Load

Memory

LoadDivider

Ports

External

Memory

Load

Cache

LoadBandwidth Mem Load0 port

1 port 2+ port

Core

LoadFP-Arith

Scalar

Vector

TLB

Load

Interconnect

Load

Resource Activity Vector

RL Agent

Critic

Actor

State

Reward

Action

Value

Quantized RAV Distribution

DFG

Input Map

1D

CNNR

N

N Soft

max

Actor Network

1D

CNNR

N

N

Critic Network

Policy

Value

Inputs

FC

FC

s2v

Estimate system resource contention of architectural resources

Deep Reinforcement Learning model for making scheduling decisions

Results:• Minimizes interference between co-located workloads• Performance deviations within 90-95% of isolated performance

sequences to systems: smart healthcare and smart systems ......acmc2: an hls engine for mcmc the...

Documents