sequences to systems: smart healthcare and smart systems ......acmc2: an hls engine for mcmc the...

1
Abstract The TCGA Accelerator The Symphony Runtime AcMC 2 : An HLS Engine for MCMC Sequences to Systems: Smart Healthcare and Smart Systems S. S. Banerjee , C. Y. Tan, M. el-Hadedy, D. Chen, W-m Hwu, S. Lumetta, Z. T. Kalbarczyk, R. K. Iyer CS/ECE/Coordinated Science Lab, University of Illinois at Urbana-Champaign, [email protected] Identify commonly algorithmic patterns in the range of analytics in personalized medicine Design and implement accelerators to target these algorithmic patterns Runtime systems to allow the execution of these workloads in heterogeneous (accelerator rich) cloud environments Compiles high level specifications of probabilistic models (probabilistic program) into optimized HDL Symphony: A machine-learning based runtime system for executing data flow graphs on heterogeneous computing environments (CPUs, GPUs, FPGAs) This material is based upon work supported in part by Xilinx, IBM, Intel and the National Science Foundation under Grant Nos. CNS 13-37732 and CNS 16-24790. Acknowledgements x n z n N µ k K [K ] x SE Streaming I/O SE µ z Controller DRAM SE SE SE SE SE k SE Controller t ype Pi; #Pi ~ Poi sson( 2); r andom Pi z( I nt eger i) ~ Uni f ormChoi ce({p f or Pi p}); r andom Real mean(Pi c) ~ Uni f or mReal ( -1, 1); r andom Real x( I nt eger i) ~ if z(i) != nul l t hen Gaussi an( mean( z( i ) ) , 1. 0) ; quer y si ze({p f or Pi p}); / / Number of cl ust er s quer y z(0); / / Assi gnment of dat a poi nt 0 obs x( 0) = 0. 2; obs x( 1) = 1. 0; / / N i n number obs x( 2) = 0. 5; obs x( 3) = 0. 7; 1 2 3 4 1 2 3 4 Model Query Data Hardware Module Probabilistic Program Graphical Representation Plate Notation: Repeats of Variables Random Variables Factor Functions 1,2 4 3 Intermediate Results Inter-task parallelism: Multiple SEs explore independent random walks Intra-task parallelism: Conditional Independence based pipelined sampling for every random variable NoC Router 49.8x 68.3x 44.8x 32.2x 39.2x Results: 47 − 100× improvement in runtime performance over a 6-core IBM Power8 CPU 753 − 1600× improvement in performance-per-watt terms. Accelerator IBM PSL Uses conditional independences to determine parallelism Allow execution of compositional MCMC: Gibbs, Hamiltonian & Metropolis-Hastings x i Bx i Markov Blanket All variables of same color can be executed in parallel Host CPU Symphony Runtime Env FPGA Driver and Library PCIe PCIe Endpoint Storage Device NIC Control Unit Pipelined Interconnect Memory Controller On-board DRAM PE(0,0) PE(0,1) PE(0,2) PE(0,3) PE(2,0) PE(2,1) PE(2,2) PE(2,3) PE(1,0) PE(1,1) PE(1,2) PE(1,3) PE(n,0) PE(n,1) PE(n,2) PE(n,3) On-chip Memory PCIe DMA Controller CAPI Controller P2P Device Access Systolic element Can use either PCIe or CAPI Runtime System Controlled Dynamically Reconfigured OS Controlled Off-the-shelf IPs GPU Pipelined systolic connections 1 2 3 4 (Application Accelerators) TCGA: The Computational Genomics Accelerator Reconfigurable architecture to multiplex accelerators Example accelerators: DFG Specification Accelerated Kernels System Arch Specification FPGA CPU GPU Application Workload Statistics Static Dependencies Dynamic Dependencies 1 3 2 4 5 State Estimation Symphony Runtime DRL Agent 2 Results: 109× improvement (73 hours to under 40 minutes). 210× improvement in performance-per-watt terms. ASAP: LD Accelerator CPU 0 CPU 1 Memory Bus Memory Memory PCIe Switch PCIe Switch PCIe Switch PCIe Switch GPU CAPI FPGA NIC GPU PCIe Device PCIe Device PCIe Device PCIe Device PCIe Device GPU PCIe Bus PCIe Backplane To Network Source of bandwidth contention Pharaoh: PairHMM FA Accelerator System Load Memory Load Divider Ports External Memory Load Cache Load Bandwidth Mem Load 0 port 1 port 2+ port Core Load FP-Arith Scalar Vector TLB Load Interconnect Load Resource Activity Vector RL Agent Critic Actor State Reward Action Value Quantized RAV Distribution DFG Input Map 1D CNN R N N Softmax Actor Network 1D CNN R N N Critic Network Policy Value Inputs FC FC s2v Estimate system resource contention of architectural resources Deep Reinforcement Learning model for making scheduling decisions Results: Minimizes interference between co-located workloads Performance deviations within 90-95% of isolated performance

Upload: others

Post on 17-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequences to Systems: Smart Healthcare and Smart Systems ......AcMC2: An HLS Engine for MCMC The Symphony Runtime Sequences to Systems: Smart Healthcare and Smart Systems S. S. Banerjee,

Abstract

The TCGA Accelerator

The Symphony RuntimeAcMC2: An HLS Engine for MCMC

Sequences to Systems: Smart Healthcare and Smart SystemsS. S. Banerjee, C. Y. Tan, M. el-Hadedy, D. Chen, W-m Hwu, S. Lumetta, Z. T. Kalbarczyk, R. K. Iyer

CS/ECE/Coordinated Science Lab, University of Illinois at Urbana-Champaign, [email protected]

• Identify commonly algorithmic patterns in the range of analytics in personalized medicine

• Design and implement accelerators to target these algorithmic patterns

• Runtime systems to allow the execution of these workloads in heterogeneous (accelerator rich) cloud environments

• Compiles high level specifications of probabilistic models (probabilisticprogram) into optimized HDL

• Symphony: A machine-learning based runtime system forexecuting data flow graphs on heterogeneous computingenvironments (CPUs, GPUs, FPGAs)

This material is based upon work supported in part by Xilinx, IBM, Intel and the National Science Foundation under Grant Nos. CNS 13-37732 and CNS 16-24790.

Acknowledgements

xn

zn

N

µk

K

[K ]

x

SE

Stre

amin

g I/O

SE

µ

z

Contr

olle

r DR

AM

SE SE SE SE…

SEk SE C

ontr

olle

r

t ype Pi ;

#Pi ~ Poi sson( 2) ;

r andom Pi z( I nt eger i ) ~ Uni f or mChoi ce( { p f or Pi p} ) ;

r andom Real mean( Pi c) ~ Uni f or mReal ( - 1, 1) ;

r andom Real x( I nt eger i ) ~

i f z( i ) ! = nul l t hen Gaussi an( mean( z( i ) ) , 1. 0) ;

quer y s i ze( { p f or Pi p} ) ; / / Number of c l ust er s

quer y z( 0) ; / / Assi gnment of dat a poi nt 0

obs x( 0) = 0. 2; obs x( 1) = 1. 0; / / N i n number

obs x( 2) = 0. 5; obs x( 3) = 0. 7;

1

2

3

4

1

2

34

Model

Query

Dat

a

Hardware ModuleProbabilistic Program Graphical Representation

Plate Notation: Repeats of Variables

Random Variables

Factor Functions

1,24

3

Inte

rmedia

te

Resu

lts

Inter-task parallelism: Multiple SEs explore independent

random walks

Intra-task parallelism: Conditional Independence based pipelined sampling for every random variable N

oC

Route

r

49.8x 68.3x 44.8x 32.2x 39.2x

Results:• 47 − 100× improvement in runtime performance over a 6-core IBM Power8 CPU• 753 − 1600× improvement in performance-per-watt terms.

Accelerator

IBM PSL

• Uses conditional independences todetermine parallelism

• Allow execution of compositionalMCMC: Gibbs, Hamiltonian &Metropolis-Hastings

x i

Bx i

Markov Blanket

All variables of

same color can be executed in parallel

Host CPU

Symphony

Runtime Env

FPGA Driver

and Library

PCIe

PCIe

Endpoint

Storage

Device

NIC

Control

Unit

Pipelined Interconnect

Mem

ory

C

ontr

olle

r

On-b

oard

DR

AM

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(2,0)

PE(2,1)

PE(2,2)

PE(2,3)

PE(1,0)

PE(1,1)

PE(1,2)

PE(1,3)

PE(n,0)

PE(n,1)

PE(n,2)

PE(n,3)

On-c

hip

M

em

ory

PCIe DMA

Controller

CAPI Controller

P2P Device

Access

Systolic element

Can use either

PCIe or CAPI

Runtime System Controlled Dynamically Reconfigured OS Controlled Off-the-shelf IPsGPU

Pipelined systolic connections

1

2

34

(Application Accelerators)

• TCGA: The Computational Genomics Accelerator• Reconfigurable architecture to multiplex accelerators• Example accelerators:

DFG

Specification

Accelerated

Kernels

System Arch

Specification

FPGA CPU GPU

Application

Workload

Statistics

Static Dependencies Dynamic Dependencies

1

3

2

45

State Estimation

Symphony Runtime

DRL Agent

2

Results:• 109× improvement (73 hours

to under 40 minutes).• 210× improvement in

performance-per-watt terms.

ASAP: LD Accelerator

CPU 0 CPU 1Memory Bus

Mem

ory M

em

ory

PCIe

Switch

PCIe

Switch

PCIe

Switch

PCIe

Switch

GPU CAPI FPGA

NIC

GPU

PC

Ie D

evice

PC

Ie D

evice

PC

Ie D

evice

PC

Ie D

evice

PC

Ie D

evice

GPU

PCIe Bus

PCIe Backplane

To Network Source of bandwidth contention

Pharaoh: PairHMM FA Accelerator

System

Load

Memory

LoadDivider

Ports

External

Memory

Load

Cache

LoadBandwidth Mem Load0 port

1 port 2+ port

Core

LoadFP-Arith

Scalar

Vector

TLB

Load

Interconnect

Load

Resource Activity Vector

RL Agent

Critic

Actor

State

Reward

Action

Value

Quantized RAV Distribution

DFG

Input Map

1D

CNNR

N

N Soft

max

Actor Network

1D

CNNR

N

N

Critic Network

Policy

Value

Inputs

FC

FC

s2v

Estimate system resource contention of architectural resources

Deep Reinforcement Learning model for making scheduling decisions

Results:• Minimizes interference between co-located workloads• Performance deviations within 90-95% of isolated performance