sequences to systems: smart healthcare and smart systems ......acmc2: an hls engine for mcmc the...
TRANSCRIPT
Abstract
The TCGA Accelerator
The Symphony RuntimeAcMC2: An HLS Engine for MCMC
Sequences to Systems: Smart Healthcare and Smart SystemsS. S. Banerjee, C. Y. Tan, M. el-Hadedy, D. Chen, W-m Hwu, S. Lumetta, Z. T. Kalbarczyk, R. K. Iyer
CS/ECE/Coordinated Science Lab, University of Illinois at Urbana-Champaign, [email protected]
• Identify commonly algorithmic patterns in the range of analytics in personalized medicine
• Design and implement accelerators to target these algorithmic patterns
• Runtime systems to allow the execution of these workloads in heterogeneous (accelerator rich) cloud environments
• Compiles high level specifications of probabilistic models (probabilisticprogram) into optimized HDL
• Symphony: A machine-learning based runtime system forexecuting data flow graphs on heterogeneous computingenvironments (CPUs, GPUs, FPGAs)
This material is based upon work supported in part by Xilinx, IBM, Intel and the National Science Foundation under Grant Nos. CNS 13-37732 and CNS 16-24790.
Acknowledgements
xn
zn
N
⇡
µk
K
[K ]
x
SE
Stre
amin
g I/O
SE
µ
z
Contr
olle
r DR
AM
SE SE SE SE…
SEk SE C
ontr
olle
r
t ype Pi ;
#Pi ~ Poi sson( 2) ;
r andom Pi z( I nt eger i ) ~ Uni f or mChoi ce( { p f or Pi p} ) ;
r andom Real mean( Pi c) ~ Uni f or mReal ( - 1, 1) ;
r andom Real x( I nt eger i ) ~
i f z( i ) ! = nul l t hen Gaussi an( mean( z( i ) ) , 1. 0) ;
quer y s i ze( { p f or Pi p} ) ; / / Number of c l ust er s
quer y z( 0) ; / / Assi gnment of dat a poi nt 0
obs x( 0) = 0. 2; obs x( 1) = 1. 0; / / N i n number
obs x( 2) = 0. 5; obs x( 3) = 0. 7;
1
2
3
4
1
2
34
Model
Query
Dat
a
Hardware ModuleProbabilistic Program Graphical Representation
Plate Notation: Repeats of Variables
Random Variables
Factor Functions
1,24
3
Inte
rmedia
te
Resu
lts
Inter-task parallelism: Multiple SEs explore independent
random walks
Intra-task parallelism: Conditional Independence based pipelined sampling for every random variable N
oC
Route
r
49.8x 68.3x 44.8x 32.2x 39.2x
Results:• 47 − 100× improvement in runtime performance over a 6-core IBM Power8 CPU• 753 − 1600× improvement in performance-per-watt terms.
Accelerator
IBM PSL
• Uses conditional independences todetermine parallelism
• Allow execution of compositionalMCMC: Gibbs, Hamiltonian &Metropolis-Hastings
x i
Bx i
Markov Blanket
All variables of
same color can be executed in parallel
Host CPU
Symphony
Runtime Env
FPGA Driver
and Library
PCIe
PCIe
Endpoint
Storage
Device
NIC
Control
Unit
Pipelined Interconnect
Mem
ory
C
ontr
olle
r
On-b
oard
DR
AM
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(2,0)
PE(2,1)
PE(2,2)
PE(2,3)
PE(1,0)
PE(1,1)
PE(1,2)
PE(1,3)
PE(n,0)
PE(n,1)
PE(n,2)
PE(n,3)
…
On-c
hip
M
em
ory
PCIe DMA
Controller
CAPI Controller
P2P Device
Access
Systolic element
Can use either
PCIe or CAPI
Runtime System Controlled Dynamically Reconfigured OS Controlled Off-the-shelf IPsGPU
Pipelined systolic connections
1
2
34
(Application Accelerators)
• TCGA: The Computational Genomics Accelerator• Reconfigurable architecture to multiplex accelerators• Example accelerators:
DFG
Specification
Accelerated
Kernels
System Arch
Specification
FPGA CPU GPU
Application
Workload
Statistics
Static Dependencies Dynamic Dependencies
1
3
2
45
State Estimation
Symphony Runtime
DRL Agent
2
Results:• 109× improvement (73 hours
to under 40 minutes).• 210× improvement in
performance-per-watt terms.
ASAP: LD Accelerator
CPU 0 CPU 1Memory Bus
Mem
ory M
em
ory
PCIe
Switch
PCIe
Switch
PCIe
Switch
PCIe
Switch
GPU CAPI FPGA
NIC
GPU
PC
Ie D
evice
PC
Ie D
evice
PC
Ie D
evice
PC
Ie D
evice
PC
Ie D
evice
GPU
PCIe Bus
PCIe Backplane
To Network Source of bandwidth contention
Pharaoh: PairHMM FA Accelerator
System
Load
Memory
LoadDivider
Ports
External
Memory
Load
Cache
LoadBandwidth Mem Load0 port
1 port 2+ port
Core
LoadFP-Arith
Scalar
Vector
TLB
Load
Interconnect
Load
Resource Activity Vector
RL Agent
Critic
Actor
State
Reward
Action
Value
Quantized RAV Distribution
DFG
Input Map
1D
CNNR
N
N Soft
max
Actor Network
1D
CNNR
N
N
Critic Network
Policy
Value
Inputs
FC
FC
s2v
Estimate system resource contention of architectural resources
Deep Reinforcement Learning model for making scheduling decisions
Results:• Minimizes interference between co-located workloads• Performance deviations within 90-95% of isolated performance