novel architectures for applications in data science and ... · data science and beyond jason...
TRANSCRIPT
Novel Architectures for Applications inData Science and BeyondJason Riedy, Jeffrey Young, Tom ConteCenter for Research into Novel Computing Hierarchies at Georgia Tech
1 March 2019
Apps: Massive+-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Find outbreaks, population epidemiology, similarpatient association
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating markets, smart &sustainable cities
Systems biology Understanding interactions, drug design
Power grid / Smart cities Disruptions, conservation, prediction
Irregular data access. Changing data.
Rogues Gallery — 1 Mar 2019 2/23
High-Performance Data Analysis (HPDA)
Novel applications:
• Data at scale and speed needs new ideas forcomputing analysis.
• “Big data” platforms fare poorly v. a single threadplus large SSD even for static data sets. (McSherry,Isard, Murray. “Scalability! But at what COST?” HotOSXV, 2015.)
• Many high-level codes are written and re-written toanswer one question: need flexibility.
• Some primitives may be tuned and re-used.
Rogues Gallery — 1 Mar 2019 3/23
HPDA and Architectures
• Current architectures are hitting limits onmanufacturing, heat dissipation, memory latency...
• New architecture proposals are difficult to evaluatevia simulation and modeling alone.
• What happens when novel prototypes hit reality?• Architects/designers need rapid feedback on newideas.
• New ideas only become successful with acommunity: software ecosystem, trained students.
Need bridges between apps and architects.
Rogues Gallery — 1 Mar 2019 4/23
Introducing the CRNCH Rogues Gallery
A physical & virtual space for hosting novel computingarchitectures, systems, and accelerators.
Emu Chick FPGAs & HMC/HBMFPAA
Amortize effort and cost of trying novel architectures.Break the “but it’s too much work” barrier.
http://crnch.gatech.edu/rogues-gallery
Rogues Gallery — 1 Mar 2019 5/23
Emu Technology’s PGAS Architecture
1 nodelet
Gossamer
Core 1
Memory-Side Processor
Gossamer
Core 4...
Migration Engine
RapidIODisk I/O
8 nodeletsper node
64 nodelets
per Chick
RapidIO
StationaryCore
• Multithreaded multicore
• Memory-side “processor” foroperations innarrow-channel DRAM
• Stationary core for OS
• Threads migrate inhardware on reads!
• Optimize for weak locality
Rogues Gallery — 1 Mar 2019 6/23
3D Stacked Memory and FPGAs
FPGAs enable flexible“near-memory” research for bothhardware and programmingmodels:• Micron/Pico EX700 & HMC• Nallatech 385s (Arria 10)• Intel Arria10 DevKit• Nallatech 520N (Stratix 10)• Xilinx MpSOC
Rogues Gallery — 1 Mar 2019 7/23
Neuromorphic Systems
• Field-Programmable Analog Array (FPAA) System-OnChip, designed in the lab of Dr. Jennifer Hasler.
• Analog arrays can achieve unprecedented power andsize reductions.
FPAA “driven” by a RPi Neuromorphic Workshop,27 Apr 2018
Rogues Gallery — 1 Mar 2019 8/23
Flexible Infrastructure
login /notebook
rg-admSlurm Ctl
toolbox (NFS)
Scheduling,Tools, and Admin
Key:
Schedulable Resource
Physical Resource
VM
USB device
User Resources
fpaa-host
power-hostnvidia-tegra-N
nvidia-tegra-1
fpaa-dev
rg-dbSlurm DBD
emu-dev emu-chick
..Nfpga-dev-1
fpga-hmcfpga-intel
Powell, Riedy, Young, and Conte. “Wrangling Rogues: ManagingExperimental Post-Moore Architectures.”
https://arxiv.org/abs/1808.06334
• Available. Plans tointegrate with NSFXSEDE.• Scheduler beingdeployed.• IncorporatesSingularity and virtualmachines forOS/library versioning.• Fun tools: usbip forFPAA... Analog inputs?
Rogues Gallery — 1 Mar 2019 9/23
Rogues Gallery Press
On The Next Platform, a partner of The Register:
https://www.nextplatform.com/2018/08/27/a-rogues-gallery-of-post-moores-law-options/
Rogues Gallery — 1 Mar 2019 10/23
Neuromorphic Workshop - April 2018
FPAA-focused workshop led by Dr.Hasler introduced the widercommunity to FPAA programmingfor neuromorphic systems.• Tutorial-style class• Lightning talks from GT andDoE researchers
• http://crnch.gatech.edu/neuro-workshop18
Rogues Gallery — 1 Mar 2019 11/23
Rogues Gallery ASPLOS Tutorial
Programming Novel Architectures in the Post-Moore Erawith The Rogues Gallery
https://crnch-rg.gitlab.io/rg/asplos-2019/
• To be held at ASPLOS 2019 in Providence, RI.• 14 April 2019, 8.30am – 12.00pm• Architecture detailed descriptions• Hands-on with the Emu Chick and (hopefully) theFPAA
Rogues Gallery — 1 Mar 2019 12/23
Rogues Gallery VIP Team
Rogues Gallery VIP team has started in Spring 2019. Thiscourse allows undergraduates to get research experience
working with software for novel architectures.http://www.vip.gatech.edu/teams/
new-team-rogues-galleryRogues Gallery — 1 Mar 2019 13/23
Selected Results: Emu Pointer-Chasing BenchmarkData-dependent loads, fine-grained access1
Ordered
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Intra-block shuffle: weak locality
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Full block shuffle: weak locality
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Vuduc, Riedy. “An Initial Characterization of the EmuChick,” Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2018.
Rogues Gallery — 1 Mar 2019 14/23
Selected Results: x86 Pointer-Chasing Benchmark
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Block size (number of 16B elements)
0
20
40
60
80
100M
emor
y ba
ndwi
dth
(GB\
s)peak STREAM bandwidth
56 threads
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Block size (number of 16B elements)
peak STREAM bandwidth112 threads
block_shuffle intra_block_shuffle full_block_shuffle
Haswell results, every pattern is different.2
2 Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Riedy, Vuduc, Conte. “A Microbenchmark Characterizationof the Emu Chick.” https://arxiv.org/abs/1809.07696
Rogues Gallery — 1 Mar 2019 15/23
Selected Results: Emu Pointer-Chasing Benchmark
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Block size (number of 16B elements)
0
2
4
6
8
10
12
Mem
ory
band
widt
h (G
B\s) peak STREAM bandwidth
2048 threads
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Block size (number of 16B elements)
peak STREAM bandwidth4096 threads
block_shuffle intra_block_shuffle full_block_shuffle
Mostly flat performance, high utilization.2
Rogues Gallery — 1 Mar 2019 16/23
Selected Results: BFS on a Dynamic Data Structure
15 16 17 18 19 20 21scale
0
20
40
60
80
100
MTE
PS
Emu single node - CilkEmu multi-node - Cilk
x86 Haswell - STINGERx86 Haswell - Cilk
0
500
1000
1500
Edge
Ban
dwid
th (M
B/s)
Note: Streaming data structure, not statically optimized. 3
3 Hein, Eswar, Abdurrahman Yasar, Prasanth Chatarasi, Li, Young, Conte, Ümit Çatalyürek, Vuduc, Riedy, Bora Uçar.“Programming Strategies for Irregular Algorithms on the Emu Chick.” https://arxiv.org/abs/1901.02775
Rogues Gallery — 1 Mar 2019 17/23
Selected Results: Labeled Subgraph Alignment
1 2 4 8 16 32 64 128Number of Threads
0
10
20
30
40
50
Sp
eed
up
Multi-BLK
Multi-HCB
Single-BLK
Single-HCB
gsaNA, the first parallel algorithm, strong scaling on DBLPgraph (2048 vertices)3
Rogues Gallery — 1 Mar 2019 18/23
Selected Results: FPGAs
Hadidi, Asgari, Young, Mudassar, Garg, Krishna, Kim. “PerformanceImplications of NoCs on 3D-Stacked Memories: Insights from the
Hybrid Memory Cube (HMC),” ISPASS 2018
• Characterizationswith FPGA and HybridMemory Cube showlatency/bandwidthtradeoff.• Other FPGA work isfocused on compilers,HPC prototyping, andsparse algorithms forIntel and Xilinx FPGAS.
Rogues Gallery — 1 Mar 2019 19/23
Lessons Learned, Emu Chick i
• Finding appropriate metrics is difficult, e.g. for theEmu:
• Comparing ASICs (e.g. x86) to FPGA-based prototypescan be unfair either way.
• Fraction of peak bandwidth for the idealizedproblem?
• SpMV: FLOP/s ∝ BW, level 2 sparse BLAS op.• Graph500 BFS: TEPS ∝ BW
Rogues Gallery — 1 Mar 2019 20/23
Lessons Learned, Emu Chick ii
• Distilling observations on architecture↔
programming model:• Program data location for load (BW) balance.• Remote memory operations v. migration exposes thearchitecture.
• Migrations cost more than it appears. Computation?• Stack spills/access can cause ping-ponging.• How does HW support for top-down (Cilk-ish) affectbottom-up (UPC/SHMEM) PGAS programming?
• Memory allocation similar to UPC, SHMEM• UPC++ rpc_ff v. Emu thread migration?
Rogues Gallery — 1 Mar 2019 21/23
AcknowledgmentsFantastic students and colleagues:
• Srinivas Eswar (GT CSE)• Dr. Eric Hein (GT ECE⇒ Emu)• Patrick Lavin (GT CSE)• Dr. Jiajia Li (GT CSE⇒ PNNL)• Abdurrahman Yaşar (GT CSE)
• Dr. Ümit Çatalürek (GT CSE)• Dr. Tom Conte (GT CS/ECE)• Dr. Bora Uçar (ENS Lyon CNRS)• Dr. Rich Vuduc (GT CSE)• Dr. Jeffrey S. Young (GT CS)
Code (ideally will have links from crnch.gatech.edu):
• https://gitlab.com/crnch-rg (soon)• https://github.com/ehein6/emu-microbench
Other testbeds:
• ORNL: ExCL• PNNL: CENATE
• Argonne• Sandia
• Berkeley: AQCT• (others?)
Rogues Gallery — 1 Mar 2019 22/23
Rogues Gallery: Active and Growing• Added FPGA resources, integrating FPAAs• Tight development loop with Emu• Active research projects and publications• Community outreach and education underway• One GT student has made the leap to industryalready...
CRNCH Rogues Gallery connects researchers andstudents with novel architectures and architects with
upcoming applications.
Let us host / manage your neat stuff!http://crnch.gatech.edu/rogues-gallery
Rogues Gallery — 1 Mar 2019 23/23