enabling near-data accelerators in datacenterspxt176/publications/enabling near-data...enabling...
TRANSCRIPT
2015 Intel Big Data Software Summithttp://goto/bigdatasoftware
Enabling Near-Data Accelerators in DatacentersDave Ojika, Jayson Strayer, Gaurav Kaul, Prashanth Thinakaran, Darin Acosta
• Motivation
• Bring unconventional compute cores (especially
FPGAs) into mainstream big data use
• Abstract software complexity by introducing
efficient accelerator programming model
• Enable a data-oriented framework for near-
memory, distributed processing
• Approach
• Accelerate computation on FPGA; transfer data
over low-latency DDR bus
• Provide in-memory storage using open-source
Tachyon framework
• Offload Spark workload to accelerator
• Method
• Use Compute-Near Memory (CNM) architecture
for design-space exploration
• Map data to cores with affinity to specific memory
regions
• Integrate a Java-OpenCL middleware to support
scheduling of tasks on accelerator
Highlights Accelerator Overview
• Memory-speed data access
• Memory-centric buffer
synchronizes with
underlying file system
Write Method
Read Method
Data
Register 4 TB
Image
Cmd
Register
Interface
Connect Object
Copy Method
W
W
R
R
R
W
write_bit
read_bit
copy_bit
R/W
Workload Analysis
In-Memory Framework
Data and Compute Layer Current Developments
• Boosted Decision Tree (BDT)• Latency-sensitive
• Poor data locality
• Fits in 4TB memory
• 7-fold cross-validation
Hit 1GB 10GB
L1 94.06% 89.75%
DRAM 0.74% 5.74%quad-core i7 CPU, 8 GB RAM
• Fraction of store-bound stalls increases with size of
dataset; memory bandwidth requirement too high for CPU
Workload can be trivially parallelized across DIMMs
• 1st Place ATLAS ‘14 Higgs ML Challenge:• Deep Learning from Oxdata’s H20
• Where do FPGA accelerators stand?• Explore BDT on CNM accelerator
High-energy physics experiment at CERN’s LHC
(collaboration with UF Physics)
• Simics Simulator • Functional model
• Software stack
• Apps & workload
exploration
Task
Task
Host Middleware
Driver FPGA
Queue
Scheduler
Tachyon
File System (Local or HDFS)
• In-memory data exchange
• Reliable file sharing at
memory-speed
• Caching of working set files
in memory
• Fault-tolerant and distributed
API
Tachyon utilizes memory aggressively, leveraging data lineage
• OpenCL driver integration
• Container enablement
• Cloud orchestration
• NVM support and NFV
Compute Near Memory (CNM)
Big Data Framework
Application
API
API
Prototype with PCI, DDR and
Direct I/O interfaces
JOC
JOC: Java-to-OpenCL Component
No-Higgs or Higgs
• BDT on CPU (2nd Place
ATLAS ‘14 Higgs ML Challenge)
Application to architecture transformation
• Utilize parallelism on FPGA
• Leverage low-latency DDR
and 100 GB optical links
100 GB
Transceiver
IP
FIFO Decoder(data filter)
Data
Reassembly
Level 2: FPGA as high-performance accelerator
Level 1: FPGA receives and pre-processes data in real-time
QFSP• Direct I/O
• Real-time
• Low latency
• Low power
• Compute Engine
• Up to 3 TFLOPS
• OpenCL kernel
• BW of host memory
Altera Arria 10 FPGA
Generic implementation
To
Datastore
DRAM
DRAMDRAM
Pre-processed dataset
Datastore
Synchronize
• In-memory data store• Memory-centric distributed storage
• Reliable data sharing at memory speed
Development kit
Cloud Orchestration
Training time for 11 million events: 5 hours!
Xeon E5-2680 @ 2.8 GHZ
BDT on MATLAB
• Prediction time: 370 ms
• Okay for online, real-
time prediction
• Training time: 5 hours
• Grew with increasing
data size
• Data affinity • Cores cooperate with each
other for shared data
accesses
• Shared Virtual Memory (SVM)
Accelerator model
CPU and device both
access shared data using
the same virtual
addresses
No explicit data
marshaling
Dave Ojika: Cloud Infrastructure Jayson Strayer: Platform Silicon
Gaurav Kaul: Health and Life SciencesPrashanth Thinakaran: Big Data
Darin Acosta: Physics Professor, UF