accelerated ml on cloud fpgas · cpu vs gpu vs fpga 23 a gpu is effective at processing the same...

89
Christoforos Kachris [email protected] Accelerated ML on cloud FPGAs

Upload: others

Post on 18-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Christoforos Kachris

[email protected]

Accelerated ML on cloud FPGAs

Page 2: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

What software developers/users want

Source: Databricks, Apache Spark Survey 2016, Report

2

Page 3: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

What software developers/users want

Source: Databricks, Apache Spark Survey 2016, Report

3

Page 4: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

A short history of computing performance

4

Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018

A domain-specific architecture for deep neural networks

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson

Multi core

Single core

High frequency

Page 5: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

What’s left for faster computing?

5

David Patterson, 2019

?

Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018

A domain-specific architecture for deep neural networks

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson

Multi core

Single core

High frequency

Page 6: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Computing power to train a model

6

In 2018, OpenAIfound that the amount of computational power used to train the largest AI models had doubled every 3.4 months since 2012.

https://www.techn

ologyreview.com/

s/614700/the-

computing-power-

needed-to-train-

ai-is-now-rising-

seven-times-

faster-than-ever-

before/

Open AI

https://openai.com/blog/ai-and-compute/#addendum

Page 7: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Processing requirements in DNN

7

Page 8: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Data Center traffic

Christoforos Kachris, Microlab@NTUA 8

Page 9: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Data Center Requirements

˃ Traffic requirements increase significantly in the data centers but

the power budget remains the same (Source: ITRS, HiPEAC, Cisco)

FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016 9

1

10

2012 2013 2014 2015 2016 2017 2018 2019

Traffic growth in Data centers versous Power constraints

Traffic growth

Heat load per rack

Power per chip

Transistor count

Transistors

Traffic growth

in Data Centers

Power per chip

Heat load per rack

Page 10: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

10https://www.iea.org/reports/data-centres-and-data-transmission-networks

Page 11: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Data Science: need for high computing power

Christoforos Kachris, Microlab@NTUA 11

Page 12: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Christoforos Kachris, Microlab@NTUA 12

Page 13: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

How Big are Data Centers

Data Center Site Sq ft

Facebook (Santa Clara) 86,000

Google (South Carolina) 200,000

HP (Atlanta) 200,000

IBM (Colorado) 300,000

Microsoft (Chicago) 700,000

Christoforos Kachris, Microlab@NTUA

[Source: “How Clean is Your Cloud?”, Greenpeace 2011]

Wembley Stadium:172,000 square ft13

Page 14: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Google data center

Christoforos Kachris, Microlab@NTUA 14

Page 15: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Data Centers Power Consumption

• Data centers consumed 330 Billion KWh in 2007 and is expected to reach 1012 Billion KWh in 2020

2007 (Billion KWh) 2020 (Billion KWh)

Data Centers 330 1012

Telecoms 293 951

Total Cloud 623 1963

15Christoforos Kachris, Microlab@NTUA

[Source: How Clean is Your Data Center, Greenpeace, 2012

Page 16: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Data Center power consumtion

Christoforos Kachris, Microlab@NTUA 16

Page 17: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Power consumption

17

Page 18: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Hardware acceleration

Hardware acceleration is the use of specialized hardware

components to perform some functions faster (10x-100x) than is

possible in software running on a more general-purpose CPU.

˃ Hardware acceleration can be performed either by specialized

chips (ASICS) or

˃ By programmable specialized chips (FPGAs) that can be

configured for specific applications

Christoforos Kachris, Microlab@NTUA 18

Page 19: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Data Center applications

Christoforos Kachris, Microlab@NTUA 19

Page 20: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Hardware Accelerators – Why is it faster?

Switch from sequential processing to parallel processing

20

Page 21: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Hardware accelerators

FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016

• HW acceleration can be used to reduce significantly the execution time and the energy consumption of several applications (10x-100x)

21

[Source: Xilinx, 2016]

Page 22: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGAs in the data centers

Christoforos Kachris, Microlab@NTUA 22

Page 23: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

CPU vs GPU vs FPGA

23

A GPU is effective at processing the same set of operations in parallel –

single instruction, multiple data (SIMD).

An FPGA is effective at processing the same or different operations in parallel –

multiple instructions, multiple data (MIMD). Specialized circuits for functions.

Control

ALU

ALU

Cache

DRAM

ALU

ALU

CPU(one core) FPGA

DRAM DRAM

GPUEach FPGA has more than 2M of these cells

Each GPU has 2880 of these cores

DRAM

Blo

ck R

AM

Blo

ck R

AM

DRAM DRAM

Page 24: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Specialization

One of the most sophisticated systems in the universe is based on specialization

24

Page 25: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Processing Platforms

˃ HW acceleration can be used to reduce significantly the

execution time and the energy consumption of several

applications (10x-100x)

Christoforos Kachris, Microlab@NTUA 25

Page 26: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Intel Xeon + FPGAs

Christoforos Kachris, Microlab@NTUA 26

Page 27: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Xeon and FPGA in the Cloud

Christoforos Kachris, Microlab@NTUA 27

Page 28: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGAs for DNN

˃ The xDNN processing engine has

dedicated execution paths for each type

of command (download, conv, pooling,

element-wise, and upload). This allows for

convolution commands to be run in

parallel with other commands if the

network graph allows it

28

Page 29: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGAs for DNN – Throughput & Latency

29

https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

Page 30: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGAs for DNN – Throughput & Latency

30

https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

Page 31: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGAs for DNN – Energy efficiency

31

Page 32: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Intel FPGAs for DNN

32

https://software.intel.com/content/www/us/en/develop/blogs/accelerate-computer-vision-from-edge-to-cloud-with-openvino-toolkit.html

Page 33: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGAs vs GPUs in DNN

33

Page 34: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

GPU vs FPGA for DNN

34

Page 35: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

HW Accelerators for Cloud Computing

FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016 35

A Survey on Reconfigurable Accelerators for Cloud Computing, FPL 2016 Kachris

Page 36: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Speedup vs Energy efficiency

FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016 36

Copyright: Christoforos

Kachris, ICCS/NTUA

A Survey on Reconfigurable Accelerators for Cloud Computing, FPL 2016 Kachris

Page 37: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Speedup per category

˃ Page Rank applications achieve the higher speedup

˃ Memcached application achieve higher energy efficiency

FPL 2016, Christoforos Kachris, ICCS/NTUA, September 2016 37

10.7

3.73.8

7.5

1.5

18

7.8

3.7

0

2

4

6

8

10

12

14

16

18

20

Speedup Energy Efficiency

Speedup and Energy efficiency per category

PageRank ML Memcached Databases

Page 38: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Catapult FPGA Acceleration Card

Christoforos Kachris, Microlab@NTUA 38

Page 39: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

www.vineyard-h2020.eu

FPGA as a Service

• Amazon EC F1’s Xilinx FPGA

39

Page 40: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Is there a market?

40

• The global Data Center Accelerator market size is expected to reach 35 billion $ by the end of 2025 [1].

• The market for FPGA is expected to grow at the highest rate owing to the increasing adoption of FPGAs for acceleration of enterprise workloads [1][1] https://www.marketwatch.com/press-release/at-387-cagr-data-center-accelerator-market-size-is-expected-to-exhibit-35020-million-usd-by-2025-2019-10-15

Intel

Available FPGAs

Page 41: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Heterogeneous DCs for energy efficiency

Christoforos Kachris, Microlab@NTUA

“The only way to differentiate server offerings is through accelerators, like we saw

with cell phones”, OpenServer Summit 2014 Leendert

Van Doorn; AMD

TODAY’s DCs Future Heterogeneous DCswith VINEYARD infrastructure

Run-time manager and orchestrator

3rd party HWaccelerators

Run-time scheduler

Big Data Applications

• Low performance• High power consumption• Best effort

• Higher performance• Lower power consumption• Predictable performance

Requirements

Servers

PServers

P

P

P

P

P

P

PP

P

P

P

DFE

DFE

DFE

DFE

VINEYARD Servers with dataflow-basedaccelerators (DFE)

Big Data Applications

41

Page 42: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

VINEYARD Heterogeneous Accelerators-based Data centre

Christoforos Kachris, Microlab@NTUA

Bioinformatic

sFinance

Big Data Applications

VINEYARD Progr. Framework

Synthesis

(OpenSPL,OpenCL)

Pattern

Matching

Analytics

engines

String

matching

Other

processin

g

Commonly used

Function/tasks

HW Manager

Library of Hardware

functions as IP

Blocks

Requirements:

• Throughput

• Latency

• Power

Racks with

programmable

dataflow engine (DFE)

accelerators

Server Racks with

commodity processor

Repository

Compres

sion

Encryptio

nScheduler

DFE

DFE

DFE

DFE

Cluster Resource Manager

Analytics

P

P

P

P

P

P

P

P

Program

mable

Logic

Racks with

MPSoC FPGAs

Programming Framework, APIs

42

Page 43: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

www.vineyard-h2020.eu

VINEYARD Framework

43

• Accelerators storedin an AppStore

• Cloud users requestaccelerators based on applications requirements

• Decouple Hardware – Software designers

Cloud computing Applications

VINEYARD Cloud Resource Manager

3rd party IP developersLibrary of Hardware

accelerators as IP Blocks

Heterogeneous Data Center

DFE

Processors Dataflow Proc.+FPGA

IP Accelerator’sApp store

Cloud tenants

Acc

Acc

Acc

Acc

DFE

DFE

DFE

Accelerator Controller

Accelerator Virtualization

Scheduler

Accelerator API

PerformanceEnergy

Page 44: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

AWS options

44

Page 45: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Performance evaluation on Machine Learning

˃ Up to 15x speedup for

Logistic regression

classification

˃ Up to 14x speedup for

K-means clustering

˃ Spark- GPU* (3.8x – 5.7x)

45

r5d.4x

f1.4x (InAccel)

0 200 400 600 800 1000 1200 1400

Logistic Regression execution time MNIST 24GB, 100 iter. (secs)

Data preprocessing Data transformation ML training

15x Speedup

r5d.4x

f1.4x (InAccel)

0 500 1000 1500 2000 2500

K-Means clustering exection timeMNIST 24GB, 100 iter. (secs)

Data preprocessing Data transformation ML training

14x Speedup

*[Spark-GPU: An Accelerated In-Memory Data Processing Engine on Clusters]

1st to offer ML-acceleration on

the cloud using FPGAs

Page 46: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

ML training

46

https://inaccel.com/cpu-gpu-or-fpga-performance-evaluation-of-cloud-computing-platforms-for-machine-learning-training/

Page 47: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Unique FPGA orchestrator by InAccel

47

Seamless integration with C/C++,

Python, Java and Scala

Automatic virtualization and scheduling

of the applications to the FPGA cluster

Fully scalable: Scale-up (multiple

FPGAs per node) and Scale-out (multiple

FPGA-based servers over Spark)

InAccel CoralResource Manager

InAccel Runtime- Resource isolation

Applications

FPGA drivers

Server

FPGA

Kernels

Automating deployment, scaling, and management of FPGA clusters

Page 48: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Current limitations for FPGA deployment

˃ Currently only one application can talk to

a single FPGA accelerator through

OpenCL

˃ Application can talk to a single FPGA.

˃ Complex device sharing

• From multiple threads/processes

• Even from the same thread

˃ Explicit allocation of the resources

(memory/compute units)

˃ User need to specify which FPGA to use

(device ID, etc.)

App1

Vendor drivers

Single FPGA

48

Page 49: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

From single instance to data centers

˃ Easy deployment

˃ Instant scaling

˃ Seamless sharing

˃ Multiple-users

˃ Multiple applications

˃ Isolation

˃ Privacy

49

InAccel FPGA Orchestrator

Kubernetes cluster

Page 50: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Universities

˃ How do you allow multiple students to

share the available FPGAs?

˃ Many universities have limited number of

FPGA cards that want to share with multiple

students.

˃ InAccel FPGA orchestrator allows multiple

students to share one or more FPGAs

seamlessly.

˃ It allows students to just invoke the function

that want to accelerate and InAccel FPGA

manager performs the serialization and the

scheduling of the functions to the available

FPGA resources.

50

InAccel FPGA Orchestrator

Lab1 Lab2 Lab3

Page 51: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Universities

˃ But the researchers want exclusive access

˃ InAccel orchestrator allows to select which

FPGA cards will be available for multiple

students and which FPGAs can be allocated

exclusively to researchers and Ph.D. students

(so they can get accurate measurements for

their papers).

˃ The FPGAs that are shared with multiple

students will perform on a best-effort approach

(InAccel manager performs the serialization of

the requested access) while the researchers

have exclusive access to the FPGAs with zero

overhead.

51

InAccel FPGA Orchestrator

Lab1 Lab2 Researcher

Shared Shared Exclusive access

Page 52: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

52

Instant

Scalability

Distribution of multi-thread

applications to multiple clusters

With a single command

Page 53: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

53

oneAPI

OS

with

Hypervisor

OS

InAccel

InAccel InAccel

Container runtime

Page 54: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

From IaaS to PaaS and SaaS for FPGAs

54

Servers with FPGAs

Virtualization/Sharing

Operating System

Middleware

Runtime

Applications

Infrastructure as a Service

Servers with FPGAs

Virtualization/Sharing

Operating System

Middleware

Runtime

Platformas a Service

Servers with FPGAs

Virtualization/Sharing

Operating System

Middleware

Runtime

Applications

Softwareas a Service

FPGAOrchestrator

FPGARepository

with accelerators

Applications

Page 55: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Seamless Integration with any framework

55

KUBESPHERE

Page 56: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Lab Exercise

˃ In this lab you are going to create your first accelerated application

˃ Use scikit learn to find out the speedup you get upon running Naive Bayes

algorithm using the original (CPU) and FPGA implementation.

56

Page 57: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Conclusions

˃ Future Data Center will have to sustain huge amount of network traffic

˃ However the power consumption will have to remain almost the same

˃ FPGA acceleration as a promising solution for Machine Learning providing

high throughput,

low latency and

energy efficient processing

Christoforos Kachris, Microlab@NTUA 57

Page 58: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Domain Specific Accelerators

The amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period)

58

Page 59: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Distributed ML

˃ CSCS: Europe’s Top Supercomputer (World 3rd) • 4500+ GPU Nodes, state-of-the-

art interconnect Task:

˃ Image Classification (ResNet-152 on ImageNet)

Single Node time (TensorFlow): 19 days

1024 Nodes: 25 minutes (in theory)

59

Page 60: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Distributed ML

˃ Parallelism in Distributed

Machine Learning.

˃ Data parallelism trains

multiple instances of the

same model on different

subsets of the training

dataset,

˃ model parallelism

distributes parallel paths

of a single model to

multiple nodes

60

A Survey on Distributed Machine Learning: https://arxiv.org/ftp/arxiv/papers/1912/1912.09789.pdf

Page 61: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

˃ Centralized systems (Figure 3a) employ a

strictly hierarchical approach to aggregation,

which happens in a single central location.

˃ Decentralized systems allow for intermediate

aggregation, either with a replicated model

that is consistently updated when the

aggregate is broadcast to all nodes such as in

tree topologies (Figure 3b) or with a

partitioned model that is shared over multiple

parameter servers (Figure 3c).

˃ Fully distributed systems (Figure 3d) consists

of a network of independent nodes that

ensemble the solution together and where no

speciffic roles are assigned to certain nodes

61

Page 62: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Distributed ML ecosystem

62

Page 63: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Data Science and ML platforms

63

Page 64: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGA for ML

˃ In many applications, neural network is trained in back-end CPU or GPU clusters ◆FPGA:

˃ very suitable for latency-sensitive real-time inference job

Unmanned vehicle

Speech Recognition

Audio Surveillance

Multi-media

64

Page 65: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

CPU vs FPGAs

65

http://cadlab.cs.ucla.edu/~cong/slides/HALO15_keynote.pdf

Page 66: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Machine Learning on FPGAs

˃ Classification

Naïve Bayes

˃ Training

Logistic regression

˃ DNN

Resnet50

66

Page 67: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Jupyter - JupyterHub

˃ Deploy and run your

FPGA-accelerated

applications using

Jupyter Notebooks

˃ InAccel manager

allows the instant

deployment of

FPGAs through

HupyterHub

67

Page 68: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

JupyterHub on FPGAs

˃ Instant acceleration of

Jupyter Notebooks

with zero code-

changes

˃ Offload the most

computational

intensive tasks on

FPGA-based servers

68

Authentication

Spawner

Kubernetes cluster

Page 69: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGA flow

69

Generate an

bitstreamFPGA Place-and-Route using

Xilinx Vivado on C4 or M4

instance

FPGA Logic Design using

Xilinx Vivado on C4 or M4

instance

Program the

FPGA

Page 70: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Bitstream repository

˃ FPGA Resource Manager is

integrated with a bitstream

repository that is used to

store FPGA bitstreams

70

Application FPGA bitstream

repository

FPGA cluster

https://store.inaccel.com

Page 71: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Lab Exercise

˃ Introduction

˃ Creating a Bitstream Artifact

˃ Running the first FPGA accelerated application

˃ Scikit-Learn on FPGAs

˃ Naive Bayes Example

˃ Logistic Regression Example

71https://edu.inaccel.com/

Page 72: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Useful links

˃ MIT: Tutorial on Hardware Accelerators for Deep Neural Networks

http://eyeriss.mit.edu/tutorial.html

˃ Intel

https://software.intel.com/content/www/us/en/develop/training/course-deep-learning-inference-fpga.html

˃ UCLA: Machine Learning on FPGAs

http://cadlab.cs.ucla.edu/~cong/slides/HALO15_keynote.pdf

˃ Distributed ML

https://www.podc.org/data/podc2018/podc2018-tutorial-alistarh.pdf

72

Page 73: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

AI chip Landscape

73

https://basicmi.github.io/AI-Chip/

Page 74: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Spectrum of new architectures for DNN

74

Page 75: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

DNN requirements

˃ Throughput

˃ Latency

˃ Energy

˃ Power

˃ Cost

75

Page 76: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

˃ Optimized hardware acceleration of both AI inference and other performance-critical

functions by tightly coupling custom accelerators into a dynamic architecture silicon

device.

˃ This delivers end-to-end application performance that is significantly greater than a fixed-

architecture AI accelerator like a GPU;

76

Page 77: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Roofline

77

Page 78: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Adaptive to new models

78

Page 79: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

FPGAs for DNN

˃ The xDNN processing engine has

dedicated execution paths for each type

of command (download, conv, pooling,

element-wise, and upload). This allows for

convolution commands to be run in

parallel with other commands if the

network graph allows it

79

Page 80: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

DNN layers

80

https://www.xilinx.com/publications/events/machine-learning-live/colorado/HotChipsOverview.pdf

Page 81: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

CPU-FPGA

˃ Even though the xDNN processing engine supports a wide range of CNN

operations, new custom networks are constantly being developed—and

sometimes, select layers/instructions might not be supported by the engine in the

FPGA. Layers of networks that are not supported in the xDNN processing engine

are identified by the xfDNN compiler and can be executed on the CPU. These

unsupported layers can be in any part of the network—beginning, middle, end, or in

a branch.

81

Page 82: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

CPU-FPGA

˃ networks and models are prepared for deployment on xDNN through Caffe,

TensorFlow, or MxNet.

˃ FPGA supports layers for xDNN while running unsupported layers on the CPU.

82

Page 83: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Optimized architecture

˃ Network optimization by fusing layers, optimizing memory dependencies in the

network, and pre-scheduling the entire network. This removes CPU host control

bottlenecks.

83

Page 84: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

DNN tradeoffs

84

https://www.xilinx.com/support/documentation/white_papers/wp514-emerging-dnn.pdf

Page 85: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Precision vs Performance vs power

85

Page 86: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Design Space trade offs

86

Page 87: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

87

J. Cong et al., Understanding Performance Differences of FPGAs and GPUs

Page 88: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Winners

88

https://www.semanticscholar.org/paper/Unified-Deep-Learning-with-CPU%2C-GPU%2C-and-FPGA-Rush-Sirasao/64c8428e93546479d44a5a3e44cb3d2553eab284#extracted

Page 89: Accelerated ML on cloud FPGAs · CPU vs GPU vs FPGA 23 A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). An FPGA

Links, more info

89