software-like compilation for datacenter fpga accelerators€¦ · rise of fpgas in the datacenter...

20
© Copyright 2021 Xilinx James Thomas, Chris Lavin, Alireza Kaviani Stanford University and Xilinx Research Labs Software-like Compilation for Datacenter FPGA Accelerators

Upload: others

Post on 12-Aug-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

James Thomas, Chris Lavin, Alireza KavianiStanford University and Xilinx Research Labs

Software-like Compilation for Datacenter FPGA Accelerators

Page 2: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

1980 1985 1990 1995 2000 2005 2010 2015

2x/ 20 years

1

10

100

1000

10000

100000

CISC2x/3.5yrs

RISC2x/1.5yrs

DennardScaling

Multicore2x/3.5yrs

Amdahl’sLaw

2x/6yrs

“A New Golden Age of Computer Architecture”, Hennessy & Patterson, Turing Lecture

Perfo

rman

ce v

s Vax

11-7

80A New Age of Domain-specific Computing

Hennessy & Patterson

©

Page 3: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Rise of FPGAs in the Datacenter

˃ Large amount of interest from Amazon, Microsoft, and others

˃ Many massively parallel datacenter workloads that should work well on FPGAs

˃ But difficult for software programmers to harness them

>> 3

Page 4: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Common Pattern in Datacenter FPGA Designs

˃ Many identical processing units (PUs)

˃ Some sort of “memory controller” to facilitate communication among the processing units and to DRAM (data analytics, machine learning, etc.)

MemoryController

Host Interface

PU

Inter-connect

PU PU PU PU

PU PU PU PU PU

PU PU PU PU PU

PU PU PU PU PU

Domain Connectivity

>> 4

Page 5: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Fleet Computing Domain

˃ As an example, I previously worked on a system called Fleet in the data analytics / stream processing domain

˃ Included a DSL for PUs and targeted Amazon F1

PU 1Output

InputStream

Ctrl

PU 2Output

PU …Output

PU 1Input

PU 2Input

PU …Input DDR

Programmable Logic

PU 1

PU 2

PU ...

OutputStream

Ctrl

>> 5

Page 6: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Fleet: Good Performance, Slow CompilationApplication # of PUs vs. CPU Perf/W vs. GPU Perf/WJSON Parsing 512 26x 5.4xInteger Coding 192 45x 2.7xDecision Tree 384 14x 0.4xSmith-Waterman 384 275x 5.8xRegex 704 60x 2.6xBloom Filter 320 15x 6.7x

App Code

Fleet & Chisel Compiler

30-60 seconds 8-12 hours *per try* 30-60 minutes

.bit

StreamRTL / Chisel

(Synthesis, P&R)Ingestion

Flow

Verilog(.v)

P&RDesign (.dcp)

Fleet: A Framework for Massively Parallel Streaming on FPGAs, ASPLOS 2020>> 6

Page 7: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Datacenter FPGA App Compilation

˃ Memory controller often doesn’t change much for a class of applications – wasted work in redoing its place and route for each app

˃ Processing units identical – significant wasted work in redoing place and route for each one

˃ How can we avoid redoing work?

PU

PU

PU

PU

PU

PU

PU

Mem. Control.

PU

PU

>> 7

Page 8: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Domain Specific Era Needs Domain Specific Backends

CIRCUITS IN SECONDS

Application Domains

A B

Vivado must generalize solutions RapidWright can specialize

˃ Exploit domain-specific traits for:Higher performanceFaster compile timeTiming closure predictability

>> 8

Page 9: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

FPGA specialist community

Traditional flows: Applications in all

Domains

Fleet Domain-Specific Compiler

Developers

Application architects

Back-end compiler

Front-end Compiler

High abstraction domain-specific Fleet language

Relocate pre-implemented operators and

functions

Customized open-source frameworks

>> 9

Page 10: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Goal

˃ Fast compilation by:Reusing compilation for replicated PUsTake memory controller (shell) from a pre-implemented library

>> 10

Page 11: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Simple Solution: Vivado Out-of-context Flow

Pre-implemented memory controller (shell)

Replicated PUs

>> 11

Page 12: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Issue: Shell-to-PU Routing

˃ Problem: Vivado routing from replicated PUs to shell takes time (1-2 hours or more)

˃ Solution: Shell is pre-implemented, so route it to a register block next to each PU location (“slot”) ahead of time

Use two connected columns of registers & pblocks to ensure shell routes don’t cross into PU slots

>> 12

Page 13: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Example Shell

˃ Slot resource layout can be different per slot column (but resource counts are same) –requires separate implementations

>> 13

Page 14: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Online PU Flow

˃ Generate PU implementation for each slot column & replicate implementations

>> 14

Page 15: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Results

>> 15

Page 16: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

180-Slot Shell

>> 16

Page 17: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Area Tradeoff

˃ Previously able to get 500+ PU’s with standard flow

˃ Still have room to add more slots

˃ Can still beat GPU with 180 PU’s in some cases, may be enough for some users

˃ For others, this can be a fast flow for prototyping, can use standard flow once design is finalized

>> 17

Page 18: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Retrospective on Vivado

˃ Vivado needs to start up/run more quickly for faster online flow

˃ Needs more low-level APIs to precisely control behaviorCould do something simpler/more direct than using register block for shell/PU isolation

˃ RapidWright achieves both goals but needs its own placer/router/etc. to be on par with Vivado

>> 18

Page 19: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Additional Speedup with Open-Source Tools

>> 19

Page 20: Software-like Compilation for Datacenter FPGA Accelerators€¦ · Rise of FPGAs in the Datacenter ˃Large amount of interest from Amazon, Microsoft, and others ˃Many massively parallel

© Copyright 2021 Xilinx

Conclusion

˃ Fast compilation system for modular datacenter designs (~10x speedup)

˃ Open source at https://github.com/jjthomas/Fleet-Floorplanning

>> 20