spatial: a language and compiler for application accelerators€¦ · spatial ir control scheduling...

139
Spatial: A Language and Compiler for Application Accelerators Raghu Prabhakar Stanford University / SambaNova Systems TVM Conference Dec 13, 2018

Upload: others

Post on 14-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial: A Language and Compiler for Application Accelerators

Raghu Prabhakar Stanford University / SambaNova

Systems

TVM Conference Dec 13, 2018

Page 2: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

The Future Is (Probably) Reconfigurable

10,000

1,000

100

10

1

0.1

Ene

rgy

Effi

cien

cy (M

OP

S/m

W)

Not programmable Less programmable More programmable

Programmability

ASIC

CPU

GPU

Reconfigurable

Instruction-BasedFPGA

�2

CGRADedicated

Page 3: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

The Future Is (Probably) Reconfigurable

10,000

1,000

100

10

1

0.1

Ene

rgy

Effi

cien

cy (M

OP

S/m

W)

Not programmable Less programmable More programmable

Programmability

ASIC

CPU

GPU

Reconfigurable

Instruction-BasedFPGA

�2

CGRADedicated

25x perf/W vs. CPU

XPU (HotChips ’17)

287 MOps/mW

Brainwave (ISCA ’18)

Page 4: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

The Future Is (Probably) Reconfigurable

10,000

1,000

100

10

1

0.1

Ene

rgy

Effi

cien

cy (M

OP

S/m

W)

Not programmable Less programmable More programmable

Programmability

ASIC

CPU

GPU

Reconfigurable

Instruction-BasedFPGA

�2

CGRADedicated

25x perf/W vs. CPU

XPU (HotChips ’17)

287 MOps/mW

Brainwave (ISCA ’18) 77x perf/W vs. FPGA

Plasticine (ISCA ’17)

Page 5: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Key QuestionHow can we more productively target

reconfigurable architectures like FPGAs?

�3

Page 6: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Key Question

Performance Productivity

Portability

How can we more productively target reconfigurable architectures like FPGAs?

Fast and efficient designs Fast and efficient programmers

Target-generic solutions

�3

Page 7: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

HDLs

�4

Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec

Page 8: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

HDLs

Performance

�4

Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec

✓ Arbitrary RTL

Page 9: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

HDLs

Performance

Portability

�4

Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec

✓ Arbitrary RTL

✘ Significant target-specific code

Page 10: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

HDLs

Performance Productivity

Portability

�4

Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec

✓ Arbitrary RTL ✘ No high-level abstractions

✘ Significant target-specific code

Page 11: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

C + Pragmas

�5

Existing High Level Synthesis (C + Pragmas) e.g. Vivado HLS, SDAccel, Altera OpenCL

HDLs

Page 12: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

C + Pragmas

Performance

�5

✘ No memory hierarchy

✘ No arbitrarypipelining

Existing High Level Synthesis (C + Pragmas) e.g. Vivado HLS, SDAccel, Altera OpenCL

HDLs

Page 13: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

C + Pragmas

Performance

Portability

�5

✓ Portable for single vendor

✘ No memory hierarchy

✘ No arbitrarypipelining

Existing High Level Synthesis (C + Pragmas) e.g. Vivado HLS, SDAccel, Altera OpenCL

HDLs

Page 14: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

C + Pragmas

Performance Productivity

Portability

�5

✓ Nested loops

✘ Difficult to optimize

✘ Ad-hoc mix of software/hardware

✓ Portable for single vendor

✘ No memory hierarchy

✘ No arbitrarypipelining

Existing High Level Synthesis (C + Pragmas) e.g. Vivado HLS, SDAccel, Altera OpenCL

HDLs

Page 15: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Rethinking HLS

�6

HDLs C + PragmasImproved HLS

Page 16: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Rethinking HLS

Performance

�6

✓ Memory hierarchy

✓ Arbitrary pipelining

HDLs C + PragmasImproved HLS

Page 17: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Rethinking HLS

Performance

Portability

�6

✓ Memory hierarchy

✓ Arbitrary pipelining

✓ Target-generic sourceacross reconfigurable architectures

HDLs C + PragmasImproved HLS

Page 18: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Rethinking HLS

Performance Productivity

Portability

�6

✓ Nested loops✓ Automatic memory

banking/buffering✓ Implicit design parameters

(unrolling, banking, etc.)

✓ Memory hierarchy

✓ Arbitrary pipelining

✓ Target-generic sourceacross reconfigurable architectures

✓ Automated design tuning

HDLs C + PragmasImproved HLS

Page 19: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Introducing Spatial

■ Programming language to simplify configurable accelerator design ■ Constructs to express:

■ Hierarchical parallel and pipelined data paths ■ explicit memory hierarchies

■ Simple APIs to manage CPU Accelerator communication

■ Open source: https://spatial-lang.org/

■ Allows programmers to focus on “interesting stuff” ■ Designed for performance oriented programmers ■ More intuitive than CUDA: dataflow instead of threads

David Koeplinger et al, “Spatial: A Language And Compiler For Application Accelerators”, PLDI 2018

Page 20: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial: Memory Hierarchy

�8

DDR DRAM GB

On-Chip SRAM MB

Local Regs KB

Page 21: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial: Memory Hierarchy

�8

DDR DRAM GB

On-Chip SRAM MB

Local Regs KB

val image = DRAM[UInt8](H,W)

Page 22: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial: Memory Hierarchy

�8

DDR DRAM GB

On-Chip SRAM MB

Local Regs KB

val image = DRAM[UInt8](H,W)

val buffer = SRAM[UInt8](C)val fifo = FIFO[Float](D)val lbuf = LineBuffer[Int](R,C)

Page 23: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial: Memory Hierarchy

�8

DDR DRAM GB

On-Chip SRAM MB

Local Regs KB

val image = DRAM[UInt8](H,W)

val buffer = SRAM[UInt8](C)val fifo = FIFO[Float](D)val lbuf = LineBuffer[Int](R,C)

buffer load image(i, j::j+C) // densebuffer gather image(a) // sparse

Page 24: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial: Memory Hierarchy

�8

DDR DRAM GB

On-Chip SRAM MB

Local Regs KB

val image = DRAM[UInt8](H,W)

val buffer = SRAM[UInt8](C)val fifo = FIFO[Float](D)val lbuf = LineBuffer[Int](R,C)

val accum = Reg[Double]val pixels = RegFile[UInt8](R,C)

buffer load image(i, j::j+C) // densebuffer gather image(a) // sparse

Page 25: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial: Control And Design Parameters

�9

Page 26: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

val P = 16 (1 ! 32)Reduce(0)(N by 1 par P){i => data(i)}{(a,b) => a + b}

Implicit/Explicit parallelization factors(optional, but can be explicitly declared)

Spatial: Control And Design Parameters

�9

Page 27: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

val P = 16 (1 ! 32)Reduce(0)(N by 1 par P){i => data(i)}{(a,b) => a + b}Stream.Foreach(0 until N){i => …}

Implicit/Explicit parallelization factors(optional, but can be explicitly declared)

Implicit/Explicit control schemes(also optional, but can be used to override compiler)

Spatial: Control And Design Parameters

�9

Page 28: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

val B = 64 (64 ! 1024)val buffer = SRAM[Float](B)Foreach(N by B){i => …}

val P = 16 (1 ! 32)Reduce(0)(N by 1 par P){i => data(i)}{(a,b) => a + b}Stream.Foreach(0 until N){i => …}

Implicit/Explicit parallelization factors(optional, but can be explicitly declared)

Explicit size parameters for loop step size and buffer sizes(informs compiler it can tune this value)

Implicit/Explicit control schemes(also optional, but can be used to override compiler)

Spatial: Control And Design Parameters

�9

Page 29: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

val B = 64 (64 ! 1024)val buffer = SRAM[Float](B)Foreach(N by B){i => …}

val P = 16 (1 ! 32)Reduce(0)(N by 1 par P){i => data(i)}{(a,b) => a + b}Stream.Foreach(0 until N){i => …}

Implicit/Explicit parallelization factors(optional, but can be explicitly declared)

Explicit size parameters for loop step size and buffer sizes(informs compiler it can tune this value)

Implicit/Explicit control schemes(also optional, but can be used to override compiler)

Foreach(64 par 16){i => buffer(i) // Parallel read}

Implicit memory banking and buffering schemes for parallelized access

Spatial: Control And Design Parameters

�9

Page 30: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]

tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}

DRAM

vectorA

Off-chip memory declarations

vectorB

FPGA

output

�10

Page 31: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

DRAM

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]

tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}

vectorA

vectorB

Explicit work division in IR

FPGA

output

24

Page 32: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]

tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b}

DRAM

vectorA

vectorB

Tiled reduction (outer)

FPGAOuter Reduce

output

24

Page 33: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

DRAM

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }

vectorA

vectorB

FPGAOuter Reduce

On-chip memory declarations

tileB (0)

tileA (0) tileA (1)

tileB (1)acc

output

24

acc

Page 34: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

DRAM

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]

tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b}

vectorA

vectorB

FPGAOuter Reduce

DRAM ! SRAM transfers(also have store, scatter, and

gather)

Stage 1

tileB (0)

tileA (0) tileA (1)

tileB (1)

output

24

acc acc

Page 35: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

DRAM

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]

tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b}

vectorA

vectorB

FPGAOuter Reduce

acc

Stage 1

tileB (0)

tileA (0) tileA (1)

tileB (1)

Tiled reduction (pipelined)

Stage 2

+× acc

output

24

acc acc

Page 36: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

FPGAOuter ReduceStage 3

DRAM

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]

tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}

vectorA

vectorB

acc

Stage 1

tileB (0)

tileA (0) tileA (1)

tileB (1)

Stage 2

+× acc

output

Outer reduce function

+

24

acc acc

Page 37: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]

tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}

24

FPGAOuter ReduceStage 3

DRAM

vectorA

vectorB

acc

Stage 1

tileB (0)

tileA (0) tileA (1)

tileB (1)

Stage 2

+× acc

output +

acc acc

Tile Size (B) Banking strategy

Parallelism factor #1Metapipelining toggle

Parallelism factor #3

Parallelism factor #2

Page 38: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)

Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]

tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)

Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}}

25

Page 39: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Dot Product in Spatial

Spatial Program Design Parameters

25

Page 40: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial Program

The Spatial Compiler

26

Page 41: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

The Spatial Compiler

26

Page 42: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR Design Parameters

Intermediate Representatio

nDesign Parameters

IR Transformation

IR Analysis

Code Generation

Legend

The Spatial Compiler

26

Page 43: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Control SchedulingSpatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR

■ Creates loop pipeline schedules ■ Detects data dependencies across loop intervals ■ Calculate initiation interval of pipelines ■ Set maximum depth of buffers

■ Supports arbitrarily nested pipelines (Commercial HLS tools don’t support this)

27

Page 44: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Modified Parameters

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR Design ParametersDesign Tuning

29

Page 45: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

FPGA

+×tileB

tileA

acc

DRAM

vectorA

vectorB

Design Space Parameters ExamplevectorA ∙ vectorB

LegendControl Compute

RegsSRAM

Small and simple, but slow!

ctr

�22

Page 46: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

■ Increases length of DRAM accesses Runtime■ Increases exploited locality Runtime ■ Increases local memory sizes Area

FPGA

+×tileB

tileA

acc

DRAM

vectorA

vectorB

vectorA ∙ vectorB

Important Parameters: Buffer Sizes

LegendControl Compute

RegsSRAM

ctr

�23

Page 47: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

FPGA

Stage 2

Tile B

■ Overlaps memory and compute Runtime■ Increases local memory sizes Area■ Adds synchronization logic Area

Important Parameters: Pipelining

Stage 1

+×tileB (0)

tileA (0)

acc

DRAM

vectorA

vectorB

tileA (1)

tileB (1)

vectorA ∙ vectorB

LegendControl Compute

RegsSRAM

Double Buffer

�24

Metapipelining requires buffering

Page 48: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

■ Improves element throughput Runtime■ Duplicates compute resources Area

Important Parameters: Parallelization

FPGA

+× acc

DRAM

vectorA

vectorBctr

vectorA ∙ vectorB

×ctrctr

×+

+

LegendControl Compute

RegsSRAM

tileB

tileA

�25

Page 49: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

■ Improves memory bandwidth Runtime■ May duplicate memory resources Area

Important Parameters: Memory Banking

+× acc

DRAM

vectorA

vectorBctr

vectorA ∙ vectorB

×ctrctr

×+

+

LegendControl Compute

RegsSRAM

tileB

tileA

Banked SRAM

�26

Parallelization requires banking

Page 50: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Modified Parameters

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR Design ParametersDesign Tuning

29

Page 51: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Modified Parameters

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR Design Parameters

Original tuning methods:■ Pre-prune space using simple

heuristics■ Randomly sample ~100,000 design

points■ Model area/runtime of each point

Design Tuning

29

Page 52: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Modified Parameters

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR Design Parameters

Original tuning methods:■ Pre-prune space using simple

heuristics■ Randomly sample ~100,000 design

points■ Model area/runtime of each point

Proposed tuning method■ Active learning: HyperMapper (More details in paper)

Design Tuning

29

Page 53: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Modified Parameters

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR Design Parameters

Original tuning methods:■ Pre-prune space using simple

heuristics■ Randomly sample ~100,000 design

points■ Model area/runtime of each point

Proposed tuning method■ Active learning: HyperMapper (More details in paper)

■ Fast: No slow transformers in loop

Design Tuning

29

Page 54: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Modified Parameters

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR Design Parameters

Original tuning methods:■ Pre-prune space using simple

heuristics■ Randomly sample ~100,000 design

points■ Model area/runtime of each point

Proposed tuning method■ Active learning: HyperMapper (More details in paper)

■ Fast: No slow transformers in loop

Design Tuning

29

Page 55: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Modified Parameters

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IR Design Parameters

Original tuning methods:■ Pre-prune space using simple

heuristics■ Randomly sample ~100,000 design

points■ Model area/runtime of each point

Proposed tuning method■ Active learning: HyperMapper (More details in paper)

■ Fast: No slow transformers in loop

Design Tuning

29

Page 56: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Spatial IR

Control Scheduling

Mem. Banking/Buffering

Access Pattern Analysis

Control Inference

Pipeline Unrolling

Pipeline Retiming

[Optional] Design Tuning

Host Resource Allocation

Control Signal InferenceChisel Code Generation

Area/Runtime Analysis

Spatial IRThe Spatial Compiler: The Rest

Code generation ■ Synthesizable Chisel ■ C++ code for host CPU

30

Page 57: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

■ FPGA:■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA■ Fixed clock rate of 150 MHz

■ Applications■ SDAccel: Hand optimized, tuned implementations■ Spatial: Hand written, automatically tuned implementations

■ Execution time = FPGA execution time

Evaluation: Performance

31

Page 58: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

0

5

10

15

BlackScholes GEMM PageRank TPC-H Q6

Performance (Spatial vs. SDAccel) Average 2.9x faster hardware than SDAccel

Spee

dup

over

SD

Acc

el 8.5x 1.4x1.6x 1.4x 3.5x 14.1x 1.3x

32

Page 59: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Productivity: Lines of Code

0

63

125

188

250

BlackScholes GEMM PageRank TPC-H Q6

SDAccelSpatial

12%

Average 42% shorter programs versus SDAccel

60%47% 44% 31% 66% 35%

Lines

33

Page 60: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

■ FPGA 1■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA■ 19.2 GB/s DRAM bandwidth (single channel)

■ FPGA 2■ Xilinx Zynq ZC706 ■ 4.3 GB/s

■ Applications■ Spatial: Hand written, automatically tuned implementations■ Fixed clock rate of 150 MHz

Evaluation: Portability

34

Page 61: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Portability: VU9P vs. Zynq ZC706

0

7.5

15

22.5

30

BlackScholes GEMM PageRank TPC-H Q6

2.5x 1.2x2.5x 2.5x 1.3x 2.5x 4.6xIdentical Spatial source, multiple targets

Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA

Spee

dup

DRAM Bandwidth: 4.5x

LUTs (GP compute): 47.3x

DSPs (integer FMA): 7.6x

On-chip memory*: 4.0x

VU9P / ZC706* No URAM used on VU9P

35

Page 62: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Portability: VU9P vs. Zynq ZC706

0

7.5

15

22.5

30

BlackScholes GEMM PageRank TPC-H Q6

2.6x 2.1x9.4x 2.7x 1.7x 1.0x 1.1xIdentical Spatial source, multiple targets

Porting: Speedup (VU9P / Zynq) only from moving to larger FPGATuning: Speedup only from tuning parameters for larger FPGA

Spee

dup

DRAM Bandwidth: 4.5x

LUTs (GP compute): 47.3x

DSPs (integer FMA): 7.6x

On-chip memory*: 4.0x

VU9P / ZC706* No URAM used on VU9P

35

Page 63: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Portability: VU9P vs. Zynq ZC706

0

7.5

15

22.5

30

BlackScholes GEMM PageRank TPC-H Q6

6.5x 2.5x23.4x 6.8x 2.2x 2.5x 5.0xIdentical Spatial source, multiple targets

Porting: Speedup (VU9P / Zynq) only from moving to larger FPGATuning: Speedup only from tuning parameters for larger FPGAProduct = Porting × Tuning

Spee

dup

DRAM Bandwidth: 4.5x

LUTs (GP compute): 47.3x

DSPs (integer FMA): 7.6x

On-chip memory*: 4.0x

VU9P / ZC706* No URAM used on VU9P

35

Page 64: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Portability: Plasticine CGRAIdentical Spatial source, multiple targets

Even reconfigurable hardware that isn’t an FPGA!

36

BenchmarkDRAM Bandwidth (%)

Load StoreResource Utilization (%)

PCU PMU AG Speedup vs. VU9P

BlackScholes 77.4 12.9 73.4 10.9 20.6 1.6

GDA 24.0 0.2 95.3 73.4 38.2 9.8

GEMM 20.5 2.1 96.8 64.1 11.7 55.0

K-Means 8.0 0.4 89.1 57.8 17.6 6.3

TPC-H Q6 97.2 0.0 29.7 37.5 70.6 1.6

Prabhakar et al. Plasticine: A Reconfigurable Architecture For Parallel Patterns (ISCA ‘17)

Page 65: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Halide to Spatial

Page 66: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

● DSL for computational photography ● Separation between algorithm (what to compute) and schedule (how to compute) ● Straightforward to express and iterate over various schedules

What is Halide?

Var x, y; Func f; f(x, y) = x + y;

Algorithm

f.tile(x,y,xi,yi,8,8);

Schedule #1Implementations

Page 67: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

● DSL for computational photography ● Separation between algorithm (what to compute) and schedule (how to compute) ● Straightforward to express and iterate over various schedules

What is Halide?

Var x, y; Func f; f(x, y) = x + y;

Algorithm

f.parallel(y); f.vectorize(x, 8);

Schedule #2Implementations

Page 68: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Why use Halide as Front-End to Spatial?

● Separation of concerns ○ High-level transformations: Tiling, Vectorization etc can happen in

Halide ○ Lift the hard work of transforming loop nests to Halide ○ Optimized code can be lowered into spatial

● Loop-based IR ○ Easy mapping to Spatial front-end

Page 69: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Halide IR

// Algorithm Var x, y; Func f; f(x, y) = x + y; // Schedule f.parallel(y); f.vectorize(x, 8); f.realize(32, 32);

produce f { let t6 = (f.extent.0 + f.min.0) let t7 = (f.min.1*f.stride.1) let t8 = max((f.extent.0/8), 0) let t3 = (t8 < ((f.extent.0 + 7)/8)) let t2 = (0 - t7) let t5 = (((t6 - t7) - f.min.0) + -8) let t4 = (t6 + -8) parallel (f.s0.y, f.min.1, f.extent.1) { let t10 = ((f.s0.y*f.stride.1) + t2) let t9 = (f.min.0 + f.s0.y) for (f.s0.x.x, 0, t8) { f[ramp(((f.s0.x.x*8) + t10), 1, 8)] = ramp(((f.s0.x.x*8) + t9), 1, 8) } if (t3) { f[ramp(((f.s0.y*f.stride.1) + t5), 1, 8)] = ramp((f.s0.y + t4), 1, 8) } } }

Page 70: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Example: Halide to Spatial

// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);

g.in().copy_to_host();wrapper.compile_to_spatial(...);

SpatialHalide

Page 71: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Example: Halide to Spatial

// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);

g.in().copy_to_host();wrapper.compile_to_spatial(...);

val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }

SpatialHalide

Page 72: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Example: Halide to Spatial

// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);

g.in().copy_to_host();wrapper.compile_to_spatial(...);

val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }

SpatialHalide

Compute at Accelerator

Page 73: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Example: Halide to Spatial

// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);

g.in().copy_to_host();wrapper.compile_to_spatial(...);

val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }

SpatialHalideAllocate SRAM

to store ‘g’

Page 74: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Example: Halide to Spatial

// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);

g.in().copy_to_host();wrapper.compile_to_spatial(...);

val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }

SpatialHalide

Tile g

Page 75: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Example: Halide to Spatial

// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);

g.in().copy_to_host();wrapper.compile_to_spatial(...);

val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }

SpatialHalide

Load ‘f’ into the

accelerator’s memory

Page 76: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Example: Halide to Spatial

// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);

g.in().copy_to_host();wrapper.compile_to_spatial(...);

val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }

SpatialHalide

Do the load at loop level ‘xo’ and store

in SRAM

Page 77: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Example: Halide to Spatial

// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);

g.in().copy_to_host();wrapper.compile_to_spatial(...);

val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }

SpatialHalide

Store ‘g’ back into

host’s DRAM

Page 78: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Conclusion

37

Page 79: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency

37

Page 80: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency■ Current programming solutions for reconfigurables are still inadequate

37

Page 81: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency■ Current programming solutions for reconfigurables are still inadequate■ Need to rethink outside of the C box for high level synthesis:

■ Memory hierarchy for optimization ■ Design parameters for tuning■ Arbitrarily nestable pipelines

37

Page 82: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency■ Current programming solutions for reconfigurables are still inadequate■ Need to rethink outside of the C box for high level synthesis:

■ Memory hierarchy for optimization ■ Design parameters for tuning■ Arbitrarily nestable pipelines

■ Spatial prototypes these language and compiler criteria:■ Average speedup of 2.9x versus SDAccel on VU9P■ Average 42% less code than SDAccel■ Achieves transparent portability through internal support for automated design tuning (HyperMapper)

37

Performance Productivity

Portability

Page 83: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency■ Current programming solutions for reconfigurables are still inadequate■ Need to rethink outside of the C box for high level synthesis:

■ Memory hierarchy for optimization ■ Design parameters for tuning■ Arbitrarily nestable pipelines

■ Spatial prototypes these language and compiler criteria:■ Average speedup of 2.9x versus SDAccel on VU9P■ Average 42% less code than SDAccel■ Achieves transparent portability through internal support for automated design tuning (HyperMapper)

37

Spatial is open source: https://spatial-lang.org/

Performance Productivity

Portability

Page 84: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Backup Slides

Page 85: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

The Team

Raghu Prabhakar

Yaqi Zhang

David Koeplinger

Matt Feldman

Tian Zhao

Ardavan Pedram

Christos Kozyrakis

Kunle Olukotun

Stefan Hadjis

Ruben Fiszel

Luigi Nardi

38

Page 86: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Custom ASICs

Page 87: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Custom ASICsGood for widely used, fixed specifications (like compression)

Page 88: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Custom ASICsGood for widely used, fixed specifications (like compression)Expensive with long design turnaround for developing fields

like ML

Page 89: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Custom ASICsGood for widely used, fixed specifications (like compression)Expensive with long design turnaround for developing fields

like ML

TimeJeff Dean, Scaled ML 2018Kunle Olukotun, ISCA 2018

20,000

15,000

10,000

5,000

02009 20172011 2013 2015

20

15

10

5

0

Relative # of Papers / Year Since 2009

ML Arxiv Papers

Page 90: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

C + Pragmas ExampleAdd 512 integers originating from accelerator DRAM

void sum(int* mem) { mem[512] = 0;

for(int i=0; i < 512; i++) { mem[512] += mem[i]; }

}

�54

Page 91: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

C + Pragmas ExampleAdd 512 integers originating from accelerator DRAM

void sum(int* mem) { mem[512] = 0;

for(int i=0; i < 512; i++) { mem[512] += mem[i]; }

}

�54

Commercial HLS Tool

Runtime: 27,236 clock cycles

(100x too long!)

Page 92: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

C + Pragmas ExampleAdd 512 integers originating from external DRAM

�55

#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)

void sum(MPort* mem) { MPort buff[LOOPCOUNT]; memcpy(buff, mem, LOOPCOUNT);

int sum = 0; for(int i=1; i<LOOPCOUNT; i++) { #pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {

#pragma UNROLL sum += (int)

(buff[i]>>j*sizeof(int)*8);}

} mem[512] = sum; }

Runtime: 302 clock cycles

Page 93: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

C + Pragmas ExampleAdd 512 integers originating from external DRAM

�55

#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)

void sum(MPort* mem) { MPort buff[LOOPCOUNT]; memcpy(buff, mem, LOOPCOUNT);

int sum = 0; for(int i=1; i<LOOPCOUNT; i++) { #pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {

#pragma UNROLL sum += (int)

(buff[i]>>j*sizeof(int)*8);}

} mem[512] = sum; }

Width of DRAM controller interface

Burst Access

Use local variable

Special compiler directives

Loop Restructuring

Bit shifting to extract individual elements

Special compiler directives

Runtime: 302 clock cycles

Page 94: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Hardware Design Considerations

Page 95: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Hardware Design Considerations

1. Finite physical compute and memory resources

Page 96: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Hardware Design Considerations

1. Finite physical compute and memory resources2. Requires aggressive pipelining for performance

■ Maximize useful execution time of compute resources

Page 97: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Hardware Design Considerations

1. Finite physical compute and memory resources2. Requires aggressive pipelining for performance

■ Maximize useful execution time of compute resources3. Disjoint memory space

■ No hardware managed memory hierarchy

Page 98: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Hardware Design Considerations

1. Finite physical compute and memory resources2. Requires aggressive pipelining for performance

■ Maximize useful execution time of compute resources3. Disjoint memory space

■ No hardware managed memory hierarchy4. Huge design parameter spaces

■ Parameters are interdependent, change runtime by orders of magnitude

Page 99: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Hardware Design Considerations

1. Finite physical compute and memory resources2. Requires aggressive pipelining for performance

■ Maximize useful execution time of compute resources3. Disjoint memory space

■ No hardware managed memory hierarchy4. Huge design parameter spaces

■ Parameters are interdependent, change runtime by orders of magnitude

5. Others… pipeline timing, clocking, etc.

Page 100: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a

Page 101: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a

Page 102: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a2i

2i+1

Foreach{i =>

Page 103: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a2i

2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

Page 104: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a2i

2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

2k 2k+1

Foreach{k =>

Page 105: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

a2i

2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

2k 2k+1

Foreach{k =>

Write port

Read port

1 “instance” of a

Page 106: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

a2i

2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

2k 2k+1

Foreach{k =>

Write port

Read port

1 “instance” of a

Page 107: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

Page 108: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

a

Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

Page 109: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

a

a

a

Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

Page 110: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

a a

Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read

Page 111: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

Metapipeline Distance = 1 a a

a a

Page 112: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

b(2j) b(2j+1)

Reduce{j =>

Metapipeline Distance = 1 a a

a a

(~4-8x memory)

Page 113: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

2k 2k+1

Foreach{k =>

Page 114: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

Metapipeline Distance = 2

2k 2k+1

Foreach{k =>

aa

Page 115: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

Metapipeline Distance = 2

2k 2k+1

Foreach{k =>

aa (~3-6x memory)

Page 116: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

2k 2k+1

Foreach{k =>

aa

b(2j) b(2j+1)

Reduce{j =>

a a

a a

Page 117: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

2k 2k+1

Foreach{k =>

aa

b(2j) b(2j+1)

Reduce{j =>

a a

a a(~7-14x memory)

Page 118: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

a

Local Memory Analysis Example (Cont.)

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

2k 2k+1

Foreach{k =>

aa

b(2j) b(2j+1)

Reduce{j =>

a a

a aStep 2: Greedily combine (merge) instances - Don’t combine if there are port conflicts - Don’t combine if the cost of merging is greater than sum of unmerged**Recompute banking for merged instances!

(~7-14x memory)

Page 119: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

2k 2k+1

Foreach{k =>

b(2j) b(2j+1)

Reduce{j =>

aa

Step 2: Greedily combine (merge) instances - Don’t combine if there are bank conflicts - Don’t combine if the cost of merging is greater than sum of unmerged**Recompute banking for merged instances!

aa

a

Page 120: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Local Memory Analysis

Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }

2i 2i+1

Foreach{i =>

2k 2k+1

Foreach{k =>

b(2j) b(2j+1)

Reduce{j =>

aa

Step 2: Greedily combine (merge) instances - Don’t combine if there are bank conflicts - Don’t combine if the cost of merging is greater than sum of unmerged**Recompute banking for merged instances!

aa

a(~5-10x memory) (40% less)

Page 121: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Kernel-Based Approach

Manually implement each DSL operation; use a simple compiler to stitch them together

Page 122: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Kernel-Based Approach

Performance

Manually implement each DSL operation; use a simple compiler to stitch them together

Misses cross-kernel optimizationsExcessive memory transfersExcessive buffering

Page 123: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Kernel-Based Approach

Performance ProductivityHigh level specificationno hardware design knowledge required

Manually implement each DSL operation; use a simple compiler to stitch them together

Misses cross-kernel optimizationsExcessive memory transfersExcessive buffering

Page 124: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

Kernel-Based Approach

Performance Productivity

Portability

High level specificationno hardware design knowledge required

Reasonably target-generic if done right

Manually implement each DSL operation; use a simple compiler to stitch them together

Misses cross-kernel optimizationsExcessive memory transfersExcessive buffering

Page 125: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Stochastic Gradient Descent in Spatial

Page 126: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Arbitrary precision custom types

Stochastic Gradient Descent in Spatial

Page 127: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Arbitrary precision custom types

Off-chip memory allocations

Stochastic Gradient Descent in Spatial

Page 128: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Arbitrary precision custom types

Off-chip memory allocations

Accelerator scope

Stochastic Gradient Descent in Spatial

Page 129: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Arbitrary precision custom types

Off-chip memory allocations

Accelerator scope

On-chip memory allocations

Stochastic Gradient Descent in Spatial

Page 130: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Arbitrary precision custom types

Off-chip memory allocations

Accelerator scope

On-chip memory allocations

Explicit memory transfer

Stochastic Gradient Descent in Spatial

Page 131: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Arbitrary precision custom types

Off-chip memory allocations

Accelerator scope

On-chip memory allocations

Explicit memory transfer

Declaration of a sequential loop

Stochastic Gradient Descent in Spatial

Page 132: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Arbitrary precision custom types

Off-chip memory allocations

Accelerator scope

On-chip memory allocations

Explicit memory transfer

Declaration of a sequential loop

Explicit memory transfer

Stochastic Gradient Descent in Spatial

Page 133: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]

val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)

Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)

wK load weights(0::D)

Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }

weights(0 :: D) store wK}

123456789

101112131415161718192021

Arbitrary precision custom types

Off-chip memory allocations

Accelerator scope

On-chip memory allocations

Explicit memory transfer

Declaration of a sequential loop

Explicit memory transfer

Debugging breakpoint

Stochastic Gradient Descent in Spatial

Page 134: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }

val x = SRAM[TX](D) x load data(i, 0::D)

// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt

// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}

222324252627282930313233343536373839404142434445

Page 135: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }

val x = SRAM[TX](D) x load data(i, 0::D)

// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt

// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}

222324252627282930313233343536373839404142434445

Custom caching for random access on y

Page 136: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }

val x = SRAM[TX](D) x load data(i, 0::D)

// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt

// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}

222324252627282930313233343536373839404142434445

Custom caching for random access on y

Explicit memory transfer

Page 137: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }

val x = SRAM[TX](D) x load data(i, 0::D)

// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt

// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}

222324252627282930313233343536373839404142434445

Custom caching for random access on y

Explicit memory transfer

Gradient computation

Page 138: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }

val x = SRAM[TX](D) x load data(i, 0::D)

// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt

// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}

222324252627282930313233343536373839404142434445

Custom caching for random access on y

Explicit memory transfer

Gradient computation

Weight update

Page 139: Spatial: A Language and Compiler for Application Accelerators€¦ · Spatial IR Control Scheduling Mem. Banking/ Buffering Access Pattern Analysis Control Inference Pipeline Unrolling

FPGA15.Sequential.Foreach

41. Reduce

DRAM

13.load

×wKweights

22.store

37.load yHat+

-45. Foreach

- ×

y

27. if … else

yPt

yAddr

yCache

yErr

xdata

SGD in Spatial: Hardware