optimus: efficient realization of streaming applications on fpgas university of michigan: amir...

Optimus: Efficient Realization of Streaming Applications on FPGAs

University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke

IBM Research: David Bacon, Rodric Rabbah

Introduction

End of free ride from clock scaling

Applications more demanding

More applications on embedded platforms

Evolution of new architectures

Crypto

XML parser

Physics

GPU

• Customizable and reconfigurable– On-the-fly and in-the-field– Customizability performance and low power

• Many orders of magnitude more

parallelism than existing multicores– Task-level parallelism– Pipeline parallelism– Bit-level parallelism

Why FPGAs?

Liquid Metal Vision

• One unified language (Lime) for programming hardware (e.g., FPGAs) and heterogeneous architectures

• Liquid Metal VM: JIT the hardware!

GPU Cell(Multicore)

CPU ???FPGA

LiquidMetal VM

Program all withLime

Liquid Metal Tool Chain

5

Streaming LanguagesStreaming Languages

Front-EndCompiler

Front-EndCompiler Spatial IRSpatial IR

Streaming VMStreaming VMVirtex5 FPGAVirtex5 FPGA

Streaming VMStreaming VM

Xilinxbitfile

Xilinxbitfile

XilinxVHDL

Compiler

XilinxVHDL

Compiler

HDLHDL

Cell BECell BE


Cell binaryCell

binary

Cell SDKCell SDK

CC

CrucibleBack-EndCompiler


OptimusBack-EndCompiler


FPGAModel

Overview

• Spatial IR (SIR)

• Compilation Flow

• Scheduling

• Optimizations

• Results

Spatial Intermediate Representation

• Main Constructs:– Filter Encapsulate computation.– Pipeline Expressing pipeline

parallelism.– Splitjoin Expressing task-level

parallelism.– Other constructs not relevant here

• Exposes different types of parallelism– Composable, hierarchical

• Some streaming languages can be easily lowered to SIR:– Lime, StreamIt

pipeline

filter

splitjoin

Top Level Compilation

Filter

Controller

M0

Init

M1

…

. . .

i0 i1 ix

OmO0O0

…

Mn

Work Source

Filter Filter

Round-Robin Splitter(8,8,8,8)

FilterFilter

Round-Robin Joiner(1,1,1,1)

Sink

a[ ]

i

Init

Controller

Controller

Controller

Controller

Controller

Controller

Controller

Controller

A

B EC

HGF I

J

D

Work

Work

WorkWorkWork

Work

Work

Work

Source

Filter Filter


FilterFilter


Sink

B DC

F

E

A

J

IHG

Filter Compilation

sum = 0i = 0sum = 0i = 0

temp = pop( )temp = pop( )

sum = sum + tempi = i + 1Branch bb2 if i < 8


push(sum)push(sum)

1

2

3

4

Basic Block

Register

Control in

Control outs

Mem

ory/Queue

ports

Ack

Live data outs

Live data ins

bb1

bb2

bb3

bb4

Live out Data

Live

ou

t Da

ta

Register

mux mux

Register

Register

Register

FIFO Read

FIFO Write

Control

Token

Control Token

Control Token

Ack

Ack

Ack

Operation Compilation

FU

…

…

i0 im

o0 on

predicate

ADDADD

CMP

Register

i 1 temp sum

8

Control out 3

11

1

temp

Control out 4

Control in

…



Stream Scheduling

• Filters fire eagerly.– Blocking channel access.– Allows for potentially smaller

channels

• Results produced with lower latency.

11

Filter 1

Filter 2

Push 2

Pop 3

Filter 1

Filter 2

Optimizations• Streaming optimizations (macro functional)

– Channel allocations, Channel access fusion, Filter fission and fusion, etc.

– Doing these optimization needs global information about the stream graph

– Typically performed manually using existing tools

• Classic optimizations (micro functional)– Common subexpression elimination, Constant folding, Loop unrolling,

etc.– Typically included in existing compilers and tools

Channel Allocation

• Larger channels:– More SRAM– More control logic– Less stalls

• Interlocking makes sure that each filter gets the

right data or blocks.

• What is the right channel size?

Channel Allocation Algorithm• Set the size of the channels to infinity.

• Warm-up the queues.

• Record the steady state instruction schedules for each pair.

• Unroll the schedules to have the same number of pushes and pops.

• Find the maximum number of overlapping lifetimes.

14

Channel Allocation Example

----

----

push

----

push

----

push

push

push

----

----

push

----

----

pop

----

----

----

pop

----

pop

pop

pop

popMax overlap = 3

Producer Consumer

Source

Filter 1

Filter 2

Sink

Channel Allocation

Channel Access Fusion

• Each channel access (push or pop) takes one cycle.

• Communication to computation ratio

• Longer critical path latency

• Limit task-level parallelism

Channel Access Fusion Algorithm

• Clustering channel access operations– Loop Unrolling– Code Motion– Balancing the groups

• Similar to vectorization

• Wide channels

18

rrrrrrrr

w

w

w

w

r

w

w

r

Write Mult. = 1

Read Mult. = 8

Write Mult. = 8

Read Mult. = 8

Write Mult. = 4

Read Mult. = 1

Access Fusion Example

• Some caveats

int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum);

int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); }}

int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);

int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum);

Access Fusion

Speedup (baseline = PowerPC)

Energy Consumption

Handel-C Comparison

• Compared DES and DCT with hand-optimized Handel-C implementation

• Performance– 5% faster before optimizations– 12x faster after optimizations

• Area– 66% larger before optimizations– 90% larger after optimizations

23

Conclusion

• Streaming language to program heterogeneous systems

• Hierarchical synthesis using Spatial IR

• Macro and micro functional optimizations− Channel Access Fusion: 2.4x speedup− Channel Allocation: 50% area saving

Thank you!

• Questions?

25

Static Stream Scheduling

• Resources have to be ready before a filter starts(pushes and pops are non-blocking).

• Double buffering for parallelism.

• Deadlock can be detected at compile-time.

• Could be inefficient in case of data dependent bahavior.

System Setup

27

Streaming LanguagesStreaming Languages

Front-EndCompiler

Front-EndCompiler SIRSIR

Streaming VMStreaming VMVirtex5 FPGAVirtex5 FPGA


Xilinxbitfile

Xilinxbitfile

XilinxVHDL

Compiler

XilinxVHDL

Compiler

HDLHDL

Cell BECell BE


Cell binaryCell

binary

Cell SDKCell SDK

CC





FPGAModel

Stream Scheduling

• Activate all the filters at time 0.

• Blocking channel access.

• No restriction on the channel size.

• Result to least latency.

28

Source

Adder 1 Adder 4


Adder 3Adder 2


Printer

a[ ]

i

Init

Controller

Controller

Controller

Controller

Controller

Controller

Controller

Controller

A

B EC

HGF I

J

D

Work

Work

WorkWorkWork

Work

Work

Work

StreamIt Example

Source

Adder 1 Adder 4


Adder 3Adder 2


Printer

B DC

F

E

A

J

IHG

void->void pipeline Minimal { add Source(); add AddSplitter(8, 4); add Printer();}

int->int splitjoin AddSplitter(int addSize, int pFactor) { split roundrobin(pFactor); for (int i = 0; i < pFactor; i++) add AdderFilter(addSize); join roundrobin(1);}

int->void filter Printer() { work pop 1 { println(pop()); }}

optimus: efficient realization of streaming applications on fpgas university of michigan: amir...

Documents

sum temp i

filter compilation sum

lime slide

tools slide

i init controller

filter fission

temp control

pop sum