optimus: efficient realization of streaming applications on fpgas university of michigan: amir...

29
Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David Bacon, Rodric Rabbah

Post on 22-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Optimus: Efficient Realization of Streaming Applications on FPGAs

University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke

IBM Research: David Bacon, Rodric Rabbah

Page 2: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Introduction

End of free ride from clock scaling

Applications more demanding

More applications on embedded platforms

Evolution of new architectures

Page 3: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Crypto

XML parser

Physics

GPU

• Customizable and reconfigurable– On-the-fly and in-the-field– Customizability performance and low power

• Many orders of magnitude more

parallelism than existing multicores– Task-level parallelism– Pipeline parallelism– Bit-level parallelism

Why FPGAs?

Page 4: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Liquid Metal Vision

• One unified language (Lime) for programming hardware (e.g., FPGAs) and heterogeneous architectures

• Liquid Metal VM: JIT the hardware!

GPU Cell(Multicore)

CPU ???FPGA

LiquidMetal VM

Program all withLime

Page 5: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Liquid Metal Tool Chain

5

Streaming LanguagesStreaming Languages

Front-EndCompiler

Front-EndCompiler Spatial IRSpatial IR

Streaming VMStreaming VMVirtex5 FPGAVirtex5 FPGA

Streaming VMStreaming VM

Xilinxbitfile

Xilinxbitfile

XilinxVHDL

Compiler

XilinxVHDL

Compiler

HDLHDL

Cell BECell BE

Streaming VMStreaming VM

Cell binaryCell

binary

Cell SDKCell SDK

CC

CrucibleBack-EndCompiler

CrucibleBack-EndCompiler

OptimusBack-EndCompiler

OptimusBack-EndCompiler

FPGAModel

Page 6: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Overview

• Spatial IR (SIR)

• Compilation Flow

• Scheduling

• Optimizations

• Results

Page 7: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Spatial Intermediate Representation

• Main Constructs:– Filter Encapsulate computation.– Pipeline Expressing pipeline

parallelism.– Splitjoin Expressing task-level

parallelism.– Other constructs not relevant here

• Exposes different types of parallelism– Composable, hierarchical

• Some streaming languages can be easily lowered to SIR:– Lime, StreamIt

pipeline

filter

splitjoin

Page 8: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Top Level Compilation

Filter

Controller

M0

Init

M1

. . .

i0 i1 ix

OmO0O0

Mn

Work Source

Filter Filter

Round-Robin Splitter(8,8,8,8)

FilterFilter

Round-Robin Joiner(1,1,1,1)

Sink

a[ ]

i

Init

Controller

Controller

Controller

Controller

Controller

Controller

Controller

Controller

A

B EC

HGF I

J

D

Work

Work

WorkWorkWork

Work

Work

Work

Source

Filter Filter

Round-Robin Splitter(8,8,8,8)

FilterFilter

Round-Robin Joiner(1,1,1,1)

Sink

B DC

F

E

A

J

IHG

Page 9: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Filter Compilation

sum = 0i = 0sum = 0i = 0

temp = pop( )temp = pop( )

sum = sum + tempi = i + 1Branch bb2 if i < 8

sum = sum + tempi = i + 1Branch bb2 if i < 8

push(sum)push(sum)

1

2

3

4

Basic Block

Register

Control in

Control outs

Mem

ory/Queue

ports

Ack

Live data outs

Live data ins

bb1

bb2

bb3

bb4

Live out Data

Live

ou

t Da

ta

Register

mux mux

Register

Register

Register

FIFO Read

FIFO Write

Control

Token

Control Token

Control Token

Ack

Ack

Ack

Page 10: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Operation Compilation

FU

i0 im

o0 on

predicate

ADDADD

CMP

Register

i 1 temp sum

8

Control out 3

11

1

temp

Control out 4

Control in

sum = sum + tempi = i + 1Branch bb2 if i < 8

sum = sum + tempi = i + 1Branch bb2 if i < 8

Page 11: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Stream Scheduling

• Filters fire eagerly.– Blocking channel access.– Allows for potentially smaller

channels

• Results produced with lower latency.

11

Filter 1

Filter 2

Push 2

Pop 3

Filter 1

Filter 2

Page 12: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Optimizations• Streaming optimizations (macro functional)

– Channel allocations, Channel access fusion, Filter fission and fusion, etc.

– Doing these optimization needs global information about the stream graph

– Typically performed manually using existing tools

• Classic optimizations (micro functional)– Common subexpression elimination, Constant folding, Loop unrolling,

etc.– Typically included in existing compilers and tools

Page 13: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Channel Allocation

• Larger channels:– More SRAM– More control logic– Less stalls

• Interlocking makes sure that each filter gets the

right data or blocks.

• What is the right channel size?

Page 14: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Channel Allocation Algorithm• Set the size of the channels to infinity.

• Warm-up the queues.

• Record the steady state instruction schedules for each pair.

• Unroll the schedules to have the same number of pushes and pops.

• Find the maximum number of overlapping lifetimes.

14

Page 15: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Channel Allocation Example

----

----

push

----

push

----

push

push

push

----

----

push

----

----

pop

----

----

----

pop

----

pop

pop

pop

popMax overlap = 3

Producer Consumer

Source

Filter 1

Filter 2

Sink

Page 16: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Channel Allocation

Page 17: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Channel Access Fusion

• Each channel access (push or pop) takes one cycle.

• Communication to computation ratio

• Longer critical path latency

• Limit task-level parallelism

Page 18: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Channel Access Fusion Algorithm

• Clustering channel access operations– Loop Unrolling– Code Motion– Balancing the groups

• Similar to vectorization

• Wide channels

18

rrrrrrrr

w

w

w

w

r

w

w

r

Write Mult. = 1

Read Mult. = 8

Write Mult. = 8

Read Mult. = 8

Write Mult. = 4

Read Mult. = 1

Page 19: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Access Fusion Example

• Some caveats

int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum);

int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); }}

int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);

int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum);

Page 20: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Access Fusion

Page 21: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Speedup (baseline = PowerPC)

Page 22: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Energy Consumption

Page 23: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Handel-C Comparison

• Compared DES and DCT with hand-optimized Handel-C implementation

• Performance– 5% faster before optimizations– 12x faster after optimizations

• Area– 66% larger before optimizations– 90% larger after optimizations

23

Page 24: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Conclusion

• Streaming language to program heterogeneous systems

• Hierarchical synthesis using Spatial IR

• Macro and micro functional optimizations− Channel Access Fusion: 2.4x speedup− Channel Allocation: 50% area saving

Page 25: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Thank you!

• Questions?

25

Page 26: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Static Stream Scheduling

• Resources have to be ready before a filter starts(pushes and pops are non-blocking).

• Double buffering for parallelism.

• Deadlock can be detected at compile-time.

• Could be inefficient in case of data dependent bahavior.

Page 27: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

System Setup

27

Streaming LanguagesStreaming Languages

Front-EndCompiler

Front-EndCompiler SIRSIR

Streaming VMStreaming VMVirtex5 FPGAVirtex5 FPGA

Streaming VMStreaming VM

Xilinxbitfile

Xilinxbitfile

XilinxVHDL

Compiler

XilinxVHDL

Compiler

HDLHDL

Cell BECell BE

Streaming VMStreaming VM

Cell binaryCell

binary

Cell SDKCell SDK

CC

CrucibleBack-EndCompiler

CrucibleBack-EndCompiler

OptimusBack-EndCompiler

OptimusBack-EndCompiler

FPGAModel

Page 28: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

Stream Scheduling

• Activate all the filters at time 0.

• Blocking channel access.

• No restriction on the channel size.

• Result to least latency.

28

Source

Adder 1 Adder 4

Round-Robin Splitter(8,8,8,8)

Adder 3Adder 2

Round-Robin Joiner(1,1,1,1)

Printer

a[ ]

i

Init

Controller

Controller

Controller

Controller

Controller

Controller

Controller

Controller

A

B EC

HGF I

J

D

Work

Work

WorkWorkWork

Work

Work

Work

Page 29: Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David

StreamIt Example

Source

Adder 1 Adder 4

Round-Robin Splitter(8,8,8,8)

Adder 3Adder 2

Round-Robin Joiner(1,1,1,1)

Printer

B DC

F

E

A

J

IHG

void->void pipeline Minimal { add Source(); add AddSplitter(8, 4); add Printer();}

int->int splitjoin AddSplitter(int addSize, int pFactor) { split roundrobin(pFactor); for (int i = 0; i < pFactor; i++) add AdderFilter(addSize); join roundrobin(1);}

int->void filter Printer() { work pop 1 { println(pop()); }}