graphgen for coram:graphgen for coram: graph...

GraphGen for CoRAM:GraphGen for CoRAM:Graph Computation on FPGAs

Gabriel Weisz (CMU) Eriko Nurvitadhi (Intel)James C. Hoe(CMU)December 7, 2013

Computer Architecture Lab at

1

This work is supported, in part, by the National Science Foundation CCF‐1320725 and by the Intel Science and Technology Center in Embedded Computing. Thank you Altera, Xilinx, and Bluespec for your generous donation of hardware and tools.

Why graph algorithms?Why graph algorithms?Many machine learning algorithms and data

mining algorithms are based on graphs

Stereo Matching

Image Segmentation

Handwriting Recognition

Speech RecognitionImage Segmentation Speech Recognition

Graphs encode data and relationships

2CARL 2013 / © Gabriel Weisz

Stereo MatchingStereo MatchingParallax ‐

Closer objectsCloser objectsare in the same

place

Input: Stereo imagesPixel Pairs

p

p gPixel Pairs

Adjacent

3

Output: Depth mapjpixels

CARL 2013 / © Gabriel Weisz

OutlineOutline

• IntroductionIntroduction• GraphGen CompilerG hG G• GraphGen FPGA Target

• Optimizations• Experimental Results• ConclusionConclusion


GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model

(GEM) ProgramVertex‐Centric

Graph Specification

GraphGen

(GEM) ProgramGraph Specification

GraphGen CompilerUpdate‐Function()

…

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+




Graph Specification

GraphGen



…

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+


Vertex-centric graph specificationVertex centric graph specificationGraph Structure Vertex Data: Edge Data:

d d1 25 struct vdatauint(32) L0;uint(32) L1;

struct edatauint(32) L0;uint(32) L1;

4 6 3uint(32) L1; uint(32) L2;

uint(32) L1; uint(32) L2;

Update‐Function(Vertex v)tmp = f1(v.data);f h Ed i

Processing Element>+

for each Edge e in scopetmp = f2(tmp, e.data);. .

‐

X‐+

7

. . +

Similar software frameworks: GraphLab, PregelCARL 2013 / © Gabriel Weisz



Graph Specification

GraphGen



…

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+


Graph Execution Model (GEM)Graph Execution Model (GEM)

Subgraph Listg p

G11 25 G11 25

G1 G2G3G2

4 6 3


G3

Graph Execution Model (GEM)Graph Execution Model (GEM)Update‐Function(Vertex v)tmp = f1(v.data);

G1

for each Edge e in scopetmp = f2(tmp, e.data);

G1

f1(v5)f ( )

PE Program

G22 4

3

f2(e2)f2(e3)f2(e4)

5

3

7 8

( )f1(v6)f2(e3)f2(e7)

6


G3f2(e7)f2(e8)G3

GEM target architectureGEM target architecture

Control PEControl PE

G1

Storage PE

G1 G2 G2G3

Data partition Compute partitionG1 G2 G2G3


GEM target architectureGEM target architecture

Control PEControl PE

G3

PE

G1 G2G3

Data partition Compute partitionG1 G2G3




Graph Specification

GraphGen



…

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+


A GEM hardware templateA GEM hardware template

V

E PEControl Logic

Compute partition

ILogic

DRAM


A GEM hardware templateA GEM hardware templateCoRAM Control Threads

V

(C‐like language)*

E PECompute partition

I

DRAM


*[Chung, et al., 2011]



Graph Specification

GraphGen



…

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+


Straight line PE programStraight line PE program

f1(v0)f1(v1)

Read Vertex 0Read Vertex 1f1(v1)

f2(e0)f2(e1)

Read Vertex 1Read Edge 0Read Edge 1f2(e1)

f3()f4(v0)

Read Edge 1ComputeWrite Vertex 0f4(v0) Write Vertex 0

Optimizations only care about item index


and which function is called

Straight line PE programStraight line PE program

Read Vertex 0Read Vertex 1 I di l tiRead Vertex 1Read Edge 0Read Edge 1

Indices are relative to local buffer, not global vertex/edge idRead Edge 1

ComputeWrite Vertex 0

global vertex/edge id

Write Vertex 0

Static data flow helps with double

20

Static data flow helps with double buffering and other optimizations


CoalescingCoalescingDRAM Buffer

AAB

B

CC

DC

DD

Reduces overhead


CoalescingCoalescingDRAM Buffer

AACA

B

CB

DC

DD

O t f dOut of order


CoalescingCoalescingDRAM Buffer Can reorder and rewrite

AAB

the PE Program

PE Program PE ProgramB

CC

DRead ARead C

g

Read ARead B

g

C

DD Read C

Read BRead BRead C

Reordered


CoalescingCoalescingDRAM Buffer DRAM Buffer

AAC A

AB

CD B

CEC

DC

DC

DE

E

E

DE

l fill i d i

E E


Can also fill in gaps and rewriteCoalescing makes transfers more efficient

Pipeline parallelismPipeline parallelism

• Pipeline registers

>+

p gimprove clock speed

PEMux‐

• Interleave independent computationscomputations

• Compiler handles I d d C i

pdata hazards Independent Computations

25

Increases performance by 10xCARL 2013 / © Gabriel Weisz

Multiple read portsMultiple read ports

VRead Vertex 0R d V 1

Read Vertices 0,1R d Ed 0 1V

E PE

Read Vertex 1Read Edge 0Read Edge 1

Read Edges 0,1ComputeWrite Vertex 0

I

Read Edge 1ComputeWrite Vertex 0

Write Vertex 0

26

Reduces the number of instructionsCARL 2013 / © Gabriel Weisz

OutlineOutline

• IntroductionIntroduction• GraphGen CompilerG hG G• GraphGen FPGA Target

• Optimizations• Experimental Results• ConclusionConclusion


Experimental configurationp g

Images Depth MapGraphWorkload ‐ [Middlebury Benchmark] [Tree Re‐weighted Message Passing]

FPGA Board Xilinx ML605 Altera DE4

Workload ‐ [Middlebury Benchmark], [Tree Re‐weighted Message Passing]

FPGA Chip Virtex‐6 LX240T Stratix‐IV EP4SGX530

Logic Cells 241,152 531,200

Block Memory 14,976 Kb 27,376 Kb

DRAM Bandwidth 6.4 GB/s 2x6.4 GB/s

28

DRAM Capacity 512 MB 2 GBCARL 2013 / © Gabriel Weisz

Experimental resultsExperimental results20Compute

Time (ms)85% of Peak

DRAM Bandwidth

12

16Time (ms)

Better

8Optimal (100 MHz) 1 85X f

0

4Optimal (100 MHz)

ML605/DE4 (1MC, 100 MHz)

DE4 (2MC, 100 MHz)

1.85X performance of 1 MC

01 Read Port 2 Read Port 4 Read PortProcessing Engine Configuration

DE4 (2MC, 100 MHz)

Optimal (150 MHz)

DE4 (2MC, 150 MHz)

29

CPU=120 ms, GPU=85 ms [Our best effort]

Convey HC‐1 (80 GB/s): 3.2 ms [Choi and Rutenbar FPL 2012]


Future work – multiple PEsFuture work multiple PEs

V V

E

I

PE E

I

PE

I I

Need to handle data hazards between PEs


ConclusionConclusion• GraphGen for CoRAM is an optimizing FPGA b k d f h G hG ilback‐end for the GraphGen compiler

• A statically scheduled pipelined PE enables simple but effective optimizations

• DE4 results: Generated system on $8k board yonly 2.6x slower than custom system on $50,000 Convey HC‐1, y

• ML605 results: $2k board only 4.8x slower than HC‐1

31

than HC 1


graphgen for coram:graphgen for coram: graph...

Documents