graphgen for coram:graphgen for coram: graph...

31
GraphGen for CoRAM: GraphGen for CoRAM: Graph Computation on FPGAs Gabriel Weisz (CMU) Eriko Nurvitadhi (Intel) James C. Hoe(CMU) December 7, 2013 Computer Architecture Lab at 1 This work is supported, in part, by the National Science Foundation CCF1320725 and by the Intel Science and Technology Center in Embedded Computing. Thank you Altera, Xilinx, and Bluespec for your generous donation of hardware and tools.

Upload: others

Post on 03-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GraphGen for CoRAM:GraphGen for CoRAM:Graph Computation on FPGAs

Gabriel Weisz (CMU) Eriko Nurvitadhi (Intel)James C. Hoe(CMU)December 7, 2013

Computer Architecture Lab at

1

This work is supported, in part, by the National Science Foundation CCF‐1320725 and by the Intel Science and Technology Center in Embedded Computing. Thank you Altera, Xilinx, and Bluespec for your generous donation of hardware and tools.

Page 2: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Why graph algorithms?Why graph algorithms?Many machine learning algorithms and data 

mining algorithms are based on graphs

Stereo Matching

Image Segmentation

Handwriting Recognition

Speech RecognitionImage Segmentation Speech Recognition

Graphs encode data and relationships 

2CARL 2013 / © Gabriel Weisz

Page 3: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Stereo MatchingStereo MatchingParallax ‐

Closer objectsCloser objectsare in the same 

place

Input: Stereo imagesPixel Pairs

p

p gPixel Pairs

Adjacent 

3

Output: Depth mapjpixels

CARL 2013 / © Gabriel Weisz

Page 4: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

OutlineOutline

• IntroductionIntroduction• GraphGen CompilerG hG G• GraphGen FPGA Target

• Optimizations• Experimental Results• ConclusionConclusion

4CARL 2013 / © Gabriel Weisz

Page 5: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model 

(GEM) ProgramVertex‐Centric

Graph Specification

GraphGen

(GEM) ProgramGraph Specification

GraphGen CompilerUpdate‐Function()

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+ 

5CARL 2013 / © Gabriel Weisz

Page 6: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model 

(GEM) ProgramVertex‐Centric

Graph Specification

GraphGen

(GEM) ProgramGraph Specification

GraphGen CompilerUpdate‐Function()

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+ 

6CARL 2013 / © Gabriel Weisz

Page 7: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Vertex-centric graph specificationVertex centric graph specificationGraph Structure Vertex Data: Edge Data:

d d1 25 struct vdatauint(32)  L0;uint(32) L1;

struct edatauint(32)  L0;uint(32) L1;

4 6 3uint(32)  L1; uint(32)  L2;

uint(32)  L1; uint(32)  L2;

Update‐Function(Vertex v)tmp = f1(v.data);f h Ed i

Processing Element>+

for each Edge e in scopetmp = f2(tmp, e.data);. .

X‐+

7

. . +

Similar software frameworks:  GraphLab, PregelCARL 2013 / © Gabriel Weisz

Page 8: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model 

(GEM) ProgramVertex‐Centric

Graph Specification

GraphGen

(GEM) ProgramGraph Specification

GraphGen CompilerUpdate‐Function()

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+ 

8CARL 2013 / © Gabriel Weisz

Page 9: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model 

(GEM) ProgramVertex‐Centric

Graph Specification

GraphGen

(GEM) ProgramGraph Specification

GraphGen CompilerUpdate‐Function()

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+ 

9CARL 2013 / © Gabriel Weisz

Page 10: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Graph Execution Model (GEM)Graph Execution Model (GEM)

Subgraph Listg p

G11 25 G11 25

G1 G2G3G2

4 6 3

10CARL 2013 / © Gabriel Weisz

G3

Page 11: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Graph Execution Model (GEM)Graph Execution Model (GEM)Update‐Function(Vertex v)tmp = f1(v.data);

G1

for each Edge e in scopetmp = f2(tmp, e.data);

G1

f1(v5)f ( )

PE Program

G22 4

3

f2(e2)f2(e3)f2(e4)

5

3

7 8

( )f1(v6)f2(e3)f2(e7)

6

11CARL 2013 / © Gabriel Weisz

G3f2(e7)f2(e8)G3

Page 12: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GEM target architectureGEM target architecture

Control PEControl PE

G1

Storage PE

G1 G2 G2G3

Data partition Compute partitionG1 G2 G2G3

12CARL 2013 / © Gabriel Weisz

Page 13: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GEM target architectureGEM target architecture

Control PEControl PE

G3

PE

G1 G2G3

Data partition Compute partitionG1 G2G3

13CARL 2013 / © Gabriel Weisz

Page 14: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model 

(GEM) ProgramVertex‐Centric

Graph Specification

GraphGen

(GEM) ProgramGraph Specification

GraphGen CompilerUpdate‐Function()

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+ 

14CARL 2013 / © Gabriel Weisz

Page 15: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model 

(GEM) ProgramVertex‐Centric

Graph Specification

GraphGen

(GEM) ProgramGraph Specification

GraphGen CompilerUpdate‐Function()

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+ 

15CARL 2013 / © Gabriel Weisz

Page 16: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

A GEM hardware templateA GEM hardware template

V

E PEControl Logic

Compute partition

ILogic

DRAM

16CARL 2013 / © Gabriel Weisz

Page 17: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

A GEM hardware templateA GEM hardware templateCoRAM Control Threads

V

(C‐like language)*

E PECompute partition

I

DRAM

17CARL 2013 / © Gabriel Weisz

*[Chung, et al., 2011]

Page 18: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model 

(GEM) ProgramVertex‐Centric

Graph Specification

GraphGen

(GEM) ProgramGraph Specification

GraphGen CompilerUpdate‐Function()

FPGA Bi

Memory Image H d+

GraphGen Template

Bitstream

‐‐‐‐‐‐‐‐

0101010001001010100100100101010100001010010100101010010

Header+ 

18CARL 2013 / © Gabriel Weisz

Page 19: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Straight line PE programStraight line PE program

f1(v0)f1(v1)

Read Vertex 0Read Vertex 1f1(v1)

f2(e0)f2(e1)

Read Vertex 1Read Edge 0Read Edge 1f2(e1)

f3()f4(v0)

Read Edge 1ComputeWrite Vertex 0f4(v0) Write Vertex 0

Optimizations only care about item index 

19CARL 2013 / © Gabriel Weisz

and which function is called

Page 20: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Straight line PE programStraight line PE program

Read Vertex 0Read Vertex 1 I di l tiRead Vertex 1Read Edge 0Read Edge 1

Indices are relative to local buffer, not global vertex/edge idRead Edge 1

ComputeWrite Vertex 0

global vertex/edge id

Write Vertex 0

Static data flow helps with double

20

Static data flow helps with double buffering and other optimizations

CARL 2013 / © Gabriel Weisz

Page 21: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

CoalescingCoalescingDRAM Buffer

AAB

B

CC

DC

DD

Reduces overhead

21CARL 2013 / © Gabriel Weisz

Page 22: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

CoalescingCoalescingDRAM Buffer

AACA

B

CB

DC

DD

O t f dOut of order

22CARL 2013 / © Gabriel Weisz

Page 23: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

CoalescingCoalescingDRAM Buffer Can reorder and rewrite 

AAB

the PE Program

PE Program PE ProgramB

CC

DRead ARead C

g

Read ARead B

g

C

DD Read C

Read BRead BRead C

Reordered

23CARL 2013 / © Gabriel Weisz

Page 24: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

CoalescingCoalescingDRAM Buffer DRAM Buffer

AAC A

AB

CD B

CEC

DC

DC

DE

E

E

DE

l fill i d i

E E

24CARL 2013 / © Gabriel Weisz

Can also fill in gaps and rewriteCoalescing makes transfers more efficient

Page 25: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Pipeline parallelismPipeline parallelism

• Pipeline registers 

>+

p gimprove clock speed

PEMux‐

• Interleave independent  computationscomputations

• Compiler handles I d d C i

pdata hazards Independent Computations

25

Increases performance by 10xCARL 2013 / © Gabriel Weisz

Page 26: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Multiple read portsMultiple read ports

VRead Vertex 0R d V 1

Read Vertices 0,1R d Ed 0 1V

E PE

Read Vertex 1Read Edge 0Read Edge 1

Read Edges 0,1ComputeWrite Vertex 0

I

Read Edge 1ComputeWrite Vertex 0

Write Vertex 0

26

Reduces the number of instructionsCARL 2013 / © Gabriel Weisz

Page 27: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

OutlineOutline

• IntroductionIntroduction• GraphGen CompilerG hG G• GraphGen FPGA Target

• Optimizations• Experimental Results• ConclusionConclusion

27CARL 2013 / © Gabriel Weisz

Page 28: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Experimental configurationp g

Images Depth MapGraphWorkload ‐ [Middlebury Benchmark] [Tree Re‐weighted Message Passing]

FPGA Board Xilinx ML605 Altera DE4

Workload ‐ [Middlebury Benchmark], [Tree Re‐weighted Message Passing]

FPGA Chip Virtex‐6 LX240T Stratix‐IV EP4SGX530

Logic Cells 241,152  531,200

Block Memory 14,976 Kb 27,376 Kb

DRAM Bandwidth 6.4 GB/s 2x6.4 GB/s

28

DRAM Capacity 512 MB 2 GBCARL 2013 / © Gabriel Weisz

Page 29: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Experimental resultsExperimental results20Compute 

Time (ms)85% of Peak 

DRAM Bandwidth

12

16Time (ms)

Better

8Optimal (100 MHz) 1 85X f

0

4Optimal (100 MHz)

ML605/DE4 (1MC, 100  MHz)

DE4 (2MC, 100 MHz)

1.85X performance of 1 MC

01 Read Port 2 Read Port 4 Read PortProcessing Engine Configuration

DE4 (2MC, 100 MHz)

Optimal (150 MHz)

DE4 (2MC, 150 MHz)

29

CPU=120 ms, GPU=85 ms [Our best effort]

Convey HC‐1 (80 GB/s): 3.2 ms [Choi and Rutenbar FPL 2012]

CARL 2013 / © Gabriel Weisz

Page 30: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

Future work – multiple PEsFuture work multiple PEs

V V

E

I

PE E

I

PE

I I

Need to handle data hazards between PEs

30CARL 2013 / © Gabriel Weisz

Page 31: GraphGen for CoRAM:GraphGen for CoRAM: Graph …calcm/carl/lib/exe/fetch.php?media=carl2013_weisz_slides.pdfExperimental resultsExperimental results Compute 20 Time(ms) 85% of Peak

ConclusionConclusion• GraphGen for CoRAM is an optimizing FPGA b k d f h G hG ilback‐end for the GraphGen compiler

• A statically scheduled pipelined PE enables simple but effective optimizations

• DE4 results: Generated system on $8k board yonly 2.6x slower than custom system on $50,000 Convey HC‐1, y

• ML605 results: $2k board only 4.8x slower than HC‐1

31

than HC 1

CARL 2013 / © Gabriel Weisz