graphgen for coram:graphgen for coram: graph...
TRANSCRIPT
GraphGen for CoRAM:GraphGen for CoRAM:Graph Computation on FPGAs
Gabriel Weisz (CMU) Eriko Nurvitadhi (Intel)James C. Hoe(CMU)December 7, 2013
Computer Architecture Lab at
1
This work is supported, in part, by the National Science Foundation CCF‐1320725 and by the Intel Science and Technology Center in Embedded Computing. Thank you Altera, Xilinx, and Bluespec for your generous donation of hardware and tools.
Why graph algorithms?Why graph algorithms?Many machine learning algorithms and data
mining algorithms are based on graphs
Stereo Matching
Image Segmentation
Handwriting Recognition
Speech RecognitionImage Segmentation Speech Recognition
Graphs encode data and relationships
2CARL 2013 / © Gabriel Weisz
Stereo MatchingStereo MatchingParallax ‐
Closer objectsCloser objectsare in the same
place
Input: Stereo imagesPixel Pairs
p
p gPixel Pairs
Adjacent
3
Output: Depth mapjpixels
CARL 2013 / © Gabriel Weisz
OutlineOutline
• IntroductionIntroduction• GraphGen CompilerG hG G• GraphGen FPGA Target
• Optimizations• Experimental Results• ConclusionConclusion
4CARL 2013 / © Gabriel Weisz
GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model
(GEM) ProgramVertex‐Centric
Graph Specification
GraphGen
(GEM) ProgramGraph Specification
GraphGen CompilerUpdate‐Function()
…
FPGA Bi
Memory Image H d+
GraphGen Template
Bitstream
‐‐‐‐‐‐‐‐
0101010001001010100100100101010100001010010100101010010
Header+
5CARL 2013 / © Gabriel Weisz
GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model
(GEM) ProgramVertex‐Centric
Graph Specification
GraphGen
(GEM) ProgramGraph Specification
GraphGen CompilerUpdate‐Function()
…
FPGA Bi
Memory Image H d+
GraphGen Template
Bitstream
‐‐‐‐‐‐‐‐
0101010001001010100100100101010100001010010100101010010
Header+
6CARL 2013 / © Gabriel Weisz
Vertex-centric graph specificationVertex centric graph specificationGraph Structure Vertex Data: Edge Data:
d d1 25 struct vdatauint(32) L0;uint(32) L1;
struct edatauint(32) L0;uint(32) L1;
4 6 3uint(32) L1; uint(32) L2;
uint(32) L1; uint(32) L2;
Update‐Function(Vertex v)tmp = f1(v.data);f h Ed i
Processing Element>+
for each Edge e in scopetmp = f2(tmp, e.data);. .
‐
X‐+
7
. . +
Similar software frameworks: GraphLab, PregelCARL 2013 / © Gabriel Weisz
GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model
(GEM) ProgramVertex‐Centric
Graph Specification
GraphGen
(GEM) ProgramGraph Specification
GraphGen CompilerUpdate‐Function()
…
FPGA Bi
Memory Image H d+
GraphGen Template
Bitstream
‐‐‐‐‐‐‐‐
0101010001001010100100100101010100001010010100101010010
Header+
8CARL 2013 / © Gabriel Weisz
GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model
(GEM) ProgramVertex‐Centric
Graph Specification
GraphGen
(GEM) ProgramGraph Specification
GraphGen CompilerUpdate‐Function()
…
FPGA Bi
Memory Image H d+
GraphGen Template
Bitstream
‐‐‐‐‐‐‐‐
0101010001001010100100100101010100001010010100101010010
Header+
9CARL 2013 / © Gabriel Weisz
Graph Execution Model (GEM)Graph Execution Model (GEM)
Subgraph Listg p
G11 25 G11 25
G1 G2G3G2
4 6 3
10CARL 2013 / © Gabriel Weisz
G3
Graph Execution Model (GEM)Graph Execution Model (GEM)Update‐Function(Vertex v)tmp = f1(v.data);
G1
for each Edge e in scopetmp = f2(tmp, e.data);
G1
f1(v5)f ( )
PE Program
G22 4
3
f2(e2)f2(e3)f2(e4)
5
3
7 8
( )f1(v6)f2(e3)f2(e7)
6
11CARL 2013 / © Gabriel Weisz
G3f2(e7)f2(e8)G3
GEM target architectureGEM target architecture
Control PEControl PE
G1
Storage PE
G1 G2 G2G3
Data partition Compute partitionG1 G2 G2G3
12CARL 2013 / © Gabriel Weisz
GEM target architectureGEM target architecture
Control PEControl PE
G3
PE
G1 G2G3
Data partition Compute partitionG1 G2G3
13CARL 2013 / © Gabriel Weisz
GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model
(GEM) ProgramVertex‐Centric
Graph Specification
GraphGen
(GEM) ProgramGraph Specification
GraphGen CompilerUpdate‐Function()
…
FPGA Bi
Memory Image H d+
GraphGen Template
Bitstream
‐‐‐‐‐‐‐‐
0101010001001010100100100101010100001010010100101010010
Header+
14CARL 2013 / © Gabriel Weisz
GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model
(GEM) ProgramVertex‐Centric
Graph Specification
GraphGen
(GEM) ProgramGraph Specification
GraphGen CompilerUpdate‐Function()
…
FPGA Bi
Memory Image H d+
GraphGen Template
Bitstream
‐‐‐‐‐‐‐‐
0101010001001010100100100101010100001010010100101010010
Header+
15CARL 2013 / © Gabriel Weisz
A GEM hardware templateA GEM hardware template
V
E PEControl Logic
Compute partition
ILogic
DRAM
16CARL 2013 / © Gabriel Weisz
A GEM hardware templateA GEM hardware templateCoRAM Control Threads
V
(C‐like language)*
E PECompute partition
I
DRAM
17CARL 2013 / © Gabriel Weisz
*[Chung, et al., 2011]
GraphGen compiler workflowGraphGen compiler workflowGraph Execution Model
(GEM) ProgramVertex‐Centric
Graph Specification
GraphGen
(GEM) ProgramGraph Specification
GraphGen CompilerUpdate‐Function()
…
FPGA Bi
Memory Image H d+
GraphGen Template
Bitstream
‐‐‐‐‐‐‐‐
0101010001001010100100100101010100001010010100101010010
Header+
18CARL 2013 / © Gabriel Weisz
Straight line PE programStraight line PE program
f1(v0)f1(v1)
Read Vertex 0Read Vertex 1f1(v1)
f2(e0)f2(e1)
Read Vertex 1Read Edge 0Read Edge 1f2(e1)
f3()f4(v0)
Read Edge 1ComputeWrite Vertex 0f4(v0) Write Vertex 0
Optimizations only care about item index
19CARL 2013 / © Gabriel Weisz
and which function is called
Straight line PE programStraight line PE program
Read Vertex 0Read Vertex 1 I di l tiRead Vertex 1Read Edge 0Read Edge 1
Indices are relative to local buffer, not global vertex/edge idRead Edge 1
ComputeWrite Vertex 0
global vertex/edge id
Write Vertex 0
Static data flow helps with double
20
Static data flow helps with double buffering and other optimizations
CARL 2013 / © Gabriel Weisz
CoalescingCoalescingDRAM Buffer
AAB
B
CC
DC
DD
Reduces overhead
21CARL 2013 / © Gabriel Weisz
CoalescingCoalescingDRAM Buffer
AACA
B
CB
DC
DD
O t f dOut of order
22CARL 2013 / © Gabriel Weisz
CoalescingCoalescingDRAM Buffer Can reorder and rewrite
AAB
the PE Program
PE Program PE ProgramB
CC
DRead ARead C
g
Read ARead B
g
C
DD Read C
Read BRead BRead C
Reordered
23CARL 2013 / © Gabriel Weisz
CoalescingCoalescingDRAM Buffer DRAM Buffer
AAC A
AB
CD B
CEC
DC
DC
DE
E
E
DE
l fill i d i
E E
24CARL 2013 / © Gabriel Weisz
Can also fill in gaps and rewriteCoalescing makes transfers more efficient
Pipeline parallelismPipeline parallelism
• Pipeline registers
>+
p gimprove clock speed
PEMux‐
• Interleave independent computationscomputations
• Compiler handles I d d C i
pdata hazards Independent Computations
25
Increases performance by 10xCARL 2013 / © Gabriel Weisz
Multiple read portsMultiple read ports
VRead Vertex 0R d V 1
Read Vertices 0,1R d Ed 0 1V
E PE
Read Vertex 1Read Edge 0Read Edge 1
Read Edges 0,1ComputeWrite Vertex 0
I
Read Edge 1ComputeWrite Vertex 0
Write Vertex 0
26
Reduces the number of instructionsCARL 2013 / © Gabriel Weisz
OutlineOutline
• IntroductionIntroduction• GraphGen CompilerG hG G• GraphGen FPGA Target
• Optimizations• Experimental Results• ConclusionConclusion
27CARL 2013 / © Gabriel Weisz
Experimental configurationp g
Images Depth MapGraphWorkload ‐ [Middlebury Benchmark] [Tree Re‐weighted Message Passing]
FPGA Board Xilinx ML605 Altera DE4
Workload ‐ [Middlebury Benchmark], [Tree Re‐weighted Message Passing]
FPGA Chip Virtex‐6 LX240T Stratix‐IV EP4SGX530
Logic Cells 241,152 531,200
Block Memory 14,976 Kb 27,376 Kb
DRAM Bandwidth 6.4 GB/s 2x6.4 GB/s
28
DRAM Capacity 512 MB 2 GBCARL 2013 / © Gabriel Weisz
Experimental resultsExperimental results20Compute
Time (ms)85% of Peak
DRAM Bandwidth
12
16Time (ms)
Better
8Optimal (100 MHz) 1 85X f
0
4Optimal (100 MHz)
ML605/DE4 (1MC, 100 MHz)
DE4 (2MC, 100 MHz)
1.85X performance of 1 MC
01 Read Port 2 Read Port 4 Read PortProcessing Engine Configuration
DE4 (2MC, 100 MHz)
Optimal (150 MHz)
DE4 (2MC, 150 MHz)
29
CPU=120 ms, GPU=85 ms [Our best effort]
Convey HC‐1 (80 GB/s): 3.2 ms [Choi and Rutenbar FPL 2012]
CARL 2013 / © Gabriel Weisz
Future work – multiple PEsFuture work multiple PEs
V V
E
I
PE E
I
PE
I I
Need to handle data hazards between PEs
30CARL 2013 / © Gabriel Weisz
ConclusionConclusion• GraphGen for CoRAM is an optimizing FPGA b k d f h G hG ilback‐end for the GraphGen compiler
• A statically scheduled pipelined PE enables simple but effective optimizations
• DE4 results: Generated system on $8k board yonly 2.6x slower than custom system on $50,000 Convey HC‐1, y
• ML605 results: $2k board only 4.8x slower than HC‐1
31
than HC 1
CARL 2013 / © Gabriel Weisz