cygraph: a reconfigurable architecture for parallel ... · cygraph: a reconfigurable architecture...
TRANSCRIPT
CyGraph: A Reconfigurable Architecture for Parallel Breadth-first Search
Osama G. Attia, Tyler Johnson, Kevin Townsend, Philip Jones, Joseph Zambreno
Reconfigurable Computing Laboratory Department of Electrical and Computer Engineering
Iowa State University
21st Reconfigurable Architectures Workshop !May 20, 2014, Phoenix, USA!
RAW – May 20, 2014 – [2/20] Reconfigurable Computing Laboratory – Iowa State University
Graphs are Everywhere
• Social Networks – Facebook (1.3 Billion user) – Twitter (6.5 Million user)
• 2.1 Billion search queries/day
• Genomics – 1 gram of soil = 1 Gb data that is represented as graphs
• Human brain (100 Billion neurons)
• Road Networks
RAW – May 20, 2014 – [3/20] Reconfigurable Computing Laboratory – Iowa State University
Breadth First Search (BFS)
• Definition: – Systematically traverse the connected nodes in a graph
starting from a given root node – Nodes are visited in the order of hop distance from root. – Many applications add extra computation during each BFS
iteration and/or post-process the result
• BFS serves as a fundamental building block for many graph processing algorithms
RAW – May 20, 2014 – [4/20] Reconfigurable Computing Laboratory – Iowa State University
Graph Representation
• Most graph processing algorithms use the well-known compressed sparse row (CSR) format
• The format contains two vectors: – Column-indices array (C) – Row-offsets array (R)
3 71
4
8
6 5
2
21 813, 5, 6 42, 3, 4
1087744410
0 1 2 3 4 5 6 7 8
0
0 1 2 3 4 5 6 7 8 9 10
Row-offset array (R)
Column-indices array (C)
RAW – May 20, 2014 – [5/20] Reconfigurable Computing Laboratory – Iowa State University
Level-Synchronous BFS
- Execution time dominated by memory latency - Large memory foot print - Poor locality - Random access - Memory read request in blue - Memory Write requests in red
RAW – May 20, 2014 – [6/20] Reconfigurable Computing Laboratory – Iowa State University
Methodology
• Custom CSR • Optimized BFS algorithm • Parallel hardware implementation • Multiplexed memory requests • Kernel-to-kernel communication
RAW – May 20, 2014 – [7/20] Reconfigurable Computing Laboratory – Iowa State University
Convey HC-2 Computer
• Four programmable FPGAs, called Application Engines (AEs) • Eight FPGAs that are used as memory controllers • AEs has access to memory at peak bandwidth 80GB/s • Each AE is connected to all memory controllers • Two FPGAs bridges the host motherboard and the coprocessor board • Operates at 150 Mhz clock • AEs are interconnected with full duplex 660 MBps AE-to-AE interface
Historically, there have been two basic approaches for designing high-performance double precision accumulators. The first approach is to statically schedule the input data in order to interleave values and partial sums from different rows, such that consecutive values belonging to each row are delivered to the accumulator--which is designed as a simple feedback adder--at a period corresponding to the pipeline latency of the adder. This still allows the adder to accept a new value every clock cycle while avoiding the accumulation data hazard among values in the same accumulation set (matrix row). Unfortunately, this method requires a large up-front cost in scheduling input data and is not practical for large data sets.
An early example of this approach was the work of deLorimier and DeHon [ 3
The second approach is to use a dynamic reduction technique that dynamically selects each input or partial sum to send into the adder--dynamically managing the progress of each active accumulation set using a controller (i.e. dynamically scheduling the inputs). For the latter case, these approaches can be divided into two types depending on whether they use a single adder or multiple adders.
]. Their scheduling technique leads to the architecture’s performance being highly dependent on the structure of the matrix, although on average they were able to achieve 66% of the peak performance in their simulation-based studies.
An early example using the dynamic reduction technique was from Prasanna's group at the University of Southern California [4
A similar implementation from UT-Knoxville and Oak Ridge National Laboratory used a similar approach but with a parallel—as opposed to a linear--array of n adders, where n was the adder depth [
]. In their earliest work, they used a linear array of adders to create a flattened binary adder tree where each adder in the array was utilized at half the rate of the previous adder in the array. This required multiple adders with exponentially decreasing utilization, had a fixed maximum set size, and required stalls between matrix rows.
5
Prasanna's group later developed two improved reduction circuits, called the double- and single-strided adders (DSA, SSA), that solved many of the problems of earlier accumulator design [
]. This implementation striped each consecutive input across each adder in turn, achieving a fixed utilization of 1/n for each adder.
6
An improved single-adder streaming reduction architecture was later developed at the University of Twente [
]. These new architectures required only two and one adders, respectively. In addition, they did
not limit the maximum number of values that can be accumulated and did not need to be stalled between accumulation sets. However, these designs required a relatively large amount of buffer memory and extremely complex control logic which limited their clock speed.
7
In each of the above discussed work, pre-made adders (usually generated with Xilinx Core Generator) have been used as the core of the accumulator. Another approach is to modify the adder itself such that the de-normalization and significand addition steps have a single cycle latency, which makes it possible to use a feedback without scheduling. To minimize the latency of denormalize portion, which includes an exponent comparison and a shift of one of the significands, both inputs are base-converted to reduce the width of exponent while increasing the width of the mantissa [
]. This design is the current state-of-the-art, as it requires less memory and less complex control than Prassanna’s SSA design. In this paper we describe a new streaming reduction technique that requires even less memory and simpler control logic than this design.
8 ]. This reduces the latency of the denormalize while increasing the adder width. Since wide adders can be achieved cheaply with carry-chained DSP48 components, these steps can sometimes be performed in one cycle. This technique is best suited for single precision operands but can be extended to double precision as well [9
III. BACKGROUND: CONVEY HC-1
]. However, in general this approach requires an unacceptably long clock period.
At Supercomputing 2009, Convey Computer unveiled the production version of the HC-1, their contribution to the space of socket-based reconfigurable computers. The HC-1 is unique in several ways. Unlike in-socket coprocessors from Nallatech [10], DRC [11], and XtremeData [12
The design of the coprocessor board is depicted in Figure 1. There are four user-programmable Virtex-5 LX 330s,
]—all of which are confined to a footprint matching the size of the socket--Convey uses a mezzanine connector to bring the front side bus (FSB) interface to a large coprocessor board roughly the size of an ATX motherboard. This coprocessor board is housed in a 1U chassis that is fused to the top of another 1U chassis containing the host motherboard.
Figure 1. The HC-1 coprocessor board. Four application engines connect to eight memory controllers through a full crossbar.
2
RAW – May 20, 2014 – [8/20] Reconfigurable Computing Laboratory – Iowa State University
Convey HC-2 Coprocessor
RAW – May 20, 2014 – [9/20] Reconfigurable Computing Laboratory – Iowa State University
CyGraph Top Level
• Each AE has access to 16 memory ports through full-crossbar • 4 AEs = 4 × 16 Kernels = 64 kernels total • CyGraph uses AE-to-AE interface to scale implementation among
different AEs (through Kernel-to-Kernel communication)
Convey HC-2 Coprocessor
K1 K2 K16
MC1 MC2 MC16
Master
Shared Memory (64 GB)
AE 0
K1 K2 K16
MC1 MC2 MC16
Master
AE 1
K1 K2 K16
MC1 MC2 MC16
Master
AE 2
AE-to-AE interface (668 MB/s)
K1 K2 K16
MC1 MC2 MC16
Master
AE 3
RAW – May 20, 2014 – [10/20] Reconfigurable Computing Laboratory – Iowa State University
Kernel-to-kernel Communication • Behaves as a token ring network • Allows multiple kernels to write to the next queue • Makes CyGraph scalable through multiple FPGAs
– Master Kernel initiates the interface by sending an empty token to the first kernel in the ring
– Token starts circulating, each kernel gets the token for one cycle – Kernel with token, reserves how many items it wants to write to
memory and passes the token to next kernel – Token = Received token + space to be reserved – When a level is finished, token will equal how many nodes were
written to memory
Time Ti Ti+1 Ti+2 Ti+3 Ti+4 Ti+5 Ti+6
K0 Send 0 * * Send 5 K1 * Send 1 Send 5 K2 * * Send 3 * Send 6
RAW – May 20, 2014 – [11/20] Reconfigurable Computing Laboratory – Iowa State University
Custom CSR
• With massive large-scale graph, it is hard to implement visited bitmaps on FPGA’s BlockRAM
• We use off-chip memories instead
• In order to optimize the performance, we modified the R array in CSR format as follows: – Start address of node (as original implementation) – Count of neighbors if visited flag is ’0’ or node’s level if visited flag is ’1’ – Visited flag
Neighbors' start address Neighbors' count V = 0
32-bit 31-bit 1-bit
Node level V = 1
63-bit 1-bit
RAW – May 20, 2014 – [12/20] Reconfigurable Computing Laboratory – Iowa State University
CyGraph Kernel Algorithm
RAW – May 20, 2014 – [13/20] Reconfigurable Computing Laboratory – Iowa State University
Pipelined Design
• The kernel could be implemented as a pipeline of four stages/processes • CyGraph Kernel will serve as processing element (PE) in the Parallel
CyGraph implementation
Process 1Request from current queue
Process 2Request Neighbors
Process 3Request Neighbors
Information
rd MC currentrsp rd MC
rsp
rdMCrspProcess 4If not visited, visit and push to next
queue
neigh CSR
neighborsMC wr
RAW – May 20, 2014 – [14/20] Reconfigurable Computing Laboratory – Iowa State University
CyGraph Kernel
Process 1Request from current queue
Process 2Request Neighbors
Process 3Request Neighbors
Information
Process 4If not visited, push
to next queue
Requests Multiplexer
Process 5
q3q2q1
Responses Decoder
To NextKernel
From Prev.Kernel
Mem
ory Controller
rsp
req
RAW – May 20, 2014 – [15/20] Reconfigurable Computing Laboratory – Iowa State University
Experimental Results (R-MAT Graphs)
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3B
illio
n e
dges
per
seco
nd (
GT
EP
s)
Number of vertices per graph
(a) Average degree = 8
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(b) Average degree = 16
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(c) Average degree = 32
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(d) Average degree = 64
Betkaoui et al. [20] CyGraph (1 AE) CyGraph (4 AEs)
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Bill
ion
ed
ge
s p
er
seco
nd
(G
TE
Ps)
Number of vertices per graph
(a) Average degree = 8
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(b) Average degree = 16
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(c) Average degree = 32
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(d) Average degree = 64
Betkaoui et al. [20] CyGraph (1 AE) CyGraph (4 AEs)
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Bill
ion
ed
ge
s p
er
seco
nd
(G
TE
Ps)
Number of vertices per graph
(a) Average degree = 8
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(b) Average degree = 16
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(c) Average degree = 32
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(d) Average degree = 64
Betkaoui et al. [20] CyGraph (1 AE) CyGraph (4 AEs)
RAW – May 20, 2014 – [16/20] Reconfigurable Computing Laboratory – Iowa State University
Experimental Results (Random Graphs)
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5B
illio
n e
dg
es
pe
r se
con
d (
GT
EP
s)
Number of vertices per graph
(a) Average degree = 8
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(b) Average degree = 16
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(c) Average degree = 32
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(d) Average degree = 64
Betkaoui et al. [20] CyGraph (1 AE) CyGraph (4 AEs)
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Bill
ion e
dges
per
seco
nd (
GT
EP
s)
Number of vertices per graph
(a) Average degree = 8
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(b) Average degree = 16
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(c) Average degree = 32
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(d) Average degree = 64
Betkaoui et al. [20] CyGraph (1 AE) CyGraph (4 AEs)
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Bill
ion
ed
ge
s p
er
seco
nd
(G
TE
Ps)
Number of vertices per graph
(a) Average degree = 8
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(b) Average degree = 16
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(c) Average degree = 32
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(d) Average degree = 64
Betkaoui et al. [20] CyGraph (1 AE) CyGraph (4 AEs)
RAW – May 20, 2014 – [17/20] Reconfigurable Computing Laboratory – Iowa State University
Analysis
8 16 32 64 Average0
0.5
1
1.5
2
2.5
3
Average vertex degree
Bil
lio
n e
dg
es
pe
r s
ec
on
d (
GT
EP
s)
Betkaoui et alCyGraph 1 AECyGraph 4 AEs
8 16 32 64 Average0
0.5
1
1.5
2
2.5
3
Average vertex degree
Billio
n e
dg
es p
er
seco
nd
(G
TE
Ps)
Betkaoui et alCyGraph 1 AECyGraph 4 AEs
R-MAT Graphs Random Graphs
RAW – May 20, 2014 – [18/20] Reconfigurable Computing Laboratory – Iowa State University
Resource Utilization
• CyGraph uses less resources • Remaining resources could be used for adding
extra computations through traversal
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Bill
ion e
dges
per
seco
nd (
GT
EP
s)
Number of vertices per graph
(a) Average degree = 8
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(b) Average degree = 16
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(c) Average degree = 32
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
3.5
Number of vertices per graph
(d) Average degree = 64
Betkaoui et al. [20] CyGraph (1 AE) CyGraph (4 AEs)
Fig. 7: BFS execution time for CyGraph against [20] using random graphs
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Bill
ion e
dges
per
seco
nd (
GT
EP
s)
Number of vertices per graph
(a) Average degree = 8
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(b) Average degree = 16
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(c) Average degree = 32
2^20 2^21 2^22 2^230
0.5
1
1.5
2
2.5
3
Number of vertices per graph
(d) Average degree = 64
Betkaoui et al. [20] CyGraph (1 AE) CyGraph (4 AEs)
Fig. 8: BFS execution time for CyGraph against [20] using RMAT graphs
interfaces (e.g. dispatch interface, and MC interface). For ourCyGraph kernel, the depth of the FIFOs used is set to 512.
Slice LUTs BRAM Slice LUT-FF
CyGraph 1 AE 53% 55% 74%
CyGraph 4 AEs 55% 55% 74%
Betkaui et al. [20] 80% 64% n/a
TABLE I: Resource Utilization
VII. CONCLUSION AND FUTURE WORK
In this paper, we have proposed a novel FPGA-based graphtraversal implementation, namely CyGraph. Our implementa-tion outperformed a state-of-the-art FPGA implementation upto factor of 5. CyGraph performed better for graphs of mediumaverage vertex degrees. We have shown that application-specific data structures significantly improve performance forthe target platform. We have introduced a kernel-to-kernelinterface that enables multiple processing elements to collab-orate writing to the same memory data-structure. Finally, wehave followed a design approach that could be carried out todesign other hardware reconfigurable architectures for graphprocessing algorithms.
The suggested future work is twofold. The first directionis to enhance our implementation by load balancing amongkernels. This will keep all CyGraph kernels busy for equal
processing times, resulting in a better utilization of memorybandwidth. One idea is to split the large CSRs into multipleCSRs before pushing them to the next queue. In the nextlevel, due to the fact that kernels interleave reading, kernelswill have to process a relatively equal number of neighbors.The second direction is to develop a full framework for graphalgorithms using the Convey HC-2. Other graph problems (e.g.shortest path, Eulerian path, maximum flow, maximum clique)are known to be difficult problems. Such a framework canserve as a base and useful tool for researchers to build uponit.
ACKNOWLEDGEMENTS
This work was supported in part by the National ScienceFoundation (NSF) under awards CNS-1116810 and CCF-1149539.
REFERENCES
[1] O. Mason and M. Verwoerd, “Graph theory and networks in biology,”in IET Systems Biology, vol. 1, no. 2, 2007, pp. 89–119.
[2] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, “Walking inFacebook: A case study of unbiased sampling of OSNs,” in IEEE 30thInternational Conference on Computer Communications (INFOCOM),2010.
[3] D. R. Zerbino and E. Birney, “Velvet: algorithms for de novo short readassembly using de Bruijn graphs,” in Genome research, vol. 18, no. 5,2008, pp. 821–829.
[4] C. B. Olson, M. Kim, C. Clauson, B. Kogon, C. Ebeling, S. Hauck, andW. L. Ruzzo, “Hardware acceleration of short read mapping,” in IEEE20th Annual International Symposium on Field-Programmable CustomComputing Machines (FCCM), 2012, pp. 161–168.
RAW – May 20, 2014 – [19/20] Reconfigurable Computing Laboratory – Iowa State University
• Conclusion – A reconfigurable hardware for accelerating BFS algorithm
• Scalable solution for multiple-FPGAs • Outperformed pervious state-of-the-art implementation
• Future Work – Design approach could be leveraged for other graph
processing algorithms – Customizing and building upon the CyGraph kernel for
real-world applications
Questions
www.rcl.ece.iastate.edu Reconfigurable Computing Laboratory