![Page 1: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/1.jpg)
An Effective GPU Implementation of Breadth-First Search
Lijuan Luo, Martin Wong and Wen-mei HwuDepartment of Electrical and Computer Engineering, UIUC
From DAC 2010
![Page 2: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/2.jpg)
Outline
Introduction Why BFS and why GPU?
Previous works IIIT-BFS and a matrix-vector based BFS
The proposed GPU solution Architecture of GPU, CUDA Hierarchical queue and kernel Synchronization Examples
Experimental results and conclusions
![Page 3: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/3.jpg)
Introduction
The graphic processing unit (GPU) has become popular in parallel processing for its cost-effectiveness.
Due to the architecture of GPU, implementing the fastest CPU algorithm on it directly may cause huge overhead.
Hence, the speed-up is not always meaningful.
![Page 4: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/4.jpg)
Introduction (cont.)
![Page 5: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/5.jpg)
Introduction (cont.)
Breadth-First Search has been widely used in EDA. Maze routing, circuit simulation, STA, etc.
Previous works are slower than the fastest CPU program in some certain types of graphs.
![Page 6: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/6.jpg)
Breadth-First Search (BFS)
Given a graph G=(V,E) and a distingui-shed source vertex s.
BFS explores the edges of G to discover every vertex that is reachable from s.
Produce a breadth-first tree with root s that contains all reachable vertices.
![Page 7: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/7.jpg)
Example
![Page 8: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/8.jpg)
BFS (cont.)
Traditional BFS algorithms use a queue to store the frontiers.
Complexity is O(V+E)
![Page 9: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/9.jpg)
Previous works
IIIT-BFS [2] is the first work implementing BFS algorithm on GPU.
[3] uses matrix-vector multiplication way to represent and do BFS.
![Page 10: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/10.jpg)
IIIT-BFS
Point out that to maintain the frontier queue can cause a huge overhead on the GPU.
For each level, IIIT-BFS exhaustively check every vertex to see whether it belongs to the current frontier.
O(VL+E), where L is the total number of levels.
In sparse graph, L=E=O(V) and hence O(VL+E)=O(V2)
![Page 11: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/11.jpg)
BFS in Matrix-vector multiplication
Accelerated a matrix-based BFS algorithm for sparse graphs.
Each frontier propagation can be transformed into a matrix-vector multiplication.
O(V+EL), where L is the number of levels. In sparse graph, L=E=O(V) and hence O
(V+EL)=O(V2)
![Page 12: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/12.jpg)
The proposed GPU solution
To propagate from all the frontier vertices in parallel. Since lots of EDA problems are
formulated as sparse graphs, the number of neighbors of each frontier vertex is less
Hierarchical queue and kernel. Same complexity as the traditional
CPU implementation.
![Page 13: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/13.jpg)
Architecture of Nvidia GTX280
A collection of 30 multiprocessors, with 8 streaming processors each.
The 30 multiprocessors share one off-chip global memory. Access time: about 300 clock cycles
Each multiprocessor has a on-chip memory shared by that 8 streaming processors. Access time: 2 clock cycles
![Page 14: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/14.jpg)
Architecture diagram
![Page 15: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/15.jpg)
Memory coalescing
Several memory transactions can be coalesced into one transaction when consecutive threads access consecutive memory locations.
Due to access time of global memory is relatively large, it is important to achieve this.
![Page 16: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/16.jpg)
CUDA programming
Compute Unified Device Architecture The CPU code does the sequential
part. Highly parallelized part usually
implement in the GPU code, called kernel.
Calling GPU function in CPU code is called kernel launch.
![Page 17: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/17.jpg)
Hierarchical Queue Management
Hierarchical frontiers structure
![Page 18: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/18.jpg)
Hierarchical Queue Management (cont.)
G-Frontier: the frontier vertices shared by all the threads of a grid.
B-Frontier: the frontier vertices common to a whole block.
W-Frontier: the frontier vertices only accessed by certain threads from a warp.
![Page 19: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/19.jpg)
Collision
Collision means more than one thread are accessing the same queue at the same time. suppose only one queue and each SP ha
s a thread that is returning the new frontier vertices
8 threads are accessing the same queue—collision happens
![Page 20: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/20.jpg)
Hierarchical Queue Management (cont.)
Each W-Frontier maintains 8 queues so that no collision will happen in a W-Frontier.
The scheduling unit – warp. contains 32 threads each four 8-thread groups
![Page 21: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/21.jpg)
Synchronization issues
Correct results require thread synchronization at the end of each level of queue.
General solution: launch one kernel for each level of queue and implement a global barrier between two launched kernels.
If we do that, the kernel-launch overhead will be huge.
CUDA only provides barrier function to synchronize threads in a block.
![Page 22: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/22.jpg)
Hierarchical Kernel Arrangement
Only the highest layer uses this expensive synchronization method (global barrier) and the others use more efficient GPU synchronization.
Using intra-block synchronization (provide by CUDA) to synchronize the threads in a block.
Using inter-block synchronization [10] to synchronize threads between different blocks. These two are GPU synchronization techniques that
the kernel does not need to terminated.
![Page 23: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/23.jpg)
Example
Intra-block sync.
Inter-block sync.
Global barrier (kernel sync.)
![Page 24: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/24.jpg)
Another example Assume that there
are 100 vertices in the queue.
First, launch a kernel. It will create a grid with a block with 512 thread (only 100 non-empty threads).
1 … … 100 Vertex 1 ~ 100
Empty threads
BlockGrid
![Page 25: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/25.jpg)
Another example (cont.) Threads in that block will synchronize by i
ntra-block synchronization.
Threads
Threads
8 W-Frontiers and one B-Frontier
Intra-block sync.
![Page 26: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/26.jpg)
Another example (cont.) Assume that after the computation, we get 1000 new fro
ntier vertices. 1000>512. It will use G-Frontier queue to handle the whole vertices and continue.
Threads in different blocks will synchronize by inter-block synchronization.
Global memory G-Frontier
Threads
Threads
Threads
Threads
Inter-block sync.
![Page 27: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/27.jpg)
Another example (cont.)
Once the new frontier vertices is larger than 15360, the kernel is terminated and re-launch a kernel with 15360 threads until this BFS-level is finished.
![Page 28: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/28.jpg)
Experimental results
Environment A dual socket, dual core 2.4 GHz Opteron proce
ssor 8 GB of memory A single NVIDIA GeForce GTX280 GPU
![Page 29: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/29.jpg)
Experimental results (cont.)
The results on degree-6 regular graph. grid-based graph like
![Page 30: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/30.jpg)
Experimental results (cont.) The results on real world graphs
Average deg(V)=2, maximum deg(V)=8 or 9
![Page 31: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/31.jpg)
Experimental results (cont.)
The results on scale-free graphs 0.1% of the vertices have degree 1000 Other vertices have average degree of 6,
maximum degree of 7
![Page 32: An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC](https://reader035.vdocument.in/reader035/viewer/2022062518/56649ccf5503460f9499b50f/html5/thumbnails/32.jpg)
Conclusions
This ideas proposed were never used on other architecture.
Most suitable for sparse and near-regular graphs that is often used in EDA.
Hierarchical queue management and hierarchical kernel arrangement.