high-performance linear algebra-based graph framework on

High-Performance Linear Algebra-basedGraph Framework on GPU

PhD Exit Seminar

Carl Yang

University of California, Davis

May 1, 2019

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 1 / 43

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion


Introduction

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion


Introduction

GraphBLAS Open Standard


Introduction

Graph traversal is sparse matrix multiplication

1Denes Konig, 1931CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 5 / 43

Introduction

Ingredient 1: Operations

“Our mission is to build up a linear algebra sense to the extent thatvector-level thinking becomes as natural as scalar-level thinking.”-Charles Van Loan

(a) eWiseAdd (b) eWiseMult

(c) mxv (d) mxm


Introduction

Ingredient 2: Operators

Semiring notation: (Add, Multiply, Domain)

Name Semiring Application

Real field {+,×,R} Classical numerical linear algebraBoolean {|,&, {0, 1}} Graph connectivityTropical {min,+,R ∪ {∞}} Shortest pathMax-plus {max,+,R} Graph matchingMin-times {min,×,R} Maximal independent set


Introduction

Operations + Operators = GraphBLAS

1http://graphblas.orgCY (UCD) High-Performance Graph Framework on GPU May 1, 2019 9 / 43

Goals

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion


Goals

Goals of GraphBLAS

1 Provide portable performance

2 Allow concise expression

3 Match state-of-the-art in performance

4 Be effective at small scale and exascale


Goals

Goal 1: Provide portable performance

Open standard with well-defined API specification ensures portability.

Q: But how to future-proof spec against future innovations in datastructures and algorithms?


Goals

Opaque Vector and Matrix

Open standard with well-defined API specification ensures portability.

Q: But how to future-proof spec against future innovations in datastructures and algorithms?


Goals

Goal 2: Allow concise expression

Algorithm Ligra GraphIt Gunrock GraphBLAST

Breadth-first-search 30 22 2732 25Graph coloring N/A N/A 893 22Local graph clustering 84 N/A 875 45Maximal independent set 51 N/A N/A 26Single-source shortest-path 55 25 760 25

Table: Lines of C++ application code counted by ‘cloc’.

1Julian Shun and Guy E. Blelloch. “Ligra: A Lightweight Graph ProcessingFramework for Shared Memory.” PPoPP ’13

2Yunming Zhang et al. “GraphIt: A High-performance Graph DSL.” OOPSLA ’18CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 14 / 43

Goals

Goal 3: Match state-of-the-art in performance

1Zhang, Peter, et al. “GBTL-CUDA: Graph algorithms and primitives for GPUs.”IPDPSW ’16.

2Wang, Yangzihao, et al. “Gunrock: GPU graph analytics.” ACM TOPC ’17.CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 15 / 43

Goals

Sparse-matrix sparse-vector multiplication necessary

1Carl Yang, Yangzihao Wang and John D. Owens. “Fast sparse-matrix and sparsevector multiplication algorithm on the GPU.” IPDPSW ’16.


Goals

Two load-balancing algorithms for SpMM

2Carl Yang, Aydın Buluc and John D. Owens. “Design principles for sparse matrixmultiplication on the GPU.” EuroPar ’18. Distinguished paper award.


Goals

Push-pull maps nicely to linear algebra

3Carl Yang, Aydın Buluc and John D. Owens. “Implementing push-pull efficiently inGraphBLAS” ICPP ’18.


Goals

Comparable performance to state-of-the-art

1

10

100

1000

soc-ork soc-lj h09 i04 kron rmat22 rmat23 rmat24 rgg roadnet road_usaSlow

downcomparedtoGunrock

Dataset

SuiteSparse CuSha Baseline Ligra ThisWork


Goals

GraphBLAST has closed the 10× performance gap

0.1

1

10

100

1000

Journals G43 ship_003 belgium_osm roadNet-CA delaunay_n24

Performance(M

TEPS)

Dataset

GBTL Gunrock GraphBLAST


Goals

Goal 4: Be effective at small scale and exascale


Challenges

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion


Challenges

Challenges of GraphBLAS

1 Hard for users to learn

2 Bulk-synchronous

3 GraphBLAS C API limited to C

4 Dynamic graphs

5 Kernel fusion


Challenges

Challenge 1: Hard for users to learn

1 Where is the graph?

2 Where are the edges?

3 Where are the vertices?


Challenges

Challenge 2: Bulk-synchronous

Some say the world is moving towards asynchronous backends:AM++, HavoqGT, Galois, STAPL, HPX, Charm++

Q: How to reconcile GraphBLAS’s inherently bulk-synchronous model

with these asynchronous runtime systems?


Challenges

Example: HavoqGT’s delegates partitioning

1Roger Pearce, Maya Gokhale and Nancy M. Amato. “Faster Parallel Traversal ofScale Free Graphs at Extreme Scale with Vertex Delegates.” SC ’14

2Yuechao Pan, Roger Pearce and John D. Owens. “Scalable Breadth-First Search ona GPU Cluster.” IPDPS ’18


Challenges

Delegates partitioning in linear algebra


Challenges

Standard 1D partition communication for Elow


Challenges

Reduced communication for Ehigh


Challenges

Open questions

Q: How to build a distributed framework that is decoupled fromGraphBLAST, so that we can get code reuse?

Q: Can we build a distributed framework that is decoupled from theimplementation, so that every GraphBLAS backend (e.g. single-threadedCPU, multi-threaded CPU, GPU) gets distributed computing for free?

This is what Horovod does for deep learning, but it needs to makeassumptions on data structures (i.e. dense tensor) and supports only3 MPI communication ops.

Can we do the same by limiting to CSR and a minimal set of ops?


Challenges

Challenge 3: GraphBLAS C API limited to C


Challenges

Use modern practices such as hourglass model

1Stephanus Dutoit. “Hourglass Interfaces for C++ APIs.” CppCon ’14CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 33 / 43

Challenges

Use modern practices such as hourglass model

Enables goodies like:

Operator overloading in Python and C++ frontends

Use + and × with previously used Semiring and Descriptor

Templates can be used in C++ backend for codegen

RAII for memory safety

Can build C++ spec atop existing C spec

1Stephanus Dutoit. “Hourglass Interfaces for C++ APIs.” CppCon ’14CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 34 / 43

Challenges

Challenge 4: Dynamic graphs

Q: What are the motivating applications for dynamic graphs?

prototype → full implementation → intuit patterns → extract frameworks

Q: How should system be designed so as to maximize decoupling betweencode and code specific to each graph representation?

CSR, SlabHash, cuSTINGER, etc.

1Max Dama. “Max Dama on Automated Trading.” 2008CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 35 / 43

Challenges

Challenge 5: Kernel fusion


Challenges

Thoughts on implementing kernel fusion

1 Treat each operation as node in computation graph2 Traverse computation graph looking for nodes that should be fused

together

Elementwise ops and simple math ops such as apply

3 Generate CUDA code

4 Use realtime compilation to generate compiled CUDA kernel


Conclusion

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion


Conclusion

Conclusion

Our work is the first GraphBLAS implementation on the GPU thatshows comparable performance to state-of-the-art

10× to 1000× speed-up compared to other GraphBLASimplementations

Linear algebra-based graph frameworks are the way to go:

Just as expressible and performant as their vertex-centric counterpartsMore concise and portableParallel traversals from different source vertices requires minimal codechange (change Vector to Matrix)

Challenge for current students: Scale up to exascale problems


Conclusion

Roadmap assuming linear weak scaling


Conclusion

Acknowledgments

My advisors, John D. Owens and Aydın Buluc

Owens group and the Gunrock team

GraphBLAS C API subcommittee

NVIDIA

for their generous hardware support

Funding support

NSF Award # CCF-1629657DARPA XDATA program under US Army award W911QX-12-C-0059DARPA HIVE program under agreement number FA8650-18-2-7836LBNL’s DOE contract DE-AC02-05CH11231DOE’s OASCR contract DE-AC02-05CH11231NNSA and DOE’s Office of Science under the Exascale ComputingProject 17-SC-20-SC


Conclusion

Questions?

Open-source code with Apache License:https://github.com/gunrock/graphblast


Conclusion

Code example: BFS


high-performance linear algebra-based graph framework on

Documents