high-performance linear algebra-based graph framework on

43
High-Performance Linear Algebra-based Graph Framework on GPU PhD Exit Seminar Carl Yang University of California, Davis May 1, 2019 CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 1 / 43

Upload: others

Post on 13-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Performance Linear Algebra-based Graph Framework on

High-Performance Linear Algebra-basedGraph Framework on GPU

PhD Exit Seminar

Carl Yang

University of California, Davis

May 1, 2019

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 1 / 43

Page 2: High-Performance Linear Algebra-based Graph Framework on

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 2 / 43

Page 3: High-Performance Linear Algebra-based Graph Framework on

Introduction

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 3 / 43

Page 4: High-Performance Linear Algebra-based Graph Framework on

Introduction

GraphBLAS Open Standard

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 4 / 43

Page 5: High-Performance Linear Algebra-based Graph Framework on

Introduction

Graph traversal is sparse matrix multiplication

1Denes Konig, 1931CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 5 / 43

Page 6: High-Performance Linear Algebra-based Graph Framework on

Introduction

Ingredient 1: Operations

“Our mission is to build up a linear algebra sense to the extent thatvector-level thinking becomes as natural as scalar-level thinking.”-Charles Van Loan

(a) eWiseAdd (b) eWiseMult

(c) mxv (d) mxm

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 6 / 43

Page 7: High-Performance Linear Algebra-based Graph Framework on

Introduction

Ingredient 2: Operators

Semiring notation: (Add, Multiply, Domain)

Name Semiring Application

Real field {+,×,R} Classical numerical linear algebraBoolean {|,&, {0, 1}} Graph connectivityTropical {min,+,R ∪ {∞}} Shortest pathMax-plus {max,+,R} Graph matchingMin-times {min,×,R} Maximal independent set

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 7 / 43

Page 8: High-Performance Linear Algebra-based Graph Framework on

Introduction

Ingredient 2: Operators

Semiring notation: (Add, Multiply, Domain)

Name Semiring Application

Real field {+,×,R} Classical numerical linear algebraBoolean {|,&, {0, 1}} Graph connectivityTropical {min,+,R ∪ {∞}} Shortest pathMax-plus {max,+,R} Graph matchingMin-times {min,×,R} Maximal independent set

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 8 / 43

Page 9: High-Performance Linear Algebra-based Graph Framework on

Introduction

Operations + Operators = GraphBLAS

1http://graphblas.orgCY (UCD) High-Performance Graph Framework on GPU May 1, 2019 9 / 43

Page 10: High-Performance Linear Algebra-based Graph Framework on

Goals

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 10 / 43

Page 11: High-Performance Linear Algebra-based Graph Framework on

Goals

Goals of GraphBLAS

1 Provide portable performance

2 Allow concise expression

3 Match state-of-the-art in performance

4 Be effective at small scale and exascale

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 11 / 43

Page 12: High-Performance Linear Algebra-based Graph Framework on

Goals

Goal 1: Provide portable performance

Open standard with well-defined API specification ensures portability.

Q: But how to future-proof spec against future innovations in datastructures and algorithms?

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 12 / 43

Page 13: High-Performance Linear Algebra-based Graph Framework on

Goals

Opaque Vector and Matrix

Open standard with well-defined API specification ensures portability.

Q: But how to future-proof spec against future innovations in datastructures and algorithms?

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 13 / 43

Page 14: High-Performance Linear Algebra-based Graph Framework on

Goals

Goal 2: Allow concise expression

Algorithm Ligra GraphIt Gunrock GraphBLAST

Breadth-first-search 30 22 2732 25Graph coloring N/A N/A 893 22Local graph clustering 84 N/A 875 45Maximal independent set 51 N/A N/A 26Single-source shortest-path 55 25 760 25

Table: Lines of C++ application code counted by ‘cloc’.

1Julian Shun and Guy E. Blelloch. “Ligra: A Lightweight Graph ProcessingFramework for Shared Memory.” PPoPP ’13

2Yunming Zhang et al. “GraphIt: A High-performance Graph DSL.” OOPSLA ’18CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 14 / 43

Page 15: High-Performance Linear Algebra-based Graph Framework on

Goals

Goal 3: Match state-of-the-art in performance

1Zhang, Peter, et al. “GBTL-CUDA: Graph algorithms and primitives for GPUs.”IPDPSW ’16.

2Wang, Yangzihao, et al. “Gunrock: GPU graph analytics.” ACM TOPC ’17.CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 15 / 43

Page 16: High-Performance Linear Algebra-based Graph Framework on

Goals

Sparse-matrix sparse-vector multiplication necessary

1Carl Yang, Yangzihao Wang and John D. Owens. “Fast sparse-matrix and sparsevector multiplication algorithm on the GPU.” IPDPSW ’16.

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 16 / 43

Page 17: High-Performance Linear Algebra-based Graph Framework on

Goals

Two load-balancing algorithms for SpMM

2Carl Yang, Aydın Buluc and John D. Owens. “Design principles for sparse matrixmultiplication on the GPU.” EuroPar ’18. Distinguished paper award.

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 17 / 43

Page 18: High-Performance Linear Algebra-based Graph Framework on

Goals

Push-pull maps nicely to linear algebra

3Carl Yang, Aydın Buluc and John D. Owens. “Implementing push-pull efficiently inGraphBLAS” ICPP ’18.

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 18 / 43

Page 19: High-Performance Linear Algebra-based Graph Framework on

Goals

Comparable performance to state-of-the-art

1

10

100

1000

soc-ork soc-lj h09 i04 kron rmat22 rmat23 rmat24 rgg roadnet road_usaSlow

downcomparedtoGunrock

Dataset

SuiteSparse CuSha Baseline Ligra ThisWork

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 19 / 43

Page 20: High-Performance Linear Algebra-based Graph Framework on

Goals

GraphBLAST has closed the 10× performance gap

0.1

1

10

100

1000

Journals G43 ship_003 belgium_osm roadNet-CA delaunay_n24

Performance(M

TEPS)

Dataset

GBTL Gunrock GraphBLAST

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 20 / 43

Page 21: High-Performance Linear Algebra-based Graph Framework on

Goals

Goal 4: Be effective at small scale and exascale

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 21 / 43

Page 22: High-Performance Linear Algebra-based Graph Framework on

Challenges

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 22 / 43

Page 23: High-Performance Linear Algebra-based Graph Framework on

Challenges

Challenges of GraphBLAS

1 Hard for users to learn

2 Bulk-synchronous

3 GraphBLAS C API limited to C

4 Dynamic graphs

5 Kernel fusion

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 23 / 43

Page 24: High-Performance Linear Algebra-based Graph Framework on

Challenges

Challenge 1: Hard for users to learn

1 Where is the graph?

2 Where are the edges?

3 Where are the vertices?

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 24 / 43

Page 25: High-Performance Linear Algebra-based Graph Framework on

Challenges

Challenge 2: Bulk-synchronous

Some say the world is moving towards asynchronous backends:AM++, HavoqGT, Galois, STAPL, HPX, Charm++

Q: How to reconcile GraphBLAS’s inherently bulk-synchronous model

with these asynchronous runtime systems?

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 25 / 43

Page 26: High-Performance Linear Algebra-based Graph Framework on

Challenges

Example: HavoqGT’s delegates partitioning

1Roger Pearce, Maya Gokhale and Nancy M. Amato. “Faster Parallel Traversal ofScale Free Graphs at Extreme Scale with Vertex Delegates.” SC ’14

2Yuechao Pan, Roger Pearce and John D. Owens. “Scalable Breadth-First Search ona GPU Cluster.” IPDPS ’18

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 26 / 43

Page 27: High-Performance Linear Algebra-based Graph Framework on

Challenges

Delegates partitioning in linear algebra

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 27 / 43

Page 28: High-Performance Linear Algebra-based Graph Framework on

Challenges

Delegates partitioning in linear algebra

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 28 / 43

Page 29: High-Performance Linear Algebra-based Graph Framework on

Challenges

Standard 1D partition communication for Elow

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 29 / 43

Page 30: High-Performance Linear Algebra-based Graph Framework on

Challenges

Reduced communication for Ehigh

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 30 / 43

Page 31: High-Performance Linear Algebra-based Graph Framework on

Challenges

Open questions

Q: How to build a distributed framework that is decoupled fromGraphBLAST, so that we can get code reuse?

Q: Can we build a distributed framework that is decoupled from theimplementation, so that every GraphBLAS backend (e.g. single-threadedCPU, multi-threaded CPU, GPU) gets distributed computing for free?

This is what Horovod does for deep learning, but it needs to makeassumptions on data structures (i.e. dense tensor) and supports only3 MPI communication ops.

Can we do the same by limiting to CSR and a minimal set of ops?

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 31 / 43

Page 32: High-Performance Linear Algebra-based Graph Framework on

Challenges

Challenge 3: GraphBLAS C API limited to C

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 32 / 43

Page 33: High-Performance Linear Algebra-based Graph Framework on

Challenges

Use modern practices such as hourglass model

1Stephanus Dutoit. “Hourglass Interfaces for C++ APIs.” CppCon ’14CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 33 / 43

Page 34: High-Performance Linear Algebra-based Graph Framework on

Challenges

Use modern practices such as hourglass model

Enables goodies like:

Operator overloading in Python and C++ frontends

Use + and × with previously used Semiring and Descriptor

Templates can be used in C++ backend for codegen

RAII for memory safety

Can build C++ spec atop existing C spec

1Stephanus Dutoit. “Hourglass Interfaces for C++ APIs.” CppCon ’14CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 34 / 43

Page 35: High-Performance Linear Algebra-based Graph Framework on

Challenges

Challenge 4: Dynamic graphs

Q: What are the motivating applications for dynamic graphs?

prototype → full implementation → intuit patterns → extract frameworks

Q: How should system be designed so as to maximize decoupling betweencode and code specific to each graph representation?

CSR, SlabHash, cuSTINGER, etc.

1Max Dama. “Max Dama on Automated Trading.” 2008CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 35 / 43

Page 36: High-Performance Linear Algebra-based Graph Framework on

Challenges

Challenge 5: Kernel fusion

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 36 / 43

Page 37: High-Performance Linear Algebra-based Graph Framework on

Challenges

Thoughts on implementing kernel fusion

1 Treat each operation as node in computation graph2 Traverse computation graph looking for nodes that should be fused

together

Elementwise ops and simple math ops such as apply

3 Generate CUDA code

4 Use realtime compilation to generate compiled CUDA kernel

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 37 / 43

Page 38: High-Performance Linear Algebra-based Graph Framework on

Conclusion

Overview

1 Introduction

2 Goals

3 Challenges

4 Conclusion

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 38 / 43

Page 39: High-Performance Linear Algebra-based Graph Framework on

Conclusion

Conclusion

Our work is the first GraphBLAS implementation on the GPU thatshows comparable performance to state-of-the-art

10× to 1000× speed-up compared to other GraphBLASimplementations

Linear algebra-based graph frameworks are the way to go:

Just as expressible and performant as their vertex-centric counterpartsMore concise and portableParallel traversals from different source vertices requires minimal codechange (change Vector to Matrix)

Challenge for current students: Scale up to exascale problems

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 39 / 43

Page 40: High-Performance Linear Algebra-based Graph Framework on

Conclusion

Roadmap assuming linear weak scaling

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 40 / 43

Page 41: High-Performance Linear Algebra-based Graph Framework on

Conclusion

Acknowledgments

My advisors, John D. Owens and Aydın Buluc

Owens group and the Gunrock team

GraphBLAS C API subcommittee

NVIDIA

for their generous hardware support

Funding support

NSF Award # CCF-1629657DARPA XDATA program under US Army award W911QX-12-C-0059DARPA HIVE program under agreement number FA8650-18-2-7836LBNL’s DOE contract DE-AC02-05CH11231DOE’s OASCR contract DE-AC02-05CH11231NNSA and DOE’s Office of Science under the Exascale ComputingProject 17-SC-20-SC

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 41 / 43

Page 42: High-Performance Linear Algebra-based Graph Framework on

Conclusion

Questions?

Open-source code with Apache License:https://github.com/gunrock/graphblast

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 42 / 43

Page 43: High-Performance Linear Algebra-based Graph Framework on

Conclusion

Code example: BFS

CY (UCD) High-Performance Graph Framework on GPU May 1, 2019 43 / 43