parallel graph algorithms

81
CS267, Spring 2011 April 7, 2011 Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory

Upload: irving

Post on 26-Feb-2016

73 views

Category:

Documents


0 download

DESCRIPTION

Parallel Graph Algorithms. Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory. CS267, Spring 2011 April 7, 2011. Lecture Outline. Applications Designing parallel graph algorithms, performance on current systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Graph Algorithms

CS267, Spring 2011April 7, 2011

Parallel Graph Algorithms

Kamesh Madduri [email protected]

Lawrence Berkeley National Laboratory

Page 2: Parallel Graph Algorithms

• Applications• Designing parallel graph algorithms, performance on

current systems• Case studies: Graph traversal-based problems, parallel

algorithms– Breadth-First Search– Single-source Shortest paths– Betweenness Centrality

Lecture Outline

Page 3: Parallel Graph Algorithms

Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds

Routing in transportation networks

H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.

Page 4: Parallel Graph Algorithms

• The world-wide web can be represented as a directed graph– Web search and crawl: traversal– Link analysis, ranking: Page rank and HITS– Document classification and clustering

• Internet topologies (router networks) are naturally modeled as graphs

Internet and the WWW

Page 5: Parallel Graph Algorithms

• Reorderings for sparse solvers– Fill reducing orderings

Partitioning, eigenvectors– Heavy diagonal to reduce pivoting (matching)

• Data structures for efficient exploitation of sparsity• Derivative computations for optimization

– graph colorings, spanning trees• Preconditioning

– Incomplete Factorizations– Partitioning for domain decomposition– Graph techniques in algebraic multigrid

Independent sets, matchings, etc.– Support Theory

Spanning trees & graph embedding techniques

Scientific Computing

B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf

Image source: Yifan Hu, “A gallery of large graphs”

Image source: Tim Davis, UF Sparse Matrix Collection.

Page 6: Parallel Graph Algorithms

• Graph abstractions are very useful to analyze complex data sets.

• Sources of data: petascale simulations, experimental devices, the Internet, sensor networks

• Challenges: data size, heterogeneity, uncertainty, data quality

Large-scale data analysis

Astrophysics: massive datasets, temporal variations

Bioinformatics: data quality, heterogeneity

Social Informatics: new analytics challenges, data uncertainty

Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com

Page 7: Parallel Graph Algorithms

• Study of the interactions between various components in a biological system

• Graph-theoretic formulations are pervasive:– Predicting new interactions:

modeling– Functional annotation of novel

proteins: matching, clustering– Identifying metabolic pathways:

paths, clustering– Identifying new protein complexes:

clustering, centrality

Data Analysis and Graph Algorithms in Systems Biology

Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003.

Page 8: Parallel Graph Algorithms

Image Source: Nexus (Facebook application)

Graph –theoretic problems in social networks

– Targeted advertising: clustering and centrality

– Studying the spread of information

Page 9: Parallel Graph Algorithms

• [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information

• Plot masterminds correctly identified from interaction patterns: centrality

• A global view of entities is often more insightful

• Detect anomalous activities by exact/approximate subgraph isomorphism.

Image Source: http://www.orgnet.com/hijackers.html

Network Analysis for Intelligence and Survelliance

Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47

Page 10: Parallel Graph Algorithms

Research in Parallel Graph AlgorithmsApplication

AreasMethods/Problems

ArchitecturesGraph Algorithms

Traversal

Shortest Paths

Connectivity

Max Flow

GPUsFPGAs

x86 multicoreservers

Massively multithreadedarchitectures

MulticoreClusters

Clouds

Social NetworkAnalysis

WWW

Computational Biology

Scientific Computing

Engineering

Find central entitiesCommunity detection

Network dynamics

Data size

ProblemComplexity

Graph partitioningMatchingColoring

Gene regulationMetabolic pathways

Genomics

MarketingSocial Search

VLSI CADRoute planning

Page 11: Parallel Graph Algorithms

Characterizing Graph-theoretic computations

• graph sparsity (m/n ratio)• static/dynamic nature• weighted/unweighted, weight distribution• vertex degree distribution• directed/undirected• simple/multi/hyper graph• problem size• granularity of computation at nodes/edges• domain-specific characteristics

• paths• clusters• partitions• matchings• patterns• orderings

Input data

Problem: Find ***

Factors that influence choice of algorithm

Graph kernel

• traversal• shortest path algorithms• flow algorithms• spanning tree algorithms• topological sort …..

Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations

Page 12: Parallel Graph Algorithms

• Applications• Designing parallel graph algorithms, performance on

current systems• Case studies: Graph traversal-based problems, parallel

algorithms– Breadth-First Search– Single-source Shortest paths– Betweenness Centrality

Lecture Outline

Page 13: Parallel Graph Algorithms

• 1735: “Seven Bridges of Königsberg” problem, resolved by Euler, one of the first graph theory results.

• …• 1966: Flynn’s Taxonomy.• 1968: Batcher’s “sorting networks”• 1969: Harary’s “Graph Theory”• …• 1972: Tarjan’s “Depth-first search and linear graph algorithms”• 1975: Reghbati and Corneil, Parallel Connected Components• 1982: Misra and Chandy, distributed graph algorithms.• 1984: Quinn and Deo’s survey paper on “parallel graph

algorithms”• …

History

Page 14: Parallel Graph Algorithms

• Idealized parallel shared memory system model

• Unbounded number of synchronous processors; no synchronization, communication cost; no parallel overhead

• EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write)

• Measuring performance: space and time complexity; total number of operations (work)

The PRAM model

Page 15: Parallel Graph Algorithms

• Pros– Simple and clean semantics.– The majority of theoretical parallel algorithms are designed

using the PRAM model.– Independent of the communication network topology.

• Cons– Not realistic, too powerful communication model.– Algorithm designer is misled to use IPC without hesitation.– Synchronized processors.– No local memory.– Big-O notation is often misleading.

PRAM Pros and Cons

Page 16: Parallel Graph Algorithms

• Reghbati and Corneil [1975], O(log2n) time and O(n3) processors

• Wyllie [1979], O(log2n) time and O(m) processors

• Shiloach and Vishkin [1982], O(log n) time and O(m) processors

• Reif and Spirakis [1982], O(log log n) time and O(n) processors (expected)

PRAM Algorithms for Connected Components

Page 17: Parallel Graph Algorithms

• Prefix sums• Symmetry breaking• Pointer jumping• List ranking• Euler tours• Vertex collapse• Tree contraction

Building blocks of classical PRAM graph algorithms

Page 18: Parallel Graph Algorithms

Static case• Dense graphs (m = O(n2)): adjacency matrix commonly used.• Sparse graphs: adjacency lists

Dynamic• representation depends on common-case query• Edge insertions or deletions? Vertex insertions or deletions?

Edge weight updates?• Graph update rate• Queries: connectivity, paths, flow, etc.

• Optimizing for locality a key design consideration.

Data structures: graph representation

Page 19: Parallel Graph Algorithms

• Compressed Sparse Row-like

Graph representation

0 7

5

3

8

2

4 6

1

9

2 5 7 7

6

0 3

2 4 7

3 6 8

0 8

1 4 8 9

0 0 3 8

4 5 6 7

6

0

1

2

3

4

5

6

7

8

9

4

1

2

3

3

2

4

4

4

1

Vertex Degree Adjacencies

Flatten adjacency arrays

2 5 7 7 6 0 3 2 4 …. 6 7 6Adjacencies Size: 2*m

0 4 5 7 … 28 Size: n+1Index intoadjacency array

Page 20: Parallel Graph Algorithms

• Each processor stores the entire graph (“full replication”)

• Each processor stores n/p vertices and all adjacencies out of these vertices (“1D partitioning”)

• How to create these “p” vertex partitions?– Graph partitioning algorithms: recursively optimize for

conductance (edge cut/size of smaller partition)– Randomly shuffling the vertex identifiers ensures that edge

count/processor are roughly the same

Distributed Graph representation

Page 21: Parallel Graph Algorithms

• Consider a logical 2D processor grid (pr * pc = p) and the matrix representation of the graph

• Assign each processor a sub-matrix (i.e, the edges within the sub-matrix)

2D graph partitioning

0 7

5

3

8

2

4 6

1x x x

xx x

x x xx x x

x xx x x

x x xx x x x

9 vertices, 9 processors, 3x3 processor grid

Flatten Sparse matrices

Per-processor local graph representation

Page 22: Parallel Graph Algorithms

• A wide range seen in graph algorithms: array, list, queue, stack, set, multiset, tree

• Implementations are typically array-based for performance considerations.

• Key data structure considerations in parallel graph algorithm design– Practical parallel priority queues– Space-efficiency– Parallel set/multiset operations, e.g., union,

intersection, etc.

Data structures in (parallel) graph algorithms

Page 23: Parallel Graph Algorithms

• Concurrency– Simulating PRAM algorithms: hardware limits of memory

bandwidth, number of outstanding memory references; synchronization

• Locality– Try to improve cache locality, but avoid “too much”

superfluous computation• Work-efficiency

– Is (Parallel time) * (# of processors) = (Serial work)?

Graph Algorithms on current systems

Page 24: Parallel Graph Algorithms

Problem Size (Log2 # of vertices)10 12 14 16 18 20 22 24 26

Per

form

ance

rate

(Mill

ion

Trav

erse

d E

dges

/s)

0

10

20

30

40

50

60

Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)

Input: Synthetic R-MAT graphs (# of edges m = 8n)

The locality challenge“Large memory footprint, low spatial and temporal locality impede performance”

~ 5X drop in performance

No Last-level Cache (LLC) misses

O(m) LLC misses

Page 25: Parallel Graph Algorithms

• Graph topology assumptions in classical algorithms do not match real-world datasets

• Parallelization strategies at loggerheads with techniques for enhancing memory locality

• Classical “work-efficient” graph algorithms may not fully exploit new architectural features– Increasing complexity of memory hierarchy, processor heterogeneity,

wide SIMD.• Tuning implementation to minimize parallel overhead is non-

trivial– Shared memory: minimizing overhead of locks, barriers.– Distributed memory: bounding message buffer sizes, bundling

messages, overlapping communication w/ computation.

The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems”

Page 26: Parallel Graph Algorithms

• Applications• Designing parallel graph algorithms, performance on

current systems• Case studies: Graph traversal-based problems, parallel

algorithms– Breadth-First Search– Single-source Shortest paths– Betweenness Centrality

Lecture Outline

Page 27: Parallel Graph Algorithms

Graph traversal (BFS) problem definition

0 7

5

3

8

2

4 6

1

9sourcevertex

Input:Output:1

1

1

2

2 3 3

4

4

distance from source vertex

Memory requirements (# of machine words):• Sparse graph representation: m+n• Stack of visited vertices: n• Distance array: n

Page 28: Parallel Graph Algorithms

Optimizing BFS on cache-based multicore platforms, for networks with “power-law” degree distributions

Problem Spec. AssumptionsNo. of vertices/edges 106 ~ 109

Edge/vertex ratio 1 ~ 100Static/dynamic? StaticDiameter O(1) ~ O(log n)Weighted/Unweighted UnweightedVertex degree distribution Unbalanced (“power law”)Directed/undirected? BothSimple/multi/hypergraph? MultigraphGranularity of computation at vertices/edges?

Minimal

Exploiting domain-specific characteristics?

Partially

Test data

Synthetic R-MATnetworks

(Data: Mislove et al., IMC 2007.)

Page 29: Parallel Graph Algorithms

1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)

Parallel BFS Strategies

0 7

5

3

8

2

4 6

1

9source vertex

2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs)

• O(D) parallel steps• Adjacencies of all

vertices in current frontier are visited in parallel

0 7

5

3

8

2

4 6

1

9source vertex

• path-limited searches from “super vertices”

• APSP between “super vertices”

Page 30: Parallel Graph Algorithms

Locality (where are the random accesses originating from?)

A deeper dive into the “level synchronous” strategy

0 31

53 84

74

11

93

1. Ordering of vertices in the “current frontier” array, i.e., accesses to adjacency indexing array, cumulative accesses O(n).

2. Ordering of adjacency list of each vertex, cumulative O(m).

3. Sifting through adjacencies to check whether visited or not, cumulative accesses O(m).

26

44

63

1. Access Pattern: idx array -- 53, 31, 74, 262,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11

Page 31: Parallel Graph Algorithms

Performance Observations

Phase #1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Per

cent

age

of to

tal

0

10

20

30

40

50

60 # of vertices in frontier arrayExecution time

Youtube social network

Phase #1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Per

cent

age

of to

tal

0

10

20

30

40

50

60 # of vertices in frontier arrayExecution time

Graph expansion Edge filtering

Flickr social network

Page 32: Parallel Graph Algorithms

• Well-studied problem, slight differences in problem formulations– Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal

dense blocks.– Databases/data mining: reordering bitmap indices for better compression;

permuting vertices of WWW snapshots, online social networks for compression

• NP-hard problem, several known heuristics– We require fast, linear-work approaches– Existing ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit

overlap in adjacency lists, dimensionality reduction

Improving locality: Vertex relabeling

x

x x x xx x x x

x x x x

x

x x xx x x x

x x x x

x x x xx x

xx

x x x xx

x xx x

xxx x

x

x

Page 33: Parallel Graph Algorithms

• Recall: Potential O(m) non-contiguous memory references in edge traversal (to check if vertex is visited).– e.g., access order: 53, 31, 31, 26, 74, 84, 0, …

• Objective: Reduce TLB misses, private cache misses, exploit shared cache.

• Optimizations:1. Sort the adjacency lists of each vertex – helps order memory accesses, reduce TLB misses.2. Permute vertex labels – enhance spatial locality.3. Cache-blocked edge visits – exploit temporal locality.

Improving locality: Optimizations

0 31

53 84

74

11

93

26

44

63

Page 34: Parallel Graph Algorithms

• Instead of processing adjacencies of each vertex serially, exploit sorted adjacency list structure w/ blocked accesses

• Requires multiple passes through the frontier array, tuning for optimal block size.• Note: frontier array size may be O(n)

Improving locality: Cache blocking

x x x xx x

xx x

x xx x

xxx x

x

x

front

ier

Adjacencies (d)

linear processing New: cache-blocked approach

x x x xx x

xx x

x xx x

xxx x

x

x

front

ier

Adjacencies (d)Metadata denotingblocking pattern

1 2 3

Tune to L2 cache size

Process high-degree vertices separately

Page 35: Parallel Graph Algorithms

Similar to older heuristics, but tuned for small-world networks:1. High percentage of vertices with (out) degrees 0, 1, and 2

in social and information networks => store adjacencies explicitly (in indexing data structure).

Augment the adjacency indexing data structure (with two additional words) and frontier array (with one bit)

2. Process “high-degree vertices” adjacencies in linear order, but other vertices with d-array cache blocking.

3. Form dense blocks around high-degree vertices Reverse Cuthill-McKee, removing degree 1 and degree 2 vertices

Vertex relabeling heuristic

Page 36: Parallel Graph Algorithms

1. Software prefetching on the Intel Core i7 (supports 32 loads and 20 stores in flight)– Speculative loads of index array and adjacencies of frontier vertices will

reduce compulsory cache misses.2. Aligning adjacency lists to optimize memory accesses

– 16-byte aligned loads and stores are faster.– Alignment helps reduce cache misses due to fragmentation– 16-byte aligned non-temporal stores (during creation of new frontier) are fast.

3. SIMD SSE integer intrinsics to process “high-degree vertex” adjacencies.4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsics

have very low overhead)5. Hugepage support (significant TLB miss reduction)6. NUMA-aware memory allocation exploiting first-touch policy

Architecture-specific Optimizations

Page 37: Parallel Graph Algorithms

Experimental SetupNetwork n m Max. out-

degree% of vertices w/ out-degree 0,1,2

Orkut 3.07M 223M 32K 5

LiveJournal 5.28M 77.4M 9K 40

Flickr 1.86M 22.6M 26K 73

Youtube 1.15M 4.94M 28K 76

R-MAT 8M-64M 8n n0.6

Intel Xeon 5560 (Core i7, “Nehalem”)• 2 sockets x 4 cores x 2-way SMT• 12 GB DRAM, 8 MB shared L3• 51.2 GBytes/sec peak bandwidth• 2.66 GHz proc.

Performance averaged over 10 different source vertices, 3 runs each.

Page 38: Parallel Graph Algorithms

Optimization Generality Impact* Tuning required?

(Preproc.) Sort adjacency lists High -- No

(Preproc.) Permute vertex labels Medium -- Yes

Preproc. + binning frontier vertices + cache blocking

M 2.5x Yes

Lock-free parallelization M 2.0x No

Low-degree vertex filtering Low 1.3x No

Software Prefetching M 1.10x Yes

Aligning adjacencies, streaming stores M 1.15x No

Fast atomic intrinsics H 2.2x No

Impact of optimization strategies

* Optimization speedup (performance on 4 cores) w.r.t baseline parallel approach, on a synthetic R-MAT graph (n=223,m=226)

Page 39: Parallel Graph Algorithms

Cache locality improvement

Theoretical count of the number of non-contiguous memory accesses: m+3n

Performance count: # of non-contiguous memory accesses (assuming cache line size of 16 words)

Data setOrkut LiveJournal Flickr Youtube

Frac

ctio

n of

theo

retic

al p

eak

0.0

0.2

0.4

0.6

0.8

1.0 BaselineAdj. sortedPermute+cache blocked

Page 40: Parallel Graph Algorithms

Parallel performance (Orkut graph)

Number of threads1 2 4 8

Per

form

ance

rate

(Mill

ion

Trav

erse

d E

dges

/s)

0

200

400

600

800

1000 Cache-blocked BFSBaseline

Parallel speedup: 4.9

Speedup over baseline: 2.9

Execution time: 0.28 seconds (8 threads)

Graph: 3.07 million vertices, 220 million edgesSingle socket of Intel Xeon 5560 (Core i7)

Page 41: Parallel Graph Algorithms

• How well does my implementation match theoretical bounds?

• When can I stop optimizing my code?– Begin with asymptotic analysis– Express work performed in terms of machine-independent

performance counts– Add input graph characteristics in analysis– Use simple kernels or micro-benchmarks to provide an

estimate of achievable peak performance– Relate observed performance to machine characteristics

Performance Analysis

Page 42: Parallel Graph Algorithms

• BFS (from a single vertex) on a static, undirected R-MAT network with average vertex degree 16.

• Evaluation criteria: largest problem size that can be solved on a system, minimum execution time.

• Reference MPI, shared memory implementations provided.

• NERSC Franklin system is ranked #2 on current list (Nov 2010).– BFS using 500 nodes of Franklin

Graph 500 “Search” Benchmark (graph500.org)

Page 43: Parallel Graph Algorithms

• Applications• Designing parallel graph algorithms, performance on

current systems• Case studies: Graph traversal-based problems, parallel

algorithms– Breadth-First Search– Single-source Shortest paths– Betweenness Centrality

Lecture Outline

Page 44: Parallel Graph Algorithms

Parallel Single-source Shortest Paths (SSSP) algorithms• Edge weights: concurrency primary challenge!• No known PRAM algorithm that runs in sub-linear time and

O(m+nlog n) work• Parallel priority queues: relaxed heaps [DGST88], [BTZ98]• Ullman-Yannakakis randomized approach [UY90]• Meyer and Sanders, ∆ - stepping algorithm [MS03]• Distributed memory implementations based on graph

partitioning• Heuristics for load balancing and termination detection

K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

Page 45: Parallel Graph Algorithms

∆ - stepping algorithm [MS03]

• Label-correcting algorithm: Can relax edges from unsettled vertices also

• ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm”

• ∆: bucket width• Vertices are ordered using buckets

representing priority range of size ƥ Each bucket may be processed in parallel

Page 46: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket∞ ∞ ∞ ∞ ∞ ∞ ∞

∆ = 0.1 (say)

Page 47: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ ∞ ∞ ∞ ∞ ∞

Initialization:Insert s into bucket, d(s) = 000

Page 48: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket

0 ∞ ∞ ∞ ∞ ∞ ∞

002R

S

.01

Page 49: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ ∞ ∞ ∞ ∞ ∞

2R

0S

.010

Page 50: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞

2R

0S0

Page 51: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞

2R

0S0

1 3.03 .06

Page 52: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 ∞ .01 ∞ ∞ ∞ ∞

R

0S0

1 3.03 .06

2

Page 53: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 .03 .01 .06 ∞ ∞ ∞

R

0S0

21 3

Page 54: Parallel Graph Algorithms

0.01

∆ - stepping algorithm: illustration

1

2

3

4

5

6

0.13

0

0.18

0.15

0.05

0.07

0.23

0.56

0.02

d array0 1 2 3 4 5 6

Buckets

One parallel phasewhile (bucket is non-empty)

i) Inspect light edgesii) Construct a set of “requests”

(R)iii) Clear the current bucketiv) Remember deleted vertices

(S)v) Relax request pairs in R

Relax heavy request pairs (from S)Go on to the next bucket0 .03 .01 .06 .16 .29 .62

R

0S

1

2 1 326

4

5

6

Page 55: Parallel Graph Algorithms
Page 56: Parallel Graph Algorithms

Classify edges as “heavy” and “light”

Page 57: Parallel Graph Algorithms

Relax light edges (phase) Repeat until B[i] Is empty

Page 58: Parallel Graph Algorithms

Relax heavy edges. No reinsertions in this step.

Page 59: Parallel Graph Algorithms

No. of phases (machine-independent performance count)

Graph Family

Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE

No.

of p

hase

s

10

100

1000

10000

100000

1000000

low diameter

high diameter

Page 60: Parallel Graph Algorithms

Average shortest path weight for various graph families~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1]

Graph Family

Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE

Ave

rage

sho

rtest

pat

h w

eigh

t

0.01

0.1

1

10

100

1000

10000

100000

Page 61: Parallel Graph Algorithms

Last non-empty bucket (machine-independent performance count)

Graph Family

Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE

Last

non

-em

pty

buck

et

1

10

100

1000

10000

100000

1000000

Fewer buckets, more parallelism

Page 62: Parallel Graph Algorithms

Graph Family

Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU SqGrid USAd NE USAt NE

No.

of B

ucke

t ins

ertio

ns

0

2000000

4000000

6000000

8000000

10000000

12000000

Number of bucket insertions (machine-independent performance count)

Page 63: Parallel Graph Algorithms

• Applications• Designing parallel graph algorithms, performance on

current systems• Case studies: Graph traversal-based problems, parallel

algorithms– Breadth-First Search– Single-source Shortest paths– Betweenness Centrality

Lecture Outline

Page 64: Parallel Graph Algorithms

Betweenness Centrality

• Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph– degree, closeness, eigenvalue, betweenness

• Betweenness Centrality

( : No. of shortest paths between s and t)• Applied to several real-world networks

– Social interactions– WWW– Epidemiology– Systems biology

tvs st

st vvBC

)()(

st

Page 65: Parallel Graph Algorithms

Algorithms for Computing Betweenness

• All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3) time), sum up the fractional dependency values (O(n2) space).

• Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space.

Page 66: Parallel Graph Algorithms

• Madduri, Bader (2006): parallel algorithms for computing exact and approximate betweenness centrality – low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n))– Exact algorithm: O(mn) work, O(m+n) space, O(nD+nm/p) time.

• Madduri et al. (2009): New parallel algorithm with lower synchronization overhead and fewer non-contiguous memory references– In practice, 2-3X faster than previous algorithm– Lock-free => better scalability on large parallel systems

Our New Parallel Algorithms

Page 67: Parallel Graph Algorithms

Parallel BC Algorithm

• Consider an undirected, unweighted graph• High-level idea: Level-synchronous parallel Breadth-

First Search augmented to compute centrality scores• Two steps

– traversal and path counting– dependency accumulation

)(1)()()(

)(

wwvv

wPv

Page 68: Parallel Graph Algorithms

Parallel BC Algorithm Illustration

0 7

5

3

8

2

4 6

1

9

Page 69: Parallel Graph Algorithms

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distanceand path counts.

0 7

5

3

8

2

4 6

1

9source vertex

Page 70: Parallel Graph Algorithms

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distanceand path counts.

0 7

5

3

8

2

4 6

1

9source vertex

2750

0

1

1

1

D

0

0 0

0

S P

Page 71: Parallel Graph Algorithms

Parallel BC Algorithm Illustration

1. Traversal step: visit adjacent vertices, update distanceand path counts.

0 7

5

3

8

2

4 6

1

9source vertex

8

2750

0

12

1

12

D

0

0 0

0

S P

3

2 7

5 7

Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel

Page 72: Parallel Graph Algorithms

Parallel BC Algorithm Illustration

1. Traversal step: at the end, we have all reachable vertices,their corresponding predecessor multisets, and D values.

0 7

5

3

8

2

4 6

1

9source vertex

21648

2750

0

12

1

12

D

0

0 0

0

S P

3

2 7

5 7

Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel

3 8

8

6

6

Page 73: Parallel Graph Algorithms

• Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n)

• Upper bound on size of each predecessor multiset: In-degree• Potential performance bottlenecks: atomic updates to

predecessor multisets, atomic increments of path counts

• New algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths.– simplifies the accumulation step– reduces an atomic operation in traversal step– cache-friendly!

Graph traversal step analysis

Page 74: Parallel Graph Algorithms

Graph Traversal Step locality analysis

for all vertices u at level d in parallel dofor all adjacencies v of u in parallel do

dv = D[v];if (dv < 0) vis = fetch_and_add(&Visited[v], 1); if (vis == 0) D[v] = d+1;

pS[count++] = v; fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1);

All the vertices are in a contiguous block (stack)

All the adjacencies of a vertex are stored compactly (graph rep.)

Store to S[u]

Non-contiguous memory access

Non-contiguous memory access

Non-contiguous memory access

Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously

Page 75: Parallel Graph Algorithms

Parallel BC Algorithm Illustration

2. Accumulation step: Pop vertices from stack, update dependence scores.

0 7

5

3

8

2

4 6

1

9source vertex

21648

2750

Delta

0

0 0

0

S P

3

2 7

5 7

3 8

8

6

6

)(1)()()(

)(

wwvv

wPv

Page 76: Parallel Graph Algorithms

Parallel BC Algorithm Illustration

2. Accumulation step: Can also be done in a level-synchronous manner.

0 7

5

3

8

2

4 6

1

9source vertex

21648

2750

Delta

0

0 0

0

S P

3

2 7

5 7

3 8

8

6

6

)(1)()()(

)(

wwvv

wPv

Page 77: Parallel Graph Algorithms

Accumulation step locality analysis

for level d = GraphDiameter-2 to 1 dofor all vertices v at level d in parallel do

for all w in S[v] in parallel do reduction(delta) delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];

BC[v] = delta[v] = delta_sum_v;

All the vertices are in a contiguous block (stack)

Each S[v] is a contiguous block

Only floating point operation in code

Page 78: Parallel Graph Algorithms

Centrality Analysis applied to Protein Interaction Networks

Human Genome core protein interactionsDegree vs. Betweenness Centrality

Degree

1 10 100

Bet

wee

nnes

s C

entra

lity

1e-7

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1e+0

43 interactionsProtein Ensembl ID

ENSG00000145332.2Kelch-like protein 8

Page 79: Parallel Graph Algorithms

Designing parallel algorithms for large sparse graph analysis

Problem size (n: # of vertices/edges)

104 106 108 1012

Work performed

O(n)

O(n2)O(nlog n)

“RandomAccess”-like

“Stream”-like

Improvelocality

Data reduction/Compression

Faster methods

Peta+

System requirements: High (on-chip memory, DRAM, network, IO) bandwidth.Solution: Efficiently utilize available memory bandwidth.

Algorithmic innovation to avoid corner cases.

Prob

lem

Co

mpl

exity

Locality

Page 80: Parallel Graph Algorithms

• Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance

• Overview of parallel graph algorithms– PRAM algorithms– graph representation

• Parallel algorithm case studies– BFS: locality, level-synchronous approach, multicore tuning– Shortest paths: exploiting concurrency, parallel priority

queue– Betweenness centrality: importance of locality, atomics

Review of lecture

Page 81: Parallel Graph Algorithms

• Questions?

Thank you!