task-based parallel programming in legiontheory.stanford.edu/~aiken/ecp/ecp.pdfthis research was...
TRANSCRIPT
![Page 1: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/1.jpg)
1
Task-Based Parallel Programming in Legion
Alex Aiken Stanford University & SLAC
ECP Legion Tutorial, February 2018
![Page 2: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/2.jpg)
2
Acknowledgments
The Legion project is joint work between Stanford, Los Alamos National Lab, NVIDIA, and SLAC.
Funding has come from many sources, but particularly the DOE and the leadership class facilities.
This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’exascale computing imperative.
ECP Legion Tutorial, February 2018
![Page 3: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/3.jpg)
3
Tutorial Materials
The slides, example program, and performance profiles are at:
http://theory.stanford.edu/~aiken/ecp
ECP Legion Tutorial, February 2018
![Page 4: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/4.jpg)
4
OVERVIEW
ECP Legion Tutorial, February 2018
![Page 5: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/5.jpg)
5
Legion & Regent
Legion is a C++ runtime a programming model
Regent is a programming language For the Legion programming model Current implementation is embedded in Lua Has an optimizing compiler
This tutorial focuses on Regent
ECP Legion Tutorial, February 2018
![Page 6: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/6.jpg)
6
Why Use Legion?
Easy access to GPUs Simplifies programming complex hardware
Easy control over data Partitioning, placement and layout in memory
Automated scheduling and latency hiding Asynchronous tasking Throughput oriented
Performance portability
ECP Legion Tutorial, February 2018
![Page 7: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/7.jpg)
7
Regent Stack
Regent Language and
compiler
Legion High-level runtime
Realm Low-level runtime
Lua Host language
ECP Legion Tutorial, February 2018
![Page 8: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/8.jpg)
8
Regent in Lua
Embedded in Lua Popular scripting language in the graphics community
Excellent interoperation with C And with other languages
Python-ish syntax For both Lua and Regent
ECP Legion Tutorial, February 2018
![Page 9: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/9.jpg)
9
PAGERANK
ECP Legion Tutorial, February 2018
![Page 10: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/10.jpg)
10
PageRank
Today’s example
Input: A directed graph. Output: The rank of each node
A measure of a node’s importance E.g., used for ranking web search results
Web pages are nodes Hyperlinks are edges
ECP Legion Tutorial, February 2018
![Page 11: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/11.jpg)
11
PageRank Equation
ECP Legion Tutorial, February 2018
rank(n) = (1 – α)/N + α × Σ p ε pred(n). rank(p)/|succs(p)|
![Page 12: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/12.jpg)
12
TASKS
ECP Legion Tutorial, February 2018
![Page 13: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/13.jpg)
13
The PageRank Task
task pagerank(nodes: region(…), edges: region(…), pr_old: region(…), pr_new:region(…), alpha: float)
{ }
Tasks are the unit of parallel execution.
Logical regions are (typed) collections Logical: no implied layout no implied location
ECP Legion Tutorial, February 2018
![Page 14: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/14.jpg)
14
The PageRank Task
task pagerank(nodes: region(…), edges: region(…), pr_old: region(…), pr_new:region(…), alpha: float)
{ }
ECP Legion Tutorial, February 2018
![Page 15: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/15.jpg)
15
The PageRank Task
task pagerank(nodes: region(…), edges: region(…), pr_old: region(…), pr_new:region(…), alpha: float) where reads(nodes, edges, pr_old), writes(pr_new) { }
Privileges declare how a task will use its region arguments.
ECP Legion Tutorial, February 2018
![Page 16: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/16.jpg)
16
The PageRank Task task pagerank(nodes: region(…), edges: region(…), pr_old: region(…), pr_new:region(…), alpha: float) where reads(nodes, edges, pr_old), writes(pr_new) { … for n in nodes do … score = 0 for e in left, right do -- indices of predecessor edges of n … score = score + pr_old[edges[e].src] end … score = (1 – alpha) / num_nodes + alpha * score pr_new[n] = score / out_degree end }
ECP Legion Tutorial, February 2018
![Page 17: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/17.jpg)
17
REGIONS
ECP Legion Tutorial, February 2018
![Page 18: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/18.jpg)
18
Regions
A region is a (typed) collection
Regions are the cross product of An index space A field space
src dst 0 a x 1 b x 2 c x 3 a y
The region’s index space The region’s field space
The edges region is constructed so that edges are grouped by dst node
EDGES
ECP Legion Tutorial, February 2018
![Page 19: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/19.jpg)
19
Discussion
Regions are the way to organize large data collections in Regent
Regions can be Structured (e.g., like arrays) Unstructured (e.g., pointer data structures)
Any number of fields
Built-in support for multidimensional index spaces
ECP Legion Tutorial, February 2018
![Page 20: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/20.jpg)
20
Nodes & Edges
EDGES
NODES
Nodes have two fields: out_degree and index
ECP Legion Tutorial, February 2018
![Page 21: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/21.jpg)
21
Nodes & Edges
EDGES
NODES
A node’s index field points just beyond its last predecessor edge.
ECP Legion Tutorial, February 2018
![Page 22: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/22.jpg)
22
PAGERANK TASK
ECP Legion Tutorial, February 2018
![Page 23: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/23.jpg)
23
PARTITIONING
ECP Legion Tutorial, February 2018
![Page 24: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/24.jpg)
24
Partitioning
To enable parallelism on a region, partition it into smaller pieces
And then run a task on each piece
Partitioning is built in to Legion/Regent A rich set of partitioning primitives
Use the primitives to build partitioning algorithms
ECP Legion Tutorial, February 2018
![Page 25: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/25.jpg)
25
Equal Partitions
One commonly used primitive is to split a region into a number of (nearly) equal size subregions
num_pieces = ispace(int1d, 4) r = region(ispace(int1d, 12), int64) p = partition(equal, r, num_pieces)
ECP Legion Tutorial, February 2018
![Page 26: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/26.jpg)
26
Region Trees
0
r
1 2 3
The Legion runtime knows and uses the region tree to manage mapping and automate parallelism and data movement.
ECP Legion Tutorial, February 2018
![Page 27: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/27.jpg)
27
Partitions
• Partitions are first class objects
• An array of the subregions formed by a partition
p = partition(equal, r, num_pieces)
p[0] p[1] p[2] p[3]
ECP Legion Tutorial, February 2018
![Page 28: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/28.jpg)
28
Discussion
• Partitioning and region creation are dynamic Can be done at any time Regions and partitions are first class values
Regions trees can be any depth Subregions can be partitioned, too
Regions can be partitioned in multiple ways A program can define multiple views of its data
Defining regions/partitions does not materialize them Gives names to subsets of the data Actual computations access physical instances of regions
ECP Legion Tutorial, February 2018
![Page 29: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/29.jpg)
29
Partitioning Strategy for PageRank
• Use edge partitioning • Approximately equal number of edges per subregion • Better than node partitioning if nodes can have very
different out degrees
• But • Keep all predecessor edges of a given node in the same
subregion
• So • Calculate the range of edges for each subregion • Partition the edges by range • Partition the nodes compatibly with the edge partition
ECP Legion Tutorial, February 2018
![Page 30: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/30.jpg)
30
Nodes & Edges
EDGES
NODES
a node’s index field points just beyond its last predecessor edge
ECP Legion Tutorial, February 2018
![Page 31: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/31.jpg)
31
First Step: Calculate Edge Ranges
task init_partition( edge_range : region(ispace(int1d),rect1d), edges : region(ispace(int1d), EdgeStruct), avg_num_edges : E_ID, num_parts: int) where writes(node_range, edge_range), reads(nodes) do … for p = 0, num_parts do right_bound = min(avg_num_edges * (p + 1), total_num_edges) var my_dst: V_ID = edges[right_bound].dst -- extend the right bound to the last edge of the current node while (right_bound < total_num_edges) do var next_dst : V_ID = edges[right_bound+1].dst if (my_dst<next_dst) then break end right_bound = right_bound + 1 end edge_range[p] = {left_bound, right_bound} end
ECP Legion Tutorial, February 2018
![Page 32: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/32.jpg)
32
DEPENDENT PARTITIONING
ECP Legion Tutorial, February 2018
![Page 33: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/33.jpg)
33
Partitioning, Revisited
Why do we want to partition data? For parallelism We will launch many tasks over many subregions
A problem We often need to partition multiple data structures in a consistent way E.g., given that we have partitioned the nodes a particular way, that will dictate the desired partitioning of the edges
ECP Legion Tutorial, February 2018
![Page 34: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/34.jpg)
34
Dependent Partitioning
Distinguish two kinds of partitions
Independent partitions Computed from the parent region, using, e.g.,
partition(equals, … )
Dependent partitions Computed using another partition
ECP Legion Tutorial, February 2018
![Page 35: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/35.jpg)
35
Dependent Partitioning Operations
Image Use the image of a partition to define a new partition E.g., the image of a field E.g., or a range of values
Preimage Use the pre-image of a field in a partition …
Set operations Form new partitions using the intersection, union, and set difference of other partitions Not illustrated today
ECP Legion Tutorial, February 2018
![Page 36: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/36.jpg)
36
LR1
Image
Computes elements reachable via a field lookup
Can be applied to index space or another partition
Computation is distributed based on location of data
IS1
s1 s2 s3
IS1
IS2
s1 s2 s3
source partition
pointer field
IS2
destination index space
…
…
…
ECP Legion Tutorial, February 2018
![Page 37: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/37.jpg)
37
Preimage
Inverse of image Computes elements that
reach a given subspace Preserves disjointness
Multiple dependent partitioning operations can be combined
Capture complex task access patterns
IS1
s1 s2 s3
IS2
IS2
s1 s2 s3
source partition pointer field
IS1
destination index space
ECP Legion Tutorial, February 2018
![Page 38: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/38.jpg)
38
Dependent Partitioning in PageRank
The use of dependent partitioning in PageRank is simple
Define a partition of the edges Using the computed edge ranges
Then define a partition of the nodes using the destination node of each edge
ECP Legion Tutorial, February 2018
![Page 39: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/39.jpg)
39
Picture (Reminder)
EDGES
NODES
ECP Legion Tutorial, February 2018
![Page 40: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/40.jpg)
40
Picture
EDGES
NODES
= dst field of edge
0:1 2:4 5:7 8:9 EDGE_RANGE
ECP Legion Tutorial, February 2018
![Page 41: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/41.jpg)
41
Picture
EDGES
NODES
= dst field of edge
0:1 2:4 5:7 8:9 EDGE_RANGE
ECP Legion Tutorial, February 2018
![Page 42: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/42.jpg)
42
Picture
EDGES
NODES
= dst field of edge
0:1 2:4 5:7 8:9 EDGE_RANGE
ECP Legion Tutorial, February 2018
![Page 43: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/43.jpg)
43
Picture
EDGES
NODES
= dst field of edge
0:1 2:4 5:7 8:9 EDGE_RANGE
ECP Legion Tutorial, February 2018
![Page 44: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/44.jpg)
44
PAGERANK MAIN
ECP Legion Tutorial, February 2018
![Page 45: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/45.jpg)
45
PARALLELISM
ECP Legion Tutorial, February 2018
![Page 46: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/46.jpg)
46
Program Semantics
A Legion program is a sequence of task launches
The runtime analyzes the tasks for interference Tasks with conflicting accesses to the same data Non-interfering tasks can execute in parallel Interfering tasks are serialized
Guarantees sequential semantics Program result is as if it had executed sequentially Very useful for debugging at scale
ECP Legion Tutorial, February 2018
![Page 47: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/47.jpg)
47
Task Graphs
When Legion discovers interfering tasks an edge is added to the task graph recording the dependency
Three wavefronts:
The runtime building the task graph The application executing the graph The runtime collecting resources from finished tasks
ECP Legion Tutorial, February 2018
![Page 48: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/48.jpg)
48
Parallel Loop from PageRank
for p in part do pagerank(part_nodes[p], part_edges[p], pr_score0, part_score1[p], … ) end
The different calls to pagerank don’t interefere. Why? Only part_score1[] is written and it is a disjoint partition.
Note the use of different views on to the data. We use both the entire pr_score0[] region and subregions of part_score1[].
ECP Legion Tutorial, February 2018
![Page 49: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/49.jpg)
49
MAPPING
ECP Legion Tutorial, February 2018
![Page 50: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/50.jpg)
50
Mapping Interface Application selects:
Where tasks run Where regions are placed
Mapping computed dynamically
Decouple correctness from performance
50
t1
t2
t3
t4t5
rc
rw
rw1 rw2
rn
rn1 rn2
$
$
$
$
N U M A
N U M A
FB
D R A M
x86
CUDA
x86
x86
x86
ECP Legion Tutorial, February 2018
![Page 51: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/51.jpg)
51
Mapping
Mapping is the process of assigning resources to Regent/Legion programs
Conceptually Assign a processor to each task
The task will execute in its entirety on that processor
Assign a memory to each region argument
And many other things!
There is a default mapper with reasonable heuristics Just another mapper, but a generic one
ECP Legion Tutorial, February 2018
![Page 52: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/52.jpg)
52
Mapping Interface
At the Legion level, mapping is an API A set of callbacks Each is called at a particular point in a task’s lifetime To write mappers, need to know this sequence of stages
Regent has a mapping DSL Concise, easy to use Compiles to the Legion mapping API Currently supports only static mappings
ECP Legion Tutorial, February 2018
![Page 53: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/53.jpg)
53
High-Level Overview of Mapping
An instance of the Legion runtime runs on each node
When a task is launched on the local runtime: The mapper picks a processor for the task The mapper picks memories for the region arguments … and other things as well …
ECP Legion Tutorial, February 2018
![Page 54: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/54.jpg)
54
New Concepts
There are a number of concepts at the mapping level that don’t exist in Regent
Machine models Variants Physical Instances
ECP Legion Tutorial, February 2018
![Page 55: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/55.jpg)
55
Machine Model
To pick concrete processors & memories, the runtime must know:
How many processors/memories there are And of what kinds
And where the processors/memories are At least relative to each other
A machine model is written once for each machine
ECP Legion Tutorial, February 2018
![Page 56: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/56.jpg)
56
Components of a Machine Model
Processors LOC TOC PROC_SET UTILITY IO
Memories GLOBAL SYSTEM RDMA FRAME_BUFFER ZERO_COPY DISK HDF5
ECP Legion Tutorial, February 2018
![Page 57: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/57.jpg)
57
Affinities
Processor -> Memory Which memories are attached to a processor
Memory -> Memory
Which memories have channels between them
Memory -> Processor All processors attached to a memory
Affinities are provided as a list (proc,mem) and (mem,mem) pairs Also include bandwidth and latency information
ECP Legion Tutorial, February 2018
![Page 58: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/58.jpg)
58
Task Variants
A task can have multiple variants Different implementations of the same task Multiple variants can be registered with the runtime Variants can have associated constraints
Examples A variant for LOC Another variant for TOC Variants for different data layouts
ECP Legion Tutorial, February 2018
![Page 59: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/59.jpg)
59
Variants in Regent
Place immediately before a task declaration __demand(__cuda)
Causes both CPU and GPU task variants to be produced
And the default mapper always prefers to pick a GPU variant if possible
ECP Legion Tutorial, February 2018
![Page 60: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/60.jpg)
60
Physical Instances
A region is a logical name for data
A physical instance is a copy of that data For some set of fields
There can be 0, 1 or many physical instances of a specific field of a region at any time
ECP Legion Tutorial, February 2018
![Page 61: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/61.jpg)
61
Physical Instances
Can be valid or invalid Is the data current or not?
Live in a specific memory
Have a specific layout Column major, row major, blocked, struct-of-arrays, array-of-
structs, …
Are allocated explicitly by the mapper
Are deallocated by the runtime Garbage collected
ECP Legion Tutorial, February 2018
![Page 62: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/62.jpg)
62
A Word About Physical Instances
Many physical instances of a region can exist simultaneously Including different versions of the same data
A task writing version 0 to disk A task reading version 5 A task writing version 6
The current version! A task scheduled to read version 6 A task scheduled to write version 7 A (meta)task scheduled to deallocate version 6 …
ECP Legion Tutorial, February 2018
![Page 63: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/63.jpg)
63
Layout Constraints
Tasks can have layout constraints on physical instances
“This task requires data in row major order”
Constraints are just that Don’t specify an exact layout Multiple instances may satisfy the constraints
ECP Legion Tutorial, February 2018
![Page 64: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/64.jpg)
64
Summary
Mapping Selects processors for tasks Selects memories for physical instances
Satisfying region requirements of tasks
Many options Default mapper does reasonable things But any sufficiently complex program will need some customization
ECP Legion Tutorial, February 2018
![Page 65: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/65.jpg)
65
PAGERANK MAPPER
ECP Legion Tutorial, February 2018
![Page 66: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/66.jpg)
66
PROFILING
ECP Legion Tutorial, February 2018
![Page 67: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/67.jpg)
67
Legion Prof
A tool for showing performance timeline Each processor is a timeline Each operation is a time interval Different kinds of operations have different colors
White space = idle time Want to understand why there is white space
ECP Legion Tutorial, February 2018
![Page 68: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/68.jpg)
68
Example Profiles from PageRank
1 node, 8 cpus pagerank/run_pr.sh --program baseline --cpus 8 --nodes 1 --gpus 0
1 node, 1 GPU pagerank/run_pr.sh --program baseline --cpus 4 --nodes 1 --gpus 1
1 node, 2 GPUs pagerank/run_pr.sh --program baseline --cpus 4 --nodes 1 --gpus 2
1 node, 4 GPUS
pagerank/run_pr.sh --program baseline --cpus 4 --nodes 1 --gpus 4
ECP Legion Tutorial, February 2018
![Page 69: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/69.jpg)
69
Performance Results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1Node 2Nodes 4Nodes
Runtim
epe
riteratio
n(secon
ds)
PageRankPerformancewithDifferentMappingConfigurations
1GPU/node(GPUMemory)
2GPU/node(GPUMemory)
4GPU/node(GPUMemory)
1GPU/node(Zero-CopyMemory)
2GPU/node(Zero-CopyMemory)
4GPU/node(Zero-CopyMemory)
ECP Legion Tutorial, February 2018
![Page 70: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/70.jpg)
70
OTHER APPLICATIONS
ECP Legion Tutorial, February 2018
![Page 71: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/71.jpg)
71
S3D: Combustion Simulation Simulates chemical reactions
DME (30 species) Heptane (52 species) PRF (116 species)
Two parts Physics
Nearest neighbor communication Data parallel
Chemistry Local Complex task parallelism
Large working sets/task 5
Planning the science simulation
• Recent 3D simulation on Jaguar was used to extrapolate and plan a target Titan simulation
• Planned simulation will have more grid points and/or larger chemistry
• Will need a month on 12,000 hybrid nodes of Titan
Figure 5: Computational domain and grid to be used for simulations of the CRF HCCI engine.
Figure 6: Reaction and diffusion structures for OH radical during the third stage thermal explosion of a high-pressureDME fueled autoignition process.Recent 3D DNS of auto-ignition with 30-species
DME chemistry (Bansal et al. 2011)
ECP Legion Tutorial, February 2018
![Page 72: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/72.jpg)
72
Mapping for Heptane 483
4 AMD Interlagos
Integer cores for Legion Runtime
8 AMD Interlagos FP
cores for application
NVIDIA Kepler K20
Dynamic Analysis for (rhsf+2) Clean-up/meta tasks
ECP Legion Tutorial, February 2018
![Page 73: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/73.jpg)
73
Mapping for Heptane 963 Handle larger problem sizes per node
Higher computation-to-communication ratios More power efficient
Different mapping Limited by size of GPU framebuffer
Legion analysis is independent of problem size Larger tasks -> fewer runtime cores
ECP Legion Tutorial, February 2018
![Page 74: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/74.jpg)
74
Weak Scaling: PRF on Titan
3X
7X
ECP Legion Tutorial, February 2018
![Page 75: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/75.jpg)
75
Fast Graph Analytics
Conventional wisdom: Graph processing has trouble taking advantage of distributed memory
High performance graph processing systems are dominated by shared-memory CPU-based systems
Observation GPUs provide higher memory bandwidth than CPUs Can avoid communication by careful placement of data in the memory hierarchy
ECP Legion Tutorial, February 2018
![Page 76: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/76.jpg)
76
Fast Graph Processing [VLDB’18]
Performance comparison on a single GPU (lower is better).
Performance comparison among different graph processing frameworks (lower is better).
Competitive with state-of-the-art single-GPU graph processing engines.
Orders of magnitude speedup compared to state-of-the-art distributed/shared memory CPU systems.
ECP Legion Tutorial, February 2018
![Page 77: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/77.jpg)
77
In CNNs, data is commonly organized as 4D tensors. tensor = [image, height, width, channel]
Existing tools parallelize the image dimension.
Motivation Explore other parallelizable dimensions Allow each layer to be parallelized differently
Convolutional Neural Networks [ICML18]
ECP Legion Tutorial, February 2018
![Page 78: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/78.jpg)
78
Results
Figure 1: Training throughput (images/second) on 16 GPUs.
Figure 2: Data transfers in each step on 16 GPUs with a minibatch size of
512. ECP Legion Tutorial, February 2018
![Page 79: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/79.jpg)
79 ECP Legion Tutorial, February 2018
![Page 80: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/80.jpg)
80
Separating Concerns
Current practice entangles functionality, scheduling, and mapping
Consider a code written in MPI + OpenMP + CUDA
Alternative Specify functionality and dependencies first Then focus on mapping and scheduling for a machine A lot of the benefits of Legion flow from this design
ECP Legion Tutorial, February 2018
![Page 81: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/81.jpg)
81
Programmer Productivity
In the end, it’s all about productivity
How much work is needed to achieve a desired level of performance?
Legion philosophy More expressive data model Requires more initial work from the programmer But makes later stages easier & more flexible
Easy to try different partitioning strategies Easy to explore alternative mappings
ECP Legion Tutorial, February 2018
![Page 82: Task-Based Parallel Programming in Legiontheory.stanford.edu/~aiken/ecp/ecp.pdfThis research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S](https://reader030.vdocument.in/reader030/viewer/2022041023/5ed4548c6f650b65a9389052/html5/thumbnails/82.jpg)
82
Legion
Legion website: http://legion.stanford.edu
ECP Legion Tutorial, February 2018