legion: runtime system · operated by los alamos national security, llc for the u.s. department of...
TRANSCRIPT
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 1
Legion: Runtime System
Pat McCormickMichael Bauer, Sean Treichler, Elliott Slaughter,
Zhihao Jia, Alex Aiken, Sam Gutierrez, Galen Shipman
2015 ECI RTS WorkshopRockville, MD – March 11-13, 2015
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 2
Legion Infrastructure
Legion High-Level Runtime
Mapping Interface
Mapping InterfaceMapping Interface
Machine ModelMachine Model
Low-Level Runtime (Realm)
GASNet MPI OpenMP CUDA Pthread ?
Legion High-Level Runtime
Mapping Interface
Mapping Interface
Machine Model
Low-Level Runtime (Realm)
GASNet MPI OpenMP CUDA Pthread ?
http://legion.stanford.edu
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 3
Data Model – Logical Regions
Unbounded set of rows (index space)
– unstructured, 1D, 2D, 3D, …
Bounded set of columns (fields)
Can be partitioned (along index space)
Tasks operate on regions:
– Must specify which fields and privileges. Allows fields to be “sliced”
Tasks “launched” in program order (execution order relaxed based on dependencies) – out-of-order processor
Field 1 Field 2 Field 3 Field 4
Index 0
Index 1
Index 2
Index 3
Index 4
Index 5
Index 6
Index 7
Index 8
…
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 4
Logical Regions (continued)
Multiple index partitions may be defined– Index space can be partitioned “by kind” (not
just a regular partition)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 5
Legion Tasks (continued)
Example dynamic execution task graph – task graph grows wider with
complexity of chemical species.
MPI
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 6
Mapping Interface Programmer selects:
– Where tasks run– Where regions (data) are placed– How regions are stored in memory
(AoS, SoA, etc.)– View dependent upon machine model
Mapping computed dynamically Decouple correctness from
performance
6
t1
t2
t3
t4t5
rc
rw
rw1 rw2
rn
rn1 rn2
$
$
$
$
NUMA
NUMA
FB
DRAM
x86
CUDA
x86
x86
x86
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 7
Realm: Deferred Execution
All Realm operations are event-based & asynchronous:– Return an event that triggers when operation is complete– Events may be passed to other tasks and stored in data
structures
All Realm operations are deferrable:– Accept an event (or set of events) as a precondition– Operation does not start until preconditions have triggered
Realm: An Event-Based Low-Level Runtime for Distributed Memory Architectures, Sean Treichler, Michael Bauer, Alex Aiken, PACT 2014.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 8
Realm Event based, dynamic, explicit representation of inter-
task dependencies to bridge the latency gap– Events are control dependences only (carry no data) – order
operations, don’t imply data dependencies – Tens of thousands of unique events per second per node
run A copy a b run B
run C run D
e1 e2
e4
Event e1 = p1.spawn(A,…, NOEVENT); Event e2 = a.copy_to(b, e1);Event e3 = p2.spawn(B,…, e2);Event e4 = p1.spawn(C,…, NOEVENT); Event e5 = p1.spawn(D,…, e4);
Machine model– Processors (CPU, GPU)– Memory (distributed,
system, GPU memory, zero-copy memory)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 9
Events
Created on demand, owned by the creating node Can wait, test on events – prefer use as preconditions Can also add custom user-triggered events Encodes owner’s ID (upper bits) Creation and triggering of an event can only happen once. Any number of operations may dependent on an event.
• When do we clean up? Programmer controlled?• Reference counting overhead too high (esp. on distributed systems)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 10
Problem: Event Lifetimes
Creating a large number of events
Need to reclaim storage
Manual management
Reference counting:
– Adds overhead– Complicated when
event handles storedin heap & distributed
~1600 events/node/s~130 events/node/s
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 11
Generational Events
Observation: a fixed-sizedstructure can describe:– One untriggered event– A large number (e.g. 232-1)
of triggered events
Event creation can reuse anexisting generational event– As long as it’s in the triggered
state– Becomes the next generation
of that event
Generation number included in event handles and subscription/trigger messages
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 12
Benefits of Generational Events ~20% of storage of
reference counting approach
Without the overhead
No sensitivity to length of run
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 13
Some Headaches... Mechanisms for pinning memory that can be used by multiple
hardware devices need to be improved. Some of this is on the GPU/NIC/etc. drivers but anything the OS can do to help would be nice.
Error handling and fault detection in operating systems today are fundamentally broken for distributed systems:– The way signal handlers run today aren't really designed for
multi-threaded systems and so they employ a stop-the-world approach that halts all threads within a process.
All operations should come in non-blocking form (i.e. in the traditional "non-blocking" definition used by OSes - return E_AGAIN rather than blocking).
Dynamic resizing of the initial pinned memory segment would be nice.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA 14
Thank you
Questions?