a portable runtime interface for multi-level memory hierarchies mike houston, ji-young park, manman...
TRANSCRIPT
A Portable Runtime Interface For Multi-Level Memory Hierarchies
Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian,
Alex Aiken, William Dally, Pat Hanrahan
Stanford University
2Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
The Problem
Lots of different architectures– Shared memory – Distributed memory – Exposed communication– Disk systems
Each architecture has its own programming system Composed machines difficult to program and manage
– Different mechanisms for each architecture
Previous runtimes and languages limited– Designed for a single architecture– Struggle with memory hierarchies
3Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Shared Memory Machines
AMD Barcelona SGI Altix
4Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Distributed Memory Machines
Marenostrum2,282 Nodes
ASC Blue Gene/L65,536 Nodes
5Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Exposed Communication Architectures
90nm | ~220 mm2 |~100 WSTI CELL processor
~200 GFLOPS
6Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Complex machines
Cluster of SMPs?– MPI?
Cluster of Cell processors?– MPI + ALF/Cell SDK
Cluster of clusters of SMPs with Cell accelerators? What about disk systems?
– Disk I/O often second class
7Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Previous Work
APIs– MPI/MPI-2– Pthreads– OpenMP
(Dagum et al. 1998)– GASNet
(Bonachea et al. 2002)– Charm
(Kale et al, 1993)– SVM
(Lebonte et al. 2004)– Cell SDK
(IBM 2006)– CellSs
(Bellens et al. 2006)– HTA
(Bikshandi et al. 2006)– …
Languages– Co-Array Fortran
(Numrich et al. 1994)– Titanium
(Yelick et al. 1998) – UPC
(Carlson et al. 1999)– ZPL
(Deitz et al. 2004)– Chapel
(Callahan et al. 2004)– X10
(Charles et al. 2005)– Brook
(Buck et al. 2004)– Cilk
(Blumofe et al. 1995)– …
8Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Contributions
Uniform scheme for explicitly describing memory hierarchies– Capture common traits important for performance
– Allow composition of memory hierarchies
Simple, portable API interface for many parallel machines– Mechanism independence for communication and management of
parallel resources
– Few entry points
– Efficient execution
Efficient compiler target– Compiler can concentrate on global optimizations
– Runtime manages mechanisms
9Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
The Sequoia System
Programming language for memory hierarchies– Fatahalian et al. (Supercomputing 2006)– Adapts Parallel Memory Hierarchies programming abstraction
Compiler for exposed communication architectures– Knight et al. (PPoPP 2007)– Takes in Sequoia program, machine file, and mapping file– Generates task calls and data transfers– Performs large granularity optimizations
Portable runtime system– Houston et al. (PPoPP 2008)– Target for Sequoia compiler– Serves as abstract machine
http://sequoia.stanford.edu
Abstract Machine Model
11Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Parallel Memory Hierarchy Model (PMH)
Abstract machines a trees of memories (each memory is an address space)(Alpern et al. 1995)
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
12Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Parallel Memory Hierarchy Model
CPU CPU CPU CPU CPU CPU CPU CPU
Memory
Memory
Memory Memory Memory Memory
Memory
Memory Memory Memory Memory
13Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Example Mappings
SPE
LS
MainMemory
Cell Processor
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
CPU CPU
NodeMemory
NodeMemory
...
Aggregate Node Memory(Virtual Level)
Cluster
CPU
Disk
Disk
NodeMemory
Shared Memory Multi-processor
CPU
...
MainMemory
L2
CPU
L2
14Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Multi-Level Configurations
SPE SPE SPE SPE SPE SPE
...
CPU
...
Disk + Playstation 3
Cluster of Playstation 3s
Cluster of SMPs
...
CPU CPU
...
CPUSPE SPE SPE SPE SPE SPE
SPE SPE SPE SPE SPE SPE
Aggregate Cluster Memory
L2 L2 L2 L2
NodeMemory
NodeMemory
LS LS LS LS LS LS
MainMemory
Disk
LS LS LS LS LS LS
MainMemory
MainMemory
LS LS LS LS LS LS
Aggregate Cluster Memory
15Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Abstraction Rules
Tree of nodes 1 control thread per node 1 memory per node
Threads can:– Transfer in bulk from/to parent memory
asynchronously
– Wait for transfers from/to parent to complete
– Allocate data
– Only access their memory directly
– Transfer control to child node(s)
– Non-leaf threads only operate to move data and control
– Synchronize with siblings
Memory
CPU
Memory
CPU
Memory
CPU
Simliar to Space Limited Procedures (Alpern et al. 1995)
Portable Runtime Interface
17Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Requirements
Resource allocation– Data allocation and naming
– Setup parallel resources Explicit bulk asynchronous communication
– Transfer lists
– Transfer commands Parallel execution
– Launch tasks on children
– Asynchronous Synchronization
– Make sure tasks/transfers complete before continuing Runtime isolation
– No direct knowledge of other runtimes
18Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Memory Level i+1
CPU Level i+1
Memory Level iChild N
Memory Level i…
Graphical Runtime Representation
Memory Level iChild 1
Runtime
CPU Level iChild 1
CPU Level i…
CPU Level iChild N
19Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Top Interface
// create and free runtimeRuntime(TaskTable table, int numChildren);virtual ~Runtime();
// allocate and deallocate arraysvirtual Array_t* AllocArray (Size_t elmtSize, int dimensions, Size_t* dim_sizes, ArrayDesc_t
descriptor, int alignment) = 0; virtual void FreeArray(Array_t* array) = 0;
// array naming virtual void AddArray(Array_t array);virtual Array_t GetArray(ArrayDesc_t descriptor);virtual void RemoveArray(ArrayDesc_t descriptor);
// launch and synchronize on tasks virtual TaskHandle_t CallChildTask(TaskID_t taskid, ChildID_t start, ChildID_t end) =
0;virtual void WaitTask(TaskHandle_t handle) = 0;
20Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Bottom Interface
// array namingvirtual Array_t* GetArray (ArrayDesc_t descriptor);
// create, free, invoke, and synchronize on transfer lists virtual XferList* CreateXferList (Array_t* dst, Array_t*
src, Size_t* dst_idx, Size_t*
src_idx, Size_t* lengths, int
count) = 0;virtual void FreeXferList (XferList* list) = 0;virtual XferHandle_t Xfer (XferList* list) = 0;virtual void WaitXfer (XferHandle_t handle) = 0;
// get number of children in bottom level, get local processor id, // and barrierint GetSiblingCount ();int GetID (); virtual void Barrier (ChildID_t start, ChildID_t stop) =
0;
21Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Compiler/Runtime Interaction
Compiler initializes runtime for each pair of memories in the hierarchy
Initialize runtime for root memory– Machine description specifies runtime to initialize (SMP, Cluster,
Disk, Cell, etc.)
If more levels in hierarchy– Initialize runtimes for child levels
Runtime cleanup is inverse– Call exit on children, wait, cleanup local resources, return to
parent
Control of hierarchy via task calls
Runtime Implementations
23Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
SMP Runtime
CPU CPU...
MainMemory
SMP Runtime
CPU
Disk
NodeMemory
Disk Runtime
SPE
LS
MainMemory
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
PowerPC
Cell Runtime
…CPU CPU
NodeMemory
NodeMemory
...
Aggregate Cluster Memory
Cluster Runtime
24Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
SMP Implementation
Pthreads based– Launch a thread per child– CallChildTask enqueues work on task queue per child
Data transfers– Memory copy from source to destination– Optimizations
• Pass reference to parent array• Feedback to compiler to remove transfers
– Machine file information
No processor at parent level– Processor 0 represents the parent node and a child node
25Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Disk Implementation
No processor at top level– Host CPU represents the parent node and child node
Implementation – Allocation
• Open file on disk– Data transfers
• Use Async I/O API to read/write data to disk
26Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cell Implementation
Overlay handling– At runtime creation, load overlay loader into SPEs
– On task call
• PowerPC notifies SPE of function to load
• SPE loads overlay and executes Data alignment
– All data allocated to 128 byte boundaries
– Multi-dimensional arrays padded to maintain alignment for dimensions Heavy use of mailboxes
– SPE synchronization
– PPE<->SPE communication Data transfers
– DMA lists
– DMA commands
27Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster Implementation
No processor at top level– Node 0 represents the parent node and a child node
Virtual level– Distributed Shared Memory (DSM)
• Many software and hardware solutions• IVY (Li et al. 1988), Mether (Minnich et al. 1991), …• Alewife (Agarwal et al. 1991), FLASH (Kuskin et al. 1994), …• Complex coherence and consistency mechanisms
– Overkill for our needs• No sharing between sibling processors = no coherence
requirements• Sequoia enforces consistency by disallowing aliasing on
writes to parent memory
28Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster Implementation, cont.
Virtual level implementation– Interval trees
• Represent arrays as covering an range of 0->N elements• Each node can have any sub-portion (interval) of range• Tree structure allows fast lookup of all nodes that cover the
interval of interest• Allows complex data distributions• See CLRS Section 14.3 for more detailed information
– Array allocation• Define distribution as interval tree and broadcast• Allocate data for intervals and register data pointers with
MPI-2 (MPI_win_create)• Align data to page alignment for network fast transfers
29Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster Implementation, cont.
Virtual level implementation– Data transfer
• Compare requested data range against interval tree• Read from parent: issue MPI_Get to any nodes with
matching intervals (MPI_LOCK_SHARED)
• Write to parent: issue MPI_Put to all nodes with matching intervals (MPI_LOCK_EXCLUSIVE)
– Optimizations• If requested range is local, return reference to parent
memory
30Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Runtime Composition
Disk
MainMemory
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
PowerPC
MainMemory
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
PowerPC
MainMemory
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
SPE
LS
PowerPC
Aggregate Cluster Memory
NodeMemory
NodeMemory
...
Aggregate Cluster Memory
CPU CPU... CPU CPU...
Disk + Playstation 3
Cluster of Playstation 3s
Cluster of SMPs
...
Cluster Runtime
Cluster Runtime
Cell Runtime
Cell Runtime
Disk Runtime
SMP Runtime SMP Runtime
Cell Runtime
Evaluation
32Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Evaluation Metrics
Can we run unmodified Sequoia code on all these systems?
How much overhead do we have in our abstraction? How well can we utilize machine resources?
– Computation– Bandwidth
Is our application performance competitive with best available implementations?
Can we effectively compose runtimes for more complex systems?
33Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Sequoia Benchmarks
Linear Algebra
Blas Level 1 SAXPY, Level 2 SGEMV, and Level 3 SGEMM benchmarks
2D single precision convolution with 9x9 support (non-periodic boundary constraints)
Complex single precision FFT
100 time steps of N-body stellar dynamics simulation (N2) single precision
Fuzzy protein string matching using HMM evaluation (Horn et al. SC2005 paper)
Conv2D
FFT3DGravity
HMMER
Best available implementations used as leaf task
34Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Single Runtime System Configurations
Scalar– 2.4 GHz Intel Pentium4 Xeon, 1GB
8-way SMP– 4 dual-core 2.66GHz Intel P4 Xeons, 8GB
Disk– 2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk
Cluster– 16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect
(780MB/s) Cell
– 3.2 GHz IBM Cell blade (1 Cell – 8 SPE), 1GB PS3
– 3.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)
35Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
System Utilization
SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER
SMP | Disk | Cluster | Cell | PS3
Per
cent
age
of R
untim
e
100
0
36Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Resource Utilization – IBM Cell
Bandwidth utilizationCompute utilization
Res
ourc
e U
tiliz
atio
n (%
)
100
0
37Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Single Runtime Configurations – GFlop/s
Scalar SMP Disk Cluster Cell PS3
SAXPY 0.3 0.7 0.007 4.9 3.5 3.1
SGEMV 1.1 1.7 0.04 12 12 10
SGEMM 6.9 45 5.5 91 119 94
CONV2D 1.9 7.8 0.6 24 85 62
FFT3D 0.7 3.9 0.05 5.5 54 31
GRAVITY 4.8 40 3.7 68 97 71
HMMER 0.9 11 0.9 12 12 7.1
38Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
SGEMM Performance
Cluster– Intel Cluster MKL: 101 GFlop/s– Sequoia: 91 GFlop/s
SMP– Intel MKL: 44 GFlop/s– Sequoia: 45 GFlop/s
39Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
FFT3D Performance
Cell– Mercury: 58 GFlop/s– FFTW 3.2 alpha 2: 35 GFlop/s– Sequoia: 54 GFlop/s
Cluster– FFTW 3.2 alpha 2: 5.3 GFlop/s– Sequoia: 5.5 GFlop/s
SMP– FFTW 3.2 alpha 2: 4.2 GFlop/s– Sequoia: 3.9 GFlop/s
40Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Best Known Implementations
HMMer– ATI X1900XT: 9.4 GFlop/s
(Horn et al. 2005)
– Sequoia Cell: 12 GFlop/s– Sequoia SMP: 11 GFlop/s
Gravity– Grape-6A: 2 billion interactions/s
(Fukushige et al. 2005)
– Sequoia Cell: 4 billion interactions/s– Sequoia PS3: 3 billion interactions/s
41Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Multi-Runtime System Configurations
Cluster of SMPs– Four 2-way, 3.16GHz Intel Pentium 4 Xeons
connected via GigE (80MB/s peak) Disk + PS3
– Sony Playstation 3 bringing data from disk (~30MB/s) Cluster of PS3s
– Two Sony Playstation 3’s connected via GigE (60MB/s peak)
42Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Multi-Runtime Utilization
SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER
Cluster of SMPs | Disk + PS3 | Cluster of PS3s
Pe
rce
nta
ge
of
Ru
ntim
e
100
0
43Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster of PS3 Issues
SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER
Cluster of SMPs | Disk + PS3 | Cluster of PS3s
Pe
rce
nta
ge
of
Ru
ntim
e
100
0
44Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Cluster of PS3 Issues
SAXPY SGEMV
Cluster of PS3s | PS3
Pe
rce
nta
ge
of
Ru
ntim
e
100
0
45Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Multi-Runtime Configurations - GFlop/s
Cluster-SMP Disk+PS3 PS3 Cluster
SAXPY 1.9 0.004 5.3
SGEMV 4.4 0.014 15
SGEMM 48 3.7 30
CONV2D 4.8 0.48 19
FFT3D 1.1 0.05 0.36
GRAVITY 50 66 119
HMMER 14 8.3 13
Conclusion
47Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Summary
Uniform runtime interface for multi-level memory hierarchies– Horizontal portability
• SMP, cluster, disk, Cell, PS3– Complex machines supported through composition
• Cluster of SMPs, disk + PS3, cluster of PS3s– Provides mechanism independence for communication and thread
management Efficient abstraction for multiple machines
– Maximize machine resource utilization– Low overhead– Competitive performance
Simple interface– <20 entry points
Code portability– Unmodified code running on 9 system configurations
Demonstrates viability of the Parallel Memory Hierarchies model
48Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Future Work
Higher level functions in the runtime?
Load balancing?
Running on more complex machines?
Combine with transactional memory?
What machines can be mapped as a tree?
49Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008
Questions?
Acknowledgements:Intel Graduate Fellowship ProgramDOE ASCIBMLANL