a portable runtime interface for multi-level memory hierarchies mike houston, ji-young park, manman...

A Portable Runtime Interface For Multi-Level Memory Hierarchies

Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian,

Alex Aiken, William Dally, Pat Hanrahan

Stanford University

2Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008

The Problem

Lots of different architectures– Shared memory – Distributed memory – Exposed communication– Disk systems

Each architecture has its own programming system Composed machines difficult to program and manage

– Different mechanisms for each architecture

Previous runtimes and languages limited– Designed for a single architecture– Struggle with memory hierarchies


Shared Memory Machines

AMD Barcelona SGI Altix


Distributed Memory Machines

Marenostrum2,282 Nodes

ASC Blue Gene/L65,536 Nodes


Exposed Communication Architectures

90nm | ~220 mm2 |~100 WSTI CELL processor

~200 GFLOPS


Complex machines

Cluster of SMPs?– MPI?

Cluster of Cell processors?– MPI + ALF/Cell SDK

Cluster of clusters of SMPs with Cell accelerators? What about disk systems?

– Disk I/O often second class


Previous Work

APIs– MPI/MPI-2– Pthreads– OpenMP

(Dagum et al. 1998)– GASNet

(Bonachea et al. 2002)– Charm

(Kale et al, 1993)– SVM

(Lebonte et al. 2004)– Cell SDK

(IBM 2006)– CellSs

(Bellens et al. 2006)– HTA

(Bikshandi et al. 2006)– …

Languages– Co-Array Fortran

(Numrich et al. 1994)– Titanium

(Yelick et al. 1998) – UPC

(Carlson et al. 1999)– ZPL

(Deitz et al. 2004)– Chapel

(Callahan et al. 2004)– X10

(Charles et al. 2005)– Brook

(Buck et al. 2004)– Cilk

(Blumofe et al. 1995)– …


Contributions

Uniform scheme for explicitly describing memory hierarchies– Capture common traits important for performance

– Allow composition of memory hierarchies

Simple, portable API interface for many parallel machines– Mechanism independence for communication and management of

parallel resources

– Few entry points

– Efficient execution

Efficient compiler target– Compiler can concentrate on global optimizations

– Runtime manages mechanisms


The Sequoia System

Programming language for memory hierarchies– Fatahalian et al. (Supercomputing 2006)– Adapts Parallel Memory Hierarchies programming abstraction

Compiler for exposed communication architectures– Knight et al. (PPoPP 2007)– Takes in Sequoia program, machine file, and mapping file– Generates task calls and data transfers– Performs large granularity optimizations

Portable runtime system– Houston et al. (PPoPP 2008)– Target for Sequoia compiler– Serves as abstract machine

http://sequoia.stanford.edu

Abstract Machine Model


Parallel Memory Hierarchy Model (PMH)

Abstract machines a trees of memories (each memory is an address space)(Alpern et al. 1995)

Memory

CPU

Memory

CPU

Memory

CPU

Memory

CPU

Memory


Parallel Memory Hierarchy Model

CPU CPU CPU CPU CPU CPU CPU CPU

Memory

Memory

Memory Memory Memory Memory

Memory

Memory Memory Memory Memory


Example Mappings

SPE

LS

MainMemory

Cell Processor

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

CPU CPU

NodeMemory

NodeMemory

...

Aggregate Node Memory(Virtual Level)

Cluster

CPU

Disk

Disk

NodeMemory

Shared Memory Multi-processor

CPU

...

MainMemory

L2

CPU

L2


Multi-Level Configurations

SPE SPE SPE SPE SPE SPE

...

CPU

...

Disk + Playstation 3

Cluster of Playstation 3s

Cluster of SMPs

...

CPU CPU

...

CPUSPE SPE SPE SPE SPE SPE

SPE SPE SPE SPE SPE SPE

Aggregate Cluster Memory

L2 L2 L2 L2

NodeMemory

NodeMemory

LS LS LS LS LS LS

MainMemory

Disk

LS LS LS LS LS LS

MainMemory

MainMemory

LS LS LS LS LS LS



Abstraction Rules

Tree of nodes 1 control thread per node 1 memory per node

Threads can:– Transfer in bulk from/to parent memory

asynchronously

– Wait for transfers from/to parent to complete

– Allocate data

– Only access their memory directly

– Transfer control to child node(s)

– Non-leaf threads only operate to move data and control

– Synchronize with siblings

Memory

CPU

Memory

CPU

Memory

CPU

Simliar to Space Limited Procedures (Alpern et al. 1995)

Portable Runtime Interface


Requirements

Resource allocation– Data allocation and naming

– Setup parallel resources Explicit bulk asynchronous communication

– Transfer lists

– Transfer commands Parallel execution

– Launch tasks on children

– Asynchronous Synchronization

– Make sure tasks/transfers complete before continuing Runtime isolation

– No direct knowledge of other runtimes


Memory Level i+1

CPU Level i+1

Memory Level iChild N

Memory Level i…

Graphical Runtime Representation

Memory Level iChild 1

Runtime

CPU Level iChild 1

CPU Level i…

CPU Level iChild N


Top Interface

// create and free runtimeRuntime(TaskTable table, int numChildren);virtual ~Runtime();

// allocate and deallocate arraysvirtual Array_t* AllocArray (Size_t elmtSize, int dimensions, Size_t* dim_sizes, ArrayDesc_t

descriptor, int alignment) = 0; virtual void FreeArray(Array_t* array) = 0;

// array naming virtual void AddArray(Array_t array);virtual Array_t GetArray(ArrayDesc_t descriptor);virtual void RemoveArray(ArrayDesc_t descriptor);

// launch and synchronize on tasks virtual TaskHandle_t CallChildTask(TaskID_t taskid, ChildID_t start, ChildID_t end) =

0;virtual void WaitTask(TaskHandle_t handle) = 0;


Bottom Interface

// array namingvirtual Array_t* GetArray (ArrayDesc_t descriptor);

// create, free, invoke, and synchronize on transfer lists virtual XferList* CreateXferList (Array_t* dst, Array_t*

src, Size_t* dst_idx, Size_t*

src_idx, Size_t* lengths, int

count) = 0;virtual void FreeXferList (XferList* list) = 0;virtual XferHandle_t Xfer (XferList* list) = 0;virtual void WaitXfer (XferHandle_t handle) = 0;

// get number of children in bottom level, get local processor id, // and barrierint GetSiblingCount ();int GetID (); virtual void Barrier (ChildID_t start, ChildID_t stop) =

0;


Compiler/Runtime Interaction

Compiler initializes runtime for each pair of memories in the hierarchy

Initialize runtime for root memory– Machine description specifies runtime to initialize (SMP, Cluster,

Disk, Cell, etc.)

If more levels in hierarchy– Initialize runtimes for child levels

Runtime cleanup is inverse– Call exit on children, wait, cleanup local resources, return to

parent

Control of hierarchy via task calls

Runtime Implementations


SMP Runtime

CPU CPU...

MainMemory

SMP Runtime

CPU

Disk

NodeMemory

Disk Runtime

SPE

LS

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

Cell Runtime

…CPU CPU

NodeMemory

NodeMemory

...


Cluster Runtime


SMP Implementation

Pthreads based– Launch a thread per child– CallChildTask enqueues work on task queue per child

Data transfers– Memory copy from source to destination– Optimizations

• Pass reference to parent array• Feedback to compiler to remove transfers

– Machine file information

No processor at parent level– Processor 0 represents the parent node and a child node


Disk Implementation

No processor at top level– Host CPU represents the parent node and child node

Implementation – Allocation

• Open file on disk– Data transfers

• Use Async I/O API to read/write data to disk


Cell Implementation

Overlay handling– At runtime creation, load overlay loader into SPEs

– On task call

• PowerPC notifies SPE of function to load

• SPE loads overlay and executes Data alignment

– All data allocated to 128 byte boundaries

– Multi-dimensional arrays padded to maintain alignment for dimensions Heavy use of mailboxes

– SPE synchronization

– PPE<->SPE communication Data transfers

– DMA lists

– DMA commands


Cluster Implementation

No processor at top level– Node 0 represents the parent node and a child node

Virtual level– Distributed Shared Memory (DSM)

• Many software and hardware solutions• IVY (Li et al. 1988), Mether (Minnich et al. 1991), …• Alewife (Agarwal et al. 1991), FLASH (Kuskin et al. 1994), …• Complex coherence and consistency mechanisms

– Overkill for our needs• No sharing between sibling processors = no coherence

requirements• Sequoia enforces consistency by disallowing aliasing on

writes to parent memory


Cluster Implementation, cont.

Virtual level implementation– Interval trees

• Represent arrays as covering an range of 0->N elements• Each node can have any sub-portion (interval) of range• Tree structure allows fast lookup of all nodes that cover the

interval of interest• Allows complex data distributions• See CLRS Section 14.3 for more detailed information

– Array allocation• Define distribution as interval tree and broadcast• Allocate data for intervals and register data pointers with

MPI-2 (MPI_win_create)• Align data to page alignment for network fast transfers


Cluster Implementation, cont.

Virtual level implementation– Data transfer

• Compare requested data range against interval tree• Read from parent: issue MPI_Get to any nodes with

matching intervals (MPI_LOCK_SHARED)

• Write to parent: issue MPI_Put to all nodes with matching intervals (MPI_LOCK_EXCLUSIVE)

– Optimizations• If requested range is local, return reference to parent

memory


Runtime Composition

Disk

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC

MainMemory

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

SPE

LS

PowerPC


NodeMemory

NodeMemory

...


CPU CPU... CPU CPU...

Disk + Playstation 3

Cluster of Playstation 3s

Cluster of SMPs

...

Cluster Runtime

Cluster Runtime

Cell Runtime

Cell Runtime

Disk Runtime

SMP Runtime SMP Runtime

Cell Runtime

Evaluation


Evaluation Metrics

Can we run unmodified Sequoia code on all these systems?

How much overhead do we have in our abstraction? How well can we utilize machine resources?

– Computation– Bandwidth

Is our application performance competitive with best available implementations?

Can we effectively compose runtimes for more complex systems?


Sequoia Benchmarks

Linear Algebra

Blas Level 1 SAXPY, Level 2 SGEMV, and Level 3 SGEMM benchmarks

2D single precision convolution with 9x9 support (non-periodic boundary constraints)

Complex single precision FFT

100 time steps of N-body stellar dynamics simulation (N2) single precision

Fuzzy protein string matching using HMM evaluation (Horn et al. SC2005 paper)

Conv2D

FFT3DGravity

HMMER

Best available implementations used as leaf task


Single Runtime System Configurations

Scalar– 2.4 GHz Intel Pentium4 Xeon, 1GB

8-way SMP– 4 dual-core 2.66GHz Intel P4 Xeons, 8GB

Disk– 2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk

Cluster– 16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect

(780MB/s) Cell

– 3.2 GHz IBM Cell blade (1 Cell – 8 SPE), 1GB PS3

– 3.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)


System Utilization

SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER

SMP | Disk | Cluster | Cell | PS3

Per

cent

age

of R

untim

e

100

0


Resource Utilization – IBM Cell

Bandwidth utilizationCompute utilization

Res

ourc

e U

tiliz

atio

n (%

)

100

0


Single Runtime Configurations – GFlop/s

Scalar SMP Disk Cluster Cell PS3

SAXPY 0.3 0.7 0.007 4.9 3.5 3.1

SGEMV 1.1 1.7 0.04 12 12 10

SGEMM 6.9 45 5.5 91 119 94

CONV2D 1.9 7.8 0.6 24 85 62

FFT3D 0.7 3.9 0.05 5.5 54 31

GRAVITY 4.8 40 3.7 68 97 71

HMMER 0.9 11 0.9 12 12 7.1


SGEMM Performance

Cluster– Intel Cluster MKL: 101 GFlop/s– Sequoia: 91 GFlop/s

SMP– Intel MKL: 44 GFlop/s– Sequoia: 45 GFlop/s


FFT3D Performance

Cell– Mercury: 58 GFlop/s– FFTW 3.2 alpha 2: 35 GFlop/s– Sequoia: 54 GFlop/s

Cluster– FFTW 3.2 alpha 2: 5.3 GFlop/s– Sequoia: 5.5 GFlop/s

SMP– FFTW 3.2 alpha 2: 4.2 GFlop/s– Sequoia: 3.9 GFlop/s


Best Known Implementations

HMMer– ATI X1900XT: 9.4 GFlop/s

(Horn et al. 2005)

– Sequoia Cell: 12 GFlop/s– Sequoia SMP: 11 GFlop/s

Gravity– Grape-6A: 2 billion interactions/s

(Fukushige et al. 2005)

– Sequoia Cell: 4 billion interactions/s– Sequoia PS3: 3 billion interactions/s


Multi-Runtime System Configurations

Cluster of SMPs– Four 2-way, 3.16GHz Intel Pentium 4 Xeons

connected via GigE (80MB/s peak) Disk + PS3

– Sony Playstation 3 bringing data from disk (~30MB/s) Cluster of PS3s

– Two Sony Playstation 3’s connected via GigE (60MB/s peak)


Multi-Runtime Utilization


Cluster of SMPs | Disk + PS3 | Cluster of PS3s

Pe

rce

nta

ge

of

Ru

ntim

e

100

0


Cluster of PS3 Issues


Cluster of SMPs | Disk + PS3 | Cluster of PS3s

Pe

rce

nta

ge

of

Ru

ntim

e

100

0


Cluster of PS3 Issues

SAXPY SGEMV

Cluster of PS3s | PS3

Pe

rce

nta

ge

of

Ru

ntim

e

100

0


Multi-Runtime Configurations - GFlop/s

Cluster-SMP Disk+PS3 PS3 Cluster

SAXPY 1.9 0.004 5.3

SGEMV 4.4 0.014 15

SGEMM 48 3.7 30

CONV2D 4.8 0.48 19

FFT3D 1.1 0.05 0.36

GRAVITY 50 66 119

HMMER 14 8.3 13

Conclusion


Summary

Uniform runtime interface for multi-level memory hierarchies– Horizontal portability

• SMP, cluster, disk, Cell, PS3– Complex machines supported through composition

• Cluster of SMPs, disk + PS3, cluster of PS3s– Provides mechanism independence for communication and thread

management Efficient abstraction for multiple machines

– Maximize machine resource utilization– Low overhead– Competitive performance

Simple interface– <20 entry points

Code portability– Unmodified code running on 9 system configurations

Demonstrates viability of the Parallel Memory Hierarchies model


Future Work

Higher level functions in the runtime?

Load balancing?

Running on more complex machines?

Combine with transactional memory?

What machines can be mapped as a tree?


Questions?

Acknowledgements:Intel Graduate Fellowship ProgramDOE ASCIBMLANL

a portable runtime interface for multi-level memory hierarchies mike houston, ji-young park, manman...

Documents

memory hierarchies fatahalian

main memory l2 cpu l2

bulk fromto parent memory

nodes slide

gflops slide

class slide

abstract machine model

node threads