total work-flow: exploiting hybrid computing architectures ... · pdf fileoperated by los...

Operated by Los Alamos National Security, LLC for DOE/NNSA

LA-UR 09-02032

Total Work-Flow: Exploiting Hybrid Computing Architectures for

Scientific Computing

Ben Bergen

Computational Physics (CCS-2)

Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1),

William Daughton (X-1)

ScicomP 15

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032


Overview

  Roadrunner System Overview

  Basic Considerations and Programming Models

  Adapting VPIC Kinetic Plasma Code to Roadrunner

  Optimizing Total Workflow

  Open Science on Roadrunner

Slide 2


LA-UR 09-02032


Roadrunner is a Cluster

Slide 3


LA-UR 09-02032


Roadrunner is a Cluster of Clusters

Slide 4


LA-UR 09-02032


Roadrunner is a Cluster of Clusters with Accelerators

Slide 5


LA-UR 09-02032


Triblade Compute Node

Slide 6


LA-UR 09-02032


Original Blade Topology

Slide 7

  One-to-one affinity between Opteron core and Cell processor

  Newer versions of DaCS support two-to-one affinity

  Not sure about four-to-one???


LA-UR 09-02032


Roadrunner: Basic Considerations for Adaptation

  First hybrid supercomputer of the current generation incorporating x86_64, PowerPC, and SPU ISAs.

  Codes require three executables   x86_64 executable runs on the Opteron host processor   PowerPC executable runs on the Power Processing Element

(PPE) accelerator processor   SPU threads runs on the eight Synergistic Processing

Element (SPE) special purpose vector unit processors   Three compilers: gcc, ppu-gcc, spu-gcc (also XL C/C++)

  Design considerations: Process launch and synchronization

Slide 8

Roadrunner has three different architectures

Opteron

PowerPC

SPE


LA-UR 09-02032



  Incorporates main memory on the Opteron and Cell eDP blades plus the local store user-controlled SRAM on the SPEs

  Codes that run on Roadrunner must handle communication between these memory spaces   Distributed memory communication between Opteron hosts   Point-to-point communication between Opteron host and Cell accelerator   Direct Memory Access (DMA) communication between Cell main memory and

SPE local store memory

  Opteron and Cell have different endianness   Some byte-swapping is necessary   Cell blades are diskless

  Design considerations: Communication and I/O

Slide 9

Roadrunner has three different address spaces


LA-UR 09-02032



  Process launch and synchronization   MPI, DaCS/ALF, libSPE2

  Communication   MPI, DaCS/ALF, libSPE2

Slide 10

Multiple tools and programming models MPI

DaCS

libSPE2

Hierarchical/heterogeneous advantages

  Fault tolerance   Faults can be caught at multiple levels

  Scalability

  Strong scalability is possible on SPEs

  Weak scalability through distributed memory


LA-UR 09-02032


Programming Models: Host-Centric (Function Offload)

Slide 11

Opteron

Cell

  Allows staged development   Existing MPI codes will run on Opterons

  Synchronous or asynchronous function offload to accelerator

  Minimizes reliance on PPE (poor performer!)

Pros

  Potential data-movement bottleneck

  Offload cost must be amortized by work done on accelerator

Cons


LA-UR 09-02032


Programming Models: Accelerator-Centric

Slide 12

Opteron

Cell

  Also allows staged development   Existing MPI codes will run on PowerPC (PPE)

  Hides complexity of hybrid architecture

  Avoids data-movement bottleneck

Pros

  Heavier reliance on PPE

  Computationally intensive portions of code must run on SPEs

  Requires “relay” to forward message traffic

Cons


LA-UR 09-02032


Message Passing Relay

Slide 13

Cell

Opteron

Cell

Opteron

Direct point-to-point communication is not possible between Cells


LA-UR 09-02032


Message Passing Relay

Slide 14

Cell

Opteron

Cell

Opteron

Data Data

Relay forwards messages through hosts to peer


LA-UR 09-02032


Programming Models: All Roads Lead Everywhere

Slide 15

Opteron

Cell

There is a natural evolution of both of these approaches into a fully hybrid computing model   Initial difference is in program Locus or

control-process

  On “evolved” model the host process runs a task-queue

  Tasks may be offloaded to other host-type cores or to accelerators

  Task data may live in worker’s memory to avoid data-movement bottlenecks

  More on how we can use this to follow!

Scheduler

Cell/GPU

Opteron Core


LA-UR 09-02032


Slide 16

Particle-In-Cell (PIC) Methods Simulate Plasma Physics

  One application of VPIC is to simulate Laser Plasma Interactions (LPI) critical to understanding Inertial Confinement Fusion (ICF) at the National Ignition Facility (NIF)

  Several difficulties arise during the compression of hohlraum capsules   Laser scattering – not enough energy to compress capsule   Laser scattering – laser does not target desired areas (unsymmetric compression)   Pre-heating – electrons heat plasma making compression more difficult

LLNL pF3D modeling of a laser beam

VPIC modeling of a single laser speckle

Integrated LLNL Hydra modeling of ICF experiment


LA-UR 09-02032


Slide 17

Particle-In-Cell Method

Advance Particles

Accumulate Currents

Update Fields

Interpolate Field Effects

Time Iteration grids

particles

Spatial Domain

+ +

+ +

+


LA-UR 09-02032


Slide 18

VPIC – Vector Particle-In-Cell

  3D, fully relativistic, electromagnetic Particle-In-Cell (PIC) code   Self-consistent evolution of a kinetic plasma   Charge conserving (no implicit solve)

  Optimized for data motion   Single precision – half the memory bandwidth/double the theoretical peak   Single-pass particle processing   Field interpolation coefficients are pre-computed

  Optimized for modern architectures   Uses short-vector, SIMD intrinsics (SSE, Altivec, SPU)

  Assumes that particles do not leave voxel in which they started   Exceptions are handled separately

  O(N) particle sorting   Improves spatial locality of particle data access   Improves temporal locality of Field data access


LA-UR 09-02032


Slide 19

Porting to Roadrunner (things that we did)

  Message Passing Relay (MP Relay)   Flattens communication topology   Allows logical point-to-point communication between Cell processors   Abstracts remote I/O layer for restart and visualization dumps

  Pipelined execution   Code restructured for data-parallel thread execution   Current support for serial, pthreads, and SPE threads   Simple, common interface: init(), finalize(), execute(function_t), sync()

  Particle data structures   Optimized for efficient communication via DMA requests   Can be tuned to cache size on traditional cached-memory architectures (padding)

  Voxel cache (access to Field data)   Fully associative least recently used (LRU) policy   Simple interface: voxel_cache_fetch() and voxel_cache_wait()

  Text overlay support   Allows acceleration of field advance, particle sorting and accumulators


LA-UR 09-02032


Pipeline Abstraction

Slide 20

init

Master Thread

execute sync finalize

  Worker threads block for execute message to reduce thread creation overhead   pthreads implementation uses condition variables   SPE implementation uses mailboxes

  SPE symbols are exposed to the PPE through _SPUEAR_ linker magic   Function call is implemented through mailbox message


LA-UR 09-02032


Slide 21

  Data are processed in segments of even multiples of 16 particles   Segments are accessed in blocks of up to 512 particles (16 KB largest possible single DMA request)   Triple-buffered: streaming data paradigm (read, update, write)

  Block processing groups particles in sets of 4   Optimal for single-precision SIMD operations   Inner loop is 4x hand unrolled

VPIC applies best strategy to particle advance

typedef struct particle { float dx, dy, dz; // position (relative to voxel) int32_t i; // index of voxel containing particle float ux, uy, uz; // particle normalized momentum float q; // particle charge } particle_t;

32 bytes


LA-UR 09-02032


Slide 22

Overlays   VPIC’s particle advance logic maxes out the Local Store (LS)

  Particle advance data uses 206 KB   This leaves ~50 KB for text (machine instructions)

  Overlays are segments of text that can be loaded/unloaded from LS   Expand the effective maximum size of an SPE program   Avoid overhead of starting new SPE threads (prohibitive)   Limited by management table size

  IBM has implemented overlay support as a software cache   Overlay manager fetches text that is not in LS (DMA call)   No prefetch capability


LA-UR 09-02032


Slide 23

Overlay Properties

Root Segment

Region 2

SA

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF


LA-UR 09-02032


Slide 24

Overlay Properties

Root Segment

Region 2

SA

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

Text is partitioned into regions with a static root segment

Text


LA-UR 09-02032


Slide 25

Overlay Properties

Root Segment

Region 2

SA

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

Each region can be filled by specific segments of text

Segments for region 1


LA-UR 09-02032


Slide 26

Overlay Properties

Root Segment

Region 2

SA

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

The size of a region is determined by its largest segment

32KB 28KB 32KB 20KB


LA-UR 09-02032


Slide 27

Overlay Properties

Root Segment

Region 2

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

Loading a new segment overwrites its respective region

SB


LA-UR 09-02032


Slide 28

Overlays   VPIC has been extended to support overlays

  Particle advance, accumulators and particle sorting have been accelerated   Current code decomposition uses one region with two segments

  Even this fairly trivial approach has difficulties   VPIC overlay strategy uses stack for data buffers   Stack placement changes with linkage (even with only trivial code changes)   Silent overflows into memory handled by overlay manager   Data corruption, hangs, segmentation faults…

  May need to implement light-weight heap   void * spu_malloc(), spu_free_all()   Would reserve 224 KB of LS space for heap   Actual implementation will use static byte array in root segment


LA-UR 09-02032


Slide 29

VPIC Highlights

  1.00 Trillion Particle Run (Poughkeepsie just after stand-up)   Aggressive test of full system   Achieved sustained performance of >374.25 TF

  11% of theoretical max performance (single precision 3.0 PF)   Gordon Bell Prize Finalist SC2008

  Cell processes used 42.8 TB RAM (93.8% of available Cell memory)   Opteron processes used 7.3 TB RAM

  Science Runs: Back-scatter in laser plasma interactions   Current bread-and-butter runs on 6 CUs (4,096 ranks : 32,768 threads)   Next set of runs will be at 16 CUs (11,520 ranks : 92,160 threads)   11x speedup over Opteron-only   Excellent machine stability (main difficulty is I/O subsystem)


LA-UR 09-02032


Efficiency is a Poor Metric for Performance

  Real computational workflows expose many more bottlenecks   Data I/O for visualization and restarts   Data rendering for visualization   Diagnostics and statistical analysis

  Many of these steps can be handled concurrently   Once the pipeline is full, we can fully subscribe a hybrid compute node   Reduces vulnerability to machine instabilities by reducing total time to solution   Special purpose accelerators can be targeted to specific tasks   Host process only manages tasks

  VPIC will be enhanced to address these issues   Initial enhancements will use pipeline abstraction   OpenCL implementation planned

Slide 30

Hybrid Computing Architectures Can Help Us!


LA-UR 09-02032


Exploiting Hybrid Architectures

Slide 31

Cell

Opteron

Scheduler

GPU

Opteron Core

Disk I/O

Rendering

Computation

Host Process/Task Queue

Computation/


LA-UR 09-02032


Commodity Nodes Like This Already Exist!

  Scalable Informatics – Pegasus GPU+Cell Node   4-16 AMD or Intel cores   8-128 GB RAM   One or more Tesla GPU cards   One or more GA-180 (PXCAB) Cell cards

Slide 32

How can we develop for a cluster of such nodes?


LA-UR 09-02032


OpenCL – One Possibility

  OpenCL is programming framework for accelerated compute nodes   Runtime – handles work distribution and JIT compilation

  Fully static embedded kernels are supported   Topology interrogation

  API – process launch, communication and synchronization   OpenCL C – kernel programming language

  Abstraction layer for SIMD vector types and intrinsics   Explicit dependency specification of kernel parameters

  Host process controls one or more attached devices

  Still missing   Heterogeneous device support   Support for clusters (host-to-host communication)   Build/configuration system

Slide 33


LA-UR 09-02032


Hybrid Compute Node

Slide 34

OpenCL

C/C++/OpenCL C Kernel Logic

OpenMPI (h)

C/C++/OpenCL C Kernel Logic

IB Interconnect   Hybrid OpenMPI   Working version in use on

Roadrunner architecture   Extends MPI Interface

  Process launch on multiple architectures

  Introduces hierarchical communicators

  Available in next release

  OpenCL   No current support for

peer-to-peer communication between hosts

  Others   CellSs, OpenMP

Node Control Process

OpenCL OpenMPI (h)

OpenMPI

Abstraction Layer


LA-UR 09-02032


Open Science on Roadrunner

  Internal peer-reviewed process identified 9 projects   Bio-Fuels, Astrophysics, Plasma Physics, Phylogenetics, Atmospheric Science,

Molecular Dynamics

  Projects have been awarded allocations on full machine   Many are currently underway   Window of 3-4 months before the machine goes behind-the-fence

  Cerrillos   162 TF Roadrunner architecture (2 CUs)   Call has been issued and proposals are being evaluated   Allocations will begin soon!

  Education   Hands-On Cell programming class (second offering currently underway)   Student program   Development allocations for collaboration

Slide 35


LA-UR 09-02032


Slide 36

Supernovae

  The final event in the evolution of a sufficiently massive star is a supernova

  SNSPH, developed at LANL by Chris Fryer and Mike Warren, is a parallel three-dimensional smoothed particle hydrodynamics code

  Simulations conducted on Roadrunner will allow comparison with data from actual light curves and spectra from supernova observations

  Large Synoptic Survey Telescope (LSST) and the Joint Dark Energy Mission (JDEM)

Work on Roadrunner will extend these simulations by calculating light curves and spectra from full radiation-hydrodynamic models of these explosions.


LA-UR 09-02032


Shock Compression of Metals

  Dislocation interactions such as line defects determine the strength of metals

  Roadrunner will finally allow us to realize the promise of computational science by bridging the gap between simulation and experiment

Slide 37

This animation shows a shock front traveling through polycrystalline Fe causing a phase transformation from the bcc (gray) to hcp (red) and fcc (green) structure


LA-UR 09-02032


Turbulent Mixing in Buoyancy Driven Flows

  Material mixing to molecular scale in the presence of turbulence induced stirring is an important process in many areas

  Most studies to date address the Boussinesq case

  Significant and unexpected differences in the mixing process occur as the material density parameters diverge

  These animations highlight the complexity of the mixing process, illustrating the new physics associated with mixing at large density differences

Slide 38


LA-UR 09-02032


Laser Plasma Interactions (LPI)

Slide 39


LA-UR 09-02032


Thanks!

  Our HPC Division staff is committed to making Roadrunner succeed   Meghan Wingate   Mark Vernon   Phil Church   Randall Rheinheimer

  Applications’ developers have done amazing work   Sriram Swaminarayan   Tim Kelley   Paul Henning   Jamal Mohd-Yusof

  Special thanks to Larry Cox for funding Khronos membership!

Slide 40

total work-flow: exploiting hybrid computing architectures ... · pdf fileoperated by los...

Documents