total work-flow: exploiting hybrid computing architectures ... · pdf fileoperated by los...

40
Operated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing Ben Bergen Computational Physics (CCS-2) Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1), William Daughton (X-1) ScicomP 15

Upload: dohanh

Post on 06-Feb-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for DOE/NNSA

LA-UR 09-02032

Total Work-Flow: Exploiting Hybrid Computing Architectures for

Scientific Computing

Ben Bergen

Computational Physics (CCS-2)

Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1),

William Daughton (X-1)

ScicomP 15

Page 2: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Overview

  Roadrunner System Overview

  Basic Considerations and Programming Models

  Adapting VPIC Kinetic Plasma Code to Roadrunner

  Optimizing Total Workflow

  Open Science on Roadrunner

Slide 2

Page 3: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster

Slide 3

Page 4: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster of Clusters

Slide 4

Page 5: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster of Clusters with Accelerators

Slide 5

Page 6: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Triblade Compute Node

Slide 6

Page 7: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Original Blade Topology

Slide 7

  One-to-one affinity between Opteron core and Cell processor

  Newer versions of DaCS support two-to-one affinity

  Not sure about four-to-one???

Page 8: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation

  First hybrid supercomputer of the current generation incorporating x86_64, PowerPC, and SPU ISAs.

  Codes require three executables   x86_64 executable runs on the Opteron host processor   PowerPC executable runs on the Power Processing Element

(PPE) accelerator processor   SPU threads runs on the eight Synergistic Processing

Element (SPE) special purpose vector unit processors   Three compilers: gcc, ppu-gcc, spu-gcc (also XL C/C++)

  Design considerations: Process launch and synchronization

Slide 8

Roadrunner has three different architectures

Opteron

PowerPC

SPE

Page 9: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation

  Incorporates main memory on the Opteron and Cell eDP blades plus the local store user-controlled SRAM on the SPEs

  Codes that run on Roadrunner must handle communication between these memory spaces   Distributed memory communication between Opteron hosts   Point-to-point communication between Opteron host and Cell accelerator   Direct Memory Access (DMA) communication between Cell main memory and

SPE local store memory

  Opteron and Cell have different endianness   Some byte-swapping is necessary   Cell blades are diskless

  Design considerations: Communication and I/O

Slide 9

Roadrunner has three different address spaces

Page 10: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation

  Process launch and synchronization   MPI, DaCS/ALF, libSPE2

  Communication   MPI, DaCS/ALF, libSPE2

Slide 10

Multiple tools and programming models MPI

DaCS

libSPE2

Hierarchical/heterogeneous advantages

  Fault tolerance   Faults can be caught at multiple levels

  Scalability

  Strong scalability is possible on SPEs

  Weak scalability through distributed memory

Page 11: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Programming Models: Host-Centric (Function Offload)

Slide 11

Opteron

Cell

  Allows staged development   Existing MPI codes will run on Opterons

  Synchronous or asynchronous function offload to accelerator

  Minimizes reliance on PPE (poor performer!)

Pros

  Potential data-movement bottleneck

  Offload cost must be amortized by work done on accelerator

Cons

Page 12: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Programming Models: Accelerator-Centric

Slide 12

Opteron

Cell

  Also allows staged development   Existing MPI codes will run on PowerPC (PPE)

  Hides complexity of hybrid architecture

  Avoids data-movement bottleneck

Pros

  Heavier reliance on PPE

  Computationally intensive portions of code must run on SPEs

  Requires “relay” to forward message traffic

Cons

Page 13: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Message Passing Relay

Slide 13

Cell

Opteron

Cell

Opteron

Direct point-to-point communication is not possible between Cells

Page 14: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Message Passing Relay

Slide 14

Cell

Opteron

Cell

Opteron

Data Data

Relay forwards messages through hosts to peer

Page 15: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Programming Models: All Roads Lead Everywhere

Slide 15

Opteron

Cell

There is a natural evolution of both of these approaches into a fully hybrid computing model   Initial difference is in program Locus or

control-process

  On “evolved” model the host process runs a task-queue

  Tasks may be offloaded to other host-type cores or to accelerators

  Task data may live in worker’s memory to avoid data-movement bottlenecks

  More on how we can use this to follow!

Scheduler

Cell/GPU

Opteron Core

Page 16: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 16

Particle-In-Cell (PIC) Methods Simulate Plasma Physics

  One application of VPIC is to simulate Laser Plasma Interactions (LPI) critical to understanding Inertial Confinement Fusion (ICF) at the National Ignition Facility (NIF)

  Several difficulties arise during the compression of hohlraum capsules   Laser scattering – not enough energy to compress capsule   Laser scattering – laser does not target desired areas (unsymmetric compression)   Pre-heating – electrons heat plasma making compression more difficult

LLNL pF3D modeling of a laser beam

VPIC modeling of a single laser speckle

Integrated LLNL Hydra modeling of ICF experiment

Page 17: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 17

Particle-In-Cell Method

Advance Particles

Accumulate Currents

Update Fields

Interpolate Field Effects

Time Iteration grids

particles

Spatial Domain

+ +

+ +

+

Page 18: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 18

VPIC – Vector Particle-In-Cell

  3D, fully relativistic, electromagnetic Particle-In-Cell (PIC) code   Self-consistent evolution of a kinetic plasma   Charge conserving (no implicit solve)

  Optimized for data motion   Single precision – half the memory bandwidth/double the theoretical peak   Single-pass particle processing   Field interpolation coefficients are pre-computed

  Optimized for modern architectures   Uses short-vector, SIMD intrinsics (SSE, Altivec, SPU)

  Assumes that particles do not leave voxel in which they started   Exceptions are handled separately

  O(N) particle sorting   Improves spatial locality of particle data access   Improves temporal locality of Field data access

Page 19: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 19

Porting to Roadrunner (things that we did)

  Message Passing Relay (MP Relay)   Flattens communication topology   Allows logical point-to-point communication between Cell processors   Abstracts remote I/O layer for restart and visualization dumps

  Pipelined execution   Code restructured for data-parallel thread execution   Current support for serial, pthreads, and SPE threads   Simple, common interface: init(), finalize(), execute(function_t), sync()

  Particle data structures   Optimized for efficient communication via DMA requests   Can be tuned to cache size on traditional cached-memory architectures (padding)

  Voxel cache (access to Field data)   Fully associative least recently used (LRU) policy   Simple interface: voxel_cache_fetch() and voxel_cache_wait()

  Text overlay support   Allows acceleration of field advance, particle sorting and accumulators

Page 20: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Pipeline Abstraction

Slide 20

init

Master Thread

execute sync finalize

  Worker threads block for execute message to reduce thread creation overhead   pthreads implementation uses condition variables   SPE implementation uses mailboxes

  SPE symbols are exposed to the PPE through _SPUEAR_ linker magic   Function call is implemented through mailbox message

Page 21: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 21

  Data are processed in segments of even multiples of 16 particles   Segments are accessed in blocks of up to 512 particles (16 KB largest possible single DMA request)   Triple-buffered: streaming data paradigm (read, update, write)

  Block processing groups particles in sets of 4   Optimal for single-precision SIMD operations   Inner loop is 4x hand unrolled

VPIC applies best strategy to particle advance

typedef struct particle { float dx, dy, dz; // position (relative to voxel) int32_t i; // index of voxel containing particle float ux, uy, uz; // particle normalized momentum float q; // particle charge } particle_t;

32 bytes

Page 22: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 22

Overlays   VPIC’s particle advance logic maxes out the Local Store (LS)

  Particle advance data uses 206 KB   This leaves ~50 KB for text (machine instructions)

  Overlays are segments of text that can be loaded/unloaded from LS   Expand the effective maximum size of an SPE program   Avoid overhead of starting new SPE threads (prohibitive)   Limited by management table size

  IBM has implemented overlay support as a software cache   Overlay manager fetches text that is not in LS (DMA call)   No prefetch capability

Page 23: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 23

Overlay Properties

Root Segment

Region 2

SA

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

Page 24: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 24

Overlay Properties

Root Segment

Region 2

SA

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

Text is partitioned into regions with a static root segment

Text

Page 25: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 25

Overlay Properties

Root Segment

Region 2

SA

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

Each region can be filled by specific segments of text

Segments for region 1

Page 26: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 26

Overlay Properties

Root Segment

Region 2

SA

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

The size of a region is determined by its largest segment

32KB 28KB 32KB 20KB

Page 27: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 27

Overlay Properties

Root Segment

Region 2

SPE Local Store

Data

SA SB

SD SE

Main Memory

Region 1

SD

SC

SF

Loading a new segment overwrites its respective region

SB

Page 28: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 28

Overlays   VPIC has been extended to support overlays

  Particle advance, accumulators and particle sorting have been accelerated   Current code decomposition uses one region with two segments

  Even this fairly trivial approach has difficulties   VPIC overlay strategy uses stack for data buffers   Stack placement changes with linkage (even with only trivial code changes)   Silent overflows into memory handled by overlay manager   Data corruption, hangs, segmentation faults…

  May need to implement light-weight heap   void * spu_malloc(), spu_free_all()   Would reserve 224 KB of LS space for heap   Actual implementation will use static byte array in root segment

Page 29: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 29

VPIC Highlights

  1.00 Trillion Particle Run (Poughkeepsie just after stand-up)   Aggressive test of full system   Achieved sustained performance of >374.25 TF

  11% of theoretical max performance (single precision 3.0 PF)   Gordon Bell Prize Finalist SC2008

  Cell processes used 42.8 TB RAM (93.8% of available Cell memory)   Opteron processes used 7.3 TB RAM

  Science Runs: Back-scatter in laser plasma interactions   Current bread-and-butter runs on 6 CUs (4,096 ranks : 32,768 threads)   Next set of runs will be at 16 CUs (11,520 ranks : 92,160 threads)   11x speedup over Opteron-only   Excellent machine stability (main difficulty is I/O subsystem)

Page 30: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Efficiency is a Poor Metric for Performance

  Real computational workflows expose many more bottlenecks   Data I/O for visualization and restarts   Data rendering for visualization   Diagnostics and statistical analysis

  Many of these steps can be handled concurrently   Once the pipeline is full, we can fully subscribe a hybrid compute node   Reduces vulnerability to machine instabilities by reducing total time to solution   Special purpose accelerators can be targeted to specific tasks   Host process only manages tasks

  VPIC will be enhanced to address these issues   Initial enhancements will use pipeline abstraction   OpenCL implementation planned

Slide 30

Hybrid Computing Architectures Can Help Us!

Page 31: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Exploiting Hybrid Architectures

Slide 31

Cell

Opteron

Scheduler

GPU

Opteron Core

Disk I/O

Rendering

Computation

Host Process/Task Queue

Computation/

Page 32: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Commodity Nodes Like This Already Exist!

  Scalable Informatics – Pegasus GPU+Cell Node   4-16 AMD or Intel cores   8-128 GB RAM   One or more Tesla GPU cards   One or more GA-180 (PXCAB) Cell cards

Slide 32

How can we develop for a cluster of such nodes?

Page 33: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

OpenCL – One Possibility

  OpenCL is programming framework for accelerated compute nodes   Runtime – handles work distribution and JIT compilation

  Fully static embedded kernels are supported   Topology interrogation

  API – process launch, communication and synchronization   OpenCL C – kernel programming language

  Abstraction layer for SIMD vector types and intrinsics   Explicit dependency specification of kernel parameters

  Host process controls one or more attached devices

  Still missing   Heterogeneous device support   Support for clusters (host-to-host communication)   Build/configuration system

Slide 33

Page 34: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Hybrid Compute Node

Slide 34

OpenCL

C/C++/OpenCL C Kernel Logic

OpenMPI (h)

C/C++/OpenCL C Kernel Logic

IB Interconnect   Hybrid OpenMPI   Working version in use on

Roadrunner architecture   Extends MPI Interface

  Process launch on multiple architectures

  Introduces hierarchical communicators

  Available in next release

  OpenCL   No current support for

peer-to-peer communication between hosts

  Others   CellSs, OpenMP

Node Control Process

OpenCL OpenMPI (h)

OpenMPI

Abstraction Layer

Page 35: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Open Science on Roadrunner

  Internal peer-reviewed process identified 9 projects   Bio-Fuels, Astrophysics, Plasma Physics, Phylogenetics, Atmospheric Science,

Molecular Dynamics

  Projects have been awarded allocations on full machine   Many are currently underway   Window of 3-4 months before the machine goes behind-the-fence

  Cerrillos   162 TF Roadrunner architecture (2 CUs)   Call has been issued and proposals are being evaluated   Allocations will begin soon!

  Education   Hands-On Cell programming class (second offering currently underway)   Student program   Development allocations for collaboration

Slide 35

Page 36: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Slide 36

Supernovae

  The final event in the evolution of a sufficiently massive star is a supernova

  SNSPH, developed at LANL by Chris Fryer and Mike Warren, is a parallel three-dimensional smoothed particle hydrodynamics code

  Simulations conducted on Roadrunner will allow comparison with data from actual light curves and spectra from supernova observations

  Large Synoptic Survey Telescope (LSST) and the Joint Dark Energy Mission (JDEM)

Work on Roadrunner will extend these simulations by calculating light curves and spectra from full radiation-hydrodynamic models of these explosions.

Page 37: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Shock Compression of Metals

  Dislocation interactions such as line defects determine the strength of metals

  Roadrunner will finally allow us to realize the promise of computational science by bridging the gap between simulation and experiment

Slide 37

This animation shows a shock front traveling through polycrystalline Fe causing a phase transformation from the bcc (gray) to hcp (red) and fcc (green) structure

Page 38: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Turbulent Mixing in Buoyancy Driven Flows

  Material mixing to molecular scale in the presence of turbulence induced stirring is an important process in many areas

  Most studies to date address the Boussinesq case

  Significant and unexpected differences in the mixing process occur as the material density parameters diverge

  These animations highlight the complexity of the mixing process, illustrating the new physics associated with mixing at large density differences

Slide 38

Page 39: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Laser Plasma Interactions (LPI)

Slide 39

Page 40: Total Work-Flow: Exploiting Hybrid Computing Architectures ... · PDF fileOperated by Los Alamos National Security, LLC for DOE/NNSA LA-UR 09-02032 Total Work-Flow: Exploiting Hybrid

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Thanks!

  Our HPC Division staff is committed to making Roadrunner succeed   Meghan Wingate   Mark Vernon   Phil Church   Randall Rheinheimer

  Applications’ developers have done amazing work   Sriram Swaminarayan   Tim Kelley   Paul Henning   Jamal Mohd-Yusof

  Special thanks to Larry Cox for funding Khronos membership!

Slide 40