blue waters and resource management - now and in the future

Blue Waters and Resource Management – Now and in the Future

Dr. William Kramer National Center for Supercomputing Applications, University of Illinois

Science & Engineering on Blue Waters

Molecular Science Weather & Climate Forecasting

Earth Science Astro* Health

Blue Waters will enable advances in a broad range of science and engineering disciplines. Examples include:

Moabcon 2013 - April 10, 2013 - Salt Lake City 2

What Have I been Doing?

3 Moabcon 2013 - April 10, 2013 - Salt Lake

City

Science Area Number of Teams

Codes Struct Grids

Unstruct Grids

Dense Matrix

Sparse Matrix

N-‐Body

Monte Carlo

FFT PIC Significant I/O

Climate and Weather 3 CESM, GCRM, CM1/WRF, HOMME

X X X X X

Plasmas/Magnetosphere 2 H3D(M),VPIC, OSIRIS, Magtail/UPIC

X X X X

Stellar Atmospheres and Supernovae

5 PPM, MAESTRO, CASTRO, SEDONA, ChaNGa, MS-‐FLUKSS

X X X X X X

Cosmology 2 Enzo, pGADGET X X X CombusRon/Turbulence 2 PSDNS, DISTUF X X General RelaRvity 2 Cactus, Harm3D,

LazEV X X

Molecular Dynamics 4 AMBER, Gromacs, NAMD, LAMMPS

X X X

Quantum Chemistry 2 SIAL, GAMESS, NWChem

X X X X X

Material Science 3 NEMOS, OMEN, GW, QMCPACK

X X X X

Earthquakes/Seismology 2 AWP-‐ODC, HERCULES, PLSQR, SPECFEM3D

X X X X

Quantum Chromo Dynamics

1 Chroma, MILC, USQCD

X X X X X

Social Networks 1 EPISIMDEMICS

EvoluRon 1 Eve

Engineering/System of Systems

1 GRIPS,Revisit X

Computer Science 1 X X X X X Moabcon 2013 - April 10, 2013 - Salt Lake City 4

Blue Waters Computing System

Moabcon 2013 -‐ April 10, 2013 -‐ Salt Lake City

Sonexion: 26 usable PB

>1 TB/sec

100 GB/sec

10/40/100 Gb Ethernet Switch

Spectra Logic: 300 usable PB

120+ Gb/sec

100-‐300 Gbps WAN

IB Switch

5

External Servers

Aggregate Memory – 1.5 PB

40 GbE

FDR IB

10 GbE

QDR IB

Cray HSN

1.2PB useable Disk

1,200 GB/s

100 GB/s

28 Dell 720 IE servers

4 Dell esLogin Online disk >25PB

/home, /project /scratch

LNET(s) rSIP GW

300 GB/s

Network GW

FC8

LNET TCP/IP (10 GbE)

SCSI (FCP)

Protocols

GridFTP (TCP/IP)

380PB RAW Tape

50 Dell 720 Near Line servers

55 GB/s

100 GB/s

100 GB/s

6

100 GB/s

40GbE Switch

440 Gb/s Ethernet from site network

Core FDR/QDR IB Extreme Switches

LAN/WAN

100 GB/s

All storage sizes given as the amount usable. Rates are always usable/measured sustained rates

Moabcon 2013 - April 10, 2013 - Salt Lake City

7

Cray XE6/XK7 - 276 Cabinets

XE6 Compute Nodes -‐ 5,688 Blades – 22,640 Nodes – 362,240 FP (bulldozer) Cores – 724,480 Integer Cores

4 GB per FP core

DSL 48 Nodes Resource

Manager (MOM) 64 Nodes

H2O Login 4 Nodes

Import/Export Nodes

Management Node

esServers Cabinets

HPSS Data Mover Nodes

XK7 GPU Nodes 768 Blades – 3,072 (4,224) Nodes

24,576 (33,792) FP Cores – 4,224 GPUs 4 GB per FP core

Sonexion 25+ usable PB online storage

36 racks

BOOT 2 Nodes

SDB 2 Nodes

Network GW 8 Nodes

Reserved 74 Nodes

LNET Routers 582 Nodes

InfiniBand fabric Boot RAID

Boot Cabinet

SMW

10/40/100 Gb Ethernet Switch

Gemini Fabric (HSN)

RSIP 12Nodes

NCSAnet Near-‐Line Storage 300+ usable PB

SupporRng systems: LDAP, RSA, Portal, JIRA, Globus CA, Bro, test systems, Accounts/AllocaRons, CVS, Wiki

Cyber ProtecRon IDPS

NPCF


SCUBA

BW Focus on Sustained Performance •  Blue Water’s and NSF are focusing on sustained performance in a way few have

been before. •  Sustained is the computer’s useful, consistent performance on a broad range of

applications that scientists and engineers use every day. •  Time to solution for a given amount of work is the important metric – not hardware Ops/s •  Sustained performance (and therefore tests) include time to read data and write the results

•  NSF’s Track-1 call emphasized sustained performance, demonstrated on a collection of application benchmarks (application + problem set)

•  Not just simplistic metrics (e.g. HP Linpack) •  Applications include both Petascale applications (effectively use the full machine, solving

scalability problems for both compute and I/O) and applications that use a large fraction of the system

•  Blue Waters project focus is on delivering sustained Petascale performance to computational and data focused applications

•  Develop tools, techniques, samples, that exploit all parts of the system •  Explore new tools, programming models, and libraries to help applications get the most from

the system •  By the Sustained Petascale Performance Metrics Blue Waters sustained >1.3 across

12 different time to solution application tests. Moabcon 2013 - April 10, 2013 - Salt Lake

City 8

View from the Blue Waters Portal


City

As of April 2, 2013, Blue Waters has delivered over 1.3 Billion core-hours to S&E Teams

Usage Breakdown – Jan 1 to Mar 26, 2013

•  Torque log accounting (NCSA, Mike Showerman)


City

0

5

10

15

20

25

30

35

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768

usage (M

nod

e-‐ho

urs)

XE job size (NODES: 1 node = 32 cores)

Accumulated XE node-‐hours – January 1 to March 26, 2013

Power of 2

> 65,536 Cores

> 262,144 Cores

OBSERVATIONS AND THOUGHTS:

FLEXIBILITY IF THE WORD OF THE NEXT DECADE

What is Blue Waters already telling us about the future @Scale systems


City

Observation 1: Topology Matters •  Much of the work for performance improvement of early

applications was understanding and tuning for layout/topology even on dedicated systems •  Factors of almost 10 were seen for some applications

•  Nvidia’s Linpack results are mostly due to topology aware work layout

•  Done with hand tuning, special node selection etc, •  Needs to become common place to really benefit use


City

Topology Matters •  Even very small changes can

have dramatic and unexpected consequences. •  Example – having just 1

down gemini out of 6114 can slow an application by >20%

•  0.0156% of components unavailable can extend an application run time by >20% if the components just happen to be in the wrong place

•  P3DNS – 6114 Nodes


City

Topology

•  1 poorly placed node out of 4116 (0.02%) can slow an application by >30%

•  On a dedicated system! •  It is hard to get an optimal

topology assignments, especial in non-dedicated use, but is should be easy to avoid really detrimental topology assignments.

14

1 poorly placed node out of 4116 (0.02%) can slow an application by >30%


Topology Awareness Needed for All Types of Interconnects


City

Tori Trees Hypercubes

Direct Connect & Dragonflys

Performance and Scalability through Flexibility

•  Harder for applications are able to scale in the face of limited bandwidths. •  BW works with science teams and technology providers to

•  Understand and develop better process-to-node mapping analysis to determine behavior and usage patterns.

•  Better instrumentation of what the network is really doing •  Topology aware resource and systems management that enable and reward

topology aware applications •  Malleability – for applications and systems

•  Understanding topology given and maximizing effectiveness •  Being able to express desired topology based on algorithms •  Mid ware support

•  Even if applications scale, consistency becomes an increasing issue for systems and applications

•  This will only get worse in future systems


City

Flexible Resiliency Modes •  Run again •  Defensive I/O (traditional Checkpoint/restart)

•  Expensive •  Extra overhead for application and system

•  Intrusive •  I/O infrastructure share across all jobs

•  New C/R (node memory copy, SSD, Journaling,…) •  Spare nodes in job requests to rebalance work if a single point of failure

•  Wastes resources •  Run times do not support well yet (but can do it)

•  Redistribute work within remaining nodes •  Charm++ , some MPI implementations •  Takes longer

•  Add spare nodes from system pool to job •  Job scheduler and resource manager and runtime all have to made more flexible


City

Observation 2: Resiliency Flexibility Critical •  Migrate From Checkpoint to Application Resiliency

•  Traditional system based checkpoint restart is no longer viable •  Defensive I/O per application is inefficient but the current state of

the art •  Better application resiliency requires improvements in both systems

and applications •  Several teams moving to new frameworks (e.g. Charm++) to

improve resiliency •  MPI trying to add better features for resiliency


City

Resiliency Flexibility •  Application Based Resiliency

•  Multiple layers of Software and Hardware have to coordinate information and reaction

•  Analysis and understanding is needed before action •  Correct and actionable messages need to flow up and down the

stack to the applications so they can take the proper action with correct information

•  Application Situational Awareness - need to understand circumstances and take action

•  Flexible resource provisioning needed in real time •  Replacing failed node on the dynamically from a system pool of

nodes •  Interaction with other constraints so sub-optimization does not

adversely impact overall system optimization


City

The Chicken or the Egg •  Applications cannot take advantage of features the

system does not provide •  So they do the best they can with guesses

•  Technology providers do not provide features because they say applications do not use them

My message is We can not brute force our way to future @scale

systems and applications any longer.


City

Many Other Observations – Other Presentations

•  Storage and I/O significant challenges •  System software quality and resiliency

•  Testing for function, feature and performance at scale •  Information Gathering for the system •  Application Methods •  Measuring real time to solution performance •  System SW scale performance •  Heterogeneous components •  Application Consistency •  Efficiency

•  Energy, TCO, utilization S&E Team Productivity

•  ….


City

Summary •  Blue Waters is delivering on its commitment to sustained performance to the

Nation for computational and data focused @Scale problems. •  We appreciate the tremendous efforts and support of all our technology

providers and science team partners •  I am pleased to see Adaptive and Cray seriously addressing topology

awareness issues – to meet BW specific needs and hopefully beyoind •  I am pleased Cray made initial improvements to enable application resiliency

but Adaptive, Cray, MPI and other technology providers need to do much more to solve

•  I am very encouraged application teams are willing (and desire) to implement flexibility in their codes if they have options

•  We need more commonality across technology providers and implement •  BW is an excellent platform for studying the issues as well as providing an

unprecedented S&E resource •  Stay tuned for amazing results from BW


City

Acknowledgements

This work is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number OCI

07-25070) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign, its National Center for

Supercomputing Applications, Cray, and the Great Lakes Consortium for Petascale Computation.

The work described is achievable through the efforts of the many other on different teams.

Moabcon 2013 - April 10, 2013 - Salt Lake City 23