blue waters and resource management - now and in the future
DESCRIPTION
In this presentation from Moabcon 2013, Bill Kramer from NCSA presents: Blue Waters and Resource Management - Now and in the Future. Watch the video of this presentation: http://insidehpc.com/?p=36343TRANSCRIPT
Blue Waters and Resource Management – Now and in the Future
Dr. William Kramer National Center for Supercomputing Applications, University of Illinois
Science & Engineering on Blue Waters
Molecular Science Weather & Climate Forecasting
Earth Science Astro* Health
Blue Waters will enable advances in a broad range of science and engineering disciplines. Examples include:
Moabcon 2013 - April 10, 2013 - Salt Lake City 2
What Have I been Doing?
3 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Science Area Number of Teams
Codes Struct Grids
Unstruct Grids
Dense Matrix
Sparse Matrix
N-‐Body
Monte Carlo
FFT PIC Significant I/O
Climate and Weather 3 CESM, GCRM, CM1/WRF, HOMME
X X X X X
Plasmas/Magnetosphere 2 H3D(M),VPIC, OSIRIS, Magtail/UPIC
X X X X
Stellar Atmospheres and Supernovae
5 PPM, MAESTRO, CASTRO, SEDONA, ChaNGa, MS-‐FLUKSS
X X X X X X
Cosmology 2 Enzo, pGADGET X X X CombusRon/Turbulence 2 PSDNS, DISTUF X X General RelaRvity 2 Cactus, Harm3D,
LazEV X X
Molecular Dynamics 4 AMBER, Gromacs, NAMD, LAMMPS
X X X
Quantum Chemistry 2 SIAL, GAMESS, NWChem
X X X X X
Material Science 3 NEMOS, OMEN, GW, QMCPACK
X X X X
Earthquakes/Seismology 2 AWP-‐ODC, HERCULES, PLSQR, SPECFEM3D
X X X X
Quantum Chromo Dynamics
1 Chroma, MILC, USQCD
X X X X X
Social Networks 1 EPISIMDEMICS
EvoluRon 1 Eve
Engineering/System of Systems
1 GRIPS,Revisit X
Computer Science 1 X X X X X Moabcon 2013 - April 10, 2013 - Salt Lake City 4
Blue Waters Computing System
Moabcon 2013 -‐ April 10, 2013 -‐ Salt Lake City
Sonexion: 26 usable PB
>1 TB/sec
100 GB/sec
10/40/100 Gb Ethernet Switch
Spectra Logic: 300 usable PB
120+ Gb/sec
100-‐300 Gbps WAN
IB Switch
5
External Servers
Aggregate Memory – 1.5 PB
40 GbE
FDR IB
10 GbE
QDR IB
Cray HSN
1.2PB useable Disk
1,200 GB/s
100 GB/s
28 Dell 720 IE servers
4 Dell esLogin Online disk >25PB
/home, /project /scratch
LNET(s) rSIP GW
300 GB/s
Network GW
FC8
LNET TCP/IP (10 GbE)
SCSI (FCP)
Protocols
GridFTP (TCP/IP)
380PB RAW Tape
50 Dell 720 Near Line servers
55 GB/s
100 GB/s
100 GB/s
6
100 GB/s
40GbE Switch
440 Gb/s Ethernet from site network
Core FDR/QDR IB Extreme Switches
LAN/WAN
100 GB/s
All storage sizes given as the amount usable. Rates are always usable/measured sustained rates
Moabcon 2013 - April 10, 2013 - Salt Lake City
7
Cray XE6/XK7 - 276 Cabinets
XE6 Compute Nodes -‐ 5,688 Blades – 22,640 Nodes – 362,240 FP (bulldozer) Cores – 724,480 Integer Cores
4 GB per FP core
DSL 48 Nodes Resource
Manager (MOM) 64 Nodes
H2O Login 4 Nodes
Import/Export Nodes
Management Node
esServers Cabinets
HPSS Data Mover Nodes
XK7 GPU Nodes 768 Blades – 3,072 (4,224) Nodes
24,576 (33,792) FP Cores – 4,224 GPUs 4 GB per FP core
Sonexion 25+ usable PB online storage
36 racks
BOOT 2 Nodes
SDB 2 Nodes
Network GW 8 Nodes
Reserved 74 Nodes
LNET Routers 582 Nodes
InfiniBand fabric Boot RAID
Boot Cabinet
SMW
10/40/100 Gb Ethernet Switch
Gemini Fabric (HSN)
RSIP 12Nodes
NCSAnet Near-‐Line Storage 300+ usable PB
SupporRng systems: LDAP, RSA, Portal, JIRA, Globus CA, Bro, test systems, Accounts/AllocaRons, CVS, Wiki
Cyber ProtecRon IDPS
NPCF
Moabcon 2013 - April 10, 2013 - Salt Lake City
SCUBA
BW Focus on Sustained Performance • Blue Water’s and NSF are focusing on sustained performance in a way few have
been before. • Sustained is the computer’s useful, consistent performance on a broad range of
applications that scientists and engineers use every day. • Time to solution for a given amount of work is the important metric – not hardware Ops/s • Sustained performance (and therefore tests) include time to read data and write the results
• NSF’s Track-1 call emphasized sustained performance, demonstrated on a collection of application benchmarks (application + problem set)
• Not just simplistic metrics (e.g. HP Linpack) • Applications include both Petascale applications (effectively use the full machine, solving
scalability problems for both compute and I/O) and applications that use a large fraction of the system
• Blue Waters project focus is on delivering sustained Petascale performance to computational and data focused applications
• Develop tools, techniques, samples, that exploit all parts of the system • Explore new tools, programming models, and libraries to help applications get the most from
the system • By the Sustained Petascale Performance Metrics Blue Waters sustained >1.3 across
12 different time to solution application tests. Moabcon 2013 - April 10, 2013 - Salt Lake
City 8
View from the Blue Waters Portal
9 Moabcon 2013 - April 10, 2013 - Salt Lake
City
As of April 2, 2013, Blue Waters has delivered over 1.3 Billion core-hours to S&E Teams
Usage Breakdown – Jan 1 to Mar 26, 2013
• Torque log accounting (NCSA, Mike Showerman)
10 Moabcon 2013 - April 10, 2013 - Salt Lake
City
0
5
10
15
20
25
30
35
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
usage (M
nod
e-‐ho
urs)
XE job size (NODES: 1 node = 32 cores)
Accumulated XE node-‐hours – January 1 to March 26, 2013
Power of 2
> 65,536 Cores
> 262,144 Cores
OBSERVATIONS AND THOUGHTS:
FLEXIBILITY IF THE WORD OF THE NEXT DECADE
What is Blue Waters already telling us about the future @Scale systems
11 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Observation 1: Topology Matters • Much of the work for performance improvement of early
applications was understanding and tuning for layout/topology even on dedicated systems • Factors of almost 10 were seen for some applications
• Nvidia’s Linpack results are mostly due to topology aware work layout
• Done with hand tuning, special node selection etc, • Needs to become common place to really benefit use
12 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Topology Matters • Even very small changes can
have dramatic and unexpected consequences. • Example – having just 1
down gemini out of 6114 can slow an application by >20%
• 0.0156% of components unavailable can extend an application run time by >20% if the components just happen to be in the wrong place
• P3DNS – 6114 Nodes
13 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Topology
• 1 poorly placed node out of 4116 (0.02%) can slow an application by >30%
• On a dedicated system! • It is hard to get an optimal
topology assignments, especial in non-dedicated use, but is should be easy to avoid really detrimental topology assignments.
14
1 poorly placed node out of 4116 (0.02%) can slow an application by >30%
Moabcon 2013 - April 10, 2013 - Salt Lake City
Topology Awareness Needed for All Types of Interconnects
15 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Tori Trees Hypercubes
Direct Connect & Dragonflys
Performance and Scalability through Flexibility
• Harder for applications are able to scale in the face of limited bandwidths. • BW works with science teams and technology providers to
• Understand and develop better process-to-node mapping analysis to determine behavior and usage patterns.
• Better instrumentation of what the network is really doing • Topology aware resource and systems management that enable and reward
topology aware applications • Malleability – for applications and systems
• Understanding topology given and maximizing effectiveness • Being able to express desired topology based on algorithms • Mid ware support
• Even if applications scale, consistency becomes an increasing issue for systems and applications
• This will only get worse in future systems
16 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Flexible Resiliency Modes • Run again • Defensive I/O (traditional Checkpoint/restart)
• Expensive • Extra overhead for application and system
• Intrusive • I/O infrastructure share across all jobs
• New C/R (node memory copy, SSD, Journaling,…) • Spare nodes in job requests to rebalance work if a single point of failure
• Wastes resources • Run times do not support well yet (but can do it)
• Redistribute work within remaining nodes • Charm++ , some MPI implementations • Takes longer
• Add spare nodes from system pool to job • Job scheduler and resource manager and runtime all have to made more flexible
17 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Observation 2: Resiliency Flexibility Critical • Migrate From Checkpoint to Application Resiliency
• Traditional system based checkpoint restart is no longer viable • Defensive I/O per application is inefficient but the current state of
the art • Better application resiliency requires improvements in both systems
and applications • Several teams moving to new frameworks (e.g. Charm++) to
improve resiliency • MPI trying to add better features for resiliency
18 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Resiliency Flexibility • Application Based Resiliency
• Multiple layers of Software and Hardware have to coordinate information and reaction
• Analysis and understanding is needed before action • Correct and actionable messages need to flow up and down the
stack to the applications so they can take the proper action with correct information
• Application Situational Awareness - need to understand circumstances and take action
• Flexible resource provisioning needed in real time • Replacing failed node on the dynamically from a system pool of
nodes • Interaction with other constraints so sub-optimization does not
adversely impact overall system optimization
19 Moabcon 2013 - April 10, 2013 - Salt Lake
City
The Chicken or the Egg • Applications cannot take advantage of features the
system does not provide • So they do the best they can with guesses
• Technology providers do not provide features because they say applications do not use them
My message is We can not brute force our way to future @scale
systems and applications any longer.
20 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Many Other Observations – Other Presentations
• Storage and I/O significant challenges • System software quality and resiliency
• Testing for function, feature and performance at scale • Information Gathering for the system • Application Methods • Measuring real time to solution performance • System SW scale performance • Heterogeneous components • Application Consistency • Efficiency
• Energy, TCO, utilization S&E Team Productivity
• ….
21 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Summary • Blue Waters is delivering on its commitment to sustained performance to the
Nation for computational and data focused @Scale problems. • We appreciate the tremendous efforts and support of all our technology
providers and science team partners • I am pleased to see Adaptive and Cray seriously addressing topology
awareness issues – to meet BW specific needs and hopefully beyoind • I am pleased Cray made initial improvements to enable application resiliency
but Adaptive, Cray, MPI and other technology providers need to do much more to solve
• I am very encouraged application teams are willing (and desire) to implement flexibility in their codes if they have options
• We need more commonality across technology providers and implement • BW is an excellent platform for studying the issues as well as providing an
unprecedented S&E resource • Stay tuned for amazing results from BW
22 Moabcon 2013 - April 10, 2013 - Salt Lake
City
Acknowledgements
This work is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number OCI
07-25070) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign, its National Center for
Supercomputing Applications, Cray, and the Great Lakes Consortium for Petascale Computation.
The work described is achievable through the efforts of the many other on different teams.
Moabcon 2013 - April 10, 2013 - Salt Lake City 23