the largest linux clusters neil pundit scalable computing systems sandia national laboratories...

The Largest Linux Clusters

Neil Pundit

Scalable Computing Systems

Sandia National Laboratoriesndpundi@sandia.gov

http://www.cs.sandia.gov/cplant/

Outline

• Cplant™ hardware, software, and performance• Major difficulties and lessons learned• Research and development activities• Celera Genomics CRADA• Red Storm• Applications• Contributors• Additional Info

What is Cplant™?

• Cplant™ is a concept

– Provide computational capacity at low cost– MPP’s from commodity components

• Cplant™ is an overall effort:

– Multiple computing systems • Alaska, Barrow, Siberia, Antartica/Ross, Antartica/West, Hawaii,

Carmel, Asilomar, Delmar, Zenia

– Multiple projects • Portals 3.0 message passing, runtime, management tools, system

integration & test, operations & management

• Cplant™ is a software package

– Released under commercial license to Unlimited Scale, Inc.

– Released as open source under GNU Public License

CplantTM Architecture

I/ONodes

Compute NodesService Nodes

……

… … … …

Ethernet

Operator(s)

I/O Nodes

System

CplantTM

ASCI Red

Net I/O

System Support

Service

Sys Admin

File I/O

Compute

Extends ASCI Red advantages

MPP “look and feel”

• Distributed systems and services architecture

• Scalable to 10,000 nodes

• Embedded RAS features

• Preserve application code base

Current Deployment

• NM clusters– Alaska, yellow, 272 nodes (FY98)– Barrow, red, 96 nodes (FY98)– Siberia, yellow, 592 nodes, (FY99)– Ross/Antarctica, yellow, 1024

nodes (FY00)– West/Antarctica, green, 80 nodes

(FY00)

• CA clusters– Asilomar-SON, green, 64 nodes

(FY97)– Asilomar-SRN, yellow, 64 nodes

(FY97)– Carmel, yellow, 128 nodes (FY99)– Delmar, yellow, 256 nodes (FY00)– Zenia, red, 32 nodes (FY00)

Antarctica - Current

• Single Plane connects up to 256 Nodes via LAN• Center Planes Swing to 1 of 3 “Heads”• Each “Head” connects up to 256 CPU Nodes via LAN• IO & Service Nodes connected via SAN (Z Direction)

• 8x8x6+ Aspect Ratio

• Supports periods processing on 3 networks

24 Service& I/O Nodes*

24 Service& I/O Nodes

256Nodes*

80Nodes

256 Nodes

256 Nodes256 Nodes

256 Nodes256 Nodes*

1/4 Plane

128 paths128 paths

128 paths

32 paths

*not yet operational

256 Nodes

128 paths

Antarctica – August ‘01

256Nodes

256 Nodes

256 Nodes256 Nodes

256 Nodes

128 paths128 paths

128 paths

32 paths

256Nodes

128Nodes

System Software

Portals

MPI Library

Cluster Services

Hardware

Parallel I/OLibrary

Distributed Services Library

yod PCT bebopd pingd

Applications Portable Batch System

Linux Operating System

Runtime Environment

Device Database

Add Delete Find PowerRole Database

Discover utility

Hardware Configuration Software

Power control

Boot node

Boot scalable unit

Boot virtual machine

Remote distribution

Update SSS0

Update virtual machine

Management Software

• Portals for fast message passing

• Linux OS

• Configuration & management tools enable managing large clusters

0123456789

101112131415

1 2 4 8 16 32 64 128 256 500

Number of Nodes

2 MB Executable

10 MB Executable

Application Launch Performance

ENFS - A Parallel File Server Capability

• Employs standard NFS• Direct data deposit

onto the visualization machine.

• Parallel (100 MB/s)– Multiple paths to the

server(s)

• Scalable– Pushes scaling issues

to the server side

• Global– Available to all compute

VizNFS

ENFS ENFS ENFS ENFS

gigE switch

• Removes locking semantics from NFS protocol– Parallel independent I/O to multiple files

– Non-overlapping access to single file

• Uses I/O nodes as proxies• Allows for investigation of third party solutions

– Currently SGI’s XFS – 117 MB/s

– Compaq’s Petal/Frangipani

– Clemson’s PVFS

Supporting Software Efforts

• Etnus, Inc. – TotalView debugger (vers. 4.1.0-1)– Cplant™ runtime environment extended to support bulk

debug server launch– Only works on GNU and Compaq Alpha/Linux binaries– Can launch yod or attach to running job– TotalView communications port to Portals 3.0 in

progress• MPI Software Technology, Inc. – MPI/Pro

– MPI/Pro ported to Portals 3.0• Kuck and Associates, Inc./Pallas, Inc. - Vampir

– Vampirtrace for MPI/Pro and ENFS• Mission Critical Linux - Linux enhancements

– Kernel modifications to increase performance on Alpha processor systems

Large Clusters Require an Extensive Integration-Test Process

Power Supplies

ECC Errors

Mother Boards

Ethernet Cable

Serial Cable

Bad Myrinet Cable

Loose Myrinet Cable

Misconfigured Myrinet Cable

Myrinet Card

PCI Riser Card

RPC Unit

Terminal Server

Misc. Hardware

Misc. Software

No Diagnosis

Integration Hardware Error Reports for 1024 Nodes of Antarctica

MPLinpack Performance

• 552 Siberia Nodes– 309.2 GFLOPS

– Would place 61st on November 2000 Top 500 list

• 1000 Antarctica nodes– 512.4 GFLOPS

– Would place 31st on November 2000 Top 500 list

Usage Data

Aug-00 Oct-00 Nov-00 Jan-01 Feb-01 Apr-01

Janus-s

Alaska

Siberia

Outline of Major Difficultiesin the Last Two Years

• Interconnect• Communication middleware• Runtime environment• Batch scheduler• Parallel I/O• System management• Testing and release process

Major Difficulties

• Interconnect (Myrinet) problems (2 PY)– GM mapper limitations (2 PM)

• Each new cluster exceeded the number of nodes the mapper could handle

– Non-deadlock-free routes (4 PM)• Code for routing algorithm gave only shortest path

routes

– Reliability• Error detection/correction (6 PM)• Switch diagnostics capture and display (1 PY)

Myrinet Reliability

• Alaska Myrinet is very reliable• Siberia Myrinet is very unreliable

– Daily bit error rate can be from 10-7 to 10-14

– Storms of multi-bit errors

• Added error detection/correction to Myrinet driver• Implemented Myrinet switch monitoring software• Implemented switch error visualization tool

Switch Error Visualization Tool

Major Difficulties (cont’d)

• Communication middleware (3.5 PY)– Portals 2.0 in Linux (6 PM)

• No API– Data structures in user space– Protection boundaries have to be crossed to access data

structures– Data structures have to be copied, manipulated, and copied

back– Requires interrupts

• Address validation/translation on the fly– Incoming messages trigger address validation– Doesn’t fit the Linux model of validating addresses on a

system call for the currently running process

– Developed Portals 3.0 API (1 PY)– Implemented Portals 3.0 (1 PY)– Transition from P2 to P3 (1 PY)

• Runtime environment (2 PY)– Most problems related to message passing

• Runtime utilities must recover from network errors

– Linux copy-on-write caused “lost” messages

– Problems show up as• Failure to start job• Utilities become uncommunicative – compute nodes

become stale, allocator is unresponsive

• Interaction of Linux, Portals, and the utilities (60% rewrite, 30% debugging, 10% enhancement)

• Batch scheduling (1 PY)– Enhanced OpenPBS

• Added non-blocking I/O for enhanced reliability (patches available under GPL)

• Integrated PBS into the runtime environment

– Uses FIFO scheduler• Reflects “good citizen” rules established by users

– Few problems with PBS

• Parallel I/O (6 PY)– Fyod – parallel independent files

• Partial success (6 PM)

– Striping fyod• Abandoned for lack of robustness (2 PY)

– ENFS (3.5 PY)• Have MPI-IO for ENFS, working on HDF-5

• 119 MB/s from 8 I/O nodes to SGI O2K with XFS

• System management tools (6 PY)– All tools are homegrown

– Commercial tools do not address scalability and Cplant™ architecture

– First implementation was too hardware specific and tightly integrated to runtime environment

– Latest implementation is flexible and separate from runtime environment

– Focus of late is on automation and robustness

• Testing and release process (5 PY)– Slow awakening that system tests were incomplete

– Testing needs to include a few representative applications

– Beyond infant mortality, we need to do stress testing

– Five-phase testing procedure in place

Five-Phase Testing Procedure

• Phase 0: Repository regression tests

– Runs nightly on 32-node system to insure the functionality of the repository

• Phase 1: Runtime environment and basic message passing tests

– Simple MPI tests and basic file I/O functions

• Phase 2: Small applications and benchmarks

– NAS Benchmarks, MPLinpack, CTH and MPSalsa with small problems

• Phase 3: Message passing stress tests

– Based on the Intel acceptance tests for ASCI/Red

• Phase 4: Friendly user applications

– Friendly users running real applications

Lessons Learned

• Bug fixing (50%)• Enhancements (30%)• Release testing (20%)

– Currently barely adequate

– Need greater attention to robustness

Current Research and Development

• OS bypass performance enhancement• Dynamic compute node allocation• Intelligent compute node allocation• Portals 3.0 on Quadrics network• Support for multi-threaded apps• Support for SMP compute nodes• Enhance cluster management tools to support

switching between heads

Collaborative Research Efforts

– Study of optimal error correction protocols (Ohio State)– Heterogeneous cluster study (Syracuse/U. of Virginia)– Study of performance with topology, communication and

applications (Ohio State)– OS bypass (U. New Mexico)– Fault tolerance in applications (U. of Texas)– Portals 3.0 implementations (VIA, LAPI) and extensions

(gather/scatter) (Mississippi State U.)– Scalable I/O (Lock manager, coherence) (Northwestern)– New MPP architectures (CalTech/JPL)– SciDAC- Scalable Systems Software Enabling

Technology Center (DOE)

Celera Genomics

• Mutli-year Cooperative Research and Development Agreement

• Develop advanced parallel bioinformatics algorithms• Develop massively parallel computer hardware designs• Incorporate these into single, integrated, high-performance

data analysis capability• Integrate technology advances into both companies’

mainstream business activities• Enhance Celera’s technical depth in high-performance

parallel computing• Enhance Sandia’s technical depth in genomics and

proteomics

ASCI Red Storm

• Tightly Coupled MPP• 20+ TF• Distributed Memory MIMD• 3-D Mesh Interconnect• Red/Black Switching• Partitioned Hardware -

System and I/O, Compute, RAS

• Partitioned System Software - System and I/O, Compute, RAS

• Integrated System Management and Full System RAS

• No Local Disk or User Writable Non-volatile Memory

ASCI RedStorm 20 Tflops

Applications Work In Progress

• CTH – 3D Eulerian shock physics

• ALEGRA– 3D arbitrary Lagrangian-

Eulerian solid dynamics

• GILA– Unstructured low-speed flow

solver

• MPQuest– Quantum electronic structures

• SALVO – 3D seismic imaging

• LADERA– Dual control volume grand

canonical MD simulation

• Parallel MESA– Parallel OpenGL

• Xpatch– Electromagnetism

• RSM/TEMPRA– Weapon safety assessment

• ITS– Coupled Electron/Photon

Monte Carlo Transport

• TRAMONTO– 3D density functional theory

for inhomogeneous fluids

• CEDAR– Genetic algorithms

Applications Work In Progress

• AZTEC– Iterative sparse linear solver

• DAVINCI– 3D charge transport simulation

• SALINAS– Finite element modal analysis

for linear structural dynamics

• TORTILLA– Mathematical and computational

methods for protein folding

• EIGER

• DAKOTA– Analysis kit for optimization

• PRONTO– Numerical methods for transient

solid dynamics

• SnRAD– Radiation transport solver

• ZOLTAN– Dynamic load balancing

• MPSALSA– Numerical methods for

simulation of chemically reacting flows

http://www.cs.sandia.gov/cplant/apps

CTH Grind Time

1 10 100 1000

Number of Nodes

Alaska

Siberia

Tflops

Blue-Pacific

Cplant™ Contributors

• Ron Brightwell• Lee Ann Fisk• Nathan Dauchy (HPTi)• Sue Goudy• Rena Haynes• Jeanette Johnston• Lisa Kennicott• Ruth Klundt (Compaq)• Jim Laros• Barney Maccabe (UNM)• Jim Otto• Rolf Riesen• Eric Russell• Lee Ward• David Evensky

• Sophia Corwell• Bob Davis• Eric Enquvist• Cathy Houf• Donna Johnson• Mike McConkey• Geoff McGirt• Mike Kurtzer• Doug Clay

• Doug Doerfler• John Noe• Neil Pundit• Art hale, Deputy Director• Bill Camp, Director

System Software Development and Testing

Production Support

Management Team

More Info

• Web site

– http://www.cs.sandia.gov/cplant/• Recent papers

– http://www.cs.sandia.gov/cplant/papers/– Including:

• “Scalable Parallel Application Launch on Cplant™”, extended abstract submitted to SC’01

• “Dynamic Allocation of Nodes on a Large Space-shared Cluster”, submitted to IEEE Cluster Computing 2001

• “Scalability and Performance of Two Large Linux Clusters”, Journal of Parallel and Distributed Computing, to appear 2001

• “Scalability and Performance of CTH on the Computational Plant”, Proceedings of 2nd International Conference on Cluster Computing

• Sandia’s Computer Science Research Institute (CSRI)

– http://www.cs.sandia.gov/CSRI/

the largest linux clusters neil pundit scalable computing systems sandia national laboratories...

networks24 service io

lanio service nodes

cpu nodes

nodes fy00westantarctica

nodes fy00zenia

nodes fy98barrow

nodes fy97carmel

nodes fy99delmar

Documents

steve plimpton sandia national labs sjplimp@sandia.gov 5th...

sandia advanced personnel locator engine a new way to search...

(1) iron opacity measurements on z at 150 ev temperatures...

rd fuzing & firing systems at sandia national …...sandia...

verne loose,* dhruv bhatnagar, ross · pdf fileverne...

modifying & extending...

verification through adaptivity · 2009-08-21 ·...

rotary vapor compression cycle - department of … vapor...

sss validation and testing september 11, 2003 rockville, md...

modifying & extending lammps...modifying & extending lammps...

the arsenic water technology partnership abbas …€¦ ·...

appendix 1: attendees...anthony martino sandia national...

sandia national laboratories robert brocato 505-844-2714...

performance assessment of a generic repository in bedded...

mobile distributed 3d sensing sandia national laboratories...

out brief for the structural reliability partnership...

pptelec020701.ppt pg 1 the intelligent systems & robotics...

direct numerical simulations of turbulent combustion...

corrosion and accelerated testing - nrel and accelerated...

juan j. torres - energy.gov€¦ · juan j. torres manager,...