cluster consolidation at nersc...i located at lawrence berkeley national laboratory, nersc is the...
Post on 09-Jul-2020
0 Views
Preview:
TRANSCRIPT
Larry PezzagliaNERSC Computational Systems Grouplmpezzaglia@lbl.gov
HEPiX Spring 2014
ClusterConsolidation
at NERSC
I Located at Lawrence Berkeley National Laboratory,NERSC is the production computing facility for theUS DOE Office of Science
I NERSC serves ~5000 users, ~400 projects, and~500 codes
I Focus is on “unique” resources:I Expert computing and other servicesI 24x7 monitoringI High-end computing and storage systems
I Known for:I Excellent services and user supportI Diverse workload
I NERSC provides Hopper (a Cray XE6), Edison (aCray XC30), and three data-intensive systems:Carver, PDSF, and Genepool.
Snapshot of NERSC
- 2 -
The NERSC Cluster Model
- 3 -
I In 2012, NERSC purchased a new system, “Mendel”to systematically expand its cluster resources
I 500+ Sandy Bridge nodes, 8000+ coresI FDR InfiniBand interconnect
I Mendel transparently expands production clustersand services
I Carver, PDSF, and Genepool (the “parent systems”)schedule jobs on portions of Mendel
I Mendel provides multiple software environments tomatch those on each parent system
I This model was presented at the 2013 Cray UserGroup meeting
I http://cug.org/proceedings/cug2013_proceedings/includes/files/pap184-file1.pdf
I http://cug.org/proceedings/cug2013_proceedings/includes/files/pap184-file2.pdf
Cluster Expansion
- 4 -
12+ PB NERSC Global Filesystems
ESnet
NERSCNetwork
Bioinformatics Cluster(Genepool)
General Purpose Cluster (Carver)
High Energy Physics/NuclearPhysics Cluster (PDSF)
12+ PB NERSC Global Filesystems
ESnet
NERSCNetwork
Bioinformatics Cluster(Genepool)
General Purpose Cluster (Carver)
High Energy Physics/NuclearPhysics Cluster (PDSF)
Data-Intensive Systems
- 5 -
12+ PB NERSC Global Filesystems
ESnet
NERSCNetwork
Bioinformatics Cluster(Genepool)
General Purpose Cluster (Carver)
High Energy Physics/NuclearPhysics Cluster (PDSF)
12+ PB NERSC Global Filesystems
ESnet
NERSCNetwork
Bioinformatics Cluster(Genepool)
General Purpose Cluster (Carver)
High Energy Physics/NuclearPhysics Cluster (PDSF)
Expansion Cluster (Mendel)
Data-Intensive Systems
- 6 -
I We use tools to construct convenient managementabstractions and tuned user environments on topof this platform:
I Familiar open-source software:I xCAT to provision and manage nodesI Cfengine3 to provide configuration management
(versioned with SVN)I NERSC-developed BSD-licensed software:
I avs_image_mgr to handle xCAT imagemanagement and versioninghttp://github.com/lpezzaglia/avs_image_mgr
I CHOS to provide multiple compute environmentsconcurrently and seamlesslyhttp://github.com/scanon/chos
I minimond to collect trending data fortroubleshooting and analysishttp://github.com/lpezzaglia/minimond
The Mendel Approach
- 7 -
Unified Mendel Hardware Platform
Unified Mendel Base OS
Add-ons
PDSFAdd-ons
PDSFxCAT Policy
PDSFCfengine Policy
PDSFUGE
PDSFsl64
CHOS
PDSFsl53
CHOS
PDSFSL 6.4Apps
PDSFSL 5.3Apps
GenepoolAdd-ons
GenepoolxCAT Policy
GenepoolCfengine Policy
GenepoolUGE
GenepoolCompute
CHOS
GenepoolLoginCHOS
GenepoolDebian 6
Apps
GenepoolDebian 6
Logins
CarverAdd-ons
CarverxCAT Policy
CarverCfengine Policy
CarverTORQUE
CarverCompute
CHOS
CarverSL 5.5Apps
Hardware/Network
BaseOS
Boot-timeDifferentiation
CHOS
UserApplications
NERSC Cluster Model
- 8 -
Extending the Model
- 9 -
I Easy Mendel administration highlighted theoperational burden of managing legacy clusters
I Changing configurations with pdsh scales poorlyI Mendel demonstrated the value of leveraging
automation to manage complex systemsI We pursued further cluster consolidation efforts
I Staff efficiency is highly valuedI Legacy hardware also gains benefits through
Mendel membershipI Cost: Upfront effort to consolidate clusters and risk
of user disruptionI Reward: Reduced long-term sysadmin burden and
increased system consistency
Motivation for Consolidation
- 10 -
I In spring 2014, we merged “Genepool”, a legacyparent cluster, into Mendel’s management system
I The combined cluster is now managed as a singleintegrated system with:
I ~1000 nodesI Multi-generational, multi-vendor hardwareI Multiple separate interconnectsI A unified xCAT+Cfengine management interface
I Constrained by a 24x7, disruption-sensitiveenvironment
I Change activated in a single all-day maintenance
Extending the Model
- 11 -
Multi-vendor, Multi-generational hardwarexCAT Management Abstractions
Unified Mendel Base OS
Add-ons
PDSFAdd-ons
PDSFxCAT Policy
PDSFCfengine Policy
PDSFUGE
PDSFsl64
CHOS
PDSFsl53
CHOS
PDSFSL 6.4Apps
PDSFSL 5.3Apps
GenepoolAdd-ons
GenepoolxCAT Policy
GenepoolCfengine Policy
GenepoolUGE
GenepoolCompute
CHOS
GenepoolLoginCHOS
GenepoolDebian 6
Apps
GenepoolDebian 6
Logins
CarverAdd-ons
CarverxCAT Policy
CarverCfengine Policy
CarverTORQUE
CarverCompute
CHOS
CarverSL 5.5Apps
Hardware/Network
BaseOS
Boot-timeDifferentiation
CHOS
UserApplications
Consolidated Cluster Model
- 12 -
Genepool and Mendel differ in several respects
Mendel (New) Genepool (Legacy)Productioninterconnect
FDR InfiniBand Gigabit Ethernet
Provisioning/IPMInetwork
Dedicated GbE Dedicated IPMInetworkProvisions overproduction network
OS SL 6.3 base OS withCHOS
Debian 6 withoutCHOS
Hardware Homogeneousplatform
Many hardwareconfigurations
Specific challenges
- 13 -
I xCAT’s hierarchical management features aresuited to managing dissimilar hardware
I Expansion required changes to our software stack1. Expand hardware support in the base OS image2. Configure an xCAT service node3. Expand Cfengine rules and boot scripts4. Perform thorough testing5. Reboot the Genepool nodes through Mendel’s
management system
Approach
- 14 -
I Support for all Genepool hardware was added tothe base OS image.
I Kernel modules for disk and network controllersI Initrd code to handle Genepool network
characteristicsI An xCAT add-on was created for the xCAT service
nodeI Changes were made with avs_image_mgr
I Provides a full revision history of every fileI Provides the ability to roll back to any previous
image
Base OS modifications
- 15 -
I An xCAT Service Node (SN) handles Genepoolprovisioning/management under the direction ofthe Management Node (MN)
I Only the SN requires connectivity to the Genepoolnetworks
I The MN only requires connectivity to the SNI The SN provides DHCP/TFTP/HTTP/xCATd servicesI xCAT commands, such as power/console
operations, are transparently routed through the SNI The SN is provisioned through the Mendel cluster
model1. The Mendel base OS image is booted2. The xCAT add-on is activated3. Cfengine rules apply SN-specific configurations
xCAT modifications
- 16 -
I Postscripts were extended to support Genepoolcharacteristics
I GPFS cluster configurationI Multipath access to local disk arraysI Local filesystem configurationsI Multiple hardware configurations
I Cfengine rules were augmented to support theadditional node classes
Postscripts and Cfengine
- 17 -
Mendel IBProductionNetwork
Mendel GbEManagement
Network
GP GbEIPMI
Network
GP GbEBoot/ProdNetwork
Compute nodes
Service nodes
Image builders
Login nodes
xCATservice node
Legacycompute nodes
Legacyinteractive nodes
xCAT Commands
xCATManagement node
The Combined Cluster
- 18 -
I The combined Mendel+Genepool system iscomplex
I Many different node classesI Each node class represents a unique
software/hardware combinationI Configuration complexity grows with the number of
node classes.
Cluster Automation
- 19 -
I The quantity of node classes exceeds what ahuman administrator can hold in immediatememory
I We must build abstractions to retain systemmanageability as complexity increases
I Configuration management has become a necessityI We manage a single integrated system, not a
collection of nodes.I Every change must be considered in a system-wide
contextI Cfengine must broker change rollout
Cluster Automation
- 20 -
Development Updates
- 21 -
I CHOS enables concurrent support of multiple Linuxenvironments on a single server
I A core component of the Mendel cluster modelI Under active developmentI Recent changes include:
I Ability to exit CHOS from within a CHOSenvironment
I Build system improvementsI pam_chos configurability enhancementsI EL7 kernel support in a testing branch
I Planned features include:I Scripts to transform an installed EL system into a
CHOS environmentI A framework for user-supplied CHOS environmentsI Reduced kernel module scope
CHOS development
- 22 -
I Collecting trending data for historical analysis isgrowing increasingly important
I NERSC developed minimond to systematize thisprocess
I Collects ~1000 statistics per nodeI Modular framework for sending metrics to multiple
data aggregation servicesI Supported output methods: plain text and Ganglia
(via gmetric or EmbeddedGmetric)I AMQP support is planned
I Only absolute counter values are recorded.Calculation of derived statistics must be performedon a remote analysis server
Data Collection
- 23 -
GPFS mmpmon Operations
Metrics Graphs
- 24 -
GPFS mmpmon Throughput
Metrics Graphs
- 25 -
IB gateway Throughput
Metrics Graphs
- 26 -
I The extended NERSC cluster model enablessystematic management of several multi-vendor,multi-interconnect, and multi-generational clustersas a single integrated system
I A unified management interface abstracts awaycomplex details
I Implementation involved minimal user disruptionI Extending the model was far easier than
separately managing both clustersI The new model dramatically simplifies operations
Conclusions
- 27 -
I This work was supported by the Director, Office of Science, Office ofAdvanced Scientific Computing Research of the U.S. Department ofEnergy under Contract No. DE-AC02-05CH11231.
Acknowledgements
- 28 -
National Energy Research Scientific Computing Center
top related