crossroads 2021 technical requirements document d technical... · dated 07-19-18 rfp no. 511017...
Post on 10-Jul-2020
1 Views
Preview:
TRANSCRIPT
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 1 of 56
Crossroads2021
TechnicalRequirementsDocument
LA-UR-18-25993SAND2018-7366O
LosAlamosNationalLaboratory,anaffirmativeaction/equalopportunityemployer,isoperatedbyTriadNationalSecurity,LLC,fortheNationalNuclearSecurityAdministrationoftheU.S.DepartmentofEnergyundercontract89233218CNA000001.LA-UR-18-25993.Approvedforpublicrelease;distributionisunlimited.SandiaNationalLaboratoriesisamulti-missionlaboratorymanagedandoperatedbyNationalTechnology&EngineeringSolutionsofSandia,LLC,awhollyownedsubsidiaryofHoneywellInternational,Inc.,fortheU.S.DepartmentofEnergy’sNationalNuclearSecurityAdministrationundercontractDE-NA0003525.SAND2018-7366O.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 2 of 56
Crossroads2021:TechnicalRequirements1 INTRODUCTION 4
1.1 SCHEDULE 6
2 SYSTEMDESCRIPTION 6
2.1 ARCHITECTURALDESCRIPTION 6
2.2 SOFTWAREDESCRIPTION 7
2.3 PRODUCTROADMAPDESCRIPTION 7
2.4 RISKMITIGATIONSTRATEGY 7
3 TARGETSFORSYSTEMDESIGN,FEATURES,ANDPERFORMANCEMETRICS 7
3.1 SCALABILITY 8
3.2 SYSTEMSOFTWAREANDRUNTIME 10
3.3 SOFTWARETOOLSANDPROGRAMMINGENVIRONMENT 12
3.4 PLATFORMSTORAGE 15
3.5 APPLICATIONPERFORMANCE 18
3.6 RESILIENCE,RELIABILITY,ANDAVAILABILITY 22
3.7 APPLICATIONTRANSITIONSUPPORTANDEARLYACCESSTOACESTECHNOLOGIES 23
3.8 TARGETSYSTEMCONFIGURATION 24
3.9 SYSTEMOPERATIONS 25
3.10 POWERANDENERGY 27
3.11 FACILITIESANDSITEINTEGRATION 29
4 OPTIONS 33
4.1 UPGRADES,EXPANSIONSANDADDITIONS 33
4.2 EARLYACCESSDEVELOPMENTSYSTEM 34
4.3 TESTSYSTEMS 35
4.4 ONSITESYSTEMANDAPPLICATIONSOFTWAREANALYSTS 35
4.5 DEINSTALLATION 35
4.6 MAINTENANCEANDSUPPORT 35
5 DELIVERYANDACCEPTANCE 38
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 3 of 56
5.1 PRE-DELIVERYTESTING 38
5.2 SITEINTEGRATIONANDPOST-DELIVERYTESTING 38
5.3 ACCEPTANCETESTING 39
6 RISKANDPROJECTMANAGEMENT 39
7 DOCUMENTATIONANDTRAINING 40
7.1 DOCUMENTATION 40
7.2 TRAINING 40
8 REFERENCES 41
APPENDIXA:SAMPLEACCEPTANCEPLAN 42
APPENDIXB:TRIADSPECIFICPROJECTMANAGEMENTREQUIREMENTS 50
DEFINITIONSANDGLOSSARY 55
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 4 of 56
1 IntroductionTheDepartmentofEnergy(DOE)NationalNuclearSecurityAdministration(NNSA)AdvancedSimulationandComputing(ASC)Programrequiresacomputingsystembedeployedin2021tosupporttheStockpileStewardshipProgram.Inresponsetothisrequirement,TriadNationalSecurity,LLC(TNS),infurtheranceofitsparticipationintheAllianceforComputingatExtremeScale(ACES),acollaborationbetweenLosAlamosNationalLaboratoryandSandiaNationalLaboratories,isreleasingaRequestforProposal(RFP)foranextgenerationsystem,Crossroads.Inthe2021timeframe,Trinity,thefirstASCAdvancedTechnologySystem(ATS-1),willbenearingtheendofitsusefullifetime.Crossroads,theproposedATS-3system,providesareplacement,tri-labcomputingresourceforexistingsimulationcodesandprovidesaresourceforever-increasingcomputingrequirementstosupporttheweaponsprogram.TheCrossroadssystem,tobesitedatLosAlamos,NM,isprojectedtoprovidealargeportionoftheATSresourcesfortheNNSAASCtri-labsimulationcommunity:LosAlamosNationalLaboratory(LANL),SandiaNationalLaboratories(SNL),andLawrenceLivermoreNationalLaboratory(LLNL),duringthe2021-2026timeframe.Crossroadsisrequiredtosupportstockpilestewardshipcertificationandassessmentstoensurethatthenation’snuclearstockpileissafe,reliableandsecure.TheASCProgramisfacedwithsignificantchallengesresultingfromtheongoingtechnologyrevolution.Theprogrammustcontinuetomeetmissionneedswhileadaptingtosometimesradicalchangesintechnology.CodesrunningonNNSAAdvancedTechnologySystems(TrinityandSierra)inthe2019timeframeareexpectedtorunefficientlyonCrossroads.ThegoaloftheCrossroadsplatformprocurementisEfficiency.Efficiencywillbeevaluatedintheareasof:
• Portingefficiency
• Performanceefficiency
• Workflowefficiency
Throughoutthisdocument,thetermefficiencywillrefertoefficiencyinthesethreeareasunlessotherwisespecified.
Trinity(ATS-1)willbeusedasthebaselineforevaluatingthesegoals.PortingefficiencyisdefinedastheeaseinwhichNNSAmissioncodescanbeportedtoexecuteontheproposedarchitecture.Minimalchangetotheexistingcodebaseisofhighvalue.Performanceefficiencyisdefinedastheachievedperformanceoftheapplicationonceportedtotheproposedplatform.Workflowefficiencyisdefinedastheefficiencythatacomplete
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 5 of 56
NNSAworkflowexecutesontheproposedplatform.Whenevaluatingproposalsefficiencyinallthreestatedareaswillbeconsideredtogether.
Forexample,apoorresultwouldbeascenariowhereanapplicationrequireslittleportingefforttoexecuteontheproposedplatformbuttheresultingperformanceoftheapplicationispoorcomparedtothebaseline.Ideally,individualapplicationsthatcompriseaworkflowcanbeeasilyportedtotheproposedplatformandperformwellwhencomparedwiththebaselinesystem.If,however,anecessaryservicelikeIOorefficientschedulingofarequiredresourcefortheworkflowisinferiorandhampersoverallworkflowefficiencythiswouldstillbeapoorresult.
TohelpinformtheOfferorofthecharacteristicsofNNSAworkflowsanaccompanyingwhitepaper,“CrossroadsWorkflows,”isprovidedthatdescribeshowapplicationteamsuseHighPerformanceComputing(HPC)resourcestodaytoadvancescientificgoals.Thewhitepaperisdesignedtoprovideaframeworkforreasoningabouttheoptimalsolutiontothesechallenges.(TheworkflowsdocumentcanbefoundontheCrossroadswebsitehttp://crossroads.lanl.gov/.)AnOfferor’sTechnicalProposalshallincludenarrativeandgraphics,asappropriate,providingitsresponses/proposedsolutionstoeachofthenumberedsectionsofthisTechnicalRequirementsDocument.AnOfferorshallincorporateitsresponses/proposedsolutionsdirectlyintoeachofthenumberedsectionsoftheTechnicalRequirementsDocument.TheTechnicalRequirementsDocumentisprovidedinMSWordformattofacilitatethisproposalrequirement.TheevaluationcommitteewillmakenopresumptionoftechnicalcapabilitywhenevaluatinganOfferor’sresponses/proposedsolutionstothisTechnicalRequirementsDocumentandmaydowngradeaproposaliftheOfferor’sresponses/proposedsolutionsarenotmateriallyresponsive.Wheretheword“should”appearsthroughoutthisdocument,itisusedtoconveyatargetthatanOfferoroughttomeetorexceed.IfanOfferorexceedsatarget,itsproposalwillbeupgraded.IfanOfferorfailstomeetatarget,itsproposalwillbedowngraded.Wheretheword“shall”appearsthroughoutthisdocument,itisusedtoimposearequirementthatanOfferormustmeetorexceed.IfanOfferorfailstomeetarequirement,itsproposalwillbedowngradedordeemednon-responsive.Eachresponse/proposedsolutionshallclearlydescribetheroleofanylower-tiersubcontractor(s)andthetechnologyortechnologies,bothhardwareandsoftware,andvalueaddedthatthelower-tiersubcontractor(s)provide,whereappropriate.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 6 of 56
ThescopeofworkandtechnicalspecificationsforanysubcontractsresultingfromthisRFPwillbenegotiatedbasedonthisTechnicalRequirementsDocumentandthesuccessfulOfferor’sresponses/proposedsolutions.Crossroadshasamaximumfundinglimitoverthesystemlifetime,toincludealldesignanddevelopment,sitepreparation,maintenance,supportandanalysts.TotalCostofOwnership(TCO)willbeconsideredinsystemselection.TheOfferormustrespondwithconfigurationandpricingforboththeprimaryandalternatepointdesigns.
1.1 ScheduleThefollowingisthetentativeschedulefortheCrossroadssystem.
Table1CrossroadsSchedule
RFPReleased Q1CY19On-siteSystemDeliveryBegins Q2CY21On-siteSystemDeliveryComplete Q3CY21AcceptanceComplete Q1CY22
2 SystemDescription2.1 ArchitecturalDescription
TheOfferorshallprovideadetailedfullsystemarchitecturaldescriptionoftheCrossroadssystems,includingdiagramsandtextdescribingthefollowingdetailsastheypertaintotheOfferor’ssystemarchitectures(primaryandalternate):§ Componentarchitecture–detailsofallprocessor(s),memory
technologies,storagetechnologies,networkinterconnect(s)andanyotherapplicablecomponents.
§ Nodearchitecture(s)–detailsofhowcomponentsarecombinedintothenodearchitecture(s).Detailsshallincludebandwidthandlatencyspecifications(orprojections)betweencomponents.
§ Boardand/orbladearchitecture(s)–detailsofhowthenodearchitecture(s)isintegratedattheboardand/orbladelevel.Detailsshouldincludeallinter-nodeandinter-board/bladecommunicationpathsandanyadditionalboard/bladelevelcomponents.
§ Rackand/orcabinetarchitecture(s)–detailsofhowboardand/orbladesareorganizedandintegratedintoracksand/orcabinets.Detailsshouldincludeallinterrack/cabinetcommunicationpathsandanyadditionalrack/cabinetlevelcomponents.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 7 of 56
§ Platformstorage–detailsofhowstorageisintegratedwiththesystem,includingaplatformstoragearchitecturaldiagram.
§ Systemarchitecture–detailsofhowrackorcabinetsarecombinedtoproducesystemarchitecture,includingthehigh-speedinterconnectsandnetworktopologies(ifmultiple)andplatformstorage.
§ Proposedfloorplan–includingdetailsofthephysicalfootprintofthesystemandallofthesupportingcomponents,includingdetailsofsiteandfacilityintegrationrequirements(e.g.power,cooling,andnetwork).
2.2 SoftwareDescriptionTheOfferorshallprovideadetaileddescriptionoftheproposedsoftwareeco-system,includingahigh-levelsoftwarearchitecturaldiagram.Specifytheprovenanceofthesoftwarecomponent,forexampleopensourceorproprietary,andsupportmechanismforeach(forthelifetimeofthesystemincludingupdates).
2.3 ProductRoadmapDescriptionTheOfferorshalldescribehowthesystemdoesordoesnotfitintotheOfferor’slong-termproductroadmapandapotentialfollow-onsystemacquisitioninthe2025/26andbeyondtimeframe.
2.4 RiskMitigationStrategyTheOfferorshallprovideasummaryofanalternateriskmitigationpointdesign.Thealternatepointdesignshallbebasedonanarchitecturethatreducestheriskofsuccessfulon-timedeployment,forexample,poseslessscheduleriskfordelivery.Itisofgreatimportancethataviableplatform(primaryoralternate)isdeliveredintheCrossroadstimeframecapableofsupportingmissionneedsregardlessofunforeseentechnologydisruptions.TheOfferorshallnotsubmitafullalternativepointdesignproposal.InsteaditssummaryofthealternateriskmitigationpointdesignshallclearlydescribeanydifferencesfromtheprimarydesignpointandhoweachofthenoteddifferencessatisfythetechnicalrequirementscontainedinthisdocumentandreducesscheduleriskfordeliveryofCrossroads.
3 TargetsforSystemDesign,Features,andPerformanceMetricsThissectioncontainstargetsfordetailedsystemdesign,featuresandperformancemetrics.ItisdesirablethattheOfferor’sproposalmeetorexceedthetargetsoutlinedinthissection.Ifatargetcannotbemet,theOfferorshallprovideadevelopmentanddeploymentplan,includingaschedule,tosatisfythetarget.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 8 of 56
TheOfferormayalsoproposeanyhardwareand/orsoftwarearchitecturalfeaturesthatwillprovideimprovementsforanyaspectofthesystem.
3.1 ScalabilityThescaleofthesystemnecessarytomeettheneedsoftheapplicationperformance,portingandworkflowrequirementsoftheNNSAlaboratoriesaddssignificantchallenges.TheOfferorshouldproposeasystemthatenablesefficiencyatuptothefullscaleofthesystem.Additionally,thesystemproposedshouldprovidefunctionalitythatassistsusersinenhancingefficiencyatuptofullscale.Scalabilityfeatures,bothhardwareandsoftware,thatbenefitbothcurrentandfutureprogrammingmodelsareessential.MemorybandwidthandlatencyareoftenlimitingfactorsintheperformanceofNNSAmissionapplicationsthereforehighvaluewillbeputonfeaturesthatincreasememorybandwidthorlowermemorylatency.
3.1.1 Thesystemshouldsupportrunningjobsuptoandincludingthefullscaleofthesystem.
3.1.2 Thesystemshouldsupportlaunchinganapplicationatfullsystemscaleinlessthan30seconds.TheOfferorshalldescribefactors(suchasexecutablesize)thatcouldpotentiallyaffectapplicationlaunchtime.
3.1.3 TheOfferorshalldescribehowapplicationlaunchscaleswiththenumberofconcurrentlaunchrequests(persecond)andthescaleofeachlaunchrequest(resourcesrequested,suchasthenumberofschedulableunitsetc.),includinginformationsuchas:§ Allsystem-levelandnode-leveloverheadintheprocessstartupincluding
howoverheadscaleswithnodecountforparallelapplications,orhowoverheadscaleswiththeapplicationcountforlargenumbersofserialapplications.
§ Anylimitationsforprocessesoncomputenodesfrominterfacingwithanexternalwork-flowmanager,externaldatabaseormessagequeuesystem.
3.1.4 Thesystemshouldsupportatleast1000concurrentusersandmorethan20,000concurrentbatchjobs.Thesystemshouldallowasingleusertoexecutemultipleindependentapplicationsonasubsetorallofthepoolofnodesallocatedtothem.TheOfferorshalldescribedetails,includinglimitationsoftheirproposedsupportforthisrequirement.
3.1.5 TheOfferorshalldescribeallareasofthesysteminwhichnode-levelresourceusage(hardwareandsoftware)increasesasajobscalesup(node,coreorthreadcount).
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 9 of 56
3.1.6 Thesystemshouldutilizeanoptimizedjobplacementalgorithmtoreducejobruntime,lowervariability,minimizelatency,etc.TheOfferorshalldescribeindetailhowthealgorithmisoptimizedtothesystemarchitecture.
3.1.7 Thesystemshouldincludeanapplicationprogramminginterfacetoallowapplicationsaccesstothephysical-to-logicalmappinginformationofthejob’snodeallocation–includingamappingbetweenMPIranksandnetworktopologycoordinates,andcore,nodeandrackidentifiers.
3.1.8 Thesystemsoftwaresolutionshouldprovidealowjitterenvironmentforapplicationsandshouldprovideanestimateofacomputenodeoperatingsystem’snoiseprofile,bothwhileidleandwhilerunninganon-trivialMPIapplication.Ifcorespecializationisused,theOfferorshalldescribethesystemsoftwareactivitythatremainsontheapplicationcores.
3.1.9 Thesystemshouldprovidecorrectnumericalresultsandconsistentruntimes(i.e.wallclocktime)thatdonotvarymorethan3%fromruntorunindedicatedmodeand5%inproductionmode.TheOfferorshalldescribestrategiesforminimizingruntimevariability.
3.1.10 Thesystem’shighspeedinterconnectshouldsupportahighmessagingbandwidth,highinjectionrate,lowlatency,highthroughput,andindependentprogress.TheOfferorshalldescribe:§ Thesysteminterconnectindetail,includinganymechanismsforadapting
toheavyloadsorinoperablelinks,aswellasadescriptionofhowdifferenttypesoffailureswillbeaddressed.
§ Howtheinterfacewillallowallcoresinthesystemtosimultaneouslycommunicatesynchronouslyorasynchronouslywiththehighspeedinterconnect.
§ Howtheinterconnectwillenablelow-latencycommunicationforone-andtwo-sidedparadigms.
3.1.11 TheOfferorshalldescribehowbothhardwareandsoftwarecomponentsoftheinterconnectsupporteffectivecomputationandcommunicationoverlapforbothpoint-to-pointoperationsandcollectiveoperations(i.e.,theabilityoftheinterconnectsubsystemtoprogressoutstandingcommunicationrequestsinthebackgroundofthemaincomputationthread).
3.1.12 TheOfferorshallreportorprojecttheproposedsystem’snodeinjection/ejectionbandwidth.
3.1.13 TheOfferorshallreportorprojecttheproposedsystem’sbiterrorrateoftheinterconnectintermsoftimeperiodbetweenerrorsthatinterruptajobrunningatthefullscaleofthesystem.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 10 of 56
3.1.14 TheOfferorshalldescribehowtheinterconnectofthesystemwillprovideQualityofService(QoS)capabilities(e.g.,intheformofvirtualchannelsorothersub-systemQoScapabilities),includingbutnotlimitedto:§ Anexplanationofhowthesecapabilitiescanbeusedtopreventcore
communicationtrafficfrominterferingwithotherclassesofcommunication,suchasdebuggingandperformancetoolsorwithI/Otraffic.
§ Anexplanationofhowthesecapabilitiesallowefficientadaptiveroutingaswellasacapabilitytopreventtrafficfromdifferentapplicationsinterferingwitheachother(eitherthroughQoScapabilitiesorappropriatejobpartitioning).
§ Anexplanationofanysub-systemQoScapabilities(e.g.platformstorageQoSfeatures).
3.1.15 TheOfferorshalldescribespecializedhardwareorsoftwarefeaturesofthesystemthatenhanceworkflowsorcomponentsofworkflowefficiency,anddescribeanylimitstotheirscalabilityonthesystem.Thehardwareshouldbeonthesamehighspeednetworkasthemaincomputeresourcesandshouldhaveequalaccesstoothercomputeresources(e.g.filesystemsandplatformstorage).Itisdesirablethatthehardwarehavethesamenodelevelarchitectureasthemaincomputeresources,butcould,forexample,havemorememorypernode.
3.2 SystemSoftwareandRuntimeThesystemshouldincludeawell-integratedandsupportedsystemsoftwareenvironment.Theoverallimperativeistoprovideuserswithaproductive,high-performing,reliable,andscalablesystemsoftwareenvironmentthatenablesefficientuseofthefullcapabilityofthesystem.
3.2.1 Thesystemshouldincludeafull-featuredLinuxoperatingsystemenvironmentonalluservisibleservicepartitions(e.g.,front-endnodes,servicenodes,I/Onodes).TheOfferorshalldescribetheproposedfull-featuredLinuxoperatingsystemenvironment.
3.2.2 Thesystemshouldincludeanoptimizedcomputepartitionoperatingsystemthatprovidesanefficientexecutionenvironmentforapplicationsrunninguptofull-systemscale.TheOfferorshalldescribeanyHPCrelevantoptimizationsmadetothecomputepartitionoperatingsystem.
3.2.3 TheOfferorshalldescribethesecuritycapabilitiesofalloperatingsystemsproposed,e.g.compute,service.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 11 of 56
3.2.4 Thesystemshouldsupportacohesiveandintegratedsolutionforlaunchinguserapplicationsbothatscaleandhighrequestfrequencythatrequiredataatruntimesuchas:sharedobjects,containerizedobjects,datafiles,anddependentsoftware.
3.2.5 Thesystemshouldincluderesourcemanagementfunctionality,includingjobmigration,backfill,targetingofspecifiedresources(e.g.,platformstorage,CPU,memory),advanceandpersistentreservations,jobpreemption,jobaccounting,architecture-awarejobplacement,powermanagement,jobdependencies(e.g.,workloadmanagement),resiliencemanagement,andhighthroughputworkloadexecution(e.g.,100,000jobsubmissionspernight).TheOfferormayproposemultiplesolutionsforavendor-supportedresourcemanagerandshoulddescribethebenefitsofeach.
3.2.6 Thesystemshouldsupportjobsconsistingofmultipleindividualapplicationsrunningsimultaneously(inter-nodeorintra-node)andcooperatingaspartofanoverallmulti-componentapplication(e.g.,ajobthatcouplesasimulationapplicationtoananalysisapplication).TheOfferorshalldescribeindetailhowthiswillbesupportedbythesystemsoftwareinfrastructure(e.g.,userinterfaces,securitymodel,andinter-applicationcommunication).
3.2.7 Thesystemshouldincludeamechanismthatwillallowuserstoprovidecontainerizedsoftwareimageswithoutrequiringprivilegedaccesstothesystemorallowingausertoescalateprivilege.Thestartuptimeforlaunchingaparallelapplicationinacontainerizedsoftwareimageatfullsystemscaleshouldnotgreatlyexceedthestartuptimeforlaunchingaparallelapplicationinthevendor-providedimage.
3.2.8 ThesystemshouldincludeamechanismfordynamicallyconfiguringexternalIPv4/IPv6connectivitytoandfromcomputenodes,enablingspecialconnectivitypathsforsubsetsofnodesonaper-batch-jobbasis,andallowingfullyroutableinteractionswithexternalservices.
3.2.9 TheSuccessfulOfferorshouldprovideaccesstosourcecode,andnecessarybuildenvironment,forallsoftwareexceptforfirmware,compilers,andthirdpartyproducts.TheSuccessfulOfferorshouldprovideupdatesofsourcecode,andanynecessarybuildenvironment,forallsoftwareoverthelifeofthesubcontract.
3.2.10 Theschedulershouldsupportjobworkflowswithdatastage-inandstage-outfromlocalfilesystemsandstoragesystemsaccessibleonlyfromaremotedatatransfersystem.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 12 of 56
3.3 SoftwareToolsandProgrammingEnvironmentTheprimaryprogrammingmodelsusedinproductionapplicationsinthistimeframearetheMessagePassingInterface(MPI),forinter-nodecommunication,andOpenMP,forfine-grainedon-nodeparallelism.WhileMPI+OpenMPwillbethemajorityoftheworkload,theACESlaboratoriesexpectsomenewapplicationstoexerciseemergingasynchronousprogrammingmodels.Systemsupportthatwouldacceleratetheseprogrammingmodels/runtimesandbenefitMPI+OpenMPisdesirable.
3.3.1 ThesystemshouldincludeanimplementationofthemostcurrentversionofMPIstandardspecification.TheOfferorshallprovideadetaileddescriptionoftheMPIimplementation(includingspecificationversion)andsupportforfeaturessuchashardwareacceleratedcollectiveoperations.TheOfferorshalldescribeanylimitationsrelativetotheMPIstandard.
3.3.2 TheOfferorshalldescribeatwhatparallelgranularitythesystemcanbeutilizedbyMPI-onlyapplications.
3.3.3 Thesystemshouldincludeoptimizedimplementationsofcollectiveoperationsutilizingbothinter-nodeandintra-nodefeatureswhereappropriate,includingMPI_Barrier,MPI_Allreduce,MPI_Reduce,MPI_Allgather,andMPI_Gather.
3.3.4 TheOfferorshalldescribethenetworktransportlayerofthesystemincludinganysupportforOpenUCX,Portals,libfabric,libverbs,andanyothertransportlayer,includinganyoptimizationsoftheirimplementationthatwillbenefitapplicationperformanceorworkflowefficiency.
3.3.5 ThesystemshouldincludeacompleteimplementationofthemostcurrentversionofOpenMPstandardincluding,ifapplicable,acceleratordirectives,aswellasasupportingprogrammingenvironment.TheOfferorshallprovideadetailedfeaturedescriptionoftheOpenMPimplementation(s)anddescribeanyexpecteddeviationsfromtheOpenMPstandard.
3.3.6 TheOfferorshallprovideadescriptionofhowapplicationswrittentoutilizeOpenMPwillbecompiledandexecutedonthesystem.
3.3.7 TheOfferorshallprovideadescriptionofanyproposedhardwareorsoftwarefeaturesthatenableOpenMPperformanceoptimizations.
3.3.8 TheOfferorshalllistanyPGASlanguagesand/orlibrariesthataresupported(e.g.UPC,SHMEM/OpenSHMEM,CAF,GlobalArrays)anddescribeanyhardwareand/orprogrammingenvironmentsoftwarethatoptimizesanyofthelistedPGASlanguagessupportedonthesystem.TheOfferorshalldescribeinteroperabilitywithMPI+OpenMP.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 13 of 56
3.3.9 TheOfferorshalldescribeandlistsupportforanyemergingprogrammingmodelssuchasasynchronoustask/datamodels(e.g.,Legion,STAPL,HPX,orOCR)anddescribeanysystemhardwareand/orprogrammingenvironmentsoftwareitwillprovidethatoptimizesanyofthesupportedmodels.TheOfferorshalldescribeinteroperabilitywithMPI+OpenMP.
3.3.10 TheOfferorshalldescribetheproposedhardwareandsoftwareenvironmentsupportfor:
§ Fastthreadsynchronizationofsubsetsofexecutionthreads;§ Atomicadd,fetch-and-add,multiply,bitwiseoperations,andcompare-
and-swapoperationsover32-bitand64-bitintegers,single-precision,anddouble-precisionoperands;
§ Atomiccompare-and-swapoperationsover16-bytewideoperandsthatcomprisetwodoubleprecisionvaluesortwo64-bitmemorypointeroperands;
§ Fastcontextswitchingortask-switching;
§ Fasttaskspawningforuniqueandidenticaltaskwithdatadependencies;§ Supportforactivemessages.
3.3.11 TheOfferorshalldescribeindetailallprogrammingAPIs,languages,compliersandcompilerextensions,etc.otherthanMPIandOpenMP(e.g.OpenACC,CUDA,OpenCL,etc.)thatwillbesupportedbythesystem.Itisdesirablethatinstancesofallprogrammingmodelsprovidedbeinteroperableandefficientwhenusedwithinasingleprocessorsinglejobrunningonthesamecomputenode.
3.3.12 ThesystemshouldincludesupportforthelanguagesC,C++(includingcompletesupportforC++11/14/17),Fortran77,Fortran90,andFortran2008programminglanguages.Providingmultiplecompilationenvironmentsishighlydesirable.TheOfferorshalldescribeanylimitationsthatcanbeexpectedinmeetingfullC++17supportbasedoncurrentexpectations.KeyASCapplicationspushthelimitsofcurrentFortrancompilers.TheOfferorshalldescribetheirsupportforFortran,includingstandardslevelsand/orcoverageofFortrantestsuites,suchastheFLANGFortranTestSuite.
3.3.13 ThesystemshouldincludeaPythonimplementationthatwillrunonthecomputepartitionwithoptimizedMPI4Py,NumPy,andSciPylibraries.
3.3.14 Thesystemshouldincludeaprogrammingtoolchain(s)thatenablesruntimecoexistenceofthreadinginC,C++,andFortran,fromwithinapplicationsandanysupportinglibrariesusingthesamecompilertoolchain.TheOfferorshalldescribetheinteractionbetweenOpenMPandnativeparallelismexpressedinlanguagestandards.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 14 of 56
3.3.15 ThesystemshouldincludeC++compiler(s)thatcansuccessfullybuildthelatestBoostC++library.TheOfferorshallsupportthemostrecentstableversionofBoost.
3.3.16 Thesystemshouldincludeoptimizedversionsoflibm,libgsl,BLASlevels1,2and3,LAPACK,ScaLAPACK,HDF5,NetCDF,andFFTW.ItisdesirableforthesetoefficientlyinteroperatewithapplicationsthatutilizeOpenMP.TheOfferorshalldescribeallotheroptimizedlibrariesthatwillbesupported,includingadescriptionoftheinteroperabilityoftheselibrarieswiththeprogrammingenvironmentsproposed.
3.3.17 Thesystemshouldincludeamechanismthatenablescontroloftaskandmemoryplacementwithinanodeforefficientperformance.TheOfferorshallprovideadetaileddescriptionofcontrolsprovidedandanylimitationsthatmayexist.
3.3.18 Thesystemshouldincludeacomprehensivesoftwaredevelopmentenvironmentwithconfigurationandsourcecodemanagementtools.Onheterogeneoussystems,amechanism(e.g.,anupgradedautoconf)shouldbeprovidedtocreateconfigurescriptstobuildcross-compiledapplicationsonloginnodes.
3.3.19 ThesystemshouldincludeaninteractiveparalleldebuggerwithanX11-basedgraphicaluserinterface.Thedebuggershouldprovideasinglepointofcontrolthatcandebugapplicationsinallsupportedlanguagesusingallgranularitiesofparallelism(e.g.MPI+X)andprogrammingenvironmentsprovidedandscaleupto25%ofthesystem.
3.3.20 Thesystemshouldincludeasuiteoftoolsfordetailedperformanceanalysisandprofilingofuserapplications.AtleastonetoolshouldsupportallgranularitiesofparallelisminmixedMPI+OpenMPprogramsandanyadditionalprogrammingmodelssupportedonthesystem.Thetoolsuitemustprovidetheabilitytosupportmulti-nodeintegratedprofilingofon-nodeparallelismandcommunicationperformanceanalysis.TheOfferorshalldescribeallproposedtoolsandthescalabilitylimitationsofeach.TheOfferorshalldescribetoolsformeasuringI/Obehaviorofuserapplications.
3.3.21 Thesystemshouldincludeevent-tracingtools.Eventtracingofinterestincludes:message-passingeventtracing,I/Oeventtracing,floatingpointexceptiontracing,andmessage-passingprofiling.Theevent-tracingtoolAPIshouldprovidefunctionstoactivateanddeactivateeventmonitoringmultipletimesduringexecutionfromwithinaprocess.
3.3.22 Thesystemshouldincludesingle-andmulti-nodestack-tracingtools.Thetoolsetshouldincludeasource-levelstacktraceback,includinganAPIthatallowsarunningprocessorthreadtoqueryitscurrentstacktrace.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 15 of 56
3.3.23 Thesystemshouldincludetoolstoassisttheprogrammerinintroducinglimitedlevelsofparallelismanddatastructurerefactoringtocodesusinganyproposedprogrammingmodelsandlanguages.Tool(s)shouldadditionallybeprovidedtoassistapplicationdevelopersinthedesignandplacementofthedatastructureswiththegoalofoptimizingdatamovement/placementfortheclassesofmemoryproposedinthesystem.
3.3.24 Thesystemshallsupportlicensingfortheprogrammingenvironment(compilers,debuggers,optimizationtools,optimizedmathlibraries,etc.)foruptotwenty(20)concurrentusersatjobsizesthatspanfrom100’sofsmall-scalejobsforasingleuserallthewayuptoasinglejoboccupyingthefullscale(100%ofthecomputepartition)oftheplatform.
3.4 PlatformStoragePlatformstorageiscertaintobeoneoftheadvancedtechnologyareasincludedinanysystemdeliveredinthistimeframe.TheACESlaboratoriesanticipatetheseemergingtechnologieswillenablenewusagemodels.Withthisinmind,anaccompanyingwhitepaper,“APEXWorkflows,”isprovidedthatdescribeshowapplicationteamsuseHPCresourcestodaytoadvancescientificgoals.Thewhitepaperisdesignedtoprovideaframeworkforreasoningabouttheoptimalsolutiontothesechallenges.ThewhitepaperisintendedtohelpanOfferordevelopaplatformstoragearchitectureresponsethatacceleratesthescienceworkflowswhileminimizingthetotalnumberofplatformstoragetiers.TheworkflowsdocumentcanbefoundontheCrossroadswebsite.
3.4.1 Thesystemshouldincludeplatformstoragecapableofretainingallapplicationinput,output,andworkingdatafor12weeks(84days),estimatedataminimumof12%ofbaselinesystemmemoryperday.
3.4.2 Thesystemshouldincludeplatformstoragewithawarranteddurabilityoramaintenanceplansuchthattheplatformstorageiscapableofabsorbingapproximatelytwotimesthesystemsbaselinememoryperdayforanominal5years.
3.4.3 TheOfferorshalldescribehowthesystemprovidessufficientbandwidthtosupportaJMTTI/Delta-Ckptratioofgreaterthan200.SeeTable2TargetSystemConfiguration.
3.4.4 TheOfferorshalldescribehowthestoragesystemprovidessufficientperformancetoasynchronouslymigrate80%ofmemory(i.e.acheckpointfrom3.4.3)fromthefastesttiertothecapacitytierin75%ofJMTTI.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 16 of 56
3.4.5 TheOfferorshalldescribehowthesystemsatisfiesaminimumstoragebandwidthrequirementcapableofwriting25%ofbaselinesystemmemoryinlessthan300seconds.
3.4.6 TheOfferorshalldescribehowajobrunningacrosstheentiresystemwithaMPIrankpercorecancreateafilefromeveryMPIrankinfewerthan10secondsbetweenthefirstandlastcreate.Iftheresponserequiresmorethanasinglepre-existingdirectorytheofferorshallalsodescribetherequireddirectorylayoutandthetimerequiredtocreatethosedirectories.
3.4.7 TheOfferorshalldescribeallavailableinterfacestoplatformstorageforthesystem,includingbutnotlimitedto:§ POSIX
§ APIs
§ ExceptionstoPOSIXcompliance.§ Timetoconsistencyandanypotentialdelaysforreliabledata
consumption.§ Anyspecialrequirementsforuserstoachieveperformanceand/or
consistentdata.
3.4.8 TheOfferorshalldescribethereliabilitycharacteristicsofplatformstorage,includingbutnotlimitedto:§ Anysinglepointoffailureforallproposedplatformstoragetiers(note
anycomponentfailurethatwillleadtotemporaryorpermanentlossofdataavailability).
§ Enumerateplatformstoragetiersthataredesignedtobelessreliableordonotusedataprotectiontechniques(e.g.,replication,erasurecoding).
§ Describetheimpactstoarunningcomputejobduetostorage-relatedfailuresandduringtherecoveryfromsaidfailureforeachreliableplatformtier.Specificallydescribethejobimpactduringfailure,andseparatelydescribethejobimpactduringrecovery.
§ Vendorsuppliedmechanismstoensuredataintegrityforeachplatformstoragetier(e.g.,datascrubbingprocesses,backgroundchecksumverification,etc.).
§ Loginorinteractivenodesaccesstoplatformstoragewhenthecomputenodesareunavailable.
3.4.9 TheOfferorshalldescribesystemfeaturesforplatformstoragetiermanagementdesignedtoaccelerateworkflows,includingbutnotlimitedto:
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 17 of 56
§ Mechanismsformigratingdatabetweenplatformstoragetiers,includingmanual,scheduled,and/orautomaticdatamigrationtoincluderebalancing,draining,orrewritingdataacrossdeviceswithinatier.
§ Howplatformstoragewillbeinstantiatedwitheachjobifitneedstobe,andhowplatformstoragemaybepersistedacrossjobs.
§ Thecapabilitiesprovidedtodefineper-userpoliciesanddatamovementbetweendifferenttiersofplatformstorageorexternalstorageresources(e.g.,archives).
§ Describeanydata-relatedconsistency,whetheroptionalorinherent,betweenstoragetiers(e.g.write-backcaching).
§ Theabilitytointegratewithaschedulingresource.§ Mechanismtoincrementallyaddcapacityandbandwidthtoaparticular
tierofplatformstorage.Pleasealsodescribefunctionalandperformanceimpactstorunningjobswhilethesystemintegratesnewresources.
§ Capabilitiestomanageorinterfaceplatformstoragewithexternalstorageresourcesorarchives(e.g.,faststoragelayersorHPSS).
3.4.10 TheOfferorshalldescribesoftwarefeaturesthatallowuserstooptimizeI/Ofortheworkflowsofthesystem,includingbutnotlimitedto:§ Batchdatamovementcapabilities,especiallywhendataresideson
multipletiersofplatformstorage.
§ Methodsforuserstocreateandmanageplatformstorageallocations.§ Anyabilitytodirectlytargetatierforwritingorreadingdata.
§ Locality-awarejob/datascheduling.§ Methodsforuserstoexploitanyenhancedperformanceofrelaxed
consistency.§ Methodsforenablinguser-definedmetadatawiththeplatformstorage
solution.
3.4.11 TheOfferorshalldescribethemethodandrateforenumeratingtheentireplatformstoragemetadata.Describeanyspecialcapabilitiesthatwouldmitigateuserperformanceissuesand/orallowtheenumerationtocompleteinfewerthan4hours;expectatleast1billionobjects.
3.4.12 TheOfferorshalldescribecapabilitiestocomprehensivelycollectplatformstorageusagedataandnotethosethatcanbecollectedout-of-band.Storagemetricsforthesystemmayinclude,butarenotlimitedto:
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 18 of 56
§ Perclientmetricsandfrequencyofcollection,includingbutnotlimitedto:thenumberofbytesreadorwritten,numberofreadorwriteinvocations,clientcachestatistics,andmetadatastatisticssuchasnumberofopens,closes,creates,andothersystemcallsofrelevancetotheperformanceofplatformstorage.
§ Joblevelmetrics,suchasthenumberofsessionseachjobinitiateswitheachplatformstoragetier,sessionduration,totaldatatransmitted(separatedasreadsandwrites)duringthesession,andthenumberoftotalplatformstorageinvocationsmadeduringthesession.
§ Platformstoragetiermetricsandfrequencyofcollection,suchasthenumberofbytesread,numberofbyteswritten,numberofreadinvocations,numberofwriteinvocations,bytesdeleted/purged,numberofI/Osessionsestablished,andperiodsofoutage/unavailability.
§ Joblevelmetricsdescribingusageofatieredplatformstoragehierarchy,suchashowlongfilesareresidentineachtier,hitrateoffilepagesineachtier(i.e.,whetherpagesareactuallyreadandhowmanytimesdataisre-read),fractionofdatamovedbetweentiersbecauseofa)explicitprogrammercontrolandb)transparentcaching,andtimeintervalbetweenaccessestothesamefile(e.g.,howlonguntilananalysisprogramreadsasimulationgeneratedoutputfile).
3.4.13 TheOfferorshallproposeamethodforprovidingaccesstoplatformstoragefromothersystemsatthefacility.Inthecaseoftieredplatformstorage,atleastonetiermustsatisfythisrequirement.
3.4.14 TheOfferorshalldescribethecapabilityforplatformstoragetierstoberepaired,serviced,andincrementallypatched/upgradedwhilerunningdifferentversionsofsoftwareorfirmwarewithoutrequiringastoragetier-wideoutage.TheOfferorshalldescribethelevelofperformancedegradation,ifany,anticipatedduringtherepairorserviceinterval.
3.4.15 TheOfferorshallspecifythetimerequiredandtheoptimalnumberofcomputenodesrequiredtoachievepeakreadandwriteperformancetothefastestplatformstoragetierusingthefollowingdatasets:
§ A1TBdatasetof20GBfiles.§ A5TBdatasetofanychosenfilesize.Offerorshallreportthefilesize
chosen.
§ Usablecapacityofthefastesttierusing32MBfiles.
3.5 ApplicationPerformanceAssuringthatrealapplicationsperformefficientlyonCrossroadsiskeyfortheirsuccess.Becausethefullapplicationsarelarge,oftenwithmillionsof
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 19 of 56
linesofcode,andinsomecasesareexportcontrolled,asuiteofbenchmarkshavebeendevelopedforRFPresponseevaluationandsystemacceptance.ThebenchmarkcodesarerepresentativeoftheworkloadsoftheNNSAlaboratoriesbutoftensmallerthanthefullapplications.TheperformanceofthebenchmarkswillbeevaluatedaspartofboththeRFPresponseandsystemacceptance.Finalbenchmarkacceptanceperformancetargetswillbenegotiatedafterafinalsystemconfigurationisdefined.Allperformancetestsmustcontinuetomeetnegotiatedacceptancecriteriathroughoutthelifetimeofthesystem.SystemacceptanceforCrossroadswillincludeexportcontrolledASCcodes(toincludecodeat0D999andITARcontrollevel)fromeachofthethreeNNSAlaboratories.ThebenchmarkinformationandlicensingrequirementsregardingtheCrossroadsacceptancecodes,andsupplementalmaterialscanbefoundontheCrossroadswebsite.
3.5.1 TheOfferorshallprovideresponsesforthebenchmarks(SNAP,HPCG,PENNANT,MiniPIC,UMT,VPIC,Branson)providedontheCrossroadsbenchmarkslinkontheCrossroadswebsite.Allmodificationsornewvariantsofthebenchmarks(includingmakefiles,buildscripts,andenvironmentvariables)aretobesuppliedintheOfferor’sresponse.§ Theresultsofallproblemsizes(baselineandoptimized)shouldbe
providedintheOfferor'sScalableSystemImprovement(SSI)spreadsheets.SSIisthecalculationusedformeasuringimprovementandisdocumentedontheCrossroadswebsite,alongwiththeSSIspreadsheets.Ifpredictedorextrapolatedresultsareprovided,themethodologyusedtoderivethemshouldbeclearlydocumented.
§ TheOfferorshallprovidelicensesforthesystemforallcompilers,libraries,andruntimesusedtoachievebenchmarkperformance.
3.5.2 TheOfferorshallprovideperformanceresultsforthesystemthatmaybebenchmarked,predicted,and/orextrapolatedforthebaselineMPI+OpenMPvariantsofthebenchmarks.TheOfferormaymodifythebenchmarkstoincludeextraOpenMPpragmasasrequired,butthebenchmarkmustremainastandard-compliantprogramthatmaintainsexistingoutputsubjecttothevalidationcriteriadescribedinthebenchmarkrunrules.
3.5.3 TheOfferorshalloptionallyprovideperformanceresultsfromanOfferoroptimizedvariantofthebenchmarks.TheOfferormaymodifythebenchmarks,includingthealgorithmand/orprogrammingmodelusedtodemonstratehighsystemperformance.Ifalgorithmicchangesaremade,theOfferorshallprovideanexplanationofwhytheresultsmaydeviatefromvalidationcriteriadescribedinthebenchmarkrunrules.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 20 of 56
3.5.4 InadditiontotheCrossroadsbenchmarks,anASCSimulationCodeSuiterepresentingthethreeNNSAlaboratorieswillbeusedtojudgeperformanceattimeofacceptance(MercuryfromLawrenceLivermore,PartiSNfromLosAlamos,andSPARCfromSandia).NNSAmissionrequirementsforecasttheneedfora6XorgreaterimprovementovertheASCTrinitysystem(Haswellpartition)forthecodesuite,measuredusingSSI.Finalacceptanceperformancetargetswillbeestablishedduringnegotiationsafterafinalsystemconfigurationisdefined.InformationregardingtheASCSimulationCodeSuitecanbefoundontheCrossroadswebsite.SourcecodewillbeprovidedtotheOfferor,butitwillrequirecompliancewithexportcontrollawsandnocostlicensingagreements.
3.5.5 TheOfferorshallreportorprojectthenumberofcoresnecessarytosaturatetheavailablenodebaselinememorybandwidthasmeasuredbytheCrossroadsmemorybandwidthbenchmarkfoundontheCrossroadswebsite.
§ Ifthenodecontainsheterogeneouscores,theOfferorshallreportthenumberofcoresofeacharchitecturenecessarytosaturatetheavailablebaselinememorybandwidth.
§ Ifmultipletiersofmemoryareavailable,theOfferorshallreporttheaboveforeveryfunctionalcombinationofcorearchitectureandbaselineorextendedmemorytier.
3.5.6 TheOfferorshallreportorprojectthesustaineddensematrixmultiplicationperformanceoneachtypeofprocessorcore(individuallyand/orinparallel)ofthesystemnodearchitecture(s)asmeasuredbytheCrossroadsmultithreadedDGEMMbenchmarkfoundontheCrossroadswebsite.
§ TheOfferorshalldescribethepercentageoftheoreticaldouble-precision(64-bit)computationalpeak,whichthebenchmarkGFLOP/srateachievesforeachtypeofcomputecore/unitintheresponse,anddescribehowthisiscalculated.
3.5.7 TheOfferorshallreport,orproject,theMPItwo-sidedmessagerateofthenodesinthesystemunderthefollowingconditionsmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsite:
§ Usingone,two,four,eight,andhalfthenumberofcoresofMPIrankspernodewithMPI_THREAD_SINGLE.
§ Usingone,two,four,eight,andhalfthenumberofcoresofMPIrankspernodeandmultiplethreadsperrankwithMPI_THREAD_MULTIPLE.
§ TheOfferormayadditionallychoosetoreportonotherconfigurations,includingMPI_THREAD_SERIALIZEDandMPI_THREAD_FUNNELED.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 21 of 56
3.5.8 TheOfferorshallreport,orproject,theMPIone-sidedmessagerateofthenodesinthesystemforallpassivesynchronizationRMAmethodswithbothpre-allocatedanddynamicmemorywindowsunderthefollowingconditionsmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsiteusing:§ One,two,four,eight,andhalfthenumberofcoresofMPIrankspernode
withMPI_THREAD_SINGLE.
§ One,two,four,eight,andhalfthenumberofcoresofMPIrankspernodeandmultiplethreadsperrankwithMPI_THREAD_MULTIPLE.
§ TheOfferormayadditionallychoosetoreportonotherconfigurations,includingMPI_THREAD_SERIALIZEDandMPI_THREAD_FUNNELED.
3.5.9 TheOfferorshallreport,orproject,thetimetoperformthefollowingcollectiveoperationsfor25%,50%,and100%ofthecomputepartitionnodesinthesystemandreportoncoreoccupancyduringtheoperationsmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsitefor:
§ An8byteMPI_Allreduceoperation.§ An8byteperrankMPI_Allgatheroperation.
3.5.10 TheOfferorshallreport,orproject,theminimumandmaximumoff-nodelatencyofthesystemforMPItwo-sidedmessagesusingthefollowingthreadingmodesmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsite:
§ MPI_THREAD_SINGLEwithasinglethreadperrank.
§ MPI_THREAD_MULTIPLEwithtwoormorethreadsperrank.
3.5.11 TheOfferorshallreport,orproject,theminimumandmaximumoff-nodelatencyforMPIone-sidedmessagesofthesystemforallpassivesynchronizationRMAmethodswithbothpre-allocatedanddynamicmemorywindowsusingthefollowingthreadingmodesmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsite:
§ MPI_THREAD_SINGLEwithasinglethreadperrank.
§ MPI_THREAD_MULTIPLEwithtwoormorethreadsperrank.
3.5.12 TheOfferorshallprovideanefficientimplementationofMPI_THREAD_MULTIPLE.Bandwidth,latency,andmessagethroughputmeasurementsusingtheMPI_THREAD_MULTIPLEthreadsupportlevelshouldhavenomorethana10%performancedegradationwhencomparedtousingtheMPI_THREAD_SINGLEsupportlevelasmeasuredbythecommunicationbenchmarkspecifiedontheCrossroadswebsite.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 22 of 56
3.5.13 TheOfferorshallreport,orproject,themaximumI/ObandwidthsofthesystemasmeasuredbytheIORbenchmarkspecifiedontheCrossroadswebsite.
3.5.14 TheOfferorshallreport,orproject,themetadataratesofthesystemasmeasuredbytheMDTESTbenchmarkspecifiedontheCrossroadswebsite.
3.5.15 TheSuccessfulOfferorshallberequiredattimeofacceptancetomeetspecifiedtargetsforacceptancebenchmarks,andmissioncodesforCrossroads,listedontheCrossroadswebsite.
3.5.16 TheOfferorshalldescribehowthesystemmaybeconfiguredtosupportahighrateandbandwidthofTCP/IPconnectionstoexternalservicesbothfromcomputenodesanddirectlytoandfromtheplatformstorage,including:§ Computenodeexternalaccessshouldallowallnodestoeachinitiate1
connectionconcurrentlywithina1secondwindow.§ Transferofdataovertheexternalnetworktoandfromthecompute
nodesandplatformstorageat100GB/sperdirectionofa1TBdatasetcomprisedof20GBfilesin10seconds.
3.6 Resilience,Reliability,andAvailabilityTheabilitytoachievetheNNSAmissiongoalshingesontheproductivityofsystemusers.Systemavailabilityisthereforeessentialandrequiressystem-widefocustoachievearesilient,reliable,andavailablesystem.Foreachmetricspecifiedbelow,theOfferormustdescribehowtheyarrivedattheirestimates(e.g.failureratesofindividualcomponentsincludinghardwareandsoftwarethatmakeupmajoraspectsofOfferor’sestimate).
3.6.1 Failureofthesystemmanagementand/orRASsystem(s)shouldnotcauseasystemorjobinterrupt.ThisrequirementdoesnotapplytoaRASsystemfeature,whichautomaticallyshutsdownthesystemforsafetyreasons,suchasanoverheatingcondition.
3.6.2 TheminimumSystemMeanTimeBetweenInterrupt(SMTBI)shouldbegreaterthan720hours.
3.6.3 TheminimumJobMeanTimeToInterrupt(JMTTI)shouldbegreaterthan24hours.Automaticrestartsdonotmitigateajobinterruptforthismetric.
3.6.4 TheratioofJMTTI/Delta-Ckptshouldbegreaterthan200.Thismetricisameasureofthesystem’sabilitytomakeprogressoveralongperiodoftimeandcorrespondstoanefficiencyofapproximately90%.If,forexample,theJMTTIrequirementisnotmet,thetargetJMTTI/Delta-Ckptratioensuresthisminimumlevelofefficiency.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 23 of 56
3.6.5 Animmediatere-launchofaninterruptedjobshouldnotrequireacompleteresourcereallocation.Ifajobisinterrupted,thereshouldbeamechanismthatallowsre-launchoftheapplicationusingthesameallocationofresource(e.g.,computenodes)thatithadbeforetheinterruptoranaugmentedallocationwhenpartoftheoriginalallocationexperiencesahardfailure.
3.6.6 Acompletesysteminitializationshouldtakenomorethan30minutes.TheOfferorshalldescribethefullsysteminitializationsequenceandtimings.
3.6.7 Thesystemshouldachieve99%scheduledsystemavailability.Systemavailabilityisdefinedintheglossary.
3.6.8 TheOfferorshalldescribetheresilience,reliability,andavailabilitymechanismsandcapabilitiesofthesystemincluding,butnotlimitedto:
§ Anyconditionoreventthatcanpotentiallycauseajobinterrupt.§ Resiliencyfeaturestoachievetheavailabilitytargets.§ Singlepointsoffailure(hardwareorsoftware),andthepotentialeffecton
runningapplicationsandsystemavailability.
§ Howajobmaintainsitsresourceallocationandisabletorelaunchanapplicationafteraninterrupt.
§ Asystem-levelmechanismtocollectfailuredataforeachkindofcomponent.
3.7 ApplicationTransitionSupportandEarlyAccesstoACESTechnologiesTheCrossroadssystemmayincludenumerousadvancedtechnologies.TheOfferorshallincludeintheirproposalaplantoeffectivelyutilizethesetechnologiesandassistintransitioningthemissionworkflowstothesystem.TheSuccessfulOfferorshallsupporteffortstotransitiontheAdvancedTechnologyDevelopmentMitigation(ATDM)codestothesystems.ATDMcodesarecurrentlybeingdevelopedbythethreeNNSAweaponslaboratories,LawrenceLivermore,LosAlamos,andSandia.Thesecodesmayrequirecompliancewithexportcontrollawsandnocostlicensingagreements.InformationabouttheATDMprogramcanbefoundontheNNSAwebsite.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 24 of 56
3.7.1 TheSuccessfulOfferorshouldprovideavehicleforsupportingthesuccessfuldemonstrationoftheapplicationperformancerequirementsandthetransitionofkeyapplicationstotheCrossroadssystem(e.g.,aCenterofExcellence).SupportshouldbeprovidedbytheOfferorandallofitskeyadvancedtechnologyproviders(e.g.,processorvendors,integrators,etc.).TheSuccessfulOfferorshouldprovideexpertsintheareasofapplicationportingandperformanceoptimizationintheformofstafftraining,generalusertraining,anddeep-diveinteractionswithasetofapplicationcodeteams.Supportshouldbeprovidedfromthedateofsubcontractexecutionthroughtwo(2)yearsafterfinalacceptanceofthesystems.
3.7.2 TheSuccessfulOfferorshalldescribetheirsupportstructurefortheproposedprogrammingenvironment.Thisincludesmechanismsforreportingissuesandrequestingnewfunctionality,inadditiontoescalationpaths/prioritiesavailabletoCrossroads’applications.Supportshouldbeprovideduptotwo(2)yearsafterfinalacceptanceofthesystems.
3.7.3 TheOfferorshalldescribewhichoftheproposedhardwareandsoftwaretechnologies(physicalhardware,emulators,and/orsimulators),willbeavailableforaccessbeforesystemdeliveryandinwhattimeframe.TheproposedtechnologiesshouldprovidevalueinadvancedpreparationforthedeliveryofthefinalCrossroadssystemforpre-system-deliveryapplicationportingandperformanceassessmentactivities.
3.8 TargetSystemConfigurationACESdeterminedthefollowingtargetsforCrossroadsSystemConfigurations.Offerorsshallstateprojectionsfortheirproposedsystemconfigurationsrelativetothesetargets.
Table2TargetSystemConfiguration
Crossroads
BaselineMemoryCapacityExcludesalllevelsofon-die-CPUcache
>0.5PiB
BenchmarkSSIincreaseoverTrinitysystem(Haswellpartition)
>6X
PlatformStorage >10XBaselineMemory
NameplatePower <20MW
PeakPower <18MW
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 25 of 56
3.9 SystemOperationsSystemmanagementshouldbeanintegralfeatureoftheoverallsystemandshouldprovidetheabilitytoeffectivelymanagesystemresourceswithhighutilizationandthroughputunderaworkloadwithawiderangeofconcurrencies.TheSuccessfulOfferorshouldprovidesystemadministrators,securityofficers,anduser-supportpersonnelwithproductiveandefficientsystemconfigurationmanagementcapabilitiesandanenhanceddiagnosticenvironment.
3.9.1 ThesystemshouldincludescalableintegratedsystemmanagementcapabilitiesthatprovidehumaninterfacesandAPIsforsystemconfigurationanditsabilitytobeautomatedthroughconfigurationmanagementsoftware,softwaremanagement,changemanagementthroughaversioncontrolsystem,localsiteintegration,andsystemconfigurationbackupandrecovery.
3.9.2 Thesystemshouldincludeameansfortrackingandanalyzingallsoftwareupdates,softwareandhardwarefailures,andhardwarereplacementsoverthelifetimeofthesystem.Allpatchesandreleasesshouldincludechangelogswithdetaileddescriptionsofbugfixesandfeaturesandalsowhatservicesareaffectedbythesechanges.
3.9.3 Thesystemshouldincludetheabilitytoperformrollingupgradesandrollbacksonasubsetofthesystemwhilethebalanceofthesystemremainsinproductionoperation.TheOfferorshalldescribethemechanisms,capabilities,workloadmanagementsupport,andlimitationsofrollingupgradesandrollbacks.Nomorethanhalfthesystempartitionshouldberequiredtobedownforrollingupgradesandrollbacks.
NominalPower <15MW
IdlePower <10%NameplatePower
JobMeanTimetoInterrupt(JMTTI)
Calculatedforasinglejobrunningontheentiresystem
>24Hours
SystemMeanTimetoInterrupt(SMTTI)
>720Hours
JMTTI/Delta-Ckpt >200
SystemAvailability >99%
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 26 of 56
3.9.4 Thesystemshouldincludeanefficientmechanismforreconfiguringandrebootingcomputenodes.TheOfferorshalldescribeindetailthecomputenoderebootmechanism,differentiatingtypesofboots(warmbootvs.coldboot)requiredfordifferentnodefeatures,aswellashowthetimerequiredtorebootscaleswiththenumberofnodesbeingrebooted.Warmboottimingsforbothfilesystemandclusterbootshallbeindependentlyprovidedandalsoacombinedwarmbootfilesystemandclusterboottimingifitdiffers.
3.9.5 Thesystemshouldincludeamechanismwherebyallmonitoringdataandlogscapturedareavailabletothesystemowner,andwillsupportanopenmonitoringAPItofacilitatelossless,scalablesamplinganddatacollectionformonitoreddata.Anyfilteringthatmayneedtooccurwillbeattheoptionofthesystemmanager.Thesystemwillincludeasamplingandconnectionframeworkthatallowsthesystemmanagertoconfigureindependentalternativeparalleldatastreamstobedirectedoffthesystemtosite-configurableconsumers.
3.9.6 Thesystemshouldincludeamechanismtocollectandprovidemetricsandlogswhichmonitorthestatus,health,utilization,andperformanceofthesystem,subsystems,andallmajorcomponents,including,butnotlimitedto:§ Environmentalmeasurementcapabilitiesforallsystemsandperipherals
andtheirsub-systemsandsupportinginfrastructure,includingpowerandenergyconsumptionandcontrol.
§ Internalhighspeednetworkperformancecounters,includingmeasuresofnetworkcongestionandnetworkresourceconsumption.
§ Informationenablingtrafficandcongestionattribution,withexplanationoftheattributionlogic.
§ Hardwareperformancecountersenablingapplicationperformanceassessmentwiththeabilitytointegratethesewithsystemmetric(e.g.,networkperformancecounters)data.
§ Alllevelsofintegratedandattachedplatformstorage.§ Thesystemasawhole,includinghardwareperformancecountersfor
metricsforalllevelsofintegratedandattachedplatformstorage.
3.9.7 TheOfferorshalldescribewhattoolsandAPIsitwillprovideforthecollection,analysis,integration,andvisualizationofmetricsandlogsproducedbythesystem(e.g.,peripherals,integratedandattachedplatformstorage,andenvironmentaldata,includingpowerandenergyconsumption).Thedescriptionshouldincludeanycapabilitiestoconfigurecollectionratesfortheavailablemetrics.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 27 of 56
3.9.8 Theofferorshalldescribeallmechanismsforobtainingapplicationperformanceandprogressinformationsuchasenablingapplicationspecificlogsandmetricstobecollectedandtransportedoffthesystemorenablinghardwareandsystemperformancecounterstobecollectedandtransportedoffthesystemandassociatedwithparticularjobs/applications.
3.9.9 TheOfferorshalldescribethesystemconfigurationmanagementanddiagnosticcapabilitiesofthesystemthataddressthefollowingtopics:
§ Detaileddescriptionofthesystemmanagementsupport.§ Anyeffectoroverheadofsoftwaremanagementtoolcomponentsonthe
CPUormemoryavailableoncomputenodes.§ Releaseplan,withregressiontestingandvalidationforallsystemrelated
softwareandsecurityupdates.§ Supportformultiplesimultaneousoralternativesystemsoftware
configurations,includingestimatedtimeandeffortrequiredtoinstallbothamajorandaminorsystemsoftwareupdate.
§ Useractivitytracking,suchasauditloggingandprocessaccounting.§ Unrestrictedprivilegedaccesstoallsoftwareandhardwarecomponents
andallrelatedperformancemetricsdeliveredwiththesystem.
3.9.10 Thesystemshouldprovideamechanismforreportingallbasiccomponentandessentialservicesstate(e.g.,up/down/running)andchangesofstate.ThesystemshouldalsoprovidedocumentedAPIsforqueryingthestate.
3.9.11 Offerorshouldprovideadescriptionofallfundamentaldataandassociatedmetricsandcomputationsusedtoassessstatus,health,utilization,andperformancein3.9.6.
3.9.12 TheOfferorshalldescribeallmeasurementcapabilities(system,rack/cabinet,board,node,component,andsub-componentlevel)forthesystem,includingcontrolandresponsetimes,samplingfrequency,accuracyofthedata,andtimestampsofthedataforindividualpointsofmeasurementandcontrol.
3.10 PowerandEnergyPower,energy,andtemperaturewillbecriticalfactorsinhowtheACESlaboratoriesmanagesystemsinthistimeframeandmustbeanintegralpartofoverallSystemsOperations.Thesolutionmustbewellintegratedintootherintersectingareas(e.g.,facilities,resourcemanagement,runtimesystems,andapplications).TheACESlaboratoriesexpectagrowingnumberofusecasesinthisareathatwillrequireaverticallyintegratedsolution.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 28 of 56
3.10.1 TheOfferorshalldescribeallpower,energy,andtemperatureoperationalmeasurementcapabilities(system,rack/cabinet,board,node,component,andsub-componentlevel)forthesystem,includingcontrolandresponsetimes,samplingfrequency,accuracyofthedata,andtimestampsofthedataforindividualpointsofmeasurementandcontrol.
3.10.2 TheOfferorshalldescribealloperationalcontrolcapabilitiesitwillprovidetoaffectpowerorenergyuse(system,rack/cabinet,board,node,component,andsub-componentlevel).
3.10.3 Thesystemshouldincludesystem-levelinterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystem,includingbutnotlimitedto:
§ ACmeasurementcapabilitiesatthesystemorracklevel.
§ System-levelminimumandmaximumpowersettings(e.g.,powercaps).§ System-levelpowerrampupanddownrate.
§ Scalablecollectionandretentionallmeasurementdatasuchas:§ point-in-timepowerdata.
§ energyusageinformation.
§ minimumandmaximumpowerdata.
3.10.4 Thesystemshouldincluderesourcemanagerinterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystem,includingbutnotlimitedto:
§ Jobandnodelevelminimumandmaximumpowersettings.§ Jobandnodelevelpowerrampupanddownrate.
§ Jobandnodelevelprocessorand/orcorefrequencycontrol.
§ Systemandjoblevelprofilingandforecasting.§ e.g.,predictionofhourlypoweraverages>24hoursinadvancewitha1
MWtolerance.
3.10.5 Thesystemshouldincludeapplicationandruntimesysteminterfacesthatenablemeasurementanddynamiccontrolofpowerandenergyrelevantcharacteristicsofthesystemincludingbutnotlimitedto:
§ Nodelevelminimumandmaximumpowersettings.
§ Nodelevelprocessorand/orcorefrequencycontrol.§ Nodelevelapplicationhints,suchas:
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 29 of 56
§ applicationenteringserial,parallel,computationallyintense,I/Ointenseorcommunicationintensephase.
3.10.6 ThesystemshouldincludeanintegratedAPIforalllevelsofmeasurementandcontrolofpowerrelevantcharacteristicsofthesystem.ItispreferablethattheprovidedAPIcomplieswiththeHighPerformanceComputingPowerApplicationProgrammingInterfaceSpecification(http://powerapi.sandia.gov).
3.10.7 TheOfferorshallproject(andreport)theNameplate,Peak,Nominal,andIdlePowerofthesystem.
3.10.8 TheOfferorshalldescribeanycontrolsavailabletoenforceorlimitpowerusagebelowNameplatepowerandthereactiontimeofthismechanism(e.g.,whatdurationandmagnitudecanpowerusageexceedtheimposedlimits).
3.10.9 TheOfferorshalldescribethestatusofthesystemwheninanIdleState(describeallIdleStatesifmultipleareavailable)andthetimetotransitionfromtheIdleState(oreachIdleStateiftherearemultiple)tothestartofjobexecution.
3.11 FacilitiesandSiteIntegration
3.11.1 Thesystemshoulduse3-phaseDelta480VAC(four-wiresystem,threephasesandoneground).Othersysteminfrastructurecomponents(e.g.,disks,switches,loginnodes,andmechanicalsubsystemssuchasCDUs)mustuseeither3-phase480VAC(stronglypreferred),3-phase208VAC(secondchoice),orsingle-phase120/208VAC(thirdchoice).Thetotalnumberofindividualbranchcircuitsandphaseloadimbalanceshouldbeminimized.
3.11.2 AllequipmentandpowercontrolhardwareofthesystemshouldbeNationallyRecognizedTestingLaboratories(NRTL)certifiedandbearappropriateNRTLlabels.
3.11.3 Everyrack,networkswitch,interconnectswitch,node,anddiskenclosureshouldbeclearlylabeledwithauniqueidentifiervisiblefromthefrontoftherackandtherearoftherack,asappropriate,whentherackdoorisopen.Theselabelswillbehighqualitysothattheydonotfalloff,fade,disintegrate,orotherwisebecomeunusableorunreadableduringthelifetimeofthesystem.Nodeswillbelabeledfromtherearwithauniqueserialnumberforinventorytracking.Itisdesirablethatmotherboardsalsohaveauniqueserialnumberforinventorytracking.Serialnumbersshallbevisiblewithouthavingtodisassemblethenode,ortheymustbeabletobequeriedfromthesystemmanagementconsole.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 30 of 56
3.11.4 Allcomponentsinarackintendedtobeservicedwhiletherackhaspowershallbefullyserviceablewithoutdangeroftouchinganexposedconductingsurface.Considerpowerswitchesforindividualcomponentsthatmayneedtobepowered-off/-onindividually.Considerminimizingthenumberofconnectingcablesthatneedtoberemovedtopower-off/-onacomponent.Considertheplacementofconnectors,handles,etc.,withrespecttoconductingservices,thatmustbeusedtoremoveandreplaceacomponent.
3.11.5 Table3belowshowstargetfacilityrequirementsidentifiedbyACESfortheCrossroadssystem.TheOfferorshalldescribethefeaturesofitsproposedsystemsrelativetositeintegrationattherespectivefacilities,including:§ Descriptionofthephysicalpackagingofthesystem,including
dimensioneddrawingsofindividualcabinetstypesandthefloorlayoutoftheentiresystem.
§ Remoteenvironmentalmonitoringcapabilitiesofthesystemandhowitwouldintegrateintofacilitymonitoring.
§ Emergencyshutdowncapabilities.§ Detaileddescriptionsofpowerandcoolingdistributionsthroughoutthe
system,includingpowerconsumptionforallsubsystems.§ DescriptionofparasiticpowerlosseswithinOfferor’sequipment,suchas
fans,powersupplyconversionlosses,power-factoreffects,etc.Forthecomputationalandplatformstoragesubsystemsseparately,giveanestimateofthetotalpowerandparasiticpowerlosses(whosedifferenceshouldbepowerusedbycomputationalorplatformstoragecomponents)attheminimumandmaximumITUE,whichisdefinedastheratiooftotalequipmentpoweroverpowerusedbycomputationalorplatformstoragecomponents.Describetheconditions(e.g.“idle”)atwhichtheextremaoccur.
§ OSdistributionsorotherclientrequirementstosupportoff-systemaccesstotheplatformstorage(e.g.LANLFileTransferAgents).
Table3CrossroadsFacilityRequirements
Location LosAlamosNationalLaboratory,LosAlamos,NM.ThesystemwillbehousedintheStrategicComputingComplex(SCC),Building2327
Altitude 7,500feet
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 31 of 56
Seismic N/A
WaterCooling ThesystemshouldoperateinconformancewithASHRAEClassW2guidelines(dated2011).Thefacilitywillprovideoperatingwatertemperatureof75°F,atupto35PSIdifferentialpressureatthesystemcabinetsHowever,Offerorshouldnoteifthesystemiscapableofoperatingathighertemperatures.Note:LANLfacilitywillprovideinletwateratanominal75°F,persystemdesign.Totalflowrequirementsmaynotexceed9600GPM.
WaterChemistry ThesystemmustoperatewithfacilitywatermeetingbasicASHRAEwaterchemistry.Specialchemistrywaterisnotavailableinthemainbuildingloopandwouldrequireaseparatetertiaryloopprovidedwiththesystem.Iftertiaryloopsareincludedinthesystem,theOfferorshalldescribetheiroperationandmaintenance,includingcoolantchemistry,pressures,andflowcontrols.Allcoolantloopswithinthesystemshouldhavereliableleakdetection,temperature,andflowalarms,withautomaticprotectionandnotificationmechanisms.
AirCooling Thesystemmustoperatewithsupplyairat75°F-60°F,witharelativehumidityfrom30%-70%.Therateofairflowisbetween800-1500CFM/floortile.Nomorethan3MWofheatshouldberemovedbyaircooling.
MaximumPowerRateofChange
Thehourlyaverageinsystempowershouldnotexceedthe2MWwidepowerbandnegotiatedatleast2hoursinadvance.
PowerQuality ThesystemshouldberesilienttoincomingpowerfluctuationsatleasttothelevelguaranteedbytheITICpowerqualitycurve.
Floor 42”raisedfloor
Ceiling 16-footceilingand16-footceilingplenum
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 32 of 56
MaximumFootprint 8000squarefeet;80feetlongand100feetdeep.
ShipmentDimensionsandWeight
Norestrictions.
FloorLoading Theaveragefloorloadingovertheeffectiveareashouldbenomorethan300poundspersquarefoot.Theeffectiveareaistheactualloadingareaplusatmostafootofsurroundingfullyunloadedarea.Amaximumlimitof300poundspersquarefootalsoappliestoallloadsduringinstallation.TheOfferorshalldescribehowtheweightwillbedistributedoverthefootprintoftherack(pointloads,lineloads,orevenlydistributedovertheentirefootprint).Apointloadappliedonaonesquareinchareashouldnotexceed1500pounds.AdynamicloadusingaCISCAWheel1sizeshouldnotexceed1250pounds(CISCAWheel2–1000pounds).
Cabling Allpowercablingandwaterconnectionsshouldbebelowtheaccessfloor.Itispreferablethatallothercabling(e.g.,systeminterconnect)isabovefloorandintegratedintothesystemcabinetry.Underfloorcables(ifunavoidable)shouldbeplenumratedandcomplywithNEC300.22andNEC645.5.Allcommunicationscables,whereverinstalled,shouldbesource/destinationlabeledatbothends.Allcommunicationscablesandfibersover10metersinlengthandinstalledunderthefloorshouldalsohaveauniqueserialnumberanddBlossdatadocument(orequivalent)deliveredattimeofinstallationforeachcable,ifamethodofmeasurementexistsforcabletype.
Externalnetworkinterfacessupportedbythesiteforconnectivityrequirementsspecifiedbelow
1Gb,10Gb,40Gb,100Gb,IBEDR,IBHDR.ThenetworkinfrastructureiscontinuouslyupgradedmovingtothelatestEthernetandIBcapabilities.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 33 of 56
Externalbandwidthon/offthesystemforgeneralTCP/IPconnectivity
Minimumof100GB/sperdirectionwithapreferencefor300GB/sperdirection.Describehow100GB/sperdirectioncouldbeexpandedto300GB/sperdirection.
Externalbandwidthon/offthesystemforaccessingthesystem’sPFS
Minimumof100GB/swithapreferencefor300GB/s.Describehow100GB/scouldbeexpandedto300GB/s.
Externalbandwidthon/offthesystemforaccessingexternal,sitesuppliedfilesystems.E.g.GPFS,NFS
Minimumof100GB/swithapreferencefor300GB/s.Describehow100GB/scouldbeexpandedto300GB/s.
4 OptionsTheACESteamexpectstohavefuturerequirementsforsystemupgradesand/oradditionalquantitiesofcomponentsbasedontheconfigurationsproposedinresponsetothissolicitation.TheOfferorshouldaddressanytechnicalchallengesforeseenwithrespecttoscalingandanyotherproductionissues.Proposalsshouldbeasdetailedaspossible.
4.1 Upgrades,ExpansionsandAdditionsTheOfferorshallproposethefollowingseparatelypricedoptionsusingwhateveristhenaturalunitfortheproposedarchitecturedesignasdeterminedbytheOfferor.Forexample,forsystemsize,theunitmaybenumberofracks,numberofblades,numberofnodesorsomeotherunitappropriateforthesystemarchitecture.Iftheproposeddesignhasnooptiontoscaleoneormoreofthesefeatures,theOfferorshouldsimplystatethisintheproposalresponse.
4.1.1 TheOfferorshalldescribeandseparatelypriceoptionsforscalinguptheoverallCrossroadssystem.Theseoptionsmaybelargerthanthesmallestcomputepartition.Anyoftheseoptionsmaybeexercisedmultipletimes.
4.1.1.1 TheOfferorshallproposeaconfigurationorconfigurationswhichincreasethebaselinememorycapacityinsteps,e.g.byadding25%,50%,100%,and200%.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 34 of 56
4.1.1.2 TheOfferorshallproposeandseparatelypriceupgradesorexpansionsforscalingthecapacityandperformanceoftheCrossroadsI/Osubsystemsuchthatitcanretainallapplicationinput,output,andworkingdatafor24and36weeksRefertosectionwhereitindicatesthattheminimumamountofsuchdatais12%ofbaselinesystemmemoryperday..IftheOfferor’sI/Osubsystemconsistsofmultiplestoragetiers,theOffershalldescribeandseparatelypriceoptionsforscalingeachstoragetierseparately.
4.1.1.3 TheOfferorshallproposeandseparatelypriceanyoptionsforupgradingtheOfferor’sproposedtechnologyoftheCrossroadssystemoveritsfive-yearlifetime.
4.1.2 TheOfferorshallalsoprovideseparatelypricedoptionsforsystemswhichtheCONTRACTORmayprocureinadditiontoCrossroadssystemthatprovideapproximately10%,25%,50%and200%oftheOfferor’sproposedcapabilityoftheCrossroadssystem.Theoptionsproposedfortheexpansionofthecrossroadssystemunder4.1.1,aboveshallalsoapplytoanyadditionalsystempurchasedatacostproportionaltothesystemcapabilitycomparedtoCrossroads.
4.2 EarlyAccessDevelopmentSystemToallowforearlyand/oraccelerateddevelopmentofapplicationsordevelopmentoffunctionalityrequiredasapartofthestatementofwork,theOfferorshallproposeoptionsforearlyaccessdevelopmentsystems.Thesesystemscanbeinsupportofthebaselinerequirementsoranyproposedoptions.
4.2.1 TheOfferorshallproposeanEarlyAccessDevelopmentSystem.Theprimarypurposeistoexposetheapplicationtothesameprogrammingenvironmentaswillbefoundonthefinalsystem.Itisacceptablefortheearlyaccesssystemnottousethefinalprocessor,node,orhigh-speedinterconnectarchitectures.However,theprogrammingandruntimeenvironmentmustbesufficientlysimilarthataporttothefinalsystemistrivial.Theearlyaccesssystemshallcontainsimilarfunctionalityofthefinalsystem,includingfilesystems,butscaleddowntotheappropriateconfiguration.TheOfferorshallproposeanoptionforthefollowingconfigurationsbasedonthesizeofthefinalCrossroadssystem.
§ 2%ofthecomputepartition.§ 5%ofthecomputepartition.
§ 10%ofthecomputepartition.
4.2.2 TheOfferorshallproposedevelopmenttestbedsystemsthatwillreduceriskandaidthedevelopmentofanyadvancedfunctionalitythatisexercisedasapartofthestatementofwork.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 35 of 56
4.3 TestSystemsTheOfferorshallproposethefollowingtestsystems.Thesystemsshallcontainallthefunctionalityofthemainsystem,includingfilesystems,butscaleddowntotheappropriateconfiguration.Multipletestsystemsmaybeawarded.
4.3.1 TheOfferorshallproposeanApplicationRegressiontestsystem,whichshouldcontainatleast200computenodes.
4.3.2 TheOfferorshallproposeaSystemDevelopmenttestsystem,whichshouldcontainatleast50computenodes.
4.4 OnSiteSystemandApplicationSoftwareAnalysts
4.4.1 TheOfferorshallproposeandseparatelypricetwo(2)SystemSoftwareAnalystsandtwo(2)ApplicationsSoftwareAnalystsforeachsite.Offerorsshallpresumeeachanalystwillbeutilizedforfour(4)years.ForCrossroads,thesepositionsrequireaDOEQ-clearanceforaccess.
4.5 DeinstallationTheOfferorshallproposetodeinstall,removeand/orrecyclethesystemandsupportinginfrastructureatendoflife.StoragemediashallbedestroyedtothesatisfactionofACES,and/orreturnedtoACESatitsrequest.
4.6 MaintenanceandSupportTheOfferorshallproposeandseparatelypricemaintenanceandsupportwiththefollowingfeatures:
4.6.1 MaintenanceandSupportPeriodTheOfferorshallproposeallmaintenanceandsupportforaperiodoffour(4)yearsfromthedateofacceptanceofthesystem.Warrantyshallbeincludedinthe4years.Forexample,ifthesystemisacceptedonApril1,2021andtheWarrantyisforoneyear,thentheWarrantyendsonMarch30,2022,andthemaintenanceperiodbeginsApril1,2022andendsonMarch30,2025.Offerorshallalsoproposeadditionalmaintenanceandsupportextensionforyears5-7.
4.6.2 MaintenanceandSupportSolutionsTheOfferorshallproposethefollowingmaintenanceandsupportsolutionsandproposepricingseparatelyforeachsolution.ACESmaypurchaseeitheroneofthesolutionsorneitherofthesolutions,atitsdiscretion.Differentmaintenancesolutionsmaybeselectedforthevarioustestsystemsandfinalsystem.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 36 of 56
4.6.2.1 Solution1–7x24TheOfferorshallpriceSolution1asfullhardwareandsoftwaresupportforallOfferorprovidedhardwarecomponentsandsoftware.Theprincipalperiodofmaintenance(PPM)shallbefor24hoursby7daysaweekwithafour-hourresponsetoanyrequestforservice.HardwareservicerequestsrequiretheOfferortobeon-sitewithinfour(4)hoursoftherequest.
4.6.2.2 Solution2–5x9TheOfferorshallpriceSolution2asfullhardwareandsoftwaresupportforallOfferorprovidedhardwarecomponentsandsoftware.Theprincipalperiodofmaintenance(PPM)shallbeona9hoursby5daysaweek(exclusiveofholidaysobservedbyACES).TheSuccessfulOfferorshallprovidehardwaremaintenancetrainingforACESstaffsothatstaffareabletoprovidehardwaresupportforallothertimestheOfferorisunabletoprovidehardwarerepairinatimelymanneroutsideofthePPM.TheSuccessfulOfferorshallsupplyhardwaremaintenanceproceduraldocumentation,training,andmanualsnecessarytosupportthiseffort.
Allproposedmaintenanceandsupportsolutionsshallincludethefollowingfeaturesandmeetallrequirementsofthissection.
4.6.3 GeneralServiceProvisionsTheSuccessfulOfferorshallberesponsible,atitsownexpense,fortherepairorreplacementofanyfailinghardwarecomponentthatitsuppliesandcorrectionofdefectsinsoftwarethatitprovidesaspartofthesystem.Atitssolediscretion,ACESmayrequestadvancereplacementofcomponentswhichshowapatternoffailureswhichreasonablyindicatesthatfuturefailuresmayoccurinexcessofreliabilitytargets,orforwhichthereisasystemicproblemthatpreventseffectiveuseofthesystem.
Hardwarefailuresduetoenvironmentalchangesinfacilitypowerandcoolingsystemswhichcanbereasonablyanticipated(suchasbrown-outs,voltage-spikesorcoolingsystemfailures)aretheresponsibilityoftheOfferor.
4.6.4 SoftwareandFirmwareUpdateServiceTheSuccessfulOfferorshallprovideanupdateserviceforallsoftwareandfirmwareprovidedforthedurationoftheWarrantyplusMaintenanceperiod.Thisshallincludenewreleasesofsoftware/firmwareandsoftware/firmwarepatchesasrequiredfornormaluse.TheSuccessfulOfferorshallintegratesoftwarefixes,revisionsorupgradedversionsin
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 37 of 56
suppliedsoftware,includingcommunitysoftware(e.g.LinuxorLustre),andmakethemavailabletoACESwithintwelve(12)monthsoftheirgeneralavailability.TheSuccessfulOfferorshallprovidepromptavailabilityofpatchesforcybersecuritydefects.
4.6.5 CallServiceTheSuccessfulOfferorshallprovidecontactinformationfortechnicalpersonnelwithknowledgeoftheproposedequipmentandsoftware.ThesepersonnelshallbeavailableforconsultationbytelephoneandelectronicmailwithACESpersonnel.Inthecaseofdegradedperformance,theSuccessfulOfferor’sservicesshallbemadereadilyavailabletodevelopstrategiesforimprovingperformance,i.e.patches,workarounds.
4.6.6 On-sitePartsCacheTheSuccessfulOfferorshallmaintainapartscacheon-siteattheACESfacilities.Thepartscacheshallbesizedandprovisionedsufficientlytosupportallnormalrepairactionsfortwoweekswithouttheneedforpartsrefresh.TheinitialsizingandprovisioningofthecacheshallbebasedonOfferor’sMeanTimeBetweenFailure(MTBF)estimatesforeachFRUandeachrack,scaledbasedonthenumberofFRU’sandracksdelivered.Thepartscacheconfigurationwillbeperiodicallyreviewedforquantitiesneededtosatisfythisrequirement,andadjustedifnecessary,basedonobservedFRUornodefailurerates.Thepartscachewillberesized,attheOfferor’sexpense,shouldtheon-sitepartscacheprovetobeinsufficienttosustaintheactuallyobservedFRUornodefailurerates.
4.6.7 On-SiteNodeCacheTheSuccessfulOfferorshallalsomaintainanon-sitesparenodeinventoryofatleast1%ofthetotalnodesinallofthesystem.ThesenodesshallbemaintainedandtestedforhardwareintegrityandfunctionalityutilizingtheHardwareSupportClusterdefinedbelowifprovided.
4.6.8 HardwareSupportClusterTheSuccessfulOfferorshallprovideaHardwareSupportCluster(HSC).TheHSCshallsupportthehotsparenodesandprovidefunctionssuchashardwareburn-in,problemdiagnosis,etc.TheSuccessfulOfferorshallsupplysufficientracks,interconnect,networking,storageequipmentandanyassociatedhardware/softwarenecessarytomaketheHSCastand-alonesystemcapableofrunningdiagnosticsonindividualorclustersofHSCnodes.ACESwillstoreandinventorytheHSCandotheron-sitepartscachecomponents.
4.6.9 DOEQ-ClearedTechnicalServicePersonnel
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 38 of 56
TheCrossroadssystemwillbeinstalledinsecurityareasthatrequireaDOEQ-clearanceforaccess.ItwillbepossibletoinstallthesystemwiththeassistanceofunclearedUScitizensorL-clearedpersonnel,buttheSuccessfulOfferorshallarrangeandpayforappropriate3rdpartysecurityescorts.TheSuccessfulOfferorshallobtainnecessaryclearancesforon-sitesupportstafftoperformtheirduties.
5 DeliveryandAcceptanceTestingofthesystemshallproceedinthreesteps:pre-delivery,post-delivery,andacceptance.Eachstepisintendedtovalidatethesystemandfeedsintosubsequentactivities.SampleAcceptanceTestplans(AppendixA)areprovidedaspartoftheRequestforProposal.
5.1 Pre-deliveryTestingTheACESteamandtheSuccessfulOfferorshallperformpre-deliverytestingatthefactoryonthehardwaretobedelivered.Anylimitationsforperformingthepre-deliverytestingshouldbeidentifiedintheOfferor’sproposal,includingscaleandlicensinglimitations(ifany).Duringpre-deliverytesting,theSuccessfulOfferorshall:§ DemonstrateRAScapabilitiesandrobustnessusingsimplefaultinjection
techniques,suchasdisconnectingcables,poweringdownsubsystems,orinstallingknownbadparts.
§ Demonstratefunctionalcapabilitiesoneachsegmentofthesystembuilt,includingthecapacitytobuildapplications,schedulejobs,andrunthemusingacustomer-providedtestingframework.Therootcauseofapplicationfailuremustbeidentifiedpriortosystemshipping.
§ Provideafilesystemsufficientlyprovisionedtosupportthesuiteoftests.§ ProvideonsiteandremoteaccesstotheACESteamtomonitortesting
andanalyzeresults.
§ Instillconfidenceintheabilitytoconformtothestatementofwork.
5.2 SiteIntegrationandPost-deliveryTestingTheACESteamandtheSuccessfulOfferorstaffshallperformsiteintegrationandpost-deliverytestingonthefullydeliveredsystem.Limitationsand/orspecialrequirementsmayexistforaccesstotheonsitesystembytheOfferor.§ Duringpost-deliverytesting,thepre-deliverytestsshallberunonthefull
systeminstallation.§ Whereapplicable,testsshallberunatfullscale.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 39 of 56
5.3 AcceptanceTestingTheACESteamandtheSuccessfulOfferorstaffshallperformonsiteacceptancetestingonthefullyinstalledsystem.Limitationsand/orspecialrequirementsmayexistforaccesstotheonsitesystembytheOfferor.
5.3.1 TheSuccessfulOfferorshalldemonstratethatthedeliveredsystemconformstothesubcontract’sStatementofWork.
6 RiskandProjectManagementTheOfferorshallproposeariskmanagementstrategyandprojectmanagementplanfortheCrossroadssystemthatiscloselycoordinatedbetweenthesubcontractsforTriadNationalSecurity,LLC.
6.1.1 TheOfferorshallProposeariskmanagementstrategyforthesystemintheeventoftechnologyproblemsorschedulingdelaysthataffectdeliveryofthesystemorachievementofperformancetargetsintheproposedtimeframe.Offerorshalldescribetheimpactofsubstitutetechnologies(ifany)ontheoverallarchitectureandperformanceofthesysteminparticularaddressingthefourtechnologyareaslistedbelow:§ Processor
§ Memory
§ High-speedinterconnect§ Platformstorage
6.1.2 TheOfferorshallidentifyanyotherhigh-riskareasandaccompanyingmitigationstrategiesforthesystem.
6.1.3 TheOfferorshallprovideaclearplanforeffectivelyrespondingtosoftwareandhardwaredefectsandsystemoutagesateachseveritylevelanddocumenthowproblemsordefectswillbeescalated.
6.1.4 TheOfferorshallproposearoadmapshowinghowtheirresponsetothisRequestforProposalalignswiththeirplansforExascalecomputing.
6.1.5 TheOfferorshallidentifyadditionalcapabilities,including:§ Itsabilitytoproduceandmaintainthesystemforthelifeofthesystem
§ Itsabilitytoachievespecificqualityassurance,reliability,availabilityandserviceabilitygoals
§ Itsin-housetestingandproblemdiagnosiscapability,includinghardwareresourcesatappropriatescale
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 40 of 56
6.1.6 TheOfferorshallprovideprojectmanagementspecificsfortheACESteamdetailedaspartoftheRequestforProposaldocument.PleaseseeAppendixBforfurtherinformation.
7 DocumentationandTrainingTheSuccessfulOfferorshallprovidedocumentationandtrainingtoeffectivelyoperate,configure,maintain,andusethesystemstotheACESteamandusersoftheCrossroadssystem.TheACESteammay,attheiroption,makeaudioandvideorecordingsofpresentationsfromtheSuccessfulOfferor’sspeakersatpubliceventstargetedattheNNSAusercommunities(e.g.,usertrainingevents,collaborativeapplicationevents,bestpracticesdiscussions,etc.).TheSuccessfulOfferorshallgranttheACESteamuseranddistributionrightsofdocumentationprovidedbytheOfferor,sessionmaterials,andrecordedmediatobesharedwithotherDOELabs’staffandallauthorizedusersandsupportstaffforCrossroads.
7.1 Documentation
7.1.1 TheSuccessfulOfferorshallprovidedocumentationforeachdeliveredsystemdescribingtheconfiguration,interconnecttopology,labelingschema,hardwarelayout,etc.ofthesystemasdeployedbeforethecommencementofsystemacceptancetesting.
7.1.2 TheSuccessfulOfferorshallsupplyandsupportsystemanduser-leveldocumentationforallcomponentsbeforethedeliveryofthesystem.Uponrequestbythelaboratories,theSuccessfulOfferorshallsupplyadditionaldocumentationnecessaryforoperationandmaintenanceofthesystem.Alluser-leveldocumentationshallbepubliclyavailable.
7.1.3 TheSuccessfulOfferorshalldistributeandupdatealldocumentationelectronicallyandinatimelymanner.Forexample,changestothesystemshallbeaccompaniedbyrelevantdocumentation.Documentationofchangesandfixesmaybedistributedelectronicallyintheformofreleasenotes.Referencemanualsmaybeupdatedlater,buteffortshouldbemadetokeepalldocumentationcurrent.
7.2 Training
7.2.1 TheSuccessfulOfferorshallprovidethefollowingtypesoftrainingatfacilitiesspecifiedbyACES:
ClassType NumberofClasses
SystemOperationsandAdvancedAdministration 2
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 41 of 56
UserProgramming 3
7.2.2 TheOfferorshalldescribeallproposedtraininganddocumentationrelevanttotheproposedsolutionsutilizingthefollowingmethods:
§ Classroomtraining
§ Onsitetraining§ Onlinedocumentation
§ Onlinetraining
8 ReferencesACESscheduleandhigh-levelinformationcanbefoundattheprimaryCrossroadswebsite(http://crossroads.lanl.gov/).
CrossroadsbenchmarksandworkflowswhitepapercanbefoundattheCrossroadsBenchmarkandWorkflowswebsite.HighPerformanceComputingPowerApplicationProgrammingInterfaceSpecificationwebsite(http://powerapi.sandia.gov/).
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 42 of 56
AppendixA:SampleAcceptancePlanAppendixA-1:ACESSampleAcceptancePlan
Testingofthesystemshallproceedinthreesteps:pre-delivery,post-deliveryandacceptance.Eachstepisintendedtovalidatethesystemandfeedsintosubsequentactivities.
Pre-delivery(Factory)Test
TheSubcontractorshalldemonstrateallhardwareisfullyfunctionalpriortoshipping.Ifthesystemistobedeliveredinseparateshipments,eachshipmentshallundergopre-deliverytesting.IftheSubcontractorproposesadevelopmentsystemsubcomponent,Triadrecognizesthatthedevelopmentsystemisnotpartofthepre-deliveryacceptancecriteria.
ACESandSubcontractorstaffshallperformpre-deliverytestingatthefactoryonthehardwaretobedelivered.Anylimitationsforperformingthepre-deliverytestingneedtobeidentifiedincludingscaleandlicensinglimitations.
• DemonstrateRAScapabilitiesandrobustness,usingsimplefaultinjectiontechniquessuchasdisconnectingcables,poweringdownsubsystems,orinstallingknownbadparts.
• Demonstratefunctionalcapabilitiesoneachsegmentofthesystembuilt,includingthecapabilitytobuildapplications,schedulejobs,andrunthemusingthecustomer-providedtestingframework.Therootcauseofanyapplicationfailuremustbeidentified.
• TheOfferorshallprovideafilesystemsufficientlyprovisionedtosupportthesuiteoftests.
• ProvideonsiteandremoteaccessforACESstafftomonitortestingandanalyzeresults.
• Instillconfidenceintheabilitytoconformtothestatementofwork.
Pre-DeliveryAssembly
• TheSubcontractorshallperformthepre-deliverytestofCrossroadsoragreed-uponsub-configurationsofCrossroadsattheSubcontractor’slocationpriortoshipment.Atitsoption,ACESmaysendarepresentative(s)toobservetestingattheSubcontractor’sfacility.WorktobeperformedbytheSubcontractorincludes:
o Allhardwareinstallationandassembly
o Burninofallcomponentso Installationofsoftware
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 43 of 56
o ImplementationoftheACES-specificproductionsystem-configurationandprogrammingenvironment
o Performtestsandbenchmarkstovalidatefunctionality,performance,reliability,andquality
• Runbenchmarksanddemonstratethatbenchmarksmeetperformancecommitments.
Pre-DeliveryConfiguration
• TBD
Pre-DeliveryTestSubcontractorshallprovideACESon-siteaccesstothesysteminordertoverifythatthesystemdemonstratestheabilitytopassacceptancecriteria.
Thepre-deliverytestshallconsistof(butisnotlimitedto)thefollowingtests:
NameofTest PassCriteria
Systempowerup Allnodesbootsuccessfully
Systempowerdown Allnodesshutdown
Unixcommands AllUNIX/Linuxandvendorspecificcommandsfunctioncorrectly
Monitoring Monitoringsoftwareshowsstatusforallnodes
Reset “Reset”functionsonallnodes
PowerOn/Off Powercycleallcomponentsoftheentiresystemfromtheconsole
FailOver/Resilience Demonstrateproperoperationofallfail-overorresiliencemechanisms
FullConfigurationTest Pre-deliverysystemcanefficientlyrunapplicationsthatusetheentirecomputeresourceofthepre-deliverysystem.Theapplicationstoberunwillbedrawnfromthe72-hourtestruns,scaledtothepre-deliveryconfiguration
Benchmarks Benchmarksshallachieveperformancewithinthelimitsofpre-deliveryconfiguration
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 44 of 56
NameofTest PassCriteria
72Hourtest 100%availabilityofthepre-deliverysystemfora72-hourtestperiodwhilerunninganagreed-uponworkloadthatexercisesatleast99%ofthecomputeresources
Post-deliveryIntegrationandTestPost-deliveryIntegration
DuringPost-DeliveryIntegration,theSubcontractor’ssystem(s)shallbedelivered,installed,fullyintegrated,andshallundergoSubcontractorstabilizationprocesses.Post-deliverytestingshallincludereplicationofallofthepre-deliverytestingsteps,alongwithappropriatetestsatscale,onthefullyintegratedplatform.Whereapplicable,testsshallberunatfullscale.
SiteIntegrationWhentheSubcontractorhasdeclaredthesystemtobestable,theSubcontractorshallmakethesystemavailabletoACESpersonnelforsite-specificintegrationandcustomization.OncetheSubcontractor’ssystemhasundergonesite-specificintegrationandcustomization,theacceptancetestshallcommence.
AcceptanceTestTheAcceptanceTestPeriodshallcommencewhenthesystemhasbeendelivered,physicallyinstalled,andundergonestabilizationandsite-specificintegrationandcustomizationcompleted.ThedurationoftheAcceptanceTestperiodisdefinedintheStatementofWork.AlltestsshallbeperformedontheinitialproductionconfigurationasdefinedbyACES.TheSubcontractorshallsupplysourcecodeused,compilescripts,output,andverificationfilesforalltestsrunbytheSubcontractor.AllsuchprovidedmaterialsbecomethepropertyofTriad.AlltestsshallbeperformedontheinitialproductionconfigurationoftheCrossroadssystemasitwillbedeployedtotheACESusercommunity.ACESmayrunalloranyportionofthesetestsatanytimeonthesystemtoensuretheSubcontractor’scompliancewiththerequirementssetforthinthisdocument.TheacceptancetestshallconsistofaFunctionalityDemonstration,aSystemBootTest,aSystemResilienceTest,aPerformanceTest,andanAvailabilityTest,performedinthatorder.FunctionalityDemonstration
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 45 of 56
SubcontractorandACESwillperformtheFunctionalityDemonstrationonadedicatedsystem.TheFunctionalityDemonstrationshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.Demonstrationsshallinclude,butarenotlimitedto,thefollowing:
• Remotemonitoring,powercontrolandbootcapability
• Networkconnectivity
• Filesystemfunctionality
• Batchsystem
• Systemmanagementsoftware
• Programbuildinganddebugging(e.g.compilers,linkers,libraries,etc.)
• Unixfunctions
SystemBootTestSubcontractorandACESwillperformtheSystemBootTestonadedicatedsystem.TheSystemBootTestshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.Demonstrationsshallinclude,butarenotlimitedto,thefollowing:Twosuccessfulsystemcoldbootstoproductionstate,withnointerventiontobringthesystemup.Productionstateisdefinedasrunningallsystemservicesrequiredforproductionuseandbeingabletocompileandrunparalleljobsonthefullsystem.Inacoldboot,allelementsofthesystem(compute,login,I/O)arecompletelypoweredoffbeforethebootsequenceisinitiated.Allcomponentsarethenpoweredon.
• Singlenodepower-fail/resettest:Failureorresetofasinglecomputenodeshallnotcausesystem-widefailure.
SystemResilienceTestSubcontractorandACESwillperformtheSystemResilienceTestonadedicatedsystem.TheSystemResilienceTestshallshowthatthesystemisconfiguredandfunctionsinaccordancewiththestatementofwork.AllsystemresiliencefeaturesofCrossroadsshallbedemonstratedviafault-injectiontestswhenrunningtestapplicationsatscale.Faultinjectionoperationsshouldincludebothgracefulandhardshutdownsofcomponents.Themetricsforresilienceoperationsincludecorrectoperation,anylossofaccessordata,andtimetocompletetheinitial
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 46 of 56
recoveryplusanytimerequiredtorestore(fail-back)anormaloperatingmodeforthefailedcomponents.
PerformanceTestCrossroadssystemperformanceandbenchmarktestsarefullydocumentedintheStatementofWorkalongwithguidanceandtestinformationfoundattheCrossroadswebsite.TheSubcontractorshallruntheCrossroadstestsandapplicationbenchmarks,fullconfigurationtest,externalnetworktestandfilesystemmetadatatestasdescribedintheApplicationandBenchmarkRunRulesdocument.Benchmarkanswersmustbecorrect,andeachbenchmarkresultmustmeetorexceedperformancecommitmentsintheperformancerequirementssection.Benchmarksmustberunusingthesuppliedresourcemanagementandschedulingsoftware.Exceptasrequiredbytherunrules,benchmarksneednotberunconcurrently.IfrequestedbyTriad,Subcontractorshallreconfiguretheresourcemanagementsoftwaretoutilizeonlyasubsetofcomputenodes,specifiedbyTriad.
JMTTIandSystemAvailabilityTestingTheJMTTIandSystemAvailabilityTestwillcommenceaftersuccessfulcompletionoftheFunctionalityDemonstration,SystemTestandPerformanceTest.ACESwillperformtheJMTTIandAvailabilityTest.TheCrossroadssystemmustdemonstratetheJMTTIandavailabilitymetricsdefinedintheStatementofWork,withinanagreed-uponperiodoftime.Anautomatedjoblaunchandoutcomeanalysistool,suchasthePavilionHPCTestingFramework,shallbeusedtomanageanagreed-uponworkloadthatwillbeusedtomeasurethereliabilityofindividualjobs.ThesejobsshallbeamixtureofbenchmarksfromthePerformanceTestandotherapplications.EverytestintheJMTTIandSystemAvailabilityTestworkloadshallobtainacorrectresultinbothdedicatedandnon-dedicatedmodes:
• Indedicatedmode,eachbenchmarkinthePerformanceTestshallmeettheperformancecommitmentspecifiedintheStatementofWork.Innon-dedicatedmode,themeanperformanceofeachperformancetestshallmeetorexceedtheperformancecommitmentspecifiedintheStatementofWork
• DuringtheJMTTIandSystemAvailabilityTest,ACESshallhavefullaccesstothesystemandshallmonitorthesystem.TriadandusersdesignatedbyTriadshallsubmitjobsthroughtheCrossroadsresourcemanagementsystem.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 47 of 56
• DuringtheJMTTIandSystemAvailabilityTest,theSubcontractorshalladheretothefollowingrequirements:
o AllhardwareandsoftwareshallbefullyfunctionalattheendoftheJMTTIandAvailabilityTest.Anydowntimerequiredtorepairfailedhardwareorsoftwareshallbeconsideredanoutageunlessitcanberepairedwithoutimpactingsystemavailability.
o Hardwareandsoftwareupgradesshallnotbepermittedduringthelast7daysoftheJMTTIandAvailabilityTest.Thesystemshallbeconsidereddownforthetimerequiredtoperformanyupgrades,includingrollingupgrades.
o Nosignificant(i.e.levels1,2or3)problemsshallbeopenduringthelast7days.
• DuringtheJMTTIandAvailabilityTestingperiod,ifanysystemsoftwareupgradeorsignificanthardwarerepairsareapplied,theSubcontractorshallberequiredtorunthePerformanceTestsanddemonstratethatthechangesincurnolossofperformance.Atitsoption,Triadmayalsorunanytestdeemednecessary.TimetakentorunthePerformanceandothertestsshallnotcountasdowntime,providedthatalltestsperformtospecifications.
DefinitionsforNodeandSystemFailuresThebaselineofinterrupts,asusedintheJMTTIandSMTBIcalculations,shallinclude,butmaynotbelimitedto,thefollowingcircumstances:
• AnodeshallbedefinedasdownifahardwareproblemcausesSubcontractorsuppliedsoftwaretocrashorthenodeisunavailable.FailuresthataretransparenttoSubcontractor-suppliedsoftwarebecauseofredundanthardwareshallnotbeclassifiedasanodebeingdownaslongasthefailuredoesnotimpactnodeorsystemperformance.Lowseveritysoftwarebugsandsuggestions(e.g.wrongerrormessage)associatedwithSubcontractorsuppliedsoftwarewillnotbeclassifiedasanodebeingdown.
• AnodeshallbeclassifiedasdownifadefectintheSubcontractorsuppliedsoftwarecausesanodetobeunavailable.Communicationnetworkfailuresexternaltothesystem,anduserapplicationprogrambugsthatdonotimpactotherusersshallnotconstituteanodebeingdown.
• Repeatfailureswithineighthoursofthepreviousfailureshallbecountedasonecontinuousfailure.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 48 of 56
• TheSubcontractor'ssystemshallbeclassifiedasdown(andallnodesshallbeconsidereddown)ifanyofthefollowingrequirementscannotbemet(“system-widefailures”):o CompleteaPOSIX`stat'operationonanyfilewithinall
Subcontractor-providedfilesystemsandaccessalldatablocksassociatedwiththesefiles.
o CompleteasuccessfulinteractivelogintotheSubcontractor'ssystem.FailuresintheACESnetworkdonotconstituteasystem-widefailure.
o Successfullyrunanypartoftheperformancetest.ThePerformanceTestconsistsoftheCrossroadsBenchmarks,theFullConfigurationTestandtheExternalNetworkTest.
o Fullswitchbandwidthisavailable.Failureofaswitchadapterinanodedoesnotconstituteasystem-widefailure.However,failureofaswitchwouldconstitutefailure,evenifalternateswitchpathswereavailable,becausefullbandwidthwouldnotbeavailableformultiplenodes.
o Userapplicationscanbelaunchedand/orcompletedviathescheduler.
• OtherfailuresinSubcontractorsuppliedproductsandservicesthatdisruptworkonasignificantportionofthenodesshallconstituteasystem-wideoutage.
• Ifthereisasystem-wideoutage,TriadshallturnoverthesystemtotheSubcontractorforservicewhentheSubcontractorindicatestheyarereadytobeginworkonthesystem.Allnodesareconsidereddownduringasystem-wideoutage.
• DowntimeforanyoutageshallbeginwhenTriadnotifiestheSubcontractorofaproblem(e.g.anofficialproblemreportisopened)and,forsystemoutages,whenthesystemismadeavailabletotheSubcontractor.Downtimeshallendwhen:o Forproblemsthatcanbeaddressedbybringingupasparenode
orbyrebootingthedownnode,thedowntimeshallendwhenasparenodeorthedownnodeisavailableforproductionuse.
o ForproblemsrequiringtheSubcontractortorepairafailedhardwarecomponent,thedowntimeshallendwhenthefailedcomponentisreturnedtoTriadandavailableforproductionuse.
Forsoftwaredowntime,thedowntimeshallendwhentheSubcontractorsuppliesafixthatrectifiestheproblemorwhenTriadrevertstoapriorcopyofthefailingsoftwarethatdoesnotexhibitthesameproblem.Afailuredue
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 49 of 56
toACESortoothercausesoutoftheSubcontractor'scontrolshallnotbecountedagainsttheSubcontractorunlessthefailuredemonstratesadefectinthesystem.IfthereareanydisagreementsastowhetherafailureisthefaultoftheSubcontractororACES,theyshallberesolvedpriortotheendoftheacceptanceperiod.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 50 of 56
AppendixB:TriadSpecificProjectManagementRequirementsAppendixB-1:TriadProjectManagementRequirementsNOTE:Thefollowingrequirementsapplytotheprojectmanagementofthedeliveryofthesystemproposedbythesubcontractor.ProjectManagement
Thedevelopment,pre-shipmenttesting,installationandacceptancetestingoftheCrossroadssystemisacomplexendeavorandwillrequireclosecooperationbetweentheSubcontractor,TriadNationalSecurity,LLC(Triad,TNS),andACES.ThereshallbequarterlyexecutivereviewsbycorporateofficersoftheSubcontractor,ACES,andrepresentativesofDOENNSA,toassesstheprogressoftheproject.
ProjectPlanningWorkshop
• TriadandSubcontractorshallscheduleandcompleteaworkshoptomutuallyunderstandandagreeuponprojectmanagementgoals,techniques,andprocesses.
• Theworkshopshalltakeplacenolaterthanaward+45days
ProjectPlan
• DeliveryMilestone:nolaterthanaward+60days
SubcontractorshallprovideTriadwithadetailedProjectPlan–whichincludesadetailedWorkBreakdownStructure(WBS).TheProjectPlanshallcontainallaspectsoftheproposedSubcontractor’ssolutionandassociatedengineering(hardwareandsoftware)andsupportactivities.
TheProjectPlanshalladdressorinclude:
• ProgramManagement
• HighAssuranceDeliveryProcess
WBS:
o FacilitiesPlanning(e.g.,floor,power&cooling,cabling);o ComputerHardwarePlanning;
o Installation&TestPlanning;
o DeploymentandIntegrationMilestoneso SystemStabilityPlanning;
o SystemScalabilityPlanning;
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 51 of 56
o SoftwarePlano Testing
o Development
o Testingo Deployment
o RiskAssessment&RiskMitigationo Staffing;
o On-siteWarrantyandMaintenanceandSupportPlanning;
o Training&Education;ProjectPlan–ProgramManagement
Ataminimum,theProjectPlan–ProgramManagementSectionshall:
o Identify,byname,theProgramManagementTeammembers;o Identify,byname,theleadCrossroadsSystemArchitect
o Identify,byname,theCrossroadsSystemRASPointofContacto DescribetherolesandresponsibilitiesoftheTeammembers;
o ListSubcontractor’sManagementContacts;o DefineandinstitutionalizethePeriodicProgressReviewprocess
withregardtofrequency(daily,weekly,monthly,quarterly,andannually),level(support,technical,andexecutive),andescalationprocedures.
• Additionally,theProjectPlan–ProgramManagementSectionshalldetailthejointactivitiesoftheSubcontractorandTriadtomonitorandassesstheoverallProgramPerformance.
• TriadwillfurnishtheSubcontractorwithatop-10listofproblemsandissues.TheSubcontractorisresponsibleforappointingapointofcontactforeachoftheitemsonthelist.Thislistshallbereviewedweekly.
• AllSubcontractorProgramManagementshallinterfacewiththedesignatedTriadCrossroadsprojectmanager.
• TheWBSwillbeupdatedbytheSubcontractormonthlyandreviewedforapprovalbyTriad
• TheSubcontractorProjectPlanshallbeupdatedbytheSubcontractorquarterlyandreviewedforapprovalbyTriad
ProjectPlan-HighAssuranceHardwareDeliveryProcess
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 52 of 56
SubcontractorshallprovideTriadwithahighassurancedeliveryprocessandcertificationprogramforhardwaredeliverablesofallstagesofthedeploymentandoperationalusebytheASCApplicationsCommunityofthesystems.Allassetsdeliveredshallbe,ataminimum,factory-testedandfield–certified;A“pre-deliverytest”shalltakeplaceatthefactorypriortoeachshipment.FunctionaldiagnosticsandagreeduponACESapplicationsshallbeexecutedtoverifytheproperfunctioningofeachsystempriortoshipment.Problemsidentifiedasaresultofthesetestsshallbecorrectedpriortoshipment.Assetsthathavesuccessfullycompletedthispre-deliverytestare“pre-verified.”
ProjectPlan-HighAssuranceSoftwareDeliveryProcessSubcontractorshallprovideTriadwithahighassurancedeliveryprocessandcertificationprogramforsoftwaredeliverablesofallstagesofthedeploymentandoperationalusebytheNNSAASCtri-labsimulationcommunityoftheCrossroadssystems.Inaddition,SubcontractorshallprovideTriadwithdocumentationofSubcontractor’santicipatedsoftwarereleaseschedulesduringlifetimeofthesubcontract.Thisincludesmajorandminorreleases,updates,andfixesaswellasexpectedbeta-levelavailability.
• WhileBetasoftwareand/orpre-GAsoftwareisanticipatedtobeinstalledandrunonthesesystems,howeverallsuchinstallationsaresubjecttoTriadapproval;
• SubcontractorshallprovideTriadwithalistofinterdependenciesbetweenhardwareandsoftwareastheypertaintothedeliveredsystems;
ProjectPlan–WBS,MilestonesSubcontractorshalldefineappropriatehigh-levelMilestonesfortheexecutionofthedeliveryandacceptanceoftheCrossroadssystem.
ProjectPlan–WBS,FacilitiesPlanningCompliantwiththerequirementsoftheFacilitiesdescribedintheTechnicalRequirements.
ProjectPlan–WBS,SystemStabilityPlanning
Scalablesystemsofthesizebeingdeliveredcanattimesprovedifficulttopredictintermsofstability.Thenumberofcomponentscanhaveasignificanteffectonthestabilityandmayprovidesomescalabilityproblemsintermsofstabilityofthesystem.Triadrequiresaplantoprogressivelyqualifyaseriesofconfigurationsofincreasingcomplexity,intermsofbothprocessorcountsandinterconnecttopology.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 53 of 56
SubcontractorshallberesponsiblefordeliveringaStabilizationPlanthatincludesthefollowing:
• Planobjectives
• TargetGoalsforStability,asagreedtojointlywithTriadandACES
• TechnicalStrategy
• Rolesandresponsibilities
• TestingPlan
• ProgressEvaluationCheckpoints
• Contingencies
ProjectPlan–Staffing:
• StaffSupportshallbeforthelifeofthesubcontract.
• SubcontractorshallidentifyitsmembersoftheProjectTeam.
ProjectPlan–On-siteWarrantyandMaintenanceandSupportPlanning
• On-siteWarrantyandMaintenanceandSupportshallbeforthelifeofthesubcontract
• On-siteWarrantyandMaintenanceandSupportshallincludeSubcontractor’spreventivemaintenanceschedule.
• On-siteWarrantyandMaintenanceandSupportshallincludeloggingandweeklyreportingofallinterruptionstoservice.Ataminimum,theSubcontractorshallenterallinterruptloggingintoTriadtrackingsystem.
ProjectPlan–TrainingandEducation
• InadditiontoSubcontractor’susualandcustomarycustomerTrainingandEducationprogram,SubcontractorshallallowACESstaffaccesstoSubcontractor’sinternalTraining&Educationprogram;
• TrainingandEducationSupportshallbeforthelifeofthesubcontract.
ProjectPlan–RiskAssessmentandRiskMitigation
• SubcontractorshallprovideTriadwithaRiskManagementPlanthatidentifiesandaddressesallidentifiedrisks.
• Subcontractorshallprovideariskmanagementstrategyfortheproposedsystemincaseoftechnologyproblemsorschedulingdelaysthataffectavailabilityorachievementofperformancetargetsintheproposedtimeframe.Subcontractorshalldescribetheimpactofsubstitutetechnologiesontheoverallarchitectureandperformanceofthesystem.Inparticular,thesubcontractorshalladdressthetechnologyareaslistedbelow:
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 54 of 56
o Processoro Memory
o High-SpeedInterconnect
o PlatformStorageandallotherI/Osubsystems
• SubcontractorshallcontinuouslymonitorandassesstherisksinvolvedforthosemajortechnologycomponentsthatSubcontractoridentifiestobeontheCriticalPath(i.e.,RiskAssessment);
• SubcontractorshallprovideTriadwithtimelyandregularupdatesregardingSubcontractor’sRiskAssessment;
• SubcontractorshallprovideTriadwithaRiskMitigationPlan.EachriskmitigationstrategyshallbesubjecttoTriadapproval.SuchRiskMitigationPlanshallinclude:
o RisksCategorization–Risksshallbecategorizedaccordingto
o Probabilityofoccurrence(Low,medium,orhigh)
o Impacttotheprogramiftheyoccur(low,medium,orhigh)o DatesforRiskMitigationDecisionPointsIdentified
o ExecutionofmitigationplansaresubjecttoTriadapprovalandmayinclude:
§ TechnologySubstitution–subjecttotheconditionthatsubstitutedtechnologiesshallnothaveaggregateperformance,capability,orcapacitylessthanoriginallyproposed;
§ 3rdPartyAssistance–especiallyinareasofcriticalsoftwaredevelopment;
§ SourceCodeAvailability–especiallyintheareasofOperatingSystems,CommunicationLibraries;
§ PerformanceCompensation–possibilityofcompensatingforperformanceshortfallsviaadditionaldeliveries.
o Subcontractor’sRiskMitigationPlanwillbereviewedquarterlybyTriad.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 55 of 56
DefinitionsandGlossaryBaselineMemory:HighperformancememorytechnologiessuchasDDR-DRAM,HBM,andHMC,forexample,thatmaybeincludedinthesystemsmemorycapacityrequirement.Itdoesnotincludememoryassociatedwithcaches.
CoefficientofVariation:Theratioofthestandarddeviationtothemean.Coldboot:Fullpower-onofasystemfromanon-energizedstate,suchasapostpoweroutagesituation.Itcanbeassumedthatfacilitiespower,water,network,andsiteinfrastructureserviceshavebeenreturnedtoserviceandtimingsforallofferedfilesystemsandclustercanbeginfromthispoint.Delta-Ckpt:Thetimetocheckpoint80%ofaggregatememoryofthesystemtopersistentstorage.Forexample,iftheaggregatememoryofthecomputepartitionis3PiB,Delta-Ckptisthetimetocheckpoint2.4PiB.Rationale:Thiswillprovideacheckpointefficiencyofabout90%forfullsystemjobs.EjectionBandwidth:Bandwidthleavingthenode(i.e.,NICtorouter).
FullScale:Allofthecomputenodesinthesystem.Thismayormaynotincludeallavailablecomputeresourcesonanode,dependingontheusecase.IdlePower:TheprojectedpowerconsumedonthesystemwhenthesystemisinanIdleState.IdleState:Astatewhenthesystemispreparedtobutnotcurrentlyexecutingjobs.Theremaybemultipleidlestates.
InjectionBandwidth:Bandwidthenteringthenode(i.e.,routertoNIC).JobInterrupt:Anysystemeventthatcausesajobtounintentionallyterminate.
JobMeanTimetoInterrupt(JMTTI):Averagetimebetweenjobinterruptsoveragiventimeintervalonthefullscaleofthesystem.Automaticrestartsdonotmitigateajobinterruptforthismetric.JMTTI/Delta-Ckpt:RatiooftheJMTTItoDelta-Ckpt,whichprovidesameasureofhowmuchusefulworkcanbeachievedonthesystem.
NameplatePower:Themaximumtheoreticalpowerthesystemcouldconsume.Thisisadesignlimit,likelynotachievableinoperation,commonlyspecifiedonelectricalequipmentlabelsandusedforpowerprovisioningdesignperNationalElectricalCode(NEC,NFPA70).NominalPower:TheprojectedpowerconsumedonthesystembytheACESworkflows(e.g.,acombinationoftheACESbenchmarkcodesrunninglargeproblemsontheentiresystem).OperationalCapability:Real,usablecapabilitiesinproductionoperation,nottheoreticalcapabilities.
LA-UR-18-25993 Crossroads 2021 Technical Requirements Document
Dated 07-19-18
RFP No. 511017 Page 56 of 56
PeakPower:TheprojectedpowerconsumedbyanapplicationthatutilizesthemaximumachievablepowerconsumptionsuchasDGEMM.
PlatformStorage:Anynonvolatilestoragethatisdirectlyusablebythesystem,itssystemsoftware,andapplications.Exampleswouldincludediskdrives,RAIDdevices,andsolid-statedrives,nomatterthemethodofattachment.RollingUpgrades/RollingRollbacks:Arollingupgradeorarollbackisdefinedaschangingtheoperatingsoftwareorfirmwareofasystemcomponentinsuchawaythatthechangedoesnotrequiresynchronizationacrosstheentiresystem.Rollingupgradesandrollbacksaredesignedtobeperformedwiththosepartsofthesystemthatarenotbeingworkedonremaininginfulloperationalcapacity.
SystemInterrupt:Anysystemevent,oraccumulationofsystemeventsovertime,resultinginmorethan1%ofthecomputeresourcebeingunavailableatanygiventime.Lossofaccesstoanydependentsubsystem(e.g.,platformstorageorservicepartitionresource)willalsoincurasysteminterrupt.SystemMeanTimeBetweenInterrupt(SMTBI):Averagetimebetweensysteminterruptsoveragiventimeinterval.SystemAvailability:((timeinperiod–timeunavailableduetooutagesinperiod)/(timeinperiod–timeunavailableduetoscheduledoutagesinperiod))*100SystemInitialization:Thetimetobring99%ofthecomputeresourceand100%ofanyserviceresourcetothepointwhereajobcanbesuccessfullylaunched.Warmboot:Thecluster/filesystemmanagementserversbeingbootedandconfigured.
top related