optimizing network usage on sequoia and sierraweb.cse.ohio-state.edu/~subramoni.1/exacomm16/...20...
TRANSCRIPT
![Page 1: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/1.jpg)
LLNL-PRES-695482 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Optimizing Network Usage on Sequoia and Sierra
BronisR.deSupinskiChiefTechnologyOfficer
LivermoreCompu=ngJune 23, 2016
![Page 2: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/2.jpg)
LLNL-PRES-695482 2
Adv
ance
d Te
chno
logy
S
yste
ms
(ATS
)
Fiscal Year ‘13 ‘14 ‘15 ‘16 ‘17 ‘18
Use Retire
‘20 ‘19 ‘21
Com
mod
ity
Tech
nolo
gy
Sys
tem
s (C
TS)
Procure& Deploy
Sequoia(LLNL)
ATS1–Trinity(LANL/SNL)
ATS2–Sierra(LLNL)
Tri-labLinuxCapacityClusterII(TLCCII)
CTS1
CTS2
‘22
System Delivery
ATS3–Crossroads(LANL/SNL)
ATS4–(LLNL)
‘23
ATS5–(LANL/SNL)
MyfocusisNNSAASCATSpla2ormsatLLNL
SequoiaandSierraarethecurrentandnext-genera=onAdvancedTechnologySystemsatLLNL
![Page 3: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/3.jpg)
LLNL-PRES-695482 3
Sequoiaprovidespreviouslyunprecedentedlevelsofcapabilityandconcurrency
§ Sequoiasta=s=cs— 20petaFLOP/speak— 17petaFLOP/sLINPACK— Memory1.5PB,4PB/sbandwidth— 98,304nodes— 1,572,864cores— 3PB/slinkbandwidth— 60TB/sbi-sec=onbandwidth— 0.5−1.0TB/sLustrebandwidth— 50PBdisk
§ 9.6MWpower,4,000^2
§ Thirdgenera=onIBMBlueGene
![Page 4: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/4.jpg)
LLNL-PRES-695482 4
TheBG/Qcomputechipintegratesprocessors,memoryandnetworkinglogicintoonechip
§ 16user+1OS+1redundantcores— 4-waymul=-threaded,1.6GHz64-bit— 16kB/16kBL1I/Dcaches— QuadFPUs(4-wideDPSIMD)— Peak:204.8GFLOPS@55W
§ Shared32MBeDRAML2cache— Mul=versionedcache
§ Dualmemorycontroller— 16GBDDR3memory(1.33Gb/s)— 2*16byte-wideinterface(+ECC)
§ Chip-to-chipnetworking— 5DTorustopology+externallink— Eachlink2GB/ssend+2GB/sreceive— DMA,put/get,collec=veopera=ons
![Page 5: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/5.jpg)
LLNL-PRES-695482 5
TradiFonalBlueGeneoverallsystemintegraFonresultsinsmallfootprint
6. Rack 2 midplanes
1, 2, or 4 I/O drawers 7. System 20 PF/s
4. Node card 32 compute cards,
Optical modules, link chips, torus
3. Compute card One single chip module, 16 GB DDR3 memory
2. Module Single chip
5a. Midplane 16 node cards
5b. I/O drawer 8 I/O cards
8 PCIe Gen2 slots
1. Chip 16 cores
![Page 6: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/6.jpg)
LLNL-PRES-695482 6
§ Communica=onlocalitythroughop=mizedMPIprocessplacementiscri=calon3Dtorusnetworks— Useof5Dtorusreducesnetworkdiameterand
reducestheimportanceofMPIprocessplacement
§ Supportforhardwareop=mizedcollec=vesshouldapplytosubcommunicatorsaswellasglobalopera=ons— Increasednetworkcommunica=oncontextsallowsmoreapplica=ons
toexploithardwaresupportforcollec=veopera=ons
§ Hardwaresupportfornetworkpar==oningminimizesjiker
§ Mul=plenetworksprovidemanybenefitsbutalsoincreasecosts
SequoiaandBlueGene/QreflectlessonsfrompreviousBlueGenegeneraFons
Theseexamplesarenetwork-centric;othersreflectlessonsthroughoutthesystemhardwareandso^warearchitecture
![Page 7: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/7.jpg)
LLNL-PRES-695482 7
§ Mechanismsthatleadtoarrhythmiaarenotwellunderstood;Contrac=onofheartiscontrolledbyelectricalbehaviorofheartcells
§ Mathema=calmodelsreproducecomponentioniccurrentsoftheac=onpoten=al— Systemofnon-linearequa=onODEs— TT06includes6speciesand14gates— Nega=vecurrentsdepolarize(ac=vate)— Posi=vecurrentsrepolarize(returntorest)
§ Abilitytorun,athighresolu=on,thousandsinsteadoftensofheartbeatsenablesdetailedstudyofdrugeffects
LLNL,withIBM,hasdevelopedCardiod,astate-of-the-artcardiacelectrophysiologysimulaFon
2012GordonBellfinalisttypifiestheeffortrequiredtoexploitsupercomputercapabilityfully
![Page 8: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/8.jpg)
LLNL-PRES-695482 8
60 beats in 67.2 seconds
60 beats in 197.4 seconds
§ Measuredpeakperformance:11.84PFlop/s(58.8%ofpeak)— 0.05mmresolu=onheart(3B=ssuecells)— Tenmillionitera=ons,dt=4usec— Performanceoffullsimula=onloop,includingI/O,measuredwithHPM
Cardoidachievesoutstandingperformancethatenablesnearlyreal-FmeheartbeatsimulaFon
Op=mizedCardioidis50xfasterthan“naive”code
§ Extremestrongscalinglimit:— 0.10mm:236=ssuecells/core.— 0.13mm:114=ssuecells/core
![Page 9: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/9.jpg)
LLNL-PRES-695482 9
CardiodrepresentsamajoradvanceinthestateoftheartofhumanheartsimulaFon
0.1 mm heart (370M tissue cells) One minute of wall time
Previous state of the art Cardioid
18.2 seconds of simulation time
![Page 10: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/10.jpg)
LLNL-PRES-695482 10
§ Par==onedcellsoverprocesseswithanupperboundon=me(notonequal=me)
§ Assigneddiffusionworkandreac=onworktodifferentcores
§ Transformedthepotassiumequa=ontoremoveserializa=on
§ Expensive1Dfunc=onsinreac=onmodelexpressedwithra=onalapproximates
§ SingleprecisionweightstoreducediffusionstenciluseofL2bandwidth
§ HandunrolledtoSIMDizeloopsovercells§ SortedbycelltypetoimproveSIMDiza=on§ Sub-sor=ngofcellstoincreasesequen=al/
vectorloadandstoringofdata.§ logfunc=onfromlibmreplacedwithcustom
inlinedfunc=ons§ Ontheflyassemblyofcodetoop=mizedata
movementatrun=me§ Memorylayouttunedtoimprovecache
performance§ Useofvectorintrinsicsandcustomdivides
CardoidachievesoutstandingperformancethroughdetailedtuningtoSequoia’sarchitecture
§ Movedintegeropera=onstofloa=ngpointunitstoexploitSIMDunits
§ Noexplicitnetworkbarrier§ L2onnodethreadbarriers§ UselowlevelSPIforhalodataexchangebetweentasks(DMA)
§ Applica=onmanagedthreads§ SIMDizeddiffusionstencilimplementa=on§ Zerofluxboundarycondi=onsapproximated
bymethodwithnoglobalsolve§ HighperformanceI/OisawareofBG/Qnetworktopology
§ Lowoverheadin-situperformancemonitors§ Assignmentofthreadstodiffusion/reac=on
dependentondomaincharacteris=cs§ Co-scheduledthreadsforimproveddual
issue§ Mul=plediffusionimplementa=onstoobtain
op=malperformanceforvariousdomains§ Remote&localcopiesseparatedtoimprove
bandwidthu=liza=on
![Page 11: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/11.jpg)
LLNL-PRES-695482 11
Atlargestscales,smallsoTwareoverheadscansignificantlyimpactperformance
DirectuseofmessageunitsandL2atomicopera=onsminimizessoverhead
370 Million Cells 1.6 Million Cores 1600 Flops/cell
60 us per iteration
0 50 100 150 200 250 300
L2AtomicBarrier
OMPFork/Join
SPIHaloExchange
MPIHaloExchange
Time (usec)
60 us target
![Page 12: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/12.jpg)
LLNL-PRES-695482 12
TheSierrasystemthatwillreplaceSequoiafeaturesaGPU-acceleratedarchitecture
Mellanox® Interconnect Dual-rail EDR Infiniband®
IBM POWER • NVLink™
NVIDIA Volta • HBM • NVLink
Components
Compute Node POWER® Architecture Processor NVIDIA®Volta™ NVMe-compatible PCIe 800GB SSD > 512 GB DDR4 + HBM Coherent Shared Memory
Compute Rack Standard 19” Warm water cooling
Compute System 2.1 – 2.7 PB Memory
120 -150 PFLOPS 10 MW
GPFS™ File System 120 PB usable storage
1.2/1.0 TB/s R/W bandwidth
![Page 13: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/13.jpg)
LLNL-PRES-695482 13
OutstandingbenchmarkanalysisbyIBMandNVIDIAdemonstratesthesystem’susability
Projec=onsincludedcodechangesthatshowedtractableannota=on-basedapproach(i.e.,OpenMP)willbecompe==ve
9
Figure 5: CORAL benchmark projections show GPU-accelerated system is expected to deliver substantially higher performance at the system level compared to CPU-only configuration.
The demonstration of compelling, scalable performance at the system level across a wide range of applications proved to be one of the key factors in the U.S. DoE’s decision to build Summit and Sierra on the GPU-accelerated OpenPOWER platform.
Conclusion Summit and Sierra are historic milestones in HPC’s efforts to reach exascale computing. With these new pre-exascale systems, the U.S. DoE maintains its leadership position, trailblazing the next generation of supercomputers while allowing the nation to stay ahead in scientific discoveries and economic competitiveness.
The future of large-scale systems will inevitably be accelerated with throughput-oriented processors. Latency-optimized CPU-based systems have long hit a power wall that no longer delivers year-on-year performance increase. So while the question of “accelerator or not” is no longer in debate, other questions remain, such as CPU architecture, accelerator architecture, inter-node interconnect, intra-node interconnect, and heterogeneous versus self-hosted computing models.
With those questions in mind, the technological building blocks of these systems were carefully chosen with the focused goal of eventually deploying exascale supercomputers. The key building blocks that allow Summit and Sierra to meet this goal are:
• The Heterogeneous computing model • NVIDIA NVLink high-speed interconnect • NVIDIA GPU accelerator platform • IBM OpenPOWER platform
With the unveiling of the Summit and Sierra supercomputers, Oak Ridge National Laboratory and Lawrence Livermore National Laboratory have spoken loud and clear about the technologies that they believe will best carry the industry to exascale.
13
CORAL APPLICATION PERFORMANCE PROJECTIONS
0x
2x
4x
6x
8x
10x
12x
14x
QBOX LSMS HACC NEKbone
Rela
tive
Per
form
ance
Scalable Science Benchmarks
CPU-only CPU + GPU
0x
2x
4x
6x
8x
10x
12x
14x
CAM-SE UMT2013 AMG2013 MCB
Rela
tive
Per
form
ance
Throughput Benchmarks
CPU-only CPU + GPU
![Page 14: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/14.jpg)
LLNL-PRES-695482 14
SierraNREwillprovidesignificantbenefittothefinalsystem
§ CenterofExcellence
§ Motherboarddesign
§ Watercooledcomputenodes
§ HWresiliencestudies/investigation(NVIDIA)
§ Switchbasedcollectives
§ Hardwaretagmatching
§ GPUDirectandNVMe
§ Opensourcecompilerinfrastructure
§ Systemdiagnostics
§ Systemscheduling
§ Burstbuffer
§ GPFSperformanceandscalability
§ Clustermanagement
§ Opensourcetools
![Page 15: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/15.jpg)
LLNL-PRES-695482 15
§ Reliable,scalable,generalpurposeprimi=ve,applicabletomul=pleusecases— In-networkTreebasedaggrega=onmechanism— Largenumberofgroups— Mul=plesimultaneousoutstandingopera=ons
§ Highperformancecollec=veoffload— Barrier,Reduce,All-Reduce— Sum,Min,Max,Min-loc,max-loc,OR,XOR,AND— Canoverlapcommunica=onandcomputa=on
§ FlexiblemechanismreflectslessonslearnedfromBlueGenesystems
Switch-basedsupportforcollecFvesfurtherimprovescriFcalfuncFonality
SHArP Tree Aggregation Node (Process running on HCA) SHArP Tree Endnode (Process running on HCA)
SHArP Tree
SHArP Tree Root
![Page 16: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/16.jpg)
LLNL-PRES-695482 16
IniFalresultsdemonstratethatSHArPcollecFvesimproveperformancesignificantly
§ OSUAllreduce1PPN,128nodes
![Page 17: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/17.jpg)
LLNL-PRES-695482 17
§ Realis=capplica=ons,par=cularlyinC++o^enusesmallmessages— Realizedmessagerateo^enthekeyperformanceindicator— MPIprovideslikleabilitytoCOALESCEthesemessages
§ MPImatchingrulesheavilyimpactrealizedmessagerate— Messageenvelopsmustmatch
• Wildcards(MPI_ANY_SOURCE,MPI_ANY_TAG)increaseenvelopmatchingcomplexityand,thus,cost
— Postedreceivedmustbematchedin-orderagainstthein-orderpostedsends
AsseenwithCardoid,MPIsoTwareoverheadcriFcallylimitsrealizednetworkperformance
Hardwaremessagematchingsupportcanalleviateso^wareoverhead
Receiver Sender Tag=A, Communicator=B, source=C, Time=X
Tag=A, Communicator=B, Destination=C, Time=Y
Tag=A, Communicator=B, source=C, Time=X+D
Tag=A, Communicator=B, Destination=C, Time=Y+D’
![Page 18: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/18.jpg)
LLNL-PRES-695482 18
MPItagmatchingoperaFonsmustappeartobeperformedatomically
§ Complexity/serializationofmessagematchinglimitstheprocessingthatcanbeperformedonGPUs
![Page 19: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/19.jpg)
LLNL-PRES-695482 19
§ OffloadedtotheConnectX-5HCA— Enablesmoreofhardwarebandwidthtotranslatetorealizedmessagerate— FullMPItagmatchingascomputeprogresses— Rendezvousoffload:largedatadeliveryascomputeprogresses
§ Controlcanbepassedbetweenhardwareandso^ware
§ Verbstagmatchingsupportisbeingup-streamed
MellanoxhardwarewillsupportefficientMPItagmatching
![Page 20: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/20.jpg)
LLNL-PRES-695482 20
§ Sierrausescommodityclusternetworksolu=on— Separatemanagement(ethernet)anduser(IB)networks— Singlenetworkforusertraffic(point-to-point,collec=ves&filesystemtraffic)
§ Asinglenetworkforusertrafficsavesmoneybuthasothercosts— Jikerimpactofotherjobs’filesystemtrafficcanbesevere— Burstbufferstrategysmoothsfilesystembandwidthdemand
• FilesystemtrafficofajobnowcompeteswithitsMPItraffic
§ Differenttypesofnetworktrafficarenotequallycri=cal— Filesystemtraffic“only”needsaguaranteeofeventualcomple=on— Collec=veso^encri=callylimitoverallperformance— Othertrafficclassesalsoexist
DeployingmulFplenetworksexacerbatesnetworkhardwarecosts,whicharealreadytoohighinlarge-scalesystems
Quality-of-Service(QoS)mechanismsarenecessarytoachieveanetworksolu=onthatreducesnetworkhardwarecostswhile
providingacceptable,consistentperformanceforalltrafficclasses
![Page 21: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/21.jpg)
LLNL-PRES-695482 21
PreliminaryresultsindicatethatIBprioritylevelscompensateforcheckpointtraffic
IBM Confidential Page 24
Figure 21:Application (pF3D nearest neighbor exchange in Z, 16 processes per node) completion time for the various benchmarked configurations, in presence of checkpointing and in isolation.
Figure 22: Checkpointing rate in absence and in presence of application (pF3D nearest neighbor exchange in Z, 16 processes per node) for the various benchmarked configurations.
3.3 All-to-All Simulations on Fat Tree
The all to all communication phases of pF3D, due to the mapping we have chosen, interact with checkpointing traffic in a smaller portion of the network. They do however exhibit similar behavior to that presented in the previous section for the nearest neighbor exchange and the conclusions are the same. In the interest of brevity, we will only include the summary of the results. The figures present the following.
For the all to all exchange in the X direction:
x Figure 23: pF3D completion time in the single process per node case.
x Figure 24: checkpointing rate in the pF3D single process per node case.
IBM Confidential Page 24
Figure 21:Application (pF3D nearest neighbor exchange in Z, 16 processes per node) completion time for the various benchmarked configurations, in presence of checkpointing and in isolation.
Figure 22: Checkpointing rate in absence and in presence of application (pF3D nearest neighbor exchange in Z, 16 processes per node) for the various benchmarked configurations.
3.3 All-to-All Simulations on Fat Tree
The all to all communication phases of pF3D, due to the mapping we have chosen, interact with checkpointing traffic in a smaller portion of the network. They do however exhibit similar behavior to that presented in the previous section for the nearest neighbor exchange and the conclusions are the same. In the interest of brevity, we will only include the summary of the results. The figures present the following.
For the all to all exchange in the X direction:
x Figure 23: pF3D completion time in the single process per node case.
x Figure 24: checkpointing rate in the pF3D single process per node case.
![Page 22: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/22.jpg)
LLNL-PRES-695482 22
§ FlexibilityofSHArPswitch-basedcollec=veswillacceleratesubcommunicatorscollec=vesandwillallowjobstosharenetwork
§ HCAMPItagmatchingwillreduceso^warecostoncri=calpath— Futuresystemsshouldfurtheracceleratemessagepassingso^ware
§ QoSmechanismsareessen=alwithburstbuffersorsystemsthatsharednetworkresourcesacrossjobs— Mul=plenetworksmights=llprovidebestsolu=oninsomecases— Networkpar==oningcoulds=llbevaluableonfuturesystems
§ GPUDirectandNVMereducewithinnodemessagingimpact
§ Highcapabilitynodesleadtoasmallernetwork— Reducesimportanceofnetworkpar==oning— LLNLCTS-1with2-to-1taperedfattrees=llrequirecarefultaskmapping
SierranetworkhardwareaddresseslessonslearnedfrompreviousLLNLsystems
Substan=alresearchques=onss=llremainforsystemsa^erSierra
![Page 23: Optimizing Network Usage on Sequoia and Sierraweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/...20 Sierra uses commodity cluster network solu=on — Separate management (ethernet) and](https://reader030.vdocument.in/reader030/viewer/2022040520/5e78e86a135f044b6b43c9a6/html5/thumbnails/23.jpg)