p802.1qcz interworking with other data center technologies

38
P802.1Qcz interworking with other data center technologies Jesus Escudero-Sahuquillo, Pedro Javier Garcia, Francisco J. Quiles, Jose Duato IEEE 802.1 Plenary Meeting, San Diego, CA, USA July 8, 2018

Upload: others

Post on 21-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P802.1Qcz interworking with other data center technologies

P802.1Qczinterworkingwithotherdatacentertechnologies

JesusEscudero-Sahuquillo,PedroJavierGarcia,FranciscoJ.Quiles,JoseDuato

IEEE802.1PlenaryMeeting,SanDiego,CA,USAJuly8,2018

Page 2: P802.1Qcz interworking with other data center technologies

Executivesummary

Existingcongestioncontrolmechanismsworkend-to-end.Weneedcomplementarymechanismsthatreactquicklywhentransientcongestionappears,alsopreventingHoL blockingfromdegradingperformance.

Page 3: P802.1Qcz interworking with other data center technologies

Outline

• Whycongestionisolationisneeded?• Analysisofcongestionscenarios• Limitationsofcurrenttechnologies• CongestionIsolationinDCNs• Conclusions

Page 4: P802.1Qcz interworking with other data center technologies

Outline

• Whycongestionisolationisneeded?• Analysisofcongestionscenarios• Limitationsofcurrenttechnologies• CongestionIsolationinDCNs• Conclusions

Page 5: P802.1Qcz interworking with other data center technologies

• TodaysDatacenters(DCNs)requireaflexiblefabricforcarryinginaconvergentwaytrafficfromdifferenttypesofapplications,storageofcontrol.• FabricdesignforDCNsmustminimizeoreliminatepacketloss,providehighthroughputandmaintainlowlatency.• Thesegoals arecrucialforapplicationsofOLDI,DeepLearning,NVMe overFabricsandtheCloudified CentralOffices.• However,congestion threatensthesegoals.

Whycongestionisolationisneeded?

Paul Congdon et al: The Lossless Network for Data Centers. NENDICA “Network Enhancements for the Next Decade” Industry Connections Activity, IEEE Standards Association, 2018.

Page 6: P802.1Qcz interworking with other data center technologies

• HoL-blockingdramaticallydegradesthenetworkperformance(e.g.PFChasnotenoughgranularityandthereisnocongestedflowidentification).• Classicale2econgestioncontrolforlosslessnetworksisdifficulttotune,reactsslowly,andmayintroduceoscillationsandinstability[1].

HSstarts

HSends

HS = traffic injected to Hot Spot destination

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1e+06 2e+06 3e+06 4e+06 5e+06Network Throughput (normalized)

Time (nanoseconds)

1QITh

VOQnet

64-node CLOS network, 4 hot-spots

Whycongestionisolationisneeded?

[1] Jesús Escudero-Sahuquillo, Ernst Gunnar Gran, Pedro Javier García, Jose Flich, Tor Skeie, OlavLysne, Francisco J. Quiles, José Duato: Combining Congested-Flow Isolation and Injection Throttlingin HPC Interconnection Networks. ICPP 2011: 662-672

Page 7: P802.1Qcz interworking with other data center technologies

• Weneedacongestionisolation(CI)mechanismthatreactsquicklywhentransientcongestionsituationsappear,preventingnetworkperformancedegradationcausedbytheHoL blocking.• WewantaCImechanismthatcomplementsothertechnologies availableintheDCNs,sothatCIimprovestheirperformance,whiletheothersreducetheCIcomplexity.

Whycongestionisolationisneeded?

Page 8: P802.1Qcz interworking with other data center technologies

Outline

• Whycongestionisolationisneeded?• Analysisofcongestionscenarios• Limitationsofcurrenttechnologies• CongestionIsolationinDCNs• Conclusions

Page 9: P802.1Qcz interworking with other data center technologies

AnalysisofcongestionTheOriginofCongestion

• Somepacketssimultaneouslyrequestthesameoutput portwithinaswitch.• Apacketcanbeforwardedwhiletheother(s)wait(s),sincetransferencespeedisdeterminedbytheoutputlink.

Contention

Page 10: P802.1Qcz interworking with other data center technologies

• Persistentcontention duringtime.• Bufferscontainingblockedpacketsfillup atingressandegressport,andcongestionappears.

Congestion appears

AnalysisofcongestionTheOriginofCongestion

Page 11: P802.1Qcz interworking with other data center technologies

CongestionRoot

PFC

AnalysisofcongestionTheOriginofCongestion

• Inlosslessnetworks,congestionpropagatesquickly duetobuffersbackpressure.Congestiontrees growupinthisway.

Page 12: P802.1Qcz interworking with other data center technologies

• Differentcongestiontreesdynamicsmakesmorecomplexthecongestionmanagement[Garciaetal.05].

Root of Congestion

Branch

Branch

Leaf

Leaf

Congestion can reach the sources

AnalysisofcongestionTheOriginofCongestion

Pedro J. García, J. Flich, J. Duato, I. Johnson, F. J. Quiles, F. Naven: Dynamic Evolution of Congestion Trees: Analysis and Impact on Switch Architecture. HiPEAC 2005: 266-285

PFC

Page 13: P802.1Qcz interworking with other data center technologies

AnalysisofcongestionCongestiontreesdynamics

• Ingeneral,theswitchwherecongestionoriginatescouldbelocatedatsomeinitialorintermediatestageorbedirectlyconnectedtoendnodes.

Root of Congestion

Incast congestion

Root of Congestion

In-networkcongestion

Page 14: P802.1Qcz interworking with other data center technologies

• Itusuallyoccurswhencongestionislight(i.e.itexceedsavailablelinkbandwidthbyasmallintegerfactoratmost).• Therearetwobasicscenarios:

1. Afewnodesinjectingtrafficatfullratetowardsthesamedestination.

2. Manynodesinjectingtrafficatlowratestowardsthesamedestination.

• Egressportsofin-networkcongestedswitchesworkatfullcapacityandmaycontendwithotherflowsforupstreamswitches,movingtherootofcongestionupwards.

AnalysisofcongestionIn-NetworkCongestion

Root of Congestion

In-networkcongestion

Page 15: P802.1Qcz interworking with other data center technologies

• Traditionalapproaches:spreadtrafficflowsacrossthemultiplepathsinordertobalancetheloadandhopefullyavoidcongestion(loadbalancing).• Problems:

1. Spreadingtrafficdonottakeintoaccountwhethertheselectedpathiscongested,generatingcollisionsoftrafficflowsinpathsalready congested.

2. Thenatureofflowsmatters:elephantflowsincreasethechanceofcreatingin-networkcongestion.

3. Traditionalloadbalancing(e.g.ECMP)donotworkwhereincast congestionappear.

AnalysisofcongestionIn-NetworkCongestion

Page 16: P802.1Qcz interworking with other data center technologies

• Manynodesstarttosendpacketsatfullratetowardsthesamedestination,almostatthesametime(e.g.OLDIservices)• Incast congestionoccursattheToR switchwherethenodethatmultiplepartiesaresynchronizingwithisconnected,andgrowsfromToRswitchestodownstreamswitches.• InCLOSnetworksmanysmallcongestiontreesconcurrentlyappearatfirst-stageswitches,latermergingatsecond-stageswitchesandformingseverallargercongestiontrees.

AnalysisofcongestionIncast Congestion

Root of Congestion

Incast congestion

Page 17: P802.1Qcz interworking with other data center technologies

• Traditionalapproach:TheDCNnetworkequipmentsimplyreactstoincast usingECN+PFCandsmartbuffermanagementinanattempttominimizepacketloss.

• Problems:1. LargeDCNnetworkshave

morehops,increasingtheclosed-loopreactiontimeofECN.

2. MoretrafficinflightmakesitdifficultforECNtoreact untilsuddentrafficbursts.

3. PFCgeneratesHoL blocking inupstreamswitches

4. ECNmaybetriggeredatsourcesnotcontributingtocongestion

AnalysisofcongestionIncast Congestion

Non-congested fllows advance at

the same speed as congested ones

Congestion affectssources that do notcause congestion

PFC

Page 18: P802.1Qcz interworking with other data center technologies

AnalysisofcongestionProposedTechnologies

• Asmallnumberoflongdurationelephantflowscanaligninsuchawaytocreatequeuingdelaysforthelargernumberofshortbutcriticalmiceflows.• Traditionalloadbalancing(i.e.ECMP)andECN+PFCarenotenoughwhenin-networkandincast congestionappearinDCNnetworks.• In-networkcongestioncanbereducedbysuitably:

• Dimensioningnetworkbisectionbandwidth.• Applyingcleverbuffersorganization.• Usingsomeformofload-awaretrafficbalancingatthesources.

• Incast congestioncanbealleviatedbyusingdestinationscheduling.• Proposedtechnologieshavealsolimitationswhencongestionappears.

Paul Congdon et al: The Lossless Network for Data Centers. NENDICA “Network Enhancements for the Next Decade” Industry Connections Activity, IEEE Standards Association, 2018.

Page 19: P802.1Qcz interworking with other data center technologies

Outline

• Whycongestionisolationisneeded?• Analysisofcongestionscenarios• Limitationsofcurrenttechnologies• CongestionIsolationinDCNs• Conclusions

Page 20: P802.1Qcz interworking with other data center technologies

LimitationsofcurrenttechnologiesLoadbalancing

• Techniquetoavoidin-networkcongestion.• Ineffectiveapproachescanactuallydotheopposite.• Loadbalancingselectsapathbyhashingtheflowidentityfieldsintheroutedpacketsuchthatallpacketsfromaparticularflowtraversethesamepath.• EqualCostMulti-Path (ECMP)routing:Flowgranularityisaproblemthatmaygenerateelephantflowstotraverseandoccupyarouteinthenetworkforalongperiodoftime.

• ECNmechanismmayreduceinjectionrateofelephantflows,butduringtheclosed-looptransientperiodtheymayinterferewithmiceflows,slowingdowntheiradvance.

Page 21: P802.1Qcz interworking with other data center technologies

• Toovercometheseissues,severalideasfocusonreducinggranularityofflowstomakebetterloadbalancingdecisions,basedonmeasuringthecongestedpaths.• Thegranularityofloadbalancinghastrade-offsbetweentheuniformityofthedistributionandcomplexityassociatedwithassuringdataisdeliveredinitsoriginalorder.• Theyrequiresomeformofsignallingcongestiontothesources.• Balancingcongestedpacketsthroughalternativeroutesmayendupmovingcongestionrootsneartoendnodes,transformingin-networkcongestioninincast congestion

LimitationsofcurrenttechnologiesLoad-awarepacket-levelbalancing

Rocher-Gonzalez, J., Escudero-Sahuquillo, J., Garcia, P.J., Quiles, F. On the Impact of Routing Algorithms in the Effectiveness of Queuing Schemes in High-Performance Interconnection Networks. In Proc. of IEEE HoTI 2017.

Page 22: P802.1Qcz interworking with other data center technologies

In-networkcongestionina3-tierCLOS

Root of Congestion

Load-aware packet-balancingdecision based on

congestion indicators

Alternative path to

destination

X

Packets Reordering

LimitationsofcurrenttechnologiesLoad-awarepacket-levelbalancing

Page 23: P802.1Qcz interworking with other data center technologies

LimitationsofcurrenttechnologiesDestinationscheduling

• Thenetworkcanassistineliminatingpacketlossatthedestinationbyschedulingtrafficdeliverywhenitwouldotherwisebelost.• TraditionalTCPassessthebandwidthandresourceavailabilitybymeasuringfeedbackthroughacks (itworkswithlightload)

• Onceincast congestionappearsatthedestination,delaysincreaseandbuffersoverflow,throughputislostandlatencyrises.• TraditionalTCPcannotreactquickenoughtohandleincast

• Solution:Requestingdatafromthesourceataratethatitcanbeconsumedwithoutloss.

Page 24: P802.1Qcz interworking with other data center technologies

• Sourcesrequest (send)directlyasmallamountofunscheduleddatatotheirdestinations.• Destinationsscheduleagrant response,bymeansofACKs,whenresourcesareavailabletoreceivetheentiretransfer.• ThereisaRTTrequest-grantdelaythatmayincreaseduringincast situations.• Solution:Sourcesmonitorthelevelofcongestioninthenetwork(light,moderateandhigh)andscheduledatainjectionaccordingtothelevelofcongestion.

LimitationsofcurrenttechnologiesLoad-awaredestinationscheduling

Page 25: P802.1Qcz interworking with other data center technologies

• Itispossibletocombinedestinationschedulingandloadbalancing,dependingonwhetherincast orin-networkcongestionismonitored.• Sourcesmeasureifthecongestionislight,moderateorhigh,applyingdifferentinjectionrates.• Theideaistoworkinload-awarebalancingmodeuntilincast congestionappear.Whenthishappens,thenetworkswitchestodestination-schedulingmode.• ThefrequencyuseofPFCandECNisreduced

LimitationsofcurrenttechnologiesCombinedload-awaredestinationschedulingandbalancing

Page 26: P802.1Qcz interworking with other data center technologies

Incast congestionina3-tierCLOS

LimitationsofcurrenttechnologiesDestinationscheduling

Root of Congestion

ECN-marked packets received

a certain rateSources inject a

moderate amount of data and apply

load balancing

Moderate load

Light load

Sources schedule big amount of

data

High load

Sources inject a small amount of data (granted by

destination)

X

Page 27: P802.1Qcz interworking with other data center technologies

• Thesetechnologiesmayworktogethertoeliminatelossintheclouddatacenternetwork[1].• Load-balancinganddestinationschedulingareend-to-endsolutionsincurringintheRTTdelayswhencongestionappear• However,thereisnotimeforlossinthenetworkduetocongestionandcongestiontreesgrowveryquickly.• TransientcongestionmaystillproduceHoL blockingthatleadstoincreaselatency,lowerthroughputandbuffersoverflow,significantlydegradingperformance.• Evenusingthesemechanisms,westillneedsomethingtodealwithHOLBlockinglocallyandfast.

LimitationsofcurrenttechnologiesConsequences

Page 28: P802.1Qcz interworking with other data center technologies

Outline

• Whycongestionisolationisneeded?• Analysisofcongestionscenarios• Limitationsofcurrenttechnologies• CongestionIsolationinDCNs• Conclusions

Page 29: P802.1Qcz interworking with other data center technologies

CongestionIsolationinDCNsMotivation

• CIisneededtoreactlocallyandveryfasttoimmediatelyeliminateHoL blocking.• PrevioustechnologiesreducetheuseofPFCandECN,buttheirclosed- andopen-loopapproachcausedelaysstillhappening.• Congestiontreesappearsuddenly,aredifficulttopredict (evenworsewhenloadbalancingisapplied)andgrowquickly.• CIcanbeappliedincombinationtotheprevioustechnologies,improvingtheirbehavior.

Page 30: P802.1Qcz interworking with other data center technologies

• CIworkswellwhencombinedwithotherCCmechanisms(e.g.e2econgestioncontrol)[1].• Loadbalancingmakesitdifficulttopredictwhenandwherecongestionpointsarise.• DestinationSchedulinghasRTT-delaysthatmaymakefeedbackinformationobsoletebythetimeitreachesthesources.• CIwillcomplementthesetechnologiesworkingtogether,makingthembehavebetter.

CongestionIsolationinDCNsImprovementsoncurrenttechnologies

[1] Jesús Escudero-Sahuquillo, Ernst Gunnar Gran, Pedro Javier García, Jose Flich, Tor Skeie, OlavLysne, Francisco J. Quiles, José Duato: Combining Congested-Flow Isolation and Injection Throttlingin HPC Interconnection Networks. ICPP 2011: 662-672

Page 31: P802.1Qcz interworking with other data center technologies

• Loadbalancing:• CIcomplementsloadbalancingaslocalandfastcongestedflowsisolationreducestheHoL blockingprobabilitieswhenloadbalancingisappliedthroughouttheentirenetwork.• Betterdecisionsforloadbalancingcanbemadeoncethecongestedflowsareisolated.

• Destinationscheduling:• TransientperiodswheregrantsaresentfromdestinationstosourcescanbecomplementedwithCI.• FastandlocalisolationofcongestedflowsreduceRTT-delaysofgrants.

CongestionIsolationinDCNsImprovementsoncurrenttechnologies

Page 32: P802.1Qcz interworking with other data center technologies

• DotheotherscomplementCI?Yes,theymakepossibletokeeptheCIrequiredresourceslow.• CIrequireadditionalresourcestokeeptrackofcongestiontrees atswitches.• Ifthenumberofcongestionspotsgrows,switchesmayenduprunningoutofresourcestokeeptrackofthem.• LoadbalancinganddestinationschedulingstrategieswilldraincongestiontreesfasterthanusingPFC+ECN.• Theywillcomplement(andimprove)theCIbehavior.

CongestionIsolationinDCNsCurrenttechnologiesalsoimproveCI

Jesús Escudero-Sahuquillo, Ernst Gunnar Gran, Pedro Javier García, Jose Flich, Tor Skeie, Olav Lysne, Francisco J. Quiles, José Duato: Efficient and Cost-Effective Hybrid Congestion Control for HPC Interconnection Networks. IEEE Trans. Parallel Distrib. Syst. 26(1): 107-119(2015)

Page 33: P802.1Qcz interworking with other data center technologies

Outline

• Whycongestionisolationisneeded?• Analysisofcongestionscenarios• Limitationsofcurrenttechnologies• CongestionIsolationinDCNs• Conclusions

Page 34: P802.1Qcz interworking with other data center technologies

Conclusions

• ThereisalotofworkdoneinDCNstodealwithcongestionandHoL blocking.• Existingsolutionsworkend-to-end,sothattransientcongestionmaystillspoilnetworkperformance.• CIprovidesafastreactiontocongestionandHoLblocking.• Infact,CIcanworkincooperationwithotherapproachesproposedtodealwithcongestion,improvingtheirbehavior.• Inaddition,theproposedapproachescanalsoworkincooperationwithCI,increasingitsbenefits.• Itisveryinterestingtoexplorethesynergyofallthesetechniquesworkingtogether.

Page 35: P802.1Qcz interworking with other data center technologies

JesusEscudero-Sahuquillo,PedroJavierGarcia,FranciscoJ.Quiles,JoseDuato

IEEE802.1PlenaryMeeting,SanDiego,CA,USAJuly8,2018

P802.1Qczinterworkingwithotherdatacentertechnologies

Page 36: P802.1Qcz interworking with other data center technologies

Backupslides

Page 37: P802.1Qcz interworking with other data center technologies

ContentionswitchA

AnalysisofcongestionIn-NetworkCongestiontransitiontoIncast

• Consideranalreadyformedcongestiontreewhoserootislocatedatsomeintermediateswitchegressport,whichisconnectedtoadownstreamswitch

• Assumeanotherflowreachingthatdownstreamswitchthroughadifferentingressportanddestinedtothesamenodeastheflowsintheexistingtree

Page 38: P802.1Qcz interworking with other data center technologies

• Iftheaggregatedbandwidthrequiredbytherootofthecongestiontreeandtheadditionalflowexceedlinkbandwidth,congestionwillbedetectedatsomeegressportofdownstreamswitch

• Theexistingcongestiontreeismergedwithanewbranch,movingtherootofthecongestiontreetoanegressportofdownstreamswitchA

Congestion trees merge in downstream

direction

switchA

AnalysisofcongestionIn-NetworkCongestiontransitiontoIncast