jason zurawski, internet2 research liaison zurawski ...going on is very valuable 3 ... – bugfixes...
TRANSCRIPT
Addressingthe“thingsthatgobumpinthenet”–perfSONAR/DYNES/LHCONE
March20th2012,OSG/ATLAS/CMSJasonZurawski,[email protected]
• CurrentNetworking– perfSONARStatus(ATLAS,CMS,LHCOPN,LHCONE)– ReachingfortheBrassRing(whywemonitor)
• FutureNetworking– DYNES– LHCONE
2–3/19/12,©2012Internet2
Agenda
• "Inanylargesystem,there'salwayssomethingbroken.”
• Networksarelargeandcomplex.TherearemulYple“layers”andweemployexpertswithknowledgeofspecificpartsjusttokeepthingsrunning– Anythingthatcangiveanexpert(oralayman)moreinsightintowhatisreallygoingonisveryvaluable
3–3/19/12,©2012Internet2
AJonPostelquote
• EveryoneshouldbefamiliarwithwhatperfSONARisabout,thistalkisnotaboutthat– MenYonedinthe“2013NITRDProgramSupplementtothePresident'sBudget”(page50)‐hdp://www.nitrd.gov/PUBS%5C2013supplement%5CFY13NITRDSupplement.pdf
4–3/19/12,©2012Internet2
“Why?”
• Ifyouarenotrunningit,thisisnotasalespitch• ThingsIwillhighlight:– Itisbeingusedwidely– Itisfindingproblems
• USATLAS– AllTier2sandTier1upgradingtonewDellR310/R610(availableas‘perfsonarnode’intheportal)
– Dashboard:hdps://perfsonar.usatlas.bnl.gov:8443/exda/?page=25&cloudName=USATLAS
– Othernon‐USclouds(Canada,Japan,Italy)comingupaswell
• CMS– AllTier2s(andTier1)havemonitoringinplace.ShouldbetesYngtoeachother
5–3/19/12,©2012Internet2
perfSONAR‐PSStatus
• LHCOPN– AllTier1sandTier0havemachinesinplacewithtestsinplace
– Dashboard:hdps://perfsonar.usatlas.bnl.gov:8443/exda/?page=25&cloudName=LHCOPN
• LHCONE– 16SitesarebeingmonitoredasapartoftheLHCONEArchprototypephase.Somearefullyconfigured,othersarenot(workinprogress–Shawnisleadingthis).
– Dashboard:hdps://perfsonar.usatlas.bnl.gov:8443/exda/?page=25&cloudName=LHCONE
6–3/19/12,©2012Internet2
perfSONAR‐PSStatus–cont.
• Currentrelease–3.2.1.1– Expecta3.2.2inmid2012– Bugfixesforthemostpart,noreal‘new’features– hdp://psps.perfsonar.net/toolkit
• Itemsonthelongerlist:– ControllinganenYredeploymentinsteadofanindividualisland(N.B.someareexploringCFEngineandthelikeinthisspace)
– IntegraYngthetoolsintoamoreportabledashboard(basingthisheavilyontheworkbyBNL)
– Bodomline–lotstodo,lidleYmeandresourcestodoit(butthisisn’tnews)
7–3/19/12,©2012Internet2
perfSONAR‐PSSorware
• CurrentNetworking– perfSONARStatus(ATLAS,CMS,LHCOPN,LHCONE)– ReachingfortheBrassRing(whywemonitor)
• FutureNetworking– DYNES– LHCONE
8–3/19/12,©2012Internet2
Agenda
• Networkmonitoringis:– Awaytopickoutproblems(packetloss,congesYon,rouYngchanges,lowthroughput)
– Usedbyoperatorstofindproblemsbeforetheusers(you)findthem
– Usedbyusers(you)tokeeptheoperatorshonest• Networkmonitoringisn’t:– Aninstantwaytosolvesaidproblems.Itwilltellyou‘what’,itwon’ttellyou‘how’or‘why’withoutspendingsomeYmeontheproblem
– AutomaYc.Thereissomeworkthatneedstobeputinbyalllevels(operators,VOs,etc.)
9–3/19/12,©2012Internet2
“WhyCare/DevoteResources?”
• “TheNetworkisSlow”– Yes,itsoktosaythis.Don’toverdoitthough(e.g.complainingatgetng8.5Gbpswhenyougot9.3Gbpsyesterday),andtrytoevidencewhenyoudosayit(e.g.yourgraphs)
• Lookingattheregulardata(andalarmingonit)– ATLAS,LHCOPN,etc.havetheregulartestsforthisexactreason
10–3/19/12,©2012Internet2
AnatomyofaProblem
• Oneofftests– Logontotheboxes(itseasy,justlikeanyotherlinuxmachine)andrunsometests.Don’tknowhow?Ask!
• EscalaYon– Youcanescalatewhenyouareinoveryourhead.ESnet/Internet2areheretohelp.
– Also–talktoyourlocalITpeoplesotheyareaware.Theydon’tbite.
• WaiYng(isthehardestpart)– Debuggingsucks.– IttakesalongYme.– ItinvolvesmulYpleparYes(thisiswhatmakesittakelonger)
11–3/19/12,©2012Internet2
AnatomyofaProblem–cont.
• 1oftheTransatlanYcLinkPairs(NewYorktoAmsterdam)• PerformancebadinonedirecYon(fromtheEUtotheUS).
– Noproblemsseenintheother(UStoEU)direcYon.– Commonissue–downloaders(e.g.peoplenotinyournetwork)seeaproblemvsuploaders(peopleinyournetwork).
• Dependingonwho/whereyouare,thismaynotbeanissueforyou:– USsites‘downloading’fromEUmayseethis– EUsitesthatusetheNLRroutestoreachlocaYonsintheUSwillbeaffected(NLRusesAMS‐>NEWYrouteexclusively)
– EUsitesthatusetheInternet2/ESnetroutesthroughAmsterdamtoNYtoreachUSsiteswillbeeffected.IftheEUsiteusesFRANK‐>WASHtoreachUS,therewillbenoproblem.
12–3/19/12,©2012Internet2
CurrentProblem(someofyouknowthis)
• Itwasactually–GEANT,Internet2,andESnetcommissionedregularinter‐domaintesYngbetweenthenetworksinlate2011
• Reportscameininlate2011• Thehardpart(s):– Debugging– PassivevsAcYve– LHCONE
13–3/19/12,©2012Internet2
Whywasn’tthiscaught?
ABasicTopology
• AllofthemajornetworksshowupatMANLANXP• Recentupgradetoswitchingfabric• MajorR&EPathtoEuropeisACE(AmericaConnectstoEurope)
IRNCLink– 2x10GLAGedCircuit
• GEANTAmsterdamExchangefeedsintoothernetworks(GEANT,SURFnet,etc.)
15–3/20/12,©2012Internet2
AnevenBedertopology
• TACircuitsareSONET.CienaCDandAlcatelterminatetheseoneitherend
• Switching/rouYngFabricisconnectedtothesetwodevicestosupportmoreconnecYons(10GEthernetforthemostpart)
16–3/20/12,©2012Internet2
WherewearespendingYmerightnow
• Narrowedtheproblemasmuchaspossible.TestersonInternet2/GEANTare1hopoffoftheswitchingfabriconeitherend(andwesYllseeloss)
• Isthisabufferingissue?Isthisaprotocolissue?Isthisanequipmentfabricissue?
17–3/20/12,©2012Internet2
1sttest–interfaceswapping@MANLAN
• ConfiguraYonchangeonCiena(MANLAN)sidetoverifythisdevice
• Blastthroughasetnumberofpackets,makesureinandoutpacketcountersagree– Theydid…
18–3/20/12,©2012Internet2
2nd–interfaceswapping@MANLAN
• ConfiguraYonchangeonBrocade(MANLAN)sidetoverifythisdevice
• Blastthroughasetnumberofpackets,makesureinandoutpacketcountersagree– Theydid…
3rd–It’sthebuffering,stupid
• AllofthesedevicesarearefuncYoningat10Gbpslinerates• Ethernet,SONET,andWAN‐PHYdohaveminorspeed
differences– Aburstofpacketsonaninputcouldoverdriveanoutput.– Thereneedstobeenoughbufferingtocoverthesecases– Inputvsoutputhavedifferentqueues
• Bufferingwasincreasedtothemax–around32K(yes,thisdoesn’tsoundlikealot,anditsnot.Enoughtohandleacoupleofframesonly…– Itdidreducethelosspercentage
20–3/20/12,©2012Internet2
NewtesYng(~1weekfromnow)
• ProtocolencapsulaYonistricky– EthernetframeisshovedintoaSONETframefortransit– WAN‐PHY(aformofEthernetw/extraencapsulaYon)wouldbein
thesameboat– IsthetranslaYongetnggarbled?Notethatsomedeviceswill
happilypassabadpacketonagivenlayerandasitgetshandedbackuperrorcorrecYonwillrejectit.
• TesYngthesetheoriesareabitinvasive,soitstakingalidleYmetoschedule
• TestCoverage‐B+– Internet2,ESnet,GEANT,andtheexperimentsallhavetestersavailable
– SomeoftheGEANTtestersarelimitedinfuncYonality• “Reportability”‐D
– Itooktheroleof‘user’thisYme.MyYcketwasclosed3(!)Ymes:• ThedayarerIopenedit,becausetherewerenocountersreporYngloss.Itwasre‐openedarerIcomplainedtheyhadto“tryharder”
• 1weeklater,arertesYnginMANLANrevealednoissues(Iwastoldto“goasksomeoneelse”).Itwasre‐openedarerInotedtheproblemisnotsolvedfroma“user”perspecYve
• 1weekarerthat,whenIwastold“openYcketscountagainsttheengineerassigned”[maybetheyarenotfedthatday?].IletitbeclosedthisYme,anddealtwithmyYcketsinothersystems
– Thisissomethingthatneedstobefixed21–3/19/12,©2012Internet2
Whereitworked,whereitisn’tworking
• NOCtoCustomerInteracYons‐C‐– NOCtreatedreportwithskepYcism.Callingit‘my’packetloss(e.g.theydon’ttrustthemeasurementtools,andlooktothepassivecountersasthelawoftheland)
– Ihadtoescalatethisintomanagementtokeepthings‘open’.StrongdesiretocloseYcketsthatareviewedas‘notmyproblem’.Thereisnohomeforthehomeless…
• NOCtoNOCInteracYons‐B‐– NOCscoordinateresourceswell,butYmelinestofindafixareslow.AdownYmeof5minutesisscheduledafull2weeksout,andonlyarerapprovalathighlevels
• GetngaresoluYon‐Incomplete– MoretesYngisneeded/isexpected.– Thisisaverychallengingproblem,andtheYmeithastakentosolvereflectsthis(e.g.noclearsignofpacketlossondevices,butapplicaYonsreactpoorly).22–3/19/12,©2012Internet2
Whereitworked,whereitisn’tworking
• Jason– SYlltryingtoupdateATLAS/CMSwhenIhearnews– Stayingontopofthemtogetthisfixed(therearesYllsomethatdenythisexists)
• USATLASThroughputGroup– ThinkingabouttheprocesstorecommendfortheendscienYst/sitetoreportissuesinatrackablemanner
– “Customers”tothenetworks,usethatrelaYonshipwhenpossible
• Networks– DoabederjobofcoordinaYngresourcesandrespondingtoproblems
23–3/19/12,©2012Internet2
AcYons
• CurrentNetworking– perfSONARStatus(ATLAS,CMS,LHCOPN,LHCONE)– ReachingfortheBrassRing(whywemonitor)
• FutureNetworking– DYNES– LHCONE
24–3/19/12,©2012Internet2
Agenda
• Whatisit–readthecontenthereifyouneedto:hdp://www.internet2.edu/dynes
• Basicidea:– ProvidehardwareandOpenSourcesorwaretoaddressdataintensivescienceoncampuses• Switch,datamovementserver,controllerPCforhardware• FDT,OSCARS,andperfSONARforsorware• Goalistoencouragecampusestocreatearesearchgradenetwork(e.g.the‘sciencedmz’‐hdp://fasterdata.es.net/fasterdata/science‐dmz/)
– Can’tproviderawcapacity,butisatooltomanageexisYngcapacity• Layer2networking(e.g.dynamiccapacity–possibilityofbandwidthguarantees)
• Endtoend‘circuit’capabiliYes(e.g.protectedVLANs)25–3/19/12,©2012Internet2
UpdatesonDYNES
26–3/19/12,©2012Internet2
CampusNets–CloggingUrTubes
“Internets”
27–3/19/12,©2012Internet2
CampusNets–WhataboutScience?
“Internets”
28–3/19/12,©2012Internet2
CampusNetsw/DYNESVision
“Internets”
EncapsulatedLayer2(MPLS)
29–3/19/12,©2012Internet2
UpdatesonDYNES–cont.
• Status(seewebformoredetails):– GroupA(~9sites),deployedandworking– GroupB(~11sites),deployed,andstarYngtocomeonline– GroupC(~14sites),orderedandbeingconfigured,deploymentinthenextmonth
– WehavefundingleAifyouarenotconnected,andaresBllinterested
• RelatedWork:– Workingw/AMPATHandRNPinBraziltoconnectOSCARScircuitstoresearchfaciliYes(e.g.SPRACE).Demosweredonelastyearandweresuccessful.
– EarlytalkswithLSST(telescopeinChile)tosupportmanagementofdataflowsapproaching80Gbpsin2020
– EarlytalkswithGlobusOnlinetointegratesupportintothistooltoreachDYNESsitesusingOSCARSandtradiYonalIPnetworking
30–3/19/12,©2012Internet2
UpdatesonDYNES–cont.
• Wheredowegofromhere?• ApplicaYons
– FDTisintegratedandcanusetheAPIstouseLayer2technologies(OSCARS/ION+maybesomedaysoon‘OpenFlow’)
– WhataboutPhEDEx/DQ2directly?– FTS(sincethisistheschedulingbitunderthedatamovers)– WhatabouttheunderlyingOSGtools?
• Whichonesmakesense,SRM?Others?– WhyintegrateanapplicaYon?
• Layer2technologiesare‘HOT/FASTLane’comparedtocampusIP.CangiveyouadirectpathtotheCampusWANandthroughtheregionalnetwork(congesYonfree)
• IPconnecYvitymay‘work’,butitshardtomanageendtoend(especiallyforTCP)
• Datamoversthatcantakeadvantageofthisaremorelikelytogetresourcesinconstrainedenvironments
31–3/19/12,©2012Internet2
DYNESOpenQuesYons/NextSteps
• Network– LHCONE(seenext)willhavesupportforLayer2services
– Regionals/CampusesintheUSarebeinginvitedtoparYcipateinLayer2networks• DYNESviaInternet2ION/ESnetSDN,etc.• OpenFlowisgainingalotoftracYon
• Vision(beingimplementedbysomealready)intelligentapplicaYonsthatmakethechoicefortheuser.– Don’thavetocareaboutthenetworkonthebodom,thingsjust‘work’
– LetthescienYstsbescienYsts,notengineers32–3/19/12,©2012Internet2
DYNESOpenQuesYons/NextSteps
• CurrentNetworking– perfSONARStatus(ATLAS,CMS,LHCOPN,LHCONE)– ReachingfortheBrassRing(whywemonitor)
• FutureNetworking– DYNES– LHCONE
33–3/19/12,©2012Internet2
Agenda
• Incaseyoumissedit…– NomoreglobalVLAN(notscalable,toomuchofapain)– DirectL2circuits(e.g.throughOSCARSorsimilartechnologies)sYllbeingexplored
– CurrentworkisonIslandsofL3VPNs• VRF–Virtual[VPN]RouYngandForwardingisbeingused
• Purpose?– AllowsparYcipantstomovetrafficbetweenoneanotherasneeded.
– BuiltusingavailablecomponentsoftheR&Enetworkinginfrastructure(e.g.ESnet,GEANT,Internet2,USLHCnet,ACE,CERNLIGHTStarlight,MANLAN,etc.)
34–3/19/12,©2012Internet2
LHCONE
35–3/19/12,©2012Internet2
LHCONE–TheIdea
• Howisthisdone?– itispossibletoimplementasharedbroadcastdomainusingaspecificIPprefixoritcanbeimplementedviaaVRF• Virtualrouters(Internet2)vsdedicatedresources(e.g.StarlightCisco)
– DifferencebetweenthisandsharedVLAN:• ThereareroutedboundariesbetweenporYonsofthesharedstructures
• ThereisarequirementsfortheexchangeofrouYnginformaYonacrossthoseboundaries.– ThisinformaYonwillbeexchangedusingBGP.
36–3/19/12,©2012Internet2
LHCONEGuts
37–3/19/12,©2012Internet2
LHCONE–MoreExact
• Hardformetoanswerthis–Iamnottheuser• As“users”,youallhavesomeimportantthingstodo:– Doyourscienceasbefore– Canyoureachtheplacesyouneedtoreach?Arethingsanybederorworsethanbefore?
– IsyourlifemeasurablybederwithLHCONEvswithout(don’tanswerthisnow,haveacookieorsomethingfirst)• Sincethisis‘justthenetwork’youmaynotevennoYce(unlessitsnotworking)
• Allkiddingaside–thenextstepsforthisliewiththestakeholders,anditisanYcipatedthatyouwill‘vote’withyouropinionsaswellasfundingdollars.
38–3/19/12,©2012Internet2
LHCONE–WhatsNext?
• Monitoring– Monitoringisnotasexytopic,it’sameanstoanend
– We(networks,aswellasVOs)needittomakesurethatthingsareworkingsothatusers(allofyou)aren’tsad
• L2&AdvancedNetworking– Lotsofopportunitytousenewtechnologies– HardsaletoaddfeaturesintoapplicaYons– We(networkproviders)can‘help’withadaptaYons,butwedon’thavethemanpower/fundingtoleadinthisarea.
39–3/19/12,©2012Internet2
ClosingThoughts
Addressingthe“thingsthatgobumpinthenet”–perfSONAR/DYNES/LHCONEMarch20th2012,OSG/ATLAS/CMSJasonZurawski,[email protected]
FormoreinformaYon,visithdp://www.internet2.edu/research/
40–3/19/12,©2012Internet2