archer service 2015 annual report · the next section of this report contains an executive summary...
TRANSCRIPT
1
ARCHERService2015AnnualReport
2
DocumentInformationandVersionHistoryVersion: 1.0Status Final
Author(s):AlanSimpson,AnneWhiting,StephenBooth,AndyTurner,FelipePopovics,SteveJordan,HarveyRichardson,MikeBrown,LornaSmith
Reviewer(s) AlanSimpson,LornaSmith,SteveJordan
Version Date Comments,Changes,Status Authors,contributors,
reviewers
0.1 2016-01-13 InputtinginitialinformationAnneWhiting,JoBeech-Brandt,AndyTurner,MikeBrown,LornaSmith
0.2 2016-01-20 Incorporatingindividualteaminput
AnneWhiting,JoBeech-Brandt,AndyTurner,MikeBrown,LornaSmith
0.3 2016-01-22 IncorporatingCrayCofEandservicereports
SteveJordan,FelipePopovics,HarveyRichardson
0.4 2016-01-22 Updatedhighlights,reviewedandcommented
AndyTurner
0.5 2016-01-22 Addedutilizationgraphs JoBeech-Brandt0.6 2016-01-26 Internalreview AlanSimpson0.7 2016-01-27 Updatespostreview AnneWhiting,Harvey
Richardson,AndyTurner,MikeBrown
1.0 2016-01-28 VersionsenttoEPSRC AlanSimpson,AnneWhiting1.1 2016-05-05 Minorchangeforwebsite JoBeech-Brandt,AlanSimpson
3
TableofContentsDocumentInformationandVersionHistory............................................................................................................21. Introduction..................................................................................................................................................................42. ExecutiveSummary...................................................................................................................................................53. ServiceUtilisation.......................................................................................................................................................63.1 OverallUtilisation.............................................................................................................................................63.2 UtilisationbyFundingBody.........................................................................................................................63.3 AdditionalUsageGraph..................................................................................................................................7
4. UserSupportandLiaison(USL)...........................................................................................................................84.1 HelpdeskMetrics...............................................................................................................................................84.2 USLServiceHighlights....................................................................................................................................8
5. OperationsandSystemsGroup(OSG)...........................................................................................................105.1 Servicefailures................................................................................................................................................105.2 OSGServiceactivities...................................................................................................................................10
6. ComputationalScienceandEngineering(CSE)..........................................................................................116.1 BestPracticeforDataManagementonARCHER..............................................................................116.2 TheARCHERDrivingTest:encouragingnewusersontotheARCHERservice.................116.3 WeeARCHIE:aRaspberryPiclustertoeducatethenextgenerationofHPCusers.........126.4 WomeninHPC.................................................................................................................................................136.5 CompetitiveeCSEProgramme..................................................................................................................13
7. CrayServiceGroup..................................................................................................................................................157.1 SummaryofPerformanceandServiceEnhancements..................................................................157.2 ReliabilityandPerformance......................................................................................................................157.3 ServiceFailures...............................................................................................................................................15
8. CrayCentreofExcellence(CoE)........................................................................................................................178.1 CoEProjectHighlights..................................................................................................................................178.2 FilesystemandI/O.........................................................................................................................................178.3 TrainingandWorkshops............................................................................................................................188.4 ARCHERQueriesandSoftware................................................................................................................188.5 eCSEMeetings..................................................................................................................................................18
4
1. IntroductionThisannualreportcoverstheperiodfrom1Jan2015to31Dec2015.ThereporthascontributionsfromalloftheteamsresponsiblefortheoperationofARCHER;
• ServiceProvider(SP)containingboththeUserSupportandLiaison(USL)TeamandtheOperationsandSystemsGroup(OSG);
• ComputationalScienceandEngineeringTeam(CSE);• Cray,includingcontributionsfromtheCrayServiceGroupandtheCrayCentreof
Excellence.
ThenextsectionofthisreportcontainsanExecutiveSummaryfortheyear.Section3providesasummaryoftheserviceutilisation.Section4providesasummaryoftheyearfortheUSLteam,detailingtheHelpdeskMetricsandoutliningsomeofthehighlightsfortheyear.TheOSGreportinSection5describestheirfourmainareasofresponsibility;maintainingday-to-dayoperationalsupport;planningserviceenhancementsinaneartomediumtimeframe;planningmajorserviceenhancements;andsupportinganddevelopingassociatedservicesthatunderpinthemainexternaloperationalservice.InSection6theCSEteamhighlightsomeoftheirkeyprojectsfromtheyear.TheydescribetheworkwiththeConsortiaContacts,theeCSEProgramme,WomeninHPCandthedistributedtrainingactivity.TheARCHERImageCompetitionisalsodescribed.InSections7and8,theCrayServiceteamandCrayCentreofExcellencegiveasummaryoftheiryear’sactivities,respectively.ThisreportandtheadditionalSAFEreportsareavailabletoviewonlineathttp://www.archer.ac.uk/about-us/reports/annual/2015.php
5
2. ExecutiveSummaryThesectionsfromthevariousteamsdescribehighlightsoftheiractivities.ThissectiongivesabriefsummaryofhighlightsfromthefirstyearoftheoverallARCHERservice.Moredetailsareprovidedintheappropriatesectionofthedocument.
• BroadeningAccesstoHPC:Increasingtheapplicabilityof,andbroadeningaccessto,advancedcomputingforUKresearchisofkeystrategicimportance.TheARCHERservicehaspilotedanumberofinitiativesaimedparticularlyatincreasingthediversityofresearchonthesystem.TheARCHERDrivingTestprovidesasimplewayforresearchersnewtoHPCtogaininitialtraininginthefieldandamodestamountofHPCresourcetoexplorethepossibilitiesofHPCintheirresearch.TheeCSEprogrammehasincludedsoftwaredevelopmenteffortprioritisingresearchcommunitiesthatarenewtoHPCtoboosttheirsoftwaretoalevelwhereitcanexploitfacilitiessuchasARCHER.
• BusinessCaseforFutureSystems:TheservicehasworkedcloselywiththeresearchcouncilstogatherinformationtosupportthebusinesscaseforfutureHPCinvestment.TheSAFEsystemhasprovidedaninvaluableresourceallowingustoanalysehowARCHER(andprevioussystems)isusedandbywhom.
• ARCHEROutreachProject:ThisprojectwasfundedbyEPSRCin2015topromote
engagementanddiversityinUKHPC,demonstrateimpactfromARCHER,andenhanceoutreachactivities.Inengagementanddiversity,2015hasseentheinitiationoftheARCHERChampionsHPCpeersupportinitiative,expansionoftheWomeninHPCnetwork,anddevelopmentofaFacesofHPCdiversitywebsite.Inimpact,anumberofcasestudieshavebeendevelopedandpublished.Inoutreach:WeeARCHIEhasbeendeveloped,aportableHPCclusterdesignedtopromotethenationalHPCservicetoyoungpeopleandthegeneralpublicataseriesofUK-wideoutreachactivities.
• MajorIncidentManagement:TechnicalandmanagementstafffromallservicepartnerscollaboratedeffectivelytoresolvetheissuesarisingfromtheLustrefilesystemissues,andprovidedsuccessfulandinnovativesolutionstominimisetheimpactonusersandtheirwork.Manyofthesesolutionshavebeenincorporatedasongoingservicedeliveryimprovementsprovidingamorerobustserviceforthefuture.AllservicepartnersareparticipatinginfurtherinitiativestomakefurtherimprovementstocoordinationandinformationsharingaswellasimprovingthejointMajorIncidentPlan.
• Utilisationoftheservicehasremainedveryhighandhasgrownsteadilythrough2015.AlthoughthishasbeenachallengingyearfortheserviceduetoissueswiththeLustrefilesystems,positivecollaborationbetweenallservicepartnershasminimisedtheimpactontheusers,maintainingahighutilisationlevel.Themajorityofthethecomputecycleshavebeenexpendedonjobsexploitinghundredsorthousandsofcores,whicharedifficulttorunonsmallerHPCsystems..
• Intotal,theServicedealtwitharound8,100queriesduring2015,meetingallquery
targets.Resolvinguserqueriespromptlysothattheresolutionallowsuserstomaximisetheirresearchontheserviceisonlypossibleduetocloseandeffectivecollaborationbetweenallservicepartners.
6
3. ServiceUtilisation3.1 OverallUtilisationUtilisationovertheyearwas87%whichissimilartothepercentageutilisationfor2014>However,followingthePhase2upgrade,whichtookplaceinlate2014,thecapacityofARCHERwasincreasedby60%.
3.2 UtilisationbyFundingBodyTheutilisationbyfundingbodyrelativetotheirallocationcanbeseenbelow.
ThisbarchartshowstheusageofARCHERbythetwoResearchCouncilspresentedasapercentageofthetotalResearchCouncilallocationonARCHER.TheunchargedproportionforEPSRCincludesthetemporaryprojectv01thatwasputinplaceduringthefilesystemissues.
7
3.3 AdditionalUsageGraphThefollowinggraphprovidesaviewofthedistributionofjobsizesonARCHER.
ThegraphshowsthatmostofthekAUsarespentonjobsbetween257coresand8192cores.ThenumberofkAUsusediscloselyrelatedtomoneyandshowshowtheinvestmentinthesystemisutilised.
8
4. UserSupportandLiaison(USL)4.1 HelpdeskMetrics
QueryClosureItwasabusyyearonthehelpdeskbutallServicelevelagreementsweremet.Atotalof7874querieswereansweredbytheServiceProvider,andover98.5%wereresolvedwithin2days.Inadditiontothis,theServiceProviderpassedon296in-depthqueriestoCSEandCray. 15Q1 15Q2 15Q3 15Q4 TOTALSelf-ServiceAdmin 1722 1172 775 1564 5233Admin 654 616 408 601 2278Technical 118 91 67 87 363TotalQueries 2494 1879 1250 2252 7874
OtherQueriesInadditiontotheAdminandTechnicalQueriesdetailedabove,theHelpdeskalsodealtwithPhonequeries,ChangeRequests,internalrequestsandUserRegistration. 15Q1 15Q2 15Q3 15Q4 TOTALPhoneCallsReceived 135(41) 100(22) 104(20) 92(14) 431(97)ChangeRequests 8 7 5 5 25UserRegistrationRequests 313 214 220 302 1049Thenumbersshowninbracketsforthephonecallsreceivedarethecallsresultinginneworupdatedqueries.Itisworthnotingthatthevolumeoftelephonecallswaslowthroughouttheyear.Ofthe431callsreceivedintotal,only97(22.5%)wereactualARCHERusercallsthatresultedinqueries.ThetrendthroughtheyearhasbeenafallingnumberofactualARCHERcallsresultinginaquery.Allphonecallswereansweredwithin2minutes,asrequired.
4.2 USLServiceHighlights
FilesystemissuesandimprovementsarisingMajorservicedisruptionwasexperiencedinMayandJuneduetoSonexionfilesystemissues.Inconjunctionwithworktoresolvetheissues,successfulmeasureswereputinplacestominimizetheuserimpact.Collaborativeworkingbetweenallservicepartnersandcarefullyconstructedandtargetedusercommunicationwerekeytothis.Thesuccessofthemeasurescouldbeseeninthe83%utilisationmaintainedduringMayandJunewiththetemporaryfilespaceutilisationaccountingfor46%oftheutilizationfortheperiod.Therewereminimalusercomplaintsreceivedduringtheperiodofdisruptionandanappreciationoftheefforttakentokeeptheservicerunningfromtheusercommunity.Manyofthemeasuresdevisedandimplementedtominimizeuserimpactanddowntimearenowincludedasstandardprocessesandfunctionality.Recommendationsfromthelessonslearnedreportsarealsobeingimplemented.Themeasuresimplementedincluded:
• SAFEfunctionalitytobeabletolockjobsubmissiononaper-filesystembasis(thispreventsusersfromwastingresourceswhentheirfilespaceisnotavailableforrunningjobs).
• Provisionoftemporaryprojectspacewhenaparticularfilesystemisunavailabletoallowuserswhoareaffectedbyfilesystemissuestokeeprunningcalculationsifpossible.
• Movetoresilientpackageinstallationacrossallfilesystemstoenableuserstoaccesspackagesindependentlyofanyparticularfilesystembeingunavailable
• AnimprovedcoordinatedMajorIncidentProcedure
9
PeriodallocationsforconsortiaandlargeresearchgroupsweresuccessfullyimplementedandthenstaggeredInQ1of2015,underthedirectionofEPSRC,6monthlyperiodallocationswereintroduced.Thiswasdonetohelpensurethatprojectsusedtheirallocationsmoreevenlyoverthelifetimeoftheproject.Thischangehashadapositiveeffect,thoughthesimultaneousendingofalargenumberofbothEPSRCandNERCallocationsinMarch2015causedthemachinetobeverybusy.SincethentheEPSRCprojectallocationshavebeenstaggeredthroughouttheyeartoavoidarecurrenceofthisissue.TheimpactofthesechangeshavebeenmeasuredusingtheSchedulingCoefficientreport.ThesereportsshownorecurrenceoftheproblemsfromMarch2015.
SAFEchangesChangeshavebeenmadetoSAFEthisyeartosupportserviceimprovements.Theseinclude:• Theimplementationofsub-projectmanagementallowingthePIstodevolvemanagementof
partsofaprojecttoprojectmanagers;• Themovetoanimprovedreportingenginetospeedupthecreationofuserreports;• TheadditionofcareerstagemonitoringinparticulartoallowEPSRCtotrackthenumberof
earlycareerstageresearchers;• And,theimplementationofautomatictweetingofusermailingstoincreasethemailing
deliveryoptions.
UK-FederationauthenticationtoSAFEimplementedUK-FederationauthenticationtotheSAFEwasimplementedallowinguserstoauthenticatewiththesamecredentialsasfortheirhomeinstitution.Theimpactofthiswastoreducethenumberofcredentialsthattheuserneedstorememberandtroubleshoot.215usershavesigneduptousethisfunctionalitytodate.
10
5. OperationsandSystemsGroup(OSG)
5.1 ServicefailuresTherewerenoservicefailuresintheperiodasdefinedinthemetric.
5.2 OSGServiceactivitiesPrincipalactivitiesundertaken(inadditiontoday-to-dayoperationalcover)included:
(1) Operatingsystemandapplicationssoftwaresupport:a. planningandimplementingCLE5.2upgradeontheXC30;b. installingregularcompilerandprogrammingdevelopmentupgrades;c. supportingOSenhancementstoexternalloginnodes.
(2) Resourcemanagement:a. PBSqueueenhancementssuchastheSHORTdevelopmentqueueandfurther
supportforcreationofadvancedreservations;b. assessingandmonitoringproblemswiththejobschedulingcycle.
(3) Storage:a. significantinvolvementinthehandlingofmajorstorageproblemsencountered
duringtheyear;b. upgradeofSonexion(lustre)filesystemsoftware;c. furtherintegrationoftheRDFintotheoperationalenvironment.
(4) Systemmonitoring:a. furtherenhancementofuseofexternalmonitoringtoolssuchasNagiosand
OMD;b. expansionofinternalsystemhealthchecks.
(5) Systemadministration:a. developmentandexpansionofautomatedtickethandling;b. refinementoflocally-developedsystemsadministrationtools;c. integrationoftheRDFdata-analysisclusterintothewideroperational
configuration.(6) Communications:
a. installationandconfigurationofmultiple40GconnectionstoJANETcorenetwork;
b. furtherhardeningofinternalACFnetworksthatunderpinbothexternaloperationalandinternalsecuremanagementservices.
(7) Servicesupportsystems:a. furtherdevelopmentofautomatedfailoverofhypervisor-basedvirtualservers
thatprovideresilientservicessuchasSAFE,websiteandwiki.(8) SupportingCrayhardwareoperations:
a. providingadditionalon-sitesupportforCraypersonnelduringmajorhardwareupgradeoperations(suchastheopticalcablere-work).
(9) Security:a. implementingenhancementstosecuritymonitoring;b. installingCray-suppliedsecurityfieldnotices;c. providingadditionalhardeningofsecuritymeasures–specificdetailsarenot
availableforobviousreasons.
11
6. ComputationalScienceandEngineering(CSE)TheseareselectedhighlightsfromtheCSEServiceduring2015.FulloperationaldetailsontheCSEservice(includingmetrics)canbefoundinthequarterlyreportsontheweb.
6.1 BestPracticeforDataManagementonARCHERTheamountofdatarequiredandproducedbymodelingandsimulationisincreasingyearonyear.Thisisreflectedinthefactthatdatamanagementandfilesystem(IO)performancearenowmajorconcernsformanyARCHERusers.Untilrelativelyrecently,neitherofthesewereissuesthatconcernedthemajorityofusers.Thereisalackofgenerally-availablematerialonthesetopicsforHPCusersanditalsotendstobeanareawheremanyHPCusershavelittleexpertiseorexperience.Inthesecondhalfof2015,theCSEservicefocusedonprovidingasetofpracticalresourcesforARCHERuserswiththeaimimprovingtheirdatamanagementand/orIOperformanceonARCHERandtheRDF.Weprovidedbothgeneraladvice,andadvicetargetedatspecificapplicationuserswhereweareawareofparticularissueswithdatamanagement.Inparticular,wehaveproduced:
• DataManagementGuideontheARCHERwebsite,covering:o ArchivingdatatotheRDFo DatatransferbetweenARCHERandtheRDFo Datatransferto/fromexternalsitestoARCHERandtheRDFo DifferentARCHERandRDFfilesystemsandtheiruse
• WhitePaperonPerformanceofParallelIOonARCHER:o InitiallycoveringMPI-IOperformanceonARCHERLustrefilesystemso CurrentlyexpandingworktolookatNetCDFandHDF5performanceo WorkingwiththeDiRACfacilitytocomparingperformanceacrossdifferentfile
systemandvendorarchitectures• Webinars:
o DataManagement:bestpracticeinusingtoolstomanagedataonARCHERandtheRDF,includinghowtoefficientlymovedatabetweenthedifferentfilesystems.
o UsingOpenFOAMonARCHER:thepopularOpenSourceCFDsoftwareOpenFOAMhasparticularissueswiththenumbersoffilesitcanproducewhenruninparallel.Thiswebinarraisedawarenessoftheseissuesintheusercommunityandprovidedadviceforhowtodealwiththeproblems.
o LustreandIOTuning:providedadescriptionoftheARCHERLustrefilesystems,whereusersmayseeissueswithperformance,andtipsforgettingbestperformanceoutofthefilesystemsdependingonyourusagepattern.
• Training:o DatamanagementandIOperformancebestpracticehasbeenbuiltintoour
Introductoryface-to-facecoursesandtheonlineARCHERDrivingTest.o AdvancedmaterialonparallelIOperformancehasbeenusedasthebasisofthe
EfficientParallelIOonARCHERcourseruninDecember2015inOxford.
6.2 TheARCHERDrivingTest:encouragingnewusersontotheARCHERservice
TheARCHERdrivingtestwaslaunchedatthestartoftheyeartogiveamechanismfornewuserseasilytogainaccesstotheservicewhilstalsoensuringthattheyhadenoughknowledgeofHPCtomakeuseoftheirARCHERaccount.Thetestathttps://www.archer.ac.uk/training/course-material/online/driving_test.phpcomprises20questionschosenrandomlyfromabankof60,distributedtoensurecoverageofallaspectsofthesystem:
12
Category #questionsHardware 2I/O 3Programming 4Compiling 3PBS 1Runningjobs 3Randomcategory 4Total 20
Itisalsosupportedbyonlinetrainingmaterialincludingslidesandvideolecturesaddressingalltheareascoveredbythetest.ThetestispromotedatARCHERtrainingcoursesandalsomentionedontheemailsenttoallattendeesafterthecourseisfinishedwhereweencouragethemtofillinthefeedbackform.Inthefirstyear,thetestwassuccessfullycompletedby122people,82ofwhomhavegoneontoobtainaccountsonARCHER;thosepassingthetestaresentacertificateofcompletion.Afteraninitialburstofinterest,take-uphasremainedveryconsistentthroughouttheyear:
Itisinterestingtonotethat,fromQ2onwards,almostallnewusershavebecomeactiveusers(i.e.havesubmittedcomputejobs).Intotal,some34,600kAUshavebeenspentbythese62users,anaverageusageofaround560kAUs;atypicalactiveuseristhereforespendingalmosthalfoftheirtotalallocationof1,200kAUs.ThedrivingtesthasbeenagreatsuccessandshowseverysignofcontinuingtoattractnewusersfortheremainderoftheARCHERservice.
6.3 WeeARCHIE:aRaspberryPiclustertoeducatethenextgenerationofHPCusers
TheARCHEROutreachprojectaimstoengagenewcommunitiesandthenextgenerationtotakeadvantageofHPCtechnologies.However,onecommonprobleminreachingouttothesecommunitiesishelpingthemtounderstandtherelationshipbetweeneverydaycomputing,beitthroughatablet,laptop,orsmartphone,andsupercomputingor‘highperformancecomputing’.TheARCHERteamappreciatestheimportanceofhelpingeveryonetounderstandhowHPCcanimprovetheirscience,allowthecommunitytodocompletemorescience,andalsotoensurethatthenextgenerationunderstandsHPCisatoolforall,notjustthefewluckyenoughtoworkataninstitutionwithanHPCresource.SowedevelopedWeeARCHIE.WeeARCHIEhasbeendesignedandbuilttohelpexplainwhatHPCis,thedifficultiesinusingsuchtechnologiesbutalsothepossibilitiesavailablewhenusingHPCplatforms.WhileWee
13
ARCHIEisonlyamodelofarealHPCcluster,ithasallthekeycomponents.Theclusterconsistsof18RaspberryPi2s,eachofwhichhasfourcores,simulatingtheideaofanode,eachwithfourcores.Theclusterhasbeendesignedtoenableexplanationofthehardwareandhowthecomponentsareconnectedandinteract.Theprocessors,switches,powersupplyunitsandnetworkingcablesareallvisiblethroughaPerspexcase,whichisdesignedtobehighlyportableenablingustotakeittooutreacheventsaroundtheUK.EachRaspberryPihasalsobeenfittedwithanLEDarraytoallowustoshowwhenthePiisactiveandfuturecodedevelopmentwillallowustoshowtheloadoneach‘node’,enablingustoteachpeopleabouttheimportanceofloadbalancing.ThedesignplansfortheWeeARCHIEclusterwillbemadeavailableonlinein2016,enablinganyonetopurchaseandbuildtheirowncluster.WewillalsobedevelopingarangeofsoftwaretohighlighttheadvantagesandalsothedifficultiesofusingHPCandtheimportanceofusingtherighttoolforyourproblem.WeeARCHIEwillbetakentoaseriesofoutreacheventsin2016includingtheBigBangFairattheNEC,BirminghaminMarch2016.
6.4 WomeninHPCWomeninHPC,startedin2013,withtheofficiallaunchinApril2014,andhasbecomeaninternationallyrecognisednameinthelastyear.In2015,theWomeninHPCinitiativewentfromthetwoeventsheldin2014tosevendifferenteventsin2015,thelaunchofanewwebsite,thesigningofourfirstinternationalWomeninHPCpartnerorganisationandwinningtheHPCWireReadersChoiceAwardforDiversity.Duringthelastyear,WomeninHPChasparticipatedinthreeinternationalconferences:PraceDays15,Dublin,Ireland;ISC2015,Frankfurt,GermanyandSupercomputing2015,Austin,USA.Ateachconferencewehavehadanarrayofevents,including‘Bird’sofaFeather’discussions,workshops,trainingsessionsandnetworkingreceptions.InSeptember2015,WomeninHPCranthefirsteverWomeninHPCcareerseventincollaborationwithBCSWomen,inLondon,bringingtogetherleadingwomenworkingwithHPCintheUKtodiscusscareeropportunitieswithearlycareerwomeninterestedinacareerinHPCorlookingforanewdirectiontofollowwithintheHPCcommunity.Thedayculminatedwithaspeednetworkingsession,whichdespitemanybeingapprehensiveof,wasthebest-receivedactivityof2015.AtISC2015,WomeninHPCsignedanagreementwithComputeCanadaasthefirstinternationalpartnertoworkwithWomeninHPC.ThepartnershipenablestheestablishmentofaCanadianWomeninHPCchapterorganisationwhichwillruntrainingeventsandnetworkingsessionsaimedattheCanadianHPCcommunity,andsharinginformationandideaswithWomeninHPC.ThisisamodelthatWomeninHPCisintheprocessofdeveloping,withtheplantoestablishchaptersandpartnershipsaroundtheglobeprovidingtheopportunityforwomenintheHPCcommunitytonetworkinternationallyaswellastheopportunitytoencourageotherwomentomoveintoacareerwithintheHPCcommunity.In2016,WomeninHPCissettoexpand,signingupadditionalinternationalandregionalpartnerstoestablishbestpracticeinbroadeningparticipationintheHPCcommunityaroundtheworld.Wewillalsobeworkingwithavarietyofconferencesaswellasexpandingourdisseminationactivities.
6.5 CompetitiveeCSEProgrammeTheembeddedCSE(eCSE)programmeprovidesfundingfor14FTEsembeddeddirectlyintothescientificcommunitythroughaseriesofcompetitive,peer-reviewedcalls.2015sawahighdemandfromthecommunityforfunding,resultinginaveryhighqualitythreshold.Overthecourseofthesixcalls,54projectshavebeenfunded.TheseprojectshavemadeasignificantimpactonthequalityandperformanceofthesoftwaresuiteonARCHER–morethan
14
tenofthemostheavilyusedcodesonthesystemhavebenefittedfromeCSEinvestmenteffort.Thisinturnhasfacilitatedgreaterscientificoutputandimpact,allowingpreviouslyuntenablescience.TheprogrammehasafocusonearlycareerresearchersandondevelopingtheUKsoftwareskillsbase.ThedistributedandembeddednatureoftheprogrammeallowsforthisskillsdevelopmenttobespreadacrossthewholeoftheUK,andakeyhighlightoftheprogrammehastobethefactthatstafffrom~30institutionsfromawidegeographicaldistributionhavebenefittedfromeCSEinvestment.Coupledwiththis,wewillhaveearlycareerresearchersobservingatfuturepanelmeetings.Theaimistogivethemabetterinsightintothemechanismofselectiontoassistintheirfuturepreparationoffundingproposals.Afinalhighlightisthesuccessfulnewcommunitiesprogrammethatencouragesproposalsfromnewcommunities,lookingtoenhancethediversityofsciencebeingcarriedoutonARCHER.OverthethreeeCSEcallsthathaveincludedthisinitiativewehavereceivedthirteennewcommunityapplications.
15
7. CrayServiceGroup7.1 SummaryofPerformanceandServiceEnhancements2015hasbeenanotherstrongyearfortheARCHERservice.Overallsystemreliabilityandutilisationofresourceshavecontinuedtobeatahighlevel.Wheretechnologyareashaveperformedbelowthehighstandardsexpected,correctiveactionhasbeentakentoresolveissueswiththeminimumamountofdisruptionandaftercarefulconsultationwithservicepartners.Moredetailsofspecifictechnologyfailurescanbefoundinthetableandassociateddescriptionsbelow.
7.2 ReliabilityandPerformanceTheperformanceandreliabilityofthehardwareandsoftwaretechnologiesunderpinningtheARCHERservicecontinuestobeofahighstandard.NewversionsofsoftwarethatprovidefeatureenhancementsandbugfixestotheusercommunityarecontinuallyunderdevelopmentandarethenbeingimplementedontheARCHERservicefollowingperiodsofevaluationonappropriatetestplatforms.Large,complexHPCsystemssuchasARCHERarenotimmunefromtechnologyfailuresbutundermostcircumstancesthosefailurescanbemanagedbyutilisingwell-designedresiliencyfeaturesandrobustconfigurations.Onoccasions,technologyfailuresdoresultinimpactupontheusercommunity.ThemostsignificanttechnologyareaoftheARCHERservicewhereissueswereencounteredin2015wasintheparallellustrefilesystemandassociatedstoragecomponents.Acknowledgingthatimprovementscouldbemadeinbothhardwareandsoftwareareasofthestoragesubsystem,theseimprovementswereforthcomingandintegratedwithaminimumofdisruptiontotheusercommunity.
7.3 ServiceFailuresSevenunscheduledincidentsclassifiedasfullservicefailureswereencounteredduring2015.Ascanbeseen,sixofthesefailuresoccurredinthefirsthalfoftheyearwithamuch-improvedperformanceandonlyasingleservicefailureinthesecondhalfoftheyear.Incident Date Description
1 08-Jan-15 Systemrebootrequiredfollowingstoragecontrollerfailure2 06-May-15 Storagefailureonlustrefilesystem/fs3.3 07-May-15 Storagefailureonlustrefilesystem/fs24 13-May-15 SystemrebootfollowingPBSProbatchsystemserverfailure5 10-Jun-15 Systemrebootfollowingafailureinthesystembootraiddevice.6 30-Jun-15 RunninguserworklostfollowingPBSProbatchsystembecoming
unresponsive7 06-Oct-15 Systemrebootfollowingunintendedinitializationofsystem
components
Thedetailsoftheseseventechnologyservicefailureswere:
• Oneservicefailureduetoalustrestoragecontrollerfaultrequiringasystemreboottoclear.
16
• Twoservicefailuresoccurredduetomultiplestoragecomponentfailuresaffectingtwodifferentlustreparallelfilesystems.
• TwoservicefailuresduetoproblemsrelatedtothePBSProbatchsubsystemwhichcausedthelossofrunninguserwork.
• Oneserviceoutageduetoacontrollerfailureinabootraiddevice,whichhousestheoperatingsystemfilesystemsfortheARCHERservice.
• Oneservicefailurewascausedbytheaccidentaluseofaninitialisationcommandonsystemcomponents.
17
8. CrayCentreofExcellence(CoE)MichaelNeffjoinedtheCoEandbringsspecificexpertiseincomputationalchemistrytotheCoE.
8.1 CoEProjectHighlights
HIPSTARAcasestudywasproducedonsomepreviousworkdonebytheCoEontheHiPSTARcodefromtheUniversityofSouthampton.InapreviousCoEproject,OpenMPwasaddedtoHiPSTARimprovingthecodescalabilityconsiderably.ThisOpenMPworkwasthenusedasabasisforanOpenACCportoftheapplication(doneinconjunctionwiththeusersbytheARCHERandORNLCoEs).TheOpenACCportoftheapplicationallowedtheuserstorunHiPSTARtoverylargescaleontheTitansystematORNL,andformacollaborationwithGEinthiswork.ThisworkhasbeendocumentedinaCraycasestudy-http://www.cray.com/sites/default/files/XC30-ARCHER-HiPSTAR-0315.pdf.
ONETEPWeexpendedasmallamountofeffortduringtheyear(viaotherUKApplicationsStaff)supportingaPoisson-BoltzmannEquationsolverforONETEP(incollaborationwithTheUniversityofSouthampton).AneCSEproposalforfurtherworkwithONETEPandCASTEPisabouttobesubmittedanditisourintentionthatcontinuingsupportfromtheCoEwouldbeprovidediftheprojectwastobefunded.
HADOOP/SparkTheCoEwasinvolvedwithaprojectwithusersfromtheUniversityofNottinghamtolookintothepotentialofanalyzingdatageneratedbymoleculardynamicsapplicationswithHadoop.Thegeneralaimherewasreallyaproof-of-conceptstudytounderstandwhatcanbedonewithHadooptechnologiesinprocessingofHPCdata.Forthisproject,theCoEbroughtinexpertsinMapReduceandSparkfromCray’sDataAnalyticsdivision.InApril,workstartedonabasicHadoopapplicationwithsubsequentinitialtesting.ThisworkwashighlightedinaCrayCaseStudy.
PDNS3DTheCoEinvestigatedaperformanceproblemwiththePDNS3Dcode.ThecoarrayimplementationwasnotperformingwellrelativetotheMPIimplementation.TheCoEusedanexpertfromtheUSCrayPerformanceteamtocontinueinvestigatingthis.HefoundaperformanceissuewiththeCraycompilerthatdisadvantagesthecoarrayversionofthecodeatallscales.Thisissuewasresolvedandsubsequentanalysisshowedthatthehalo-swapcommunicationpatternsareimplementeddifferentlyintheMPIandcoarrayversions.Theresultswerecommunicatedattheendoftheyearandwehopetodiscussthemfurtherinthenearfuture.
8.2 FilesystemandI/OTheCoEwasengagedtounderstandtheadverseeffectsreportedbyusersasaresultoffilesystemrebuilds.WorkingwiththeCrayteamonsite,weperformedadetailedinvestigationofindividualstorageunit(OST)performanceandwereabletodeterminethecauseoftheadverseperformance.Aspartofthiswork,wewerealsoabletoshowthatthetuningofbothraid-checkandtherebuildprocesswasworkingtotheextentthatapplicationswouldbelessimpactedwhentheseoperationswerethrottled.Filesystemtasksarenowmuchlessintrusiveasaresultofsoftwareimprovementsandconfigurationofraid-checkandrebuildoperations.AspartofthisefforttheCoEalsostartedtoengagewiththeNCAScommunity,weregivenaccesstotheNCASPumaservice,andhavebeenabletorunarepresentativeUMjobonARCHERviathatservice.InitialinvestigationsconcentratedontheI/Oserverconfigurationandthisisatopicthatwehopetorevisitinthefuture.ThedirectinvestigationofNCASUMjobsbecamelessrelevant
18
oncethefilesystemperformanceissueswereunderstood,andthenewraid-checkregimewasputinplace.ThequestionofhowtooptimizeI/Ocomesupoftensoweareconsideringhowwecandomore(beyondexistingmaterialwehavepresentedinoptimisationworkshopsandintheARCHERtuningguide)togetappropriateinformationtousers.
8.3 TrainingandWorkshopsTheCoEassistedwithvariousworkshopsduringtheyear.ParticularexampleswerethePortingandOptimisationworkshoprunatthetimeofEASC2015,the1stEuroOpenACCHackathon,andtheARCHERserialOptimisationcourserunatCray’sEMEAHQinBristolinDecember.CoEstaffpresentedaseminaronmodernFortran,aswellasatalkoncoarraysandARCHERprojectsatajointmeetingoftheBritishComputerSocietyandInstituteofPhysics.TheCoEwasabletoengagewithARCHERusersatvariouseventsincludingtheInsightUKmeetinginCoventry,theUKTurbulenceConsortiumAnnualreviewmeeting,andthe24thDiscreteSimulationofFluidDynamics(DSFD)conferenceinEdinburgh.TheARCHERCoEorganisedamini-symposiumatthePARCO2015conferencewithafocusonprogrammingformanycorenodes(includingGPUs,multicoreCPUs,andIntelXeonPhi).This(alongwithotherParCoevents)wasausefulwaytointeractwithusersandthewidercommunityonconcernsandrequirementsforprogrammingmodelsaswelookatcurrentandfuturearchitectures.
8.4 ARCHERQueriesandSoftwareOfparticularnotethisyearwasanissuewithsuboptimalperformanceofNWChem.NWChemwasnotperformingoptimallyonARCHERand,forsomecases,wasslowerthanHECToR.Theproblemwasdifficulttodiagnoseduetolargeruntimes,butwasfoundtobeduetoanewGAimplementation.CrayCoE,CrayUSAdevelopers,EPCCandtheuserwereallinvolvedinworkingonthis.Anearlyfixdidnotworkduetoaraceconditionbut,asoftheendoftheyear,anewARMCIcommunicationmodelhasresolvedtheperformanceproblem.UpdatesofCLEandtheProgrammingEnvironmenttowardstheendoftheyearonARCHERmeanthatnewfeaturesareavailableandwewillproduceaseminarin2016tooutlinethese.
8.5 eCSEMeetingsTheCoEcompletedtechnicalassessmentsforthetwoeCSEcallsduringtheyear,andstaffattendedtheprojectplanningmeetings.