archer service 2015 annual report · the next section of this report contains an executive summary...

18
1 ARCHER Service 2015 Annual Report

Upload: others

Post on 03-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

1

ARCHERService2015AnnualReport

Page 2: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

2

DocumentInformationandVersionHistoryVersion: 1.0Status Final

Author(s):AlanSimpson,AnneWhiting,StephenBooth,AndyTurner,FelipePopovics,SteveJordan,HarveyRichardson,MikeBrown,LornaSmith

Reviewer(s) AlanSimpson,LornaSmith,SteveJordan

Version Date Comments,Changes,Status Authors,contributors,

reviewers

0.1 2016-01-13 InputtinginitialinformationAnneWhiting,JoBeech-Brandt,AndyTurner,MikeBrown,LornaSmith

0.2 2016-01-20 Incorporatingindividualteaminput

AnneWhiting,JoBeech-Brandt,AndyTurner,MikeBrown,LornaSmith

0.3 2016-01-22 IncorporatingCrayCofEandservicereports

SteveJordan,FelipePopovics,HarveyRichardson

0.4 2016-01-22 Updatedhighlights,reviewedandcommented

AndyTurner

0.5 2016-01-22 Addedutilizationgraphs JoBeech-Brandt0.6 2016-01-26 Internalreview AlanSimpson0.7 2016-01-27 Updatespostreview AnneWhiting,Harvey

Richardson,AndyTurner,MikeBrown

1.0 2016-01-28 VersionsenttoEPSRC AlanSimpson,AnneWhiting1.1 2016-05-05 Minorchangeforwebsite JoBeech-Brandt,AlanSimpson

Page 3: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

3

TableofContentsDocumentInformationandVersionHistory............................................................................................................21. Introduction..................................................................................................................................................................42. ExecutiveSummary...................................................................................................................................................53. ServiceUtilisation.......................................................................................................................................................63.1 OverallUtilisation.............................................................................................................................................63.2 UtilisationbyFundingBody.........................................................................................................................63.3 AdditionalUsageGraph..................................................................................................................................7

4. UserSupportandLiaison(USL)...........................................................................................................................84.1 HelpdeskMetrics...............................................................................................................................................84.2 USLServiceHighlights....................................................................................................................................8

5. OperationsandSystemsGroup(OSG)...........................................................................................................105.1 Servicefailures................................................................................................................................................105.2 OSGServiceactivities...................................................................................................................................10

6. ComputationalScienceandEngineering(CSE)..........................................................................................116.1 BestPracticeforDataManagementonARCHER..............................................................................116.2 TheARCHERDrivingTest:encouragingnewusersontotheARCHERservice.................116.3 WeeARCHIE:aRaspberryPiclustertoeducatethenextgenerationofHPCusers.........126.4 WomeninHPC.................................................................................................................................................136.5 CompetitiveeCSEProgramme..................................................................................................................13

7. CrayServiceGroup..................................................................................................................................................157.1 SummaryofPerformanceandServiceEnhancements..................................................................157.2 ReliabilityandPerformance......................................................................................................................157.3 ServiceFailures...............................................................................................................................................15

8. CrayCentreofExcellence(CoE)........................................................................................................................178.1 CoEProjectHighlights..................................................................................................................................178.2 FilesystemandI/O.........................................................................................................................................178.3 TrainingandWorkshops............................................................................................................................188.4 ARCHERQueriesandSoftware................................................................................................................188.5 eCSEMeetings..................................................................................................................................................18

Page 4: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

4

1. IntroductionThisannualreportcoverstheperiodfrom1Jan2015to31Dec2015.ThereporthascontributionsfromalloftheteamsresponsiblefortheoperationofARCHER;

• ServiceProvider(SP)containingboththeUserSupportandLiaison(USL)TeamandtheOperationsandSystemsGroup(OSG);

• ComputationalScienceandEngineeringTeam(CSE);• Cray,includingcontributionsfromtheCrayServiceGroupandtheCrayCentreof

Excellence.

ThenextsectionofthisreportcontainsanExecutiveSummaryfortheyear.Section3providesasummaryoftheserviceutilisation.Section4providesasummaryoftheyearfortheUSLteam,detailingtheHelpdeskMetricsandoutliningsomeofthehighlightsfortheyear.TheOSGreportinSection5describestheirfourmainareasofresponsibility;maintainingday-to-dayoperationalsupport;planningserviceenhancementsinaneartomediumtimeframe;planningmajorserviceenhancements;andsupportinganddevelopingassociatedservicesthatunderpinthemainexternaloperationalservice.InSection6theCSEteamhighlightsomeoftheirkeyprojectsfromtheyear.TheydescribetheworkwiththeConsortiaContacts,theeCSEProgramme,WomeninHPCandthedistributedtrainingactivity.TheARCHERImageCompetitionisalsodescribed.InSections7and8,theCrayServiceteamandCrayCentreofExcellencegiveasummaryoftheiryear’sactivities,respectively.ThisreportandtheadditionalSAFEreportsareavailabletoviewonlineathttp://www.archer.ac.uk/about-us/reports/annual/2015.php

Page 5: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

5

2. ExecutiveSummaryThesectionsfromthevariousteamsdescribehighlightsoftheiractivities.ThissectiongivesabriefsummaryofhighlightsfromthefirstyearoftheoverallARCHERservice.Moredetailsareprovidedintheappropriatesectionofthedocument.

• BroadeningAccesstoHPC:Increasingtheapplicabilityof,andbroadeningaccessto,advancedcomputingforUKresearchisofkeystrategicimportance.TheARCHERservicehaspilotedanumberofinitiativesaimedparticularlyatincreasingthediversityofresearchonthesystem.TheARCHERDrivingTestprovidesasimplewayforresearchersnewtoHPCtogaininitialtraininginthefieldandamodestamountofHPCresourcetoexplorethepossibilitiesofHPCintheirresearch.TheeCSEprogrammehasincludedsoftwaredevelopmenteffortprioritisingresearchcommunitiesthatarenewtoHPCtoboosttheirsoftwaretoalevelwhereitcanexploitfacilitiessuchasARCHER.

• BusinessCaseforFutureSystems:TheservicehasworkedcloselywiththeresearchcouncilstogatherinformationtosupportthebusinesscaseforfutureHPCinvestment.TheSAFEsystemhasprovidedaninvaluableresourceallowingustoanalysehowARCHER(andprevioussystems)isusedandbywhom.

• ARCHEROutreachProject:ThisprojectwasfundedbyEPSRCin2015topromote

engagementanddiversityinUKHPC,demonstrateimpactfromARCHER,andenhanceoutreachactivities.Inengagementanddiversity,2015hasseentheinitiationoftheARCHERChampionsHPCpeersupportinitiative,expansionoftheWomeninHPCnetwork,anddevelopmentofaFacesofHPCdiversitywebsite.Inimpact,anumberofcasestudieshavebeendevelopedandpublished.Inoutreach:WeeARCHIEhasbeendeveloped,aportableHPCclusterdesignedtopromotethenationalHPCservicetoyoungpeopleandthegeneralpublicataseriesofUK-wideoutreachactivities.

• MajorIncidentManagement:TechnicalandmanagementstafffromallservicepartnerscollaboratedeffectivelytoresolvetheissuesarisingfromtheLustrefilesystemissues,andprovidedsuccessfulandinnovativesolutionstominimisetheimpactonusersandtheirwork.Manyofthesesolutionshavebeenincorporatedasongoingservicedeliveryimprovementsprovidingamorerobustserviceforthefuture.AllservicepartnersareparticipatinginfurtherinitiativestomakefurtherimprovementstocoordinationandinformationsharingaswellasimprovingthejointMajorIncidentPlan.

• Utilisationoftheservicehasremainedveryhighandhasgrownsteadilythrough2015.AlthoughthishasbeenachallengingyearfortheserviceduetoissueswiththeLustrefilesystems,positivecollaborationbetweenallservicepartnershasminimisedtheimpactontheusers,maintainingahighutilisationlevel.Themajorityofthethecomputecycleshavebeenexpendedonjobsexploitinghundredsorthousandsofcores,whicharedifficulttorunonsmallerHPCsystems..

• Intotal,theServicedealtwitharound8,100queriesduring2015,meetingallquery

targets.Resolvinguserqueriespromptlysothattheresolutionallowsuserstomaximisetheirresearchontheserviceisonlypossibleduetocloseandeffectivecollaborationbetweenallservicepartners.

Page 6: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

6

3. ServiceUtilisation3.1 OverallUtilisationUtilisationovertheyearwas87%whichissimilartothepercentageutilisationfor2014>However,followingthePhase2upgrade,whichtookplaceinlate2014,thecapacityofARCHERwasincreasedby60%.

3.2 UtilisationbyFundingBodyTheutilisationbyfundingbodyrelativetotheirallocationcanbeseenbelow.

ThisbarchartshowstheusageofARCHERbythetwoResearchCouncilspresentedasapercentageofthetotalResearchCouncilallocationonARCHER.TheunchargedproportionforEPSRCincludesthetemporaryprojectv01thatwasputinplaceduringthefilesystemissues.

Page 7: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

7

3.3 AdditionalUsageGraphThefollowinggraphprovidesaviewofthedistributionofjobsizesonARCHER.

ThegraphshowsthatmostofthekAUsarespentonjobsbetween257coresand8192cores.ThenumberofkAUsusediscloselyrelatedtomoneyandshowshowtheinvestmentinthesystemisutilised.

Page 8: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

8

4. UserSupportandLiaison(USL)4.1 HelpdeskMetrics

QueryClosureItwasabusyyearonthehelpdeskbutallServicelevelagreementsweremet.Atotalof7874querieswereansweredbytheServiceProvider,andover98.5%wereresolvedwithin2days.Inadditiontothis,theServiceProviderpassedon296in-depthqueriestoCSEandCray. 15Q1 15Q2 15Q3 15Q4 TOTALSelf-ServiceAdmin 1722 1172 775 1564 5233Admin 654 616 408 601 2278Technical 118 91 67 87 363TotalQueries 2494 1879 1250 2252 7874

OtherQueriesInadditiontotheAdminandTechnicalQueriesdetailedabove,theHelpdeskalsodealtwithPhonequeries,ChangeRequests,internalrequestsandUserRegistration. 15Q1 15Q2 15Q3 15Q4 TOTALPhoneCallsReceived 135(41) 100(22) 104(20) 92(14) 431(97)ChangeRequests 8 7 5 5 25UserRegistrationRequests 313 214 220 302 1049Thenumbersshowninbracketsforthephonecallsreceivedarethecallsresultinginneworupdatedqueries.Itisworthnotingthatthevolumeoftelephonecallswaslowthroughouttheyear.Ofthe431callsreceivedintotal,only97(22.5%)wereactualARCHERusercallsthatresultedinqueries.ThetrendthroughtheyearhasbeenafallingnumberofactualARCHERcallsresultinginaquery.Allphonecallswereansweredwithin2minutes,asrequired.

4.2 USLServiceHighlights

FilesystemissuesandimprovementsarisingMajorservicedisruptionwasexperiencedinMayandJuneduetoSonexionfilesystemissues.Inconjunctionwithworktoresolvetheissues,successfulmeasureswereputinplacestominimizetheuserimpact.Collaborativeworkingbetweenallservicepartnersandcarefullyconstructedandtargetedusercommunicationwerekeytothis.Thesuccessofthemeasurescouldbeseeninthe83%utilisationmaintainedduringMayandJunewiththetemporaryfilespaceutilisationaccountingfor46%oftheutilizationfortheperiod.Therewereminimalusercomplaintsreceivedduringtheperiodofdisruptionandanappreciationoftheefforttakentokeeptheservicerunningfromtheusercommunity.Manyofthemeasuresdevisedandimplementedtominimizeuserimpactanddowntimearenowincludedasstandardprocessesandfunctionality.Recommendationsfromthelessonslearnedreportsarealsobeingimplemented.Themeasuresimplementedincluded:

• SAFEfunctionalitytobeabletolockjobsubmissiononaper-filesystembasis(thispreventsusersfromwastingresourceswhentheirfilespaceisnotavailableforrunningjobs).

• Provisionoftemporaryprojectspacewhenaparticularfilesystemisunavailabletoallowuserswhoareaffectedbyfilesystemissuestokeeprunningcalculationsifpossible.

• Movetoresilientpackageinstallationacrossallfilesystemstoenableuserstoaccesspackagesindependentlyofanyparticularfilesystembeingunavailable

• AnimprovedcoordinatedMajorIncidentProcedure

Page 9: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

9

PeriodallocationsforconsortiaandlargeresearchgroupsweresuccessfullyimplementedandthenstaggeredInQ1of2015,underthedirectionofEPSRC,6monthlyperiodallocationswereintroduced.Thiswasdonetohelpensurethatprojectsusedtheirallocationsmoreevenlyoverthelifetimeoftheproject.Thischangehashadapositiveeffect,thoughthesimultaneousendingofalargenumberofbothEPSRCandNERCallocationsinMarch2015causedthemachinetobeverybusy.SincethentheEPSRCprojectallocationshavebeenstaggeredthroughouttheyeartoavoidarecurrenceofthisissue.TheimpactofthesechangeshavebeenmeasuredusingtheSchedulingCoefficientreport.ThesereportsshownorecurrenceoftheproblemsfromMarch2015.

SAFEchangesChangeshavebeenmadetoSAFEthisyeartosupportserviceimprovements.Theseinclude:• Theimplementationofsub-projectmanagementallowingthePIstodevolvemanagementof

partsofaprojecttoprojectmanagers;• Themovetoanimprovedreportingenginetospeedupthecreationofuserreports;• TheadditionofcareerstagemonitoringinparticulartoallowEPSRCtotrackthenumberof

earlycareerstageresearchers;• And,theimplementationofautomatictweetingofusermailingstoincreasethemailing

deliveryoptions.

UK-FederationauthenticationtoSAFEimplementedUK-FederationauthenticationtotheSAFEwasimplementedallowinguserstoauthenticatewiththesamecredentialsasfortheirhomeinstitution.Theimpactofthiswastoreducethenumberofcredentialsthattheuserneedstorememberandtroubleshoot.215usershavesigneduptousethisfunctionalitytodate.

Page 10: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

10

5. OperationsandSystemsGroup(OSG)

5.1 ServicefailuresTherewerenoservicefailuresintheperiodasdefinedinthemetric.

5.2 OSGServiceactivitiesPrincipalactivitiesundertaken(inadditiontoday-to-dayoperationalcover)included:

(1) Operatingsystemandapplicationssoftwaresupport:a. planningandimplementingCLE5.2upgradeontheXC30;b. installingregularcompilerandprogrammingdevelopmentupgrades;c. supportingOSenhancementstoexternalloginnodes.

(2) Resourcemanagement:a. PBSqueueenhancementssuchastheSHORTdevelopmentqueueandfurther

supportforcreationofadvancedreservations;b. assessingandmonitoringproblemswiththejobschedulingcycle.

(3) Storage:a. significantinvolvementinthehandlingofmajorstorageproblemsencountered

duringtheyear;b. upgradeofSonexion(lustre)filesystemsoftware;c. furtherintegrationoftheRDFintotheoperationalenvironment.

(4) Systemmonitoring:a. furtherenhancementofuseofexternalmonitoringtoolssuchasNagiosand

OMD;b. expansionofinternalsystemhealthchecks.

(5) Systemadministration:a. developmentandexpansionofautomatedtickethandling;b. refinementoflocally-developedsystemsadministrationtools;c. integrationoftheRDFdata-analysisclusterintothewideroperational

configuration.(6) Communications:

a. installationandconfigurationofmultiple40GconnectionstoJANETcorenetwork;

b. furtherhardeningofinternalACFnetworksthatunderpinbothexternaloperationalandinternalsecuremanagementservices.

(7) Servicesupportsystems:a. furtherdevelopmentofautomatedfailoverofhypervisor-basedvirtualservers

thatprovideresilientservicessuchasSAFE,websiteandwiki.(8) SupportingCrayhardwareoperations:

a. providingadditionalon-sitesupportforCraypersonnelduringmajorhardwareupgradeoperations(suchastheopticalcablere-work).

(9) Security:a. implementingenhancementstosecuritymonitoring;b. installingCray-suppliedsecurityfieldnotices;c. providingadditionalhardeningofsecuritymeasures–specificdetailsarenot

availableforobviousreasons.

Page 11: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

11

6. ComputationalScienceandEngineering(CSE)TheseareselectedhighlightsfromtheCSEServiceduring2015.FulloperationaldetailsontheCSEservice(includingmetrics)canbefoundinthequarterlyreportsontheweb.

6.1 BestPracticeforDataManagementonARCHERTheamountofdatarequiredandproducedbymodelingandsimulationisincreasingyearonyear.Thisisreflectedinthefactthatdatamanagementandfilesystem(IO)performancearenowmajorconcernsformanyARCHERusers.Untilrelativelyrecently,neitherofthesewereissuesthatconcernedthemajorityofusers.Thereisalackofgenerally-availablematerialonthesetopicsforHPCusersanditalsotendstobeanareawheremanyHPCusershavelittleexpertiseorexperience.Inthesecondhalfof2015,theCSEservicefocusedonprovidingasetofpracticalresourcesforARCHERuserswiththeaimimprovingtheirdatamanagementand/orIOperformanceonARCHERandtheRDF.Weprovidedbothgeneraladvice,andadvicetargetedatspecificapplicationuserswhereweareawareofparticularissueswithdatamanagement.Inparticular,wehaveproduced:

• DataManagementGuideontheARCHERwebsite,covering:o ArchivingdatatotheRDFo DatatransferbetweenARCHERandtheRDFo Datatransferto/fromexternalsitestoARCHERandtheRDFo DifferentARCHERandRDFfilesystemsandtheiruse

• WhitePaperonPerformanceofParallelIOonARCHER:o InitiallycoveringMPI-IOperformanceonARCHERLustrefilesystemso CurrentlyexpandingworktolookatNetCDFandHDF5performanceo WorkingwiththeDiRACfacilitytocomparingperformanceacrossdifferentfile

systemandvendorarchitectures• Webinars:

o DataManagement:bestpracticeinusingtoolstomanagedataonARCHERandtheRDF,includinghowtoefficientlymovedatabetweenthedifferentfilesystems.

o UsingOpenFOAMonARCHER:thepopularOpenSourceCFDsoftwareOpenFOAMhasparticularissueswiththenumbersoffilesitcanproducewhenruninparallel.Thiswebinarraisedawarenessoftheseissuesintheusercommunityandprovidedadviceforhowtodealwiththeproblems.

o LustreandIOTuning:providedadescriptionoftheARCHERLustrefilesystems,whereusersmayseeissueswithperformance,andtipsforgettingbestperformanceoutofthefilesystemsdependingonyourusagepattern.

• Training:o DatamanagementandIOperformancebestpracticehasbeenbuiltintoour

Introductoryface-to-facecoursesandtheonlineARCHERDrivingTest.o AdvancedmaterialonparallelIOperformancehasbeenusedasthebasisofthe

EfficientParallelIOonARCHERcourseruninDecember2015inOxford.

6.2 TheARCHERDrivingTest:encouragingnewusersontotheARCHERservice

TheARCHERdrivingtestwaslaunchedatthestartoftheyeartogiveamechanismfornewuserseasilytogainaccesstotheservicewhilstalsoensuringthattheyhadenoughknowledgeofHPCtomakeuseoftheirARCHERaccount.Thetestathttps://www.archer.ac.uk/training/course-material/online/driving_test.phpcomprises20questionschosenrandomlyfromabankof60,distributedtoensurecoverageofallaspectsofthesystem:

Page 12: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

12

Category #questionsHardware 2I/O 3Programming 4Compiling 3PBS 1Runningjobs 3Randomcategory 4Total 20

Itisalsosupportedbyonlinetrainingmaterialincludingslidesandvideolecturesaddressingalltheareascoveredbythetest.ThetestispromotedatARCHERtrainingcoursesandalsomentionedontheemailsenttoallattendeesafterthecourseisfinishedwhereweencouragethemtofillinthefeedbackform.Inthefirstyear,thetestwassuccessfullycompletedby122people,82ofwhomhavegoneontoobtainaccountsonARCHER;thosepassingthetestaresentacertificateofcompletion.Afteraninitialburstofinterest,take-uphasremainedveryconsistentthroughouttheyear:

Itisinterestingtonotethat,fromQ2onwards,almostallnewusershavebecomeactiveusers(i.e.havesubmittedcomputejobs).Intotal,some34,600kAUshavebeenspentbythese62users,anaverageusageofaround560kAUs;atypicalactiveuseristhereforespendingalmosthalfoftheirtotalallocationof1,200kAUs.ThedrivingtesthasbeenagreatsuccessandshowseverysignofcontinuingtoattractnewusersfortheremainderoftheARCHERservice.

6.3 WeeARCHIE:aRaspberryPiclustertoeducatethenextgenerationofHPCusers

TheARCHEROutreachprojectaimstoengagenewcommunitiesandthenextgenerationtotakeadvantageofHPCtechnologies.However,onecommonprobleminreachingouttothesecommunitiesishelpingthemtounderstandtherelationshipbetweeneverydaycomputing,beitthroughatablet,laptop,orsmartphone,andsupercomputingor‘highperformancecomputing’.TheARCHERteamappreciatestheimportanceofhelpingeveryonetounderstandhowHPCcanimprovetheirscience,allowthecommunitytodocompletemorescience,andalsotoensurethatthenextgenerationunderstandsHPCisatoolforall,notjustthefewluckyenoughtoworkataninstitutionwithanHPCresource.SowedevelopedWeeARCHIE.WeeARCHIEhasbeendesignedandbuilttohelpexplainwhatHPCis,thedifficultiesinusingsuchtechnologiesbutalsothepossibilitiesavailablewhenusingHPCplatforms.WhileWee

Page 13: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

13

ARCHIEisonlyamodelofarealHPCcluster,ithasallthekeycomponents.Theclusterconsistsof18RaspberryPi2s,eachofwhichhasfourcores,simulatingtheideaofanode,eachwithfourcores.Theclusterhasbeendesignedtoenableexplanationofthehardwareandhowthecomponentsareconnectedandinteract.Theprocessors,switches,powersupplyunitsandnetworkingcablesareallvisiblethroughaPerspexcase,whichisdesignedtobehighlyportableenablingustotakeittooutreacheventsaroundtheUK.EachRaspberryPihasalsobeenfittedwithanLEDarraytoallowustoshowwhenthePiisactiveandfuturecodedevelopmentwillallowustoshowtheloadoneach‘node’,enablingustoteachpeopleabouttheimportanceofloadbalancing.ThedesignplansfortheWeeARCHIEclusterwillbemadeavailableonlinein2016,enablinganyonetopurchaseandbuildtheirowncluster.WewillalsobedevelopingarangeofsoftwaretohighlighttheadvantagesandalsothedifficultiesofusingHPCandtheimportanceofusingtherighttoolforyourproblem.WeeARCHIEwillbetakentoaseriesofoutreacheventsin2016includingtheBigBangFairattheNEC,BirminghaminMarch2016.

6.4 WomeninHPCWomeninHPC,startedin2013,withtheofficiallaunchinApril2014,andhasbecomeaninternationallyrecognisednameinthelastyear.In2015,theWomeninHPCinitiativewentfromthetwoeventsheldin2014tosevendifferenteventsin2015,thelaunchofanewwebsite,thesigningofourfirstinternationalWomeninHPCpartnerorganisationandwinningtheHPCWireReadersChoiceAwardforDiversity.Duringthelastyear,WomeninHPChasparticipatedinthreeinternationalconferences:PraceDays15,Dublin,Ireland;ISC2015,Frankfurt,GermanyandSupercomputing2015,Austin,USA.Ateachconferencewehavehadanarrayofevents,including‘Bird’sofaFeather’discussions,workshops,trainingsessionsandnetworkingreceptions.InSeptember2015,WomeninHPCranthefirsteverWomeninHPCcareerseventincollaborationwithBCSWomen,inLondon,bringingtogetherleadingwomenworkingwithHPCintheUKtodiscusscareeropportunitieswithearlycareerwomeninterestedinacareerinHPCorlookingforanewdirectiontofollowwithintheHPCcommunity.Thedayculminatedwithaspeednetworkingsession,whichdespitemanybeingapprehensiveof,wasthebest-receivedactivityof2015.AtISC2015,WomeninHPCsignedanagreementwithComputeCanadaasthefirstinternationalpartnertoworkwithWomeninHPC.ThepartnershipenablestheestablishmentofaCanadianWomeninHPCchapterorganisationwhichwillruntrainingeventsandnetworkingsessionsaimedattheCanadianHPCcommunity,andsharinginformationandideaswithWomeninHPC.ThisisamodelthatWomeninHPCisintheprocessofdeveloping,withtheplantoestablishchaptersandpartnershipsaroundtheglobeprovidingtheopportunityforwomenintheHPCcommunitytonetworkinternationallyaswellastheopportunitytoencourageotherwomentomoveintoacareerwithintheHPCcommunity.In2016,WomeninHPCissettoexpand,signingupadditionalinternationalandregionalpartnerstoestablishbestpracticeinbroadeningparticipationintheHPCcommunityaroundtheworld.Wewillalsobeworkingwithavarietyofconferencesaswellasexpandingourdisseminationactivities.

6.5 CompetitiveeCSEProgrammeTheembeddedCSE(eCSE)programmeprovidesfundingfor14FTEsembeddeddirectlyintothescientificcommunitythroughaseriesofcompetitive,peer-reviewedcalls.2015sawahighdemandfromthecommunityforfunding,resultinginaveryhighqualitythreshold.Overthecourseofthesixcalls,54projectshavebeenfunded.TheseprojectshavemadeasignificantimpactonthequalityandperformanceofthesoftwaresuiteonARCHER–morethan

Page 14: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

14

tenofthemostheavilyusedcodesonthesystemhavebenefittedfromeCSEinvestmenteffort.Thisinturnhasfacilitatedgreaterscientificoutputandimpact,allowingpreviouslyuntenablescience.TheprogrammehasafocusonearlycareerresearchersandondevelopingtheUKsoftwareskillsbase.ThedistributedandembeddednatureoftheprogrammeallowsforthisskillsdevelopmenttobespreadacrossthewholeoftheUK,andakeyhighlightoftheprogrammehastobethefactthatstafffrom~30institutionsfromawidegeographicaldistributionhavebenefittedfromeCSEinvestment.Coupledwiththis,wewillhaveearlycareerresearchersobservingatfuturepanelmeetings.Theaimistogivethemabetterinsightintothemechanismofselectiontoassistintheirfuturepreparationoffundingproposals.Afinalhighlightisthesuccessfulnewcommunitiesprogrammethatencouragesproposalsfromnewcommunities,lookingtoenhancethediversityofsciencebeingcarriedoutonARCHER.OverthethreeeCSEcallsthathaveincludedthisinitiativewehavereceivedthirteennewcommunityapplications.

Page 15: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

15

7. CrayServiceGroup7.1 SummaryofPerformanceandServiceEnhancements2015hasbeenanotherstrongyearfortheARCHERservice.Overallsystemreliabilityandutilisationofresourceshavecontinuedtobeatahighlevel.Wheretechnologyareashaveperformedbelowthehighstandardsexpected,correctiveactionhasbeentakentoresolveissueswiththeminimumamountofdisruptionandaftercarefulconsultationwithservicepartners.Moredetailsofspecifictechnologyfailurescanbefoundinthetableandassociateddescriptionsbelow.

7.2 ReliabilityandPerformanceTheperformanceandreliabilityofthehardwareandsoftwaretechnologiesunderpinningtheARCHERservicecontinuestobeofahighstandard.NewversionsofsoftwarethatprovidefeatureenhancementsandbugfixestotheusercommunityarecontinuallyunderdevelopmentandarethenbeingimplementedontheARCHERservicefollowingperiodsofevaluationonappropriatetestplatforms.Large,complexHPCsystemssuchasARCHERarenotimmunefromtechnologyfailuresbutundermostcircumstancesthosefailurescanbemanagedbyutilisingwell-designedresiliencyfeaturesandrobustconfigurations.Onoccasions,technologyfailuresdoresultinimpactupontheusercommunity.ThemostsignificanttechnologyareaoftheARCHERservicewhereissueswereencounteredin2015wasintheparallellustrefilesystemandassociatedstoragecomponents.Acknowledgingthatimprovementscouldbemadeinbothhardwareandsoftwareareasofthestoragesubsystem,theseimprovementswereforthcomingandintegratedwithaminimumofdisruptiontotheusercommunity.

7.3 ServiceFailuresSevenunscheduledincidentsclassifiedasfullservicefailureswereencounteredduring2015.Ascanbeseen,sixofthesefailuresoccurredinthefirsthalfoftheyearwithamuch-improvedperformanceandonlyasingleservicefailureinthesecondhalfoftheyear.Incident Date Description

1 08-Jan-15 Systemrebootrequiredfollowingstoragecontrollerfailure2 06-May-15 Storagefailureonlustrefilesystem/fs3.3 07-May-15 Storagefailureonlustrefilesystem/fs24 13-May-15 SystemrebootfollowingPBSProbatchsystemserverfailure5 10-Jun-15 Systemrebootfollowingafailureinthesystembootraiddevice.6 30-Jun-15 RunninguserworklostfollowingPBSProbatchsystembecoming

unresponsive7 06-Oct-15 Systemrebootfollowingunintendedinitializationofsystem

components

Thedetailsoftheseseventechnologyservicefailureswere:

• Oneservicefailureduetoalustrestoragecontrollerfaultrequiringasystemreboottoclear.

Page 16: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

16

• Twoservicefailuresoccurredduetomultiplestoragecomponentfailuresaffectingtwodifferentlustreparallelfilesystems.

• TwoservicefailuresduetoproblemsrelatedtothePBSProbatchsubsystemwhichcausedthelossofrunninguserwork.

• Oneserviceoutageduetoacontrollerfailureinabootraiddevice,whichhousestheoperatingsystemfilesystemsfortheARCHERservice.

• Oneservicefailurewascausedbytheaccidentaluseofaninitialisationcommandonsystemcomponents.

Page 17: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

17

8. CrayCentreofExcellence(CoE)MichaelNeffjoinedtheCoEandbringsspecificexpertiseincomputationalchemistrytotheCoE.

8.1 CoEProjectHighlights

HIPSTARAcasestudywasproducedonsomepreviousworkdonebytheCoEontheHiPSTARcodefromtheUniversityofSouthampton.InapreviousCoEproject,OpenMPwasaddedtoHiPSTARimprovingthecodescalabilityconsiderably.ThisOpenMPworkwasthenusedasabasisforanOpenACCportoftheapplication(doneinconjunctionwiththeusersbytheARCHERandORNLCoEs).TheOpenACCportoftheapplicationallowedtheuserstorunHiPSTARtoverylargescaleontheTitansystematORNL,andformacollaborationwithGEinthiswork.ThisworkhasbeendocumentedinaCraycasestudy-http://www.cray.com/sites/default/files/XC30-ARCHER-HiPSTAR-0315.pdf.

ONETEPWeexpendedasmallamountofeffortduringtheyear(viaotherUKApplicationsStaff)supportingaPoisson-BoltzmannEquationsolverforONETEP(incollaborationwithTheUniversityofSouthampton).AneCSEproposalforfurtherworkwithONETEPandCASTEPisabouttobesubmittedanditisourintentionthatcontinuingsupportfromtheCoEwouldbeprovidediftheprojectwastobefunded.

HADOOP/SparkTheCoEwasinvolvedwithaprojectwithusersfromtheUniversityofNottinghamtolookintothepotentialofanalyzingdatageneratedbymoleculardynamicsapplicationswithHadoop.Thegeneralaimherewasreallyaproof-of-conceptstudytounderstandwhatcanbedonewithHadooptechnologiesinprocessingofHPCdata.Forthisproject,theCoEbroughtinexpertsinMapReduceandSparkfromCray’sDataAnalyticsdivision.InApril,workstartedonabasicHadoopapplicationwithsubsequentinitialtesting.ThisworkwashighlightedinaCrayCaseStudy.

PDNS3DTheCoEinvestigatedaperformanceproblemwiththePDNS3Dcode.ThecoarrayimplementationwasnotperformingwellrelativetotheMPIimplementation.TheCoEusedanexpertfromtheUSCrayPerformanceteamtocontinueinvestigatingthis.HefoundaperformanceissuewiththeCraycompilerthatdisadvantagesthecoarrayversionofthecodeatallscales.Thisissuewasresolvedandsubsequentanalysisshowedthatthehalo-swapcommunicationpatternsareimplementeddifferentlyintheMPIandcoarrayversions.Theresultswerecommunicatedattheendoftheyearandwehopetodiscussthemfurtherinthenearfuture.

8.2 FilesystemandI/OTheCoEwasengagedtounderstandtheadverseeffectsreportedbyusersasaresultoffilesystemrebuilds.WorkingwiththeCrayteamonsite,weperformedadetailedinvestigationofindividualstorageunit(OST)performanceandwereabletodeterminethecauseoftheadverseperformance.Aspartofthiswork,wewerealsoabletoshowthatthetuningofbothraid-checkandtherebuildprocesswasworkingtotheextentthatapplicationswouldbelessimpactedwhentheseoperationswerethrottled.Filesystemtasksarenowmuchlessintrusiveasaresultofsoftwareimprovementsandconfigurationofraid-checkandrebuildoperations.AspartofthisefforttheCoEalsostartedtoengagewiththeNCAScommunity,weregivenaccesstotheNCASPumaservice,andhavebeenabletorunarepresentativeUMjobonARCHERviathatservice.InitialinvestigationsconcentratedontheI/Oserverconfigurationandthisisatopicthatwehopetorevisitinthefuture.ThedirectinvestigationofNCASUMjobsbecamelessrelevant

Page 18: ARCHER Service 2015 Annual Report · The next section of this report contains an Executive Summary for the year. Section 3 provides a summary of the service utilisation. Section 4

18

oncethefilesystemperformanceissueswereunderstood,andthenewraid-checkregimewasputinplace.ThequestionofhowtooptimizeI/Ocomesupoftensoweareconsideringhowwecandomore(beyondexistingmaterialwehavepresentedinoptimisationworkshopsandintheARCHERtuningguide)togetappropriateinformationtousers.

8.3 TrainingandWorkshopsTheCoEassistedwithvariousworkshopsduringtheyear.ParticularexampleswerethePortingandOptimisationworkshoprunatthetimeofEASC2015,the1stEuroOpenACCHackathon,andtheARCHERserialOptimisationcourserunatCray’sEMEAHQinBristolinDecember.CoEstaffpresentedaseminaronmodernFortran,aswellasatalkoncoarraysandARCHERprojectsatajointmeetingoftheBritishComputerSocietyandInstituteofPhysics.TheCoEwasabletoengagewithARCHERusersatvariouseventsincludingtheInsightUKmeetinginCoventry,theUKTurbulenceConsortiumAnnualreviewmeeting,andthe24thDiscreteSimulationofFluidDynamics(DSFD)conferenceinEdinburgh.TheARCHERCoEorganisedamini-symposiumatthePARCO2015conferencewithafocusonprogrammingformanycorenodes(includingGPUs,multicoreCPUs,andIntelXeonPhi).This(alongwithotherParCoevents)wasausefulwaytointeractwithusersandthewidercommunityonconcernsandrequirementsforprogrammingmodelsaswelookatcurrentandfuturearchitectures.

8.4 ARCHERQueriesandSoftwareOfparticularnotethisyearwasanissuewithsuboptimalperformanceofNWChem.NWChemwasnotperformingoptimallyonARCHERand,forsomecases,wasslowerthanHECToR.Theproblemwasdifficulttodiagnoseduetolargeruntimes,butwasfoundtobeduetoanewGAimplementation.CrayCoE,CrayUSAdevelopers,EPCCandtheuserwereallinvolvedinworkingonthis.Anearlyfixdidnotworkduetoaraceconditionbut,asoftheendoftheyear,anewARMCIcommunicationmodelhasresolvedtheperformanceproblem.UpdatesofCLEandtheProgrammingEnvironmenttowardstheendoftheyearonARCHERmeanthatnewfeaturesareavailableandwewillproduceaseminarin2016tooutlinethese.

8.5 eCSEMeetingsTheCoEcompletedtechnicalassessmentsforthetwoeCSEcallsduringtheyear,andstaffattendedtheprojectplanningmeetings.