delaying the ‘peak data crisis’ in the era of data-intensive science · 2016-07-28 · delaying...

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

Delayingthe‘PeakDataCrisis’intheEraofData-intensiveScience

PresenterLesleyWyborn1

LesleyWyborn1,BenEvans21LesleyWyborn,NationalComputationalInfrastructure,Canberra,Australia,[email protected]

2BenEvans,NationalComputationalInfrastructure,Canberra,Australia,[email protected]

SUMMARYThe ‘PeakOilCrisis’ refers to thepoint in timewhen themaximumrateofextractionofpetroleum is reached,afterwhichitisexpectedtoenterterminaldecline.Althoughoriginallypredictedtohappenbetween1985and2000,moreefficient use of existing resources, combined with new discoveries, has extended the current estimate to 2020. Inparallel, the ‘PeakData Crisis’ refers to the point in time atwhich there is insufficient affordable persistent storageavailable for all the copies of the scientific data sets and their derivative products that have been deemed ofimportance to researchers in the community. It iswell documented that data volumes are growing at a rate that isfasterthanexponential,anditisalsocommonforrawdatatogothroughaseriesofprocessinglevelsasthedataareconverted intomoreusefulparametersandproducts.Thegrowingdatavolumeshavedrivenamove towards largermore centralized processing through well-managed facilities with data repositories that are co-located withcomputationalsystems.Analysissuggeststhatanincreasingdemandinthesecentralizedfacilitiesforstorageiscomingfromdifferentindividualresearchers/researchgroupswantingtoreformat/processdataintoformatsandspecificationsthatareeitherveryspecifictotheirparticularusecaseand/ortheirchosenapplication.Onthesecentralizedfacilities,multiple copies of petabyte-scale datasets in different formats is becoming untenable and there needs to be a shifttowards internationally agreed community High Performance Data sets that permit users to interactively invokedifferent forms of computation. To achieve this individual researchers, or individual research teams, need to join agrowingnumberofglobalscientificcommunitiestodetermineagreedformatsandstandardsthatmakemoreeffectiveuseofexistingstorageandhelpdelaythe‘PeakDataCrisis’intheeraofData-intensiveScience.

EXTENDEDABSTRACTUntilrecently,thelackofconsistencyindataformats,standardsandspecificationsacrosstheEarthandenvironmentalScience communities has been more an inconvenience, as relevant data had been able to be downloaded fromrepositoriesorphysicallyshippedforanalysisonlocalcomputationalinfrastructures.Theinconsistenciesinfileformatsandstandardshavecreatedspecialistroles(oftencalleddatawranglers)thathavebeenabletouseandtranslatethedifferingformatstosuitthebewilderingarrayofpackagesandendapplications.However,asdatavolumeshavegrownexponentially[e.g.,1],theresolutionandsizeofsomedigitalscientificdatasetshavenowreachedthepointwhereanewapproachisneededtostoreanddynamicallyaccessthedatasets.Thesegrowingdatavolumeshavealsodrivenamove towards largermore centralizedprocessing throughwell-managed facilitieswithdata repositories thatare co-locatedwithcomputational systems (HighPerformanceComputing (HPC)or cloud) thatbetterenableData-intensiveSciencetobeundertaken.Large volume data collections typically comprise individual datasets from copious numbers of data collectioncampaigns, instrumentsandmodelscollectedovermanyyears.Toutilizethecapabilityofcomputationalcapabilities,these datasets need to be pre-processed (e.g., calibrated, leveled, geo-located, etc.) and it is common for raw datacollected from instruments toactuallygo througha seriesof levelsofprocessing in thedata life cycle.Forexample,NASAdataproductsareprocessedatvarious levels ranging fromLevel0 toLevel4,withLevel0productsbeingrawdata at full instrument resolution whilst at higher levels, the data are converted into more useful parameters andformats [2]. Forusage inHighPerformanceComputing (HPC) environments eachof thedata levels thenneed tobeaggregated into High Performance Data (HPD) [3] assets that are designed to enable users to interactively invokedifferentformsofcomputationovertheselarge-scaledatasets[4].In a national and global context some communities are realizing that theyneed to come to termswith theneed toregularise the datasets including calibrations, projections and coordinate systems, data encoding standards andformats. To achieve this better uniformity, the current paradigm of translating data formats for the benefit of theapplicationwillneed tobe reconsidered. Further thecostof storageon thesemajorplatforms is such thatmultiple


copiesofpetabyte-scaledatasetsindifferentformatsisbecominguntenable,andbeforelongthelooming“PeakDataCrisis”willmeanthatthosenotpreparingforthechancewillrunoutof“oil”.TheEarthandenvironmentalsciencecommunitieshavebeentypicallyworkingwithintheirdomainstoaddresssomeoftheseproblems.However,therichnessoftheend-useranalysisofdatameansthattheissuehastobeaddressedacrossthespectrumofEarthandenvironmentalsciencedataassets.Thismeansthattheglobalscientificcommunityneedstofurtherdeterminetheagreedformatsandstandardsthatwillenabledifferentdisciplinesandresearchgroupstoinvokedifferentformsofcomputationoveragreeddatasets.The astronomy and climate science communities have been doing this within their disciplines for some years. InAstronomyacommondatastandardsandcommondataservicesarenowpartof theday-to-dayoperations (FlexibleImage Transport System (FITS)World Coordinate System [5], IVOA data service standards [6]). The CoupledModelIntercomparisonProject(CMIP)[7]communityandtheEarthSystemsGridFederation(ESGF)[8]establishedprotocolsfor gridded data from century scale models. The ESGF manages the first-ever decentralized database for handlingclimatemodel data,withmultiple petabytes of data at dozens of federated sitesworldwide. It is recognized as theleading infrastructureforthemanagementandaccessof largedistributeddatavolumesforclimatechangeresearch.The international Oceans Data Interoperability Platform (ODIP) [9] includes all the major organisations engaged inocean data management in EU, US, and Australia and is a global collaborative effort that focuses on enabling theeffectivesharingofdataacrossscientificdomainsandinternationalboundariesthroughsharingbestpracticeandthetransferoftechnologyacrossmarineandoceanographicdataanddataacquisitiontechnologies.MoreinternationalcommunityeffortsarenowrequiredtodeterminetheoptimalHighPerformanceDataconstructsofourlargevolumedatasetsateachofthevariouslevelsofprocessing.Thedevelopmentofspecificlargevolumedatasetsbyindividualsisnolongertenable.AswiththePeakOilCrisis,anewtechnologymaycomealongequivalenttotheunconventionalhydrocarbonrecoverythatwillfurtherdelaythePeakDataCrisis.However,it ishopedthatwhilethecurrentpressureondatastorageexists,itwillenableacultureofcooperationondevelopinginternationallycompliantcommunitydataplatformsonwhichresearchgroupscanthenundertakemultiplecompetitiveresearchprogramsandthusfurthertheAustralianInnovationAgenda[10].WhatreallyisrequiredisthedevelopmentofResearchCommunityPlatforms thatdrawon thebestofbreedof researchersondata, computeand tools, to createagreedcollaborativedomain-themedplatformsthat linkbetweenthemajor infrastructures(compute,dataandtools)anddrivestandardswithin,andenablecompatibilityacross,thedisciplines,butatthesametime,stillenableresearcherstoundertakenewandinnovativeresearchmethods.

REFERENCES1. G.Peng,N.A.Ritchey,K.S.Casey,E.J.Kearns,J.L.Privette,D.Saunders,P.Jones,T.Maycock,andSteveAnsari,

ScientificStewardshipintheOpenDataandBigDataEra—RolesandResponsibilitiesofStewardsandOtherMajorProductStakeholders,D-LibMagazine,2016,22.5/6,http://www.dlib.org/dlib/may16/peng/05peng.print.htmlAccessed11June2016.

2. NASAScienceDataProcessingLevels,http://science.nasa.gov/earth-science/earth-science-data/data-processing-levels-for-eosdis-data-products/,accessed11June2016.

3. B.Evans,L.Wyborn,T.Pugh.C.Allen,J.Antony,K.Gohar,D.Porter,J.Smillie,C.Trenham,J.Wang,A.Ip,G.Bell,“TheNCIHighPerformanceComputingandHighPerformanceDataPlatformtoSupporttheAnalysisofPetascaleEnvironmentalDataCollections”inEnvironmentalSoftwareSystems,Infrastructures,Services,Applications,IR.Denzer,R.M.Argent,GSchinak,J.Hrebicek,Eds.,FIPAICT448,pp.569–577,2015.

4. Bryant,R.E.,Data-IntensiveSupercomputing:TheCaseforDISC.ComputerScienceDepartmentTechnicalReportCM-CS-07-128,2007,http://reports-archive.adm.cs.cmu.edu/anon/2007/CMU-CS-07-128.pdf,accessed11June2016..

5. FITSWorldCoordinateSystemhttp://www.atnf.csiro.au/people/mcalabre/WCS/,accessed11June6. 2016.InternationalVirtualObservatoryAlliance(IVOA)http://www.ivoa.net/,accessed11June2016.7. CoupledModelIntercomparisonProject(CMIP)http://cmip-pcmdi.llnl.gov/,accessed11June2016.8. EarthSystemsGridFederationhttp://esgf.llnl.gov/,accessed11June20169. OceansDataInteroperabilityPlatform,http://www.odip.org/,accessed11June201610. TheNationalInnovationandScienceAgendahttp://www.innovation.gov.au/page/agenda,accessed11June,2016

ABOUTTHEAUTHORSLesleyWybornisageochemistbytrainingandjoinedthethenBMRin1972andforthenext42yearsheldavarietyofgeoscienceandgeoinformaticspositionsasBMRchangedtoAGSOthenGeoscienceAustralia. In2014she joinedthe


ANUandcurrentlyhasajointadjunctfellowshipwithNationalComputationalInfrastructureandtheResearchSchoolofEarthSciences.Shehasbeen involved inmanyAustralianeResearchprojects, including theNeCTAR fundedVGL, theVirtual Hazards, Impacts and Risk Laboratory, and the Provenance Connectivity Projects. She is Deputy Chair of theAustralianAcademyofScience‘DataforScienceCommittee’.ShewasawardedtheAustralianPublicServiceMedalforhercontributionstoGeoscienceandGeoinformaticsin2014,theGeologicalSocietyofAmerica,GeoinformaticsDivisionCareerAchievementAwardfor2015andin2016shewasmadeaFellowoftheGeologicalSocietyofAmerica.Ben Evans is the Associate Director of Research, Engagement and Initiatives at the National ComputationalInfrastructure.HeoverseesNCI’sprogramsinhighly-scalablecomputing,Data-intensivecomputing,datamanagementandservices,virtuallaboratoryinnovation,andvisualization.HehasplayedleadingrolesinnationalvirtuallaboratoriessuchastheClimateandWeatherScienceLaboratory(CWSLab)andVGL,aswellasmajorinternationalcollaborations,suchastheUnifiedModelinfrastructureunderpinningtheACCESSsystemforClimateandWeather,EarthSystemsGridFederation (ESGF), EarthCube, the Coupled Model Inter-comparison Project (CMIP), and its support for theIntergovernmentalPanelonClimateChange(IPCC).

delaying the ‘peak data crisis’ in the era of data-intensive science · 2016-07-28 · delaying...

Documents