delaying the ‘peak data crisis’ in the era of data-intensive science · 2016-07-28 · delaying...

3
eResearch Australasia Conference | Melbourne – Australia | 10 - 14 October - 2016 Delaying the ‘Peak Data Crisis’ in the Era of Data-intensive Science Presenter Lesley Wyborn1 Lesley Wyborn1, Ben Evans2 1Lesley Wyborn, National Computational Infrastructure, Canberra, Australia, [email protected] 2Ben Evans, National Computational Infrastructure, Canberra, Australia, [email protected] SUMMARY The ‘Peak Oil Crisis’ refers to the point in time when the maximum rate of extraction of petroleum is reached, after which it is expected to enter terminal decline. Although originally predicted to happen between 1985 and 2000, more efficient use of existing resources, combined with new discoveries, has extended the current estimate to 2020. In parallel, the ‘Peak Data Crisis’ refers to the point in time at which there is insufficient affordable persistent storage available for all the copies of the scientific data sets and their derivative products that have been deemed of importance to researchers in the community. It is well documented that data volumes are growing at a rate that is faster than exponential, and it is also common for raw data to go through a series of processing levels as the data are converted into more useful parameters and products. The growing data volumes have driven a move towards larger more centralized processing through well-managed facilities with data repositories that are co-located with computational systems. Analysis suggests that an increasing demand in these centralized facilities for storage is coming from different individual researchers/research groups wanting to reformat/process data into formats and specifications that are either very specific to their particular use case and/or their chosen application. On these centralized facilities, multiple copies of petabyte-scale datasets in different formats is becoming untenable and there needs to be a shift towards internationally agreed community High Performance Data sets that permit users to interactively invoke different forms of computation. To achieve this individual researchers, or individual research teams, need to join a growing number of global scientific communities to determine agreed formats and standards that make more effective use of existing storage and help delay the ‘Peak Data Crisis’ in the era of Data-intensive Science. EXTENDED ABSTRACT Until recently, the lack of consistency in data formats, standards and specifications across the Earth and environmental Science communities has been more an inconvenience, as relevant data had been able to be downloaded from repositories or physically shipped for analysis on local computational infrastructures. The inconsistencies in file formats and standards have created specialist roles (often called data wranglers) that have been able to use and translate the differing formats to suit the bewildering array of packages and end applications. However, as data volumes have grown exponentially [e.g., 1], the resolution and size of some digital scientific datasets have now reached the point where a new approach is needed to store and dynamically access the datasets. These growing data volumes have also driven a move towards larger more centralized processing through well-managed facilities with data repositories that are co- located with computational systems (High Performance Computing (HPC) or cloud) that better enable Data-intensive Science to be undertaken. Large volume data collections typically comprise individual datasets from copious numbers of data collection campaigns, instruments and models collected over many years. To utilize the capability of computational capabilities, these datasets need to be pre-processed (e.g., calibrated, leveled, geo-located, etc.) and it is common for raw data collected from instruments to actually go through a series of levels of processing in the data life cycle. For example, NASA data products are processed at various levels ranging from Level 0 to Level 4, with Level 0 products being raw data at full instrument resolution whilst at higher levels, the data are converted into more useful parameters and formats [2]. For usage in High Performance Computing (HPC) environments each of the data levels then need to be aggregated into High Performance Data (HPD) [3] assets that are designed to enable users to interactively invoke different forms of computation over these large-scale datasets [4]. In a national and global context some communities are realizing that they need to come to terms with the need to regularise the datasets including calibrations, projections and coordinate systems, data encoding standards and formats. To achieve this better uniformity, the current paradigm of translating data formats for the benefit of the application will need to be reconsidered. Further the cost of storage on these major platforms is such that multiple

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Delaying the ‘Peak Data Crisis’ in the Era of Data-intensive Science · 2016-07-28 · Delaying the ‘Peak Data Crisis’ in the Era of Data-intensive Science Presenter Lesley

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

Delayingthe‘PeakDataCrisis’intheEraofData-intensiveScience

PresenterLesleyWyborn1

LesleyWyborn1,BenEvans21LesleyWyborn,NationalComputationalInfrastructure,Canberra,Australia,[email protected]

2BenEvans,NationalComputationalInfrastructure,Canberra,Australia,[email protected]

SUMMARYThe ‘PeakOilCrisis’ refers to thepoint in timewhen themaximumrateofextractionofpetroleum is reached,afterwhichitisexpectedtoenterterminaldecline.Althoughoriginallypredictedtohappenbetween1985and2000,moreefficient use of existing resources, combined with new discoveries, has extended the current estimate to 2020. Inparallel, the ‘PeakData Crisis’ refers to the point in time atwhich there is insufficient affordable persistent storageavailable for all the copies of the scientific data sets and their derivative products that have been deemed ofimportance to researchers in the community. It iswell documented that data volumes are growing at a rate that isfasterthanexponential,anditisalsocommonforrawdatatogothroughaseriesofprocessinglevelsasthedataareconverted intomoreusefulparametersandproducts.Thegrowingdatavolumeshavedrivenamove towards largermore centralized processing through well-managed facilities with data repositories that are co-located withcomputationalsystems.Analysissuggeststhatanincreasingdemandinthesecentralizedfacilitiesforstorageiscomingfromdifferentindividualresearchers/researchgroupswantingtoreformat/processdataintoformatsandspecificationsthatareeitherveryspecifictotheirparticularusecaseand/ortheirchosenapplication.Onthesecentralizedfacilities,multiple copies of petabyte-scale datasets in different formats is becoming untenable and there needs to be a shifttowards internationally agreed community High Performance Data sets that permit users to interactively invokedifferent forms of computation. To achieve this individual researchers, or individual research teams, need to join agrowingnumberofglobalscientificcommunitiestodetermineagreedformatsandstandardsthatmakemoreeffectiveuseofexistingstorageandhelpdelaythe‘PeakDataCrisis’intheeraofData-intensiveScience.

EXTENDEDABSTRACTUntilrecently,thelackofconsistencyindataformats,standardsandspecificationsacrosstheEarthandenvironmentalScience communities has been more an inconvenience, as relevant data had been able to be downloaded fromrepositoriesorphysicallyshippedforanalysisonlocalcomputationalinfrastructures.Theinconsistenciesinfileformatsandstandardshavecreatedspecialistroles(oftencalleddatawranglers)thathavebeenabletouseandtranslatethedifferingformatstosuitthebewilderingarrayofpackagesandendapplications.However,asdatavolumeshavegrownexponentially[e.g.,1],theresolutionandsizeofsomedigitalscientificdatasetshavenowreachedthepointwhereanewapproachisneededtostoreanddynamicallyaccessthedatasets.Thesegrowingdatavolumeshavealsodrivenamove towards largermore centralizedprocessing throughwell-managed facilitieswithdata repositories thatare co-locatedwithcomputational systems (HighPerformanceComputing (HPC)or cloud) thatbetterenableData-intensiveSciencetobeundertaken.Large volume data collections typically comprise individual datasets from copious numbers of data collectioncampaigns, instrumentsandmodelscollectedovermanyyears.Toutilizethecapabilityofcomputationalcapabilities,these datasets need to be pre-processed (e.g., calibrated, leveled, geo-located, etc.) and it is common for raw datacollected from instruments toactuallygo througha seriesof levelsofprocessing in thedata life cycle.Forexample,NASAdataproductsareprocessedatvarious levels ranging fromLevel0 toLevel4,withLevel0productsbeingrawdata at full instrument resolution whilst at higher levels, the data are converted into more useful parameters andformats [2]. Forusage inHighPerformanceComputing (HPC) environments eachof thedata levels thenneed tobeaggregated into High Performance Data (HPD) [3] assets that are designed to enable users to interactively invokedifferentformsofcomputationovertheselarge-scaledatasets[4].In a national and global context some communities are realizing that theyneed to come to termswith theneed toregularise the datasets including calibrations, projections and coordinate systems, data encoding standards andformats. To achieve this better uniformity, the current paradigm of translating data formats for the benefit of theapplicationwillneed tobe reconsidered. Further thecostof storageon thesemajorplatforms is such thatmultiple

Page 2: Delaying the ‘Peak Data Crisis’ in the Era of Data-intensive Science · 2016-07-28 · Delaying the ‘Peak Data Crisis’ in the Era of Data-intensive Science Presenter Lesley

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

copiesofpetabyte-scaledatasetsindifferentformatsisbecominguntenable,andbeforelongthelooming“PeakDataCrisis”willmeanthatthosenotpreparingforthechancewillrunoutof“oil”.TheEarthandenvironmentalsciencecommunitieshavebeentypicallyworkingwithintheirdomainstoaddresssomeoftheseproblems.However,therichnessoftheend-useranalysisofdatameansthattheissuehastobeaddressedacrossthespectrumofEarthandenvironmentalsciencedataassets.Thismeansthattheglobalscientificcommunityneedstofurtherdeterminetheagreedformatsandstandardsthatwillenabledifferentdisciplinesandresearchgroupstoinvokedifferentformsofcomputationoveragreeddatasets.The astronomy and climate science communities have been doing this within their disciplines for some years. InAstronomyacommondatastandardsandcommondataservicesarenowpartof theday-to-dayoperations (FlexibleImage Transport System (FITS)World Coordinate System [5], IVOA data service standards [6]). The CoupledModelIntercomparisonProject(CMIP)[7]communityandtheEarthSystemsGridFederation(ESGF)[8]establishedprotocolsfor gridded data from century scale models. The ESGF manages the first-ever decentralized database for handlingclimatemodel data,withmultiple petabytes of data at dozens of federated sitesworldwide. It is recognized as theleading infrastructureforthemanagementandaccessof largedistributeddatavolumesforclimatechangeresearch.The international Oceans Data Interoperability Platform (ODIP) [9] includes all the major organisations engaged inocean data management in EU, US, and Australia and is a global collaborative effort that focuses on enabling theeffectivesharingofdataacrossscientificdomainsandinternationalboundariesthroughsharingbestpracticeandthetransferoftechnologyacrossmarineandoceanographicdataanddataacquisitiontechnologies.MoreinternationalcommunityeffortsarenowrequiredtodeterminetheoptimalHighPerformanceDataconstructsofourlargevolumedatasetsateachofthevariouslevelsofprocessing.Thedevelopmentofspecificlargevolumedatasetsbyindividualsisnolongertenable.AswiththePeakOilCrisis,anewtechnologymaycomealongequivalenttotheunconventionalhydrocarbonrecoverythatwillfurtherdelaythePeakDataCrisis.However,it ishopedthatwhilethecurrentpressureondatastorageexists,itwillenableacultureofcooperationondevelopinginternationallycompliantcommunitydataplatformsonwhichresearchgroupscanthenundertakemultiplecompetitiveresearchprogramsandthusfurthertheAustralianInnovationAgenda[10].WhatreallyisrequiredisthedevelopmentofResearchCommunityPlatforms thatdrawon thebestofbreedof researchersondata, computeand tools, to createagreedcollaborativedomain-themedplatformsthat linkbetweenthemajor infrastructures(compute,dataandtools)anddrivestandardswithin,andenablecompatibilityacross,thedisciplines,butatthesametime,stillenableresearcherstoundertakenewandinnovativeresearchmethods.

REFERENCES1. G.Peng,N.A.Ritchey,K.S.Casey,E.J.Kearns,J.L.Privette,D.Saunders,P.Jones,T.Maycock,andSteveAnsari,

ScientificStewardshipintheOpenDataandBigDataEra—RolesandResponsibilitiesofStewardsandOtherMajorProductStakeholders,D-LibMagazine,2016,22.5/6,http://www.dlib.org/dlib/may16/peng/05peng.print.htmlAccessed11June2016.

2. NASAScienceDataProcessingLevels,http://science.nasa.gov/earth-science/earth-science-data/data-processing-levels-for-eosdis-data-products/,accessed11June2016.

3. B.Evans,L.Wyborn,T.Pugh.C.Allen,J.Antony,K.Gohar,D.Porter,J.Smillie,C.Trenham,J.Wang,A.Ip,G.Bell,“TheNCIHighPerformanceComputingandHighPerformanceDataPlatformtoSupporttheAnalysisofPetascaleEnvironmentalDataCollections”inEnvironmentalSoftwareSystems,Infrastructures,Services,Applications,IR.Denzer,R.M.Argent,GSchinak,J.Hrebicek,Eds.,FIPAICT448,pp.569–577,2015.

4. Bryant,R.E.,Data-IntensiveSupercomputing:TheCaseforDISC.ComputerScienceDepartmentTechnicalReportCM-CS-07-128,2007,http://reports-archive.adm.cs.cmu.edu/anon/2007/CMU-CS-07-128.pdf,accessed11June2016..

5. FITSWorldCoordinateSystemhttp://www.atnf.csiro.au/people/mcalabre/WCS/,accessed11June6. 2016.InternationalVirtualObservatoryAlliance(IVOA)http://www.ivoa.net/,accessed11June2016.7. CoupledModelIntercomparisonProject(CMIP)http://cmip-pcmdi.llnl.gov/,accessed11June2016.8. EarthSystemsGridFederationhttp://esgf.llnl.gov/,accessed11June20169. OceansDataInteroperabilityPlatform,http://www.odip.org/,accessed11June201610. TheNationalInnovationandScienceAgendahttp://www.innovation.gov.au/page/agenda,accessed11June,2016

ABOUTTHEAUTHORSLesleyWybornisageochemistbytrainingandjoinedthethenBMRin1972andforthenext42yearsheldavarietyofgeoscienceandgeoinformaticspositionsasBMRchangedtoAGSOthenGeoscienceAustralia. In2014she joinedthe

Page 3: Delaying the ‘Peak Data Crisis’ in the Era of Data-intensive Science · 2016-07-28 · Delaying the ‘Peak Data Crisis’ in the Era of Data-intensive Science Presenter Lesley

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

ANUandcurrentlyhasajointadjunctfellowshipwithNationalComputationalInfrastructureandtheResearchSchoolofEarthSciences.Shehasbeen involved inmanyAustralianeResearchprojects, including theNeCTAR fundedVGL, theVirtual Hazards, Impacts and Risk Laboratory, and the Provenance Connectivity Projects. She is Deputy Chair of theAustralianAcademyofScience‘DataforScienceCommittee’.ShewasawardedtheAustralianPublicServiceMedalforhercontributionstoGeoscienceandGeoinformaticsin2014,theGeologicalSocietyofAmerica,GeoinformaticsDivisionCareerAchievementAwardfor2015andin2016shewasmadeaFellowoftheGeologicalSocietyofAmerica.Ben Evans is the Associate Director of Research, Engagement and Initiatives at the National ComputationalInfrastructure.HeoverseesNCI’sprogramsinhighly-scalablecomputing,Data-intensivecomputing,datamanagementandservices,virtuallaboratoryinnovation,andvisualization.HehasplayedleadingrolesinnationalvirtuallaboratoriessuchastheClimateandWeatherScienceLaboratory(CWSLab)andVGL,aswellasmajorinternationalcollaborations,suchastheUnifiedModelinfrastructureunderpinningtheACCESSsystemforClimateandWeather,EarthSystemsGridFederation (ESGF), EarthCube, the Coupled Model Inter-comparison Project (CMIP), and its support for theIntergovernmentalPanelonClimateChange(IPCC).