exploiting the long tail of scientific data: making small ... · pdf fileeresearch australasia...

3
eResearch Australasia Conference | Melbourne – Australia | 10 - 14 October - 2016 Exploiting the Long Tail of Scientific Data: Making Small Data BIG Presenter Lesley Wyborn1 Lesley Wyborn1, Kerstin Lehnert2 1Lesley Wyborn, National Computational Infrastructure, Canberra, Australia, [email protected] 2 Lamont-Doherty Earth Observatory of Columbia University, New York, USA, [email protected] SUMMARY Big Data is no longer on the Garter hype curve: increasingly small data is gaining recognition that it is a highly valuable asset in its own right, and that its collective sum has the potential to be of far greater importance than all of its parts. However, for the Earth and environmental sciences funding for data support is still primarily focused on those areas that generate massive volumes of observational or computed data using large-scale, shared instrumentation such as global sensor networks, satellites, or high-performance computing facilities. In their own right, small data sets concatenated into standardized BIG data sets have the potential to make a valuable contribution to research and can be a breeding ground for new and innovate research ideas. Small data can also be used to calibrate large volume remotely sensed data collections and can provide clues that uncover unforeseen trends in big data sets. In many Earth and environmental areas of research, especially those where data are primarily acquired by individual investigators or small teams (known as ‘Long-tail science communities’), data are poorly shared and integrated, and lack a community- based data infrastructure that ensures persistent access, quality control, standardization. Because of their heterogeneity and lack of standardization long tail collections are not attractive to funders as Returns On Investments (ROIs) are perceived to be low. Different strategies are required that apply to multiple collections of the same data type. Options include (1) a more modular approach to developing the required standards, (2) developing domain specialized repositories and (3) working with instrument manufacturers that generate a substantial proportion of the long tail data to develop agreements for instrument outputs to be compatible with internationally agreed standards. EXTENDED ABSTRACT The BIG data world in the Earth and environmental sciences comprises those disciplines that generate massive volumes of observational or computed data using large-scale, shared instrumentation such as global sensor networks, satellites, or high-performance computing facilities. These data are typically standardized, and relatively well managed and curated by funded community data facilities. But many small scale Earth and environmental research data sets are very small in size, especially those where data are primarily acquired by individual investigators or small teams (known as ‘Long-tail science communities’): these small datasets are usually poorly shared and integrated. Data consistency is hard to achieve because they lack a of a broader community-based data infrastructures that ensures persistent access, quality control, standardization, and integration of data, as well as appropriate tools to fully explore and mine the data within the context of the broader Science community. Whilst the data volumes for individual collections in the Long Tail are small, in total they represent a very significant portion of scientific research papers and outputs [1] and have a huge potential to contribute to science. In a way, they are all pieces of a puzzle, which if they could be put together correctly, have the potential to generated new knowledge. In other words, small data, when properly curated, can be compared and integrated to reveal large-scale temporal and spatial patterns that could lead to new scientific discoveries and insights. But growing small data to become BIG requires considerable effort and investment. Because small data is heterogeneous, and distributed across many disparate institutions, and across national and international borders, different strategies are required to gain acceptance for the need for standardization that is more accepted in the Big Data world. The difficulty with long tail collections is that there may only be a few specialist researchers or research groups developing a particular data set, but there are many thousands of these ‘specialised’ data sets’. It is just not cost effective to develop a unique standardised solution for each specialised data set the way it is for say a particular satellite or a specific airborne geophysical data set: the economies of scale are just not there. Three potential ways to make it easier to get more value out of the long tail data collections are: 1) More domain focused repositories: General-purpose and institutional repositories just do not have the required expertise and infrastructure to support the multiple domain specific requirements that ensure integration and reusability. Increasing evidence shows that when

Upload: vuongbao

Post on 06-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

ExploitingtheLongTailofScientificData:MakingSmallDataBIGPresenterLesleyWyborn1

LesleyWyborn1,KerstinLehnert21LesleyWyborn,NationalComputationalInfrastructure,Canberra,Australia,lesley.wyborn@anu.edu.au2Lamont-DohertyEarthObservatoryofColumbiaUniversity,NewYork,USA,[email protected]

SUMMARYBigDataisnolongerontheGarterhypecurve:increasinglysmalldataisgainingrecognitionthatitisahighlyvaluableassetinitsownright,andthatitscollectivesumhasthepotentialtobeoffargreaterimportancethanallofitsparts.However, for theEarthandenvironmental sciences funding fordatasupport is stillprimarily focusedon thoseareasthatgeneratemassivevolumesofobservationalor computeddatausing large-scale, shared instrumentation suchasglobal sensor networks, satellites, or high-performance computing facilities. In their own right, small data setsconcatenatedintostandardizedBIGdatasetshavethepotentialtomakeavaluablecontributiontoresearchandcanbe a breeding ground for new and innovate research ideas. Small data can also be used to calibrate large volumeremotelysenseddatacollectionsandcanprovidecluesthatuncoverunforeseentrendsinbigdatasets.InmanyEarthandenvironmentalareasofresearch,especiallythosewheredataareprimarilyacquiredbyindividualinvestigatorsorsmallteams(knownas‘Long-tailsciencecommunities’),dataarepoorlysharedandintegrated,andlackacommunity-based data infrastructure that ensures persistent access, quality control, standardization. Because of theirheterogeneityandlackofstandardizationlongtailcollectionsarenotattractivetofundersasReturnsOnInvestments(ROIs) are perceived to be low.Different strategies are required that apply tomultiple collections of the samedatatype. Options include (1) a more modular approach to developing the required standards, (2) developing domainspecializedrepositoriesand(3)workingwith instrumentmanufacturersthatgenerateasubstantialproportionofthelongtaildatatodevelopagreementsforinstrumentoutputstobecompatiblewithinternationallyagreedstandards.

EXTENDEDABSTRACTTheBIGdataworldintheEarthandenvironmentalsciencescomprisesthosedisciplinesthatgeneratemassivevolumesofobservationalorcomputeddatausinglarge-scale,sharedinstrumentationsuchasglobalsensornetworks,satellites,or high-performance computing facilities. These data are typically standardized, and relatively well managed andcuratedbyfundedcommunitydatafacilities.ButmanysmallscaleEarthandenvironmentalresearchdatasetsareverysmall insize,especially thosewheredataareprimarilyacquiredby individual investigatorsorsmall teams(knownas‘Long-tailsciencecommunities’):thesesmalldatasetsareusuallypoorlysharedandintegrated.Dataconsistencyishardto achieve because they lack a of a broader community-based data infrastructures that ensures persistent access,qualitycontrol,standardization,andintegrationofdata,aswellasappropriatetoolstofullyexploreandminethedatawithinthecontextofthebroaderSciencecommunity.Whilst thedatavolumes for individualcollections in theLongTailaresmall, in total theyrepresentaverysignificantportionofscientificresearchpapersandoutputs[1]andhaveahugepotentialtocontributetoscience.Inaway,theyareallpiecesofapuzzle,whichiftheycouldbeputtogethercorrectly,havethepotentialtogeneratednewknowledge.Inotherwords,smalldata,whenproperlycurated,canbecomparedandintegratedtoreveallarge-scaletemporalandspatialpatternsthatcouldleadtonewscientificdiscoveriesandinsights.But growing small data to become BIG requires considerable effort and investment. Because small data isheterogeneous, and distributed across many disparate institutions, and across national and international borders,differentstrategiesarerequiredtogainacceptancefortheneedforstandardizationthat ismoreaccepted intheBigDataworld.Thedifficultywith long tail collections is that theremayonlybea fewspecialist researchersor researchgroupsdevelopingaparticulardataset,buttherearemanythousandsofthese‘specialised’datasets’.Itisjustnotcosteffective to develop a unique standardised solution for each specialised data set the way it is for say a particularsatelliteoraspecificairbornegeophysicaldataset:theeconomiesofscalearejustnotthere.Threepotentialwaystomakeiteasiertogetmorevalueoutofthelongtaildatacollectionsare:1)Moredomainfocusedrepositories:General-purposeandinstitutionalrepositoriesjustdonothavetherequiredexpertiseandinfrastructuretosupportthemultipledomain specific requirements that ensure integration and reusability. Increasing evidence shows thatwhen

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

properlycuratedindomainspecificrepositories,itiseasiertoaggregatelongtailcollectionsandenablethemtomakeavaluable contribution to research. Domain-focused repositories are better positioned to implement and enforcecommunityendorsedbestpracticesandguidelinesthatensurereusabilityandharmonizationofdata,andtoprovideflexibility and performance of database schemas and search applications.More importantly, as the volumes of dataincrease,theycanprovidetoolsforinvestigatorstocontributedataandtosupportandsemi-automateQCworkflowsthatimprovequalityofsubmitteddata.ExamplesofsuchdomainrepositoriesaretheUSNationalScienceFoundation(NSF) fundedIntegratedEarthDataApplications(IEDA)[2]andthe IncorporatedResearch InstitutionsforSeismology(IRIS)[3]datafacilities.IEDAisadatafacilityfundedforthesolidearthsciencestodevelopandoperatedataservicesthatsupportdatastewardshipthroughoutthefull lifecycleofobservationaldatainthesolidearthsciencesincludingmarinegeologyandgeophysics,geochemistry,andgeochronology.Throughpartnerships, IEDAaimstoenhanceuserexperiencesbyjointlydevelopingstreamlineddataservices,includinguser-friendly,single-pointofentryinterfacesfordatasubmission,discovery,andaccess.TheIRISDataManagementCentreisaconsortiumofover120USuniversitiesandhostsanextremelylargearchiveofseismicdatafromhundredsofexperimentsaroundtheglobe.IRISspecializesintheoperationofsciencedatafacilitiesfortheacquisition,management,anddistributionofseismologicaldata.2:AmoremodularaproachtodevelopingandleveraginginternationalStandardsAtfirstglance,althoughthecollectionsoflongtaildatadoseemheterogeneousanddisconnected,attheabstractlevel,themajority of scientific observations can be based around the concept of the ISO Observation andMeasurementmodel [4],whichprovidesgeneralmodelsandschema for supporting thepackagingofobservations from laboratoryinstruments, sensor systems and sensor-related processing. Other relevant abstract ISO/OGC Standards cover GML(GeographyMarkup Language), Spatial Coordinate Systems,Metadata Standards etc. Domain-specific standards formuchof thedata content that is collectedaspartof scientific research isbestdoneunder theauspicesof the ICSUindividualscientificunionsorotherequivalentlearnedsocieties.Thesecouldtakealeadingroleinthedevelopmentof‘authoritative’ domain specific vocabularies and ontologies thatwould facilitate the harmonization of same contentstoredinmultiplelongtaildatasystems.

3)StandardizeoutputfromtheInstrumentmakersAnotheroption is toworkwiththe instrumentmakers,particularlythoseofsmallsensor instruments.Thedatafrommanyinstrumentsisexportedininaccessibleandproprietaryformats,whichmakeconversiontoopentransferformatsdifficultorimpossible.Individualresearchers/researchgroupsjustdonothavetheresourcestodealwiththis,therebylimiting their ability to standardised their data thus reducing their ability to easily share their datawith others andcontributetobuildinginternationallycompatiblestandardiseddatasets.

CONCLUSIONSItispossibletomakesmalldatacollectionsintoBIGdatasetswhosetotalvaluehasthepotentialtobegreaterthatthesumoftheparts.Assmalldatacollectionscontinuetoincrease,weneedtoseriouslyextendingthedesignofcurrentdata facilities and consider operating them as part of an alliance of related research communities, preferably withinternational connections, to facilitate sharingof data services and infrastructures.Weneed toensure that relevantindividual research data collections can be sustainably curated and harmonised so that however small, they can bereusedtoanswertheresearchquestionsoftodayandthoseofthefuture.

REFERENCES1. Heidorn, P. Bryan 2008 Shedding Light on the Dark Data in the Long Tail of Science. Library trends, 57, (2), In:

InstitutionalRepositories:CurrentStateandFuture’editedbySarahL.ShreevesandMelissaH.Cragin,pp.280-299,http://muse.jhu.edu/journals/library_trends/v057/57.2.heidorn.pdfaccessed12June2016.

2. IntegratedEarthDataApplications(IEDA)www.iedadata.org)accessed12June2016.3. IncorporatedResearchInstitutionsforSeismology(IRIS)http://ds.iris.edu/ds/nodes/dmc/accessed12June2016.4. Cox,S.J.D.(editor2015)ObservationandMeasurementhttp://www.ogcnetwork.net/OM,accessed12June2016.

ABOUTTHEAUTHORSLesleyWybornisageochemistbytrainingandjoinedthethenBMRin1972andforthenext42yearsheldavarietyofgeoscienceandgeoinformaticspositionsasBMRchangedtoAGSOthenGeoscienceAustralia. In2014she joinedtheANUandcurrentlyhasajointadjunctfellowshipwithNationalComputationalInfrastructureandtheResearchSchoolofEarthSciences.Shehasbeen involved inmanyAustralianeResearchprojects, including theNeCTAR fundedVGL, theVirtual Hazards, Impacts and Risk Laboratory, and the Provenance Connectivity Projects. She is Deputy Chair of theAustralianAcademyofScience‘DataforScienceCommittee’.ShewasawardedtheAustralianPublicServiceMedalfor

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

hercontributionstoGeoscienceandGeoinformaticsin2014,theGeologicalSocietyofAmerica,GeoinformaticsDivisionCareerAchievementAwardfor2015andin2016shewasmadeaFellowoftheGeologicalSocietyofAmerica.KerstinLehnert isSeniorResearchScientistat theLamont-DohertyEarthObservatoryofColumbiaUniversity,whereshedirectstheNSF-fundeddatafacilityIEDA(InterdisciplinaryEarthDataAlliance).Herbackgroundisinpetrologyandgeochemistry, holding a PhD in petrology from the University of Freiburg in Germany. Over the past 15 years, herresearchinteresthascenteredonGeoinformaticswithparticularemphasisonthedevelopmentofdatainfrastructuresforthesolidEarthsciencesandEarthsciencesamples.KerstiniscurrentlymemberoftheNSFAdvisoryCommitteeforCyberinfrastructure, Presidentof the Earth and Space Science Informatics FocusGroupof theAmericanGeophysicalUnion,PresidentoftheIGSNe.V.,andelectedmemberoftheEarthCubeLeadershipCouncil.