exploiting the long tail of scientific data: making small ... · pdf fileeresearch australasia...

Post on 06-Mar-2018

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

ExploitingtheLongTailofScientificData:MakingSmallDataBIGPresenterLesleyWyborn1

LesleyWyborn1,KerstinLehnert21LesleyWyborn,NationalComputationalInfrastructure,Canberra,Australia,lesley.wyborn@anu.edu.au2Lamont-DohertyEarthObservatoryofColumbiaUniversity,NewYork,USA,lehnert@ldeo.columbia.edu

SUMMARYBigDataisnolongerontheGarterhypecurve:increasinglysmalldataisgainingrecognitionthatitisahighlyvaluableassetinitsownright,andthatitscollectivesumhasthepotentialtobeoffargreaterimportancethanallofitsparts.However, for theEarthandenvironmental sciences funding fordatasupport is stillprimarily focusedon thoseareasthatgeneratemassivevolumesofobservationalor computeddatausing large-scale, shared instrumentation suchasglobal sensor networks, satellites, or high-performance computing facilities. In their own right, small data setsconcatenatedintostandardizedBIGdatasetshavethepotentialtomakeavaluablecontributiontoresearchandcanbe a breeding ground for new and innovate research ideas. Small data can also be used to calibrate large volumeremotelysenseddatacollectionsandcanprovidecluesthatuncoverunforeseentrendsinbigdatasets.InmanyEarthandenvironmentalareasofresearch,especiallythosewheredataareprimarilyacquiredbyindividualinvestigatorsorsmallteams(knownas‘Long-tailsciencecommunities’),dataarepoorlysharedandintegrated,andlackacommunity-based data infrastructure that ensures persistent access, quality control, standardization. Because of theirheterogeneityandlackofstandardizationlongtailcollectionsarenotattractivetofundersasReturnsOnInvestments(ROIs) are perceived to be low.Different strategies are required that apply tomultiple collections of the samedatatype. Options include (1) a more modular approach to developing the required standards, (2) developing domainspecializedrepositoriesand(3)workingwith instrumentmanufacturersthatgenerateasubstantialproportionofthelongtaildatatodevelopagreementsforinstrumentoutputstobecompatiblewithinternationallyagreedstandards.

EXTENDEDABSTRACTTheBIGdataworldintheEarthandenvironmentalsciencescomprisesthosedisciplinesthatgeneratemassivevolumesofobservationalorcomputeddatausinglarge-scale,sharedinstrumentationsuchasglobalsensornetworks,satellites,or high-performance computing facilities. These data are typically standardized, and relatively well managed andcuratedbyfundedcommunitydatafacilities.ButmanysmallscaleEarthandenvironmentalresearchdatasetsareverysmall insize,especially thosewheredataareprimarilyacquiredby individual investigatorsorsmall teams(knownas‘Long-tailsciencecommunities’):thesesmalldatasetsareusuallypoorlysharedandintegrated.Dataconsistencyishardto achieve because they lack a of a broader community-based data infrastructures that ensures persistent access,qualitycontrol,standardization,andintegrationofdata,aswellasappropriatetoolstofullyexploreandminethedatawithinthecontextofthebroaderSciencecommunity.Whilst thedatavolumes for individualcollections in theLongTailaresmall, in total theyrepresentaverysignificantportionofscientificresearchpapersandoutputs[1]andhaveahugepotentialtocontributetoscience.Inaway,theyareallpiecesofapuzzle,whichiftheycouldbeputtogethercorrectly,havethepotentialtogeneratednewknowledge.Inotherwords,smalldata,whenproperlycurated,canbecomparedandintegratedtoreveallarge-scaletemporalandspatialpatternsthatcouldleadtonewscientificdiscoveriesandinsights.But growing small data to become BIG requires considerable effort and investment. Because small data isheterogeneous, and distributed across many disparate institutions, and across national and international borders,differentstrategiesarerequiredtogainacceptancefortheneedforstandardizationthat ismoreaccepted intheBigDataworld.Thedifficultywith long tail collections is that theremayonlybea fewspecialist researchersor researchgroupsdevelopingaparticulardataset,buttherearemanythousandsofthese‘specialised’datasets’.Itisjustnotcosteffective to develop a unique standardised solution for each specialised data set the way it is for say a particularsatelliteoraspecificairbornegeophysicaldataset:theeconomiesofscalearejustnotthere.Threepotentialwaystomakeiteasiertogetmorevalueoutofthelongtaildatacollectionsare:1)Moredomainfocusedrepositories:General-purposeandinstitutionalrepositoriesjustdonothavetherequiredexpertiseandinfrastructuretosupportthemultipledomain specific requirements that ensure integration and reusability. Increasing evidence shows thatwhen

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

properlycuratedindomainspecificrepositories,itiseasiertoaggregatelongtailcollectionsandenablethemtomakeavaluable contribution to research. Domain-focused repositories are better positioned to implement and enforcecommunityendorsedbestpracticesandguidelinesthatensurereusabilityandharmonizationofdata,andtoprovideflexibility and performance of database schemas and search applications.More importantly, as the volumes of dataincrease,theycanprovidetoolsforinvestigatorstocontributedataandtosupportandsemi-automateQCworkflowsthatimprovequalityofsubmitteddata.ExamplesofsuchdomainrepositoriesaretheUSNationalScienceFoundation(NSF) fundedIntegratedEarthDataApplications(IEDA)[2]andthe IncorporatedResearch InstitutionsforSeismology(IRIS)[3]datafacilities.IEDAisadatafacilityfundedforthesolidearthsciencestodevelopandoperatedataservicesthatsupportdatastewardshipthroughoutthefull lifecycleofobservationaldatainthesolidearthsciencesincludingmarinegeologyandgeophysics,geochemistry,andgeochronology.Throughpartnerships, IEDAaimstoenhanceuserexperiencesbyjointlydevelopingstreamlineddataservices,includinguser-friendly,single-pointofentryinterfacesfordatasubmission,discovery,andaccess.TheIRISDataManagementCentreisaconsortiumofover120USuniversitiesandhostsanextremelylargearchiveofseismicdatafromhundredsofexperimentsaroundtheglobe.IRISspecializesintheoperationofsciencedatafacilitiesfortheacquisition,management,anddistributionofseismologicaldata.2:AmoremodularaproachtodevelopingandleveraginginternationalStandardsAtfirstglance,althoughthecollectionsoflongtaildatadoseemheterogeneousanddisconnected,attheabstractlevel,themajority of scientific observations can be based around the concept of the ISO Observation andMeasurementmodel [4],whichprovidesgeneralmodelsandschema for supporting thepackagingofobservations from laboratoryinstruments, sensor systems and sensor-related processing. Other relevant abstract ISO/OGC Standards cover GML(GeographyMarkup Language), Spatial Coordinate Systems,Metadata Standards etc. Domain-specific standards formuchof thedata content that is collectedaspartof scientific research isbestdoneunder theauspicesof the ICSUindividualscientificunionsorotherequivalentlearnedsocieties.Thesecouldtakealeadingroleinthedevelopmentof‘authoritative’ domain specific vocabularies and ontologies thatwould facilitate the harmonization of same contentstoredinmultiplelongtaildatasystems.

3)StandardizeoutputfromtheInstrumentmakersAnotheroption is toworkwiththe instrumentmakers,particularlythoseofsmallsensor instruments.Thedatafrommanyinstrumentsisexportedininaccessibleandproprietaryformats,whichmakeconversiontoopentransferformatsdifficultorimpossible.Individualresearchers/researchgroupsjustdonothavetheresourcestodealwiththis,therebylimiting their ability to standardised their data thus reducing their ability to easily share their datawith others andcontributetobuildinginternationallycompatiblestandardiseddatasets.

CONCLUSIONSItispossibletomakesmalldatacollectionsintoBIGdatasetswhosetotalvaluehasthepotentialtobegreaterthatthesumoftheparts.Assmalldatacollectionscontinuetoincrease,weneedtoseriouslyextendingthedesignofcurrentdata facilities and consider operating them as part of an alliance of related research communities, preferably withinternational connections, to facilitate sharingof data services and infrastructures.Weneed toensure that relevantindividual research data collections can be sustainably curated and harmonised so that however small, they can bereusedtoanswertheresearchquestionsoftodayandthoseofthefuture.

REFERENCES1. Heidorn, P. Bryan 2008 Shedding Light on the Dark Data in the Long Tail of Science. Library trends, 57, (2), In:

InstitutionalRepositories:CurrentStateandFuture’editedbySarahL.ShreevesandMelissaH.Cragin,pp.280-299,http://muse.jhu.edu/journals/library_trends/v057/57.2.heidorn.pdfaccessed12June2016.

2. IntegratedEarthDataApplications(IEDA)www.iedadata.org)accessed12June2016.3. IncorporatedResearchInstitutionsforSeismology(IRIS)http://ds.iris.edu/ds/nodes/dmc/accessed12June2016.4. Cox,S.J.D.(editor2015)ObservationandMeasurementhttp://www.ogcnetwork.net/OM,accessed12June2016.

ABOUTTHEAUTHORSLesleyWybornisageochemistbytrainingandjoinedthethenBMRin1972andforthenext42yearsheldavarietyofgeoscienceandgeoinformaticspositionsasBMRchangedtoAGSOthenGeoscienceAustralia. In2014she joinedtheANUandcurrentlyhasajointadjunctfellowshipwithNationalComputationalInfrastructureandtheResearchSchoolofEarthSciences.Shehasbeen involved inmanyAustralianeResearchprojects, including theNeCTAR fundedVGL, theVirtual Hazards, Impacts and Risk Laboratory, and the Provenance Connectivity Projects. She is Deputy Chair of theAustralianAcademyofScience‘DataforScienceCommittee’.ShewasawardedtheAustralianPublicServiceMedalfor

eResearchAustralasiaConference|Melbourne–Australia|10-14October-2016

hercontributionstoGeoscienceandGeoinformaticsin2014,theGeologicalSocietyofAmerica,GeoinformaticsDivisionCareerAchievementAwardfor2015andin2016shewasmadeaFellowoftheGeologicalSocietyofAmerica.KerstinLehnert isSeniorResearchScientistat theLamont-DohertyEarthObservatoryofColumbiaUniversity,whereshedirectstheNSF-fundeddatafacilityIEDA(InterdisciplinaryEarthDataAlliance).Herbackgroundisinpetrologyandgeochemistry, holding a PhD in petrology from the University of Freiburg in Germany. Over the past 15 years, herresearchinteresthascenteredonGeoinformaticswithparticularemphasisonthedevelopmentofdatainfrastructuresforthesolidEarthsciencesandEarthsciencesamples.KerstiniscurrentlymemberoftheNSFAdvisoryCommitteeforCyberinfrastructure, Presidentof the Earth and Space Science Informatics FocusGroupof theAmericanGeophysicalUnion,PresidentoftheIGSNe.V.,andelectedmemberoftheEarthCubeLeadershipCouncil.

top related