i’m grace peng, one of the data specialists at the ncar ... · standardizaon of data is not a new...
TRANSCRIPT
I’mGracePeng,oneofthedataspecialistsattheNCARResearchDataArchiveakaastheRDA.ManywithinNCARknowusasthepeoplewhostagedataon/gladeforyou.Somepeopleinthisroommayhaveneverheardofus.Formorethan40years,theRDAhasbeencollecInganddisseminaIngweatherandclimatedatatotheresearchcommunity.WehostagrowingcollecIonofover600datasets.Dataspecialistsarepartdatacurators,partdataengineers,partsoLwaredevelopers,andpartsubjectmaNerexperts.Amongourdataspecialists,wehaveexpertsinmeteorology,oceanography,mathemaIcs,physics,chemistryandengineering.We’rewitnessingalargeincreaseintheuseofwxandclimatedatainthecommercial,governmentandeducaIonalsector.Ourgrowinguserbaseincludesmorenon-expertssowefindourselvesbecomingeducatorsaboutwxandclimatedataaswell.Dataremainsaliveaslongaspeopleareusingit.Workingwithstudentshelpsuslearnwhatweneedtodototoday’sdatatomakeitusefulforfutureusers.BythepeoplecometomewiththeirquesIons,theyhaveusuallybeguntheir
1
I’lltrytogetthroughthesefivefundamentalquesIonsin25minutes.IfIdon’t,pleasecometotheBeNerthanFreeworkshoptomorrow.ThefirstQ,Canthis…soundsobvious,butit’sactuallydifficult.ThisisthepartwhereyoumakesureyouhaveascienceprojectandnotanOp-Edpiece.Thesecond(andfirst)Qtogetherhelpsyouformulateyourresearchplan.Datadiscovery,findingthebestdataforyourproject,isadifficulttechnicalproblem.Datasearchenginesfoundedonmetadatacanhelpyou.Ifthedatadoesn’tmatchwhatyouexpect(fromthedocumentaIon),youcouldbeexperiencingtechnicaldifficulIesreadingthedataincorrectly,oryoumayhavediscoveredanerrorinthedata.Scienceshouldbereproducible.You’veheardothertalksthismorningaboutdatacuraIonanddigitalobjectidenIfiersfordatasets.I’llonlyneedtocoverthemechanicsofhowtocitedatasothatothersunderstandwhatyoudid.
2
Canthisproblembeansweredwithdata?isametaquesIonthatyouneedtothinkthroughbeforestarIngonanyendeavor.Isitmeasurable?
Ifnotdirectlymeasureable,isitrelatedtosomethingmeasurable?Hasthedatabeencollectedalready?MayhavetowaitunIldataexists.
Shouldthisproblembetackled?Don’tcollectsensiIveinformaIonifyoucan’tsafeguardit.
Wethinkofweatherasbenigndata.But,asweatherappsturnpeopleintomobileweatherstaIons,whataretheethicalandsafetyconsideraIons?Ifweknowwheresomeonelives(theirnight-Imeclosestcelltower),andalsowheretheyworkorgotoschoolduringtheday,wecannarrowitdowntoafewpeople.IfthatinformaIonendsupinthehandsofadverIsers,itisannoyingandintrusive.Ifitendsupinthehandsofaviolentexorarepressivegovernment,itcouldbedeadly.Inmyownwork,IhelpresearchersobtaindataforpolluIonstudiesincountrieswherequesIoningtheenvironmentalcostsofdevelopmenthaslandedpeopleinjail.WhatiftheirgovernmentdemandsIhandovertheirdatarequestrecords?
3
Dataavailabilityinformsourapproachestoresearchproblems.WhatkindofspaIo-temporalcoverageandresoluIondoyouneed?Precisionandaccuracy(P&A)ofmeasurements?Weallwantthe“best”data,withthemostP&A.But,workthroughyourerrorbudgettofindtheminimumacceptableP&A.ThiswillopenupmoredatapossibiliIesforyourresearch.Doesitrequiredatafusion;doyouneedtocombinedatafrommulIplesources?Thisisusuallythecaseorelseitwouldn’tberesearch.Newdataenablesnewinquiries.But,donotoverlookthevalueoflookingatolddatainnewways.MyfavoriteexampleisthepaperthatexaminedtheprevalenceofcirruscloudsintheSLCareabeforeandaLeritbecameaDeltaAirlineshubairport.TheycorrelatedjetfueltaxreceiptsandcirrusprevalenceandmadeaveryconvincingcasethatjetcontrailsarethecauseoftheincreasingprevalenceofcirruscloudsnearSLC.
4
ThepracIceofDatadiscoveryissIllinthejuvenilephase.Thereisn’tonekillerappordominantsearchenginethatcansearchalldatarepositoriesyet.Thisisamixedbagwithupsides.ConsidertheImespentindatadiscoveryalearningopportunity.Readresearchrelatedtoyourproject.Whatkindsofdatadotheyuseandwheredidtheyobtainit?Datacanbeciteddirectly,asanenItyinitsownright,oLenwithaDOI.Thisisthemodern,preferredmethod.Aside:[DOIstandsforDigitalObjectIdenIfiersareonetoonemappingsofdatasets(DS)toDOInumber(unlikeDSnames).SomeplacesgivenewDOIstonewversionsofaDS,whileothersgivenewDOIsformajornewversionsofDSs.However,thepracIceofciIngdatapapers,papersdescribingthepreparaIonofadataset,isalsoinwidespreaduseandhasbeenusedformuchlonger.Beforetheadventofdatapapers,peoplewouldcitethefirstsciencepaperthatused
5
Asanexample,takealookattheRDAfrontdoor,rda.ucar.eduYoucandoafreetextsearchfordataatthetop.Or,youcanperformafacetedsearch,successivelynarrowingselecIonsbaseduponmetadatacriteria.ThesearejustafewofthemanyopIonsenabledattheRDA.Checkoutournewdatasetsandotherannouncementsonourblog.Takeavideotourofourhomepagetolearnhowtousethemyriadfeaturesofoursite.
6
TheRDA’ssearchfeaturesarepoweredbystandardizedmetadata,mostcommonlyGlobalChangeMasterDirectorykeywords.NetCDFiscloselyassociatedwithCFstandardsandNCAR.Itisself-describingandself-contained.Thereareseveralotherstandardsinwideuse.SomeolderstandardssuchasFGDCmaybesupercededbynewerones.However,someoldstandardssuchasWMOandGRIBremainindailyuseworldwide.GRIBandBUFRareself-describing,butnotself-contained.Theyrequireexternaltabletounpackthebinarydatacorrectly.Ifyouapplythewrongtable,youreadthevaluesincorrectly.GRIBandBUFRarewidelyusedinoperaIonaltransmission,butmustbecarefulifusedlater.IfyouareunsureaboutthecorrectGRIBtabletoapply,ASK!Adatafilecanbemappedtomorethanonedatastandardforinteroperability.Infact,olderbutsIllusefuldatasetsshouldbemappedtonewmetadatastandardstomaintaintheirdiscoverabilitybybothhumansandmachines.
7
GCMDkeywordsmapoutexplicitlythecontentsofdatafiles.Thisshouldeliminateguesswork,whichcanleadtomisinterpreIngdata.Thislooksexcessivelydetailed,butthisishowmachine-to-machinecommunicaIonworks.Thisalsoenablesdistributedsearch,whereoneportalcancrawlotherdatasitesforyou.Dataproducerscanobtainhelpfromdatacurators/metadataspecialiststomapouttheirdata.
8
StandardizaIonofdataisnotanewidea.Once,whalingshiplogbookslookedliketheoneatleL.In1853,MauryandJansencreatedandtestedanuniversallogbookformatthatissimilartowhat’susedtoday.ThentheyorganizedaninternaIonalconferencetopromotetheuseoftheiruniversallogbookformat.SeafarerscouldsIllmaintainidiosyncraIclogbookslikethoseatleL,buttheywererequiredtoalsologintheirinformaIonintheuniversallogbookwithstandardizedunits.BaumanRareBookshNp://sappingaNenIon.blogspot.com/2012/11/reading-digital-sources-case-study-in.htmlTheBrusselsConferenceanditslegacyLp://Lp.wmo.int/Documents/PublicWeb/amp/mmop/documents/JCOMM-TR/J-TR-27-BRU150-Proceedings/DOCUMENTS_JCOMM_27/Session_4/4_2_Konnen.pdf
9
ThisdatastandardizaIonenablesustocreatemapslikethis.WeknowglobalshippingroutesandtrafficoverImefromtheselogbooks.Weagreedtocommonunitsofmeasurement(laItudeandlongitude,GreenwichIme,Celsiusscale,frequencyofrecordedmeasurements)hNp://sappingaNenIon.blogspot.com/2012/11/reading-digital-sources-case-study-in.htmlThedataofthepastisavailabletoustodaybecauseofpreservaIoneffortsofdatacompilersandcuratorsofyesterday.It’sunderstandabletousonlybecausetheyhandeddownnotjustthedata,buttheknowledgeaboutthedataintheformofmetadataandadherencetowell-describeddataconvenIons.Theseareknownasdataandmetadatastandards.ThecorollaryisthatouradherencetodatastandardstodaywillenableyourdataoftodaytoImetravelandaidfuturescienIsts.
10
ThisiscurrentlytheRDA’smostpopulardataset;itisusedtoiniIalizeNumericalWeatherPredicIon(NWP)modelssuchasWRF.Ifwearchiveadataset,wealsoarchivedocumentaIonabouthowitwaspreparedandsoLwaretoreadit.Ifthedatasetisarchivedinoneofthemajorstandards,suchasNetCDForGRIB,wedon’tneedtosaymuchmoreaboutit.But,ifitisinamoreidiosyncraIcformat,wewillprovidespecializedreadersandotherhelp.NotethateachdatasetisassignedtoadataspecialistwhosecontactinformaIonisatthetopright.Dataspecialistsserveasaconduitbetweendatausersanddataproducers.
11
Usersaccessdatathroughtools.One-off(ad-hoc)datarequiresone-offtools.DataandmetadatastandardizaIonallowsustocreatereusabletools.Thisecosystemofdatastandards,tools,searchanddeliverytoolsmakesupourshareddatainfrastructure.Wheninfrastructureworkssmoothly,themagicbecomesmundaneandinvisible.Usersarenotchargedthefullcostofdevelopingormaintaininginfrastructure.Butpleasekeepinmindthatinfrastructureisnotcheapanddelayingmaintenancecancausecatastrophicfailuresinthefuture.
12
Speakingofmagic.Whenyouopenupthosedatafiles,takesomeImetointerrogateitscontents.IfyouhaveIme,explorebeyondthevaluesthatyouoriginallywantedtousetoseeifsomethingelsemightopenupnewpossibiliIesforyourresearch.Ifyoudon’tgetwhatyouexpecttosee,eitherinthefilecontentsorvaluesthatdon’tmakephysicalsense,stop.Consultwiththedataprovider.Ifyougotthedatafromus,sendamessagetordahelp@ucar.eduandccittothedataspecialistforthatdataset.
13
Forthesakeofreproduciblescience,usesounddatacitaIonpracIces.DifferentjournalshavedifferentdatacitaIonstyles.Theysharecommonrequiredelements.Whocreatedthedataset?Howisitcalled?.GivetheDOIifithasone.Wheredidyougetit?DatacanbereplicatedinmulIpleplacesthatservedifferentsubsetsofthedata.CiIngthedatasourceisimportant.DataissomeImesreprocessedorcorrected.Putdowntheversionnumberanddataaccessdate.IfyoumakeadatarequestthroughtheRDAwebinterface,therequestisautomaIcallyloggedinourdatabase.Shouldanythinghappen,wecanregeneratethedatarequestforyouoranyoneelsewhowantstorepeatyourwork.Thiscansaveyoustoragespaceasyouneedonlysavelocallywhatyouarecurrentlyusing.OurdatabaserecordsalsoallowustomakecustomdatacitaIonsforyouwiththiswidgetfoundonallofourdatasethomepages.ThewidgetgeneratescitaIonsinAMS,AGUandseveralotherpopularstyles.YoucanalsocreateRISandBibTeX
14
Ifyouareinterestedinthinkingmetaaboutdata,Iputupalistofmyfavoritedatabooks.Wealsomaintainablogwherewemakeannouncements,giveIpsandposttutorials.Thank-you.
15