DataEntry,andManipulationDataONECommunityEngagement&OutreachWorking
Group
BestPracticesforCreatingDataFilesDataEntryOptionsDataIntegrationBestPracticesDataManipulationOptions
LessonTopics
Recognizeandplanforinconsistenciesthatcanmakeadatasetdifficulttounderstandand/ormanipulateDescribecharacteristicsofstabledataformatsandlistreasonsforusingtheseformatsIdentifydataentrytoolsIdentifyvalidationmeasuresthatcanbeperformedasdataisenteredReviewbestpracticesfordataintegrationDescribethebasiccomponentsofarelationaldatabase
LearningObjectives
Createqualitydatasetsthatare:ValidOrganizedtosupporteaseofuseandreuse
GoalsofDataEntry
Example:PoorDataEntry
Example:PoorDataEntry,continued
Columnsofdataareconsistent:onlynumbers,dates,ortextConsistentNames,Codes,Formats(date)usedineachcolumnDataareallinonetable,whichismucheasierforastatisticalprogramtoworkwiththanmultiplesmalltableswhicheachrequirehumanintervention
RecommendedPractices
CreatedescriptivecolumnnameswithoutspacesorspecialcharactersSoilT30toSoil_Temp_30cmSpecies-Codeto"Species_Code(Avoidusing-,+,*,^incolumnnames.Somesoftwaremayinterpretthesesymbolsasanoperator)"
Useadescriptivefilename.Forinstance,afilenamedSEV_SmallMammalData_v.5.25.2010.csvindicatestheprojectthedataisassociatedwith(SEV),thethemeofthedata(SmallMammalData)andalsowhenthisversionofthedatawascreated(v.5.25.2010).Thisnameismuchmorehelpfulthanafilenamedmydata.xls.
RecommendedPractices,continued
MissingdataPreferablyleavefieldempty(NULL=novalue)Innumericfields,useadistinctvaluesuchas9999toindicateamissingvalueIntextfields,useNA(“NotApplicable”or“NotAvailable”)UseDataflagsinaseparatecolumntoqualifymissingvalue
RecommendedPractices,continued
Entercompletelinesofdata
RecommendedPractices,continued
Forthelongterm,storedatainaconsistentformatthatcanbereadwellintothefutureandthatcanbeusedbyanyapplicationnoworinthefutureAppropriatefiletypesinclude:
Non-proprietary:useanopen,documentedstandardCommonusagebyresearchcommunity:Standardrepresentation(ASCII,Unicode)UnencryptedUncompressed
ASCIIformattedfilesarelikelytobereadableintothefutureUseASCII(comma-separated)fortabulardata
BestPractices
BestPracticesforPreparingEnvironmentalDataSetstoShareandArchive.September2010.LesA.Hook,SureshK.SanthanaVannan,TammyW.Beaty,RobertB.Cook,andBruceE.Wilson.https://daac.ornl.gov/PI/BestPractices-2010.pdfPreparingDataforSharing.2015.LibbieStephenson.https://doi:10.7910/DVN/BJNXVQ
Resources
Twocommontools:GoogleDocs,Excel
DataEntryTools
GoogleDocsForms
GoogleDocsSpreadsheet
Excel
Excel:DataValidation
Greatforcharts,graphs,calculationsFlexibleaboutcellcontenttype—cellsinsamecolumncancontainnumbersortext**Easytouse–buthardertomaintainascomplexityandsizeofdatagrows
EasytoquerytoselectportionsofdataDatafieldsaretyped–Forexample,onlyintegersareallowedinintegerfieldsColumnscannotbesortedindependentlyofeachotherSteeperlearningcurvethanaspreadsheet
SpreadsheetversusRelationalDatabase
Whatisarelationaldatabase?
DatabaseFeatures:Explicitcontroloverdatatypes
Relationshipsaredefinedbetweentables
UsingStructuredQueryLanguage(SQL)
Dataentryusingadatabase
BeawareofBestPracticesinyourdomainwhendesigningdatafilestructuresChooseadataentrymethodthatallowssomevalidationofdataasitisenteredConsiderinvestingtimeinlearninghowtouseadatabaseifdatasetsarelargeorcomplex
Review:PlanningforDataEntry
class:center
Considertryingoneofthese:
Personal,single-userdatabasescanbedevelopedinMSAccess,whichisstoredasafileontheuser’scomputer.MSAccesscomeswitheasyGUItoolstocreatedatabases,runqueries,andwritereports.Amorerobustdatabasethatisfree,accommodatesmultipleusersandwillrunonWindowsorLinuxisMySQL.GUIinterfacesforMySQLincludephpMyadmin(free)andNavicat(inexpensive).
Ifyouwanttotryadatabase:
DatabaseDesignforMereMortals:AHands-OnGuidetoRelationalDatabaseDesign(2ndEdition).MichaelJ.Hernandez.Addison-Wesley,2003.FundamentalsofRelationalDatabaseDesign.PaulLitwin.http://r937.com/relational.html.(AccessedMay12,2016).
Tolearnmoreaboutdesigningarelationaldatabase:
MaintaindatasetprovenanceDocumenttransformationsBewareofaccidentalduplication
Reviewmetadataforcompatibilityofcontext,methods,andmeaningForwhatpurposewasthedatacollected?Howwasthedatacollected?Isitsensibletocombinethesedatasets?
DataIntegrationBestPractices
EnsurecompatibilityConverttocommonunitsChooseappropriatenumericprecisionEvaluateandstandardizemissingvaluecodes
DocumentallassumptionsWhatassumptionsunderlietheoriginaldatasets?Whatassumptionsdidyoumakeincombiningthedatasets?
DataIntegrationBestPractices
RecognizethatyouarecreatinganewdatasetRevisitthedatalifecycletoensurethenewdatasetisproperlydocumented,validated,andpreserved
UsereproducibleworkflowsEnabletransparencyandreproducibilityintheintegrationprocessEnsureothersunderstandandcanevaluateyourdecisionmakingprocess.Automatetheintegrationasmuchaspossible,especiallywhenintegratingmanyorlargedatasets
DataIntegrationBestPractices
EnsureattributionoforiginaldatasetownersandrespectdatausageagreementsExampleresource:
ExamplecitationtotherelateddatasetfromtheDryadrepository:
DataIntegrationBestPractices
Usefulforanalyzing,subsettingandtransformingdataCanbeusedtocheckandassurequalitydataOptionsincludeSAS,SPSS,R,andMatlab(notfree)
SAS:HascomprehensivesupportSPSS:Hasauser-friendlyGUIMatlab:AnalysisandVisualizationplatformthathas“toolboxes”availablefordifferentdisciplines,suchasmodelingorgenomicanalyses
DataManipulation
Free(http://www.r-project.org/index.html)ProducespublicationqualitygraphicsLotsofforumsfromwhichtogethelpSoftware(suchasKeplerfordevelopingworkflows)willintegrateanalyticalcomponentswritteninR
UsingR
Toolssuchas(butnotlimitedto)spreadsheettoolssuchasMSExcelandrelationaldatabases(MSAccess,MySQL,andmore)canprovidestructure,flexibilityandpotentialforworkingmoreeasilywithdatasetsbutalsorequireplanningSelectionofadatabaseorspreadsheettooldependsontherelationshipsbetweenthedata,andhowitwillbeused,aswellasotherconsiderationsre:time,resources,output.
Review:Selectingtoolsfordatastorageanduse
Maintainingprovenance(atrailofcustodyanddecisions)isimportantwhenintegratingmorethanonedatasetDocumentingandunderstandingcontextandrelationships,aswellaschangesiscrucialwhencreatinganewdataset(anytimeyoucombinetwoormoredisparatedatasets)Createatransparent,reproducibleworkflowMakesuretoprovideproperattributionandcitationtoallresources,includingtheoriginaldataset.ToolssuchasR,Matlab,andotherscanbeusefulinestablishingworkflowsandaccessingdatasets
Review:DataIntegration&Manipulation
AboutParticipateinourGitHubrepo:https://dataoneorg.github.io/dataone_lessons/
Suggestedcitation:DataONEEducationModule:DataManagement.DataONE.RetrievedNovember12,2016.Fromhttps://dataoneorg.github.io/dataone_lessons/
Copyrightlicenseinformation:Norightsreserved;youmayenhanceandreuseforyourownpurposes.WedoaskthatyouprovideappropriatecitationandattributiontoDataONE.