data entry, and manipulation - github pages · 2020-01-06 · data manipulation options lesson...

Post on 18-Mar-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DataEntry,andManipulationDataONECommunityEngagement&OutreachWorking

Group

BestPracticesforCreatingDataFilesDataEntryOptionsDataIntegrationBestPracticesDataManipulationOptions

LessonTopics

Recognizeandplanforinconsistenciesthatcanmakeadatasetdifficulttounderstandand/ormanipulateDescribecharacteristicsofstabledataformatsandlistreasonsforusingtheseformatsIdentifydataentrytoolsIdentifyvalidationmeasuresthatcanbeperformedasdataisenteredReviewbestpracticesfordataintegrationDescribethebasiccomponentsofarelationaldatabase

LearningObjectives

Createqualitydatasetsthatare:ValidOrganizedtosupporteaseofuseandreuse

GoalsofDataEntry

Example:PoorDataEntry

Example:PoorDataEntry,continued

Columnsofdataareconsistent:onlynumbers,dates,ortextConsistentNames,Codes,Formats(date)usedineachcolumnDataareallinonetable,whichismucheasierforastatisticalprogramtoworkwiththanmultiplesmalltableswhicheachrequirehumanintervention

RecommendedPractices

CreatedescriptivecolumnnameswithoutspacesorspecialcharactersSoilT30toSoil_Temp_30cmSpecies-Codeto"Species_Code(Avoidusing-,+,*,^incolumnnames.Somesoftwaremayinterpretthesesymbolsasanoperator)"

Useadescriptivefilename.Forinstance,afilenamedSEV_SmallMammalData_v.5.25.2010.csvindicatestheprojectthedataisassociatedwith(SEV),thethemeofthedata(SmallMammalData)andalsowhenthisversionofthedatawascreated(v.5.25.2010).Thisnameismuchmorehelpfulthanafilenamedmydata.xls.

RecommendedPractices,continued

MissingdataPreferablyleavefieldempty(NULL=novalue)Innumericfields,useadistinctvaluesuchas9999toindicateamissingvalueIntextfields,useNA(“NotApplicable”or“NotAvailable”)UseDataflagsinaseparatecolumntoqualifymissingvalue

RecommendedPractices,continued

Entercompletelinesofdata

RecommendedPractices,continued

Forthelongterm,storedatainaconsistentformatthatcanbereadwellintothefutureandthatcanbeusedbyanyapplicationnoworinthefutureAppropriatefiletypesinclude:

Non-proprietary:useanopen,documentedstandardCommonusagebyresearchcommunity:Standardrepresentation(ASCII,Unicode)UnencryptedUncompressed

ASCIIformattedfilesarelikelytobereadableintothefutureUseASCII(comma-separated)fortabulardata

BestPractices

BestPracticesforPreparingEnvironmentalDataSetstoShareandArchive.September2010.LesA.Hook,SureshK.SanthanaVannan,TammyW.Beaty,RobertB.Cook,andBruceE.Wilson.https://daac.ornl.gov/PI/BestPractices-2010.pdfPreparingDataforSharing.2015.LibbieStephenson.https://doi:10.7910/DVN/BJNXVQ

Resources

Twocommontools:GoogleDocs,Excel

DataEntryTools

GoogleDocsForms

GoogleDocsSpreadsheet

Excel

Excel:DataValidation

Greatforcharts,graphs,calculationsFlexibleaboutcellcontenttype—cellsinsamecolumncancontainnumbersortext**Easytouse–buthardertomaintainascomplexityandsizeofdatagrows

EasytoquerytoselectportionsofdataDatafieldsaretyped–Forexample,onlyintegersareallowedinintegerfieldsColumnscannotbesortedindependentlyofeachotherSteeperlearningcurvethanaspreadsheet

SpreadsheetversusRelationalDatabase

Whatisarelationaldatabase?

DatabaseFeatures:Explicitcontroloverdatatypes

Relationshipsaredefinedbetweentables

UsingStructuredQueryLanguage(SQL)

Dataentryusingadatabase

BeawareofBestPracticesinyourdomainwhendesigningdatafilestructuresChooseadataentrymethodthatallowssomevalidationofdataasitisenteredConsiderinvestingtimeinlearninghowtouseadatabaseifdatasetsarelargeorcomplex

Review:PlanningforDataEntry

class:center

Considertryingoneofthese:

Personal,single-userdatabasescanbedevelopedinMSAccess,whichisstoredasafileontheuser’scomputer.MSAccesscomeswitheasyGUItoolstocreatedatabases,runqueries,andwritereports.Amorerobustdatabasethatisfree,accommodatesmultipleusersandwillrunonWindowsorLinuxisMySQL.GUIinterfacesforMySQLincludephpMyadmin(free)andNavicat(inexpensive).

Ifyouwanttotryadatabase:

DatabaseDesignforMereMortals:AHands-OnGuidetoRelationalDatabaseDesign(2ndEdition).MichaelJ.Hernandez.Addison-Wesley,2003.FundamentalsofRelationalDatabaseDesign.PaulLitwin.http://r937.com/relational.html.(AccessedMay12,2016).

Tolearnmoreaboutdesigningarelationaldatabase:

MaintaindatasetprovenanceDocumenttransformationsBewareofaccidentalduplication

Reviewmetadataforcompatibilityofcontext,methods,andmeaningForwhatpurposewasthedatacollected?Howwasthedatacollected?Isitsensibletocombinethesedatasets?

DataIntegrationBestPractices

EnsurecompatibilityConverttocommonunitsChooseappropriatenumericprecisionEvaluateandstandardizemissingvaluecodes

DocumentallassumptionsWhatassumptionsunderlietheoriginaldatasets?Whatassumptionsdidyoumakeincombiningthedatasets?

DataIntegrationBestPractices

RecognizethatyouarecreatinganewdatasetRevisitthedatalifecycletoensurethenewdatasetisproperlydocumented,validated,andpreserved

UsereproducibleworkflowsEnabletransparencyandreproducibilityintheintegrationprocessEnsureothersunderstandandcanevaluateyourdecisionmakingprocess.Automatetheintegrationasmuchaspossible,especiallywhenintegratingmanyorlargedatasets

DataIntegrationBestPractices

EnsureattributionoforiginaldatasetownersandrespectdatausageagreementsExampleresource:

ExamplecitationtotherelateddatasetfromtheDryadrepository:

DataIntegrationBestPractices

Usefulforanalyzing,subsettingandtransformingdataCanbeusedtocheckandassurequalitydataOptionsincludeSAS,SPSS,R,andMatlab(notfree)

SAS:HascomprehensivesupportSPSS:Hasauser-friendlyGUIMatlab:AnalysisandVisualizationplatformthathas“toolboxes”availablefordifferentdisciplines,suchasmodelingorgenomicanalyses

DataManipulation

Free(http://www.r-project.org/index.html)ProducespublicationqualitygraphicsLotsofforumsfromwhichtogethelpSoftware(suchasKeplerfordevelopingworkflows)willintegrateanalyticalcomponentswritteninR

UsingR

Toolssuchas(butnotlimitedto)spreadsheettoolssuchasMSExcelandrelationaldatabases(MSAccess,MySQL,andmore)canprovidestructure,flexibilityandpotentialforworkingmoreeasilywithdatasetsbutalsorequireplanningSelectionofadatabaseorspreadsheettooldependsontherelationshipsbetweenthedata,andhowitwillbeused,aswellasotherconsiderationsre:time,resources,output.

Review:Selectingtoolsfordatastorageanduse

Maintainingprovenance(atrailofcustodyanddecisions)isimportantwhenintegratingmorethanonedatasetDocumentingandunderstandingcontextandrelationships,aswellaschangesiscrucialwhencreatinganewdataset(anytimeyoucombinetwoormoredisparatedatasets)Createatransparent,reproducibleworkflowMakesuretoprovideproperattributionandcitationtoallresources,includingtheoriginaldataset.ToolssuchasR,Matlab,andotherscanbeusefulinestablishingworkflowsandaccessingdatasets

Review:DataIntegration&Manipulation

AboutParticipateinourGitHubrepo:https://dataoneorg.github.io/dataone_lessons/

Suggestedcitation:DataONEEducationModule:DataManagement.DataONE.RetrievedNovember12,2016.Fromhttps://dataoneorg.github.io/dataone_lessons/

Copyrightlicenseinformation:Norightsreserved;youmayenhanceandreuseforyourownpurposes.WedoaskthatyouprovideappropriatecitationandattributiontoDataONE.

top related