data analysis and workflows - github pages€¦ · data analysis and workflows dataone community...
TRANSCRIPT
DataAnalysisandWorkflowsDataONECommunityEngagement&OutreachWorking
Group
ReviewoftypicaldataanalysesReproducibility&provenanceWorkflowsingeneralInformalworkflowsFormalworkflows
LessonTopics
LearningObjectivesAftercompletingthislesson,theparticipantwillbeableto:
UnderstandasubsetoftypicalanalysesusedDefineaworkflowUnderstandtheconceptsinformalandformalworkflowsDiscussthebenefitsofworkflows
TheDataLifeCycle
DataAnalysesProcesses:
Conductedviapersonalcomputer,grid,cloudcomputingStatistics,modelruns,parameterestimations,graphs/plots,etc.
TypesofAnalysesProcessing:subsetting,merging,manipulating
Reduction:importantforhigh-resolutiondatasetsTransformation:unitconversions,linearandnonlinearalgorithms
TypesofAnalysesGraphicalanalyses
Visualexplorationofdata:searchforpatternsQualityassurance:outlierdetection
ConventionalStatistics
ExperimentaldataExamples:ANOVA,MANOVA,linearandnonlinearregressionRelyonassumptions:randomsampling,random&normallydistributederror,independenterrorterms,homogeneousvariance
DescriptiveStatistics
ObservationalordescriptivedataExamples:diversityindices,clusteranalysis,quadrantvariance,distancemethods,principalcomponentanalysis,correspondenceanalysis
StatisticalAnalyses
FromOksanen(2011)MultivariateAnalysisofEcologicalCommunitiesinR:vegantutorial
TypesofAnalysesStatisticalanalyses(continued)
Temporalanalyses:timeseriesSpatialanalyses:forspatialautocorrelationNonparametricapproachesusefulwhenconventionalassumptionsviolatedorunderlyingdistributionunknownOthermis.analyses:riskassessment,generalizedlinearmodels,mixedmodels,etc.
AnalysesofverylargedatasetsDatamininganddiscoveryOnlinedataprocessing
AfterDataAnalysisRe-analysisofoutputsFinalvisualizations:charts,graphs,simulations,etc
Scienceisiterative:Theprocessthatresultsinthefinalproductcanbecomplex
ReproducibilityReproducibilityatcoreofscientificmethodComplexprocess=moredifficulttoreproduce
Gooddocumentationrequiredforreproducibility
Metadata:dataaboutdataProcessmetadata:dataaboutprocessusedtocreate,manipulate,andanalyzedata
EnsuringReproducibility:DocumentingtheProcess
Processmetadata:Informationaboutprocess(analysis,dataorganization,graphing)usedtogettodataoutputsRelatedconcept:dataprovenance
OriginsofdataGoodprovenance=abletofollowdatathroughoutentirelifecycleAllowsforReplication&reproducibilityAnalysisforpotentialdefects,errorsinlogic,statisticalerrorsEvaluationofhypotheses
Workflows:TheBasicsFormalizationofprocessmetadataPrecisedescriptionofscientificprocedureConceptualizedseriesofdataingestion,transformation,andanalyticalstepsThreecomponents
Inputs:informationormaterialrequiredOutputs:informationormaterialproduced&potentiallyusedasinputinotherstepsTransformationrules/algorithms(e.g.analyses)
Twotypes:InformalFormal/Executable
InformalWorkflowsInputsoroutputsincludedata,metadata,orvisualizationsAnalyticalprocessesincludeoperationsthatchangeormanipulatedatainsomewayDecisionsspecifyconditionsthatdeterminethenextstepintheprocessPredefinedprocessesorsubroutinesspecifyafixedmulti-stepprocess
InformalWorkflowsWorkflowdiagrams:Simplelinearflowchart
Conceptualizinganalysisasasequenceofstepsarrowsindicateflow
InformalWorkflowsFlowCharts:simplestformofworkflow
InformalWorkflowsFlowcharts:simplestformofworkflow
TransformationRules
???Thesestepsareknowninworkflowsas“transformationrules”.Transformationrulesdescribewhatisdoneto/withthedatatoobtaintherelevantoutputsforpublication.
InformalWorkflowsFlowcharts:simplestformofworkflow
InputsandOutputs
???Nowwefocusontheactualdata.TheInputs&outputsofthisworkflowareshownhereinred.Thefirstinputsaretherawtemperature&salinitydata.TheseareimportedintoR.TheoutputofthisprocessisthedatainRformat.ThatdatainRformatthenbecometheinputforthequalitycontrolanddatacleaningstep.Theoutputofthisstepis“clean”temperatureandsalinitydata,whichisthentheinputfortheanalysisstep.Theoutputoftheanalysisstepisthesummarystatistics,suchasmeanandstandarddeviationbymonth.Thesearesubsequentlytheinputsforthevisualizationstep.
InformalWorkflowsWorkflowdiagrams:addingdecisionpoints
InformalWorkflowsWorkflowdiagrams:asimpleexample
InformalWorkflowsWorkflowdiagrams:acomplexexample
InformalWorkflowsCommentedscripts:bestpractices
Well-documentedcodeiseasiertoreview,share,enablesrepeatedanalysisAddhigh-levelinformationatthetop
Projectdescription,author,dateScriptdependencies,inputs,andoutputsDescribesparametersandtheirorigins
NoticeandorganizesectionsWhathappensinthesectionandwhyDescribedependencies,inputs,andoutputs
Construct“end-to-end”scriptifpossibleAcompletenarrativeRunswithoutinterventionfromstarttofinish
Formal/ExecutableWorkflowsAnalyticalpipelineEachstepcanbeimplementedindifferentsoftwaresystemsEachstep&itsparameters/requirementsformallyrecordedAllowsreuseofbothindividualstepsandoverallworkflow
Formal/ExecutableWorkflowsBenefits
SingleaccesspointformultipleanalysesacrosssoftwarepackagesKeepstrackofanalysisandprovenance:enablesreproducibility
Eachstep&itsparameters/requirementsformallyrecordedWorkflowcanbestoredAllowssharingandreuseofindividualstepsoroverallworkflow
AutomaterepetitivetasksUseacrossdifferentdisciplinesandgroupsCanrunanalysesmorequicklysincenotstartingfromscratch
Formal/ExecutableWorkflowsExample:KeplerSoftware
Open-source,free,cross-platformDrag-and-dropinterfaceforworkflowconstructionSteps(analyses,manipulationsetc)inworkflowrepresentedby“actor”ActorsconnectfromaworkflowPossibleapplications
TheoreticalmodelsorobservationalanalysesHierarchicalmodelingCanhavenestedworkflowsCanaccessdatafromweb-basedsources(e.g.databases)
Downloadsandmoreinformationatkepler-project.org
Formal/ExecutableWorkflowsExample:KeplerSoftware
Formal/ExecutableWorkflowsExample:KeplerSoftware
Formal/ExecutableWorkflowsExample:KeplerSoftware
Formal/ExecutableWorkflowsExample:VisTrails
OpensourceWorkflowandprovenancemanagementsupportGearedtowardexploratorycomputationaltasks
CanmanageevolvingSWFMaintainsdetailedhistoryaboutstepsanddata
www.vistrails.org
WorkflowsinGeneralScienceisbecomingmorecomputationallyintensiveSharingworkflowsbenefitsscience
ScientificworkflowsystemsmakedocumentingworkflowseasierMinimally:documentyouranalysisviainformalworkflowsEmergingworkflowapplications(formal/executableworkflows)will
Linksoftwareforexecutableend-to-endanalysisProvidedetailedinfoaboutdata&analysisFacilitatere-use&refinementofcomplex,multi-stepanalysesEnableefficientswappingofalternativemodels&algorithmsHelpautomatetedioustasks
BestPracticesforDataAnalysisScientistsshoulddocumentworkflowsusedtocreateresults
DataprovenanceAnalysesandparametersusedConnectionsbetweenanalysesviainputsandoutputs
Documentationcanbeinformal(e.g.flowcharts,commentedscripts)orformal(e.g.Kepler,VisTrails)
SummaryModernscienceiscomputer-intensive
Heterogeneousdata,analyses,softwareReproducibilityisimportantWorkflows=processmetadataUseofinformalorformalworkflowsfordocumentingprocessmetadataensuresreproducibility,repeatability,validation
ResourcesforDataAnalysis&Workflows1. W.MichenerandJ.Brunt,Eds.EcologicalData:Design,ManagementandProcessing.(Blackwell,NewYork,2000).
Thefullslidedeckmaybedownloadedfrom:http://www.dataone.org/education-modules
Suggestedcitation:DataONEEducationModule:AnalysisandWorkflows.DataONE.RetrievedOctober262016.Fromhttp://www.dataone.org/sites/all/documents/L9_AnalysisWorkflows.pptx
Copyrightlicenseinformation:Norightsreserved;youmayenhanceandreuseforyourownpurposes.WedoaskthatyouprovideappropriatecitationandattributiontoDataONE.