analysis and workflows lesson 10: analysis and workflows cc image by wlef70 on flickr
TRANSCRIPT
Analysis and Workflows
Tutorials on Data ManagementLesson 10: Analysis and Workflows
CC
imag
e b
y w
lef7
0 o
n F
lickr
Analysis and Workflows
• Review of typical data analyses• Reproducibility & provenance• Workflows in general• Informal workflows• Formal workflows
Lesson Topics
CC
imag
e b
y jw
als
h o
n F
lickr
Analysis and Workflows
• After completing this lesson, the participant will be able to: o Understand a subset of typical analyses usedo Define a workflowo Understand the concepts informal and formal workflowso Discuss the benefits of workflows
Learning Objectives
CC
imag
e b
y cy
bra
rian
77
on
Flic
kr
Analysis and Workflows
The Data Life CyclePlan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
Analysis and Workflows
• Conducted via personal computer, grid, cloud computing• Statistics, model runs, parameter estimations, graphs/plots
etc.
Data Analyses
CC
imag
e b
y h
eg
em
on
x o
n F
lickr
CC
imag
e b
y ta
i viin
ikka
on
Flic
kr
Analysis and Workflows
• Processing: subsetting, merging, manipulatingo Reduction: important for high-resolution datasetso Transformation: unit conversions, linear and nonlinear algorithms
Types of Analyses
0711070500276000 0711070600276000 0711070700277003 0711070800282017 0711070900285000 0711071000293000 0711071100301000 0711071200304000
Date timeair temp precip
C mm11-Jul-07 5:0027.6 00011-Jul-07 6:0027.6 00011-Jul-07 7:0027.7 00311-Jul-07 8:0028.2 01711-Jul-07 9:0028.5 00011-Jul-07 10:0029.3 00011-Jul-07 11:0030.1 00011-Jul-07 12:0030.4 000
Recreated from Michener & Brunt (2000)
Analysis and Workflows
• Graphical analyseso Visual exploration of data: search for patternso Quality assurance: outlier detection
Types of Analyses
Box and whisker plot of temperature by monthScatter plot of August Temperatures
Strasser, unpub. dataStrasser, unpub. data
Analysis and Workflows
• Statistical analyses Conventional statistics
o Experimental datao Examples: ANOVA, MANOVA, linear and
nonlinear regressiono Rely on assumptions: random sampling, random & normally distributed error, independent error terms, homogeneous variance
Descriptive statisticso Observational or descriptive datao Examples: diversity indices, cluster
analysis, quadrant variance, distance methods, principal component analysis, correspondence analysis
Types of Analyses
From Oksanen (2011) Multivariate Analysis of Ecological Communities in R: vegan tutorial
Example of Principle Component Analysis
Analysis and Workflows
• Statistical analyses (continued)o Temporal analyses: time serieso Spatial analyses: for spatial autocorrelationo Nonparametric approaches useful when conventional assumptions
violated or underlying distribution unknowno Other misc. analyses: risk assessment, generalized linear models,
mixed models, etc.
• Analyses of very large datasetso Data mining & discoveryo Online data processing
Types of Analyses
Analysis and Workflows
• Re-analysis of outputs• Final visualizations: charts, graphs, simulations etc.
After Data Analysis
Science is iterative: The process that results in the final product
can be complex
Analysis and Workflows
• Reproducibility at core of scientific method• Complex process = more difficult to reproduce• Good documentation required for reproducibility
o Metadata: data about datao Process metadata: data about process used to create, manipulate,
and analyze data
Reproducibility
CC
imag
e b
y R
ich
ard
Car
ter
on F
lickr
Analysis and Workflows
• Process metadata: Information about process (analysis, data organization, graphing) used to get to data outputs
• Related concept: data provenanceo Origins of datao Good provenance = able to follow data throughout entire life cycleo Allows for
• Replication & reproducibility• Analysis for potential defects, errors in logic, statistical errors• Evaluation of hypotheses
Ensuring Reproducibility: Documenting the Process
Analysis and Workflows
• Formalization of process metadata• Precise description of scientific procedure• Conceptualized series of data ingestion, transformation, and
analytical steps• Three components
o Inputs: information or material requiredo Outputs: information or material produced & potentially used as
input in other stepso Transformation rules/algorithms (e.g. analyses)
• Two types:o Informal o Formal/Executable
Workflows: The Basics
Analysis and Workflows
Workflow diagrams: Some basic building blocks
Informal Workflows
Analytical process
Data (input or output)
Decision
Predefined process
(subroutine)
o Inputs or outputs include data, metadata, or visualizations
o Analytical processes include operations that change or manipulate data in some way
o Decisions specify conditions that determine the next step in the process
o Predefined processes or subroutines specify a fixed multi-step process
Analysis and Workflows
Informal Workflows
Output data, visualizations,
associated metadata
Data ingestion
Raw data and
associated metadata
Data cleaning
Analysis Step 1
Analysis Step 2
Output generation
Workflow diagrams: Simple linear flow chart• Conceptualizing analysis as a sequence of stepso arrows indicate flow
Analysis and Workflows
Flow charts: simplest form of workflow
Informal Workflows
Graph production
Analysis: mean, SD
Quality control & data cleaning
Data import into R
Analysis and Workflows
Flow charts: simplest form of workflow
Informal Workflows
Transformation Rules
Graph production
Analysis: mean, SD
Quality control & data cleaning
Data import into R
Analysis and Workflows
Flow charts: simplest form of workflow
Informal Workflows
Inputs & Outputs
Graph production
Analysis: mean, SD
Quality control & data cleaning
Data import into R
Salinity data
Temperature data
“Clean” T & S data
Data in R format
Summary statistics
Analysis and Workflows
Informal Workflows
Analysis 2
Analysis 1CONDITION
false
true
CONDITIONAL LOOP
IF
Workflow diagrams: Adding decision points
Analysis and Workflows
Informal Workflows
species name
lat/long
Data ingestion
Data and metadata
QA/QC (e.g., check for appropriate data
types, outliers)
Determine range (calculate convex hull
of points)
Compile list of unique taxa
Generate output data and visualizations for taxa frequencies in
stated range
Data, visualizations, and metadata
FORcounts
Calculate taxa frequencies
Workflow diagrams: a simple example
Analysis and Workflows
Informal Workflows
false
true
A
B
Data ingestion
Data and metadata Data
cleaning
Analysis 1B
Analysis 1A
Output generation
Data, visualizations, and metadata
Data and metadata
FOR
Predefined process
(subroutine)
Data ingestion
Data cleaning
Data integration
Data and metadata
Data ingestion
Data cleaning IF
Analysis 2
Data integration
Workflow diagrams: a complex example
Analysis and Workflows
#%$&
Informal WorkflowsCommented scripts: best practices• Well-documented code is easier to review, share, enables
repeated analysis• Add high-level information at the top
o Project description, author, dateo Script dependencies, inputs, and outputso Describes parameters and their origins
• Notice and organize sectionso What happens in the section and whyo Describe dependencies, inputs, and outputs
• Construct “end-to-end” script if possibleo A complete narrativeo Runs without intervention from start to finish
Analysis and Workflows
• Analytical pipeline• Each step can be implemented in different software systems• Each step & its parameters/requirements formally recorded• Allows reuse of both individual steps and overall workflow
Formal/Executable Workflows
CC
imag
e b
y A
J C
ann
on
Flic
kr
Analysis and Workflows
Benefits:• Single access point for multiple analyses across software
packages• Keeps track of analysis and provenance: enables
reproducibilityo Each step & its parameters/requirements formally recorded
• Workflow can be stored• Allows sharing and reuse of individual steps or overall
workflowo Automate repetitive taskso Use across different disciplines and groupso Can run analyses more quickly since not starting from scratch
Formal/Executable Workflows
Analysis and Workflows
Example: Kepler Software• Open-source, free, cross-platform• Drag-and-drop interface for workflow construction• Steps (analyses, manipulations etc) in workflow represented
by “actor”• Actors connect from a workflow• Possible applicationso Theoretical models or observational analyseso Hierarchical modelingo Can have nested workflowso Can access data from web-based sources (e.g. databases)
• Downloads and more information at kepler-project.org
Formal/Executable Workflows
Analysis and Workflows
Formal/Executable Workflows
Drag & drop components from this list
Actors in workflow
Example: Kepler Software
Analysis and Workflows
Formal/Executable Workflows
This model shows the solution to the classic Lotka-Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics.
Example: Kepler Software
Analysis and Workflows
Formal/Executable Workflows
OutputExample: Kepler Software
Analysis and Workflows
Example: VisTrails• Open-source• Workflow & provenance management support• Geared toward exploratory computational taskso Can manage evolving SWFo Maintains detailed history about steps & data
• www.vistrails.org
Formal/Executable Workflows
Screenshot example
Analysis and Workflows
• Science is becoming more computationally intensive• Sharing workflows benefits scienceo Scientific workflow systems make documenting workflows easier
• Minimally: document your analysis via informal workflows• Emerging workflow applications (formal/executable
workflows) willo Link software for executable end-to-end analysiso Provide detailed info about data & analysiso Facilitate re-use & refinement of complex, multi-step analyseso Enable efficient swapping of alternative models & algorithmso Help automate tedious tasks
Workflows in General
Analysis and Workflows
• Scientists should document workflows used to create resultso Data provenanceo Analyses and parameters usedo Connections between analyses via inputs and outputs
• Documentation can be informal (e.g. flowcharts, commented scripts) or formal (e.g. Kepler, VisTrails)
Best Practices for Data Analysis
CC
imag
e g
eek
cale
nd
ar
on F
lickr
Analysis and Workflows
• Modern science is computer-intensiveo Heterogeneous data, analyses, software
• Reproducibility is important• Workflows = process metadata• Use of informal or formal workflows for documenting
process metadata ensures reproducibility, repeatability, validation
Summary
Analysis and Workflows
1. W. Michener and J. Brunt, Eds. Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000).
Resources for Data Analysis & Workflows
Analysis and Workflows
The full slide deck may be downloaded from:http://www.dataone.org/education-modules
Suggested citation:DataONE Education Module: Analysis and Workflows. DataONE. Retrieved Nov12, 2012. From http://www.dataone.org/sites/all/documents/L10_Analysis Workflows.pptx
Copyright license information:No rights reserved; you may enhance and reuse for your own purposes. We do ask that you provide appropriate citation and attribution to DataONE.