analysis and workflows lesson 10: analysis and workflows cc image by wlef70 on flickr

34
Analysis and Workflows Tutorials on Data Management Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Upload: martha-sayres

Post on 31-Mar-2015

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Tutorials on Data ManagementLesson 10: Analysis and Workflows

CC

imag

e b

y w

lef7

0 o

n F

lickr

Page 2: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Review of typical data analyses• Reproducibility & provenance• Workflows in general• Informal workflows• Formal workflows

Lesson Topics

CC

imag

e b

y jw

als

h o

n F

lickr

Page 3: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• After completing this lesson, the participant will be able to: o Understand a subset of typical analyses usedo Define a workflowo Understand the concepts informal and formal workflowso Discuss the benefits of workflows

Learning Objectives

CC

imag

e b

y cy

bra

rian

77

on

Flic

kr

Page 4: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

The Data Life CyclePlan

Collect

Assure

Describe

Preserve

Discover

Integrate

Analyze

Page 5: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Conducted via personal computer, grid, cloud computing• Statistics, model runs, parameter estimations, graphs/plots

etc.

Data Analyses

CC

imag

e b

y h

eg

em

on

x o

n F

lickr

CC

imag

e b

y ta

i viin

ikka

on

Flic

kr

Page 6: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Processing: subsetting, merging, manipulatingo Reduction: important for high-resolution datasetso Transformation: unit conversions, linear and nonlinear algorithms

Types of Analyses

0711070500276000 0711070600276000 0711070700277003 0711070800282017 0711070900285000 0711071000293000 0711071100301000 0711071200304000

Date timeair temp precip

C mm11-Jul-07 5:0027.6 00011-Jul-07 6:0027.6 00011-Jul-07 7:0027.7 00311-Jul-07 8:0028.2 01711-Jul-07 9:0028.5 00011-Jul-07 10:0029.3 00011-Jul-07 11:0030.1 00011-Jul-07 12:0030.4 000

Recreated from Michener & Brunt (2000)

Page 7: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Graphical analyseso Visual exploration of data: search for patternso Quality assurance: outlier detection

Types of Analyses

Box and whisker plot of temperature by monthScatter plot of August Temperatures

Strasser, unpub. dataStrasser, unpub. data

Page 8: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Statistical analyses Conventional statistics

o Experimental datao Examples: ANOVA, MANOVA, linear and

nonlinear regressiono Rely on assumptions: random sampling, random & normally distributed error, independent error terms, homogeneous variance

Descriptive statisticso Observational or descriptive datao Examples: diversity indices, cluster

analysis, quadrant variance, distance methods, principal component analysis, correspondence analysis

Types of Analyses

From Oksanen (2011) Multivariate Analysis of Ecological Communities in R: vegan tutorial

Example of Principle Component Analysis

Page 9: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Statistical analyses (continued)o Temporal analyses: time serieso Spatial analyses: for spatial autocorrelationo Nonparametric approaches useful when conventional assumptions

violated or underlying distribution unknowno Other misc. analyses: risk assessment, generalized linear models,

mixed models, etc.

• Analyses of very large datasetso Data mining & discoveryo Online data processing

Types of Analyses

Page 10: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Re-analysis of outputs• Final visualizations: charts, graphs, simulations etc.

After Data Analysis

Science is iterative: The process that results in the final product

can be complex

Page 11: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Reproducibility at core of scientific method• Complex process = more difficult to reproduce• Good documentation required for reproducibility

o Metadata: data about datao Process metadata: data about process used to create, manipulate,

and analyze data

Reproducibility

CC

imag

e b

y R

ich

ard

Car

ter

on F

lickr

Page 12: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Process metadata: Information about process (analysis, data organization, graphing) used to get to data outputs

• Related concept: data provenanceo Origins of datao Good provenance = able to follow data throughout entire life cycleo Allows for

• Replication & reproducibility• Analysis for potential defects, errors in logic, statistical errors• Evaluation of hypotheses

Ensuring Reproducibility: Documenting the Process

Page 13: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Formalization of process metadata• Precise description of scientific procedure• Conceptualized series of data ingestion, transformation, and

analytical steps• Three components

o Inputs: information or material requiredo Outputs: information or material produced & potentially used as

input in other stepso Transformation rules/algorithms (e.g. analyses)

• Two types:o Informal o Formal/Executable

Workflows: The Basics

Page 14: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Workflow diagrams: Some basic building blocks

Informal Workflows

Analytical process

Data (input or output)

Decision

Predefined process

(subroutine)

o Inputs or outputs include data, metadata, or visualizations

o Analytical processes include operations that change or manipulate data in some way

o Decisions specify conditions that determine the next step in the process

o Predefined processes or subroutines specify a fixed multi-step process

Page 15: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Informal Workflows

Output data, visualizations,

associated metadata

Data ingestion

Raw data and

associated metadata

Data cleaning

Analysis Step 1

Analysis Step 2

Output generation

Workflow diagrams: Simple linear flow chart• Conceptualizing analysis as a sequence of stepso arrows indicate flow

Page 16: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Flow charts: simplest form of workflow

Informal Workflows

Graph production

Analysis: mean, SD

Quality control & data cleaning

Data import into R

Page 17: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Flow charts: simplest form of workflow

Informal Workflows

Transformation Rules

Graph production

Analysis: mean, SD

Quality control & data cleaning

Data import into R

Page 18: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Flow charts: simplest form of workflow

Informal Workflows

Inputs & Outputs

Graph production

Analysis: mean, SD

Quality control & data cleaning

Data import into R

Salinity data

Temperature data

“Clean” T & S data

Data in R format

Summary statistics

Page 19: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Informal Workflows

Analysis 2

Analysis 1CONDITION

false

true

CONDITIONAL LOOP

IF

Workflow diagrams: Adding decision points

Page 20: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Informal Workflows

species name

lat/long

Data ingestion

Data and metadata

QA/QC (e.g., check for appropriate data

types, outliers)

Determine range (calculate convex hull

of points)

Compile list of unique taxa

Generate output data and visualizations for taxa frequencies in

stated range

Data, visualizations, and metadata

FORcounts

Calculate taxa frequencies

Workflow diagrams: a simple example

Page 21: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Informal Workflows

false

true

A

B

Data ingestion

Data and metadata Data

cleaning

Analysis 1B

Analysis 1A

Output generation

Data, visualizations, and metadata

Data and metadata

FOR

Predefined process

(subroutine)

Data ingestion

Data cleaning

Data integration

Data and metadata

Data ingestion

Data cleaning IF

Analysis 2

Data integration

Workflow diagrams: a complex example

Page 22: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

#%$&

Informal WorkflowsCommented scripts: best practices• Well-documented code is easier to review, share, enables

repeated analysis• Add high-level information at the top

o Project description, author, dateo Script dependencies, inputs, and outputso Describes parameters and their origins

• Notice and organize sectionso What happens in the section and whyo Describe dependencies, inputs, and outputs

• Construct “end-to-end” script if possibleo A complete narrativeo Runs without intervention from start to finish

Page 23: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Analytical pipeline• Each step can be implemented in different software systems• Each step & its parameters/requirements formally recorded• Allows reuse of both individual steps and overall workflow

Formal/Executable Workflows

CC

imag

e b

y A

J C

ann

on

Flic

kr

Page 24: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Benefits:• Single access point for multiple analyses across software

packages• Keeps track of analysis and provenance: enables

reproducibilityo Each step & its parameters/requirements formally recorded

• Workflow can be stored• Allows sharing and reuse of individual steps or overall

workflowo Automate repetitive taskso Use across different disciplines and groupso Can run analyses more quickly since not starting from scratch

Formal/Executable Workflows

Page 25: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Example: Kepler Software• Open-source, free, cross-platform• Drag-and-drop interface for workflow construction• Steps (analyses, manipulations etc) in workflow represented

by “actor”• Actors connect from a workflow• Possible applicationso Theoretical models or observational analyseso Hierarchical modelingo Can have nested workflowso Can access data from web-based sources (e.g. databases)

• Downloads and more information at kepler-project.org

Formal/Executable Workflows

Page 26: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Formal/Executable Workflows

Drag & drop components from this list

Actors in workflow

Example: Kepler Software

Page 27: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Formal/Executable Workflows

This model shows the solution to the classic Lotka-Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics.

Example: Kepler Software

Page 28: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Formal/Executable Workflows

OutputExample: Kepler Software

Page 29: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

Example: VisTrails• Open-source• Workflow & provenance management support• Geared toward exploratory computational taskso Can manage evolving SWFo Maintains detailed history about steps & data

• www.vistrails.org

Formal/Executable Workflows

Screenshot example

Page 30: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Science is becoming more computationally intensive• Sharing workflows benefits scienceo Scientific workflow systems make documenting workflows easier

• Minimally: document your analysis via informal workflows• Emerging workflow applications (formal/executable

workflows) willo Link software for executable end-to-end analysiso Provide detailed info about data & analysiso Facilitate re-use & refinement of complex, multi-step analyseso Enable efficient swapping of alternative models & algorithmso Help automate tedious tasks

Workflows in General

Page 31: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Scientists should document workflows used to create resultso Data provenanceo Analyses and parameters usedo Connections between analyses via inputs and outputs

• Documentation can be informal (e.g. flowcharts, commented scripts) or formal (e.g. Kepler, VisTrails)

Best Practices for Data Analysis

CC

imag

e g

eek

cale

nd

ar

on F

lickr

Page 32: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

• Modern science is computer-intensiveo Heterogeneous data, analyses, software

• Reproducibility is important• Workflows = process metadata• Use of informal or formal workflows for documenting

process metadata ensures reproducibility, repeatability, validation

Summary

Page 33: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

1. W. Michener and J. Brunt, Eds. Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000).

Resources for Data Analysis & Workflows

Page 34: Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

Analysis and Workflows

The full slide deck may be downloaded from:http://www.dataone.org/education-modules

Suggested citation:DataONE Education Module: Analysis and Workflows. DataONE. Retrieved Nov12, 2012. From http://www.dataone.org/sites/all/documents/L10_Analysis Workflows.pptx

Copyright license information:No rights reserved; you may enhance and reuse for your own purposes. We do ask that you provide appropriate citation and attribution to DataONE.