an introduction to taverna dr. georgina moulton and stian soiland the university of manchester...

An Introduction to Taverna

Dr. Georgina Moulton and Stian SoilandThe University of Manchester

([email protected]; [email protected] )

(on behalf of the myGRID team)

mailto:[email protected]

mailto:[email protected]

http://www.omii.ac.uk/

Outline of the day

• Introduction to workflows• Introduction to Taverna

– Case-studies

• Hands-on Taverna workshop– Build you own workflows– Explore features of Taverna

• Taverna in a little more detail


What you will learn

• No prior knowledge of workflow technology• By the end of the tutorial participant will know

how to – install the workbench software, import and run existing

workflows and build their own from components available on the public internet.

– use the semantic search technologies in myGrid assist this process by enabling service discovery

– do basic troubleshooting of workflows using Taverna's fault tolerance and debug mechanisms

– manage the import and export of data to and from the workflow system.


What is Taverna?

Taverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments

• Access to local and remote resources and analysis tools

• Automation of data flow• Iteration over large data sets


• Workflow language specifies how processes (web services) fit together

• Describes what you want to do, not how you want to do it

• High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows

• Workflow is a kind of script or protocol that you configure when you run it.

- Easier to explain, share, relocate, reuse and repurpose.- Workflow <=> Model- Workflow is the integrator of knowledge

Workflows

RepeatMasker

Web service

GenScanWeb Service

BlastWeb Service

Sequence Predicted Genes out


Two types of workflows

• Data workflows– A task is invoked once its

expected data has been received, and when complete passes any resulting data downstream

• Control workflows– A task is invoked once its

dependant tasks have completed

A

B

C D

E

F


Williams-Beuren Syndrome (WBS)

• Contiguous sporadic gene deletion disorder

• 1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosis

• Haploinsufficiency of the region results in the phenotype

• Multisystem phenotype – muscular, nervous, circulatory systems

• Characteristic facial features• Unique cognitive profile• Mental retardation (IQ 40-100,

mean~60, ‘normal’ mean ~ 100 )• Outgoing personality, friendly

nature, ‘charming’


Williams-Beuren Syndrome Microdeletion

Chr 7 ~155 Mb

~1.5 Mb7q11.23

GT

F2I

RF

C2

CY

LN

2

GT

F2I

RD

1

NC

F1

WB

SC

R1/

E1f

4H

LIM

K1

EL

N

CL

DN

4

CL

DN

3

ST

X1A

WB

SC

R18

WB

SC

R21

TB

L2

BC

L7B

BA

Z1B

FZ

D9

WB

SC

R5/

LA

B

WB

SC

R22

FK

BP

6

PO

M12

1

NO

LR

1

GT

F2I

RD

2

C-c

en

C-m

id

A-c

en

B-m

id

B-c

en

A-m

id

B-t

el

A-t

el

C-t

el

WB

SC

R14

ST

AG

3P

MS

2L

Block A

FK

BP

6T

PO

M12

1N

OL

R1

Block C

GT

F2I

PN

CF

1PG

TF

2IR

D2P

Block B

**

WBS

SVAS

Patient deletions

CTA-315H11

CTB-51J22

‘Gap’

Physical Map

Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164


Filling a genomic gap in silico

• Two steps to filling the genomic gap: 1. Identify new, overlapping sequence of

interest2. Characterise the new sequence at

nucleotide and amino acid level

• Number of issues if we are to do it the traditional way:

1. Frequently repeated – info rapidly added to public databases

2. Time consuming and mundane 3. Don’t always get results4. Huge amount of interrelated data is produced


Traditional Bioinformatics

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa


Requirements

• Automation• Reliability• Repeatability• Few programming skill required• Works on distributed resources


A B C

The Williams Workflows

A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence


The Biological Results

CTA-315H11 CTB-51J22

EL

N

WB

SC

R14

RP11-622P13 RP11-148M21 RP11-731K22

314,004bp extension

All nine known genes identified(40/45 exons identified)

CL

DN

4

CL

DN

3

ST

X1A

WB

SC

R18

WB

SC

R21

WB

SC

R22

WB

SC

R24

WB

SC

R27

WB

SC

R28

Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified


Workflow Advantages

• Automation– Capturing processes in an explicit manner– Tedium! Computers don’t get

bored/distracted/hungry/impatient!– Saves repeated time and effort

• Modification, maintenance, substitution and personalisation

• Easy to share, explain, relocate, reuse and build• Releases Scientists/Bioinformaticians to do other

work• Record

– Provenance: what the data is like, where it came from, its quality

– Management of data (LSID - Life Science Identifiers)


Benefit to the Scientist?

• Automated plumbing– Systematic. Making boring stuff easier so can do more funky stuff.

Data chaining replaces manual hand-offs. Accelerated creation of results. Repetitive and unbiased analysis. Potentially reproducible but not always.

• Easier to use (but maybe not design)– Gives non-developers access to sophisticated codes and

applications. Avoids need to download-install-learn how to use someone else's code.

• A framework to leverage a community’s applications, services, datasets and codes– Honours original codes and applications. Heterogeneous coding

styles and tools sets. The best applications.– Promoting community metadata and common formats & standards

• A framework for extensibility, adaptability & innovation.– Add my code, reuse and repurpose


It’s more than plumbing….

• Workflows are protocols and records.– Explicit and precise descriptions of a scientific protocol – Scientific transparency. Easier to explain, share, relocate,

reuse and repurpose and remember.– Provenance of results for credibility.

• Workflows are know-how. – Specialists create applications; experts design and set

parameters; inexperienced punch above their weight with sophisticated protocols

• Workflows are collaborations.– Multi-disciplinary workflows promote even broader

collaborations.


In silico experiment lifecycle


Finding and Sharing Tools Taverna Workbench

3rd Party Applications and Portals

WorkflowEnactor

Service Management

Results Management

LogMetadata

DefaultDataStore

CustomStore

DAS

KAVE BAKLAVA

Feta

myExperiment

Utopia

ClientsClients

LSIDs

Workflow enactor

Part of a bigger picture (which we will talk in more detail later)


Taverna Workflow Workbench


Taverna• Taverna is :

– A workflow language based on a dataflow model.– A graphical editing environment for that language.– An invocation system to run instances of that

language on data supplied by a user of the system.• When you download it you get all this rolled into a

single piece of desktop software• The enactor can be run independently of the GUI• Java based, runs on Windows, Mac OS, Linux, Solaris

….• It doesn't necessarily run "on a grid". • Can be used to access resources, either on a grid, or

anywhere else.


OMII-UK

• Funded through the Open Middleware Infrastructure Institute (OMII-UK) as part of the myGrid project run by Carole Goble

• Four years old, funding secured through 2008 and beyond.

• Development team at Manchester & Hinxton, UK

• Wide group of ‘friends and allies’ across the world particularly within UK eScience

• Implemented in Java, released under LGPL licence.


Biomart query

Biomart query

Soaplab operation

wrapping an EMBOSS tool

Soaplab operation

wrapping an EMBOSS tool

Workflow diagramWorkflow diagram

Tree view of workflow structure

Tree view of workflow structure

Available servicesAvailable services

Version 1.5.1 Shown running on a Mac but written in Java, Runs & developed on Windows, OS X and Linux.


An Open World

• Open domain services and resources.• Taverna accesses 3500+ operation.• Third party.• All the major providers

– NCBI, DDBJ, EBI …• Enforce NO common data model.

• Quality Web Services considered desirable

.


Services

• Taverna can interoperate the following by default :– SOAP based web services– Biomart data warehouses– Soaplab wrapped command line tools– BioMoby services and object constructors (talk

tomorrow)– Inline interpreted scripting (Java based)

• Other service classes can be added through an extension point (but you probably don’t need to)


Multi-disciplinary

• ~37000 downloads• Ranked 210 on

sourceforge• Users in US,

Singapore, UK, Europe, Australia,

• Systems biology• Proteomics• Gene/protein annotation• Microarray data analysis• Medical image analysis• Heart simulations• High throughput screening• Phenotypical studies• Plants, Mouse, Human• Astronomy• Aerospace• Dilbert Cartoons


What do Scientists use Taverna for?

• Data gathering and annotating– Distributed data and knowledge

• Data analysis– Distributed analysis tools and

• Data mining and knowledge management– Hypothesis generation and modelling


Case Study – Graves Disease

• Autoimmune disease that causes hyperthyroidism

• Antibodies to the thyrotropin receptor result in constitutive activation of the receptor and increased levels of thyroid hormone

• Original myGrid Case StudyRef: Li P, Hayward K, Jennings C, Owen K, Oinn T, Stevens R, Pearce S and Wipat A (2004) Association of variations in NFKBIE with Graves? disease using classical and myGrid methodologies. UK e-Science All Hands Meeting 2004


Graves Disease

The experiment: • Analysing microarray data to determine genes

differentially-expressed in Graves Disease patients and healthy controls

• Characterising these genes (and any proteins encoded by them) in an annotation pipeline

• From affymetrix probeset identifier, extract information about genes encoded in this region.

• For each gene, evidence is extracted from other data sources to potentially support it as a candidate for disease involvement


Annotation Pipeline

Evidence includes:• SNPs in coding and non-coding regions• Protein products • Protein structure and functional features• Metabolic Pathways• Gene Ontology terms


Data Analysis

• Access to local and remote analysis tool• You start with your own data / public data

of interest• You need to analyse it to extract biological

knowledge


Case study: Investigating Genotype-Phenotype Correlations in

Trypanotolerance

Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. (2007) A systematic strategy for large-scale analysis of

genotype phenotype correlations: identification of candidate genes involved in African

trypanosomiasis.Nucleic Acids Res.35(16):5625-33


Which genesare betweentwo genes?

Which genesare up-regulatedin a data set?

In which pathwaysis a set of genesinvolved?

Why is one mouse resistantand another one susceptible?

What did the immune system of the susceptible mouse do inappropriately?

Which of the strain differences between resistant/susceptiblemice are significant?

Which of the differently activatedpathways in resistant/susceptiblemice are significant?

Top

-dow

n: Im

mun

olog

y-dr

iven

Susceptibility Infected host

Which genes in a region are differently expressed in resistant/susceptible mice and have SNPs?

What are the expression levels ofall genes involved in a particularpathway in resistant/susceptible mice?

Which strain differences canbe found in resistant/susceptible mice?

Which pathways are differently activated inresistant/susceptible mice?

Sim

ple

Com

posi

tion

Bio

logi

cal

Less

ons

Gen

eral

Com

plex

Bot

tom

-up:

Dat

a dr

iven

Genome Transcriptome Pathway

Dat

aso

urce

s


Bioinformatics Challenges

• Linking from genotype to phenotype– Integrated ‘omics (GIMS)– Microarray analysis– Working with the literature– Presentation of results to non-bioinformaticians– Separating cause and effect


Genotype to Phenotype


Genotype Phenotype

?

Current Methods

200

What processes to investigate?


?

200

Microarray + QTL

Genes captured in microarray experiment and present in QTL (Quantitative Trait Loci ) region

Genotype Phenotype

Metabolic pathways

Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping


Trypanosomiasis (“Sleeping sickness”)

• Trypanosoma species parasite

• Human sleeping sickness => T.brucei

• Cattle => T.congolense and T.vivax

• Major problem on cattle production in sub-Saharan Africa

• Symptoms include:– severe anaemia - weight loss– foetal abortion– cachexia and associated

intermittent fever– Oedema - general loss of condition

• Some breeds of cattle are tolerate mild and moderate infections


Trypanosomiasis

• Quantitative Trait Loci data available for cattle and mouse

• Issues – to identify the genetic difference responsible for resistance and breed them into productive cattle.

• Only need to be right (not for the right reasons)

• Model system in mice


Mouse Model

• A/J or Balb/C strains are susceptible• C57BL/6 (B6) are resistant• QTL regions defined (Iraqi, Kemp, Gibson)

– Tir1 (17.4-18.3cM Chr 17), Tir2 and Tir3

• Which genes are responsible for resistance?


Tir1 region

• Contains > 130 genes including TNF and MHC region

• Markers not mapped• Can microarray help?

• Issues: What tissue, what time?


Data used

• Samples were taken from Liver, Spleen and Kidney at time points 0, 3, 7, 9 and 17 days post infection for all three strains of mouse. In total 225 oligonucleotide arrays were used to capture cellular responses to infection, with 5 biological replicates per condition, and samples from 5 mice were used to create each biological replicate

LOTS OF DATARIGOUROUS STATISTICAL ANALYSIS


CHR

QTL

Gene A

Gene B

Pathway A

Pathway B

Pathway linked to phenotype – high priority

Pathway not linked to phenotype – medium priority

Pathway C

Phenotype

literature

literature

literature

Gene C

Pathway not linked to QTL – low priority

Genotype


Key:

A – Retrieve genes in QTL region

B – Annotate genes with external database Ids

C – Cross-reference Ids with KEGG gene ids

D – Retrieve microarray data from MaxD database

E – For each KEGG gene get the pathways it’s involved in

F – For each pathway get a description of what it does

G – For each KEGG gene get a description of what it does


Workflow Breakdown

• Stage 1: Microarray Analyses using MADAT and R

• Stage 2: Finding genes in the QTL• Stage 3: Finding pathways• Stage 4: Ranking gene lists by SNPs in

susceptible vs resistant strains


Finding Genes in the QTL

• Find where probesets are in genomic sequence

• Get list of genes in that region – by searching mmusculus_Ensembl and then both Uniprot and Entrez gene

• Get the corresponding ids from KEGG• Concatenate gene lists and remove

duplicates


Finding Pathways

• For each gene, get a list of known pathways

• For each pathway get associated descriptions

• Merge pathways and remove duplicates


Finding SNPs and Ranking Genes

• Each QTL gene is analysed to determine any SNPs in the region

• Any SNPs are scored for informativeness• Each gene is ranked in terms of its SNPs scores

In the mouse model, strain AJ is more susceptible to trypanosome infection, so informative SNPs are those that are unique to this strain and have the same allele in all the other strains


Parallelising Tasks

SNP analysis and pathway analysis can happen concurrently. They are using the same initial data, but there is no dependency between them

Workflows allow this parallelisation of tasks and the behaviour is often implicit – the user does not have to specify that this should happen


Locations of Services

• Madat microarray software analysis package, including R statistical package – University of Manchester

• AffyMetrix data for mouse 430_2 • BioMart mouse Ensembl database – searching

Uniprot and Entrez gene – EBI, UK• Kegg Gene IDs - Kanehisa Laboratory, Kyoto, Japan• Kegg pathways and descriptions - Kanehisa

Laboratory, Kyoto, Japan• RankGene – SNP analysis service – University of

Manchester


Workflow Features

• Shims• Nested Workflows• Iterations• Output Collections


Shims

• 12 beanshell scripts in the workflow• Beanshells allow users to build small,

bespoke scripts for connecting incompatible services

• Example – ‘CreateReport’ takes the results from the BioMart mouse Ensembl query and collects them into a report


Nested workflows

• A processor can be a workflow itself.

• Encourages the reuse of workflows within a more complex scenario.

• Greater abstraction of an overall process making it more manageable.


Beanshell

• Mmusculus_gene_ensembl has 1 input and 7 outputs

• Each output has the results of iterating over each gene in the QTL

• The ‘CreateReport’ beanshell allows iteration 1 from each result to be collected together, followed by iteration 2 etc, so that they can be examined at a later date


Local Java Processor

Split_by_regexSplits a list of gene identifiers into single gene identifiers for pathway analysesThe list of genes produced from the QTL analysis are presented as one per lineThis Shim splits this file at every new line so that each gene identifier has its own file for further analysis


Before Workflows

• Genotype and phenotype correlations are difficult -fragmented data held in numerous data resources

• Error prone mappings between resources • Lots of repetitive operations• Scientists would choose most likely

candidates for further investigation based on prior knowledge or experience – potentially missing important correlations


Workflow Results

Trypanosomiasis resistanceA strong candidate gene was found – Daxx gene not found using manual investigation methods– The gene was identified from analysis of biological

pathway information– Possible candidate identified by Yan et al (2004): Daxx

SNP info– Sequencing of the Daxx gene in Wet Lab showed

mutations that is thought to change the structure of the protein

– Mutation was published in scientific literature, noting its effect on the binding of Daxx protein to p53 protein – p53 plays direct role in cell death and apoptosis, one of the Trypanosomiasis phenotypes


Conclusions

• Automation has allowed a systematic analysis of a large data space

• The expression of genes and their pathways can be investigated with no prior knowledge – bias from prior knowledge is not introduced into the experiment

• The workflow is a permanent record of the experimental methods used – increasing the reproducibility and providing a starting-point for future modifications and additions to the protocols


An Automatic Annotation Pipeline

• Genome annotation pipelines – workflow assembles evidence for predicted genes / potential functions

• Human expert can ‘review’ this evidence before submission to the genome database

Collaboration with the Bergen Center for Computational Science (computational biology unit) – Gene Prediction in Algal Viruses, a case study. Presented at NETAB2005http://www.nettab.org/2005/docs/

NETTAB2005_LanzenPoster.pdf


User Interaction Handling

• Interaction Service and corresponding Taverna processor allows a workflow to call out to an expert human user

• Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline

• To set this up refer to Taverna project site

Collaboration with the University of BergenRef: Poster, Nettab 2005


Iteration

• Repeated application of a process to multiple data items

• Processor takes a single list as inputs and enactor engine will invoke the processor multiple times and collate results into a new list.


Conditional Branching


Current Workflow Issues

• Web Service Stability• Workflow discovery and reuse• Experimental design• Workflow implications


Web Service Stability

• Distributed computing – Users do no normally own the services they

use– Workflow system providers do not usually own

services

• What can we do when a service fails?– Find an alternative, add an alternative to

invoke automatically– Allow users to rank services by their reliability


What about when a service fails?

• Most services are owned by other people• No control over service failure• Some are research level

Workflows are only as good as the services they connect!

• To help - Taverna can:• Notify failures• Instigate retries• Set criticality• Substitute alternative services


Technology Stability

• Web services are gaining in popularity, most major bioinformatics service providers supply a web service interface to their resources

• The Web Service Description Language (WSDL) is currently a recommendation candidate with the W3C (World Wide Web Consortium).


Workflow Discovery and Reuse

• Workflows are useful experimental artefacts

• Reusing or repurposing existing workflows can save time and effort and should result in the perpetuation of ‘good’ workflow design

• Workflow fragments are also useful for starting new experiments


Workflows are hard work

• Often complex. – Need intelligent steering and analysis.– Need explanations to ensure used properly and safely.

• Challenging and expensive to develop.– Development assistance. Don’t start from scratch.– Take a long time to build good ones & a lot of know-how.

• You can still build crap workflows – Enable scientists to be scientists, not programmers.– Enable scientists to be creative yet sound.


Workflows are commodities

• Valuable first class assets in their own right.– To be pooled and shared and traded and reused – Within communities and across communities– Of pieces, of wholes, of when and how to– Pattern books. Validated community workflow packs– Publish and review

• Enable Mediocre Scientists to do the mundane just as much as the Great and Good.– Conservation of Work principle

• But….Reusability often confined to the project it was conceived. Social and technical challenges for sharing and reuse.


Taverna:Record, Reuse, Recycle, Repurpose

• Trichuris muris - the mouse whipworm

• Trypanosomiasis cattle workflow reused without change over a new dataset

• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.

• A manual two year study of candidate genes had failed to do this.

Paul Fisher et al A Systematic Strategy for Large-Scale Unbiased Analysis of Genotype-Phenotype Correlations Bioinformatics in review


Workflow Reuse – Workflows are Scientific Protocols – Share them!

Addisons Disease

SNP design

Protein annotation

Microarray analysis

myGrid Workflow Repository

http://workflows.mygrid.org.uk/repository


A workflow marketplace


Started February 2007

A community social network for sharing workflows

A gateway to other publisihing environments

A platform for launching workflows

Soon you will be able to see a workflow and just launch it – not just Taverna, but others like Kepler, Triana


Experimental Design

• Workflows can generate large amounts of data

• Gathering data no is longer a problem – but how do you analyse such a large numbers of files and high volumes?

– Visualisation– Building data models– Populating data schemas


Experimental Design - Visualisation

Other tools can use workflows as background processes

• UTOPIA is a visualisation tool for DNA and proteins

• UTOPIA displays workflow results in an interactive way, allowing scientists to explore their data and initiate further additional experiments


UTOPIA


WF ExecutionEngine

Portal

Middleware

Results

ProvenanceWarehouse

Resources

DesignGUI

ApplicationApplication

Application

Datasets

Workflow Warehouse

Service / ComponentCatalogue

Resource Information

Services

Provenance

http://www.kooprime.com

Utopia


Experimental Design – Building Data Models

• Generate EMBL records as a step in the workflow

• Generate GFF models as a step in the workflow

• Results are now just one file


Experimental Design – Populating Data Schemas

• SBML (Systems Biology Mark-up Language) is the standard format for generating and sharing systems biology data

• A Taverna plug-in (Using libSMBL) allows SBML models to be consumed or produced from workflow experiments


MCISB Case Study

• Manchester Centre for Integrative Systems Biology– Case study for their informatics infrastructure

• Superimposing array data onto pathway maps using Taverna workflows

[Peter Li, Doug Kell, 2006]


MCISB

Study Involves• Combining public and local data • Gathering, creating and storing data in SBML models• Visualising results using pre-existing SMBL-based

toolsWhy?• To view transcriptome data from the context of

pathways• To see the effects of up/down regulation on

pathways


Pathway maps are first saved as SBML models using bespoke SMBL software - CellDesigner

Glycolysis pathway


MaxD

Read gene names of enzymes from SBML Cell Designer model

Query Big Expt data in MaxD using gene names

Calculate colour of enzymenodes based on mRNA expression levels

Create new SBML model

Transcriptome pathway workflow


libSBML using the API Consumer

• Building the SBML model was possible because of Taverna’s extensibility

• Did not have to write new services – just used the API consumer to add SBML library

• API consumer can be downloaded from the myGrid website


JC_C-0.07-1_Measurement JC_N-0.07-1_Measurement

Decreased levels of GPM3

New SBML models viewed using Cell Designer


Results

• Visualisation of one data source over another by the building of a common model during the workflow run

• Promotes re-use by providing a simple interface for interpretation by the scientist

• Offers a proof of principle for the overlay of other omics data:– Proteomics data from PRIDE– Metabolomic data from MEMO


Transparency


Provenance

• Who, What, Where, When, Why?, How?

• Context• Interpretation• Logging & Debugging• Reproducibility and repeatability• Evidence & Audit• Non-repudiation• Credit and Attribution• Credibility• Accurate reuse and interpretation• Smart re-running• Cross experiment mining• Just good scientific practice

Smart Tea

BioMOBY


Tracking

From which Ensembl gene does pathway mmu004620 come from?


Provenance

Workflow experiments can span several months, re-running the same workflow or comparing the results of several different workflows

Scientists need to record:• Who performed the experiment and when• What the experiment was• What services were invoked• What the final and intermediate results wereSome workflow systems enable this process

provenance to be collected automatically – for Taverna, we have the myGrid LogBook


runsWorkflow

launchedBy

Organisation provenance

WorkflowWorkflow

Experimenter Experimenter

OrganisationOrganisation

belongsTo

hasInput

executesProcessRune.g. web service invocation of BLAST @ NCBI

iteration

e.g. BLAST @ NCBI

Workflow runWorkflow run

ProcessProcess

ProcessRunProcessRun

ProcessIterationProcessIteration

Workflow provenance

workflowOutput

DataData

Data/ knowledge provenance

Atomic DataAtomic Data

derivedFromKnowledge statements

e.g. similar_sequence_toKnowledge statements

e.g. similar_sequence_to

createdBy

Data CollectionData Collection

containsData

isA isA

runsProcesshasProcesses


Workflow Implications

• Workflows generate lots of data in a model that is easy to understand

• More and better resources are available to more scientists

• Changes in communication between laboratory scientists and in silico scientists

• Changes in work practices towards hypothesis generation

• Workflows can be shared, published, verified, repurposed and reused and scientists should always receive the credit for their creation.


Taverna Summary

• Automation• Implicit iteration• Implicit parallelisation• Support for nested workflow construction• Error handling

– Retry, failover and automatic substitution of alternates


Extensibility

• Accepts many types of services:- web services, beanshell scripts, local java scripts, JDBC connections…etc

• Easy to add your own services• Plug-in architecture

Easy to build new processor typesEasy to extend to include alternative results viewers


More information…

Taverna http://taverna.sourceforge.netmyGrid http://www.mygrid.org.ukOMII-UK http://www.omii.ac.uk

http://taverna.sourceforge.net/

http://www.mygrid.org.uk/



Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

OMII-UK• Tom Oinn, Daniele Turi, Katy Wolstencroft, June Finch, Stuart Owen, David Withers, Stian Soiland,

Franck Tanoh, Matthew GambleResearch• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu,

Wang Kaixuan, Current contributors• Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE peopleUser Advocates and their bosses• Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li,

Simon Hubbard, Tracy Craddock, Doug KellPast Contributors• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin

Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson and Chris Wroe.

Industrial • Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica, Funders• EPSRC, Wellcome Trust


an introduction to taverna dr. georgina moulton and stian soiland the university of manchester...

Documents

interrelated data

export of data

expected data

new sequence

workflow modelworkflow

workflow system

sequence gaps

workflow experiments