an introduction to taverna dr. georgina moulton and stian soiland the university of manchester...
TRANSCRIPT
An Introduction to Taverna
Dr. Georgina Moulton and Stian SoilandThe University of Manchester
([email protected]; [email protected] )
(on behalf of the myGRID team)
Outline of the day
• Introduction to workflows• Introduction to Taverna
– Case-studies
• Hands-on Taverna workshop– Build you own workflows– Explore features of Taverna
• Taverna in a little more detail
What you will learn
• No prior knowledge of workflow technology• By the end of the tutorial participant will know
how to – install the workbench software, import and run existing
workflows and build their own from components available on the public internet.
– use the semantic search technologies in myGrid assist this process by enabling service discovery
– do basic troubleshooting of workflows using Taverna's fault tolerance and debug mechanisms
– manage the import and export of data to and from the workflow system.
What is Taverna?
Taverna enables the interoperation between databases and tools by providing a toolkit for composing, executing and managing workflow experiments
• Access to local and remote resources and analysis tools
• Automation of data flow• Iteration over large data sets
• Workflow language specifies how processes (web services) fit together
• Describes what you want to do, not how you want to do it
• High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows
• Workflow is a kind of script or protocol that you configure when you run it.
- Easier to explain, share, relocate, reuse and repurpose.- Workflow <=> Model- Workflow is the integrator of knowledge
Workflows
RepeatMasker
Web service
GenScanWeb Service
BlastWeb Service
Sequence Predicted Genes out
Two types of workflows
• Data workflows– A task is invoked once its
expected data has been received, and when complete passes any resulting data downstream
• Control workflows– A task is invoked once its
dependant tasks have completed
A
B
C D
E
F
Williams-Beuren Syndrome (WBS)
• Contiguous sporadic gene deletion disorder
• 1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosis
• Haploinsufficiency of the region results in the phenotype
• Multisystem phenotype – muscular, nervous, circulatory systems
• Characteristic facial features• Unique cognitive profile• Mental retardation (IQ 40-100,
mean~60, ‘normal’ mean ~ 100 )• Outgoing personality, friendly
nature, ‘charming’
Williams-Beuren Syndrome Microdeletion
Chr 7 ~155 Mb
~1.5 Mb7q11.23
GT
F2I
RF
C2
CY
LN
2
GT
F2I
RD
1
NC
F1
WB
SC
R1/
E1f
4H
LIM
K1
EL
N
CL
DN
4
CL
DN
3
ST
X1A
WB
SC
R18
WB
SC
R21
TB
L2
BC
L7B
BA
Z1B
FZ
D9
WB
SC
R5/
LA
B
WB
SC
R22
FK
BP
6
PO
M12
1
NO
LR
1
GT
F2I
RD
2
C-c
en
C-m
id
A-c
en
B-m
id
B-c
en
A-m
id
B-t
el
A-t
el
C-t
el
WB
SC
R14
ST
AG
3P
MS
2L
Block A
FK
BP
6T
PO
M12
1N
OL
R1
Block C
GT
F2I
PN
CF
1PG
TF
2IR
D2P
Block B
**
WBS
SVAS
Patient deletions
CTA-315H11
CTB-51J22
‘Gap’
Physical Map
Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164
Filling a genomic gap in silico
• Two steps to filling the genomic gap: 1. Identify new, overlapping sequence of
interest2. Characterise the new sequence at
nucleotide and amino acid level
• Number of issues if we are to do it the traditional way:
1. Frequently repeated – info rapidly added to public databases
2. Time consuming and mundane 3. Don’t always get results4. Huge amount of interrelated data is produced
Traditional Bioinformatics
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Requirements
• Automation• Reliability• Repeatability• Few programming skill required• Works on distributed resources
A B C
The Williams Workflows
A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence
The Biological Results
CTA-315H11 CTB-51J22
EL
N
WB
SC
R14
RP11-622P13 RP11-148M21 RP11-731K22
314,004bp extension
All nine known genes identified(40/45 exons identified)
CL
DN
4
CL
DN
3
ST
X1A
WB
SC
R18
WB
SC
R21
WB
SC
R22
WB
SC
R24
WB
SC
R27
WB
SC
R28
Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified
Workflow Advantages
• Automation– Capturing processes in an explicit manner– Tedium! Computers don’t get
bored/distracted/hungry/impatient!– Saves repeated time and effort
• Modification, maintenance, substitution and personalisation
• Easy to share, explain, relocate, reuse and build• Releases Scientists/Bioinformaticians to do other
work• Record
– Provenance: what the data is like, where it came from, its quality
– Management of data (LSID - Life Science Identifiers)
Benefit to the Scientist?
• Automated plumbing– Systematic. Making boring stuff easier so can do more funky stuff.
Data chaining replaces manual hand-offs. Accelerated creation of results. Repetitive and unbiased analysis. Potentially reproducible but not always.
• Easier to use (but maybe not design)– Gives non-developers access to sophisticated codes and
applications. Avoids need to download-install-learn how to use someone else's code.
• A framework to leverage a community’s applications, services, datasets and codes– Honours original codes and applications. Heterogeneous coding
styles and tools sets. The best applications.– Promoting community metadata and common formats & standards
• A framework for extensibility, adaptability & innovation.– Add my code, reuse and repurpose
It’s more than plumbing….
• Workflows are protocols and records.– Explicit and precise descriptions of a scientific protocol – Scientific transparency. Easier to explain, share, relocate,
reuse and repurpose and remember.– Provenance of results for credibility.
• Workflows are know-how. – Specialists create applications; experts design and set
parameters; inexperienced punch above their weight with sophisticated protocols
• Workflows are collaborations.– Multi-disciplinary workflows promote even broader
collaborations.
In silico experiment lifecycle
Finding and Sharing Tools Taverna Workbench
3rd Party Applications and Portals
WorkflowEnactor
Service Management
Results Management
LogMetadata
DefaultDataStore
CustomStore
DAS
KAVE BAKLAVA
Feta
myExperiment
Utopia
ClientsClients
LSIDs
Workflow enactor
Part of a bigger picture (which we will talk in more detail later)
Taverna Workflow Workbench
Taverna• Taverna is :
– A workflow language based on a dataflow model.– A graphical editing environment for that language.– An invocation system to run instances of that
language on data supplied by a user of the system.• When you download it you get all this rolled into a
single piece of desktop software• The enactor can be run independently of the GUI• Java based, runs on Windows, Mac OS, Linux, Solaris
….• It doesn't necessarily run "on a grid". • Can be used to access resources, either on a grid, or
anywhere else.
OMII-UK
• Funded through the Open Middleware Infrastructure Institute (OMII-UK) as part of the myGrid project run by Carole Goble
• Four years old, funding secured through 2008 and beyond.
• Development team at Manchester & Hinxton, UK
• Wide group of ‘friends and allies’ across the world particularly within UK eScience
• Implemented in Java, released under LGPL licence.
Biomart query
Biomart query
Soaplab operation
wrapping an EMBOSS tool
Soaplab operation
wrapping an EMBOSS tool
Workflow diagramWorkflow diagram
Tree view of workflow structure
Tree view of workflow structure
Available servicesAvailable services
Version 1.5.1 Shown running on a Mac but written in Java, Runs & developed on Windows, OS X and Linux.
An Open World
• Open domain services and resources.• Taverna accesses 3500+ operation.• Third party.• All the major providers
– NCBI, DDBJ, EBI …• Enforce NO common data model.
• Quality Web Services considered desirable
.
Services
• Taverna can interoperate the following by default :– SOAP based web services– Biomart data warehouses– Soaplab wrapped command line tools– BioMoby services and object constructors (talk
tomorrow)– Inline interpreted scripting (Java based)
• Other service classes can be added through an extension point (but you probably don’t need to)
Multi-disciplinary
• ~37000 downloads• Ranked 210 on
sourceforge• Users in US,
Singapore, UK, Europe, Australia,
• Systems biology• Proteomics• Gene/protein annotation• Microarray data analysis• Medical image analysis• Heart simulations• High throughput screening• Phenotypical studies• Plants, Mouse, Human• Astronomy• Aerospace• Dilbert Cartoons
What do Scientists use Taverna for?
• Data gathering and annotating– Distributed data and knowledge
• Data analysis– Distributed analysis tools and
• Data mining and knowledge management– Hypothesis generation and modelling
Case Study – Graves Disease
• Autoimmune disease that causes hyperthyroidism
• Antibodies to the thyrotropin receptor result in constitutive activation of the receptor and increased levels of thyroid hormone
• Original myGrid Case StudyRef: Li P, Hayward K, Jennings C, Owen K, Oinn T, Stevens R, Pearce S and Wipat A (2004) Association of variations in NFKBIE with Graves? disease using classical and myGrid methodologies. UK e-Science All Hands Meeting 2004
Graves Disease
The experiment: • Analysing microarray data to determine genes
differentially-expressed in Graves Disease patients and healthy controls
• Characterising these genes (and any proteins encoded by them) in an annotation pipeline
• From affymetrix probeset identifier, extract information about genes encoded in this region.
• For each gene, evidence is extracted from other data sources to potentially support it as a candidate for disease involvement
Annotation Pipeline
Evidence includes:• SNPs in coding and non-coding regions• Protein products • Protein structure and functional features• Metabolic Pathways• Gene Ontology terms
Data Analysis
• Access to local and remote analysis tool• You start with your own data / public data
of interest• You need to analyse it to extract biological
knowledge
Case study: Investigating Genotype-Phenotype Correlations in
Trypanotolerance
Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. (2007) A systematic strategy for large-scale analysis of
genotype phenotype correlations: identification of candidate genes involved in African
trypanosomiasis.Nucleic Acids Res.35(16):5625-33
Which genesare betweentwo genes?
Which genesare up-regulatedin a data set?
In which pathwaysis a set of genesinvolved?
Why is one mouse resistantand another one susceptible?
What did the immune system of the susceptible mouse do inappropriately?
Which of the strain differences between resistant/susceptiblemice are significant?
Which of the differently activatedpathways in resistant/susceptiblemice are significant?
Top
-dow
n: Im
mun
olog
y-dr
iven
Susceptibility Infected host
Which genes in a region are differently expressed in resistant/susceptible mice and have SNPs?
What are the expression levels ofall genes involved in a particularpathway in resistant/susceptible mice?
Which strain differences canbe found in resistant/susceptible mice?
Which pathways are differently activated inresistant/susceptible mice?
Sim
ple
Com
posi
tion
Bio
logi
cal
Less
ons
Gen
eral
Com
plex
Bot
tom
-up:
Dat
a dr
iven
Genome Transcriptome Pathway
Dat
aso
urce
s
Bioinformatics Challenges
• Linking from genotype to phenotype– Integrated ‘omics (GIMS)– Microarray analysis– Working with the literature– Presentation of results to non-bioinformaticians– Separating cause and effect
Genotype to Phenotype
?
200
Microarray + QTL
Genes captured in microarray experiment and present in QTL (Quantitative Trait Loci ) region
Genotype Phenotype
Metabolic pathways
Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping
Trypanosomiasis (“Sleeping sickness”)
• Trypanosoma species parasite
• Human sleeping sickness => T.brucei
• Cattle => T.congolense and T.vivax
• Major problem on cattle production in sub-Saharan Africa
• Symptoms include:– severe anaemia - weight loss– foetal abortion– cachexia and associated
intermittent fever– Oedema - general loss of condition
• Some breeds of cattle are tolerate mild and moderate infections
Trypanosomiasis
• Quantitative Trait Loci data available for cattle and mouse
• Issues – to identify the genetic difference responsible for resistance and breed them into productive cattle.
• Only need to be right (not for the right reasons)
• Model system in mice
Mouse Model
• A/J or Balb/C strains are susceptible• C57BL/6 (B6) are resistant• QTL regions defined (Iraqi, Kemp, Gibson)
– Tir1 (17.4-18.3cM Chr 17), Tir2 and Tir3
• Which genes are responsible for resistance?
Tir1 region
• Contains > 130 genes including TNF and MHC region
• Markers not mapped• Can microarray help?
• Issues: What tissue, what time?
Data used
• Samples were taken from Liver, Spleen and Kidney at time points 0, 3, 7, 9 and 17 days post infection for all three strains of mouse. In total 225 oligonucleotide arrays were used to capture cellular responses to infection, with 5 biological replicates per condition, and samples from 5 mice were used to create each biological replicate
LOTS OF DATARIGOUROUS STATISTICAL ANALYSIS
CHR
QTL
Gene A
Gene B
Pathway A
Pathway B
Pathway linked to phenotype – high priority
Pathway not linked to phenotype – medium priority
Pathway C
Phenotype
literature
literature
literature
Gene C
Pathway not linked to QTL – low priority
Genotype
Key:
A – Retrieve genes in QTL region
B – Annotate genes with external database Ids
C – Cross-reference Ids with KEGG gene ids
D – Retrieve microarray data from MaxD database
E – For each KEGG gene get the pathways it’s involved in
F – For each pathway get a description of what it does
G – For each KEGG gene get a description of what it does
Workflow Breakdown
• Stage 1: Microarray Analyses using MADAT and R
• Stage 2: Finding genes in the QTL• Stage 3: Finding pathways• Stage 4: Ranking gene lists by SNPs in
susceptible vs resistant strains
Finding Genes in the QTL
• Find where probesets are in genomic sequence
• Get list of genes in that region – by searching mmusculus_Ensembl and then both Uniprot and Entrez gene
• Get the corresponding ids from KEGG• Concatenate gene lists and remove
duplicates
Finding Pathways
• For each gene, get a list of known pathways
• For each pathway get associated descriptions
• Merge pathways and remove duplicates
Finding SNPs and Ranking Genes
• Each QTL gene is analysed to determine any SNPs in the region
• Any SNPs are scored for informativeness• Each gene is ranked in terms of its SNPs scores
In the mouse model, strain AJ is more susceptible to trypanosome infection, so informative SNPs are those that are unique to this strain and have the same allele in all the other strains
Parallelising Tasks
SNP analysis and pathway analysis can happen concurrently. They are using the same initial data, but there is no dependency between them
Workflows allow this parallelisation of tasks and the behaviour is often implicit – the user does not have to specify that this should happen
Locations of Services
• Madat microarray software analysis package, including R statistical package – University of Manchester
• AffyMetrix data for mouse 430_2 • BioMart mouse Ensembl database – searching
Uniprot and Entrez gene – EBI, UK• Kegg Gene IDs - Kanehisa Laboratory, Kyoto, Japan• Kegg pathways and descriptions - Kanehisa
Laboratory, Kyoto, Japan• RankGene – SNP analysis service – University of
Manchester
Shims
• 12 beanshell scripts in the workflow• Beanshells allow users to build small,
bespoke scripts for connecting incompatible services
• Example – ‘CreateReport’ takes the results from the BioMart mouse Ensembl query and collects them into a report
Nested workflows
• A processor can be a workflow itself.
• Encourages the reuse of workflows within a more complex scenario.
• Greater abstraction of an overall process making it more manageable.
Beanshell
• Mmusculus_gene_ensembl has 1 input and 7 outputs
• Each output has the results of iterating over each gene in the QTL
• The ‘CreateReport’ beanshell allows iteration 1 from each result to be collected together, followed by iteration 2 etc, so that they can be examined at a later date
Local Java Processor
Split_by_regexSplits a list of gene identifiers into single gene identifiers for pathway analysesThe list of genes produced from the QTL analysis are presented as one per lineThis Shim splits this file at every new line so that each gene identifier has its own file for further analysis
Before Workflows
• Genotype and phenotype correlations are difficult -fragmented data held in numerous data resources
• Error prone mappings between resources • Lots of repetitive operations• Scientists would choose most likely
candidates for further investigation based on prior knowledge or experience – potentially missing important correlations
Workflow Results
Trypanosomiasis resistanceA strong candidate gene was found – Daxx gene not found using manual investigation methods– The gene was identified from analysis of biological
pathway information– Possible candidate identified by Yan et al (2004): Daxx
SNP info– Sequencing of the Daxx gene in Wet Lab showed
mutations that is thought to change the structure of the protein
– Mutation was published in scientific literature, noting its effect on the binding of Daxx protein to p53 protein – p53 plays direct role in cell death and apoptosis, one of the Trypanosomiasis phenotypes
Conclusions
• Automation has allowed a systematic analysis of a large data space
• The expression of genes and their pathways can be investigated with no prior knowledge – bias from prior knowledge is not introduced into the experiment
• The workflow is a permanent record of the experimental methods used – increasing the reproducibility and providing a starting-point for future modifications and additions to the protocols
An Automatic Annotation Pipeline
• Genome annotation pipelines – workflow assembles evidence for predicted genes / potential functions
• Human expert can ‘review’ this evidence before submission to the genome database
Collaboration with the Bergen Center for Computational Science (computational biology unit) – Gene Prediction in Algal Viruses, a case study. Presented at NETAB2005http://www.nettab.org/2005/docs/
NETTAB2005_LanzenPoster.pdf
User Interaction Handling
• Interaction Service and corresponding Taverna processor allows a workflow to call out to an expert human user
• Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline
• To set this up refer to Taverna project site
Collaboration with the University of BergenRef: Poster, Nettab 2005
Iteration
• Repeated application of a process to multiple data items
• Processor takes a single list as inputs and enactor engine will invoke the processor multiple times and collate results into a new list.
Conditional Branching
Current Workflow Issues
• Web Service Stability• Workflow discovery and reuse• Experimental design• Workflow implications
Web Service Stability
• Distributed computing – Users do no normally own the services they
use– Workflow system providers do not usually own
services
• What can we do when a service fails?– Find an alternative, add an alternative to
invoke automatically– Allow users to rank services by their reliability
What about when a service fails?
• Most services are owned by other people• No control over service failure• Some are research level
Workflows are only as good as the services they connect!
• To help - Taverna can:• Notify failures• Instigate retries• Set criticality• Substitute alternative services
Technology Stability
• Web services are gaining in popularity, most major bioinformatics service providers supply a web service interface to their resources
• The Web Service Description Language (WSDL) is currently a recommendation candidate with the W3C (World Wide Web Consortium).
Workflow Discovery and Reuse
• Workflows are useful experimental artefacts
• Reusing or repurposing existing workflows can save time and effort and should result in the perpetuation of ‘good’ workflow design
• Workflow fragments are also useful for starting new experiments
Workflows are hard work
• Often complex. – Need intelligent steering and analysis.– Need explanations to ensure used properly and safely.
• Challenging and expensive to develop.– Development assistance. Don’t start from scratch.– Take a long time to build good ones & a lot of know-how.
• You can still build crap workflows – Enable scientists to be scientists, not programmers.– Enable scientists to be creative yet sound.
Workflows are commodities
• Valuable first class assets in their own right.– To be pooled and shared and traded and reused – Within communities and across communities– Of pieces, of wholes, of when and how to– Pattern books. Validated community workflow packs– Publish and review
• Enable Mediocre Scientists to do the mundane just as much as the Great and Good.– Conservation of Work principle
• But….Reusability often confined to the project it was conceived. Social and technical challenges for sharing and reuse.
Taverna:Record, Reuse, Recycle, Repurpose
• Trichuris muris - the mouse whipworm
• Trypanosomiasis cattle workflow reused without change over a new dataset
• Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.
• A manual two year study of candidate genes had failed to do this.
Paul Fisher et al A Systematic Strategy for Large-Scale Unbiased Analysis of Genotype-Phenotype Correlations Bioinformatics in review
Workflow Reuse – Workflows are Scientific Protocols – Share them!
Addisons Disease
SNP design
Protein annotation
Microarray analysis
myGrid Workflow Repository
http://workflows.mygrid.org.uk/repository
A workflow marketplace
Started February 2007
A community social network for sharing workflows
A gateway to other publisihing environments
A platform for launching workflows
Soon you will be able to see a workflow and just launch it – not just Taverna, but others like Kepler, Triana
Experimental Design
• Workflows can generate large amounts of data
• Gathering data no is longer a problem – but how do you analyse such a large numbers of files and high volumes?
– Visualisation– Building data models– Populating data schemas
Experimental Design - Visualisation
Other tools can use workflows as background processes
• UTOPIA is a visualisation tool for DNA and proteins
• UTOPIA displays workflow results in an interactive way, allowing scientists to explore their data and initiate further additional experiments
UTOPIA
WF ExecutionEngine
Portal
Middleware
Results
ProvenanceWarehouse
Resources
DesignGUI
ApplicationApplication
Application
Datasets
Workflow Warehouse
Service / ComponentCatalogue
Resource Information
Services
Provenance
http://www.kooprime.com
Utopia
Experimental Design – Building Data Models
• Generate EMBL records as a step in the workflow
• Generate GFF models as a step in the workflow
• Results are now just one file
Experimental Design – Populating Data Schemas
• SBML (Systems Biology Mark-up Language) is the standard format for generating and sharing systems biology data
• A Taverna plug-in (Using libSMBL) allows SBML models to be consumed or produced from workflow experiments
MCISB Case Study
• Manchester Centre for Integrative Systems Biology– Case study for their informatics infrastructure
• Superimposing array data onto pathway maps using Taverna workflows
[Peter Li, Doug Kell, 2006]
MCISB
Study Involves• Combining public and local data • Gathering, creating and storing data in SBML models• Visualising results using pre-existing SMBL-based
toolsWhy?• To view transcriptome data from the context of
pathways• To see the effects of up/down regulation on
pathways
Pathway maps are first saved as SBML models using bespoke SMBL software - CellDesigner
Glycolysis pathway
MaxD
Read gene names of enzymes from SBML Cell Designer model
Query Big Expt data in MaxD using gene names
Calculate colour of enzymenodes based on mRNA expression levels
Create new SBML model
Transcriptome pathway workflow
libSBML using the API Consumer
• Building the SBML model was possible because of Taverna’s extensibility
• Did not have to write new services – just used the API consumer to add SBML library
• API consumer can be downloaded from the myGrid website
JC_C-0.07-1_Measurement JC_N-0.07-1_Measurement
Decreased levels of GPM3
New SBML models viewed using Cell Designer
Results
• Visualisation of one data source over another by the building of a common model during the workflow run
• Promotes re-use by providing a simple interface for interpretation by the scientist
• Offers a proof of principle for the overlay of other omics data:– Proteomics data from PRIDE– Metabolomic data from MEMO
Transparency
Provenance
• Who, What, Where, When, Why?, How?
• Context• Interpretation• Logging & Debugging• Reproducibility and repeatability• Evidence & Audit• Non-repudiation• Credit and Attribution• Credibility• Accurate reuse and interpretation• Smart re-running• Cross experiment mining• Just good scientific practice
Smart Tea
BioMOBY
Provenance
Workflow experiments can span several months, re-running the same workflow or comparing the results of several different workflows
Scientists need to record:• Who performed the experiment and when• What the experiment was• What services were invoked• What the final and intermediate results wereSome workflow systems enable this process
provenance to be collected automatically – for Taverna, we have the myGrid LogBook
runsWorkflow
launchedBy
Organisation provenance
WorkflowWorkflow
Experimenter Experimenter
OrganisationOrganisation
belongsTo
hasInput
executesProcessRune.g. web service invocation of BLAST @ NCBI
iteration
e.g. BLAST @ NCBI
Workflow runWorkflow run
ProcessProcess
ProcessRunProcessRun
ProcessIterationProcessIteration
Workflow provenance
workflowOutput
DataData
Data/ knowledge provenance
Atomic DataAtomic Data
derivedFromKnowledge statements
e.g. similar_sequence_toKnowledge statements
e.g. similar_sequence_to
createdBy
Data CollectionData Collection
containsData
isA isA
runsProcesshasProcesses
Workflow Implications
• Workflows generate lots of data in a model that is easy to understand
• More and better resources are available to more scientists
• Changes in communication between laboratory scientists and in silico scientists
• Changes in work practices towards hypothesis generation
• Workflows can be shared, published, verified, repurposed and reused and scientists should always receive the credit for their creation.
Taverna Summary
• Automation• Implicit iteration• Implicit parallelisation• Support for nested workflow construction• Error handling
– Retry, failover and automatic substitution of alternates
Extensibility
• Accepts many types of services:- web services, beanshell scripts, local java scripts, JDBC connections…etc
• Easy to add your own services• Plug-in architecture
Easy to build new processor typesEasy to extend to include alternative results viewers
More information…
Taverna http://taverna.sourceforge.netmyGrid http://www.mygrid.org.ukOMII-UK http://www.omii.ac.uk
Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer
OMII-UK• Tom Oinn, Daniele Turi, Katy Wolstencroft, June Finch, Stuart Owen, David Withers, Stian Soiland,
Franck Tanoh, Matthew GambleResearch• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu,
Wang Kaixuan, Current contributors• Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE peopleUser Advocates and their bosses• Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li,
Simon Hubbard, Tracy Craddock, Doug KellPast Contributors• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin
Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson and Chris Wroe.
Industrial • Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica, Funders• EPSRC, Wellcome Trust