taverna and my grid a solution for confusion intensive computing? tom oinn – embl-ebi,...

32
Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, [email protected] http://mygrid.org.uk http://taverna.sf.net

Upload: phillip-bryan

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 2: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Who are we? myGrid

An EPSRC funded ‘eScience Pilot Project’

Based across multiple sites in the UK

Taverna A tethered spin-off of the

myGrid project Aimed at producing

powerful tools to complement the basic research work

EBI Hinxton Campus

Page 3: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

What is Taverna? Allows scientists to graphically construct

complex processes in the form of workflows What is a workflow?

Set of activities that make up a process Definitions about how data moves between these

activities The user specifies what to do but not how to do it Insulates users from the complexity of

distributed computing

Page 4: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Looks a bit like this…

Page 5: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

myGrid, Taverna and WBS One of several early adopters of Taverna Manchester based group working on

Williams-Beuren Syndrome in the medical genetics department

Workflows written by life scientists not computer scientists

Following slides stolen at the last minute from Hannah Tipney at Manchester!

Page 6: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Williams-Beuren Syndrome (WBS) Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal crossover (homologous

recombination) during meiosis Haploinsufficiency of the region results in the phenotype Multisystem phenotype – muscular, nervous, circulatory systems Characteristic facial features Unique cognitive profile Mental retardation (IQ 40-100, mean~60, ‘normal’ mean ~ 100 ) Outgoing personality, friendly nature, ‘charming’

Page 7: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Chr 7 ~155 Mb

~1.5 Mb7q11.23

C-cen

C-mid

A-cen

B-mid

B-cen

A-mid

GTF2I

RFC2

CYLN2

GTF2IRD1

NCF1

WBSCR1/E1f4H

LIMK1

ELNCLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

TBL2BCL7B

BAZ1B

FZD9

WBSCR5/LAB

WBSCR22

FKBP6

POM121

NOLR1

GTF2IRD2

B-telA-tel

C-tel

WBSCR14

STAG3PMS2L

Blo

ck A

FKBP6T

POM121NOLR1

Blo

ck C

GTF2IPNCF1PGTF2IRD2P

Blo

ck B

CTA-315H11

CTB-51J22

Gap

Physical Map

Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164

Williams-Beuren Syndrome Microdeletion

Page 8: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

GenBank Accession No

GenBank Entry

Seqret

Nucleotide seq (Fasta)

GenScanCoding sequence

ORFs

prettyseq

restrict

cpgreport

RepeatMasker

ncbiBlastWrapper

sixpack

transeq

6 ORFs

Restriction enzyme map

CpG Island locations and %

Repetitive elements

Translation/sequence file. Good for records and publications

Blastn Vs nr, est databases.

Amino Acid translation

epestfind

pepcoil

pepstats

pscan

Identifies PEST seq

Identifies FingerPRINTS

MW, length, charge, pI, etc

Predicts Coiled-coil regions

SignalPTargetPPSORTII

InterPro

Hydrophobic regions

Predicts cellular location

Identifies functional and structural domains/motifs

Pepwindow?Octanol?

BlastWrapper

URL inc GB identifier

tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr

RepeatMasker

Query nucleotide sequence

BLASTwrapper

Sort for appropriate Sequences only

RepeatMasker

TF binding Prediction

Promotor Prediction

Regulation Element Prediction

Identify regulatory elements in genomic sequence

Experiment

Page 9: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Analysis via ‘Cut and Paste’

Page 10: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

A B C

A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence

Workflows

Page 11: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

The Biological Results

CTA-315H11 CTB-51J22

ELN

WBSCR14

RP11-622P13 RP11-148M21 RP11-731K22

314,004bp extension

All nine known genes identified(40/45 exons identified)

CLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

WBSCR22

WBSCR24

WBSCR27

WBSCR28

Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified

Page 12: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

And Now… Pretty Pictures

The first thing users see…

Page 13: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

BioMoby (orange), Soaplab (wheat), Workflow (red), SOAP Service (green), SeqHound (blue), Local Java operation (purple), String constant (pale blue)

Different service types, unified.

Page 14: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Launching a workflow…

Page 15: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Invocation progress…

Page 16: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Browsing the results…

Page 17: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Results in context…

Page 18: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Integration Epochs1. Databases / Data warehouses

Integration of data

2. Distributed Queries, Workflows Integration of process

3. Semantic Unification Integration of knowledge

Current state of the art somewhere around 2.5, what do we need to do next?

Page 19: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Last Year’s Problems Multiple data sources

SOA approaches, distributed queries i.e. OGSA-DAI

Heterogeneous computational resources SOA combined with workflow methods Toolkits widely used and deployed i.e. Soaplab,

BioMoby et al. As a community we can provide data and

compute services, and are doing so.

Page 20: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Yesterday’s Problems Usability

Distributed computing and biologists go together like water and mains electricity

Graphical workflow environments now exist e.g. Taverna, Triana, Discovery-Net, Ptolemy…

Can be improved upon but basically usable by the target audience of expert researchers.

Page 21: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Concept Workflows, SOA and friends are now accepted as

a legitimate way of doing things Methods have moved from the ‘out there’

research world to just inside the common scientific toolbox

Page 22: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Functionality Integration of BioMoby, EMBOSS, SOAP

services, command line tools, SeqHound, Web CGIs and others on demand

Fault tolerance and reporting Enactment of complex process flows Some service discovery (crude but surprisingly

effective) Available and widely used (>2500 downloads of

Taverna from http://taverna.sf.net)

Page 23: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Current Work Service Discovery

Doing it properly – semantic registry technology Ontologies for services, data etc. Annotating the corpus of services with metadata

Data management Putting data in context within the scientific

process Managing the new bursts of data from workflow

systems

Page 24: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

So Where’s This Confusion Then?

At the moment, invoking a workflow gives results equivalent to a big set of files

Files are data, what we want is knowledge Confusion is formed from data and banished by

the conversion of that data into knowledge This is the problem for Today, Tomorrow and

beyond! So, what are we going to do about it next?

Page 25: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Some Types of knowledge in myGrid and Taverna Data to Context Knowledge

Which operation produced the data? Which workflow defined the operation? When, Where and Who?

Workflow design and enactment!

Data to Data Knowledge Relate operation inputs and outputs

Base ‘derived from’ relation in RDF Can be specialized through templates

Page 26: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Context to Context Knowledge Common information model shared across

components Encapsulates organizations, people, experiment

designs, instances and results. Equivalent to an overall eScience file system

In Silico eScience ‘Materials and Methods’ Expressed in terms of workflow definitions within

Taverna

Page 27: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

The eScience Knowledge Gap (one of them anyway!)

Hypothesis is missing! Without some specification of the hypothesis which the

experiment is designed to test we cannot do much more than the forms of knowledge stated previously.

Hypothesis as part of the Process Model? Can we define the hypothesis as the population of a

domain and experiment specific data model in combination with a set of statements about instances of this model?

How would this fit in with the current workflow centric approach we’re taking?

Page 28: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

But Domain Modeling is Hard Do we need to model the entire domain?

Derive an experiment specific model by either creating from scratch or aggregating fine grained ‘Atomic Domain Models’ Examples – Sequence + Features, GO Term Graph,

Metabolic Pathway, Protein Interaction Set For example, if the hypothesis is ‘proteins annotated with

GO term xxx or children by InterPro scan are implicated in pathway zzz’ Aggregate target domain model consists of the combination

of these Atomic Domain Models. Hypothesis statement in the form of this model + query over

the model topology which returns the proportion of proteins in the model satisfying the hypothesis constraint.

Page 29: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Populating the Target Domain Model

Workflows are based on the composition of distributed services Can we derive services from the Target Domain Model?

For example, the Sequence + Features model would manifest a setFeature(start, end, sequence, feature) operation or similar.

Allow the user to incorporate these operations into the workflow alongside the regular services, effectively annotating the workflow.

Make use of existing Data to Data Knowledge and Data to Context Knowledge to link entities within the Target Domain Model with derivation information.

Page 30: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

Data Transformed to Knowledge A workflow invocation would now result in a

populated domain model as opposed to (or in addition to) a large set of discrete pieces of data.

Explicit semantic in the Target Domain Model Drive hypothesis testing Drive visualization in a graphical UI Generate textual summary of the knowledge

Page 31: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

myGrid and WBS People!CoreMatthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes,

Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.

UsersSimon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical

Medical Sciences, University of Newcastle, UKHannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UKPostgraduatesMartin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan,

Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)Robin McEntire (GSK)CollaboratorsKeith Decker

Page 32: Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI, tmo@ebi.ac.uk ://mygrid.org.uk ://taverna.sf.net

AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project

Particular thanks to the other members of the Taverna project, http://taverna.sf.net