isa tools presentation

Eamonn MaguireLead Software Engineer

Novartis, 21st October 2011

The ISA software suite

[email protected]

Tuesday, 8 November 2011

mailto:[email protected]


Who am I?


it’s rhetorical...

IrishFormal background is Computer Science (Bachelors) and Bioinformatics (Masters)Lead software engineer on the ISA projectDPhil Student at Oxford in Visualization in the Dept. of Computer ScienceHave my own graphic design company (Antarctic Design)

Part of a small but productive and vibrant team at Oxford headed by Susanna-Assunta Sansone.

Our work includes the ISA tools/infrastructure, MIBBI & BioSharing.


What is ISA all about?


We want to enable better reporting of experiments...

We want to make to easier for submitters...

We want to provide tooling which biologists will want to use...


What’s the problem?


Could be beans. Could be peas. Could be soup.

Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.

In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language. What do I mean by this? Well...

1. there is fragmentation: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML, but in essence they capture much of the same information; and

2. the terminologies used to describe experiments is different, even though many concepts are shared such as sample description. Field names as well as values...

Tin can analogy borrowed from Norman Morrison & converted

from ontologies to metadata transfer standards.









可能是豌豆 - a different representation...non latin language











可能是豌豆 - a different representation...non latin language

Might be petit pois - a different terminology






Can you imagine having to translate everything you write into a different language in order to submit your data?





你能想象有翻译成不同的语言编写的一切，以提交您的数据吗？即使转换工具，像谷歌，翻译弄错了。





你能想象有翻译成不同的语言编写的一切，以提交您的数据吗？即使转换工具，像谷歌，翻译弄错了。

An féidir leat a shamhlú go bhfuil gach rud a scríobh tú a aistriú isteach i dteanga eile d'fhonn a chur isteach do chuid sonraí? Fiú uirlisí chomhshó, cosúil le google translate a fháilsé mícheart.


Take home point...


Repositories are making it difficult for biologists to submit data, and for others to use it. Particularly for those performing multi-omic experiments where to submit say proteomic and transcriptomic data, one must provide the same general data in two very different formats...why? Well people like to have their own formats...plus, ad hoc is easier in general

Our solution is one general purpose, flexible format, herein referred to as ISA-Tab.

A domain agnostic format to capture experimental metadata in omic experiments (transcriptomic, genomic, proteomic, metabolomic) as well as traditional experiments such as clinical chemistry and histology.

...it works on lots (I won’t dare say all) types of data...nutrigenomics, toxicogenomics, public health... etc.


Tell me more...


investigation

assay(s) assay(s)

data data

external files in native or other for-

mats

pointers to data file names/location

investigationhigh level concept to link related studies

studythe central unit, containing information on the subject under study, its characteristics and any treatments applied.a study has associated assays

assaytest performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data)

Biologists like tab. They don’t like XML.

Through basic inference...ISA-Tab is good :)


But we don’t want to do this...


http://xkcd.com/927/




A format on it’s own isn’t very much though...


Too true...the secret to adoption is to provide the tooling to enable biologists to get data into the format, share it, convert and analyse it!

The ISAtools provide this tool support.


The ISA tools


&others being developed by the ISA community...

PERL Parser for ISA by Bob MacCallum and Python Parser for ISA by Brad Chapman

isacreator converter

Developed on top of the ISA-Tab format...modular, configurable, open source, Java based*

*apart from the R, PERL and Python packages of course...


The ISA tools... modular


Create

Experimentalist uses editor to report investigation.

Configure

Curator creates template

Validate

Convert from ISA

Check adherance to template

Users browse investigations, query and view experimental metadata, and access associated data files

Curator stores metadata in database using BII data management tool

Load

Convert to MAGE-TAB, PRIDE-ML, SRA-XML for submission to international public repositories

Browse

Requires Configuration XMLPerform analysis of data in context with the metadata using the Galaxy or R analysis engines.

Analyze

isacreator

converter

Convert to ISA

Convert from MAGE-Tab to ISATab. More formats coming soon...

converter



Are you just using buzz words? Well we like buzz words as much as everyone else, but no.

We need to be configurable to support evolving checklists and requirements. Just check out mibbi.org, lots of checklists! 32 in fact at the last count.

MIBBI is trying to harmonise these checklists to reduce redundancy and make them interoperable.

The ISA tools... configurable


Checklists...what are they?


When we report things, there are some things which are really important.

In a school report, we have the child’s name, their class, teacher, subjects taken and so on.

Well, in a biological experiment, the very same principles apply. We need information about the sample (species, strain, age) and information about the protocols applied during the experiment and subsequent parameters.

We have 32 checklists at present because there are differences in what is deemed important depending on the experiment being performed.

Good reporting means that statistics can be applied better, experiments can be reproduced more easily, and data mashups can occur in the future.

Experiments are expensive, we should make sure that their full value is realised.


On this point...


Helping to demystify the unwieldy world of standards...

Find out what standards are out there...MI Checklists, ontologies and formats plus what domains they are suited to...Find out about data sharing policies from NIH for example.


Configurable...back to that


So, our infrastructure is built upon XML files. These are created by the ISAConfigurator.

A configuration XML file describes the fields (or checklist) required to describe a particular experiment!

We need to support lots of different checklists, and it should be easy for people to change their requirements should they need to....


ISAconfigurator


The brick maker...a kiln The bricks...

Configuration XML



Create configuration xml files


The ISAconfigurator...



The configuration xml...


This is an example of a field definition created by the configurator. In this instance we are describing a label field, in particular, one used to describe the label used in a microarray experiment.

We have defined it to come from an ontology, and we recommend the ChEBI ontology. It is also required.



The configuration xml is an important part of the infrastructure and is utilised in various components in differing capacities.

isacreator

Used in content validation but it’s main purpose here is to build the user

interface...more on this later.

Used in content validation. The validation component is also called in the ISAconverter and BII data manager before conversion and

loading respectively

Aside from strong ontology support, the configuration xml also allows for specification of regular expressions which field contents should match, to specify if a field is an integer, double, list value, boolean, string or a field which should accept a file location...

The configuration xml...



isacreatorCreate & Edit ISA-Tab



isacreator

Developed to be a user friendly way to enter standards-compliant metadata: it has lots of features... powered by ncbo annotator

visualise helpsuggesttagterms clear all

spreadsheet-like interfaceautomated ontology tagging

QR code generator

publication searcher

ontology search

visualization

file chooser

But these are just some of them...we also have a data entry wizard and an import utility...

The ISAcreator...


Use of the configuration xml


Configuration xml schema (XSD) is consumed by an XML beans goal in maven and Java stubs are created which are then used to load the XML files into memory

The configuration is also used to define the form view using a similar mechanism....

<xml><field>sample</field><field>protocol ref</field><field>extract name</field><field>label</field>...</xml>

Java ObjectTableReferenceObject

XML definition(s) Import into Java Object Model using classes created by XML beans

Construct spreadsheet model. Columns, rows, etc.

Assign cell editors. Ontology terms are given the ontology selection tool as a cell editor, file fields are given a file chooser etc.


Sounds good...what does it look like?...



Ontologies


We use the NCBO Bioportal and the EBI’s OLS to do searching and browsing on ontologies.

Ontology Resource ManagerThe resource manager provides seamless searching of ontology resources, regardless of their origins, their underlying

data schema or the mechanism (REST, SOAP or local file store) through which they are accessed.

NCBOBioPortal

Ontology Lookup Service (OLS)

Plugin

Ontology browsing & searching

Ontology tagging

Search, Hierarchy and Annotator services

Ontology field restriction

ISAcreator manages ontology metadata such as version information as well as individual term accessions, source, uri and so forth.

Ontology search code is usable outside of ISAcreator. In fact, the ISAconfigurator imports ISAcreator as a maven dependency and reuses it’s components to do ontology restriction...plugins can also make use of our ontology search and browse functionalities


Ontologies...some more technical details


How do we browse so quickly without downloading and reasoning over ontologies?(disclaimer: speed also depends on if you access OLS/BioPortal from Europe/America)

Ontologies are all accessed by web services...this part is clear.

But browsing over ontologies, especially those coming from 2 separate resources, in different parts of the world with very different implementations isn’t easy.

To make the browsing experience not so slow and painful, we preload parts of the ontology tree in advance of them being requested by the user.

root

level(root) + 1

ontology loaded root expanded

level(a) +1

branch a

node a expanded

level(b) +1

branch a

branch b

root, level 0 root, level 0


Plugins


Plugins can be developed for 3 different purposes:

In ISAcreator, we use the Apache Felix implementation of the OSGi framework...it’s really good.

Search (adds extra search space for ontology tool)

Custom cell editors (for spreadsheet)

Extra general functionality (which appears in a plugin menu)


Plugins...example


Novartis Metastore Search

Search function on the Novartis Metastore... integrates search results on the metastore in the Ontology search tool.

So, with the Novartis plugin in your Plugin directory, you’ll be able to search the Novartis metastore directly within ISAcreator, and it will handle all the tasks involved with recording term source, etc.



Make sure the ISA-Tab is correct



Checks: the structure of the ISA-Tab to ensure it’s well formed;

the contents to ensure that it matches what is defined in the configuration xml

Then:maps the tab structure into an graph-based object model

Actions such as conversion to other formats and persisting to the DB are performed on this object model (called the BIIObjectStore).

H. Sapiens

33 Years

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years



or...

validate from the command line...

or...within ISAcreator directly...



Convert to or from differing formats



Converts MAGE-Tab to ISA-Tab.This is still in beta, however we are getting close to a fully working version. We’ve successfully

creating validated ISA-Tab for ~90% of the 21k experiments in ArrayExpress

Available as a web service, web interface and source is available for running conversions locally

The converters

http://isatab.sourceforge.net/magetoisa/

Fully Endorsed by ArrayExpress, PRIDE and the European Nucleotide Archive (ENA)...





or...

convert from the command line...

or...within ISAcreator directly...



Automagically filters out the formats you

can’t export to...e.g., if I have no sequencing experiments, I won’t

need to export in SRA



Get ISA-Tab into a databaseShare it (or don’t) with the world



GUI & command line interface to get ISA-Tab into an instance of the BII (BioInvestigation Index)

Calls the validator first, then persists the BIIObjectStore object to the database via Hibernate



Lots of admin functionalities available from the GUI, these are also available using the command line or API

Over X11, using such an interface is slow...I’d suggest making use of the API or

command line tools available...

Disclaimer



Database



The BioInvestigation Index term is an overloaded one. It refers to the database & the web application

The database itself is quite complicated to describe in detail in a single presentation, but the key take home message is that it is graph based...remember this?

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

In the BII, we have Materials, Processes, Cross References and Annotations. This makes things pretty generic...and the BII model is even more generic that ISA-Tab

Database



One more word about the database, (and a few sentences) then I’ll show the web application.

Scalable.

ArrayExpress v2 makes use of all of the BII object model. They just add a table for bio entities (or genes) and that’s it!

AE have >21,000 experiments and >500,000 hybridizations loaded into it’s database.

Database

As far as we know... :)



Web Application



Web application



We created the web application as a light weight solution enabling users to share their data.

Web application

(But it’s a J2EE solution so I think we’ve got an oxymoron on our hands)

But even though it’s enterprise level, it is at least light on maintenance. You’ll not have to do much with BII once it is running. The EBI version, running across 2 servers (one as backup) has been live for 6 months so far without one

restart...and I only restarted to deploy a new instance.



Web application

We use JBoss Seam, mainly because we don’t have to worry about HTTP sessions, scope, etc. It manages everything for us which is useful...this is particularly important in highly accessed systems and releases time to be spent working on more interesting things...

But it’s also a really good “integration framework”, pulling in JSP, JSF, EJB, JPA, Hibernate, etc.



Web application

We use HQL instead of platform specific SQL. So the database can be Oracle, MySQL, PostGreSQL...a database independent application

We can deal directly with objects, directly from the database queries

We construct the schema using POJO’s, some XML



Web application

Lucene creates a document-based index of the database contents

We use annotations to specify which fields should be indexed

This index can be accessed and queried very quickly, so we use this to build the user interface



Being deployed on Cloud-enabled instance of the BioLinux VM

Will make it easier to create deployments of the BII database and web application...



Analysis

Last but not least...



Package to read ISA-Tab into R, especially BioConductor to run analysis scripts on your data...

It can automatically call microarray, mass spec and flow cytometry analysis packages on appropriate datasets...

There is also a script to create Galaxy libraries from ISA-Tab

We still need to upload this to BioConductor...created by Audrey Kauffman

Brad Chapman is working on this at HSPH


Who’s using ISA?


Fortunately, lots of people are now taking ISA on board... people are realising that MAGE-TAB, SOFT, PRIDE-ML and SRA-XML are an overhead which can be avoided, especially in multi-omic experiments.

rof retneC lanoitaN ehTToxicological Research (NCTR)

& others...see the case study section on the ISA tools web site


Who’s using ISA?


Case study: Metabolomics repository - Metabolights

Built on top of the ISA infrastructure with a custom front-end web interface...

Data entry tooling - ISAcreator, ISAvalidator and ISAconverterData management tools - BII data manager, BII database

Also developing their own plugins for ISAcreator (of type: custom cell editor) to help users in reporting metabolite assignments.



Who’s using ISA?


Case study: Metabolomics repository - Metabolights


Who’s using ISA?


Case study: SCDE

Built on top of the ISA infrastructure in its entirety

Contributing automatic deployment scripts for the BII (linked with the cloud BioLinux initiative)

Created the Python Parser for ISA-Tab

Curated stem cell informatics resource linked with the Galaxy analysis engine



Who’s using ISA?


Case study: SCDE


Who’s using ISA?


Case study: GeneData - InnoMed

720 animals

~20,000 assays

16 compounds

3 doses

Biggest public study of its kind

Only available in ISA-Tab


Who’s using ISA?



720 animals

~20,000 assays

16 compounds

3 doses

Biggest public study of its kind

Only available in ISA-Tab

protein expression profilingby mass spectrometry

transcription profilingby dna microarray

metabolite profilingby mass spectrometry

metabolite profilingby nmr spectroscopy

histology

clinical chemistry

hematology


Who’s using ISA?




Our next steps...as a community


Analysis

blood serum

SCAN

HYB

TRANS

LABEL

EX

SAMP

SCAN

TRANS

SAMP

missing protocols and no information about what was being measured.

well described process from sample to data file.

Making visual comparisons is straightfor-ward using this approach. The longest path is constructed based on all other known datasets in the pool of workflows being compared.

liver kidney blood serum blood plasma

low doseaspirin

SCAN

HYB

TRANS

LABEL

EX

SAMP

SCAN

TRANS

EX

SAMP

SCAN

HYB

TRANS

LABEL

EX

SAMP

SCAN

HYB

TRANS

LABEL

EX

SAMP

SCAN

HYB

TRANS

LABEL

EX

SAMP

SCAN

HYB

TRANS

LABEL

EX

SAMP

SCAN

HYB

TRANS

SAMP

SCAN

TRANS

SAMP

liver kidney blood serum blood plasma

kidney

x5 x5 x5 x5

x5 x5 x5 x5

x5 x5

Visualization Further adoption


We can’t do everything by ourselves...


ISA team

Susanna-Assunta SansonePhilippe Rocca-SerraEamonn Maguire

Contributors

Marco BrandiziNatalija SklyarBrad ChapmanBob MacCallumKenneth HaugPablo ConesaAudrey Kauffman

Funders

Collaborators at

rof retneC lanoitaN ehTToxicological Research (NCTR)



ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community levelPhilippe Rocca-Serra; Marco Brandizi; Eamonn Maguire; Nataliya Sklyar ; Chris Taylor ; Kimberly Begley; Dawn Field; Stephen Harris; Winston Hide; Oliver Hofmann; Steffen Neumann; Peter Sterk; Weida Tong; Susanna-Assunta SansoneBioinformatics 2010 26: 2354-2356


Questions??


You can email [email protected]

View our bloghttp://isatools.wordpress.com

Follow us on Twitter@antarcticdesign

View our websitehttp://www.isa-tools.org

Thanks for listening...

View our Git repo & contributehttp://github.com/ISA-tools












isa tools presentation

Technology

different formats

different representation

different language inorder

general data

traditional experiments

things arent

types of data

latin language analogy