isa tools presentation
DESCRIPTION
TRANSCRIPT
Eamonn MaguireLead Software Engineer
Novartis, 21st October 2011
The ISA software suite
Tuesday, 8 November 2011
Who am I?
Novartis, 21st October 2011
it’s rhetorical...
IrishFormal background is Computer Science (Bachelors) and Bioinformatics (Masters)Lead software engineer on the ISA projectDPhil Student at Oxford in Visualization in the Dept. of Computer ScienceHave my own graphic design company (Antarctic Design)
Part of a small but productive and vibrant team at Oxford headed by Susanna-Assunta Sansone.
Our work includes the ISA tools/infrastructure, MIBBI & BioSharing.
Tuesday, 8 November 2011
What is ISA all about?
Novartis, 21st October 2011
We want to enable better reporting of experiments...
We want to make to easier for submitters...
We want to provide tooling which biologists will want to use...
Tuesday, 8 November 2011
What’s the problem?
Novartis, 21st October 2011
Could be beans. Could be peas. Could be soup.
Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language. What do I mean by this? Well...
1. there is fragmentation: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML, but in essence they capture much of the same information; and
2. the terminologies used to describe experiments is different, even though many concepts are shared such as sample description. Field names as well as values...
Tin can analogy borrowed from Norman Morrison & converted
from ontologies to metadata transfer standards.
Tuesday, 8 November 2011
What’s the problem?
Novartis, 21st October 2011
Could be beans. Could be peas. Could be soup.
Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language. What do I mean by this? Well...
1. there is fragmentation: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML, but in essence they capture much of the same information; and
2. the terminologies used to describe experiments is different, even though many concepts are shared such as sample description. Field names as well as values...
可能是豌豆 - a different representation...non latin language
Tin can analogy borrowed from Norman Morrison & converted
from ontologies to metadata transfer standards.
Tuesday, 8 November 2011
What’s the problem?
Novartis, 21st October 2011
Could be beans. Could be peas. Could be soup.
Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language. What do I mean by this? Well...
1. there is fragmentation: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML, but in essence they capture much of the same information; and
2. the terminologies used to describe experiments is different, even though many concepts are shared such as sample description. Field names as well as values...
可能是豌豆 - a different representation...non latin language
Might be petit pois - a different terminology
Tin can analogy borrowed from Norman Morrison & converted
from ontologies to metadata transfer standards.
Tuesday, 8 November 2011
What’s the problem?
Novartis, 21st October 2011
Can you imagine having to translate everything you write into a different language in order to submit your data?
Tuesday, 8 November 2011
What’s the problem?
Novartis, 21st October 2011
Can you imagine having to translate everything you write into a different language in order to submit your data?
你能想象有翻译成不同的语言编写的一切,以提交您的数据吗?即使转换工具,像谷歌,翻译弄错了。
Tuesday, 8 November 2011
What’s the problem?
Novartis, 21st October 2011
Can you imagine having to translate everything you write into a different language in order to submit your data?
你能想象有翻译成不同的语言编写的一切,以提交您的数据吗?即使转换工具,像谷歌,翻译弄错了。
An féidir leat a shamhlú go bhfuil gach rud a scríobh tú a aistriú isteach i dteanga eile d'fhonn a chur isteach do chuid sonraí? Fiú uirlisí chomhshó, cosúil le google translate a fháilsé mícheart.
Tuesday, 8 November 2011
Take home point...
Novartis, 21st October 2011
Repositories are making it difficult for biologists to submit data, and for others to use it. Particularly for those performing multi-omic experiments where to submit say proteomic and transcriptomic data, one must provide the same general data in two very different formats...why? Well people like to have their own formats...plus, ad hoc is easier in general
Our solution is one general purpose, flexible format, herein referred to as ISA-Tab.
A domain agnostic format to capture experimental metadata in omic experiments (transcriptomic, genomic, proteomic, metabolomic) as well as traditional experiments such as clinical chemistry and histology.
...it works on lots (I won’t dare say all) types of data...nutrigenomics, toxicogenomics, public health... etc.
Tuesday, 8 November 2011
Tell me more...
Novartis, 21st October 2011
investigation
assay(s) assay(s)
data data
external files in native or other for-
mats
pointers to data file names/location
investigationhigh level concept to link related studies
studythe central unit, containing information on the subject under study, its characteristics and any treatments applied.a study has associated assays
assaytest performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data)
Biologists like tab. They don’t like XML.
Through basic inference...ISA-Tab is good :)
Tuesday, 8 November 2011
But we don’t want to do this...
Novartis, 21st October 2011
http://xkcd.com/927/
Tuesday, 8 November 2011
A format on it’s own isn’t very much though...
Novartis, 21st October 2011
Too true...the secret to adoption is to provide the tooling to enable biologists to get data into the format, share it, convert and analyse it!
The ISAtools provide this tool support.
Tuesday, 8 November 2011
The ISA tools
Novartis, 21st October 2011
&others being developed by the ISA community...
PERL Parser for ISA by Bob MacCallum and Python Parser for ISA by Brad Chapman
isacreator converter
Developed on top of the ISA-Tab format...modular, configurable, open source, Java based*
*apart from the R, PERL and Python packages of course...
Tuesday, 8 November 2011
The ISA tools... modular
Novartis, 21st October 2011
Create
Experimentalist uses editor to report investigation.
Configure
Curator creates template
Validate
Convert from ISA
Check adherance to template
Users browse investigations, query and view experimental metadata, and access associated data files
Curator stores metadata in database using BII data management tool
Load
Convert to MAGE-TAB, PRIDE-ML, SRA-XML for submission to international public repositories
Browse
Requires Configuration XMLPerform analysis of data in context with the metadata using the Galaxy or R analysis engines.
Analyze
isacreator
converter
Convert to ISA
Convert from MAGE-Tab to ISATab. More formats coming soon...
converter
Tuesday, 8 November 2011
Novartis, 21st October 2011
Are you just using buzz words? Well we like buzz words as much as everyone else, but no.
We need to be configurable to support evolving checklists and requirements. Just check out mibbi.org, lots of checklists! 32 in fact at the last count.
MIBBI is trying to harmonise these checklists to reduce redundancy and make them interoperable.
The ISA tools... configurable
Tuesday, 8 November 2011
Checklists...what are they?
Novartis, 21st October 2011
When we report things, there are some things which are really important.
In a school report, we have the child’s name, their class, teacher, subjects taken and so on.
Well, in a biological experiment, the very same principles apply. We need information about the sample (species, strain, age) and information about the protocols applied during the experiment and subsequent parameters.
We have 32 checklists at present because there are differences in what is deemed important depending on the experiment being performed.
Good reporting means that statistics can be applied better, experiments can be reproduced more easily, and data mashups can occur in the future.
Experiments are expensive, we should make sure that their full value is realised.
Tuesday, 8 November 2011
On this point...
Novartis, 21st October 2011
Helping to demystify the unwieldy world of standards...
Find out what standards are out there...MI Checklists, ontologies and formats plus what domains they are suited to...Find out about data sharing policies from NIH for example.
Tuesday, 8 November 2011
Configurable...back to that
Novartis, 21st October 2011
So, our infrastructure is built upon XML files. These are created by the ISAConfigurator.
A configuration XML file describes the fields (or checklist) required to describe a particular experiment!
We need to support lots of different checklists, and it should be easy for people to change their requirements should they need to....
Tuesday, 8 November 2011
ISAconfigurator
Novartis, 21st October 2011
The brick maker...a kiln The bricks...
Configuration XML
Tuesday, 8 November 2011
Novartis, 21st October 2011
Create configuration xml files
Tuesday, 8 November 2011
The ISAconfigurator...
Novartis, 21st October 2011
Tuesday, 8 November 2011
The ISAconfigurator...
Novartis, 21st October 2011
Tuesday, 8 November 2011
The configuration xml...
Novartis, 21st October 2011
This is an example of a field definition created by the configurator. In this instance we are describing a label field, in particular, one used to describe the label used in a microarray experiment.
We have defined it to come from an ontology, and we recommend the ChEBI ontology. It is also required.
Tuesday, 8 November 2011
Novartis, 21st October 2011
The configuration xml is an important part of the infrastructure and is utilised in various components in differing capacities.
isacreator
Used in content validation but it’s main purpose here is to build the user
interface...more on this later.
Used in content validation. The validation component is also called in the ISAconverter and BII data manager before conversion and
loading respectively
Aside from strong ontology support, the configuration xml also allows for specification of regular expressions which field contents should match, to specify if a field is an integer, double, list value, boolean, string or a field which should accept a file location...
The configuration xml...
Tuesday, 8 November 2011
Novartis, 21st October 2011
isacreatorCreate & Edit ISA-Tab
Tuesday, 8 November 2011
Novartis, 21st October 2011
isacreator
Developed to be a user friendly way to enter standards-compliant metadata: it has lots of features... powered by ncbo annotator
visualise helpsuggesttagterms clear all
spreadsheet-like interfaceautomated ontology tagging
QR code generator
publication searcher
ontology search
visualization
file chooser
But these are just some of them...we also have a data entry wizard and an import utility...
The ISAcreator...
Tuesday, 8 November 2011
Use of the configuration xml
Novartis, 21st October 2011
Configuration xml schema (XSD) is consumed by an XML beans goal in maven and Java stubs are created which are then used to load the XML files into memory
The configuration is also used to define the form view using a similar mechanism....
<xml><field>sample</field><field>protocol ref</field><field>extract name</field><field>label</field>...</xml>
Java ObjectTableReferenceObject
XML definition(s) Import into Java Object Model using classes created by XML beans
Construct spreadsheet model. Columns, rows, etc.
Assign cell editors. Ontology terms are given the ontology selection tool as a cell editor, file fields are given a file chooser etc.
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
Ontologies
Novartis, 21st October 2011
We use the NCBO Bioportal and the EBI’s OLS to do searching and browsing on ontologies.
Ontology Resource ManagerThe resource manager provides seamless searching of ontology resources, regardless of their origins, their underlying
data schema or the mechanism (REST, SOAP or local file store) through which they are accessed.
NCBOBioPortal
Ontology Lookup Service (OLS)
Plugin
Ontology browsing & searching
Ontology tagging
Search, Hierarchy and Annotator services
Ontology field restriction
ISAcreator manages ontology metadata such as version information as well as individual term accessions, source, uri and so forth.
Ontology search code is usable outside of ISAcreator. In fact, the ISAconfigurator imports ISAcreator as a maven dependency and reuses it’s components to do ontology restriction...plugins can also make use of our ontology search and browse functionalities
Tuesday, 8 November 2011
Ontologies...some more technical details
Novartis, 21st October 2011
How do we browse so quickly without downloading and reasoning over ontologies?(disclaimer: speed also depends on if you access OLS/BioPortal from Europe/America)
Ontologies are all accessed by web services...this part is clear.
But browsing over ontologies, especially those coming from 2 separate resources, in different parts of the world with very different implementations isn’t easy.
To make the browsing experience not so slow and painful, we preload parts of the ontology tree in advance of them being requested by the user.
root
level(root) + 1
ontology loaded root expanded
level(a) +1
branch a
node a expanded
level(b) +1
branch a
branch b
root, level 0 root, level 0
Tuesday, 8 November 2011
Plugins
Novartis, 21st October 2011
Plugins can be developed for 3 different purposes:
In ISAcreator, we use the Apache Felix implementation of the OSGi framework...it’s really good.
Search (adds extra search space for ontology tool)
Custom cell editors (for spreadsheet)
Extra general functionality (which appears in a plugin menu)
Tuesday, 8 November 2011
Plugins...example
Novartis, 21st October 2011
Novartis Metastore Search
Search function on the Novartis Metastore... integrates search results on the metastore in the Ontology search tool.
So, with the Novartis plugin in your Plugin directory, you’ll be able to search the Novartis metastore directly within ISAcreator, and it will handle all the tasks involved with recording term source, etc.
Tuesday, 8 November 2011
Novartis, 21st October 2011
Make sure the ISA-Tab is correct
Tuesday, 8 November 2011
Novartis, 21st October 2011
Checks: the structure of the ISA-Tab to ensure it’s well formed;
the contents to ensure that it matches what is defined in the configuration xml
Then:maps the tab structure into an graph-based object model
Actions such as conversion to other formats and persisting to the DB are performed on this object model (called the BIIObjectStore).
H. Sapiens
33 Years
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Tuesday, 8 November 2011
Novartis, 21st October 2011
Tuesday, 8 November 2011
Novartis, 21st October 2011
or...
validate from the command line...
or...within ISAcreator directly...
Tuesday, 8 November 2011
Novartis, 21st October 2011
Convert to or from differing formats
Tuesday, 8 November 2011
Novartis, 21st October 2011
Converts MAGE-Tab to ISA-Tab.This is still in beta, however we are getting close to a fully working version. We’ve successfully
creating validated ISA-Tab for ~90% of the 21k experiments in ArrayExpress
Available as a web service, web interface and source is available for running conversions locally
The converters
http://isatab.sourceforge.net/magetoisa/
Fully Endorsed by ArrayExpress, PRIDE and the European Nucleotide Archive (ENA)...
Tuesday, 8 November 2011
Novartis, 21st October 2011
Tuesday, 8 November 2011
Novartis, 21st October 2011
or...
convert from the command line...
or...within ISAcreator directly...
Tuesday, 8 November 2011
Novartis, 21st October 2011
Automagically filters out the formats you
can’t export to...e.g., if I have no sequencing experiments, I won’t
need to export in SRA
Tuesday, 8 November 2011
Novartis, 21st October 2011
Get ISA-Tab into a databaseShare it (or don’t) with the world
Tuesday, 8 November 2011
Novartis, 21st October 2011
GUI & command line interface to get ISA-Tab into an instance of the BII (BioInvestigation Index)
Calls the validator first, then persists the BIIObjectStore object to the database via Hibernate
Tuesday, 8 November 2011
Novartis, 21st October 2011
Lots of admin functionalities available from the GUI, these are also available using the command line or API
Over X11, using such an interface is slow...I’d suggest making use of the API or
command line tools available...
Disclaimer
Tuesday, 8 November 2011
Novartis, 21st October 2011
Database
Tuesday, 8 November 2011
Novartis, 21st October 2011
The BioInvestigation Index term is an overloaded one. It refers to the database & the web application
The database itself is quite complicated to describe in detail in a single presentation, but the key take home message is that it is graph based...remember this?
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
In the BII, we have Materials, Processes, Cross References and Annotations. This makes things pretty generic...and the BII model is even more generic that ISA-Tab
Database
Tuesday, 8 November 2011
Novartis, 21st October 2011
One more word about the database, (and a few sentences) then I’ll show the web application.
Scalable.
ArrayExpress v2 makes use of all of the BII object model. They just add a table for bio entities (or genes) and that’s it!
AE have >21,000 experiments and >500,000 hybridizations loaded into it’s database.
Database
As far as we know... :)
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web Application
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web application
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web application
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web application
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web application
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web application
Tuesday, 8 November 2011
Novartis, 21st October 2011
We created the web application as a light weight solution enabling users to share their data.
Web application
(But it’s a J2EE solution so I think we’ve got an oxymoron on our hands)
But even though it’s enterprise level, it is at least light on maintenance. You’ll not have to do much with BII once it is running. The EBI version, running across 2 servers (one as backup) has been live for 6 months so far without one
restart...and I only restarted to deploy a new instance.
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web application
We use JBoss Seam, mainly because we don’t have to worry about HTTP sessions, scope, etc. It manages everything for us which is useful...this is particularly important in highly accessed systems and releases time to be spent working on more interesting things...
But it’s also a really good “integration framework”, pulling in JSP, JSF, EJB, JPA, Hibernate, etc.
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web application
We use HQL instead of platform specific SQL. So the database can be Oracle, MySQL, PostGreSQL...a database independent application
We can deal directly with objects, directly from the database queries
We construct the schema using POJO’s, some XML
Tuesday, 8 November 2011
Novartis, 21st October 2011
Web application
Lucene creates a document-based index of the database contents
We use annotations to specify which fields should be indexed
This index can be accessed and queried very quickly, so we use this to build the user interface
Tuesday, 8 November 2011
Novartis, 21st October 2011
Being deployed on Cloud-enabled instance of the BioLinux VM
Will make it easier to create deployments of the BII database and web application...
Tuesday, 8 November 2011
Novartis, 21st October 2011
Analysis
Last but not least...
Tuesday, 8 November 2011
Novartis, 21st October 2011
Package to read ISA-Tab into R, especially BioConductor to run analysis scripts on your data...
It can automatically call microarray, mass spec and flow cytometry analysis packages on appropriate datasets...
There is also a script to create Galaxy libraries from ISA-Tab
We still need to upload this to BioConductor...created by Audrey Kauffman
Brad Chapman is working on this at HSPH
Tuesday, 8 November 2011
Who’s using ISA?
Novartis, 21st October 2011
Fortunately, lots of people are now taking ISA on board... people are realising that MAGE-TAB, SOFT, PRIDE-ML and SRA-XML are an overhead which can be avoided, especially in multi-omic experiments.
rof retneC lanoitaN ehTToxicological Research (NCTR)
& others...see the case study section on the ISA tools web site
Tuesday, 8 November 2011
Who’s using ISA?
Novartis, 21st October 2011
Case study: Metabolomics repository - Metabolights
Built on top of the ISA infrastructure with a custom front-end web interface...
Data entry tooling - ISAcreator, ISAvalidator and ISAconverterData management tools - BII data manager, BII database
Also developing their own plugins for ISAcreator (of type: custom cell editor) to help users in reporting metabolite assignments.
isacreator converter
Tuesday, 8 November 2011
Who’s using ISA?
Novartis, 21st October 2011
Case study: Metabolomics repository - Metabolights
Tuesday, 8 November 2011
Who’s using ISA?
Novartis, 21st October 2011
Case study: SCDE
Built on top of the ISA infrastructure in its entirety
Contributing automatic deployment scripts for the BII (linked with the cloud BioLinux initiative)
Created the Python Parser for ISA-Tab
Curated stem cell informatics resource linked with the Galaxy analysis engine
isacreator converter
Tuesday, 8 November 2011
Who’s using ISA?
Novartis, 21st October 2011
Case study: SCDE
Tuesday, 8 November 2011
Who’s using ISA?
Novartis, 21st October 2011
Case study: GeneData - InnoMed
720 animals
~20,000 assays
16 compounds
3 doses
Biggest public study of its kind
Only available in ISA-Tab
Tuesday, 8 November 2011
Who’s using ISA?
Novartis, 21st October 2011
Case study: GeneData - InnoMed
720 animals
~20,000 assays
16 compounds
3 doses
Biggest public study of its kind
Only available in ISA-Tab
protein expression profilingby mass spectrometry
transcription profilingby dna microarray
metabolite profilingby mass spectrometry
metabolite profilingby nmr spectroscopy
histology
clinical chemistry
hematology
Tuesday, 8 November 2011
Who’s using ISA?
Novartis, 21st October 2011
Case study: GeneData - InnoMed
Tuesday, 8 November 2011
Our next steps...as a community
Novartis, 21st October 2011
Analysis
blood serum
SCAN
HYB
TRANS
LABEL
EX
SAMP
SCAN
TRANS
SAMP
missing protocols and no information about what was being measured.
well described process from sample to data file.
Making visual comparisons is straightfor-ward using this approach. The longest path is constructed based on all other known datasets in the pool of workflows being compared.
liver kidney blood serum blood plasma
low doseaspirin
SCAN
HYB
TRANS
LABEL
EX
SAMP
SCAN
TRANS
EX
SAMP
SCAN
HYB
TRANS
LABEL
EX
SAMP
SCAN
HYB
TRANS
LABEL
EX
SAMP
SCAN
HYB
TRANS
LABEL
EX
SAMP
SCAN
HYB
TRANS
LABEL
EX
SAMP
SCAN
HYB
TRANS
SAMP
SCAN
TRANS
SAMP
liver kidney blood serum blood plasma
kidney
x5 x5 x5 x5
x5 x5 x5 x5
x5 x5
Visualization Further adoption
Tuesday, 8 November 2011
We can’t do everything by ourselves...
Novartis, 21st October 2011
ISA team
Susanna-Assunta SansonePhilippe Rocca-SerraEamonn Maguire
Contributors
Marco BrandiziNatalija SklyarBrad ChapmanBob MacCallumKenneth HaugPablo ConesaAudrey Kauffman
Funders
Collaborators at
rof retneC lanoitaN ehTToxicological Research (NCTR)
Tuesday, 8 November 2011
Novartis, 21st October 2011
ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community levelPhilippe Rocca-Serra; Marco Brandizi; Eamonn Maguire; Nataliya Sklyar ; Chris Taylor ; Kimberly Begley; Dawn Field; Stephen Harris; Winston Hide; Oliver Hofmann; Steffen Neumann; Peter Sterk; Weida Tong; Susanna-Assunta SansoneBioinformatics 2010 26: 2354-2356
Tuesday, 8 November 2011
Questions??
Novartis, 21st October 2011
You can email [email protected]
View our bloghttp://isatools.wordpress.com
Follow us on Twitter@antarcticdesign
View our websitehttp://www.isa-tools.org
Thanks for listening...
View our Git repo & contributehttp://github.com/ISA-tools
Tuesday, 8 November 2011