eamonn maguire: the open source isa metadata tracking framework: from data curation and management...
DESCRIPTION
Eamonn Maguire's talk on "The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe" at ISCB-Asia, December 17th 2012TRANSCRIPT
Eamonn MaguireLead Software EngineerOxford University
ISCB-Asia, 17th December 2012
The Open Source ISA Metadata Tracking Framework: From Data Curation and Management at the Source, to the Linked Data Universe
What is ISA all about?
ISCB-Asia, 17th December 2012
We want to enable better reporting of experiments...
We want to make to easier for submitters...
We want to provide tooling which biologists will want to use...
What’s the problem?
ISCB-Asia, 17th December 2012
Could be beans. Could be peas. Could be soup.
Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language.
1. there is fragmentation in formats: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML.2. different formats often capture different information - often not enough to actually repeat an experiment correctly3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens or rat vs rattus norvegicus, making search more difficult.
Tin can analogy borrowed from Norman Morrison & converted
from ontologies to metadata transfer standards.
What’s the problem?
ISCB-Asia, 17th December 2012
Could be beans. Could be peas. Could be soup.
Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language.
1. there is fragmentation in formats: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML.2. different formats often capture different information - often not enough to actually repeat an experiment correctly3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens or rat vs rattus norvegicus, making search more difficult.
可能是豌豆 - a different representation...non latin language
Tin can analogy borrowed from Norman Morrison & converted
from ontologies to metadata transfer standards.
What’s the problem?
ISCB-Asia, 17th December 2012
Could be beans. Could be peas. Could be soup.
Analogy time.Each can is an experiment. We have no labels, so no indication about what is in the can.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language.
1. there is fragmentation in formats: the formats used to describe experiments are different, e.g. MAGE-Tab, PRIDE-ML, SRA-XML.2. different formats often capture different information - often not enough to actually repeat an experiment correctly3. the terminologies used to describe an experiment is different, e.g. humans vs homo sapiens or rat vs rattus norvegicus, making search more difficult.
可能是豌豆 - a different representation...non latin languageMight be petit pois - a different terminology
Tin can analogy borrowed from Norman Morrison & converted
from ontologies to metadata transfer standards.
1. There is fragmentation in formats
ISCB-Asia, 17th December 2012
Can you imagine having to translate everything you write into a different language in order to submit your data?
1. There is fragmentation in formats
ISCB-Asia, 17th December 2012
Can you imagine having to translate everything you write into a different language in order to submit your data?
你能想象有翻译成不同的语言编写的一切,以提交您的数据吗?即使转换工具,像谷歌,翻译弄错了。
1. There is fragmentation in formats
ISCB-Asia, 17th December 2012
Can you imagine having to translate everything you write into a different language in order to submit your data?
你能想象有翻译成不同的语言编写的一切,以提交您的数据吗?即使转换工具,像谷歌,翻译弄错了。
An féidir leat a shamhlú go bhfuil gach rud a scríobh tú a aistriú isteach i dteanga eile d'fhonn a chur isteach do chuid sonraí? Fiú uirlisí chomhshó, cosúil le google translate a fháilsé mícheart.
ISCB-Asia, 17th December 2012
Repositories are making it difficult for biologists to submit data, and for others to use it. Particularly for those performing multi-omic experiments...to submit say proteomic and transcriptomic data, one must provide slightly different information in two very different formats...why?
Our solution is one general purpose, flexible format, herein referred to as ISA-Tab.
A domain agnostic format to capture experimental metadata in omic experiments (transcriptomic, genomic, proteomic, metabolomic) as well as traditional experiments such as clinical chemistry and histology.
...it already works in lots of domains...nutrigenomics, toxicogenomics, public health... etc.
1. There is fragmentation in formats: our solution
ISCB-Asia, 17th December 2012
investigation
assay(s) assay(s)
data data
external files in native or other for-
mats
pointers to data file names/location
investigationhigh level concept to link related studies
studythe central unit, containing information on the subject under study, its characteristics and any treatments applied.a study has associated assays
assaytest performed either on material taken from the sub-ject or on the whole initial subject, which produce quali-tative or quantitative meas-urements (data)
Biologists like tab. They don’t like XML.
Through basic inference...ISA-Tab is good :)
1. There is fragmentation in formats: our solution
ISCB-Asia, 17th December 2012
Minimal Information about a Biological or Biomedical Investigation.
The information captured by a format is generated via a ‘checklist’, ideally a list of fields that together provide the minimal amount of information required to be able to reproduce an experiment.
MIBBI is trying to harmonise these checklists to reduce redundancy and make them interoperable.
2. Different formats often capture different information...But there are lots of similarities
We have 32 checklists at present because there are differences in what is deemed important depending on the experiment being performed.
Now integrated in
ISCB-Asia, 17th December 2012
Helping to demystify the unwieldy world of standards...
Find out what standards are out there...MI Checklists, ontologies and formats plus what domains they are suited to...
Find out about data sharing policies from NIH for example.
Databases, which standards they use etc.
Now integrated in
ISCB-Asia, 17th December 2012
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same language. What do I mean by this? Well...
1. there is fragmentation:
2. different formats often capture different information
3. the terminologies used to describe an experiment are different: we promote the use of ontologies to harmonize the recording of experiments.
The ISA tools...
ISCB-Asia, 17th December 2012
ISA tools brings together a common representation, MI checklists and ontologies.
Common representation
MI ChecklistsOntologies
The ISA tools
ISCB-Asia, 17th December 2012
Developed on top of the ISA-Tab format...modular, configurable, open source, Java based*
See them all at isa-tools.org
The ISA tools... a tool for all your needs
ISCB-Asia, 17th December 2012
Configurable...
ISCB-Asia, 17th December 2012
So, our infrastructure is built upon XML files. These are created by the ISAConfigurator.
A configuration XML file describes the fields (or checklist) required to describe a particular experiment and any ontologies to be used.
We need to support lots of different checklists, and it should be easy for people to change their requirements should they need to....
ISCB-Asia, 17th December 2012
Create configuration xml files
ISCB-Asia, 17th December 2012
isacreatorCreate & Edit ISA-Tab
ISCB-Asia, 17th December 2012
isacreator
Developed to be a user friendly way to enter standards-compliant metadata: it has lots of features... powered by ncbo annotator
visualise helpsuggesttagterms clear all
spreadsheet-like interfaceautomated ontology tagging
QR code generator
publication searcher
ontology search
visualization
file chooser
But these are just some of them...we also have a data entry wizard and an import utility...
The ISAcreator...
Ontology search and automated annotation in Google Docs
ISCB-Asia, 17th December 2012
Make sure the ISA-Tab is correct
ISCB-Asia, 17th December 2012
or...
validate from the command line...or...
within ISAcreator directly...
validate from the dedicated tool...
ISCB-Asia, 17th December 2012
Convert to or from differing formats
ISCB-Asia, 17th December 2012
Converts MAGE-Tab to ISA-Tab.This is still in beta, however we are getting close to a fully working version. We’ve successfully
creating validated ISA-Tab for ~90% of the 21k experiments in ArrayExpress
Available as a web service, web interface and source is available for running conversions locally
The converters
http://isatab.sourceforge.net/magetoisa/
Fully Endorsed by ArrayExpress, PRIDE and the European Nucleotide Archive (ENA)...
ISCB-Asia, 17th December 2012
Saghantelian_1,
KO1,
KO1_extract,
Sample,collec5on,
processed,,material,
./cdf/KO/ko15.CDF,
Informa5on,content,en5ty,
extrac5on, material,,processing,
mass,spectrometry,
has,specified,input,
has,specified,input,
has,specified,input,
has,specified,output,
has,specified,output,
has,specified,output,
derives,from,
derives,from,
derives,from,
type,
type,
type,
type,
type,
type,
material(en*ty(
The converters...semantic web
ISCB-Asia, 17th December 2012
The converters...semantic web
•Make the semantics of ISAtab explicit, including materials & data entities & processes•Exploit the semantic annotations available in ISAtab datasets•Augment ISA syntax with new elements (e.g. groups), facilitating the
understanding & querying of experimental design•Facilitate querying, data integration & knowledge discovery/reasoning
ISCB-Asia, 17th December 2012
The converters...semantic web
Notes&in&Lab&books&(informa1on&for&humans)&
Spreadsheets&&&Tables&(ISAtab&metadata)&
Facts&as&RDF&statements&(informa1on&for&machines)&
ISCB-Asia, 17th December 2012
Get ISA-Tab into a databaseShare it (or don’t) with the world
ISCB-Asia, 17th December 2012
Database & Web Application
ISCB-Asia, 17th December 2012
Web application
ISCB-Asia, 17th December 2012
Web application
ISCB-Asia, 17th December 2012
Web application
ISCB-Asia, 17th December 2012
Analysis
Last but not least...
ISCB-Asia, 17th December 2012
Package to read ISA-Tab into R, especially BioConductor to run analysis scripts on your data...
It can automatically call microarray, mass spec and flow cytometry analysis packages on appropriate datasets...
There is also a script to create Galaxy libraries from ISA-Tab
Available from BioConductor...
Brad Chapman is working on this at HSPH
Dedicated ISAcreator mode. Allows for persistence and perusal of ISA experiments in GenomeSpace
ISCB-Asia, 17th December 2012
isacommons
S t e m C e ll C o m m o n sNanotechnology
Informatics Working Group
A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including:
ISCB-Asia, 17th December 2012
ISCB-Asia, 17th December 2012
ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community levelPhilippe Rocca-Serra; Marco Brandizi; Eamonn Maguire; Nataliya Sklyar ; Chris Taylor ; Kimberly Begley; Dawn Field; Stephen Harris; Winston Hide; Oliver Hofmann; Steffen Neumann; Peter Sterk; Weida Tong; Susanna-Assunta SansoneBioinformatics 2010 26: 2354-2356
Towards Interoperable Bioscience DataSansone SA, Rocca-Serra P, Field D, Maguire E et alNature Genetics 2012
Questions??
ISCB-Asia, 17th December 2012
You can email [email protected]
View our bloghttp://isatools.wordpress.com
Follow us on Twitter@isatools
View our websitehttp://www.isa-tools.org
Thanks for listening...
View our Git repo & contributehttp://github.com/ISA-tools