mat úš kalaš university of bergen, norway biohackathon , kyōto august 21, 2011
Post on 24-Feb-2016
36 Views
Preview:
DESCRIPTION
TRANSCRIPT
Matúš KalašUniversity of Bergen, Norway
BioHackathon, KyōtoAugust 21, 2011
(Extended version for discussions)
EDAM ontology
BioXSD
of bioinformatics data and methods
and
EMBRACE Data And Methods ontology
Jon IsonPeter Rice (PI)
Hamish McWilliamJames Malone
EBI, EMBL, Hinxton
Matúš KalašInge Jonassen (PI)
CBU, Uni Bergen
Steve Pettifer
University of Manchester
An ontology for annotation ofbioinformatics tools, resources, and data
EDAM ontology
Design principles of EDAM:
• Bioinformatics specificwith as few exceptions as necessary
• Well-defined scopeoperations, types of data (including identifiers), topics, formats
• Relevant and usablefor users and annotators
• Ontologically sanewell-defined concepts and relations, reflecting the reality
• Maintainable
Scope of EDAM, and example concepts:
Format
“FASTQ”“SBML”
Data
“Sequence trace”“Position frequency
matrix”
Topic
“Phylogenetics”“Protein classification”
Operation
“Multiple sequence alignment”
“Molecular dynamics simulation”
EDAM sub-ontologies and types of relations:
Example annotation with EDAM:
Examples: SAWSDL, EMBOSS (similarly also within data-resource annotation in DRCAT)
2nd example annotation with EDAM:
Example: DRCAT
Matúš KalašJan Christian Bryne*
Armin Töpfer **
Pål Puntervoll (PI)
Inge Jonassen (PI)
CBU, BCCS, Bergen
* now Oslo University Hospital
** now Uni Bielefeld, moving to Basel
Edita BartaševičiūtėKristoffer Rapacki (PI)
CBS, DTU, Greater Copenhagen
Jon Ison
EBI, EMBL, Hinxton
Alexandre JosephChristophe Blanchet (PI)
IBCP, CNRS, Lyon
Steve Pettifer
University of Manchester
An XML exchange formatfor basic bioinformatics data
BioXSD
Goals of BioXSD:
• Being an XSD-based XML format to complement RDF and plain-text formats
• Filling the gap between specialised XSD-based exchange formats (such as SBML, MAGE-ML, PDBML, phyloXML, PSI-MI MIF, GCDML, GLYDE-II, … )
• Compatible with XML libraries for all main programming languages
• As lightweight as possible, but fitting everyone
• Developed and maintained in an open but organised collaborationwelcoming requests from the community
• Detailed structurein-depth validation, semantic annotation (EDAM), efficient compression (EXI)
BioXSD is an exchange formatfor basic bioinformatics data
references to data, accessions, …
sequence and genome features (annotation)
sequence alignments
biomolecular sequences
http://EDAMontology.sourceforge.net
EDAM at NCBO BioPortal: http://bioportal.bioontology.org/ontologies/1498
http://BioXSD.org
http://drcat.sourceforge.net (a catalogue of databases annotated with EDAM)
Additional stuff
EDAM- & BioXSD-related topics for BH11:
• EDAM for new applications & for semantic dataplus what is the best RDF representation for annotation of tools & resources?
• BioXSD to RDF, RDF to BioXSD, SPARQLing of BioXSD datafirstly, what is the best RDF representation for sequence/alignment/feature data?
• BioXSD support in Open Bio*import & export of BioXSD into/from BioPython, BioRuby, BioPerl, BioJava
• Compatibility of BioXSD with other bioinformatics XSDson the conceptual & design level; and on the level of data integration
Acknowledgement
Big thanks to the BioHackathon organisers!!!
and the sources of funding BH!
Projects contributing to EDAM & BioXSD, and their sources of funding:
EXAMPLE OF AN EDAM CONCEPT:
id: EDAM:0001099name: UniProt accessionsubset: identifiersubset: datanamespace: identifierdef: "Accession number of a UniProt database entry."regex: "[A-NR-Z][0-9][A-Z][A-Z0-9][A-Z0-9][0-9]" "[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9]"example: P43353 Q9C199 A5A6J6synonym: "UniProtKB accession number" EXACT []synonym: "UniProtKB entry accession" EXACT []synonym: "UniProt accession number" EXACT []synonym: "Swiss-Prot accession" EXACT []is_a: EDAM:0002091 ! Accession
EXAMPLE OF AN EDAM CONCEPT:
Use cases driving EDAM are:
• Searching for tools and resourcesand categorising them
• Tool & data integration automation of data handling or even workflow composition vocabulary for semantically rich data (incl. RDF)
• Data provenancehow data was created and processed
How to annotate the tools and resources?
• Annotate a formal description of the tool
- WSDL (SOAP Web services) (SAWSDL standard)
- WADL (Web applications, URL & REST services)
• Annotate in a dedicated catalogue
- for example DrCAT (online bio databases)
http://drcat.sourceforge.net
- RDF (needs additional vocabulary/ies)
Who should annotate the tools and resourceswith EDAM?
Providers and users
preferably not catalogue curators
How to annotate SOAP Web services?wsdl:definitions wsdl:service wsdl:port wsdl:binding wsdl:portType * wsdl:operation wsdl:input wsdl:fault wsdl:output wsdl:message wsdl:part xs:element xs:complexType xs:sequence * xs:element more types, elements, attributes, enumerations ..
SAWSDL
Using URIs of concepts
Advantages of XML Schema
Advantages for users of tools: Advantages for providers of tools:
usability Automatic input validation * security
usability Easy conversion of formats Parsing “for free” maintainability
usability Auto-generation of objects and GUIs (*) maintainability
scalability Efficient compression (with EXI) (*) scalability
semantics Annotation of details possible semantics
resources Workflow programmingeasier & faster
Ready-made I/Obuilding blocks:
development easier & faster (*)resources
BioXSD 1.1 beta1 types:
SimpleTypes:
NucleotideSequence AminoacidSequence GeneralNucleotideSequence GeneralAminoacidSequence Biosequence
Accession(s)
helper types: Name, Text Uri Integer(s), Decimal(s) … and a few more
ComplexTypes:
NucleotideSequenceRecord AminoacidSequenceRecord GeneralNucleotideSequenceRecord GeneralAminoacidSequenceRecord BiosequenceRecord
..SequenceAlignment(s)
AnnotatedSequence
DatabaseReference, EntryReference OntologyReference, OntologyConcept Species, SequenceReference, Method
helper types: Score, SequencePosition(s) … a few more
BioXSD : BiosequenceRecord
BioXSD : BiosequenceAlignment
BioXSD : AnnotatedSequence
BioXSD : AnnotatedSequence
BioXSD can be used:
• Directly as an input/output format of tools
• BioXSD can be extended, restricted,or included within other formats
• BioXSD can serve as the intermediate canonical format
>sp|P43353|AL3B1_HUMAN Aldehyde dehydrogenase family 3 member B1 OS=Homo sapiens GN=ALDH3B1 PE=1 SV=1MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL
>AL3B1_HUMAN P43353 ALDEHYDE DEHYDROGENASE 3B1 (EC 1.2.1.5). - Homo sapiens (Human).MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL
>gi|4502043|ref|NP_000685.1| aldehyde dehydrogenase family 3 member B1 isoform a [Homo sapiens]MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL
>sp_ac|P43353 \ID= AL3B1_HUMAN \DE="Aldehyde dehydrogenase family 3 member B1 (Aldehyde dehydrogenase 7)" \NCBITAXID=9606 MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEI
Sequence record in BioXSD 1.0:
<mySequence xsi:type="AminoacidSequenceRecord"> <sequence>MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQ YVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQE MEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPG MEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL</sequence> <species> <databaseName>NCBI Taxonomy</databaseName> <accession>9606</accession> <entryUri>http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606</entryUri> <name>Human</name> </species> <customName>Aldehyde dehydrogenase family 3 member B1 (ALDH3B1)</customName> <formalReference> <databaseName>UniProt</databaseName> <accession xsi:type=“UniprotAccession">P43353</accession> <entryUri>http://www.uniprot.org/uniprot/P43353</entryUri> <sequenceVersion>1</sequenceVersion> <isoformAccession xsi:type=“ExtendedUniprotAccession">P43353-1</isoformAccession> </formalReference></mySequence>
Example data in BioXSD 1.1 beta1 format:a sequence record
<exampleSequenceRecord xsi:type="bx:NucleotideSequenceRecord"><bx:sequence>gtgcgagaggcccgtgccgccgtgcgcgctgcctacgaggctttctgccgctggagggaggtc</
bx:sequence><bx:species
dbName="NCBI Taxonomy"dbUri="http://www.ncbi.nlm.nih.gov/taxonomy"accession="9598"entryUri="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9598"speciesName="Chimp"
/><bx:reference
dbName="GenBank/Nucleotide"dbUri="http://www.ncbi.nlm.nih.gov/nuccore"accession="NM_001008991"entryUri="http://www.ncbi.nlm.nih.gov/nuccore/NM_001008991"sequenceVersion="1"
><bx:subsequencePosition>
<bx:segment min="282" max="345"/></bx:subsequencePosition>
</bx:reference ><bx:name>snippet of aldehyde dehydrogenase 5 family, member A1 (ALDH5A1)</bx:name><bx:note>nuclear gene encoding mitochondrial protein, mRNA (GI:57113868)</bx:note>
</exampleSequenceRecord>
Sequence record in BioXSD 1.1 beta1:
Sequence-string restriction in BioXSD:
<xs:simpleType name="NucleotideSequence" sawsdl:modelReference="http://purl.org/edam/data/0001211"> <xs:annotation> <xs:documentation>
Nucleotide sequence without ambiguous ("degenerate") bases </xs:documentation>
</xs:annotation>
<xs:restriction base="GenericNucleotideSequence"> <xs:pattern value="[acgt]+"/>
<xs:pattern value="[acgu]+"/>
</xs:restriction></xs:simpleType>
Strategy
recommended by the EMBRACE network of excellence (2005 - 2010)
The EMBRACE project partners:
EMBL-EBI, Hinxton, UK; EMBL, Heidelberg, Germany; ITB, CNR, Bari, Italy; University of Manchester, UK; SIB, Geneva, Switzerland; SLU, Sweden; CNRS, Clermont-Ferrand and Lyon, France; CBS, DTU, Lyngby, Denmark; CSIC, Madrid, Spain; University of Stockholm, Sweden; INRIA-UCBL, Lyon, France; MPIMG, Berlin, Germany; CSC, Espoo, Finland; UCL, London, UK; The Weizmann Institute, Rehovot, Israel; University of Nijmegen, Netherlands; INTA, Madrid, Spain; CBU, BCCS, Bergen, Norway
top related