methods and practices to make bioinformatics tools...

1
1: Mathématique et Informatique Appliquées du Génome à l’Environnement (MaIAGE) - Jouy-en-Josas - INRA; 2: Plant Resistance to Pests and Diseases (RPB) – Montpellier – IRD; 3: Génétique Physiologie et Systèmes d’Elevage (GenPhySE) - Castanet Tolosan - INRA; 4: Unité de Nutrition Humaine (UNH) - Clermont-Ferrand - INRA et Université d’Auvergne; 5: Unité de recherche Virologie et Immunologie Moléculaires (VIM) - Jouy-en-Josas - INRA; 6: Plateforme MetaToul-AXIOM - Toulouse – INRA WHAT’S NEXT We are currently working on a new use case “miRDeep2”. It is a pipeline made from the miRDeep2 tool but never finished. We are installing it on a Test Galaxy instances and then we will reproduce the work done for the other project that is to: ensure the wrappers follow the best practices; be sure they can be shared with Conda package dependencies; publish them on the Toolshed; Install them on a Galaxy instance and maybe create a container and organize a training about it. And for SNiPlay we are managing some others pipelines like the Haplophyles analysis one. 6. Current deliverables 4. Progress For the past 21 months we have set up methods for the overall project. And we principally applied it on 4 use cases (subprojects). For each use case we have 4 main deliverables: the packaged tools (wrapper for Galaxy and packages to install dependencies); an instance easily accessible for users; a use case with someone having a good knowledge of the tool; documentation and/or training support ready to use. DELIVERABLES 5 . Project deliverables The first use case is the suite “REPET” to detect, annotate and analyse repeats in genomic sequences (specifically designed for transposable elements). For this tool we finalized wrappers that allow the two “REPET” pipelines to run on Galaxy (in a lighter way). Difficulties came from the large number of dependencies and the need of a HPC cluster for these pipelines. We succeed to install it on galaxy instances on the IFB cloud and on a URGI virtual machine. We have therefore carried out a training on the URGI galaxy machine, it allows us to validate the tool, and make a list of new features to improve the current version. We made a second development cycle, with the delivery of a new version of REPET. This version has been successfully used in a second training session. PROGRESS RESULTS They have been established by the community to improve the quality of wrappers development. Respect of these practices provides a harmonization of tools architecture, an easy tool maintenance and an insurance of the tool quality for users. Good practices include the way to build a wrapper with the essentials markups, the way to distribute the tools with toolshed management and the way to install the tools with dependencies management. Based on these practices we designed this tool integration process, with two technical deliverable levels. 3 . Tool integration process TECHNOLOGIES BEST PRACTICES In front of this process we apply some technologies commonly used by the community: - Planemo, an assisting software for Galaxy tools building. It eases development by providing wrappers skeletons (template) and checking the good syntax of the xml files. Also, it can automate tools testing and publication on the toolshed. - Conda, a dependencies and environment management technology used by Galaxy developers to manage tools packaging. It is now the privileged way for dependencies management in Galaxy last versions. There is also a repository for all the biological tools packages named “BioConda”. - Docker, a container manager, that provide light virtual machines. The interest of this technology is to easily share tools in a close environment, ready to use 1. Project origin Accessibility, Reproducibility and Transparency are the guidelines of the web-based platform Galaxy. These three principles have lead the Galaxy Community to create Good Practices for tool integration into their platform. Today, a developer has clear guidelines to wrap a tool under galaxy (xml file, packaging). 2. Partner communities FRENCH COMMUNITY A.R.T. GFLS PROJECT The French community has been among the early users of Galaxy. Several tools wrappers have been developed since 2010. Wrappers have a heterogeneous non-standardised architecture because the stabilization of methods took a consequent time. After this significant stabilization time, the French community has adopted the Galaxy good practices. Consequently, The French Galaxy working group of the French Bioinformatics Institute (IFB), established the project “Galaxy For Life Science” (GFLS). This project aim to provide upgrade and highlight of wrappers for several French scientific communities (Plant, Livestock, Microbe). These enhancement are made according to these good practices, tools and method. Valentin MARCON [email protected] Institut National de la Recherche Agronomique (INRA) Unité Mathématiques & Informatique Appliquées du Génome à l'Environnement Plateforme bioinformatique Migale Domaine de Vilvert 78352 Jouy-en-Josas Past 21 Month Remaining 8 month Plant Livestock Microbe Taking tools out of their laboratory: Methods and practices to make Bioinformatics tools accessible through Galaxy Valentin Marcon 1 , Alexis Dereeper 2 , Sarah Maman-Haddad 3 , Melanie Petera 4 , Luc Jouneau 5 , Marie Tremblay-Franco 6 , and Olivier Inizan 1 The second use case “BIOS4BIOL” was a set of wrappers, build by members of a statistical working group at INRA. We standardized the wrappers and created a new one that achieve normalization on data, a common feature of the old wrappers. We worked with the original developers of the tools and to facilitate our collaborations we made a training on the versioning tool “git”. We also have managed dependencies with Conda packages. The tools are now available in the “Genotool” Galaxy instance and in the Galaxy Toolshed. The third use case “SNiPlay” , a tool for SNP detection, management and analysis. It is a pipeline of wrappers that has been published on the Toolshed but needed to be completed. In this project, we managed dependencies with Conda and for that, we created two new Conda packages. Readseq and sNMF package are now available through the Bioconda repository. We then updated the tools on the Toolshed. Finally we created a Docker image that includes SNiPlay pipeline tools, others complementary tools, a visualization plugin, and the ready-to-use workflow. Olivier INIZAN [email protected] @valentin_marcon @OlivierInizan

Upload: vudang

Post on 26-Aug-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods and practices to make Bioinformatics tools ...migale.jouy.inra.fr/sites/migale.jouy.inra.fr.drupal7.migale.jouy... · These three principles have lead the Galaxy Community

1: Mathématique et Informatique Appliquées du Génome à l’Environnement (MaIAGE) - Jouy-en-Josas - INRA; 2: Plant Resistance to Pests and Diseases (RPB) – Montpellier – IRD; 3: Génétique Physiologie etSystèmes d’Elevage (GenPhySE) - Castanet Tolosan - INRA; 4: Unité de Nutrition Humaine (UNH) - Clermont-Ferrand - INRA et Université d’Auvergne; 5: Unité de recherche Virologie et ImmunologieMoléculaires (VIM) - Jouy-en-Josas - INRA; 6: Plateforme MetaToul-AXIOM - Toulouse – INRA

WHAT’S NEXT We are currently working on a new use case “miRDeep2”. It is a pipeline made from the miRDeep2 tool but neverfinished. We are installing it on a Test Galaxy instances and then we will reproduce the work done for the other project that is to: ensure thewrappers follow the best practices; be sure they can be shared with Conda package dependencies; publish them on the Toolshed; Installthem on a Galaxy instance and maybe create a container and organize a training about it.And for SNiPlay we are managing some others pipelines like the Haplophyles analysis one.

6. Current deliverables

4. Progress

For the past 21 months

we have set up methods

for the overall project.

And we principally

applied it on 4 use cases

(subprojects).

For each use case we have 4 main deliverables: the

packaged tools (wrapper for Galaxy and packages to install dependencies); an

instance easily accessible for users; a use case with someone having a good

knowledge of the tool; documentation and/or training support ready to use.

DELIVERABLES

5. Project deliverables

The first use case is the suite “REPET” to detect, annotate and analyse repeats in genomic sequences (specifically designed for transposable elements).For this tool we finalized wrappers that allow the two “REPET” pipelines to run on Galaxy (in a lighter way). Difficulties came from the large number of dependenciesand the need of a HPC cluster for these pipelines. We succeed to install it on galaxy instances on the IFB cloud and on a URGI virtual machine. We have thereforecarried out a training on the URGI galaxy machine, it allows us to validate the tool, and make a list of new features to improve the current version. We made a seconddevelopment cycle, with the delivery of a new version of REPET. This version has been successfully used in a second training session.

PROGRESS RESULTS

Theyhave been established by thecommunity to improve the quality ofwrappers development. Respect ofthese practices provides aharmonization of tools architecture,an easy tool maintenance and aninsurance of the tool quality forusers.Good practices include the way tobuild a wrapper with the essentialsmarkups, the way to distribute thetools with toolshed managementand the way to install the tools withdependencies management.

Based on these practices wedesigned this tool integrationprocess, with two technicaldeliverable levels. 3. Tool integration process

TECHNOLOGIESBEST PRACTICES In front of thisprocess we apply some technologies commonlyused by the community:

- Planemo, an assisting software for Galaxy toolsbuilding. It eases development by providingwrappers skeletons (template) and checking thegood syntax of the xml files. Also, it can automatetools testing and publication on the toolshed.

- Conda, a dependencies and environmentmanagement technology used by Galaxydevelopers to manage tools packaging. It is nowthe privileged way for dependencies managementin Galaxy last versions. There is also a repository forall the biological tools packages named “BioConda”.

- Docker, a container manager, that provide lightvirtual machines. The interest of this technology isto easily share tools in a close environment, readyto use

1. Project origin

Accessibility, Reproducibility and Transparency are the guidelines of the web-basedplatform Galaxy. These three principles have lead the Galaxy Community to create GoodPractices for tool integration into their platform. Today, a developer has clear guidelines to wrapa tool under galaxy (xml file, packaging).

2. Partner communities

FRENCH COMMUNITYA.R.T.

GFLS PROJECT

The French community has been among the early usersof Galaxy. Several tools wrappers have been developed since 2010. Wrappers have aheterogeneous non-standardised architecture because the stabilization of methods took aconsequent time. After this significant stabilization time, the French community has adoptedthe Galaxy good practices.

Consequently, The French Galaxy working group of the French Bioinformatics Institute (IFB),established the project “Galaxy For Life Science” (GFLS). This project aim to provide upgrade and highlight of wrappers forseveral French scientific communities (Plant, Livestock, Microbe). These enhancement are made according to these goodpractices, tools and method.

Valentin MARCON [email protected]

Institut National de la Recherche Agronomique (INRA)Unité Mathématiques & Informatique Appliquées du Génome à l'Environnement Plateforme bioinformatique MigaleDomaine de Vilvert 78352 Jouy-en-Josas

Past21 Month

Remaining8 month

Plant Livestock Microbe

Taking tools out of their laboratory:Methods and practices to make Bioinformaticstools accessible through Galaxy

Valentin Marcon∗1, Alexis Dereeper2, Sarah Maman-Haddad3, Melanie Petera4, Luc Jouneau5, Marie Tremblay-Franco6, and Olivier Inizan1

The second use case “BIOS4BIOL” was a set of wrappers, build by members of a statistical working group at INRA. We standardized the wrappers and created a newone that achieve normalization on data, a common feature of the old wrappers. We worked with the original developers of the tools and to facilitate ourcollaborations we made a training on the versioning tool “git”. We also have managed dependencies with Conda packages.The tools are now available in the “Genotool” Galaxy instance and in the Galaxy Toolshed.

The third use case “SNiPlay” , a tool for SNP detection, management and analysis. It is a pipeline of wrappers that has been published on the Toolshed but needed tobe completed. In this project, we managed dependencies with Conda and for that, we created two new Conda packages. Readseq and sNMF package are nowavailable through the Bioconda repository. We then updated the tools on the Toolshed. Finally we created a Docker image that includes SNiPlay pipeline tools, otherscomplementary tools, a visualization plugin, and the ready-to-use workflow.

Olivier INIZAN [email protected]

@valentin_marcon

@OlivierInizan