chilton, j.1* 2 1 2 3galaxyp.org/wp-content/uploads/2017/08/building...building proteomic...

1
Building Proteomic Application Platforms for Cloud Computing Environments with CloudBioLinux Chilton, J. 1* , Zenka, R. 2 , Jagtap, P. 1 , Lynch, B. 1 , Bergen III, H.R. 2 , Griffin, T.J. 3 1 University of Minnesota Supcomputing Institute; 2 Mayo Clinic; 3 University of Minnesota CloudBioLinux for Data Analysis I CloudBioLinux is an open-source framework for creating fully automated installation mechanisms for bioinformatics software and data. I We have contributed numerous extensions and enhancements to CloudBioLinux to make a great environment for building mass spec data analysis platforms. I With emerging proteomics applications such as proteogenomics, it is becoming essential to build applications platforms that tie together traditional proteomic analyses with other bioinformatic analyses (such as sequence similarity analysis or genomic mapping). CloudBioLinux is an ideal platform for creating such platforms. I The core CloudBioLinux framework is developed and maintained by Brad Chapman, more information can be found at http://cloudbiolinux.org/. https://biocloudcentral.msi.umn.edu/ - Create Your Own Cluster I Spin up your own cluster created with CloudBioLinux and our customizations for mass spec data analysis today at https://biocloudcentral.msi.umn.edu. I All you need is your Amazon Web Services (AWS) credentials and BioCloudCentral will orchestrate the creation of a cluster on Amazon EC2 for your data analysis. I Cluster is preconfigured with CloudMan an easy-to-use interface for managing your new Amazon cluster. I This cloud image comes prebundled with applications, programming libraries, and GUI tools described below. Applications Large Proteomic Tool Suites Trans proteomic-pipeline, OpenMS, crux Database Search Tools Myrimatch, X! Tandem, OMSSA, ... Specialized Identification Tools TagRecon, PepNovo, ... Validation Tools Percolator, Fido, Mayu, ... Bioinformatics NCBI Blast+, EMBOSS, Augustus, ... Tools Developed at the University of Minnesota iQuant (isobaric quantification), psm-eval (flexible re-evaluation of identified peptide-spectrum matches), peptide-to-gff (map peptides for genomic visualization) Programming Libraries pyteomics ”Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis...” https://pypi.python.org/pypi/pyteomics mspire ”Mspire is a full featured library for working with mass spectrometry data, particularly proteomic, metabolomic and lipidomic data sets.” https://github.com/princelab/mspire R Our preconfigured CloudBioLinux image comes preinstalled with various useful R libraries proteomic data analysis (e.g. xcms, mzR, FactoMineR, caret, ggplot2, VennDiagram, ...) Graphical Applications Figure: PeptideShaker running in a CloudBioLinux environment. More information on PeptideShaker can be found at http://code.google.com/p/peptide-shaker/ The example CloudBioLinux environment comes bundled with user-friendly graphical Desktop applications as well, including: I MZmine I PeptideShaker and SearchGUI I TOPPAS I PRIDE Converter I PRIDEInspector Figure: OpenMS TOPPAS workflow editor running in a CloudBioLinux environment. More information on OpenMS can be found at http://open-ms.sourceforge.net/ CloudBioLinux for Platform Development Check out you own copy of CloudBioLinux on Github at https://github.com/chapmanb/cloudbiolinux and build a customized flavor for your mass spec data analysis platform. I Simple YAML data files describe what software, libraries, and data to install. I Integrate your own applications (big or small) using native OS packages, fabric functions, Puppet modules, or Chef cookbooks. CloudBioLinux as a Platform - wine I Wine is a compatibility layer allowing many Windows applications to run other other operating systems (such as Linux). I We have built a high level framework for packaging and redistributing Wine enviornments. https://github.com/jmchilton/proteomics-wine-env I Includes documentation for creating such a Wine environment and configuring msconvert to work with vendor libraries. I CloudBioLinux includes support for distributing such environments and creating friendly wrapper scripts along with install procedures for proteomics programs such as msconvert and multiplierz. Figure: Screenshot of the Windows-only software Morpheus running under Linux using Wine. More information on the Morpheus can be found here: http://www.chem.wisc.edu/ coon/software.php CloudBioLinux as a Platform - Galaxy-P I Galaxy-P (http://getgalaxyp.org) is an extension to the popular Galaxy (http://galaxyproject.org) framework to enable proteomics workflows. I The example image that can be launched with CloudBioLinux comes preconfigured with an a subset of Galaxy-P. I The multi-omics systems biology nature of Galaxy-P demonstrates the utility of building on CloudBioLinux and the complexity of Galaxy-P demonstrates the necessity for automating such deployments. I Our internal and publicly accessible Galaxy-P servers (https://usegalaxyp.org) are both hosted on a private OpenStack cloud infrastructure and running images built with CloudBioLinux. CloudBioLinux as a Platform - Swift I Swift (https://github.com/romanzenka/swift) is a web interface and workflow engine capable of running Mascot, Sequest, X!Tandem, Myrimatch, OMSSA, Scaffold and IDPicker + extra utilities like msmseval. It provides an unified interface for submitting searches through all the engines simultaneously. I The CloudBioLinux version of Swift is fully open-source, utilizing only the free tools like msconvert, X!Tandem, Myrimatch and IDPicker. This enables Swift to run on as many machines as possible, using Sun Grid Engine for job submission. * [email protected] 61st ASMS

Upload: others

Post on 26-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chilton, J.1* 2 1 2 3galaxyp.org/wp-content/uploads/2017/08/BUilding...Building Proteomic Application Platforms for Cloud Computing Environments with CloudBioLinux Chilton, J.1*, Zenka,

Building Proteomic Application Platforms for Cloud Computing Environments with CloudBioLinuxChilton, J.1*, Zenka, R.2, Jagtap, P.1, Lynch, B.1, Bergen III, H.R.2, Griffin, T.J.3

1University of Minnesota Supcomputing Institute; 2Mayo Clinic; 3University of Minnesota

CloudBioLinux for Data Analysis

I CloudBioLinux is an open-source framework for creating fully automated installation mechanisms for bioinformatics software and data.I We have contributed numerous extensions and enhancements to CloudBioLinux to make a great environment for building mass spec data

analysis platforms.I With emerging proteomics applications such as proteogenomics, it is becoming essential to build applications platforms that tie together

traditional proteomic analyses with other bioinformatic analyses (such as sequence similarity analysis or genomic mapping). CloudBioLinux isan ideal platform for creating such platforms.

I The core CloudBioLinux framework is developed and maintained by Brad Chapman, more information can be found athttp://cloudbiolinux.org/.

https://biocloudcentral.msi.umn.edu/ - Create Your Own Cluster

I Spin up your own cluster created with CloudBioLinux and our customizations for mass spec data analysistoday at https://biocloudcentral.msi.umn.edu.

I All you need is your Amazon Web Services (AWS) credentials and BioCloudCentral will orchestrate thecreation of a cluster on Amazon EC2 for your data analysis.

I Cluster is preconfigured with CloudMan an easy-to-use interface for managing your new Amazon cluster.I This cloud image comes prebundled with applications, programming libraries, and GUI tools described below.

Applications

Large Proteomic Tool SuitesTrans proteomic-pipeline, OpenMS, crux

Database Search ToolsMyrimatch, X! Tandem, OMSSA, ...

Specialized Identification ToolsTagRecon, PepNovo, ...

Validation ToolsPercolator, Fido, Mayu, ...

BioinformaticsNCBI Blast+, EMBOSS, Augustus, ...

Tools Developed at the University of MinnesotaiQuant (isobaric quantification), psm-eval (flexible re-evaluation ofidentified peptide-spectrum matches), peptide-to-gff (map peptides forgenomic visualization)

Programming Libraries

pyteomics”Pyteomics provides a growing set of modules to facilitatethe most common tasks in proteomics data analysis...”https://pypi.python.org/pypi/pyteomics

mspire”Mspire is a full featured library for working with massspectrometry data, particularly proteomic, metabolomicand lipidomic data sets.”https://github.com/princelab/mspire

ROur preconfigured CloudBioLinux image comespreinstalled with various useful R libraries proteomic dataanalysis (e.g. xcms, mzR, FactoMineR, caret, ggplot2,VennDiagram, ...)

Graphical Applications

Figure: PeptideShaker running in a CloudBioLinuxenvironment. More information on PeptideShaker can befound at http://code.google.com/p/peptide-shaker/

The example CloudBioLinux environmentcomes bundled with user-friendly graphicalDesktop applications as well, including:

I MZmineI PeptideShaker and SearchGUII TOPPASI PRIDE ConverterI PRIDEInspector

Figure: OpenMS TOPPAS workflow editor running in aCloudBioLinux environment. More information on OpenMScan be found at http://open-ms.sourceforge.net/

CloudBioLinux for Platform Development

Check out you own copy of CloudBioLinux on Github at https://github.com/chapmanb/cloudbiolinux and build a customized flavor for your massspec data analysis platform.

I Simple YAML data files describe what software, libraries, and data to install.I Integrate your own applications (big or small) using native OS packages, fabric functions, Puppet modules, or Chef cookbooks.

CloudBioLinux as a Platform - wine

I Wine is a compatibility layer allowing many Windows applications to run otherother operating systems (such as Linux).

I We have built a high level framework for packaging and redistributing Wineenviornments. https://github.com/jmchilton/proteomics-wine-envI Includes documentation for creating such a Wine environment and configuring msconvert to

work with vendor libraries.

I CloudBioLinux includes support for distributing such environments and creatingfriendly wrapper scripts along with install procedures for proteomics programs suchas msconvert and multiplierz. Figure: Screenshot of the Windows-only software Morpheus running

under Linux using Wine. More information on the Morpheus can befound here: http://www.chem.wisc.edu/ coon/software.php

CloudBioLinux as a Platform - Galaxy-P

I Galaxy-P (http://getgalaxyp.org) is an extension to the popular Galaxy(http://galaxyproject.org) framework to enable proteomics workflows.

I The example image that can be launched with CloudBioLinux comespreconfigured with an a subset of Galaxy-P.

I The multi-omics systems biology nature of Galaxy-P demonstrates the utilityof building on CloudBioLinux and the complexity of Galaxy-P demonstratesthe necessity for automating such deployments.

I Our internal and publicly accessible Galaxy-P servers(https://usegalaxyp.org) are both hosted on a private OpenStack cloudinfrastructure and running images built with CloudBioLinux.

CloudBioLinux as a Platform - Swift

I Swift (https://github.com/romanzenka/swift) is a webinterface and workflow engine capable of running Mascot,Sequest, X!Tandem, Myrimatch, OMSSA, Scaffold andIDPicker + extra utilities like msmseval. It provides an unifiedinterface for submitting searches through all the enginessimultaneously.

I The CloudBioLinux version of Swift is fully open-source,utilizing only the free tools like msconvert, X!Tandem,Myrimatch and IDPicker. This enables Swift to run on asmany machines as possible, using Sun Grid Engine for jobsubmission.

*[email protected]

61st ASMS