chi next gen-ntino-krampis

20
Cloud BioLinux: Pre-Configured and On-Demand High Performance Computing for the Genomics Community Ntino Krampis, PhD Next-Gen Sequence Data Management '10 Providence, RI

Upload: ntino-krampis

Post on 11-May-2015

1.555 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chi next gen-ntino-krampis

Cloud BioLinux: Pre-Configured and On-Demand

High Performance Computing for the Genomics Community

Ntino Krampis, PhD

Next-Gen Sequence Data Management '10Providence, RI

Page 2: Chi next gen-ntino-krampis

Expensive sequencing, computing and large organizations

● multi-million, broad-impact sequencing projects

● large sequencing center, with a dedicated bioinformatics department

● large-scale computations on SGE cluster, algorithm acceleration hardware

Page 3: Chi next gen-ntino-krampis

Bench-top, commodity sequencing and small labs

● small-factor sequencer available: GS Junior by 454

● sequencing as a standard technique in basic biology and genetics research

● remember microarrays and lengthy assays for protein interactions ?

● RNAseq and ChiPseq, and each biologist will be tackling a metagenome

Page 4: Chi next gen-ntino-krampis

Will small labs become the long tail of sequencing ?

● downstream bioinformatic analysis required for biological discovery

● basic analysis example: large-scale BLAST to public DBs (try 0.5GB at NCBI)

● do not have the hardware, expertise, or time to install and run software locally

amount of sequencing

number of labs

Credit: WikiMedia Commons

Page 5: Chi next gen-ntino-krampis

Cloud Biolinuxpre-configured and on-demand bioinformatics on the cloud

● a public virtual machine (VM) on EC2 with 100+ bioinformatics tools

● how it came to be, what offers for sequence analysis

● where and how do I run it, especially if I am not a computer expert

● modifying and sharing VM configurations and data with your peers

● openness and community around Cloud Biolinux

Page 6: Chi next gen-ntino-krampis

Cloud Biolinux

The Biolinux part

tinyurl.com/BioLinux-NEBC

tinyurl.com/CloudBioLinux-JCVI

+

=

● an Ubuntu Linux desktop for bioinformatics

● NEBC packaged software and maintains repository

● Ubuntu AMI on EC2, pull packages from repository

● additional software of interest to JCVI

Page 7: Chi next gen-ntino-krampis

Cloud Biolinuxwhat comes in the box

● glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS

● mpiBLAST clusters using EC2 virtual machine instances

● Celera whole genome shotgun assembler

● NX remote desktop, easy to use for benchtop scientists

Page 8: Chi next gen-ntino-krampis

Cloud Biolinux

The Cloud part

● find our VM on Amazon EC2:

Biolinux 5.0 packages (32-bit): ami-6953b200Biolinux 6.0 packages (64-bit): ami-6011e409 , EBS based

● 17GB / 6 core instances 0.5$ / hour, see aws.amazon.com/ec2/pricing

● a small bacterial genome assembly costs a little over 2$

● up to 68 RAM / 26 core, EBS up to 1000 GB in size (0.10$ / GB / month)

● make a copy of our public Biolinux ami - add your data - make private

Page 9: Chi next gen-ntino-krampis

Cloud Biolinux http://tinyurl.com/cloud-biolinux-tutorial (credit to the NEBC team)

simply signup at

aws.amazon.com

then

aws.amazon.com/console

and

Page 10: Chi next gen-ntino-krampis

Cloud Biolinuxhttp://tinyurl.com/cloud-biolinux-tutorial (credit to the NEBC team)

● find Cloud Biolinux AMI using ID

● enter desired password for remote desktop login

● all other default

Page 11: Chi next gen-ntino-krampis

● get remote desktop client:nomachine.com/download.php

● simply enter VM's IP address and your password

Page 12: Chi next gen-ntino-krampis
Page 13: Chi next gen-ntino-krampis

What if I want to share my alignments with a collaborator?

save your data as a new AMI

EBS cost 0.10$ / GB / month

at 15GB, it costs 1.5$ / month

Page 14: Chi next gen-ntino-krampis

share your data: public or with another AWS user

users with access can boot the AMI with all the software + data

Page 15: Chi next gen-ntino-krampis

Cloud Biolinux

The Cloud part

● run Cloud Biolinux on your private cloud ?

● Eucalyptus open source cloud platform

● identical API with EC2, without the usage charges

● easy to set up on your lab's cluster, comes with Ubuntu server (UEC)

● download VMs from Sourceforge ( tinyurl.com/CloudBiolinux-SF )

open.eucalyptus.com

Page 16: Chi next gen-ntino-krampis

Cloud Biolinux

● porting VMs across cloud platforms is not trivial

● Cloud Biolinux VMs from EC2 to Eucalyptus, Xen kernel and boot sector

● framework to share VM configurations ( tinyurl.com/bootstrap-cloudbiolinux )

● based on python-fabric automated deployment tool

● simply edit the software list files and share with collaborators

● they start with fresh VM, python-fabric replicates VM setup on their cloud

tinyurl.com/python-fabric

Page 17: Chi next gen-ntino-krampis

Cloud Biolinux

Collaboration and open source

high-level configuration describing software groups

for each group individual software packages

simply edit the files to change the VM configuration

tinyurl.com/CloudBioLinux-github

...............

Page 18: Chi next gen-ntino-krampis

Cloud Biolinux

The community

● from JCVI and NEBC to an open-source, community-based project

● community initiated during tele-conference meeting at SC '10, Portland, OR

● first meeting past July in Boston, tinyurl.com/openbio-codefest-2010

● work done: 64-bit AMIs, NX remote desktop, set-up the fabric framework

● next year's at ISMB/BOSC in Vienna, Austria http://metalab.at/

● cloudbiolinux.com and most important, tinyurl.com/cloudbiolinux-lists

Page 19: Chi next gen-ntino-krampis

Cloud Biolinux

The future

● expand community, receive feedback, add more software to the VM

● genome assemblers, high-memory EC2 instances up to 68GB RAM

● Hadoop / MapReduce (for those running the VM in private clouds)

● analysis pipelines that are used by large sequencing centers

● actively seeking funding to put major effort in development

● tinyurl.com/cloudbiolinux-lists or [email protected]

Page 20: Chi next gen-ntino-krampis

Acknowledgments & Credits

Brad Chapman - development of the fabric scripts and community organizer

Tim Booth, Bela Tiwari – BioLinux 6.0 development and EC2 documentation

Deepak Singh and AWS - education grant supporting codefest workshop

Justin Johnson – community and sponsorship of cloudbiolinux.com

J. Craig Venter Inst. - time allowed to work on an open-source project

D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation

Members of the Cloud Biolinux community:

Enis AfganMichael HeuerRichard HollandMark JensenDave MessinaSteffen MöllerRoman Valls

Thank you !