globus genomics: how science-as-a-service is accelerating discovery (bdt310) | aws re:invent 2013
DESCRIPTION
"In this talk, hear about two high-performant research services developed and operated by the Computation Institute at the University of Chicago running on AWS. Globus.org, a high-performance, reliable, robust file transfer service, has over 10,000 registered users who have moved over 25 petabytes of data using the service. The Globus service is operated entirely on AWS, leveraging Amazon EC2, Amazon EBS, Amazon S3, Amazon SES, Amazon SNS, etc. Globus Genomics is an end-to-end next-gen sequencing analysis service with state-of-art research data management capabilities. Globus Genomics uses Amazon EC2 for scaling out analysis, Amazon EBS for persistent storage, and Amazon S3 for archival storage. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS. "TRANSCRIPT
Science as a Service
Ian Foster, The University of Chicago and Argonne National Laboratory
November 14, 2013
A time of disruptive change
A time of disruptive change
Most labs have limited resources Heidorn: NSF grants in 2007
< $350,000 80% of awards 50% of grant $$
$1,000,000
$100,000
$10,000
$1,000
2000 4000 6000 8000
Automation is required to apply more sophisticated methods to far more data
Automation is required to apply more sophisticated methods to far more data
Outsourcing is needed to achieve economies of scale in the use of automated methods
Building a discovery cloud • Identify time-consuming activities amenable to
automation and outsourcing • Implement as high-quality, low-touch SaaS • Leverage IaaS for reliability,
economies of scale • Extract common elements as
research automation platform Bonus question: Sustainability
Software as a service
Platform as a service
Infrastructure as a service
We aspire (initially) to create a great user experience for
research data management
What would a “dropbox for science” look like?
• Collect • Move • Sync • Share • Analyze
• Annotate • Publish • Search • Backup • Archive
BIG DATA
Registry Staging Store
Ingest Store
Analysis Store
Community Store
Archive Mirror
Ingest Store
Analysis Store
Community Store
Archive Mirror
Registry
Quota exceeded
!
Expired credentials
!
Network failed. Retry.
!
Permission denied
!
It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA … but in reality it’s often very challenging
• Collect • Move • Sync • Share • Analyze
• Annotate • Publish • Search • Backup • Archive
BIG DATA
• Collect • Move • Sync • Share • Analyze
• Annotate • Publish • Search • Backup • Archive
BIG DATA
• Move • Sync • Share Capabilities delivered using
Software-as-Service (SaaS) model
Data Source
Data Destination
User initiates transfer request
1
Globus Online moves/syncs files
2
Globus Online notifies user
3
Data Source
User A selects file(s) to share; selects user/group, sets share permissions
1
Globus Online tracks shared files; no need to move files to cloud storage!
2
User B logs in to Globus Online and accesses
shared file
3
Extreme ease of use • InCommon, Oauth, OpenID, X.509, … • Credential management • Group definition and management • Transfer management and optimization • Reliability via transfer retries • Web interface, REST API, command line • One-click “Globus Connect” install • 5-minute Globus Connect Multi User install
Early adoption is encouraging
Early adoption is encouraging
>12,000 registered users; >150 daily >27 PB moved; >1B files
10x (or better) performance vs. scp 99.9% availability
Entirely hosted on Amazon
Amazon web services used • Amazon EC2 for hosting Globus services • Elastic Load Balancing to use multiple
Availability Zones for reliability and uptime • Amazon S3 to store historical state • Amazon RDS PostgreSQL for active state
K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s
B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC
Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience
2
Credit: Kerstin Kleese-van Dam
Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL
• Collect • Move • Sync • Share • Analyze
• Annotate • Publish • Search • Backup • Archive
BIG DATA
• Move • Sync • Share Capabilities delivered using
Software-as-Service (SaaS) model
• Collect • Move • Sync • Share • Analyze
• Annotate • Publish • Search • Backup • Archive
BIG DATA
Globus Online already does a lot
Globus Toolkit
Sharing Service Transfer Service
Globus Nexus (Identity, Group, Profile)
Glo
bus
Onl
ine
API
s
Glo
bus
Con
nect
The identity challenge in science • Research communities often need to
– Assign identities to their users – Manage user profiles – Organize users into groups for authorization
• Obstacles to high-quality implementations – Complexity of associated security protocols – Creation of identity silos – Multiple credentials for users – Reliability, availability, scalability, security
Nexus provides four key capabilities • Identity provisioning
– Create, manage Globus identities
• Identity hub – Link with other identities; use
to authenticate to services
• Group hub – User-managed groups; groups can
be used for authorization
• Profile management – User-managed attributes;
can use in group admission
I
I I I
I
I a b
I U
V G
Key points: 1) Outsource
identity, group, profile management
2) REST API for flexible integration
3) Intuitive, customizable Web interfaces
Branded sites
Open Science Grid University of Chicago XSEDE
DOE kBase Indiana University University of Exeter
Globus Online NERSC NIH BIRN
A platform for integration
A platform for integration
A platform for integration
Data management SaaS (Globus) + Next-gen sequence analysis pipelines (Galaxy) +
Cloud IaaS (Amazon) = Flexible, scalable, easy-to-use genomics analysis for
all biologists
globus genomics
Globus Toolkit
Sharing Service Transfer Service
Globus Nexus (Identity, Group, Profile)
Glo
bus
Onl
ine
API
s
Glo
bus
Con
nect
We are adding capabilities
Globus Toolkit
Sharing Service Transfer Service
Dataset Services
Globus Nexus (Identity, Group, Profile)
Glo
bus
Onl
ine
API
s
Glo
bus
Con
nect
We are adding capabilities
We are adding capabilities • Ingest and publication
– Imagine a DropBox that not only replicates, but also extracts metadata, catalogs, converts
• Cataloging – Virtual views of data based on user-defined and/or automatically
extracted metadata
• Computation – Associate computational procedures, orchestrate application,
catalog results, record provenance
Next Gen Sequencing Analysis for Everyone – No IT Required
Ravi K Madduri, The University of Chicago and Argonne National Laboratory
November 14, 2013
One slide to get your attention
Outline • Globus Vision • Challenges in Sequencing Analysis
– Big Data Management – Analysis at Scale – Reproducibility
• Proposed Approach Using Globus Genomics • Example Collaborations • Q&A
Globus Vision Goal: Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service to:
– provide millions of researchers with unprecedented access to powerful tools for managing Big Data
– reduce research IT costs dramatically via economies of scale
“Civilization advances by extending the number of important operations which we can perform without thinking of them” —Alfred North Whitehead , 1911
Challenges in Sequencing Analysis
Sequencing Centers
Sequencing Centers
Data Movement and Access Challenges
Manual Data Analysis
Public Data
Storage
Local Cluster/ Cloud Seq
Center
Research Lab
How do we analyze this Sequence Data
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
• Manually move the data to the Compute node
(Re)Run Script
Install
Modify
• Install all the tools required for the Analysis • BWA, Picard, GATK, Filtering Scripts, etc. • Shell scripts to sequentially execute the tools
• Manually modify the scripts for any change • Error Prone, difficult to keep track, messy.. • Difficult to maintain and transfer the knowledge
• Data is distributed in different locations • Research labs need access to the data for analysis • Be able to share data with other researchers/collaborators
• Inefficient ways of data movement • Data needs to be available on the local and distributed compute
Resources • Local clusters, cloud, grid and transfer the knowledge
Globus Genomics
Sequencing Centers Sequencing Centers
Public Data
Storage
Local Cluster/ Cloud Seq
Center
Research Lab
Globus Provides a • High-performance • Fault-tolerant • Secure file transfer Service between all data-endpoints
Data Management Data Analysis
Galaxy Data Libraries
• Globus Integrated within Galaxy
• Web-based UI • Drag-Drop workflow
creations • Easily modify
Workflows with new tools
Globus Genomics on Amazon EC2
• Analytical tools are automatically run on the scalable compute resources when possible
Galaxy Based Workflow Management System
Globus Genomics
Globus Genomics Architecture
Figure 2: Globus Genomics Architecture
Globus Genomics Usage
Globus Genomics • Computational profiles for
various analysis tools • Resources can be provisioned
on-demand with Amazon Web Services cloud based infrastructure
• Glusterfs as a shared file system between head nodes and compute nodes
• Provisioned I/O on Amazon EBS
Coming soon! • Integration with Globus Catalog
– Better data discovery and metadata management
• Integration with Globus Sharing – Easy and secure method to share large datasets with collaborators
• Integration with Amazon Glacier for data archiving • Support for high throughput computational
modalities through Apache Mesos – MapReduce and MPI clusters
• Dynamic Storage Strategies using Amazon S3 or LVM-based shared file system
Provide more capability for more people at lower cost by building a “Discovery Cloud”
Delivering “Science as a service”
Our vision for a 21st century discovery infrastructure
Thank you to our sponsors
For more information • More information on Globus Genomics and to
sign up: www.globus.org/genomics • More information on Globus:
www.globusonline.org • Follow us on Twitter: @ianfoster, @madduri,
@globusgenomics, @globusonline
Thank you!
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
BDT 310