big process for big data @ pnnl, may 2013
TRANSCRIPT
computationinstitute.org
Thanks to great colleagues and collaborators
• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago
• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI
• Francesco de Carlo, Chris Jacobsen, and others at Argonne
• Kerstin Kleese-Van Dam, Carina Lansing, and others at PNNL
computationinstitute.org
The Computation Institute
= UChicago + Argonne
= Cross-disciplinary nexus
= Home of the Discovery Cloud
computationinstitute.org
High energy physics
Molecular biology
Cosmology
Genetics
MetagenomicsLinguistics
Economics
Climate change
Visual arts
computationinstitute.org
x10 in 6 years
x105 in 6 years
Will data kill genomics?
Kahn, Science, 331 (6018): 728-729
computationinstitute.org
18 ordersof magnitudein 5 decades!
12 ordersof magnitudeIn 6 decades!
Moore’s Law for X-Ray Sources
computationinstitute.org
Large Hadron Collider
Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG
computationinstitute.org
computationinstitute.org
1.2 PB of climate dataDelivered to 23,000 users
computationinstitute.org
We have exceptional infrastructure for the 1%
computationinstitute.org
What about the 99%?
We have exceptional infrastructure for the 1%
computationinstitute.org
What about the 99%?
We have exceptional infrastructure for the 1%
Big science. Small labs.
computationinstitute.org
Need: A new way to deliver research
cyberinfrastructureFrictionlessAffordable
Sustainable
computationinstitute.org
We asked ourselves:
What if the research work flow could be managed as
easily as……our pictures
…home entertainment…our e-mail
computationinstitute.org
What makes these services great?
Great User Experience+
High performance (but invisible) infrastructure
computationinstitute.org
We aspire (initially) to create a great user
experience forresearch data managementWhat would a “dropbox
for science” look like?
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
computationinstitute.org
RegistryStaging Store
IngestStore
AnalysisStore
Community Store
Archive Mirror
IngestStore
AnalysisStore
Community Store
Archive Mirror
Registry
Quotaexceeded
!
Expiredcredential
s
!
Networkfailed. Retry.
!
Permissiondenied
!
It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, &
Archive BIG DATA… but in reality it’s often very challenging
computationinstitute.org
Automation is required to apply more sophisticated methods to far more data
Automation and outsourcing are key
computationinstitute.org
Automation is required to apply more sophisticated methods to far more data
Outsourcing is needed to achieve economies of scale in the use of automated methods
Automation and outsourcing are key
computationinstitute.org
Building a discovery cloud:Research strategy
• Identify time-consuming activity that appears amenable to automation and outsourcing
• Implement activity as a high-quality, low-touch SaaS solution, leveraging commercial IaaS for high reliability, economies of scale
• Evaluate
• Extract common elements as aresearch automation platform
• Repeat
Bonus question: Identify methods for delivering SaaS solutions sustainably
Software as a service
Platform as a service
Infrastructure as a service
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
• Collect•Move• Sync• Share
Capabilities delivered using Software-as-Service (SaaS) model
computationinstitute.org
DataSource
DataDestinatio
n
User initiates transfer request
1
Globus Online moves/syncs files
2
Globus Online notifies user
3
computationinstitute.org
DataSource
User A selects file(s) to share; selects user/group, sets share permissions
1
Globus Online tracks shared files; no need to move files to cloud storage!
2
User B logs in to Globus Online
and accesses shared file
3
computationinstitute.org
Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and
optimization• Reliability via transfer retries• Web interface, REST API, command
line• One-click “Globus Connect” install • 5-minute Globus Connect Multi User
install
computationinstitute.org
Early adoption is encouraging
computationinstitute.org
Early adoption is encouraging
8,000 registered users; >100 daily~16 PB moved; ~1B files
10x (or better) performance vs. scp99.9% availability
Entirely hosted on Amazon
computationinstitute.org
We benefit greatly from ESnet’s “Science DMZ”
Three key components, all required:• “Friction free” network path
– Highly capable network devices (wire-speed, deep queues)– Virtual circuit connectivity option– Security policy and enforcement specific to science
workflows– Located at or near site perimeter if possible
• Dedicated, high-performance Data Transfer Nodes (DTNs)– Hardware, operating system, libraries optimized for
transfer– Optimized data transfer tools: Globus Online, GridFTP
• Performance measurement/test node– perfSONAR
Details at http://fasterdata.es.net/science-dmz/
computationinstitute.org
K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s
computationinstitute.org
B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC
computationinstitute.org
Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience
computationinstitute.org34
Credit: Kerstin Kleese-van Dam
Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
computationinstitute.org
Globus Online already does a lot
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus (Identity, Group, Profile)G
lob
us O
nlin
e A
PIs
Glo
bu
s C
on
nect
computationinstitute.org
Data management SaaS (Globus) + Next-gen sequence analysis pipelines
(Galaxy) + Cloud IaaS (Amazon) =
Flexible, scalable, easy-to-use genomics analysis for all biologists
globus genomics
computationinstitute.org
A platform for integration
computationinstitute.org
A platform for integration
computationinstitute.org
A platform for integration
computationinstitute.org
We are also adding capabilities
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus (Identity, Group, Profile)G
lob
us O
nlin
e A
PIs
Glo
bu
s C
on
nect
computationinstitute.org
More capabilities underway …
Globus Toolkit
Sharing Service
Transfer Service
Dataset Services
Globus Nexus (Identity, Group, Profile)G
lob
us O
nlin
e A
PIs
Glo
bu
s C
on
nect
computationinstitute.org
Expanding Globus Online services
• Ingest and publication– Imagine a DropBox that not only
replicates, but also extracts metadata, catalogs, converts
• Cataloging– Virtual views of data based on user-
defined and/or automatically extracted metadata
• Computation– Associate computational procedures,
orchestrate application, catalog results, record provenance
computationinstitute.org
Looking deeply at how researchers use data
• A single research question often requires the integration of many data elements, that are:– In different locations– In different formats (Excel, text, CDF, HDF,
…)– Described in different ways
• Best grouping can vary during investigation– Longitudinal, vertical, cross-cutting
• But always needs to be operated on as a unit– Share, annotate, process, copy, archive, …
computationinstitute.org
How do we manage data today?
• Often, a curious mix of ad hoc methods– Organize in directories using file and
directory naming conventions– Capture status in README files,
spreadsheets, notebooks
• Time-consuming, complex, error prone
Why can’t we manage our data like we manage our pictures and music?
computationinstitute.org
Introducing the dataset• Group data based on use, not location– Logical grouping to organize, reorganize, search,
and describe usage
• Tag with characteristics that reflect content …– Capture as much existing information as we can
• …or to reflect current status in investigation– Stage of processing, provenance, validation, ..
• Share data sets for collaboration– Control access to data and metadata
• Operate on datasets as units– Copy, export, analyze, tag, archive, …
computationinstitute.org
Builds on catalog as a service
Approach
• Hosted user-defined catalogs
• Based on tag model<subject, name, value>
• Optional schema constraints
• Integrated with other Globus services
Three REST APIs
/query/
• Retrieve subjects
/tags/
• Create, delete, retrieve tags
/tagdef/
• Create, delete, retrieve tag definitions
Builds on USC Tagfiler project (C. Kesselman et al.)
computationinstitute.org50
Multi-scale imaging at APS
StorageImage processing
(noise removal, etc.)
Tomographic reconstruction
Visual inspection
Selection
Beamline 2-BM-B~1.5um resolution
Beamline 32-ID-C20-50 nm resolution
Image processing (noise removal, etc.)
Tomographic reconstruction
Visual inspection
Selection
Selection Multi-scale image fusion
Visual inspection
Up to 100 fps2K x 2K, 16 bits11 GB raw data
1,500 fps2K x 2K, 16 bits1 min readout
11 GB raw data
51
mydata42
owner: Francescotype: 3dtomoformat: HDF5beamline: 2BM
Tomograph
y
Define datasetInfer typeExtract metadata
Populate catalog(s)
Locate datasetsAccess files
analyze
Catalog derived products
transfer/schedule
OrchestrationOrganization
Record provenance
Annotate, sharebrowse, search
computationinstitute.org
computationinstitute.org
computationinstitute.org
computationinstitute.org
Building a discovery cloud:Research strategy
• Identify time-consuming activity that appears amenable to automation and outsourcing
• Implement activity as a high-quality, low-touch SaaS solution, leveraging commercial IaaS for high reliability, economies of scale
• Evaluate
• Extract common elements as aresearch automation platform
• Repeat
Bonus question: Identify methods for delivering SaaS solutions sustainably
Software as a service
Platform as a service
Infrastructure as a service
computationinstitute.org
Our challenge:
Sustainability
We are a non-profit service provider to the non-profit
research community
computationinstitute.org
Globus Online Provider Plans
Support ongoing operations
Offer value-added capabilities
Engage more closely with users
computationinstitute.orgStarting at $20k per year
• Provider endpoints with sharing
• Multiple GridFTP servers per endpoint
• Branded web sites
• Alternate identity provider
• Usage reporting
• MSS optimizations
• Operations monitoring and management
• Input into and access to product roadmap
Provider Plans offer…
computationinstitute.org
To provide more capability formore people at substantially lower cost by creatively aggregating (“cloud”) and federating (“grid”) resources
“Science as a service”
Our vision for a 21st century discovery infrastructure
computationinstitute.org
It’s a time of great opportunity … to develop and apply Science aaS
Globus Nexus (Identity, Group, Profile)
…
Sharing Service
Transfer Service
Dataset Services
Globus Toolkit
Glo
bu
s O
nlin
e A
PIs
Glo
bu
s C
on
nect
computationinstitute.org
Thanks to great colleagues and collaborators
• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago
• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI
• Francesco de Carlo, Chris Jacobsen, and others at Argonne
• Kerstin Kleese-Van Dam, Carina Lansing, and others at PNNL
computationinstitute.org
Thank you to our sponsors!