Download - Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3
www.ci.anl.govwww.ci.uchicago.edu
Big process for big data
Process automation for data‐driven science
Ian Foster Computation Institute
Argonne National Laboratory & The University of Chicago
Talk at HPC 2012 Conference, Cetraro, Italy, June 25, 2012
www.ci.anl.govwww.ci.uchicago.edu
2
Big science is making it work
All build on NSF‐
& DOE‐supported Globus Toolkit software
LIGO: 1 PB data in last science run, distributed worldwide
ESG: 1.2 PB climate datadelivered to 23,000 users; 600+ pubs
OSG: 1.4M CPU‐hours/day, >90 sites, >3000 users,
>260 pubs in 2010
Robust production solutionsSubstantial teams and expenseSustained, multi‐year effortApplication‐specific solutions,
built on common technology
www.ci.anl.govwww.ci.uchicago.edu
3
But small/medium science is struggling
More data, more complex dataAd‐hoc solutionsInadequate software, hardwareData plan mandates
www.ci.anl.govwww.ci.uchicago.edu
4
Complexity is large and growing
Run experimentCollect dataMove dataCheck data
Annotate dataShare data
Find similar dataLink to literatureAnalyze dataPublish data
Time
www.ci.anl.govwww.ci.uchicago.edu
5
www.ci.anl.govwww.ci.uchicago.edu
6
Tripit exemplifies process automation
MeBook flights
Book hotel
Record flightsSuggest hotelRecord hotelGet weatherPrepare mapsShare infoMonitor
pricesMonitor flight
Other servicesTime
www.ci.anl.govwww.ci.uchicago.edu
7
Complexity is large and growing
Run experimentCollect dataMove dataCheck data
Annotate dataShare data
Find similar dataLink to literatureAnalyze dataPublish data
Time
www.ci.anl.govwww.ci.uchicago.edu
8
Can we extract this complexity?
www.ci.anl.govwww.ci.uchicago.edu
9
Process automation for science
Run experimentCollect dataMove dataCheck data
Annotate dataShare data
Find similar dataLink to literatureAnalyze dataPublish data
Time
?Research IT
as a service ?
?Research IT
as a service?
www.ci.anl.govwww.ci.uchicago.edu
10
A first take on “big process for science”
Dark Energy Survey Metagenomics Climate scienceGenomics Land use change X‐ray source data
Biomedical imaging High energy physics Nielsen data
www.ci.anl.govwww.ci.uchicago.edu
11
A first take on “big process for science”
Dark Energy Survey Metagenomics Climate scienceGenomics Land use change X‐ray source data
Biomedical imaging High energy physics Nielsen data
www.ci.anl.govwww.ci.uchicago.edu
12
Software as a Service (Gartner)
1. The application is owned, delivered, and managed remotely by one or more providers
2. The application is based on a single code base that is consumed in a one‐to‐many model by all
contracted customers at any time3. The application is licensed on pay‐per‐use or
subscription basis
4. The application behind the service is properly web architected—not an existing application
web enabled [D. Terrar]
www.ci.anl.govwww.ci.uchicago.edu
13
Globus Online: Data transfer as SaaS
• Reliable file transfer.– Fire‐and‐forget– Automatic fault recovery– High performance– Across multiple security domains
• No IT required.– No client software install– New features automatically
available– Consolidated
support and troubleshooting Works with existing GridFTP servers; also Globus Connect
www.ci.anl.govwww.ci.uchicago.edu
14
Globus Transfer to date• In 18 months
– 5,000 users– 5 PB moved– 500M files– 99.9% uptime
• Broad adoption– Experimental facilities– Supercomputers– Campuses– Individuals– Projects
www.ci.anl.govwww.ci.uchicago.edu
18
Dark Energy Survey use of Globus Online• Dark Energy Survey
receives 100,000 files each night in Illinois
• They transmit files to Texas for analysis …
then move results back to Illinois
• Process must be reliable, routine, and efficient
• They outsource this task to Globus Online
Image credit: Roger Smith/NOAO/AURA/NSF
Blanco 4m on Cerro Tololo
www.ci.anl.govwww.ci.uchicago.edu
19
www.ci.anl.govwww.ci.uchicago.edu
20
Genome sequence analysis pipelines
Amazon S3 storage
Amazon S3 storage
Amazon EC2 computing
Amazon EC2 computing
Commercial sequencing center
www.ci.anl.govwww.ci.uchicago.edu
21
Globus Online under the covers
Globus Nexus is used to manage
‐‐
user identities ‐‐
user profiles
‐‐
groups and policies‐‐
resource definitions
www.ci.anl.govwww.ci.uchicago.edu
22
Globus Online under the covers
Globus Nexus is used to manage
‐‐
user identities ‐‐
user profiles
‐‐
groups and policies‐‐
resource definitions
Monitoring and controlAuto‐tuning of transfer
parametersDetection & attempted
correction of errorsManual intervention
when required
www.ci.anl.govwww.ci.uchicago.edu
23
Globus Online under the covers
Monitoring and controlAuto‐tuning of transfer
parametersDetection & attempted
correction of errorsManual intervention
when required
Reliable cloud‐based infrastructureEC2 for transfer managementS3 for system stateSimpleDB for lock managementReplication across availability zones
Globus Nexus is used to manage
‐‐
user identities ‐‐
user profiles
‐‐
groups and policies‐‐
resource definitions
www.ci.anl.govwww.ci.uchicago.edu
24
Globus Online under the covers
Monitoring and controlAuto‐tuning of transfer
parametersDetection & attempted
correction of errorsManual intervention
when required
Reliable cloud‐based infrastructureEC2 for transfer managementS3 for system stateSimpleDB for lock managementReplication across availability zones
Globus Nexus is used to manage
‐‐
user identities ‐‐
user profiles
‐‐
groups and policies‐‐
resource definitions
www.ci.anl.govwww.ci.uchicago.edu
25
A first take on “big process for science”
www.ci.anl.govwww.ci.uchicago.edu
26
A first take on “big process for science”
Globus IntegrateGlobus Integrate
Globus
Transfer
Globus
TransferGlobus
Storage
Globus
StorageGlobus
Collaborate
Globus
CollaborateGlobus
Catalog
Globus
Catalog…SaaS
…PaaS
Research Data Management‐as‐a‐Service
www.ci.anl.govwww.ci.uchicago.edu
27
Commercial
storage service
provider
Commercial
storage service
provider
National
research
center
National
research
center
Campus
computin
g center
Campus
computin
g center
Globus Storage: For when you want to …
• Place
your data where you want
• Access
it from anywhere via different protocols
• Update it, version it, and take snapshots
• Share
versions with who you want
• Synchronize
among locations
Globus
Storage
volume
Globus Transfer, HTTP/REST, Desktop sync
www.ci.anl.govwww.ci.uchicago.edu
28
Globus Storage under the covers
Conventional or cloud storage system
Cassandra database hosted on Amazon
Data File system
metadata
GridFTP
server
GridFTP
serverGridFTP
server
GridFTP
serverHTTP
server
HTTP
server
www.ci.anl.govwww.ci.uchicago.edu
29
Globus Collaborate: For when you want to
Join with a few or many people to:•Share docs•Track tasks•Send email•Share data •Do whatever
With:•Common
groups•Delegated
management
www.ci.anl.govwww.ci.uchicago.edu
31
TBI=Traumatic Brain InjuryDTI=Diffusion Tensor ImagingMRI=Magnetic Resonance Imaging
UChicago
Object
Store
UChicago
Object
Store
UChicago
Object
Store
UChicago
Object
StoreCornell
Red CloudCornell
Red Cloud
SDSCCloudSDSCCloud
Globus Storage & Collaborate in action
Kyle
Bryce PADSComputeCluster
“TBI”
volume
“TBI”
volume
Globus Storage
Create volume and
share with TBI group
Globus Storage
Create volume and
share with TBI group
Globus Transfer
Copy TBI data to
compute cluster
Globus Transfer
Copy TBI data to
compute cluster
Globus Transfer
Move DTI results
to shared volume
Globus Transfer
Move DTI results
to shared volume
Globus NexusAdd Bryce to TBI
collaboration
Globus NexusAdd Bryce to TBI
collaboration
Globus CollaboratePublish DTI data to TBI
web site
Globus CollaboratePublish DTI data to TBI
web site
Amazon S3Amazon S3
DTI Group‐
Kyle
Globus ConnectMove MRI files to
TBI shared volume
Globus ConnectMove MRI files to
TBI shared volume
Globus Connect
Move DTI results to
Bryce’s laptop
Globus Connect
Move DTI results to
Bryce’s laptop
Globus StorageCreate snapshot to
share with group
Globus StorageCreate snapshot to
share with group
DTI Group‐
Kyle
‐
Bryce
www.ci.anl.govwww.ci.uchicago.edu
32
Data acquisition, management, analysis
Big Data (volume, velocity, variety, variability)…
demands Big Process in order for discovery to scale
Experiments Computationsdon’t
Literatureforget!
www.ci.anl.govwww.ci.uchicago.edu
33
Let’s rethink how we provide research IT
Accelerate discovery and innovation worldwide by providing research IT as a serviceresearch IT as a service
Leverage the cloud to•provide millions of researchers with unprecedented
access to powerful tools; •enable a massive shortening of cycle times in
time‐consuming research processes; and•reduce research IT costs dramatically via economies
of scale
www.ci.anl.govwww.ci.uchicago.edu
34
Process automation for science
Run experimentCollect dataMove dataCheck data
Annotate dataShare data
Find similar dataLink to literatureAnalyze dataPublish data
Time
?Research IT
as a service ?
?Research IT
as a service?
www.ci.anl.govwww.ci.uchicago.edu
35
Process automation for science
Run experimentCollect dataMove dataCheck data
Annotate dataShare data
Find similar dataLink to literatureAnalyze dataPublish data
Time
?Research IT
as a service ?
?Research IT
as a service?
www.ci.anl.govwww.ci.uchicago.edu
36
Acknowledgements
• Thanks for vital and much appreciated support:
– DOE Office of Advanced Scientific Computing Research (ASCR)
– NSF Office of Cyberinfrastructure
(OCI)– National Institutes of Health– The University of Chicago
• And thanks to the amazing Globus
Online team. See
www.globusonline.org/about/goteam/
www.ci.anl.govwww.ci.uchicago.edu
Thank you!
globusonline.org
@globusonline