globus as a platform for research data management · globus as a platform for research data...
TRANSCRIPT
Globus as a platform for research data management
Vas VasiliadisUniversity of Chicago
Best Practices in Data InfrastructureMay 17, 2016
Globus delivers…
Big data transfer, sharing,publication, and discovery…
…directly from your own storage systems…...via software-as-a-service
2
Globus as SaaS
Researcher initiates transfer request; or requested automatically by script, science gateway
1
InstrumentCompute Facility
Globus transfers files reliably, securely
2
Globus controls access to shared
files on existing storage; no need
to move files to cloud storage!
4
Curator reviews and approves; data set
published on campus or other system
7
Researcher selects files to share, selects user or group,
and sets access permissions
3
Collaborator logs in to Globus and accesses shared files; no local
account required; download via Globus
5
Researcher assembles data set;
describes it using metadata (Dublin core and domain-
specific)
6
6
Peers, collaborators search and discover datasets; transfer and share using Globus
8
Publication Repository
Personal Computer
Transfer
Share
Publish
Discover
• SaaSWeb access; low operational costs
• Use storage system of your choice
• Access using your existing credentials
3
Globus as bridging technology to…
• Supercomputing resources: NCSA, NERSC, XSEDE
• Campus HPC facilities• Clouds: Jetstream, AWS, Google• Instruments• Lab clusters, servers, laptops, etc.
4
Scaling up analysis
Move datasets to campus HPC, supercomputer, national facility
Move results to (…)
Bridging to instruments: APS
6Cou
rtesy
of F
ranc
esco
De
Car
lo, A
rgon
ne N
atio
nal L
abor
ator
y (2
016)
Dynamic imaging:>200TB per dataset
APS DMagic
• Simple commands to automate the majority of beamline data management tasks
• Toolbox supports APS Imaging Group; can be easily adapted to any APS beamline
• Given an experiment date, retrieves users from APS scheduling system and automatically sends e-mail with link to the data
• Monitors a directory and copies any new files to a personal or remote server endpoint
• Data can be shared directly from the beamline machine or from a Globus server endpoint
7
Data Distribution: NGS
EC2
Ad Hoc Sharing: NIH
9
helix.nih.gov
CC Storage
Globus Connect
Globus Publication Archivematica
Compute Canada Cloud
Regional Repository
Institutional Repository
MetadataMetadata
Index
Globus Connect
CC Storage
Globus Connect
CC Storage
Repositories: Compute Canada
National ResearchData Repository(Phase 1)
Courtesy of Todd Trann, Compute Canada, 2016
NRDP Features
• Federated Storage Model: Storage and repositories distributed, and owned operated by organizations / institutions
• National Data Discovery: Single search to discover data, regardless of location
• Suitable for broad range of data types
• Archivematica: preservation packages
• Automatic geographic data replication11Adapted from Todd Trann, Compute Canada, 2016
Globus serves as…
A platform for building science gateways, portals and other web applications in support of research and education
12
Identity/Authentication, Group Management
…Globus Toolkit
Glo
bus
API
s
Glo
bus
Con
nectData Publication & Discovery
File Sharing
File Transfer & Replication
Globus as PaaS
13
Enable existing institutional ID systems to be used in external web applications
Integrate file transfer and sharing capabilities into scientific web apps, portals, gateways, etc.
Data Archive: NCAR
Serving a global community
• 17+ PB virtual processing
• 45,000+ custom orders, 4,000 users, 380 TB served in 2014 Courtesy of Thomas Cram, NCAR (2014)
Fully automated delivery via portal using Globus PaaS
PaaS enabled automated workflow
• User logs in w/NCAR or other campus identity
• Selected dataset copied to staging area (shared endpoint)
• Read permission granted to user to access shared endpoint
• User receives email with link to access files
• ACLs deleted after five days
Analysis portal: Sanger
17
Compute Access: OSG
18
Data “dropbox”: BBFC
Studios upload movies for rating• Authenticate to BBFC IdP; issued unique ID• Automatically provision “dropbox”, set ACLs• Auto activate shared endpoint using SSO• Initiate transfer
19
/distributor/paramount/32534
/distributor/wb/65346
Globus today…
5major services
13national labs use Globus
160 PBtransferred
10,000+active endpoints
27 billion files processed
~450 active daily users
40,000registered users
99.9%uptime
50+institutional subscribers
1 PBlargest single
transfer to date
3 months longest
continuously managed transfer
130+federated
campus identities
Thank you to our sponsors!
U . S . D E PA RT M E N T O F
ENERGY
21
Users, usage continue steady growth…
0
500
1000
1500
2000
2500
3000
Num
ber o
f Use
rs
Active Users
…but freemium gap is widening
0
500
1000
1500
2000
2500
3000
Num
ber o
f End
poin
ts
Free
Subscribed
Active Endpoints
Globus Subscriptions• Globus Provider Plan
– Shared endpoints– Data publication– Peer-to-peer transfer/sharing– Management console– Usage reports– Priority support– Application integration
• Branded Web Site• Alternate Identity Provider (InCommon is standard)• Premium Storage Connectors (S3, HPSS, Spectra
Google Drive coming soon)
24
globus.org/provider-plans
We hope you will join us…
• Signup and transfer files: globus.org/login• Create endpoints: globus.org/globus-connect-
server• Documentation: docs.globus.org• Need help? support.globus.org• Subscribe to help us make Globus self-sustaining:
globus.org/provider-plans• Follow us: @globusonline
25