big process for big data @ pnnl, may 2013

computationinstitute.org

Big process for big data

Ian [email protected]


Thanks to great colleagues and collaborators

• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago

• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI

• Francesco de Carlo, Chris Jacobsen, and others at Argonne

• Kerstin Kleese-Van Dam, Carina Lansing, and others at PNNL


The Computation Institute

= UChicago + Argonne

= Cross-disciplinary nexus

= Home of the Discovery Cloud


High energy physics

Molecular biology

Cosmology

Genetics

MetagenomicsLinguistics

Economics

Climate change

Visual arts


x10 in 6 years

x105 in 6 years

Will data kill genomics?

Kahn, Science, 331 (6018): 728-729


18 ordersof magnitudein 5 decades!

12 ordersof magnitudeIn 6 decades!

Moore’s Law for X-Ray Sources


Large Hadron Collider

Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG


1.2 PB of climate dataDelivered to 23,000 users


We have exceptional infrastructure for the 1%


What about the 99%?



What about the 99%?


Big science. Small labs.


Need: A new way to deliver research

cyberinfrastructureFrictionlessAffordable

Sustainable


We asked ourselves:

What if the research work flow could be managed as

easily as……our pictures

…home entertainment…our e-mail


What makes these services great?

Great User Experience+

High performance (but invisible) infrastructure


We aspire (initially) to create a great user

experience forresearch data managementWhat would a “dropbox

for science” look like?


• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for


RegistryStaging Store

IngestStore

AnalysisStore

Community Store

Archive Mirror

IngestStore

AnalysisStore

Community Store

Archive Mirror

Registry

Quotaexceeded

!

Expiredcredential

s

!

Networkfailed. Retry.

!

Permissiondenied

!

It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, &

Archive BIG DATA… but in reality it’s often very challenging


Automation is required to apply more sophisticated methods to far more data

Automation and outsourcing are key


Automation is required to apply more sophisticated methods to far more data

Outsourcing is needed to achieve economies of scale in the use of automated methods

Automation and outsourcing are key


Building a discovery cloud:Research strategy

• Identify time-consuming activity that appears amenable to automation and outsourcing

• Implement activity as a high-quality, low-touch SaaS solution, leveraging commercial IaaS for high reliability, economies of scale

• Evaluate

• Extract common elements as aresearch automation platform

• Repeat

Bonus question: Identify methods for delivering SaaS solutions sustainably

Software as a service

Platform as a service

Infrastructure as a service




BIG DATA…for




• Collect•Move• Sync• Share

Capabilities delivered using Software-as-Service (SaaS) model


DataSource

DataDestinatio

n

User initiates transfer request

1

Globus Online moves/syncs files

2

Globus Online notifies user

3


DataSource

User A selects file(s) to share; selects user/group, sets share permissions

1

Globus Online tracks shared files; no need to move files to cloud storage!

2

User B logs in to Globus Online

and accesses shared file

3


Extreme ease of use

• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and

optimization• Reliability via transfer retries• Web interface, REST API, command

line• One-click “Globus Connect” install • 5-minute Globus Connect Multi User

install


Early adoption is encouraging


Early adoption is encouraging

8,000 registered users; >100 daily~16 PB moved; ~1B files

10x (or better) performance vs. scp99.9% availability

Entirely hosted on Amazon


We benefit greatly from ESnet’s “Science DMZ”

Three key components, all required:• “Friction free” network path

– Highly capable network devices (wire-speed, deep queues)– Virtual circuit connectivity option– Security policy and enforcement specific to science

workflows– Located at or near site perimeter if possible

• Dedicated, high-performance Data Transfer Nodes (DTNs)– Hardware, operating system, libraries optimized for

transfer– Optimized data transfer tools: Globus Online, GridFTP

• Performance measurement/test node– perfSONAR

Details at http://fasterdata.es.net/science-dmz/


K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s


B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC


Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience

computationinstitute.org34

Credit: Kerstin Kleese-van Dam

Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL




BIG DATA…for


Globus Online already does a lot

Globus Toolkit

Sharing Service

Transfer Service

Globus Nexus (Identity, Group, Profile)G

lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect


Data management SaaS (Globus) + Next-gen sequence analysis pipelines

(Galaxy) + Cloud IaaS (Amazon) =

Flexible, scalable, easy-to-use genomics analysis for all biologists

globus genomics


A platform for integration


We are also adding capabilities

Globus Toolkit

Sharing Service

Transfer Service


lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect


More capabilities underway …

Globus Toolkit

Sharing Service

Transfer Service

Dataset Services


lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect


Expanding Globus Online services

• Ingest and publication– Imagine a DropBox that not only

replicates, but also extracts metadata, catalogs, converts

• Cataloging– Virtual views of data based on user-

defined and/or automatically extracted metadata

• Computation– Associate computational procedures,

orchestrate application, catalog results, record provenance


Looking deeply at how researchers use data

• A single research question often requires the integration of many data elements, that are:– In different locations– In different formats (Excel, text, CDF, HDF,

…)– Described in different ways

• Best grouping can vary during investigation– Longitudinal, vertical, cross-cutting

• But always needs to be operated on as a unit– Share, annotate, process, copy, archive, …


How do we manage data today?

• Often, a curious mix of ad hoc methods– Organize in directories using file and

directory naming conventions– Capture status in README files,

spreadsheets, notebooks

• Time-consuming, complex, error prone

Why can’t we manage our data like we manage our pictures and music?


Introducing the dataset• Group data based on use, not location– Logical grouping to organize, reorganize, search,

and describe usage

• Tag with characteristics that reflect content …– Capture as much existing information as we can

• …or to reflect current status in investigation– Stage of processing, provenance, validation, ..

• Share data sets for collaboration– Control access to data and metadata

• Operate on datasets as units– Copy, export, analyze, tag, archive, …


Builds on catalog as a service

Approach

• Hosted user-defined catalogs

• Based on tag model<subject, name, value>

• Optional schema constraints

• Integrated with other Globus services

Three REST APIs

/query/

• Retrieve subjects

/tags/

• Create, delete, retrieve tags

/tagdef/

• Create, delete, retrieve tag definitions

Builds on USC Tagfiler project (C. Kesselman et al.)

computationinstitute.org50

Multi-scale imaging at APS

StorageImage processing

(noise removal, etc.)

Tomographic reconstruction

Visual inspection

Selection

Beamline 2-BM-B~1.5um resolution

Beamline 32-ID-C20-50 nm resolution

Image processing (noise removal, etc.)

Tomographic reconstruction

Visual inspection

Selection

Selection Multi-scale image fusion

Visual inspection

Up to 100 fps2K x 2K, 16 bits11 GB raw data

1,500 fps2K x 2K, 16 bits1 min readout

11 GB raw data

51

mydata42

owner: Francescotype: 3dtomoformat: HDF5beamline: 2BM

Tomograph

y

Define datasetInfer typeExtract metadata

Populate catalog(s)

Locate datasetsAccess files

analyze

Catalog derived products

transfer/schedule

OrchestrationOrganization

Record provenance

Annotate, sharebrowse, search


Building a discovery cloud:Research strategy

• Identify time-consuming activity that appears amenable to automation and outsourcing

• Implement activity as a high-quality, low-touch SaaS solution, leveraging commercial IaaS for high reliability, economies of scale

• Evaluate

• Extract common elements as aresearch automation platform

• Repeat

Bonus question: Identify methods for delivering SaaS solutions sustainably

Software as a service

Platform as a service

Infrastructure as a service


Our challenge:

Sustainability

We are a non-profit service provider to the non-profit

research community


Globus Online Provider Plans

Support ongoing operations

Offer value-added capabilities

Engage more closely with users

computationinstitute.orgStarting at $20k per year

• Provider endpoints with sharing

• Multiple GridFTP servers per endpoint

• Branded web sites

• Alternate identity provider

• Usage reporting

• MSS optimizations

• Operations monitoring and management

• Input into and access to product roadmap

Provider Plans offer…


To provide more capability formore people at substantially lower cost by creatively aggregating (“cloud”) and federating (“grid”) resources

“Science as a service”

Our vision for a 21st century discovery infrastructure


It’s a time of great opportunity … to develop and apply Science aaS

Globus Nexus (Identity, Group, Profile)

…

Sharing Service

Transfer Service

Dataset Services

Globus Toolkit

Glo

bu

s O

nlin

e A

PIs

Glo

bu

s C

on

nect


Thanks to great colleagues and collaborators

• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago

• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI

• Francesco de Carlo, Chris Jacobsen, and others at Argonne

• Kerstin Kleese-Van Dam, Carina Lansing, and others at PNNL


Thank you to our sponsors!

big process for big data @ pnnl, may 2013

Technology

search backup archivebig

yearswill data

optimized data transfer

management transfer

orgbig science

sync sharecapabilities

multi user

discovery cloud