big process for big data @ pnnl, may 2013

Post on 10-May-2015

468 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

computationinstitute.org

Big process for big data

Ian Fosterfoster@anl.gov

computationinstitute.org

Thanks to great colleagues and collaborators

• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago

• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI

• Francesco de Carlo, Chris Jacobsen, and others at Argonne

• Kerstin Kleese-Van Dam, Carina Lansing, and others at PNNL

computationinstitute.org

The Computation Institute

= UChicago + Argonne

= Cross-disciplinary nexus

= Home of the Discovery Cloud

computationinstitute.org

High energy physics

Molecular biology

Cosmology

Genetics

MetagenomicsLinguistics

Economics

Climate change

Visual arts

computationinstitute.org

x10 in 6 years

x105 in 6 years

Will data kill genomics?

Kahn, Science, 331 (6018): 728-729

computationinstitute.org

18 ordersof magnitudein 5 decades!

12 ordersof magnitudeIn 6 decades!

Moore’s Law for X-Ray Sources

computationinstitute.org

Large Hadron Collider

Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG

computationinstitute.org

computationinstitute.org

1.2 PB of climate dataDelivered to 23,000 users

computationinstitute.org

We have exceptional infrastructure for the 1%

computationinstitute.org

What about the 99%?

We have exceptional infrastructure for the 1%

computationinstitute.org

What about the 99%?

We have exceptional infrastructure for the 1%

Big science. Small labs.

computationinstitute.org

Need: A new way to deliver research

cyberinfrastructureFrictionlessAffordable

Sustainable

computationinstitute.org

We asked ourselves:

What if the research work flow could be managed as

easily as……our pictures

…home entertainment…our e-mail

computationinstitute.org

What makes these services great?

Great User Experience+

High performance (but invisible) infrastructure

computationinstitute.org

We aspire (initially) to create a great user

experience forresearch data managementWhat would a “dropbox

for science” look like?

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

computationinstitute.org

RegistryStaging Store

IngestStore

AnalysisStore

Community Store

Archive Mirror

IngestStore

AnalysisStore

Community Store

Archive Mirror

Registry

Quotaexceeded

!

Expiredcredential

s

!

Networkfailed. Retry.

!

Permissiondenied

!

It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, &

Archive BIG DATA… but in reality it’s often very challenging

computationinstitute.org

Automation is required to apply more sophisticated methods to far more data

Automation and outsourcing are key

computationinstitute.org

Automation is required to apply more sophisticated methods to far more data

Outsourcing is needed to achieve economies of scale in the use of automated methods

Automation and outsourcing are key

computationinstitute.org

Building a discovery cloud:Research strategy

• Identify time-consuming activity that appears amenable to automation and outsourcing

• Implement activity as a high-quality, low-touch SaaS solution, leveraging commercial IaaS for high reliability, economies of scale

• Evaluate

• Extract common elements as aresearch automation platform

• Repeat

Bonus question: Identify methods for delivering SaaS solutions sustainably

Software as a service

Platform as a service

Infrastructure as a service

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

• Collect•Move• Sync• Share

Capabilities delivered using Software-as-Service (SaaS) model

computationinstitute.org

DataSource

DataDestinatio

n

User initiates transfer request

1

Globus Online moves/syncs files

2

Globus Online notifies user

3

computationinstitute.org

DataSource

User A selects file(s) to share; selects user/group, sets share permissions

1

Globus Online tracks shared files; no need to move files to cloud storage!

2

User B logs in to Globus Online

and accesses shared file

3

computationinstitute.org

Extreme ease of use

• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and

optimization• Reliability via transfer retries• Web interface, REST API, command

line• One-click “Globus Connect” install • 5-minute Globus Connect Multi User

install

computationinstitute.org

Early adoption is encouraging

computationinstitute.org

Early adoption is encouraging

8,000 registered users; >100 daily~16 PB moved; ~1B files

10x (or better) performance vs. scp99.9% availability

Entirely hosted on Amazon

computationinstitute.org

We benefit greatly from ESnet’s “Science DMZ”

Three key components, all required:• “Friction free” network path

– Highly capable network devices (wire-speed, deep queues)– Virtual circuit connectivity option– Security policy and enforcement specific to science

workflows– Located at or near site perimeter if possible

• Dedicated, high-performance Data Transfer Nodes (DTNs)– Hardware, operating system, libraries optimized for

transfer– Optimized data transfer tools: Globus Online, GridFTP

• Performance measurement/test node– perfSONAR

Details at http://fasterdata.es.net/science-dmz/

computationinstitute.org

K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s

computationinstitute.org

B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC

computationinstitute.org

Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience

computationinstitute.org34

Credit: Kerstin Kleese-van Dam

Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

computationinstitute.org

Globus Online already does a lot

Globus Toolkit

Sharing Service

Transfer Service

Globus Nexus (Identity, Group, Profile)G

lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect

computationinstitute.org

Data management SaaS (Globus) + Next-gen sequence analysis pipelines

(Galaxy) + Cloud IaaS (Amazon) =

Flexible, scalable, easy-to-use genomics analysis for all biologists

globus genomics

computationinstitute.org

A platform for integration

computationinstitute.org

A platform for integration

computationinstitute.org

A platform for integration

computationinstitute.org

We are also adding capabilities

Globus Toolkit

Sharing Service

Transfer Service

Globus Nexus (Identity, Group, Profile)G

lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect

computationinstitute.org

More capabilities underway …

Globus Toolkit

Sharing Service

Transfer Service

Dataset Services

Globus Nexus (Identity, Group, Profile)G

lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect

computationinstitute.org

Expanding Globus Online services

• Ingest and publication– Imagine a DropBox that not only

replicates, but also extracts metadata, catalogs, converts

• Cataloging– Virtual views of data based on user-

defined and/or automatically extracted metadata

• Computation– Associate computational procedures,

orchestrate application, catalog results, record provenance

computationinstitute.org

Looking deeply at how researchers use data

• A single research question often requires the integration of many data elements, that are:– In different locations– In different formats (Excel, text, CDF, HDF,

…)– Described in different ways

• Best grouping can vary during investigation– Longitudinal, vertical, cross-cutting

• But always needs to be operated on as a unit– Share, annotate, process, copy, archive, …

computationinstitute.org

How do we manage data today?

• Often, a curious mix of ad hoc methods– Organize in directories using file and

directory naming conventions– Capture status in README files,

spreadsheets, notebooks

• Time-consuming, complex, error prone

Why can’t we manage our data like we manage our pictures and music?

computationinstitute.org

Introducing the dataset• Group data based on use, not location– Logical grouping to organize, reorganize, search,

and describe usage

• Tag with characteristics that reflect content …– Capture as much existing information as we can

• …or to reflect current status in investigation– Stage of processing, provenance, validation, ..

• Share data sets for collaboration– Control access to data and metadata

• Operate on datasets as units– Copy, export, analyze, tag, archive, …

computationinstitute.org

Builds on catalog as a service

Approach

• Hosted user-defined catalogs

• Based on tag model<subject, name, value>

• Optional schema constraints

• Integrated with other Globus services

Three REST APIs

/query/

• Retrieve subjects

/tags/

• Create, delete, retrieve tags

/tagdef/

• Create, delete, retrieve tag definitions

Builds on USC Tagfiler project (C. Kesselman et al.)

computationinstitute.org50

Multi-scale imaging at APS

StorageImage processing

(noise removal, etc.)

Tomographic reconstruction

Visual inspection

Selection

Beamline 2-BM-B~1.5um resolution

Beamline 32-ID-C20-50 nm resolution

Image processing (noise removal, etc.)

Tomographic reconstruction

Visual inspection

Selection

Selection Multi-scale image fusion

Visual inspection

Up to 100 fps2K x 2K, 16 bits11 GB raw data

1,500 fps2K x 2K, 16 bits1 min readout

11 GB raw data

51

mydata42

owner: Francescotype: 3dtomoformat: HDF5beamline: 2BM

Tomograph

y

Define datasetInfer typeExtract metadata

Populate catalog(s)

Locate datasetsAccess files

analyze

Catalog derived products

transfer/schedule

OrchestrationOrganization

Record provenance

Annotate, sharebrowse, search

computationinstitute.org

computationinstitute.org

computationinstitute.org

computationinstitute.org

Building a discovery cloud:Research strategy

• Identify time-consuming activity that appears amenable to automation and outsourcing

• Implement activity as a high-quality, low-touch SaaS solution, leveraging commercial IaaS for high reliability, economies of scale

• Evaluate

• Extract common elements as aresearch automation platform

• Repeat

Bonus question: Identify methods for delivering SaaS solutions sustainably

Software as a service

Platform as a service

Infrastructure as a service

computationinstitute.org

Our challenge:

Sustainability

We are a non-profit service provider to the non-profit

research community

computationinstitute.org

Globus Online Provider Plans

Support ongoing operations

Offer value-added capabilities

Engage more closely with users

computationinstitute.orgStarting at $20k per year

• Provider endpoints with sharing

• Multiple GridFTP servers per endpoint

• Branded web sites

• Alternate identity provider

• Usage reporting

• MSS optimizations

• Operations monitoring and management

• Input into and access to product roadmap

Provider Plans offer…

computationinstitute.org

To provide more capability formore people at substantially lower cost by creatively aggregating (“cloud”) and federating (“grid”) resources

“Science as a service”

Our vision for a 21st century discovery infrastructure

computationinstitute.org

It’s a time of great opportunity … to develop and apply Science aaS

Globus Nexus (Identity, Group, Profile)

Sharing Service

Transfer Service

Dataset Services

Globus Toolkit

Glo

bu

s O

nlin

e A

PIs

Glo

bu

s C

on

nect

computationinstitute.org

Thanks to great colleagues and collaborators

• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago

• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI

• Francesco de Carlo, Chris Jacobsen, and others at Argonne

• Kerstin Kleese-Van Dam, Carina Lansing, and others at PNNL

computationinstitute.org

Thank you to our sponsors!

top related