big process for big data @ pnnl, may 2013

65
computationinstitute.or Big process for big data Ian Foster [email protected]

Upload: ian-foster

Post on 10-May-2015

468 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Big process for big data

Ian [email protected]

Page 2: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Thanks to great colleagues and collaborators

• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago

• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI

• Francesco de Carlo, Chris Jacobsen, and others at Argonne

• Kerstin Kleese-Van Dam, Carina Lansing, and others at PNNL

Page 3: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

The Computation Institute

= UChicago + Argonne

= Cross-disciplinary nexus

= Home of the Discovery Cloud

Page 4: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

High energy physics

Molecular biology

Cosmology

Genetics

MetagenomicsLinguistics

Economics

Climate change

Visual arts

Page 5: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

x10 in 6 years

x105 in 6 years

Will data kill genomics?

Kahn, Science, 331 (6018): 728-729

Page 6: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

18 ordersof magnitudein 5 decades!

12 ordersof magnitudeIn 6 decades!

Moore’s Law for X-Ray Sources

Page 7: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Large Hadron Collider

Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG

Page 8: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Page 9: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

1.2 PB of climate dataDelivered to 23,000 users

Page 10: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

We have exceptional infrastructure for the 1%

Page 11: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

What about the 99%?

We have exceptional infrastructure for the 1%

Page 12: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

What about the 99%?

We have exceptional infrastructure for the 1%

Big science. Small labs.

Page 13: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Need: A new way to deliver research

cyberinfrastructureFrictionlessAffordable

Sustainable

Page 14: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

We asked ourselves:

What if the research work flow could be managed as

easily as……our pictures

…home entertainment…our e-mail

Page 15: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

What makes these services great?

Great User Experience+

High performance (but invisible) infrastructure

Page 16: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

We aspire (initially) to create a great user

experience forresearch data managementWhat would a “dropbox

for science” look like?

Page 17: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

Page 18: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

RegistryStaging Store

IngestStore

AnalysisStore

Community Store

Archive Mirror

IngestStore

AnalysisStore

Community Store

Archive Mirror

Registry

Quotaexceeded

!

Expiredcredential

s

!

Networkfailed. Retry.

!

Permissiondenied

!

It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, &

Archive BIG DATA… but in reality it’s often very challenging

Page 19: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Automation is required to apply more sophisticated methods to far more data

Automation and outsourcing are key

Page 20: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Automation is required to apply more sophisticated methods to far more data

Outsourcing is needed to achieve economies of scale in the use of automated methods

Automation and outsourcing are key

Page 21: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Building a discovery cloud:Research strategy

• Identify time-consuming activity that appears amenable to automation and outsourcing

• Implement activity as a high-quality, low-touch SaaS solution, leveraging commercial IaaS for high reliability, economies of scale

• Evaluate

• Extract common elements as aresearch automation platform

• Repeat

Bonus question: Identify methods for delivering SaaS solutions sustainably

Software as a service

Platform as a service

Infrastructure as a service

Page 22: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

Page 23: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

• Collect•Move• Sync• Share

Capabilities delivered using Software-as-Service (SaaS) model

Page 24: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

DataSource

DataDestinatio

n

User initiates transfer request

1

Globus Online moves/syncs files

2

Globus Online notifies user

3

Page 25: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

DataSource

User A selects file(s) to share; selects user/group, sets share permissions

1

Globus Online tracks shared files; no need to move files to cloud storage!

2

User B logs in to Globus Online

and accesses shared file

3

Page 26: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Extreme ease of use

• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and

optimization• Reliability via transfer retries• Web interface, REST API, command

line• One-click “Globus Connect” install • 5-minute Globus Connect Multi User

install

Page 27: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Early adoption is encouraging

Page 28: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Early adoption is encouraging

8,000 registered users; >100 daily~16 PB moved; ~1B files

10x (or better) performance vs. scp99.9% availability

Entirely hosted on Amazon

Page 29: Big Process for Big Data @ PNNL, May 2013
Page 30: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

We benefit greatly from ESnet’s “Science DMZ”

Three key components, all required:• “Friction free” network path

– Highly capable network devices (wire-speed, deep queues)– Virtual circuit connectivity option– Security policy and enforcement specific to science

workflows– Located at or near site perimeter if possible

• Dedicated, high-performance Data Transfer Nodes (DTNs)– Hardware, operating system, libraries optimized for

transfer– Optimized data transfer tools: Globus Online, GridFTP

• Performance measurement/test node– perfSONAR

Details at http://fasterdata.es.net/science-dmz/

Page 31: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s

Page 32: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC

Page 33: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience

Page 34: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org34

Credit: Kerstin Kleese-van Dam

Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL

Page 35: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

Page 36: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

• Collect•Move• Sync• Share• Analyze

• Annotate• Publish• Search• Backup• Archive

BIG DATA…for

Page 37: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Globus Online already does a lot

Globus Toolkit

Sharing Service

Transfer Service

Globus Nexus (Identity, Group, Profile)G

lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect

Page 38: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Data management SaaS (Globus) + Next-gen sequence analysis pipelines

(Galaxy) + Cloud IaaS (Amazon) =

Flexible, scalable, easy-to-use genomics analysis for all biologists

globus genomics

Page 39: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

A platform for integration

Page 40: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

A platform for integration

Page 41: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

A platform for integration

Page 42: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

We are also adding capabilities

Globus Toolkit

Sharing Service

Transfer Service

Globus Nexus (Identity, Group, Profile)G

lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect

Page 43: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

More capabilities underway …

Globus Toolkit

Sharing Service

Transfer Service

Dataset Services

Globus Nexus (Identity, Group, Profile)G

lob

us O

nlin

e A

PIs

Glo

bu

s C

on

nect

Page 44: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Expanding Globus Online services

• Ingest and publication– Imagine a DropBox that not only

replicates, but also extracts metadata, catalogs, converts

• Cataloging– Virtual views of data based on user-

defined and/or automatically extracted metadata

• Computation– Associate computational procedures,

orchestrate application, catalog results, record provenance

Page 45: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Looking deeply at how researchers use data

• A single research question often requires the integration of many data elements, that are:– In different locations– In different formats (Excel, text, CDF, HDF,

…)– Described in different ways

• Best grouping can vary during investigation– Longitudinal, vertical, cross-cutting

• But always needs to be operated on as a unit– Share, annotate, process, copy, archive, …

Page 46: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

How do we manage data today?

• Often, a curious mix of ad hoc methods– Organize in directories using file and

directory naming conventions– Capture status in README files,

spreadsheets, notebooks

• Time-consuming, complex, error prone

Why can’t we manage our data like we manage our pictures and music?

Page 47: Big Process for Big Data @ PNNL, May 2013
Page 48: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Introducing the dataset• Group data based on use, not location– Logical grouping to organize, reorganize, search,

and describe usage

• Tag with characteristics that reflect content …– Capture as much existing information as we can

• …or to reflect current status in investigation– Stage of processing, provenance, validation, ..

• Share data sets for collaboration– Control access to data and metadata

• Operate on datasets as units– Copy, export, analyze, tag, archive, …

Page 49: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Builds on catalog as a service

Approach

• Hosted user-defined catalogs

• Based on tag model<subject, name, value>

• Optional schema constraints

• Integrated with other Globus services

Three REST APIs

/query/

• Retrieve subjects

/tags/

• Create, delete, retrieve tags

/tagdef/

• Create, delete, retrieve tag definitions

Builds on USC Tagfiler project (C. Kesselman et al.)

Page 50: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org50

Multi-scale imaging at APS

StorageImage processing

(noise removal, etc.)

Tomographic reconstruction

Visual inspection

Selection

Beamline 2-BM-B~1.5um resolution

Beamline 32-ID-C20-50 nm resolution

Image processing (noise removal, etc.)

Tomographic reconstruction

Visual inspection

Selection

Selection Multi-scale image fusion

Visual inspection

Up to 100 fps2K x 2K, 16 bits11 GB raw data

1,500 fps2K x 2K, 16 bits1 min readout

11 GB raw data

Page 51: Big Process for Big Data @ PNNL, May 2013

51

mydata42

owner: Francescotype: 3dtomoformat: HDF5beamline: 2BM

Tomograph

y

Define datasetInfer typeExtract metadata

Populate catalog(s)

Locate datasetsAccess files

analyze

Catalog derived products

transfer/schedule

OrchestrationOrganization

Record provenance

Annotate, sharebrowse, search

Page 52: Big Process for Big Data @ PNNL, May 2013
Page 53: Big Process for Big Data @ PNNL, May 2013
Page 54: Big Process for Big Data @ PNNL, May 2013
Page 55: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Page 56: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Page 57: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Page 58: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Building a discovery cloud:Research strategy

• Identify time-consuming activity that appears amenable to automation and outsourcing

• Implement activity as a high-quality, low-touch SaaS solution, leveraging commercial IaaS for high reliability, economies of scale

• Evaluate

• Extract common elements as aresearch automation platform

• Repeat

Bonus question: Identify methods for delivering SaaS solutions sustainably

Software as a service

Platform as a service

Infrastructure as a service

Page 59: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Our challenge:

Sustainability

We are a non-profit service provider to the non-profit

research community

Page 60: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Globus Online Provider Plans

Support ongoing operations

Offer value-added capabilities

Engage more closely with users

Page 61: Big Process for Big Data @ PNNL, May 2013

computationinstitute.orgStarting at $20k per year

• Provider endpoints with sharing

• Multiple GridFTP servers per endpoint

• Branded web sites

• Alternate identity provider

• Usage reporting

• MSS optimizations

• Operations monitoring and management

• Input into and access to product roadmap

Provider Plans offer…

Page 62: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

To provide more capability formore people at substantially lower cost by creatively aggregating (“cloud”) and federating (“grid”) resources

“Science as a service”

Our vision for a 21st century discovery infrastructure

Page 63: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

It’s a time of great opportunity … to develop and apply Science aaS

Globus Nexus (Identity, Group, Profile)

Sharing Service

Transfer Service

Dataset Services

Globus Toolkit

Glo

bu

s O

nlin

e A

PIs

Glo

bu

s C

on

nect

Page 64: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Thanks to great colleagues and collaborators

• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago

• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI

• Francesco de Carlo, Chris Jacobsen, and others at Argonne

• Kerstin Kleese-Van Dam, Carina Lansing, and others at PNNL

Page 65: Big Process for Big Data @ PNNL, May 2013

computationinstitute.org

Thank you to our sponsors!