data management on the fusion computational pipeline, “end-to-end solutions” presentation to...

24
Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK) , V. Bhat (PPPL), E. Feibush(PPPL) B. Ludäscher (UCD) ,M. Parashar (Rutgers), A. Shoshani (LBL) D. Silver (Rutgers), M. Vouk (NCS) GPS SciDAC CEMM SciDAC Batchelor QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture. No singing in this presentation

Upload: ezra-adams

Post on 25-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Data Management on the Fusion Computational Pipeline, “End-to-end solutions”

Presentation to SciDAC 2005 meeting6/29/05

Scott A. KlaskyPPPL

M. Beck (UTK) , V. Bhat (PPPL), E. Feibush(PPPL) B. Ludäscher (UCD) ,M. Parashar (Rutgers), A. Shoshani (LBL)

D. Silver (Rutgers), M. Vouk (NCS)

GPS SciDACCEMM SciDAC

Batchelor

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

No

sing

ing

in t

his

pres

enta

tion

Page 2: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Outline of Talk• The Fusion Simulation Project (FSP).

• Computer Science enabling technologies.

• The Scientific Investigation Process.

• Technologies necessary for leadership class computing, such as the FSP.– Adaptive Workflow Technology– Data streaming– Collaborative code monitoring– Integrated Data Analysis and Visualization Environment.

– Ubiquitous and Transparent Data Sharing.

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

Page 3: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

A complete simulation of all interacting phenomenaFusion Simulation Project (FSP) 15 year project.

Time

Dat

a G

ener

atio

n (T

Bs)

•Strong need for Scientific Data Management for the FSP!Dahlburg report

Page 4: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

It’s about the enabling technologies

CS

Math

Enabling technologies respond

Applications driveApplications

(Keyes)

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

Page 5: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

FSP has computer science/DM requirements

• Coupling multiple codes/data– In-core and network-based

• Analysis and visualization– Feature extraction, data juxtaposition for

V&V

• Dynamic monitoring and control– Parameter modification, snapshot

generation, …

• Data sharing among collaborators– Transparent and efficient data access

• These requirements are shared with many other simulations across the DOE community.

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

Page 6: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Six data technologies Fundamental to supporting the data management

requirements for scientific applications

• From the report from the DOE Office of Science Data-Management Workshops (March – May 2004) (R. Mount)

– http://www-user.slac.stanford.edu/rmount/dm-workshop-04/Final-report.pdf

– Workflow, data flow, data transformation– Metadata, data description, logical organization– Efficient access and queries, data integration– Distributed data management, data movement, networks– Storage and caching– Data analysis, visualization, and integrated environments

• A path-finding FSP should develop and demonstrate these components

Page 7: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Overall priorities for each of the six areas of data management

• Each branch (simulation-driven, experiment/observation-driven, information-intensive) of each application science ranked the six areas from 1 (lowest) to 6 (highest). • (Fusion, Astrophysics, Combustion and Climate) simulations have similar needs• And end-to-end solution links this areas together for 1 consistent view of the data.

Page 8: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

OFES has a clear need for advanced SDM• Current OFES Data management technologies

work well for current experiments, but do not scale well for large data.

• The time is ripe for OFES to join in collaborative efforts with other DOE data management researchers and design a system, which will be scaleable to a FSP and ultimately to the needs of ITER.

Page 9: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

The Scientific Investigation Process.• A simplified version of the scientific investigation process is shown below,

with seven stages.

• At every stage, data management is essential.• Idea stage: Scientists question about a phenomenon and a hypothesis for

the explanation.• Implementation stage: Implement a test-bed. Possibly make changes in

the hypothesis to implement the changes.• V&V Stage: interpret results via data analysis/viz tools.

• Pre-production Stage: run parameter surveys/sensitivity analysis.

• Production Stage: perform large experiments.

• Interpretation Stage: Interpret the results from the production/pre-production stages.

• Assimilation Stage: assimilation of results from previous steps.

• GOAL in end to end solutions is to reduce the time from ideas to discovery!

Page 10: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Workflows (an edge FSP project, NYU - Chang et al.) monitoring, analysis, storing

NoiseDetection

Need MoreFlights?

BlobDetection

ComputePuncture

Plots

Islanddetection

Out-of-coreIsosurfacemethods

FeatureDetection

Portal(Elvis)

XGC-ET Mesh/Interpolation M3D-L(Linear stability)

Stable?

XGC-ET Mesh/Interpolation M3D

t Stable?B healed?

Mesh/Interpolation Yes

Yes

No

No

Start (L-H)

DistributedStore Distributed

Store

DistributedStore

TBs GBs

MBs

I D A V E

Page 11: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Scientific Workflows, Pre-KEPLER/SPA

• Distributed Data & Job Management– Authenticate, access, move, replicate, query … data (“Data-Grid”)– schedule, launch, monitor jobs (“Compute-Grid”)

• Data Integration:– Conceptual querying & integration, structure & semantics, e.g. mediation w/

SQL, XQuery + OWL (Semantics-enabled Mediator)• Data Analysis, Mining, Knowledge Discovery:• Scientific Visualization

– 3-D (volume), 4-D (spatial-temporal), n-D (conceptual views) …

one-of-a-kind custom apps., detached (island) solutions such workflows are hard to understand, maintain, reproduce no/little workflow design, automation, reuse, documentation need for an integrated scientific workflow environment

Lack of Integration

Page 12: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

What is a Scientific Workflow (SWF)?

• Model the way scientists work with their data and tools– Mentally coordinate data export, import, analysis via software systems

• Scientific workflows emphasize data flow (≠ business workflows)

• Metadata (incl. provenance info, semantic types etc.) is crucial for automated data ingestion, data analysis, …

• Goals: – SWF automation,

– SWF & component reuse,

– SWF design & documentation

– making scientists’ data analysis and management easier!

Page 13: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Interactive and Autonomic Control of Workflows/ Simulations

• Scale, complexity and dynamism of the FSP requires simulations to be accessed, monitored and controlled during execution.

• Development and deployment of applications that can be externally monitored and interactively or autonomically controlled.– Enable interactive and autonomic (policies driven) control of

simulation elements, interactions and workflows.– A control network to enable elements to be accessed and

managed externally.• Support runtime monitoring, dynamic data injection and simulation

workflow control.• Support efficient and scalable implementations of monitoring,

interactive and autonomic control and rule execution.

Page 14: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

PPPL/LN/Rutgers Data streaming technologyAdaptive threaded buffer management

Feedback

3 blocks

1 block

2 blocks

3 blocks

Data

Metadata

Data Input

Data Transfer

•Thread+Buffer the IO layer to overlap communication/computation with IO.•Idea is to stream as much data over the WAN as possible during the simulation with < overhead than writing to local disk.•Data is accessed in an identical fashion for local and remote depots.Local depots 500Mbs:Low latencyPPPL Depots 100Mbs:High latency

I / O

B u f f e r

R e m o t e D e p o t s a t P P P L

L o c a l D e p o t s c l o s e r

t o s i m u l a t i o n c l u s t e r

T W R I T E

g r o u p

F a i l s a f e

B u f f e r

S i m u l a t i o n c l u s t e r ( 1 6 p r o c s / n o d e )

F a i l s a f e

P r o c e s s o r

I / O

P r o c e s s o r

L o c a l D e p o t s

s i m u l a t i o n m a c h i n e

2 ) S i m u l a t i o n p r o c e s s o r s t r a n s f e r d a t a t o I / O

p r o c e s s o r w h i c h e n q u e u e s d a t a i n t h e I / O b u f f e r

3 ) I / O b u f f e r t r a n s f e r s d a t a t o P P P L

1 ) C r e a t e G r o u p s o f 1 6 P r o c e s s o r s p e r T W R I T E g r o u p

1 I / O p r o c e s s o r a n d 1 f a i l s a f e p r o c e s s o r

5 ) F a i l s a f e b u f f e r t r a n s f e r s d a t a t o l o c a l

d e p o t s o n s i m u l a t i o n m a c h i n e

4 ) I / O b u f f e r f i l l s u p s i m u l a t i o n p r o c e s s o r

t r a n s f e r s d a t a t o F a i l s a f e P r o c e s s o r ’ s b u f f e r

Logistic Networks is essential

Page 15: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Network Adaptability

Latency Aware Network Aware Self Adjusting Buffer.

Page 16: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

High Throughput for “live” simulations

• Buffering Scheme can keep up with data generation rates of 85Mbps from NERSC to PPPL, and 99Mbs from ORNL to PPPL.

• The data was generated across 32 nodes (SP), 64 processors (SGI).

NERSC to PPPL: GTC simulation on 512

processors: 97Mbs/100Mbs.

ESNET router statistics peak transfer rates of 99.2Mbs/100Mbs from ORNL to PPPL for 8 hours! (5 minute average)

•The simulations dictates the data generation rate.•Example 3D simulation 20483, writing 5 variables every hour = 364Mbs

Page 17: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Low Overhead

• Overhead is defined as– Difference between Time taken with I/O scheme and time taken with

no I/O of resulting data from simulations during its lifetime.• Data generation

– 1.5Mbs/node * 64 nodes = 96Mbs (now)– 8Mbs/node * 64 nodes = 520Mbs (GTC future)

Data Generation Rates - Mbps/ Node1 3 5 7 9 11 13 15

% Overhead

0

5

10

15

20

25Buffering scheme Write 2 MB blocksper timestep to GPFSWrite 10 MB blocksper timestep to GPFSOverhead with HDF5 + GPFSPredicted data generation rate of GTC in 5 Yrs

Overhead of the Buffering Scheme compared to GPFS

% o

verh

ead

Page 18: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

ElVis: Collaborative Code Monitoring• Part of the Fusion Collaboratory SciDAC• Develop a “harden” java based collaborative visualization system based

on SciVis [Klasky, Ki, Fox]• http://w3.pppl.gov/transp/transpgrid_monitor• Used for monitoring fusion (transp) runs.• Web based and java application.• Used by dozens of fusion scientist. • Being extended to be actors in the Kepler system.

Page 19: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Requirements for data analysis and visualization

• Feature extraction routines– Puncture plots classification

– Feature/Blob detection SDM center Kamath (LLNL)

Quasiperiodic

Islands

Separatrix

• Data juxtaposition requires– Normalize simulation and experimental

data into a common space(units, meshes, interpolation)

– Quantifying the similarity (surface area, volume, rate of change over time, where are the features over time,…)

QuickTime™ and aYUV420 codec decompressor

are needed to see this picture.

Zweben (PPPL)

Page 20: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

IDAVE --- Integrated Data Analysis and Visualization Environment

• Approach:– Enhance the existing IDAVE’s in

the fusion community to support robust and accessible visualization

– Incorporate and tightly integrate visualization into the scientific workflow.

– Support advanced visualization/ data mining capabilities on the simulation and experimental data produced

– Support visualization on workstations and display walls.

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

Page 21: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Ubiquitous and Transparent Data Sharing

• Problem:– Simulations and collaborators in any FSP will be distributed across a

national and international networks– FSP simulations will produce massive amounts of data that will be

permanently stored in national facilities, and temporary stored at collaborators disk storage systems

– Need to share large volume of data amongst collaborators and the wider community.

– Current fusion solutions are inadequate to handle FSP data management challenges.

IDAVE Tapes

Petabytes

Disks

Terabytes

Disks

Terabytes

e.g. HPSS

IDAVE

IDAVE

IDAVE

Page 22: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Ubiquitous and Transparent Data Sharing

• What technology is required– Metadata system

• To map user concepts to datasets and files• e.g. find {ITER, shot_1174, Var=P(2D), Time=0-10}• e.g. Yields: /iter/shot1174/mhd

– Logical to physical data (files) mapping• e.g. lors://www.pppl.gov/fsp/shot1174.xml• Support for multiple replicas based on access patterns

– Technology to manage temporary space• Lifetime, garbage collection

– Technology for fast access• Parallel streams, large transfer windows, data streaming

– Robustness• If mass store unavailable, replicas can be used• Technology to recover from transient failure

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

Page 23: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Ubiquitous and Transparent Data Sharing

• Approach– Need logistical versions of standard libraries and tools (NetCDF,

HDF5) for moving and accessing data across the network– Speed of transfer and control of placement are vital to

performance and fault tolerance– Data staging, scheduling and tracking based on common SDM

tools and policies – Global namespace and placement policies to enable community

collaboration around distributed postprocessing, visualization tasks

• Use:– Logistical Networking: distributed depot system, maps logical to

physical, parallel access, file staging– Storage Resource Management (SRM): Disk & Tape Mgmt

Systems, manage space, lifetime, garbage collection,– No dependence on a single system: SRM is a middleware

standard for multiple storage systems

Page 24: Data Management on the Fusion Computational Pipeline, “End-to-end solutions” Presentation to SciDAC 2005 meeting 6/29/05 Scott A. Klasky PPPL M. Beck (UTK),

Summary• The scientific investigation process in the FSP

will be limited without a strong data management-visualization approach highlighted in the 2004 DOE Data Management report.

• Many DOE projects would benefit from End-to-End solutions.

• Need to couple DOE/NSF computer science research with hardened solutions for applications.

• 2004 Data Management Workshop: need $32M/year of new funding!

QuickTime™ and aCanyon Movie Toolkit (cvid) decompressor

are needed to see this picture.