a. shoshani feb. 2007 scientific data management computational research division lawrence berkeley...
Post on 20-Jan-2016
216 Views
Preview:
TRANSCRIPT
A. Shoshani Feb. 2007
SCIENTIFIC DATA MANAGEMENTSCIENTIFIC DATA MANAGEMENT
Computational Research DivisionLawrence Berkeley National Laboratory
February , 2007
Arie ShoshaniArie Shoshani
A. Shoshani Feb. 2007
Outline
• Problem areas in managing scientific data• Motivating examples• Requirements
• The DOE Scientific Data Management Center • A three-layer architectural approach• Some results of technologies (details in mini-symposium)
• Specific technologies from LBNL• Fastbit: innovative bitmap indexing for very large datasets• Storage Resource Managers: providing uniform access to storage systems
A. Shoshani Feb. 2007
Motivating Example - 1
STAR: Solenoidal Tracker At RHICRHIC: Relativistic Heavy Ion Collider
LHC: Large Hadron ColliderIncludes: ATLAS, STAR, …
•Optimizing Storage Management and Data Accessfor High Energy and Nuclear Physics Applications
Experiment
# members
/institutionsDate of
first data
# events/year
volume/year-TB
STAR 350/35 2001 108-10
9 500
PHENIX 350/35 2001 109 600
BABAR 300/30 1999 109 80
CLAS 200/40 1997 1010 300
ATLAS 1200/140 2007 1010 5000
A mockup ofAn “event”
A. Shoshani Feb. 2007
Typical Scientific Exploration Process
• Generate large amounts of raw data• large simulations• collect from experiments
• Post-processing of data• analyze data (find particles produced, tracks)• generate summary data
•e.g. momentum, no. of pions, transverse energy•Number of properties is large (50-100)
• Analyze data• use summary data as guide• extract subsets from the large dataset
•Need to access events based on partialproperties specification (range queries)
•e.g. ((0.1 < AVpT < 0.2) ^ (10 < Np < 20)) v (N > 6000)• apply analysis code
A. Shoshani Feb. 2007
Motivating example - 2
• Combustion simulation: 1000x1000x1000 mesh with 100s of chemical species over 1000s of time steps – 1014 data values
• Astrophysics simulation: 1000x1000x1000 mesh with 10s of variables per cell over 1000s of time steps - 1013 data values
• This is an image of a single variable• What’s needed is search over
multiple variables, such as• Temperature > 1000
AND pressure > 106
AND HO2 > 10-7 AND HO2 > 10-6
• Combining multiple single-variable indexes efficiently is a challenge
• Solution: specialized bitmap indexes
A. Shoshani Feb. 2007
Motivating Example - 3
• Earth System Grid
Accessing large distributed stores for by 100’s of scientists
• Problems• Different storage systems• Security procedures• File streaming• Lifetime of request• Garbage collection
• Solution• Storage Resource Managers
(SRMs)
SRM
A. Shoshani Feb. 2007
Motivating Example 4: Fusion SimulationCoordination between Running Codes
NoiseDetection
Need MoreFlights?
BlobDetection
ComputePuncture
Plots
Islanddetection
Out-of-coreIsosurfacemethods
FeatureDetection
Web-Portal(ElVis)
XGC-ET Mesh/Interpolation M3D-L(Linear stability)
Stable?
XGC-ET Mesh/Interpolation M3D
t Stable?B healed?
Mesh/Interpolation Yes
Yes
No
No
DistributedStore
MBs
DistributedStore
TBs
DistributedStoreGBs
wide-area network
A. Shoshani Feb. 2007
Motivating example - 5
• Data Entry and Browsing tool for entering and linking metadata from multiple data sources
• Metadata Problem for Microarray analysis• Microarray schemas are quite complex
•Many objects: experiments, samples, arrays, hybridization, measurements, …
•Many associations between them• Data is generated and processed in multiple locations which participate in the data pipeline• In this project: Synechococcus sp. WH8102 whole genome
•microbes are cultured at Scripps Institution of Oceanography (SIO) •then the sample pool is sent to The Institute for Genomics Research (TIGR)•then images send to Sandia Lab for Hyperspectral Imaging and analysis
• Metadata needs to be captured and LINKED• Generating specialized user interfaces is expensive and time-consuming to build and change
•Data is collected on various systems, spreadsheets, notebooks, etc.
A. Shoshani Feb. 2007
The kind of technology needed
DEB: Data Entry and Browsing Tool1) The microbes are cultured at SIO1) The microbes are cultured at SIO
2) Microarray hybridization at TIGR2) Microarray hybridization at TIGR
3) Hyperspectral imaging at Sandia3) Hyperspectral imaging at Sandia
• Features- Interface based on lab notebook look and feel
- Tools are built on top of commercial DBMS
- Schema-driven automatic Screen generation
HS_Slide
HS-Scan
AnalysisMCR-Analysis
LCS-Analysis
Link toMA-Scan
HS_Experiment
NucleotidePool
ProbeSource
Probe
Hybridization
MA-Scan
Study Slide
probe1 probe2
Link FromTIGR to
SIO
A. Shoshani Feb. 2007
Storage Growth is Exponential
The challenges are in managing the data
• Unlike compute and network resources, storage resources are not reusable
• Unless data is explicitly removed• Need to use storage wisely• Checkpointing, remove replicated data• Time consuming, tedious tasks
• Data growth scales with compute scaling• Storage will grow even with good practices
(such as eliminating unnecessary replicas)• Not necessarily on supercomputers
but, on user/group machines and archival storage
• Storage cost is a consideration• Has to be part of science growth cost• But, storage costs going down at a rate
similar to data growth• Need continued investment in new storage
technologies
Storage Growth 1998-2006 at ORNL
(rate: 2X / year)
Storage Growth 1998-2006 at NERSC-LBNL(rate: 1.7X / year)
A. Shoshani Feb. 2007
Data and Storage ChallengesEnd-to-End: 3 Phases of Scientific Investigation)
• Data production phase• Data movement
• I/O to parallel file system• Moving data out of supercomputer storage
• Sustain data rates of GB/sec• Observe data during production• Automatic generation of metadata
• Post-processing phase• Large-scale (entire datasets) data
processing• Summarization / statistical properties• Reorganization / transposition• Generate data at different granularity
• On-the-fly data processing• computations for visualization / monitoring
• Data extraction / analysis phase• Automate data distribution / replication
• Synchronize replicated data• Data lifetime management to unclog storage
• Extract subsets efficiently• Avoid reading unnecessary data• Efficient indexes for “fixed content” data• Automated use of metadata
• Parallel analysis tools• Statistical analysis tools• Data mining tools
A. Shoshani Feb. 2007
The Scientific Data Management Center
(Center for Enabling Technologies - CET)
• PI: Arie Shoshani, LBNL• Annual budget: 3.3 Million• Established 5 years ago (SciDAC-1)• Successfully re-competed for the next 5 years
(SciDAC-2)• Featured in second issue of SciDAC magazine• Laboratories
• ANL, ORNL, LBNL, LLNL, PNNL
• Universities• NCSU, NWU, SDSC, UCD,
Uof Utah,
http://www.scidacreview.org/0602/pdf/data.pdf
A. Shoshani Feb. 2007
Tapes
Disks
ScientificSimulations
& experiments
Scientific Data Management Center
ScientificAnalysis
& Discovery
DataManipulation:
• Getting files from Tape archive• Extracting subset of data from files• Reformatting data• Getting data from heterogeneous, distributed systems• moving data over the network
Petabytes
Terabytes
Tapes
Disks
Petabytes
Terabytes
DataManipulation:
~80% time
~20% time
~20% time
~80% time
• Using SDM-Center technology
ScientificAnalysis
& Discovery
• Climate Modeling• Astrophysics• Genomics and Proteomics• High Energy Physics• Fusion…
•Optimizing shared access from mass storage systems•Parallel-IO for various file formats•Feature extraction techniques•High-dimensional cluster analysis•High-dimensional indexing•Parallel statistics …
SDM-ISIC Technology
Current Goal
A. Shoshani Feb. 2007
A Typical SDM Scenario
Flo
w T
ier
Control Flow Layer
Applications &Software Tools
Layer
I/O System Layer
Storage & NetworkResouces
Layer
Wo
rk T
ier
+
DataMover
SimulationProgram
ParallelR
PostProcessing
TerascaleBrowser
Task A:Generate
Time-Steps
Task B:Move TS
Task C:Analyze TS
Task D:Visualize TS
ParallelNetCDF
PVFSHDF5
LibrariesSRMSubset
extractionFile
system
A. Shoshani Feb. 2007
Approach
• Use an integrated framework that:
•Provides a scientific workflow capability
•Supports data mining and analysis tools
•Accelerates storage and access to data
• Simplify data management tasks for the scientist
•Hide details of underlying parallel and indexingtechnology
•Permit assembly of modules using a simple graphical workflow description tool
DataMining &Analysis
Layer
StorageEfficientAccessLayer
ScientificProcess
AutomationLayer
ScientificApplication
ScientificUnderstanding
SDM Framework
A. Shoshani Feb. 2007
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Engine
ScientificWorkflow
Components
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis and
FeatureIdentification
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(SRM)(ROMIO)
ParallelI/O
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Analysis
Parallel R
Statistical
A. Shoshani Feb. 2007
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Engine
ScientificWorkflow
Components
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis and
FeatureIdentification
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(SRM)(ROMIO)
ParallelI/O
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Analysis
Parallel R
Statistical
A. Shoshani Feb. 2007
Data Generation
Workflow Design and ExecutionWorkflow Design and ExecutionScientific Process
Automation Layer
OS, Hardware (Disks, Mass Store)OS, Hardware (Disks, Mass Store)
Storage Efficient Access
Layer
Data Mining and Analysis Layer
SimulationRun
ParallelnetCDF
MPI-IO PVFS2
A. Shoshani Feb. 2007
Parallel NetCDF v.s. HDF5 (ANL+NWU)
• Developed Parallel netCDF• Enables high performance parallel I/O to
netCDF datasets• Achieves up to 10 fold performance
improvement over HDF5
• Enhanced ROMIO: • Provides MPI access to PVFS2• Advanced parallel file system interfaces
for more efficient access
• Developed PVFS2• Production use at ANL, Ohio SC,
Univ. of Utah HPC center• Offered on Dell clusters• Being ported to IBM BG/L system
P0P0
P1P1
P2P2
P3P3
netCDFnetCDF
Parallel File SystemParallel File System
Parallel netCDFParallel netCDF
P0P0
P1P1
P2P2
P3P3
Parallel File SystemParallel File System
Before After
Parallel Virtual File System:Enhancements and deployment
Interprocess communication
FLASH I/O Benchmark Performance (8x8x8 block sizes)
A. Shoshani Feb. 2007
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Engine
ScientificWorkflow
Components
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis and
FeatureIdentification
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(SRM)(ROMIO)
ParallelI/O
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Analysis
Parallel R
Statistical
A. Shoshani Feb. 2007
Statistical Computing with R
About R (http://www.r-project.org/):
• R is an Open Source (GPL), most widely used programming environment for statistical analysis and graphics; similar to S.
• Provides good support for both users and developers.
• Highly extensible via dynamically loadable add-on packages.
• Originally developed by Robert Gentleman and Ross Ihaka.
> …> dyn.load( “foo.so”) > .C( “foobar” )> dyn.unload( “foo.so” )
> …> dyn.load( “foo.so”) > .C( “foobar” )> dyn.unload( “foo.so” )
> library(mva)> pca <- prcomp(data)> summary(pca)
> library(mva)> pca <- prcomp(data)> summary(pca)
> library (rpvm)> .PVM.start.pvmd ()> .PVM.addhosts (...)> .PVM.config ()
> library (rpvm)> .PVM.start.pvmd ()> .PVM.addhosts (...)> .PVM.config ()
A. Shoshani Feb. 2007
Providing Task and Data Parallelism in pR
• Likelihood Maximization.
• Re-sampling schemes: Bootstrap, Jackknife, etc.
• Animations
• Markov Chain Monte Carlo (MCMC).– Multiple chains.– Simulated Tempering: running parallel chains at different “temperature“ to improve mixing.
Task-parallel analyses:
• k-means clustering
• Principal Component Analysis (PCA)
• Hierarchical (model-based) clustering
• Distance matrix, histogram, etc. computations
Data-parallel analyses:
A. Shoshani Feb. 2007
Parallel R (pR) Distribution
http://www.ASPECT-SDM.org/Parallel-RReleases History:
• pR enables both data and task parallelism (includes task-pR and RScaLAPACK) (version 1.8.1)
• RScaLAPACK provides R interface to ScaLAPACK with its scalability in terms of problem size and number of processors using data parallelism (release 0.5.1)
• task-pR achieves parallelism by performing out-of-order execution of tasks. With its intelligent scheduling mechanism it attains significant gain in execution times (release 0.2.7)
•pMatrix provides a parallel platform to perform major matrix operations in parallel using ScaLAPACK and PBLAS Level II & III routines
Also: Available for download from R’s CRAN web site (www.R-Project.org) with 37 mirror sites in 20 countries
A. Shoshani Feb. 2007
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Engine
ScientificWorkflow
Components
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis and
FeatureIdentification
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(SRM)(ROMIO)
ParallelI/O
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Analysis
Parallel R
Statistical
A. Shoshani Feb. 2007
•Classify each of the nodes: quasiperiodic, islands, separatrix
•Connections between the nodes•Want accurate and robust classification, valid when few points in each node
Piecewise Polynomial Models for Classification of Puncture (Poincaré) plots
National Compact Stellarator Experiment
Quasiperiodic Islands Separatrix
A. Shoshani Feb. 2007
Polar Coordinates
• Transform the (x,y) data to Polar coordinates (r,).• Advantages of polar coordinates:
• Radial exaggeration reveals some features that are hard to see otherwise.
• Automatically restricts analysis to radial band with data, ignoring inside and outside.
• Easy to handle rotational invariance.
A. Shoshani Feb. 2007
Piecewise Polynomial Fitting: Computing polynomials
• In each interval, compute the polynomial coefficients to fit 1 polynomial to the data.
• If the error is high, split the data into an upper and lower group. Fit 2 polynomials to the data, one to each group.
Blue: data. Red: polynomials. Black: interval boundaries.
A. Shoshani Feb. 2007
Classification
•The number of polynomials needed to fit the data and the number of gaps gives the information needed to classify the node:
Number of polynomials
Gaps one two
ZeroQuasiperiodic
Separatrix
> Zero Islands
2 Polynomials2 Gaps Islands
2 Polynomials0 Gaps Separatrix
A. Shoshani Feb. 2007
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Engine
ScientificWorkflow
Components
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis and
FeatureIdentification
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(SRM)(ROMIO)
ParallelI/O
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Analysis
Parallel R
Statistical
A. Shoshani Feb. 2007
Example Data Flow in Terascale Supernova Initiative
InputData
HighlyParallelCompute
Output~500x500files
Aggregate to ~500 files (< 2 to 10+ GB each)
Archive
Data Depot
Logistic NetworkL-Bone
Local MassStorage 14+TB)
Aggregate to one file (1+ TB each)
VizWall
Viz Client
Local 44 Proc.Data Cluster- data sits on local nodes for weeks
Viz Software
Logistical Network
Courtesy: John Blondin
A. Shoshani Feb. 2007
Automate data generation, transfer and visualization of a large-scale simulation at ORNL
Original TSI Workflow Examplewith John Blondin, NCSU
A. Shoshani Feb. 2007
Automate data generation, transfer and visualization of a large-scale simulation at ORNL
Submit Job to Cray at ORNLCheck whether a time slice
is finishedAggregate all into
One large File- Save to HPSS
Split it into 22 Files and store them in
XRaid
Notify Head Node at NC State
Head Node submit scheduling to SGE
SGE schedule the transfer for 22 Nodes
Node Retrieve File from XRaid
Node-0 Node-1
…..Node-21
YesYes
Start Ensight to generate Video Files at Head Node
ORNL
NCSU
Top level TSI Workflow
A. Shoshani Feb. 2007
Using the Scientific Workflow Tool (Kepler)Emphasizing Dataflow (SDSC, NCSU, LLNL)
Automate data generation, transfer and visualization of a large-scale simulation at ORNL
A. Shoshani Feb. 2007
New actors in Fusion workflowto support automated data movement
Detect whenFiles are
Generated
Detect whenFiles are
Generated Movefiles
Movefiles
Tar files
Tar files
OTPLoginactor
OTPLoginactor
DiskCache
FileWatcher
actor
FileWatcher
actor
SeaborgNERSC
HPSSORNL
DiskCache
Disk cackeEwok-ORNL
ScpFile copier
actor
ScpFile copier
actorTar’ing
actor
Tar’ingactor
KeplerWorkflowEngine
Softwarecomponents
Hardware+ OS
LoginAt ORNL
(OTP)
LoginAt ORNL
(OTP)Archive
files
Archivefiles
Localarchiving
actor
Localarchiving
actor
SimulationProgram
(MPI)
SimulationProgram
(MPI)
StartTwo
Independentprocesses
KEPLER
1
2
A. Shoshani Feb. 2007
Re-applying Technology
Technology
Parallel NetCDF
Parallel VTK
Compressed bitmaps
Storage ResourceManagers
Feature Selection
Scientific Workflow
New Applications
Climate
Climate
Combustion, Astrophysics
Astrophysics
Fusion (exp. & simulation)
Astrophysics
Initial Application
Astrophysics
Astrophysics
HENP
HENP
Climate
Biology
SDM technology, developed for one application, can be effectively targeted at many other applications …
A. Shoshani Feb. 2007
Broad Impact of the SDM Center…
Astrophysics:High speed storage technology, parallel NetCDF, integration software used for Terascale Supernova Initiative (TSI) and FLASH simulations Tony Mezzacappa – ORNL, Mike Zingale – U of Chicago, Mike Papka – ANL
Scientific Workflow John Blondin – NCSU
Doug Swesty, Eric Myra – Stony Brook
Climate:High speed storage technology, Parallel NetCDF, and ICA technology used for Climate Modeling projects Ben Santer – LLNL, John Drake – ORNL, John Michalakes – NCAR
Combustion:Compressed Bitmap Indexing used for fast generation of flame regions and tracking their progress over timeWendy Koegler, Jacqueline Chen – Sandia Lab
ASCI FLASH – parallel NetCDF
Dimensionality reduction
Region growing
A. Shoshani Feb. 2007
Broad Impact (cont.)
Biology:Kepler workflow system and web-wrapping technology used for executing complex highly repetitive workflow tasks for processing microarray dataMatt Coleman - LLNL
High Energy Physics:Compressed Bitmap Indexing and Storage Resource Managers used for locating desired subsets of data (events) and automatically retrieving data from HPSSDoug Olson - LBNL, Eric Hjort – LBNL, Jerome Lauret - BNL
Fusion:A combination of PCA and ICA technology used to identify the key parameters that are relevant to the presence of edge harmonic oscillations in a Tokomak
Keith Burrell - General Atomics Scott Klasky - PPPL
Building a scientific workflow
Dynamic monitoring of HPSS file transfers
Identifying key parameters for the DIII-D Tokamak
A. Shoshani Feb. 2007
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Engine
ScientificWorkflow
Components
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis and
FeatureIdentification
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(SRM)(ROMIO)
ParallelI/O
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Analysis
Parallel R
Statistical
A. Shoshani Feb. 2007
FastBit:An Efficient Indexing Technology For Accelerating Data Intensive ScienceOutline Overview Searching Technology Applications
http://sdm.lbl.gov/fastbit
A. Shoshani Feb. 2007
Searching Problems in Data Intensive Sciences
• Find the collision events with the most distinct signature of Quark Gluon Plasma
• Find the ignition kernels in a combustion simulation• Track a layer of exploding supernova
These are not typical database searches:• Large high-dimensional data sets (1000 time steps X 1000
X 1000 X 1000 cells X 100 variables)• No modification of individual records during queries, i.e.,
append-only data• Complex questions: 500 < Temp < 1000 && CH3 > 10-4 &&
…• Large answers (hit thousands or millions of records)• Seek collective features such as regions of interest,
beyond typical average, sum
A. Shoshani Feb. 2007
Common Indexing Strategies Not Efficient
Task: searching high-dimensional append-only data with ad hoc range queries
•Most tree-based indices are designed to be updated quickly• E.g. family of B-Trees• Sacrifice search efficiency to permit dynamic update
•Hash-based indices are• Efficient for finding a small number of records• But, not efficient for ad hoc multi-dimensional queries
•Most multi-dimensional indices suffer curse of dimensionality• E.g. R-tree, Quad-trees, KD-trees, …• Don’t scale to high dimensions (< 20)• Are inefficient if some dimensions are not queried
A. Shoshani Feb. 2007
Our Approach: An Efficient Bitmap Index
•Bitmap indices• Sacrifice update efficiency to gain more search efficiency• Are efficient for multi-dimensional queries• Scale linearly as the number of dimensions actually used in a query
•Bitmap indices may demand too much space•We solve the space problem by developing an efficient compression method that • Reduces the index size, typically 30% of raw data, vs. 300% for some B+-
tree indices• Improves operational efficiency, 10X speedup
•We have applied FastBit to speed up a number of DOE funded applications
A. Shoshani Feb. 2007
FastBit In a Nutshell
•FastBit is designed to search multi-dimensional append-only data• Conceptually in table format
•rows objects•columns attributes
•FastBit uses vertical (column-oriented) organization for the data
• Efficient for searching
•FastBit uses bitmap indices with a specialized compression method
• Proven in analysis to be optimal for single-attribute queries• Superior to others because they are also efficient for multi-
dimensional queries
row
column
A. Shoshani Feb. 2007
Bit-Sliced Index
• Take advantage that index need to be is append only• partition each property into bins
• (e.g. for 0<Np<300, have 300 equal size bins) • for each bin generate a bit vector• compress each bit vector (some version of run length encoding)
000000000000000
000010001000000
000001110111011
101100000000000
010000000000000
000000000000100
000000000000000
Column 1
000001110111011
101100000000000
010000001000000
000000000000100
000000000000000
Column 2
000000000000000
000000001000000
000001110111011
101100000000000
010000000000000
000000000000100
000000000000000
column n
000010000000000
. . .
A. Shoshani Feb. 20072 < A
Basic Bitmap Index
• First commercial version• Model 204, P. O’Neil, 1987
• Easy to build: faster than building B-trees• Efficient for querying: only bitwise logical
operations• A < 2 b0 OR b1
• A > 2 b3 OR b4 OR b5
• Efficient for multi-dimensional queries• Use bitwise operations to combine the
partial results• Size: one bit per distinct value per object
• Definition: Cardinality == number of distinct values
• Compact for low cardinality attributes only, say, < 100
• Need to control size for high cardinality attributes
Datavalues
015312041
100000100
010010001
000001000
000100000
000000010
001000000
=0 =1 =2 =3 =4 =5
b0 b1 b2 b3 b4 b5
A < 2
A. Shoshani Feb. 2007
Run Length Encoding
Uncompressed:0000000000001111000000000 ......0000001000000001111100000000 .... 000000
Compressed:12, 4, 1000,1,8, 5,492
Practical considerations:- Store very short sequences as-is (literal words)- Count bytes/words rather than bits (for long sequences)- Use first bit for type of word: literal or count- Use second bit of count to indicate 0 or 1 sequence
[literal] [31 0-words] [literal] [31 0-words] [00 0F 00 00] [80 00 00 1F] [02 01 F0 00] [80 00 00 0F]
Other ideas- repeated byte patterns, with counts - Well-known method use in Oracle: Byte-aligned Bitmap Code (BBC)
Advantage:Can perform logical operations such as: AND, OR, NOT, XOR, …And COUNT operations directly on compressed data
A. Shoshani Feb. 2007
FastBit Compression Method is Compute-Efficient
10000000000000000000011100000000000000000000000000000……………….00000000000000000000000000000001111111111111111111111111
Example: 2015 bits
Main Idea: Use run-length-encoding, but...partition bits into 31-bit groups on 32-bit machines
Encode each group using one word
31 bits Count=63 (31 bits) 31 bits
31 bits 31 bits…31 bits
Merge neighboring groups with identical bits
• Name: Word-Aligned Hybrid (WAH) code (US patent 6,831,575)• Key features: WAH is compute-efficient because it
Uses the run-length encoding (simple)Allows operations directly on compressed bitmapsNever breaks any words into smaller pieces during operations
A. Shoshani Feb. 2007selectivity
Compute Efficient Compression Method:
10 times faster than best-known method
10X
A. Shoshani Feb. 2007
Time to Evaluate a Single-Attribute Range Condition in FastBit is Optimal
• Evaluating a single attribute range condition may require OR’ing multiple bitmaps
• Both analysis and timing measurement confirm that the query processing time is at worst proportional to the number of hits
Worst case:UniformRandom Data
Realisticcase:Zipf Data
Bitmapsin an index
A range
BBC: Byte-aligned Bitmap CodeThe best known bitmap compression
A. Shoshani Feb. 2007
Processing Multi-Dimensional Queries
•Merging results from tree-based indices is slow• Because sorting and merging are slow
•Merging results from bitmap indices is fast• Because bitwise operations on bitmaps are efficient
1,2,5,7,9 2,4,5,8
2,5
100000100
010010001
000001000
000100000
000000010
001000000
100000100
000000001
001001000
000000000
000010010
010100000
110010101
010110010
OR OR
010010000
AND
A. Shoshani Feb. 2007
Multi-Attribute Range Queries
• Results are based on 12 most queried attributes (2.2 million records) from STAR High-Energy Physics Experiment with average attribute cardinality equal to 222,000
• WAH compressed indices are 10X faster than bitmap indices from a DBMS, 5X faster than our own implementation of BBC
• Size of WAH compressed indices is only 30% of raw data size (a popular DBMS system uses 3-4X for B+-tree indices)
2-D queries 5-D queries
A. Shoshani Feb. 2007
The Case for Query Driven Visualization
•Support Compound Range Queries: e.g. Get all cells where Temperature > 300k AND Pressure is < 200 millibars
•Subsetting: Only load data that corresponds to the query.•Get rid of visual “clutter”•Reduce load on data analysis pipeline•Quickly find and label connected regions•Do it really fast!
A. Shoshani Feb. 2007
Architecture Overview: Query-Driven Vis. Pipeline
Vis /Analysis
Display
Index
DataQuery
FastBit
[Stockinger, Shalf, Bethel, Wu 2005]
A. Shoshani Feb. 2007
DEX Visualization Pipeline
DataQuery
Visualization Toolkit(VTK)
3D visualization of aSupernova explosion
[Stockinger, Shalf, Bethel, Wu 2005]
A. Shoshani Feb. 2007
Extending FastBit to Find Regions of Interest
• Comparison to what VTK is good at: • single attribute iso-contouring
• But, FastBit also does well on:• Multi-attribute search• Region finding: produces whole volume
rather than contour• Region tracking
• Proved to have the same theoretical efficiency as the best iso-contouring algorithms
• Measured to be 3X faster than the best iso-contouring algorithms
• Implemented in Dexterous Data Explorer (DEX) jointly with Vis group
0
0.1
0.2
0.3
Tim
e [s
ec]
vtkKitwareContourFilter DEX
[Stockinger, Shalf, Bethel, Wu 2005]
3X
A. Shoshani Feb. 2007
Combustion: Flame Front Tracking
Need to perform:
• Cell identification Identify all cells that satisfy user
specified conditions, such as, “600 < Temperature < 700
AND HO2 concentration > 10-7”
• Region growing Connect neighboring cells into
regions
• Region tracking Track the evolution of the
regions (i.e., features) through time
All steps perform with Bitmap structures
T1 T2
T4T3
A. Shoshani Feb. 2007
0
0.4
0.8
1.2
1.6
2
0.E+00 1.E+05 2.E+05 3.E+05 4.E+05
Number of line segments
reg
ion
gro
win
g t
ime
(s
ec
)
Time required to identify regions in 3D Supernova simulation (LBNL)
On 3D data with over 110 million records,region finding takes less than 2 seconds
Linear cost with number of segments
[Wu, Koegler, Chen, Shoshani 2003]
A. Shoshani Feb. 2007
Extending FastBit to Computer Conditional Histograms
1
10
100
1000
10000
1 10 100
Number of processors
Tim
e [s
ec]
FastBit
ROOT
ROOT-FastBit
• Conditional histograms are common in data analysis
• E.g., finding the number of malicious network connections in a particular time window
• Top left: a histogram of number of connections to port 5554 of machine in LBNL IP address space (two-horizontal axes), vertical axis is time
• Two sets of scans are visible as two sheets
• Bottom left: FastBit computes conditional histograms much faster than common data analysis tools
• 10X faster than ROOT• 2X faster than ROOT with FastBit indices
[Stockinger, Bethel, Campbell, Dart, Wu 2006]
A. Shoshani Feb. 2007
A Nuclear Physics Example: STAR
• STAR: Solenoidal Tracker At RHIC; RHIC: Relativistic Heavy Ion Collider • 600 participants / 50 institutions / 12 countries / in production since 2000• ~100 million collision events a year, ~5 MB raw data per event, several levels of
summary data• Generated 3 petabytes and 5 million files
Append-only data, aka write-once read-many (WORM) data
A. Shoshani Feb. 2007
Grid Collector
Benefits of the Grid Collector:• transparent object access• Selection of objects based on their attribute values• Improvement of analysis system’s throughput• Interactive analysis of data distributed on the Grid
A. Shoshani Feb. 2007
Finding Needles in STAR Data
• One of the primary goals of STAR is to search for Quark Gluon Plasma (QGP)
• A small number (~hundreds) of collision events may contain the clearest evidence of QGP
• Using high-level summary data, researchers found 80 special events• Have track distributions that are indicative of QGP
• Further analysis needs to access more detailed data• Detailed data are large (terabytes) and reside on HPSS • May take many weeks to manually migrate to disk
• We located and retrieved the 80 events in 15 minutes
A. Shoshani Feb. 2007
Grid Collector Speeds up Analyses
0
1
2
3
4
5
0 0.2 0.4 0.6 0.8 1
selectivity
sp
ee
du
p
Sample 1
Sample 2
Sample 3
• Test machine: 2.8 GHz Xeon, 27 MB/s read speed• When searching for rare events, say, selecting one event out of 1000, using GC is 20 to 50
times faster• Using GC to read 1/2 of events, speedup > 1.5, 1/10 events,
speed up > 2.
1
10
100
1000
0.00001 0.0001 0.001 0.01 0.1 1
selectivity
sp
ee
du
p
Sample 1
Sample 2
Sample 3
A. Shoshani Feb. 2007
Summary – Applications Involving FastBit
STAR Search for rare events with special significance
BNL (STAR collaboration)
Combustion Data Analysis
Finding and tracking ignition kernels
Sandia (Combustion Research Facility)
Dexterous Data Explorer (DEX)
Interactive exploration of large scientific data (visualize regions of interest)
LBNL Vis group
Network Traffic Analysis
Enable interactive analysis of network traffic data for forensic and live stream data
LBNL Vis group, NERSC/ESNet security,
DNA sequencing anomaly detection
Finding anomalies in raw DNA sequencing data to diagnose sequencing machine operations and DNA sample preparations
JGI
FastBit implements an efficient patented compression technique to speed up the searches in data intensive scientific applications
A. Shoshani Feb. 2007
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Engine
ScientificWorkflow
Components
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis and
FeatureIdentification
Parallel NetCDF
ParallelVirtual
FileSystem
StorageResourceManager
(SRM)(ROMIO)
ParallelI/O
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Analysis
Parallel R
Statistical
A. Shoshani Feb. 2007
What is SRM?
•Storage Resource Managers (SRM) are middleware components whose function is to provide• Dynamic space allocation • Dynamic file management in space
•For shared storage components on the WAN
A. Shoshani Feb. 2007
Motivation
•Suppose you want to run a job on your local machine• Need to allocate space• Need to bring all input files• Need to ensure correctness of files transferred• Need to monitor and recover from errors• What if files don’t fit space? Need to manage file streaming• Need to remove files to make space for more files
•Now, suppose that the machine and storage space is a shared resource• Need to do the above for many users• Need to enforce quotas• Need to ensure fairness of scheduling users
A. Shoshani Feb. 2007
Motivation
•Now, suppose you want to do that on a WAN• Need to access a variety of storage systems• mostly remote systems, need to have access permission• Need to have special software to access mass storage systems
•Now, suppose you want to run distributed jobs on the WAN• Need to allocate remote spaces• Need to move (stream) files to remote sites• Need to manage file outputs and their movement to destination site(s)
A. Shoshani Feb. 2007
Data Analysis
Tapes
Petabytes
Disks
Terabytes
Disks
Terabytes
e.g. HPSS
Ubiquitous and Transparent Data Access and Sharing
Data Analysis
Data Analysis
Data Analysis
A. Shoshani Feb. 2007
Interoperability of SRMs
SRM SRM SRM
Enstore JASMine
ClientUSER/APPLICATIONS
Grid Middleware
SRM
dCache
SRM
Castor
SRM
Unix-baseddisks
SRM
SE
CCLRC RAL
A. Shoshani Feb. 2007
SRB
UniTree HPSS DB2 Oracle Unix
Application
File SID DBLobj SID Obj SID
MCAT
Dublin Core
Resource
User
ApplicationMeta-data
ClientLibrary
SRBServer
LocalSRB Server
SDSC Storage Resource Broker - Grid Middleware
This figure was takenfrom one of the talks by Reagan Moore
A. Shoshani Feb. 2007
SRM vs. SRB
• Storage Resource Broker (SRB)• Very successful product from SDSC, has long history• Is a centralized solution where all requests go to a central server that includes a
metadata catalog (MCAT)• Developed by a single institution
• Storage Resource Management (SRM)• Based on open standard• Developed by multiple institutions for their storage systems• Designed for interoperation of heterogeneous storage systems
• Features of SRM that SRB does not deal with• Managing storage space dynamically based client’s request• Managing content of space based on lifetime controlled by client• Support for file streaming by pinning and releasing files
• Several institutions now ask for an SRM interface to SRB• In GGF: activity to bridge these technologies
A. Shoshani Feb. 2007
HRM(performs writes)
HRM(performs writes)
HRM(performs writes)
GridFTPHTTP(s)FTPservices
SRM SRM SRM SRM SRM SRM
SRM-TESTERClient
SRM
WEB
1. Initiate SRM-TESTER
3. Publish test results
CERNLCG
IC.UKEGEE
UIOARC
SDSCOSG
LBNLSTAR
APACSRM
Grid.ITSRM
SRM
FNALCMS
SRM
VUSRM
2. Test Storage Sites according to the spec v1.1 and v2.2
GGF GIN-Data SRM inter-op testing(GGF: Global Grid Forum, GIN: Grid Interoperability
Now)
A. Shoshani Feb. 2007
Testing Operations Results
ping put getAdvisory
deleteCopy
(SRMs)Copy
(gsiftp)
ARC
(UIO.NO)pass fail pass fail pass fail
EGEE (IC.UK) pass pass pass pass pass pass
CMS
(FNAL.GOV)pass pass pass pass pass pass
LCG/EGEE (CERN) pass pass pass pass N.A. N.A.
OSG
(SDSC)pass pass pass pass pass fail
STAR
(LBNL)pass pass pass pass pass pass
A. Shoshani Feb. 2007
Peer-to-Peer Uniform Interface
MSS
Storage Resource Manager
network
clientClient
(command line)... Client’s site
...Disk
CacheDisk
Cache
Site 2Site 1 Site N
Storage Resource Manager
DiskCache
Storage Resource Manager
ClientProgram
DiskCache
DiskCache
...
Storage Resource Manager
DiskCache
DiskCache
...
Uniform SRMinterface
A. Shoshani Feb. 2007
Tomcat servlet engine
Tomcat servlet engine
MCSMetadata Cataloguing Services
MCSMetadata Cataloguing Services
RLSReplica Location Services
RLSReplica Location Services
SOAP
RMI
MyProxyserver
MyProxyserver
MCS client
RLS client
MyProxy client
GRAMgatekeeper
GRAMgatekeeper
CASCommunity Authorization Services
CASCommunity Authorization Services
CAS client
disk MSSMass Storage System
HPSSHigh PerformanceStorage System
disk
HPSSHigh PerformanceStorage System
disk
disk
DRMStorage Resource
Management
DRMStorage Resource
Management
HRMStorage Resource
Management
HRMStorage Resource
Management
HRMStorage Resource
Management
HRMStorage Resource
Management
HRMStorage Resource
Management
HRMStorage Resource
Management
gridFTP
gridFTP
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
gridFTPserver
openDAPgserver
openDAPgserver
gridFTPStripedserver
gridFTPStripedserver
LBNL
LLNL
ISI
NCAR
ORNL
ANL
DRMStorage Resource
Management
DRMStorage Resource
Management
Earth Science Grid Analysis Environment
A. Shoshani Feb. 2007
History and partners in SRM Collaboration
•5 year of Storage Resource (SRM) Management activity•Experience with SRM system implementations
• Mass Storage Systems:•HRM-HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor (CERN), MSS (NCAR), Castor (RAL) …
• Disk systems: •DRM(LBNL), jSRM (Jlab), DPM (CERN), universities …
• Combination systems:•dCache(Fermi) – sophisticated multi-storage system•L-Store (U Vanderbilt) based on Logistical Networking•StoRM – to parallel file systems (ICTP, Trieste, Italy)
A. Shoshani Feb. 2007
Standards for Storage Resource Management
•Main concepts• Allocate spaces
• Get/put files from/into spaces
• Pin files for a lifetime
• Release files and spaces
• Get files into spaces from remote sites
• Manage directory structures in spaces
• SRMs communicate as peer-to-peer
• Negotiate transfer protocols
A. Shoshani Feb. 2007
DataMover
Perform “rcp –r directory” on the WAN
A. Shoshani Feb. 2007
SRMs supports data movement betweenstorage systems
ComputeSystems
Networks
OtherStorage
systems
StorageResourceManager
ComputeResource
Management
CommunityAuthorization
Services
Application-Specific Data
Discovery Services
DataCompute
Scheduling(Brokering)
Data Filtering orTransformation
Services
DatabaseManagement
Services
RequestInterpretationand Planning
Services
File TransferService(GridFTP)
DataTransportServices
Monitoring/AuditingServices
Workflow orRequest
ManagementServices
Consistency Services(e.g., Update Subscription,Versioning, Master Copies)
DataFederationServices
RE
SO
UR
CE
:S
HA
RIN
G S
ING
LER
ES
OU
RC
ES
CO
LLE
CT
I VE
1:
GE
NE
RA
LS
ER
VIC
ES
FO
RC
OO
RD
INA
TIN
GM
ULT
I PLE
RE
SO
UR
CE
S
CO
LLE
CT
IVE
2:
SE
RV
ICE
SS
PE
CIF
I C T
OA
PP
LIC
AT
ION
DO
MA
IN O
RV
IRT
UA
L O
RG
.
ResourceMonitoring/
Auditing
FA
BR
ICC
ON
NE
CT
I VIT
Y
CommunicationProtocols (e.g.,TCP/IP stack)
Authentication andAuthorization
Protocols (e.g., GSI)
CO
LLE
CT
I VE
This figure based on theGrid Architecture paper by Globus Team
Mass StorageSystem(HPSS)
General DataDiscoveryServices
Data Filtering orTransformation
ServicesMovement
Storage
A. Shoshani Feb. 2007
Massive Robust File Replication
• Multi-File Replication – why is it a problem?• Tedious task – many files, repetitious
• Lengthy task – long time, can take hours, even days
• Error prone – need to monitor transfers
• Error recovery – need to restart file transfers
• Stage and archive from MSS – limited concurrency, down time,
transient failures
• Commercial MSS – HPSS at NERSC, ORNL, …,
• Legacy MSS – MSS at NCAR
• Independent MSS – Castor (CERN), Enstore (Fermilab), JasMINE
(Jlab)
A. Shoshani Feb. 2007
Create identicalDirectory and issueSRM-COPY(thousands of files)
SRM-GET (one file at a time)
GridFTP GET (pull mode)
stage filesarchive files
Network transfer
Get listof filesFrom directory
Recovers from file transfer failures
Anywhere
DiskCache
DataMover
SRM(performs writes)
LBNL/ORNL
DiskCache
SRM(performs reads)
NCAR
NCAR-MSS
Recovers from staging failures
Recovers from archiving failures
Web-basedFile
MonitoringTool
DataMover: SRMs use in ESG forRobust Multi-file replication
A. Shoshani Feb. 2007
Shows:-Files already transferred- Files during transfer- Files to be transferred
Also shows foreach file:-Source URL-Target URL-Transfer rate
Web-Based File Monitoring Tool
A. Shoshani Feb. 2007
Shows that archiving is the bottleneck
File tracking helps to identify bottlenecks
A. Shoshani Feb. 2007
File tracking shows recovery from transient failures
Total:45 GBs
A. Shoshani Feb. 2007
Robust Multi-File Replication
•Main results• DataMover is being used in production for over three years• Moves about 10 TBs a month currently• Averages 8 MB/s (64 Mb/sec) over WAN• Eliminated person-time to monitor transfer and recover from failures• Reduced error rates from about 1% to 0.02%
(50 fold reduction)*
* http://www.ppdg.net/docs/oct04/ppdg-star-oct04.doc
A. Shoshani Feb. 2007
Summary: lessons learned
• Scientific workflow is an important paradigm• Coordination of tasks AND Management of data flow• Managing repetitive steps• Tracking, estimation
• Efficient I/O is often the bottleneck• Technology essential for efficient computation• Mass storage need to be seamlessly managed• Opportunities to interact with Math packages
• General analysis tools are useful• Parallelization is key to scaling• Visualization is an integral part of analysis
• Data movement is complex• Network infrastructure is not enough – can be unreliable• Need robust software to manage failures• Need to manage space allocation• Managing format mismatch is part of data flow
• Metadata emerging as an important need• Description of experiments/simulation• Provenance• Use of hints / access patterns
A. Shoshani Feb. 2007
Data and Storage Challenges: Still to be Overcome
General Open Problems
• Multiple parallel file systems• A common data model
• Coordinated scheduling of resources• Reservations and workflow management
• Multiple data formats• Running coupled codes
• Coordinated data movement (not just files)• Data Reliability / monitoring / recovery
• Tracking data for long running jobs• Security: authentication + authorization
Fundamental technology areasFrom the report from the DOE Office of Science Data-Management Workshops (March – May 2004)
• Efficient access and queries, data integration
• Distributed data management, data movement, networks
• Storage and caching• Data analysis, visualization, and
integrated environments• Metadata, data description,
logical organization• Workflow, data flow, data
transformation
URL: http://www-user.slac.stanford.edu/rmount/ dm-workshop-04/Final-report.pdf
A. Shoshani Feb. 2007
SDM Mini-Symposium
• High-Performance Parallel Data and Storage Management• Alok Choudhary (NWU) and Rob Ross (ANL), Robert Latham (ANL)
• Mining Science Data• Chandrika Kamath (LLNL)
• Accelerating Scientific Exploration with Workflow Automation Systems• Ilkay Altintas (SDSC), Terence Critchlow (LLNL), Scott Klasky (ORNL), Bertram
Ludaescher (UCDavis), Steve Parker (UofUtah), Mladen Vouk (NCSU)
• High Performance Statistical Computing with Parallel R and Star-P• Nagiza Samatova (ORNL) and Alan Edelman (MIT)
top related