realtime data analytics at nersc...realtime data analytics at nersc - 1 - lawrence berkeley national...
TRANSCRIPT
Prabhat XLDB May 24, 2016
Realtime Data Analytics at NERSC
- 1 -
Lawrence Berkeley National Laboratory
- 2 -
3
National Energy Research Scientific Computing Center
NERSC is the Production HPC & Data Facility for DOE
Biological and Environmental Systems
Applied Math, Exascale Materials, Chemistry, Geophysics
Particle Physics, Astrophysics
Largest funder of physical science research in U.S.
Nuclear Physics Fusion Energy, Plasma Physics
- 4 -
Focus on Science • NERSC supports the broad
mission needs of the six DOE Office of Science program offices
• 6,000 users and 750 projects • Extensive science engagement
and user training programs • 2078 refereed publications in
2015
- 5 -
NERSC - 2016
2 x 10 Gb
1 x 100 Gb
Software Defined Networking
Data-Intensive Systems PDSF, JGI,KBASE,HEP
14x QDR
Vis & Analytics Data Transfer Nodes Adv. Arch. Testbeds Science Gateways
Global Scratch
3.6 PB 5 x SFA12KE
/project
5 PB DDN9900 & NexSAN
/home 250 TB NetApp 5460
50 PB stored, 240 PB capacity HPSS
80 GB/s
50 GB/s
5 GB/s
12 GB/s
32x FDR IB
28 PB Local
Scratch >700 GB/s
Cori: Cray XC-40
Ph1: 1630 nodes, 2.3GHz Intel “Haswell” Cores, 203TB RAM Ph2: >9300 nodes, >60cores, 16GB HBM, 96GB DDR per node
- 6 -
7.6 PB Local
Scratch 163 GB/s
16x FDR IB
Edison: Cray XC-30
5,576 nodes, 133K, 2.4GHz Intel “IvyBridge” Cores, 357TB RAM
Ethernet & IB Fabric
Science Friendly Security Production Monitoring
Power Efficiency
WAN
1.5 PB “DataWarp”
>1.5 TB/s
The Cori System • Cori will transition HPC and data-
centric workloads to energy efficient architectures
- 7 -
System named after Gerty Cori, Biochemist and first American woman to receive the Nobel prize in science.
Astronomy
Physics Light Sources
Genomics Climate
DOE facilities are facing a data deluge
- 9 -
- 11 -
- 12 -
- 13 -
- 14 -
- 15 -
- 16 -
- 17 -
4 V’s of Scientific Big Data
- 18 -
Science Domain
Variety Volume Velocity Veracity
Astronomy Multiple Telescopes, multi-band/spectra
O(100) TB 100 GB/night – 10 TB/night
Noisy, acquisition artefacts
Light Sources Multiple imaging modalities
O(100) GB 1 Gb/s-1 Tb/s Noisy, sample preparation/acquisition artefacts
Genomics Sequencers, Mass-spec, proteomics
O(1-10) TB TB/week Missing data, errors
High Energy Physics
Multiple detectors O(100) TB – O(10) PB
1-10 PB/s reduced to GB/s
Noisy, artefacts, spatio-temporal
Climate Simulations Multi-variate, spatio-temporal
O(10) TB 100 GB/s ‘Clean’, need to account for multiple sources of uncertainty
Why Real-time Analytics? Why Now?
• Large instruments are producing massive data streams – Fast, predictable turnaround is integral to the processing
pipeline – Traditional HPC systems use batch queues with long or
unpredictable wait times
• Computational Steering <-> Experimental Steering – Change experimental configuration during your precious
beam-time!
• Follow-on analysis might be time critical – Supernovae candidates, asteroid detection
- 19 -
Real-time Use Cases
• Realtime interaction with experimental facilities – Light Sources: ALS, LCLS
• Realtime jobs driven by web portals – OpenMSI, MetAtlas
• Computational Steering – DIII – D reactor
• Experimental Steering – iPTF follow-on
- 20 -
Real-time Queue at NERSC
• NERSC has made a small pool of nodes available for immediate turnaround / “Realtime” computing – Up to 32 nodes in realtime queue (1024 cores) – Realtime nodes have higher priority than other queues – Pool can shrink or grow as needed based on demand
• Approved projects have a small number of nodes available on-demand without queue wait times – Configurations on a per-repo basis for
• Maximum number of jobs • Maximum number of cores • Wallclock • …
- 21 -
Usage (12/2015-04/2016)
- 22 -
Distribution
- 23 -
TOTALS: 332,625 hours used 23,244 jobs
Science Use Case: iPTF
- 24 -
DISCOVERIES Yi Cao, et al. (2015) Nature, “A strong ultraviolet pulse from a newborn Type Ia supernova”
PI: Kasliwal, Nugent, Cao
• Nightly images transferred • Subtractions performed • Candidates inserted in database • Typical turn-around time < 5
minutes
Science Use Case: Advanced Light Source
- 25 -
Production running at ALS beamlines: • 24x7 Operation • 176,293 Datasets • 155 Beamline Users • 1,050 TB Data Stored • 2,379,754 Jobs at NERSC
• Image reconstruction algorithms run on Cori
• 3D volume rendered on SPOT web portal
• ALS beamline users receive instant feedback
Science Use Case: Metabolite Atlas
Ben Bowen, LBL
- 26 -
• Pre-computed fragmentation trees for 10,000+ compounds
• Real-time queue used for comparing raw spectra to trees to obtain possible matches
• Results obtained in minutes • iPython interface to NERSC
Science Use Case: Cryo-Electron Microscopy
• Structure determination of TFIID
• 10-100 GB image stacks • Image classification • Real time queue used for
• Assessment of data quality during electron microscopy data collection
• Rapid optimization of data processing strategies
3D structure of TFIID-containing complex Nogales Lab Louder et al. (2016), Nature 531 (7596): 604-619
LCLS Workflow Today: 150 TB Analysis in 5 days
stream XTC format
hitfinder
spotfinder
index
integrate
Cornell–SLAC Pixel Array
Diffraction Detector
Injector
DAQ multilevel data acquisition and
control system
HPSS
Global Scratch
/Project (NGF)
hitfinder
spotfinder
index
integrate
hitfinder
spotfinder
index
integrate
… psana
Prompt analysis
requires Fast Networks
& Real-time HPC
Queues
Compute Engine Cray XC30 Science DMZ
HPSS
Global Scratch
/Project (NGF)
Reconstruction
Actionable knowledge for Next Beamtime
HPC 2GB/s
Streaming data from the detector to HPC ● 100-1000x data rates ● Indexing, classification, reconstruction, via on-the-fly veto system ● Quasi real-time response (<10 min) ● Terabit/s throughput from front-end
electronics ● Petaflop scale analysis on-demand
Indexed Diffraction Image
Reconstructed structure
LCLS-II 2019: Nanocrystallography Pipeline
Key Takeaways
• Data streaming and real-time analytics are emerging requirements at NERSC
• Experimental facilities are heaviest users – Light sources, Telescopes
• SDN capabilities are needed to enable data flows directly between compute node and workflow DBs
• Users would like to use realtime nodes to do more long-running interactive work/debugging
• Provisioning resources for real-time queue is an ongoing exercise
- 30 -
Acknowledgments
• Shreyas Cholia • Doug Jacobsen (NERSC) • NERSC Real-time queue users!
- 31 -
Thanks!
- 32 -