reaching harmony in tomorrow’s world of supercomputing · large hadron collider needs • 3d...
TRANSCRIPT
© 2019 Cray Inc.
R e a c h i n g H a r m o n y i n To m o r r o w ’ s W o r l d o f S u p e r c o m p u t i n g
Per NybergVP, Artificial Intelligence
© 2019 Cray Inc.
• How is AI being adopted in supercomputing environments today ?• What does our future look like ? • Can we reach a state of harmony between the worlds of simulation and AI ?
Topics
© 2019 Cray Inc.
The Competitive Landscape Is Changing
Advent of Supercomputing
1970s
The Haves and the Have Mores
2000s
The Haves and Have Nots
1980s and 90s
Adapted from: “DIGITAL AMERICA: A TALE OF THE HAVES AND HAVE-MORES” McKinsey Global Institute, 2015
The Information Elite
Today
© 2019 Cray Inc.
How Have Supercomputers Enabled Advanced Simulation ?
• Delivering socio-economic benefits through advanced weather and climate forecasts
• Met Office supercomputers process more than 215 billion weather observations each day
• High resolution climate modeling can better assess impact of climate change at a regional scale
• Determined the precise chemical structure of the HIV capsid
• Protein shell that protects the virus’s genetic material and is a key to its virulence.
• Key to the development of new antiretroviral drugs
• Requires the assembly of more than 1,300 identical proteins – in atomic-level detail.
• Crop devastation by whiteflies is a major cause of hunger in East Africa
• Understanding the DNA of the species by generating phylogenetic trees
• With only 500 whiteflies in a genetic dataset, the possible relationships between these flies run into the octillions (10^25)
© 2019 Cray Inc.
...and what about Machine Learning ?• Crop data is key to decision
makers.• Applying Deep Learning on
satellite data, the two major crops can be distinguished with 95% accuracy just a few months after planting and well before harvest.
• More timely estimates could be used for a variety applications, including supply-chain logistics, commodity market future projections, and more.
• Cryo-electron microscopy (cryo-EM) provides 3D structural information of biological molecules and assemblies
• Cryo-EM has improved structure resolution to near atomic in just the past few years
• Critical to advancing basic biology and to characterizing drugs and drug targets for improved drug discovery.
• Development of systems for connected cars and autonomous technologies are key to enabling autonomous vehicles.
• These advances could not have been realized without the application of deep learning to object detection in image and full motion video
© 2019 Cray Inc.
SIMULATION ARTIFICIAL INTELLIGENCE
COGNITIVE SIMULATION
How are Individual Domains Evolving ?
Deep CFD“Learning on the outside”
Object DetectionCombustion Modeling
Researching Precursors of Tropical Cyclones Using
Deep Learning
Numerical Weather Prediction Machine Learning Based Physics Emulators in Numerical Weather
Prediction Models“Learning on the inside”
Sim
ML
Sim ML
© 2019 Cray Inc.
What Are Some Other Benefits of Machine Learning ?
Saving Compute Cycles Through Improved Simulation Guidance
• Application of machine learning optimization techniques such as regularization and steering to Full Waveform Inversion (FWI) seismic imaging.
• Compared with manual velocity model determination and tuning, FWI with ML converges more quickly and efficiently.
When Simulation Is too Expensive
• Detailed simulation of subatomic particles interactions is essential to High Energy Physics at CERN
• Monte Carlo approach is not fast enough for the High-Luminosity Large Hadron Collider needs
• 3D convolutional GAN can generate realistic detector output >2000x faster
Ref: Dr. Federico Carminati et al, CERN
Maximizing Data Utilization
• Satellites create more data than can be assimilated
• Only a small % of available data is used today.
• “Deep learning object detection can be used to identify areas of atmospheric instability from satellite observation data, focus extraction of observations on these regions of interest.”
Ref: Jebb Stewart, NOAA, 2018 ECMWF workshop on HPC in Meteorology
© 2019 Cray Inc. 8
Computational Demands of Machine Learning are Growing (fast)
Ref: Open AI
Ref: Open AI
© 2019 Cray Inc.
A More Relevant Example: Climate ScienceCredit: Prabhat, NERSC, Exascale Deep Learning for Climate Analytics
Liu, et alABDA’16
Racah, et al, NIPS’17Kurth, et al, SC’17
Kurth, et alSC’18
Racah, et alNIPS’17
© 2019 Cray Inc. 10
Climate Instance SegmentationCredit: Prabhat, NERSC, Exascale Deep Learning for Climate Analytics
Climate Data Set• CAM5 0.25-degree output
• 1152x768 pixels, 3-hr• 16 variables • 20 TB
• Ground Truth Labels for TC and AR • Heuristics for detection followed by mask
creation• Formulate segmentation problem
• 3 classes - background, TC and AR• High imbalance - most pixels are background
high variance - shape of events change over time and in-between themselves
Segmentation Architecture• DeepLabv3+, 66 layers, • 43.7M parameters, 14.4 TF/sample• 100,000 samples
© 2019 Cray Inc. 11
What Does Our Future Look Like ?
Think You Know How DisruptiveArtificial Intelligence Is? Think Again –
Forbes
Collision Course: Commercial and HPC Big Data - Scientific
Computing
Opportunities at the Intersection of high-end computing and data science –SC17 panel
Blurring the Lines: High-End Computing and Data Science – SC17 panel
Cray CTO On The Cambrian Compute Explosion – The Next Platform
The Convergence of HPC and AI
...Not to mention that ML/DL continues to rapidly evolve
© 2019 Cray Inc.
Rapidly Evolving AI Workflows with Increasing Technology Diversity
DataAcquisition
DataProcessing
Data PreparationLabor & CPU Intensive
ModelTraining
ModelTesting
Model DevelopmentCompute (GPU & CPU) Intensive
ModelDeployment
ModelInference
Model ImplementationBusiness Process Intensive
Compute Requirements (By Stage and Task): Data preparation Machine
Learning Training Deep Learning
Training Inference
Data Requirements (Size, Type, Performance): Large volume data
stores High-bandwidth Fast IOPs Direct connectivity
to external storage HDF5 ~ Key-Value
Stores
Performance Requirements (By Stage and Task): Scaling Throughput
User-productivity Containers Collaborative
notebooks Open-source
frameworks High productivity
interpreted languages
Training and Inference Processor Landscape Intel, AMD, ARM,
NVIDIA Google TPU v2 Graphcore Habana 30+ startups….
Storage Technology Landscape Tape HDD SSD-SATA/SAS SSD-NVMe Flash On-node vs. off-
node – Datawarp vs. NVMe-over-Fabrics
© 2019 Cray Inc.
Time to Insight: Performance Emphasis Will Move From Optimized “Codes” to “Workflows”
Time to solution
Limiting resource( # memory channels )
limiting resource#IO channels
I/Oparallel
Time to insight
Data from Database 2
Data from Database 1
P1Job2
(Simulation,
Graph Analytics)
Job1
(Clustering)
Job1’
(Deep Learning)
P2
© 2019 Cray Inc.
Development of Domain and Use-Case Specific Multi-Models
Scientific Data
Streaming Sensor Data
Public Data
3rd Party Data
Operational Data
Streaming Sensor Data
Enterprise Data
Social Media
Research & Demographics
Historical Data
AI Today Multi-Model Future
Model CNN, RNN, LSTM, GAN etc. Domain-specific
Baseline Humans, Other ML algorithms Theory, Science
Use Case Speech, Image interpretation
Computational SteeringProxy models
Figure of Merit
Time-to-accuracy, Model-size + Interpretability, Feasibility
Model Design Transfer/Incremental Learning Hyper-parameter optimization
Model Testing A/B Testing on a cadence Statistical rigor
Points
Lines
Grid
Surface
Mesh
© 2019 Cray Inc. 15
AI for Science: Increasing Mingling of Simulation and AI Methodologies
Domain-Specific Models• Fluid Dynamics• Schrodinger Equation• Full-Wave Inversion• Integration with user-facilities
Combinatorial Patterns• Search for meta-stable states• Search for particles• Search for N-way correlations• Search in a high-dimensional feature space
Model Repurposing• Deep Learning for X• Feature engineering• Predictive modeling• Hypothesis creation with interpretation
Workflows and Automation• Smarter initialization for simulations• Computational steering• Hybrid models of theory and AI
EmergingUse-cases
Mod
el C
ompl
exity
Computational Complexity
© 2019 Cray Inc.
Convergence Does Not Mean Sameness.How do Simulation and AI Workflows Compare?
SIMULATION ARTIFICIAL INTELLIGENCE
Credit: “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” Eric Breck et al, Google, Inc.
• Workflows are bespoke: Use case and data-specific
• Code can be disposable• ML-based system behaviour is not easily specified
in advance - depends on dynamic qualities of the data and on various model configuration choices
• Maximize data movement-- scan/sort/stream all the data all the time
• Simulation codes are universal, long lasting and evolve more predictably
• Focused on making sure the code mirrors the physics
• Architecture specific optimizations are long lasting• Tightly integrated processor-memory-interconnect
& network storage
© 2019 Cray Inc.
• Rapidly evolving AI Workflows with greater technology diversity• Performance emphasis will move from optimized “Codes” to “Workflows”• Development of domain and use-case specific multi-models• The mingling of simulation and AI methodologies will:
• Increase computational requirements and complexity• Create entirely new use-cases
• Convergence does not mean sameness:• The unique characteristics of simulation and AI need to be addressed• Harmony: Need for both simulation and AI to co-exist within the same
environment
Recap – What Does Our Future Look Like ?
17
© 2019 Cray Inc.
Cray Shasta – Data-Centric Design
Designed to run multiple workloads simultaneously (simulation, AI, analytics)
Reflective of changing application/workflow landscape
Designed to run different processors and accelerators
Designed for a variety of datacenter types
Cray system software to enable modularity, extensibility, and delivers a single interface for application development
Running any workflow at scale –simulation/modeling, AI, analytics
Portable and extensible to support an evolving software ecosystem
Cray Slingshot interconnect to handle the complex processing and communication of HPC and AI applications
Slingshot is a complete rethinking of the interconnect for HPC and has added novel adaptive routing, quality-of-service, and congestion management features
Slingshot also adds standard ethernet compatibility to integrate seamlessly into your existing datacenter operations
FLEXIBILITY EXTENSIBILITY SCALABILITY
© 2019 Cray Inc.
AURORA – The 1st US Exascale System
“This system will be an excellent platform for both traditional high performance
computing applications but also is being designed to be excellent for data analytics, particularly the kinds of
streaming data problems we have in the DOE, where we have data coming off
accelerators, detectors, telescopes and so forth,”
“This platform is designed to tackle the largest AI training and inference problems
that we know about,”
“There are over a hundred AI applications being developed by the various national
laboratories. As part of the ExascaleComputing Project, there’s a new effort
around exascale machine learning, and that activity is feeding into the requirements for
Aurora, particularly the software environment; the hardware will be very excellent for training and will achieve state of the art performance
on training.”
Rick Stevens, Argonne Associate laboratory director for computing, environment and life sciences
© 2019 Cray Inc.
T h a n k Yo u .
Per NybergVP, Artificial Intelligence