cray in azure€¦ · opportunities and future potential, our product development and new product...
TRANSCRIPT
© 2018 Cray Inc.
A r c h i t e c t u r a l C h a l l e n g e s a n d S o l u t i o n s f o r
T h e C o n v e r g e n c e o f B i g D a t a , H P C a n d A I
HPC-AI Advisory Council, Perth 2019
@Rangan_Sukumar
https://www.linkedin.com/in/rangan/
© 2018 Cray Inc.
S A F E H A R B O R S TAT E M E N T
This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts.
These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements.
2
© 2018 Cray Inc. 3
CONVERGENCE OF BIG DATA, HPC AND AI
Scientific
Models
Analytics
Models SimulationBig DataAnalytics
Data-Intensive
Processing
Artificial Intelligence& Machine Learning
WORKFLOWS OF THE EXASCALE ERA
© 2018 Cray Inc.
ARCHITECTING A CONVERGENT SYSTEM
Levels of Architectural Maturity
Level 1: Can a system run HPC, Big Data and AI
applications/workloads?
Level 2: Can a system productively execute a
“convergent workflow” consisting of HPC, Big Data
and AI tools, codes and frameworks?
Level 3: Can a system accelerate/scale the workflow
when required?
Level 4: Given a workflow, is this the top
“performant” system one can build?
© 2018 Cray Inc.
• Design specifications
• maximize(performance-per-$)
• minimize($-to-insight)
• maximize(architected performance * community productivity) <= budget
• minimum(benchmark-performance) >= scaling factor
• maximum(app-to-app performance variation) <= epsilon
• minimize(operating costs ~ power, downtime, human resources)
• Figures-of-merit
• FLOPS, Programmability, Utilization, Benchmark, Scientific Innovation, …
5
THE CHALLENGE OF ARCHITECTING HPC SYSTEMS
© 2018 Cray Inc.
• Design specifications
• maximize(ROI-per-byte)
• maximize(capacity * consistency * availability * fault-tolerance)
• maximize(open-source tool support)
• minimize(time-to-prototype + time-to-production)
• minimize(security risk)
• minimize(operating costs ~ power, downtime, human resources)
• Figures-of-merit
• ROI, Elasticity, Multi-tenancy, Ease-of-use, Time-to-accuracy,…
6
THE CHALLENGE OF ARCHITECTING BIG-DATA SYSTEMS
© 2018 Cray Inc.
TODAY: IT IS THE TALE OF TWO ECOSYSTEMS
7
Architectural differences
Access policy
Programming models
Languages and tools
Performance
J. Dongarra et al., Exascale computing and Big Data: The next frontier, ACM Communications 2015
Resource manager and Job-scheduling
© 2018 Cray Inc.
H O W H A S C R AY A P P R O A C H E D C O N V E R G E N C E ?
8
© 2018 Cray Inc.
© 2018 Cray Inc.
Level 1: Can a system run HPC, Big Data and AI applications/workloads?
Scalable simulation &
modelling, analytics, and AI
500NX 500GT
Urika Manager
Manage user’s analytics
environments, workflows & data
sources
Urika Analytics Services
Enable, support and accelerate parallel performance analytics
libraries & frameworks
500NX 500GT
© 2018 Cray Inc.
Level 2: Can a system execute a “convergent workflow” consisting of HPC, Big Data and AI tools, codes and frameworks productively?
Application Interfaces
Frameworkse.g. TensorFlow, Cray Graph Engine, Raja, Kokkos, etc.
Programming Modelse.g. MapReduce,, GRPC, SHMEM, OpenMP, BSP, etc.
Librariese.g. MKL, CUDA, libSci
Collectivese.g. NCCL, MPI
System Interfaces
Micro-services e.g. Parallel I/O, Lustre integration, etc.
Container/Batch/Interactive Jobse.g. Docker, Singularity etc.
Provisioning / Job Orchestratione.g. Kubernetes, SLURM, Moab
Community Productivity
Domain-specific CreativityIs there an ecosystem of sustainable community(open-source) engagement that enables verticalsegments?
Code PortabilityDoes a user have to rewrite code? Does vendorsupport code porting for novel architectures?
ProgrammabilityDoes an end-user have to learn a new language or canthey launch jobs with modern tools (e.g. notebooks)?
Data Pre-ProcessingDoes system offer tools to optimize ETL wall-time?
Data MovementDoes system provide ability to run multiple frameworks/applications on the same data?
System
UtilizationPeak vs. Sustained, Performance per $
ReliabilityFaults, MTTF, Uptime
ScalabilityWeak and strong
System Architecture
Interconnecteth, InfiniBand, Aries
ProvisioningMesos, Moab, SLURM
Node Architecture# of xPUs+ cache + memory + network
DiskLatency
MemoryCapacity, Latency
xPUSpeed
ComponentPerformance
NodePerformance
Multi-nodePerformance
SystemPerformance
FacilityPerformance
i/o
Hardware Software Ecosystem
© 2018 Cray Inc.
Level 2: Can a system execute a “convergent workflow” consisting of HPC, Big Data and AI tools, codes and frameworks in reasonable time?
Source: Aaron Vose and Pete Mendygral, Cray Performance Team
Accelerating Time-to-solution
95%+ scalability efficiency reduces training time from days to hours
Get to highest-possible accuracy in 1/3rd the time.
© 2018 Cray Inc.
Level 3: Can a system seamlessly accelerate/scale the workflow when required?
Data
Acquisition
Data
Processing
Data PreparationLabor & CPU Intensive
Model
Training
Model
Testing
Model DevelopmentCompute (GPU & CPU) Intensive
Model
Deployment
Model
Inference
Model ImplementationBusiness Process Intensive
▪ Data Requirements - By Size, Type, Performance:
▪ large volume data stores
▪ high-bandwidth
▪ fast IOPs
▪ Direct connectivity to external storage
▪ HDF5/Netcdf ~ Relational, Document, Key-Value
▪ Performance Requirements - By Stage and Task:
▪ Scaling
▪ Throughput
▪ User-productivity
▪ Performant Python
▪ Collaborative notebooks
▪ Open-source framework support
▪ Storage Technology Landscape
▪ Tape
▪ HDD
▪ SSD-SATA/SAS
▪ SSD-NVMe
▪ Flash
▪ On-node vs. off-node –Datawarp vs. NVMe-over-Fabrics
© 2018 Cray Inc. 14
* Includes Cray Science & Math Libraries, Intel MKL
**NVIDIA CUDA, cuDNN, MIOpen
CrayMPI
OpenMPI
Cray PE DL Plugin
Architecture
Specific Libraries**
Cray HPO
TensorFlow
Spark
HDF5
NetCDF
Keras
Pytorch
Dask
Jupyter
Notebook
Alchemist
TensorBoard
pbdRCGECray Feature
Selection
Horovod
Architecture
Specific Libraries*
Distributed AI
Data Preparation & Analytics
Urika Libraries
& Frameworks
Urika Parallel
Communication
and IO Layer
Urika Parallel
Performance
Layer
Level 3: Can a system seamlessly accelerate/scale the workflow when required?
© 2018 Cray Inc.
Level 4: Given a workflow, is this the top “performant” system one can build?
Company + Product SECRET SAUCE DIFFERENTIATION W.R.T NVIDIA V100
Habana Labs
• Goya (Inferencing)
• Gaudi (Training)
Mixed precision matrix multiplication, RoCE support on
chip, 8x 100G PCI Gen4/3 lanes / chip
Goya: Model streaming, model multi-tenancy
Gaudi: Scale-out with Ethernet (GA in Q3 2019)
>10x expected benefits over T4 and V100 respectively
Graphcore
• IPUs (Both)
Massive parallelism by design, leverages sparsity of
matrix multiples in DL, SRAM on chip, data-flow and
computational graphs
Custom kernels implementable using Poplar
On CNNs 25X faster, 8x lower latency – throughput @
12000 images/second in inferencing mode
On RNNs – a C++ application -> 500x faster, 28x lower
latency, LSTM network – 10x faster on the throughput
samples/second vs. batch size.
Could benefit HPC workloads
Cerebras
• CS-1
Large wafer packs more silicon for compute, Avoid
zero-multiplies, dynamic scheduling on chip
Custom kernels: using proprietary toolkit
Xilinx
• Alveo
Mixed precision within kernels
Custom kernels: using VHDL, Verilog etc.
Claim 90x on database acceleration, 20x on ML, 12x
on video processing, 10x on HPC codes over GPUs.
Beating ASICS on performance-per-$
NVIDIA, AMD, Intel, Xilinx, Groq, Wave….
▪ IC vendors (12), IP vendors (11), Cloud vendors, Startups (9+34 and increasing)
▪ Amongst them, 48 different chips for inferencing, 21 for training, 12 claim they can do both
© 2018 Cray Inc.
Level 4: Given a workflow, is this the top “performant” system one can build?
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
Habana Goya NVIDIA V100 NVIDIA T4
Max Throughput (images/sec)
0
20
40
60
80
100
120
140
160
Habana Goya NVIDIA V100 NVIDIA T4
Power(images/sec/watt)
0
0.2
0.4
0.6
0.8
1
1.2
Habana Goya NVIDIA V100 NVIDIA T4
Batch-size 1 latency (in ms)
0
1
2
3
4
5
6
7
8
Habana Goya NVIDIA V100 NVIDIA T4
Relative cost
~4x
~3x
~3x
Habana vs. NVIDIA on ResNet-50
50+ startups designing AI-specific ASICsHabana outperforms NVIDIA on several key metrics
© 2018 Cray Inc.
T H E C R AY D I F F E R E N T I AT I O N
17
© 2018 Cray Inc.
# 1: A SMARTER INTERCONNECT THE DATA MOVEMENT, MANAGEMENT AND I/O PATTERNS…
…WITH A SMARTER INTERCONNECT.
J. Dongarra et al., Exascale computing and Big Data: The next frontier, ACM Communications 2015
• Hypothetical “best” node
• Mixed-Precision
• Fat nodes (lots of RAM)
• ASICs (Training + Inference)
• Persistent node-local storage
• Run-time code optimization
• Hypothetical “best” node
• High-Precision
• Vector extensions
• CPUs/GPUs with High
Bandwidth Memory
• I/O over memory hierarchy
• Static code-optimization
High-performance Interconnect
© 2018 Cray Inc.
# 2: PROGRAMMING NEW MODES OF PARALLELISM
19
• Model-Parallelism (Training, Inferencing)
• Intra-node vs. Inter-node bandwidth
Data
PartitionsWorkers
Shared
StatesShared
DataWorkers
Partitioned
Model
Data-Parallelism Model-Parallelism Ensemble-Parallelism
© 2018 Cray Inc.
# 3: MORE BANDWIDTH FROM/TO STORAGE
Tiered Flash and HDD Servers
Traditional
Model
OSS
(HDD)
OSS
(HDD)
High Speed
Network
Storage Area
Network
LNET
LNET
Compute
Node
High Speed
NetworkNew
ModelOSS & MDS
(SSD)~80 GB/s
OSS(HDD)
~30 GB/s
Compute
Node
Compute
Node
Compute
Node
Benefits:• Lower cost• Lower complexity• Lower latency• Improved small I/O performance
NVRAM caching NVRAM caching20
© 2018 Cray Inc.
# 4: PRODUCTION-QUALITY UP-TO-DATE SOFTWARE
Method Who?
LARS (MBS – 32K) NVIDIA
Learning Rate schedule (~64K) Facebook
Gradient Clipping Microsoft
Mixed Precision Training Baidu
Optimizer Tuning (~32K)
- K-FAC
- Neumann
Research
(now part of
TensorFlow)
Batch Normalization Google
● Hardware: 10-1000x in 2 years*● Training
● Intel, AMD, ARM, NVIDIA
● Google TPU v2
● Cerebras
● Graphcore
● Habana
● 30+ startups….
● Inferencing● Intel Nervana
● Wave Computing
● Groq
● Software : 7-10x improvement in time-to-accuracy in 1 year on CNNs
© 2018 Cray Inc.
# 5: MULTI-TOOL WORKFLOWS
Time to solution
Limiting resource
( # memory channels )
limiting resource
#IO channels
I/O
parallel
Time to insight
Data from
Database 2
Data from
Database 1
P1
Job2
(Graph Analytics)
Job1
(Clustering)
Job1’
(Deep Learning)P2
Optimized for components and
the end-to-end workflow
© 2018 Cray Inc.
H O W A R E C U S T O M E R S L E V E R A G I N G C R AY T O O L S ?
23
© 2018 Cray Inc.
Ref: Open AI
Source: NVIDIA-ces-2016-press-conference
Source: https://www.slideshare.net/GroupeT2i/deep-learning-the-future-of-ai
THROUGHPUT AND CAPABILITY AT SCALE
ALEXNET TO ALPHAGO ZERO: A 300,000x INCREASE IN COMPUTE
© 2018 Cray Inc.
NEW METHODS FOR DIFFERENT SHAPES OF DATA
25
Geospatial Data Molecular Data Omics Data
Points
Lines
Grid
Surface
Mesh
Molecular
homology
Sequences
Streaming Data
Time series
Graph Data
Associations
2D, 3D, 4D volumes, Higher precision (32, 64 bit), Higher #
channels (3, 16, 1024), Sparse + Dense, Resolution
NEW METHODS FOR DIFFERENT SHAPES OF SCIENTIFIC DATA
© 2018 Cray Inc.
SEARCHING DIFFERENT FEATURE SPACES
Feature-based
f(x, y,…,z) = 0
Pattern-basedFunction-based
Structured Data
Feature-space is ill-posed
Search is well-defined
Unstructured Data
Feature-space is theoretical
Search is empirical
Semi-Structured Data
Feature-space to be discovered
Search is P or NP-hard
Opportunity to create the ”models of tomorrow”
© 2018 Cray Inc.
BENEFITS FROM HPC ADOPTION FOR AI
Best practices:● Application fine-tuning / Performance optimization● High-performance interconnect● Algorithmic cleverness to trade compute and i/o● Overlap compute and i/o with programming model
Graph Analytics Matrix Methods Deep Learning
Handle 1000x bigger datasets with a 100x better speed-up with queries
Get 2-26x over Big Data Frameworks like Hadoop, Spark (for the same
cluster-size)
95%+ scalability efficiency that can reduce training time from days to hours
⋯ ⋯⋮ ⋮⋯ ⋯
∗⋮ ⋯ ⋮⋮ … ⋮
≈⋯
⋮ ⋱ ⋮⋯
© 2018 Cray Inc.
S U C C E S S S T O R I E S O F C O N V E R G E N T W O R K F L O W S
28
Big Data and AI @ HPC Centers
© 2018 Cray Inc.
#1: COMPUTING AT SCIENTIFIC FACILITIES
29
State-of-the-art: Took 3000 collaborators nearly 10 years to build
Using the labels from previous data analysis efforts…. Saves compute cycles….
Convergence saves ”flops”
© 2018 Cray Inc.
#2: DISTRIBUTED TRAINING OF AI/ML CODES
3
0
● Before
● After
32.00
128.00
512.00
2048.00
8192.00
32768.00
131072.00
1 4 16 64 256
Sa
mp
les
/sec
(a
gg
reg
ate
)Nodes (GPUs)
Inception v3 Performance on XC50
MBS=4 MBS=8
MBS=16 MBS=32
MBS=64 200 x N
S.R. Sukumar et al., in the Proceedings of the NIPS Workshop on Deep Learning at Supercomputing scale, 2017
State-of-the-art: Takes 6+days to train a model
Convergence “trains” in minutes
© 2018 Cray Inc.
#3: INTERACTIVE ANALYSIS OF BIG DATA
31
● Cori @ NERSC
● 1630 compute nodes
● Memory: 128 GB/node,
● 32 2.3GHz Haswell cores/node Gittens, Alex, et al. "Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+ MPI using three case
studies." , IEEE International Conference on Big Data.2016.
State-of-the-art: Exploratory data analysis is not interactive
Convergence enables
“iterative discovery”
© 2018 Cray Inc.
"Large Enough"
Data
Compute Power
Advanced Algorithms
and Software
Frameworks
Data Science
Expertise
Improved Velocity Models
#4: HYBRID THEORY+AI MODELSState-of-the-art: Works when manually-tuned
© 2018 Cray Inc.
Start with an initial energy velocity
model to try to explain reflected
waveforms.
Conventional FWI attempts to derive a
more accurate velocity model.
“Ground truth” Benchmark
Conventional FWI
FWI with Machine Learning
Synthetic benchmark with known
subsurface geometry and seismic
reflection data.
Using machine learning (regularization
and steering) to guide the convergence
process.
#4: HYBRID THEORY+AI MODELS
Convergence enables “higher fidelity” to reality
© 2018 Cray Inc.
SUMMARY: SYSTEMS FOR THE FUTURE
34
• General purpose flexibility
• Commodity-like configurations with custom processors, chips
• Seamless heterogeneity
• CPUs, GPUs, FPGAs, ASICs
• High-performance interconnects for data centers
• MPI and TCP/IP collectives, compute on the network
• Unified software stack with micro-services
• Programming environment for performance and productivity
• Workflow optimization
• Match growth in compute, model-size and data with I/O
THANK YOU
Q U E S T I O N S ?