cray in azure€¦ · opportunities and future potential, our product development and new product...

© 2018 Cray Inc.

A r c h i t e c t u r a l C h a l l e n g e s a n d S o l u t i o n s f o r

T h e C o n v e r g e n c e o f B i g D a t a , H P C a n d A I

HPC-AI Advisory Council, Perth 2019

@Rangan_Sukumar

https://www.linkedin.com/in/rangan/

[email protected]

© 2018 Cray Inc.

S A F E H A R B O R S TAT E M E N T

This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts.

These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements.

2

© 2018 Cray Inc. 3

CONVERGENCE OF BIG DATA, HPC AND AI

Scientific

Models

Analytics

Models SimulationBig DataAnalytics

Data-Intensive

Processing

Artificial Intelligence& Machine Learning

WORKFLOWS OF THE EXASCALE ERA

© 2018 Cray Inc.

ARCHITECTING A CONVERGENT SYSTEM

Levels of Architectural Maturity

Level 1: Can a system run HPC, Big Data and AI

applications/workloads?

Level 2: Can a system productively execute a

“convergent workflow” consisting of HPC, Big Data

and AI tools, codes and frameworks?

Level 3: Can a system accelerate/scale the workflow

when required?

Level 4: Given a workflow, is this the top

“performant” system one can build?

© 2018 Cray Inc.

• Design specifications

• maximize(performance-per-$)

• minimize($-to-insight)

• maximize(architected performance * community productivity) <= budget

• minimum(benchmark-performance) >= scaling factor

• maximum(app-to-app performance variation) <= epsilon

• minimize(operating costs ~ power, downtime, human resources)

• Figures-of-merit

• FLOPS, Programmability, Utilization, Benchmark, Scientific Innovation, …

5

THE CHALLENGE OF ARCHITECTING HPC SYSTEMS

© 2018 Cray Inc.

• Design specifications

• maximize(ROI-per-byte)

• maximize(capacity * consistency * availability * fault-tolerance)

• maximize(open-source tool support)

• minimize(time-to-prototype + time-to-production)

• minimize(security risk)

• minimize(operating costs ~ power, downtime, human resources)

• Figures-of-merit

• ROI, Elasticity, Multi-tenancy, Ease-of-use, Time-to-accuracy,…

6

THE CHALLENGE OF ARCHITECTING BIG-DATA SYSTEMS

© 2018 Cray Inc.

TODAY: IT IS THE TALE OF TWO ECOSYSTEMS

7

Architectural differences

Access policy

Programming models

Languages and tools

Performance

J. Dongarra et al., Exascale computing and Big Data: The next frontier, ACM Communications 2015

Resource manager and Job-scheduling

© 2018 Cray Inc.

H O W H A S C R AY A P P R O A C H E D C O N V E R G E N C E ?

8

© 2018 Cray Inc.

Level 1: Can a system run HPC, Big Data and AI applications/workloads?

Scalable simulation &

modelling, analytics, and AI

500NX 500GT

Urika Manager

Manage user’s analytics

environments, workflows & data

sources

Urika Analytics Services

Enable, support and accelerate parallel performance analytics

libraries & frameworks

500NX 500GT

© 2018 Cray Inc.

Level 2: Can a system execute a “convergent workflow” consisting of HPC, Big Data and AI tools, codes and frameworks productively?

Application Interfaces

Frameworkse.g. TensorFlow, Cray Graph Engine, Raja, Kokkos, etc.

Programming Modelse.g. MapReduce,, GRPC, SHMEM, OpenMP, BSP, etc.

Librariese.g. MKL, CUDA, libSci

Collectivese.g. NCCL, MPI

System Interfaces

Micro-services e.g. Parallel I/O, Lustre integration, etc.

Container/Batch/Interactive Jobse.g. Docker, Singularity etc.

Provisioning / Job Orchestratione.g. Kubernetes, SLURM, Moab

Community Productivity

Domain-specific CreativityIs there an ecosystem of sustainable community(open-source) engagement that enables verticalsegments?

Code PortabilityDoes a user have to rewrite code? Does vendorsupport code porting for novel architectures?

ProgrammabilityDoes an end-user have to learn a new language or canthey launch jobs with modern tools (e.g. notebooks)?

Data Pre-ProcessingDoes system offer tools to optimize ETL wall-time?

Data MovementDoes system provide ability to run multiple frameworks/applications on the same data?

System

UtilizationPeak vs. Sustained, Performance per $

ReliabilityFaults, MTTF, Uptime

ScalabilityWeak and strong

System Architecture

Interconnecteth, InfiniBand, Aries

ProvisioningMesos, Moab, SLURM

Node Architecture# of xPUs+ cache + memory + network

DiskLatency

MemoryCapacity, Latency

xPUSpeed

ComponentPerformance

NodePerformance

Multi-nodePerformance

SystemPerformance

FacilityPerformance

i/o

Hardware Software Ecosystem

© 2018 Cray Inc.

Level 2: Can a system execute a “convergent workflow” consisting of HPC, Big Data and AI tools, codes and frameworks in reasonable time?

Source: Aaron Vose and Pete Mendygral, Cray Performance Team

Accelerating Time-to-solution

95%+ scalability efficiency reduces training time from days to hours

Get to highest-possible accuracy in 1/3rd the time.

© 2018 Cray Inc.

Level 3: Can a system seamlessly accelerate/scale the workflow when required?

Data

Acquisition

Data

Processing

Data PreparationLabor & CPU Intensive

Model

Training

Model

Testing

Model DevelopmentCompute (GPU & CPU) Intensive

Model

Deployment

Model

Inference

Model ImplementationBusiness Process Intensive

▪ Data Requirements - By Size, Type, Performance:

▪ large volume data stores

▪ high-bandwidth

▪ fast IOPs

▪ Direct connectivity to external storage

▪ HDF5/Netcdf ~ Relational, Document, Key-Value

▪ Performance Requirements - By Stage and Task:

▪ Scaling

▪ Throughput

▪ User-productivity

▪ Performant Python

▪ Collaborative notebooks

▪ Open-source framework support

▪ Storage Technology Landscape

▪ Tape

▪ HDD

▪ SSD-SATA/SAS

▪ SSD-NVMe

▪ Flash

▪ On-node vs. off-node –Datawarp vs. NVMe-over-Fabrics

© 2018 Cray Inc. 14

* Includes Cray Science & Math Libraries, Intel MKL

**NVIDIA CUDA, cuDNN, MIOpen

CrayMPI

OpenMPI

Cray PE DL Plugin

Architecture

Specific Libraries**

Cray HPO

TensorFlow

Spark

HDF5

NetCDF

Keras

Pytorch

Dask

Jupyter

Notebook

Alchemist

TensorBoard

pbdRCGECray Feature

Selection

Horovod

Architecture

Specific Libraries*

Distributed AI

Data Preparation & Analytics

Urika Libraries

& Frameworks

Urika Parallel

Communication

and IO Layer

Urika Parallel

Performance

Layer

Level 3: Can a system seamlessly accelerate/scale the workflow when required?

© 2018 Cray Inc.

Level 4: Given a workflow, is this the top “performant” system one can build?

Company + Product SECRET SAUCE DIFFERENTIATION W.R.T NVIDIA V100

Habana Labs

• Goya (Inferencing)

• Gaudi (Training)

Mixed precision matrix multiplication, RoCE support on

chip, 8x 100G PCI Gen4/3 lanes / chip

Goya: Model streaming, model multi-tenancy

Gaudi: Scale-out with Ethernet (GA in Q3 2019)

>10x expected benefits over T4 and V100 respectively

Graphcore

• IPUs (Both)

Massive parallelism by design, leverages sparsity of

matrix multiples in DL, SRAM on chip, data-flow and

computational graphs

Custom kernels implementable using Poplar

On CNNs 25X faster, 8x lower latency – throughput @

12000 images/second in inferencing mode

On RNNs – a C++ application -> 500x faster, 28x lower

latency, LSTM network – 10x faster on the throughput

samples/second vs. batch size.

Could benefit HPC workloads

Cerebras

• CS-1

Large wafer packs more silicon for compute, Avoid

zero-multiplies, dynamic scheduling on chip

Custom kernels: using proprietary toolkit

Xilinx

• Alveo

Mixed precision within kernels

Custom kernels: using VHDL, Verilog etc.

Claim 90x on database acceleration, 20x on ML, 12x

on video processing, 10x on HPC codes over GPUs.

Beating ASICS on performance-per-$

NVIDIA, AMD, Intel, Xilinx, Groq, Wave….

▪ IC vendors (12), IP vendors (11), Cloud vendors, Startups (9+34 and increasing)

▪ Amongst them, 48 different chips for inferencing, 21 for training, 12 claim they can do both

© 2018 Cray Inc.

Level 4: Given a workflow, is this the top “performant” system one can build?

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

Habana Goya NVIDIA V100 NVIDIA T4

Max Throughput (images/sec)

0

20

40

60

80

100

120

140

160


Power(images/sec/watt)

0

0.2

0.4

0.6

0.8

1

1.2


Batch-size 1 latency (in ms)

0

1

2

3

4

5

6

7

8


Relative cost

~4x

~3x

~3x

Habana vs. NVIDIA on ResNet-50

50+ startups designing AI-specific ASICsHabana outperforms NVIDIA on several key metrics

© 2018 Cray Inc.

T H E C R AY D I F F E R E N T I AT I O N

17

© 2018 Cray Inc.

# 1: A SMARTER INTERCONNECT THE DATA MOVEMENT, MANAGEMENT AND I/O PATTERNS…

…WITH A SMARTER INTERCONNECT.

J. Dongarra et al., Exascale computing and Big Data: The next frontier, ACM Communications 2015

• Hypothetical “best” node

• Mixed-Precision

• Fat nodes (lots of RAM)

• ASICs (Training + Inference)

• Persistent node-local storage

• Run-time code optimization

• Hypothetical “best” node

• High-Precision

• Vector extensions

• CPUs/GPUs with High

Bandwidth Memory

• I/O over memory hierarchy

• Static code-optimization

High-performance Interconnect

© 2018 Cray Inc.

# 2: PROGRAMMING NEW MODES OF PARALLELISM

19

• Model-Parallelism (Training, Inferencing)

• Intra-node vs. Inter-node bandwidth

Data

PartitionsWorkers

Shared

StatesShared

DataWorkers

Partitioned

Model

Data-Parallelism Model-Parallelism Ensemble-Parallelism

© 2018 Cray Inc.

# 3: MORE BANDWIDTH FROM/TO STORAGE

Tiered Flash and HDD Servers

Traditional

Model

OSS

(HDD)

OSS

(HDD)

High Speed

Network

Storage Area

Network

LNET

LNET

Compute

Node

High Speed

NetworkNew

ModelOSS & MDS

(SSD)~80 GB/s

OSS(HDD)

~30 GB/s

Compute

Node

Compute

Node

Compute

Node

Benefits:• Lower cost• Lower complexity• Lower latency• Improved small I/O performance

NVRAM caching NVRAM caching20

© 2018 Cray Inc.

# 4: PRODUCTION-QUALITY UP-TO-DATE SOFTWARE

Method Who?

LARS (MBS – 32K) NVIDIA

Learning Rate schedule (~64K) Facebook

Gradient Clipping Microsoft

Mixed Precision Training Baidu

Optimizer Tuning (~32K)

- K-FAC

- Neumann

Google

Research

(now part of

TensorFlow)

Batch Normalization Google

● Hardware: 10-1000x in 2 years*● Training

● Intel, AMD, ARM, NVIDIA

● Google TPU v2

● Cerebras

● Graphcore

● Habana

● 30+ startups….

● Inferencing● Intel Nervana

● Wave Computing

● Groq

● Software : 7-10x improvement in time-to-accuracy in 1 year on CNNs

© 2018 Cray Inc.

# 5: MULTI-TOOL WORKFLOWS

Time to solution

Limiting resource

( # memory channels )

limiting resource

#IO channels

I/O

parallel

Time to insight

Data from

Database 2

Data from

Database 1

P1

Job2

(Graph Analytics)

Job1

(Clustering)

Job1’

(Deep Learning)P2

Optimized for components and

the end-to-end workflow

© 2018 Cray Inc.

H O W A R E C U S T O M E R S L E V E R A G I N G C R AY T O O L S ?

23

© 2018 Cray Inc.

Ref: Open AI

Source: NVIDIA-ces-2016-press-conference

Source: https://www.slideshare.net/GroupeT2i/deep-learning-the-future-of-ai

THROUGHPUT AND CAPABILITY AT SCALE

ALEXNET TO ALPHAGO ZERO: A 300,000x INCREASE IN COMPUTE

© 2018 Cray Inc.

NEW METHODS FOR DIFFERENT SHAPES OF DATA

25

Geospatial Data Molecular Data Omics Data

Points

Lines

Grid

Surface

Mesh

Molecular

homology

Sequences

Streaming Data

Time series

Graph Data

Associations

2D, 3D, 4D volumes, Higher precision (32, 64 bit), Higher #

channels (3, 16, 1024), Sparse + Dense, Resolution

NEW METHODS FOR DIFFERENT SHAPES OF SCIENTIFIC DATA

© 2018 Cray Inc.

SEARCHING DIFFERENT FEATURE SPACES

Feature-based

f(x, y,…,z) = 0

Pattern-basedFunction-based

Structured Data

Feature-space is ill-posed

Search is well-defined

Unstructured Data

Feature-space is theoretical

Search is empirical

Semi-Structured Data

Feature-space to be discovered

Search is P or NP-hard

Opportunity to create the ”models of tomorrow”

© 2018 Cray Inc.

BENEFITS FROM HPC ADOPTION FOR AI

Best practices:● Application fine-tuning / Performance optimization● High-performance interconnect● Algorithmic cleverness to trade compute and i/o● Overlap compute and i/o with programming model

Graph Analytics Matrix Methods Deep Learning

Handle 1000x bigger datasets with a 100x better speed-up with queries

Get 2-26x over Big Data Frameworks like Hadoop, Spark (for the same

cluster-size)

95%+ scalability efficiency that can reduce training time from days to hours

⋯ ⋯⋮ ⋮⋯ ⋯

∗⋮ ⋯ ⋮⋮ … ⋮

≈⋯

⋮ ⋱ ⋮⋯

© 2018 Cray Inc.

S U C C E S S S T O R I E S O F C O N V E R G E N T W O R K F L O W S

28

Big Data and AI @ HPC Centers

© 2018 Cray Inc.

#1: COMPUTING AT SCIENTIFIC FACILITIES

29

State-of-the-art: Took 3000 collaborators nearly 10 years to build

Using the labels from previous data analysis efforts…. Saves compute cycles….

Convergence saves ”flops”

© 2018 Cray Inc.

#2: DISTRIBUTED TRAINING OF AI/ML CODES

3

0

● Before

● After

32.00

128.00

512.00

2048.00

8192.00

32768.00

131072.00

1 4 16 64 256

Sa

mp

les

/sec

(a

gg

reg

ate

)Nodes (GPUs)

Inception v3 Performance on XC50

MBS=4 MBS=8

MBS=16 MBS=32

MBS=64 200 x N

S.R. Sukumar et al., in the Proceedings of the NIPS Workshop on Deep Learning at Supercomputing scale, 2017

State-of-the-art: Takes 6+days to train a model

Convergence “trains” in minutes

© 2018 Cray Inc.

#3: INTERACTIVE ANALYSIS OF BIG DATA

31

● Cori @ NERSC

● 1630 compute nodes

● Memory: 128 GB/node,

● 32 2.3GHz Haswell cores/node Gittens, Alex, et al. "Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+ MPI using three case

studies." , IEEE International Conference on Big Data.2016.

State-of-the-art: Exploratory data analysis is not interactive

Convergence enables

“iterative discovery”

© 2018 Cray Inc.

"Large Enough"

Data

Compute Power

Advanced Algorithms

and Software

Frameworks

Data Science

Expertise

Improved Velocity Models

#4: HYBRID THEORY+AI MODELSState-of-the-art: Works when manually-tuned

© 2018 Cray Inc.

Start with an initial energy velocity

model to try to explain reflected

waveforms.

Conventional FWI attempts to derive a

more accurate velocity model.

“Ground truth” Benchmark

Conventional FWI

FWI with Machine Learning

Synthetic benchmark with known

subsurface geometry and seismic

reflection data.

Using machine learning (regularization

and steering) to guide the convergence

process.

#4: HYBRID THEORY+AI MODELS

Convergence enables “higher fidelity” to reality

© 2018 Cray Inc.

SUMMARY: SYSTEMS FOR THE FUTURE

34

• General purpose flexibility

• Commodity-like configurations with custom processors, chips

• Seamless heterogeneity

• CPUs, GPUs, FPGAs, ASICs

• High-performance interconnects for data centers

• MPI and TCP/IP collectives, compute on the network

• Unified software stack with micro-services

• Programming environment for performance and productivity

• Workflow optimization

• Match growth in compute, model-size and data with I/O

THANK YOU

Q U E S T I O N S ?

cray in azure€¦ · opportunities and future potential, our product development and new product...

Documents