clustar: ai training platformpowered byhigh performance ...junxue zhang evp clustar phd sing lab,...

Junxue ZHANG

EVP CLUSTARPhD SING Lab, HKUST

AGUEST 1,2018

CLUSTAR: AI Training Platform Poweredby High Performance Networking

Deep Learning Is Becoming Increasingly Important

27

Computer Vision Natural Language Processing Auto-driving Cars

How does Deep Learning Work ?

28

𝑦 = 𝑎 ∗ 𝑥 + 𝑏

1

𝑥

𝑠𝑢𝑚

Input Layer Output Layer

𝒙 𝒚 𝒚𝒑𝒓𝒆𝒅1 5

2 7mini batch

𝑎 = 1

𝑏 = 1


29

𝑦 = 𝑎 ∗ 𝑥 + 𝑏

1

𝑥

𝑠𝑢𝑚



2 7mini batchForward Pass

𝑎 = 1

𝑏 = 1

2

3


30

𝑦 = 𝑎 ∗ 𝑥 + 𝑏

1

𝑥

𝑠𝑢𝑚



2 7mini batch

𝐿 = 𝐶4 𝑦 − 𝑦6789 =124 𝑦 − 𝑦6789

;

Forward Pass

𝑎 = 1

𝑏 = 1

2

3

Calculating Loss


31

𝑦 = 𝑎 ∗ 𝑥 + 𝑏

1

𝑥

𝑠𝑢𝑚



2 7mini batch

𝐿 = 𝐶4 𝑦 − 𝑦6789 =124 𝑦 − 𝑦6789

;

𝜕𝐿𝜕𝑎 =

𝜕𝐿𝜕𝑦6789

×𝜕𝑦6789𝜕𝑎 =4 𝑦6789 − 𝑦 𝑥 = −11

𝜕𝐿𝜕𝑏 =

𝜕𝐿𝜕𝑦6789

×𝜕𝑦6789𝜕𝑏 =4 𝑦6789 − 𝑦 = −7

𝑎 = 𝑎 − 𝑟𝜕𝐿𝜕𝑎

𝑏 = 𝑏 − 𝑟𝜕𝐿𝜕𝑏

2

3

Calculating Loss

Backpropagation

𝑎 = 1 − 0.1 ∗ −11 = 2.1

𝑏 = 1 − 0.1 ∗ −7 = 1.7


32

𝑦 = 𝑎 ∗ 𝑥 + 𝑏

1

𝑥

𝑠𝑢𝑚


𝐿 = 𝐶4 𝑦 − 𝑦6789 =124 𝑦 − 𝑦6789

;

𝜕𝐿𝜕𝑎 =

𝜕𝐿𝜕𝑦6789

×𝜕𝑦6789𝜕𝑎 =4 𝑦6789 − 𝑦 𝑥 = −11

𝜕𝐿𝜕𝑏 =

𝜕𝐿𝜕𝑦6789

×𝜕𝑦6789𝜕𝑏 =4 𝑦6789 − 𝑦 = −7

𝑎 = 𝑎 − 𝑟𝜕𝐿𝜕𝑎

𝑏 = 𝑏 − 𝑟𝜕𝐿𝜕𝑏

Calculating Loss

Backpropagation

𝑎 = 1 − 0.1 ∗ −11 = 2.1

𝑏 = 1 − 0.1 ∗ −7 = 1.7


5 13

NextIteration


33

Input Layer Output LayerHidden Layer


34

Input Layer Output LayerHidden Layer

Forward Pass Forward Pass Forward Pass

Calculating Loss

BackpropagationBackpropagationBackpropagation

𝑤;CD𝑤E;D

𝑤FED

The Big Data Drives a New Paradigm for Training

35

1. Data is too large to fit in single machine

2. The training time is too longUber: it usually takes weeks or longer to complete [1]

Networking Plays an Important Role

36

Networking

Worker 1 Worker 2

𝑤E 𝑤; …

Parameter Server

DataPartition 1

DataPartition 2


37

Networking

Worker 1 Worker 2

𝑤E 𝑤; …

Parameter Server

Pull Parameters From Servers

DataPartition 1

DataPartition 2

𝑤E𝑤; 𝑤E𝑤;


38

Networking

Worker 1 Worker 2

𝑤E 𝑤; …

Parameter Server

Forward Pass Forward Pass

DataPartition 1

DataPartition 2

Input Input



39

Networking

Worker 1 Worker 2

𝑤E 𝑤; …

Parameter Server

Forward Pass Forward Pass

Calculating Loss

DataPartition 1

DataPartition 2

Calculating Loss

Input Input



40

Networking

Worker 1 Worker 2

𝑤ED 𝑤EDD𝑤;DD𝑤;D

𝑤E 𝑤; …

Parameter Server

DataPartition 1

DataPartition 2

Backpropagation Backpropagation


41

Networking

Worker 1 Worker 2


𝑤E 𝑤; …

Parameter Server

DataPartition 1

DataPartition 2


Push parameters to Servers


42

Networking

Worker 1 Worker 2


𝑤E 𝑤; …

Parameter Server

DataPartition 1

DataPartition 2


Push parameters to Servers

Networking is critical to performance !


43

Model Logistic Regression

Multi-layer perceptron

Alexnet VGG-16 Resnet-50

Speedup 2.59x 3.45x 1.6x 1.33x 1.03x

The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR

CLUSTAR: AI Training Platform Powered by High Performance Networking

Smart NetworkingScheduling

• Co-flow scheduling

• Elephant & Mice flowscheduling

GDR

• Towards 0-copy data flow

• Utilize RDMA and GPUDirect

• Integrated with TensorFlow

ParaExpress

• Resilient and adaptive parameter aggregation

• Tackles the disadvantage of Parameter Server & Ring AllReduce

Key Technology（World-leading Research Achievements）

MLT

• Utilize the SGD of AI training

• Semi-loss tolerance

• Model quality awareness

Between 2 Machines Multiple Machines AI Protocol

Wider Roads Traffic Scheduling New Traffic Rule for AI

The important of networking towards AI system equals the traffic system towards cities

44





GDR




ParaExpress




MLT







45





GDR




ParaExpress




MLT







46

CLUSTAR Platform

47

基础设施

应⽤用计算机视觉

⾦金金融⾏行行业应⽤用

语⾳音识别⾃自然语⾔言处理理⾃自动驾驶智能反欺诈智能⽆无⼈人机

安防⾏行行业应⽤用互联⽹网⾏行行业应⽤用制造业⾏行行业应⽤用医疗⾏行行业应⽤用政府⾏行行业应⽤用

通⽤用硬件 CPU GPU FPGA ASIC RDMA⽹网络全闪存存储

Intel Nvidia AMD 寒武纪 Mellanox Broadcom P4E8

Storage

Clustar AI Fabrics

RoCE 智能⽹网卡

数据预处理离线训练在线训练多租户管理任务调度运维监控

Spark优化 TensorFlow优化容器器编排引擎交互编程界⾯面

星云平台可编程⽹网络

GDR: Towards Zero Copy Data Flow

48

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2

Data Center Networking

The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU


49

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




50

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




51

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




52

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




53

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




54

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




55

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




56

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2


GDR removes the unnecessary copy to boost performance


57

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




58

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2




59

CPU

Memory

GPU RDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 1

CPU

Memory

GPURDMANIC

CPU

Memory

GPU GPU

Socket 1 Socket 2

Server 2



Memory Management

60

OS Managed Application Memory

Pinned RDMA MemorySending Buffer

Unnecessary Data Copy between pinned buffer and application memory degrades performance


GDR further reduces the data copy by managing the objects manually over pinned memory

Memory Management

61



Allocated Object




Memory Management

62



Allocated Object




Memory Management

63



Allocated Object

Data Copy

Allocated Object




Memory Management

64



Allocated Object

Data Copy

Allocated Object



Manually manage malloc() and free() over pre-pinned memory


Memory Management

65



Allocated Object

Data Copy

Allocated Object




Allocated Object


Memory Management

66



Allocated Object

Data Copy

Allocated Object




Allocated Object


TensorFlow GDR

67

GDR has been contributed to TensorFlow community (We have commercial version) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gdr

Benchmark

68

VGG16 BERTAlexNet

The Evil of Parameter Server & Ring AllReduce

69

Worker A Worker B Worker C

ParameterServer

Parameter Server largely degrades from congested links due to over-subscribed networking

Worker A

Worker B

Worker C

Worker D


70


ParameterServer

Bottleneck link


Worker A

Worker B

Worker C

Worker D


71


ParameterServer

Bottleneck link


Worker A

Worker B

Worker C

Worker D


72


ParameterServer

Bottleneck link


Worker A

Worker B

Worker C

Worker D


73


ParameterServer

Bottleneck link


Worker A

Worker B

Worker C

Worker D

Delayed due to congestion

Cannot Start Transferring



74


ParameterServer

Bottleneck link


Worker A

Worker B

Worker C

Worker D

Delayed due to congestion



The long dependency of Ring AllReudce may cause the whole job to wait once one hop blocks

ParaExpress: Networking-aware Parameter Aggregation

75

?Root

Aggregator

Leaf 1 2

?

3 4 5 6

?

7 8

Rack1 Rack2

Real-time networking conditions 1

23

4

56

78

ToR

ToR

Generate

Optimal Parameter Aggregation

The generated parameter aggregation topology has the advantage of both tree structure (Parameter Server ) and ring structure (Ring AllReduce)

ParaExpress Architecture

76

TaskQueue

Completion Queue

ExecutionGraph

Resolver

Operation Pool

Request Manager

Traffic Prioritization Module

ParaExpress MasterEmbedding Plan Prioritization

MPI Request

High Speed Network Interface

Change DSCP – Priority Mapping

ParaExpress AgentExecution Graph

D

…… …

R1 A1 S1

R2 A2

An

S2

SnRnTensor

Highlighted Results

77

• Compared with TensorFlow PS, Baidu Ring AllReduce and Horovod, the software optimization of ParaExpress can achieve 1.5-4.3X better performance.

• In real environment, ParaExpess achieves 2.6X better results than Parameter Server and 3X better results than Ring AllReduce.

About CLUSTAR

World-leading Research Achievements

79

9 papers appear in top-tier Networking Conference (SIGCOMM/NSDI）in recent 5 years. First in Asia.• "AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization", ACM SIGCOMM 2018

• "PowerMan: An Out-of-Band Management Network for Datacenters using Power Line Communication", USENIX NSDI 2018

• "Resilient Datacenter Load Balancing in the Wild", ACM SIGCOMM 2017

• "Enabling Wide-spread Communications on Optical Fabric with MegaSwitch", USENIX NSDI 2017

• "Scheduling Mix-flows in Commodity Datacenters with Karuna", ACM SIGCOMM 2016

• "CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark", ACM SIGCOMM 2016

• "Enabling ECN in Multi-Service Multi-Queue Data Centers", USENIX NSDI 2016

• "Information-Agnostic Flow Scheduling for Commodity Data Centers", USENIX NSDI 2015

• "Explicit Path Control in Commodity Data Centers: Design and Applications", USENIX NSDI 2015

Statistics of universities in great China:

University Number of accepted papers

HKUST 9 (all are from teams of CLUSTAR)

Tsinghua University 5 (from different professors and labs)

Chinese Academy of Sciences 3 (from different professors and labs)

Peking University 1

Fudan University 1

National Supercomputing Center in Wuxi 1

Selected Clients

80

n GDR n ParaExpress

Selected Clients

Utilize GDR to boost

performance of moments

classification for WeChat; and

performance of CV for SAIC

Achievements：

1. ~3X for Wechat

2. ~1.6 for SAIC

Selected Clients

Utilize ParaExpress to

improve the performance

of AI training in the

sophisticated cloud

environment

Progress：

POC

Utilize GDR to boost the

performance of Federated

Learning. Utilize MLT to boost

the long-distance

communication

Progress：

Developing

High-speed

Networking

virtualization

Progress：

Developing

AI Unicorn1

1 NDA issue

n AI Consulting

Selected Clients

Smart customer support

system. Utilize CLUSTAR

platform to speed up the AI

training

Progress：

Developing next-gen AI

platform together

CLUSTAR Team

81

Kai CHENFounder

• PhD, Northwest University• Associated Professor, HKUST• 50+ top-tier networking conference paper

(SIGCOMM/NSDI)

Qiang YANGCo-founder

• PhD, University of Maryland• Chair Professor, Department Head of CSE,

HKUST• President of IJCAI

• 10+ years of research experience on DCN• Director of SING Lab, HKUST• Director of WHAT Lab, HKUST

• Founder of Transfer Learning• IEEE/ACM/AAAI Fellow• Founding director of the Huawei Noah's Ark

Research Lab

Shuihai HUVP of Technology

• PhD, HKUST• Expertise on RDMA

Pin LYUDirector of Algorithm

• 7 years of IBM software development

Junxue ZHANGEVP

• PhD, HKUST• Architecture for CLUSTAR

platform

Yajing LYUVP of Business

• MBA, ESSEC• 6+ years of business experiences

Junhuan SUNVP of Engineering

• 10+ years of engineering experiences

Weiyan WANGAI Scientist

• PhD, HKUST• AutoML System

Milestone

82

2018.11

CLUSTAR v1.0 launched!

Cooperate with SAIC

2018.05

CLUSTAR is founded!

2018.01 2018.09

Join Nvidia Inception Program

2019.01

CLUSTAR v1.1 launched!

Cooperate with WeChat

2017.08 2018.11

Cooperate with Sunshine Insurance

Angle Funding

2018.03

THANK [email protected]

https://www.clustarai.com

clustar: ai training platformpowered byhigh performance ...junxue zhang evp clustar phd sing lab,...

Documents