clustar: ai training platformpowered byhigh performance ...junxue zhang evp clustar phd sing lab,...
TRANSCRIPT
-
Junxue ZHANG
EVP CLUSTARPhD SING Lab, HKUST
AGUEST 1,2018
CLUSTAR: AI Training Platform Poweredby High Performance Networking
-
Deep Learning Is Becoming Increasingly Important
27
Computer Vision Natural Language Processing Auto-driving Cars
-
How does Deep Learning Work ?
28
𝑦 = 𝑎 ∗ 𝑥 + 𝑏
1
𝑥
𝑠𝑢𝑚
Input Layer Output Layer
𝒙 𝒚 𝒚𝒑𝒓𝒆𝒅1 5
2 7mini batch
𝑎 = 1
𝑏 = 1
-
How does Deep Learning Work ?
29
𝑦 = 𝑎 ∗ 𝑥 + 𝑏
1
𝑥
𝑠𝑢𝑚
Input Layer Output Layer
𝒙 𝒚 𝒚𝒑𝒓𝒆𝒅1 5
2 7mini batchForward Pass
𝑎 = 1
𝑏 = 1
2
3
-
How does Deep Learning Work ?
30
𝑦 = 𝑎 ∗ 𝑥 + 𝑏
1
𝑥
𝑠𝑢𝑚
Input Layer Output Layer
𝒙 𝒚 𝒚𝒑𝒓𝒆𝒅1 5
2 7mini batch
𝐿 = 𝐶4 𝑦 − 𝑦6789 =124 𝑦 − 𝑦6789
;
Forward Pass
𝑎 = 1
𝑏 = 1
2
3
Calculating Loss
-
How does Deep Learning Work ?
31
𝑦 = 𝑎 ∗ 𝑥 + 𝑏
1
𝑥
𝑠𝑢𝑚
Input Layer Output Layer
𝒙 𝒚 𝒚𝒑𝒓𝒆𝒅1 5
2 7mini batch
𝐿 = 𝐶4 𝑦 − 𝑦6789 =124 𝑦 − 𝑦6789
;
𝜕𝐿𝜕𝑎 =
𝜕𝐿𝜕𝑦6789
×𝜕𝑦6789𝜕𝑎 =4 𝑦6789 − 𝑦 𝑥 = −11
𝜕𝐿𝜕𝑏 =
𝜕𝐿𝜕𝑦6789
×𝜕𝑦6789𝜕𝑏 =4 𝑦6789 − 𝑦 = −7
𝑎 = 𝑎 − 𝑟𝜕𝐿𝜕𝑎
𝑏 = 𝑏 − 𝑟𝜕𝐿𝜕𝑏
2
3
Calculating Loss
Backpropagation
𝑎 = 1 − 0.1 ∗ −11 = 2.1
𝑏 = 1 − 0.1 ∗ −7 = 1.7
-
How does Deep Learning Work ?
32
𝑦 = 𝑎 ∗ 𝑥 + 𝑏
1
𝑥
𝑠𝑢𝑚
Input Layer Output Layer
𝐿 = 𝐶4 𝑦 − 𝑦6789 =124 𝑦 − 𝑦6789
;
𝜕𝐿𝜕𝑎 =
𝜕𝐿𝜕𝑦6789
×𝜕𝑦6789𝜕𝑎 =4 𝑦6789 − 𝑦 𝑥 = −11
𝜕𝐿𝜕𝑏 =
𝜕𝐿𝜕𝑦6789
×𝜕𝑦6789𝜕𝑏 =4 𝑦6789 − 𝑦 = −7
𝑎 = 𝑎 − 𝑟𝜕𝐿𝜕𝑎
𝑏 = 𝑏 − 𝑟𝜕𝐿𝜕𝑏
Calculating Loss
Backpropagation
𝑎 = 1 − 0.1 ∗ −11 = 2.1
𝑏 = 1 − 0.1 ∗ −7 = 1.7
𝒙 𝒚 𝒚𝒑𝒓𝒆𝒅3 9
5 13
NextIteration
-
How does Deep Learning Work ?
33
Input Layer Output LayerHidden Layer
-
How does Deep Learning Work ?
34
Input Layer Output LayerHidden Layer
Forward Pass Forward Pass Forward Pass
Calculating Loss
BackpropagationBackpropagationBackpropagation
𝑤;CD𝑤E;D
𝑤FED
-
The Big Data Drives a New Paradigm for Training
35
1. Data is too large to fit in single machine
2. The training time is too longUber: it usually takes weeks or longer to complete [1]
-
Networking Plays an Important Role
36
Networking
Worker 1 Worker 2
𝑤E 𝑤; …
Parameter Server
DataPartition 1
DataPartition 2
-
Networking Plays an Important Role
37
Networking
Worker 1 Worker 2
𝑤E 𝑤; …
Parameter Server
Pull Parameters From Servers
DataPartition 1
DataPartition 2
𝑤E𝑤; 𝑤E𝑤;
-
Networking Plays an Important Role
38
Networking
Worker 1 Worker 2
𝑤E 𝑤; …
Parameter Server
Forward Pass Forward Pass
DataPartition 1
DataPartition 2
Input Input
𝑤E𝑤; 𝑤E𝑤;
-
Networking Plays an Important Role
39
Networking
Worker 1 Worker 2
𝑤E 𝑤; …
Parameter Server
Forward Pass Forward Pass
Calculating Loss
DataPartition 1
DataPartition 2
Calculating Loss
Input Input
𝑤E𝑤; 𝑤E𝑤;
-
Networking Plays an Important Role
40
Networking
Worker 1 Worker 2
𝑤ED 𝑤EDD𝑤;DD𝑤;D
𝑤E 𝑤; …
Parameter Server
DataPartition 1
DataPartition 2
Backpropagation Backpropagation
-
Networking Plays an Important Role
41
Networking
Worker 1 Worker 2
𝑤ED 𝑤EDD𝑤;DD𝑤;D
𝑤E 𝑤; …
Parameter Server
DataPartition 1
DataPartition 2
Backpropagation Backpropagation
Push parameters to Servers
-
Networking Plays an Important Role
42
Networking
Worker 1 Worker 2
𝑤ED 𝑤EDD𝑤;DD𝑤;D
𝑤E 𝑤; …
Parameter Server
DataPartition 1
DataPartition 2
Backpropagation Backpropagation
Push parameters to Servers
Networking is critical to performance !
-
Networking Plays an Important Role
43
Model Logistic Regression
Multi-layer perceptron
Alexnet VGG-16 Resnet-50
Speedup 2.59x 3.45x 1.6x 1.33x 1.03x
The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR
-
CLUSTAR: AI Training Platform Powered by High Performance Networking
Smart NetworkingScheduling
• Co-flow scheduling
• Elephant & Mice flowscheduling
GDR
• Towards 0-copy data flow
• Utilize RDMA and GPUDirect
• Integrated with TensorFlow
ParaExpress
• Resilient and adaptive parameter aggregation
• Tackles the disadvantage of Parameter Server & Ring AllReduce
Key Technology(World-leading Research Achievements)
MLT
• Utilize the SGD of AI training
• Semi-loss tolerance
• Model quality awareness
Between 2 Machines Multiple Machines AI Protocol
Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
44
-
CLUSTAR: AI Training Platform Powered by High Performance Networking
Smart NetworkingScheduling
• Co-flow scheduling
• Elephant & Mice flowscheduling
GDR
• Towards 0-copy data flow
• Utilize RDMA and GPUDirect
• Integrated with TensorFlow
ParaExpress
• Resilient and adaptive parameter aggregation
• Tackles the disadvantage of Parameter Server & Ring AllReduce
Key Technology(World-leading Research Achievements)
MLT
• Utilize the SGD of AI training
• Semi-loss tolerance
• Model quality awareness
Between 2 Machines Multiple Machines AI Protocol
Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
45
-
CLUSTAR: AI Training Platform Powered by High Performance Networking
Smart NetworkingScheduling
• Co-flow scheduling
• Elephant & Mice flowscheduling
GDR
• Towards 0-copy data flow
• Utilize RDMA and GPUDirect
• Integrated with TensorFlow
ParaExpress
• Resilient and adaptive parameter aggregation
• Tackles the disadvantage of Parameter Server & Ring AllReduce
Key Technology(World-leading Research Achievements)
MLT
• Utilize the SGD of AI training
• Semi-loss tolerance
• Model quality awareness
Between 2 Machines Multiple Machines AI Protocol
Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
46
-
CLUSTAR Platform
47
基础设施
应⽤用计算机视觉
⾦金金融⾏行行业应⽤用
语⾳音识别 ⾃自然语⾔言处理理 ⾃自动驾驶智能反欺诈 智能⽆无⼈人机
安防⾏行行业应⽤用 互联⽹网⾏行行业应⽤用 制造业⾏行行业应⽤用医疗⾏行行业应⽤用 政府⾏行行业应⽤用
通⽤用硬件 CPU GPU FPGA ASIC RDMA⽹网络 全闪存存储
Intel Nvidia AMD 寒武纪 Mellanox Broadcom P4E8
Storage
Clustar AI Fabrics
RoCE 智能⽹网卡
数据预处理 离线训练 在线训练 多租户管理 任务调度 运维监控
Spark优化 TensorFlow优化 容器器编排引擎 交互编程界⾯面
星云平台 可编程⽹网络
-
GDR: Towards Zero Copy Data Flow
48
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
49
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
50
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
51
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
52
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
53
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
54
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
55
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
56
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
GDR removes the unnecessary copy to boost performance
-
GDR: Towards Zero Copy Data Flow
57
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
GDR removes the unnecessary copy to boost performance
-
GDR: Towards Zero Copy Data Flow
58
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
GDR removes the unnecessary copy to boost performance
-
GDR: Towards Zero Copy Data Flow
59
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
GDR removes the unnecessary copy to boost performance
-
Memory Management
60
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
61
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
62
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
63
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Data Copy
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
64
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Data Copy
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
Manually manage malloc() and free() over pre-pinned memory
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
65
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Data Copy
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
Manually manage malloc() and free() over pre-pinned memory
Allocated Object
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
66
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Data Copy
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
Manually manage malloc() and free() over pre-pinned memory
Allocated Object
GDR further reduces the data copy by managing the objects manually over pinned memory
-
TensorFlow GDR
67
GDR has been contributed to TensorFlow community (We have commercial version) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gdr
-
Benchmark
68
VGG16 BERTAlexNet
-
The Evil of Parameter Server & Ring AllReduce
69
Worker A Worker B Worker C
ParameterServer
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
-
The Evil of Parameter Server & Ring AllReduce
70
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
-
The Evil of Parameter Server & Ring AllReduce
71
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
-
The Evil of Parameter Server & Ring AllReduce
72
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
-
The Evil of Parameter Server & Ring AllReduce
73
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
Delayed due to congestion
Cannot Start Transferring
Cannot Start Transferring
-
The Evil of Parameter Server & Ring AllReduce
74
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
Delayed due to congestion
Cannot Start Transferring
Cannot Start Transferring
The long dependency of Ring AllReudce may cause the whole job to wait once one hop blocks
-
ParaExpress: Networking-aware Parameter Aggregation
75
?Root
Aggregator
Leaf 1 2
?
3 4 5 6
?
7 8
Rack1 Rack2
Real-time networking conditions 1
23
4
56
78
ToR
ToR
Generate
Optimal Parameter Aggregation
The generated parameter aggregation topology has the advantage of both tree structure (Parameter Server ) and ring structure (Ring AllReduce)
-
ParaExpress Architecture
76
TaskQueue
Completion Queue
ExecutionGraph
Resolver
Operation Pool
Request Manager
Traffic Prioritization Module
ParaExpress MasterEmbedding Plan Prioritization
MPI Request
High Speed Network Interface
Change DSCP – Priority Mapping
ParaExpress AgentExecution Graph
D
…… …
R1 A1 S1
R2 A2
An
S2
SnRnTensor
-
Highlighted Results
77
• Compared with TensorFlow PS, Baidu Ring AllReduce and Horovod, the software optimization of ParaExpress can achieve 1.5-4.3X better performance.
• In real environment, ParaExpess achieves 2.6X better results than Parameter Server and 3X better results than Ring AllReduce.
-
About CLUSTAR
-
World-leading Research Achievements
79
9 papers appear in top-tier Networking Conference (SIGCOMM/NSDI)in recent 5 years. First in Asia.• "AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization", ACM SIGCOMM 2018
• "PowerMan: An Out-of-Band Management Network for Datacenters using Power Line Communication", USENIX NSDI 2018
• "Resilient Datacenter Load Balancing in the Wild", ACM SIGCOMM 2017
• "Enabling Wide-spread Communications on Optical Fabric with MegaSwitch", USENIX NSDI 2017
• "Scheduling Mix-flows in Commodity Datacenters with Karuna", ACM SIGCOMM 2016
• "CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark", ACM SIGCOMM 2016
• "Enabling ECN in Multi-Service Multi-Queue Data Centers", USENIX NSDI 2016
• "Information-Agnostic Flow Scheduling for Commodity Data Centers", USENIX NSDI 2015
• "Explicit Path Control in Commodity Data Centers: Design and Applications", USENIX NSDI 2015
Statistics of universities in great China:
University Number of accepted papers
HKUST 9 (all are from teams of CLUSTAR)
Tsinghua University 5 (from different professors and labs)
Chinese Academy of Sciences 3 (from different professors and labs)
Peking University 1
Fudan University 1
National Supercomputing Center in Wuxi 1
-
Selected Clients
80
n GDR n ParaExpress
Selected Clients
Utilize GDR to boost
performance of moments
classification for WeChat; and
performance of CV for SAIC
Achievements:
1. ~3X for Wechat
2. ~1.6 for SAIC
Selected Clients
Utilize ParaExpress to
improve the performance
of AI training in the
sophisticated cloud
environment
Progress:
POC
Utilize GDR to boost the
performance of Federated
Learning. Utilize MLT to boost
the long-distance
communication
Progress:
Developing
High-speed
Networking
virtualization
Progress:
Developing
AI Unicorn1
1 NDA issue
n AI Consulting
Selected Clients
Smart customer support
system. Utilize CLUSTAR
platform to speed up the AI
training
Progress:
Developing next-gen AI
platform together
-
CLUSTAR Team
81
Kai CHENFounder
• PhD, Northwest University• Associated Professor, HKUST• 50+ top-tier networking conference paper
(SIGCOMM/NSDI)
Qiang YANGCo-founder
• PhD, University of Maryland• Chair Professor, Department Head of CSE,
HKUST• President of IJCAI
• 10+ years of research experience on DCN• Director of SING Lab, HKUST• Director of WHAT Lab, HKUST
• Founder of Transfer Learning• IEEE/ACM/AAAI Fellow• Founding director of the Huawei Noah's Ark
Research Lab
Shuihai HUVP of Technology
• PhD, HKUST• Expertise on RDMA
Pin LYUDirector of Algorithm
• 7 years of IBM software development
Junxue ZHANGEVP
• PhD, HKUST• Architecture for CLUSTAR
platform
Yajing LYUVP of Business
• MBA, ESSEC• 6+ years of business experiences
Junhuan SUNVP of Engineering
• 10+ years of engineering experiences
Weiyan WANGAI Scientist
• PhD, HKUST• AutoML System
-
Milestone
82
2018.11
CLUSTAR v1.0 launched!
Cooperate with SAIC
2018.05
CLUSTAR is founded!
2018.01 2018.09
Join Nvidia Inception Program
2019.01
CLUSTAR v1.1 launched!
Cooperate with WeChat
2017.08 2018.11
Cooperate with Sunshine Insurance
Angle Funding
2018.03
-
THANK [email protected]
https://www.clustarai.com