tachyon meetup: first-ever scalable, distributed deep learning architecture using tachyon &...
Post on 22-Jan-2018
1.225 Views
Preview:
TRANSCRIPT
@adataoinc @pentagoniachttp://adatao.com
adatao.com/deeplearning
1
Distributed Deep Learning on Tachyon & Spark
Christopher Nguyen, PhDVu Pham
Michael Bui, PhD
@adataoinc @pentagoniac adatao.com/deeplearning
2
The Journey
1. What We Do At Adatao
2. Challenges We Ran Into
3. How We Addressed Them
4. Lessons to Share
1. How some interesting things came about
2. Where some interesting things are going
3. How some good engineering/architectural decisions are made
Along the Way, You’ll Hear
@adataoinc @pentagoniac adatao.com/deeplearning
3
Acknowledgements/Discussions with
§Nam Ma, Adatao
§Haoyuan Li, TachyonNexus
§Shaoshan Liu, Baidu
§Reza Zadeh, Stanford/Databricks
@adataoinc @pentagoniac adatao.com/deeplearning
4
Adatao 3 Pillars
App Development
BIG APPS
PREDICTIVE ANALYTICS
NATURAL INTERFACES
COLLABO-RATION
Big Data + Big Compute
@adataoinc @pentagoniac adatao.com/deeplearning
5
Deep Learning Use Case
IoT Customer Segmentation
Fraud Detection
@adataoinc @pentagoniac adatao.com/deeplearning
6
etc…
Challenge 1: Deep Learning Platform Options
@adataoinc @pentagoniac adatao.com/deeplearning
6
etc…Which approach?
Challenge 1: Deep Learning Platform Options
@adataoinc @pentagoniac adatao.com/deeplearning
7
It Depends!
@adataoinc @pentagoniac adatao.com/deeplearning
8
MapReduce vs Pregel At Google: An Analogy
— Sanjay Ghemawat, Google
If you squint at a problem just a certain way, it becomes a MapReduce problem
@adataoinc @pentagoniac adatao.com/deeplearning
9
API
API
API
Data EngineerData ScientistBusiness Analyst
Custom Apps
@adataoinc @pentagoniac adatao.com/deeplearning
10
Moral: “And No Religion, Too.”
Architectural choices are locally optimal.1What’s best for someone else isn’t necessarily best for you. And vice versa.2
@adataoinc @pentagoniac adatao.com/deeplearning
11
Challenge 2: Who-Does-What Architecture
DistBelief
Large Scale Distributed Deep Networks, Jeff Dean et al, NIPS 2012
1. Compute Gradients
2. Update Params (Descent)
@adataoinc @pentagoniac adatao.com/deeplearning
12
The View From Berkeley—Dave Patterson, UC Berkeley
7 Dwarfs of
Parallel Computing —Phillip Colella, LLBL
@adataoinc @pentagoniac adatao.com/deeplearning
13
Unleashing the Potential of Tachyon
Memory-Based Filesystem Datacenter-Scale Distributed Filesystem
today
Filesystem-Backed Shared Memory Datacenter-Scale Distributed Memory
tomorrow
Ah Ha!
@adataoinc @pentagoniac adatao.com/deeplearning
14
Spark & Tachyon Architectural Options
Model as Broadcast Variable
Spark-Only
Model Stored as
Tachyon File
Tachyon- Storage
Model Hosted in
HTTP Server
Param- Server
Model Stored and Updated by Tachyon
Tachyon- CoProc
@adataoinc @pentagoniac adatao.com/deeplearning
15
Tachyon CoProcessor Concept
@adataoinc @pentagoniac adatao.com/deeplearning
16
Tachyon CoProcessor
@adataoinc @pentagoniac adatao.com/deeplearning
17
A. Compute Gradients
B. Update Params (Descent)
Tachyon-Storage In Detail
@adataoinc @pentagoniac adatao.com/deeplearning
18
A. Compute Gradients
B. Update Params (Descent)
Tachyon-CoProcessors In Detail
@adataoinc @pentagoniac adatao.com/deeplearning
19
Tachyon-CoProcessors
§Spark workers do Gradients
- Handle data-parallel partitions
- Only compute Gradients; freed up quickly
- New workers can continue gradient compute where previous workers left off (mini-batch behavior)
§ Use Tachyon for Descent
- Model Host
- Parameter Server
@adataoinc @pentagoniac adatao.com/deeplearning
20
Demo
ImageNet model
Custom Apps
Adatao AppBuilder
Adatao PredictiveEngine
Data EngineerData ScientistBusiness Analyst
[Model Zoo, Network-in-Network]
Predictive Intelligence for All
@adataoinc @pentagoniac adatao.com/deeplearning
21
Result
@adataoinc @pentagoniac adatao.com/deeplearning
22
Data Set Size Model Size
MNIST 50K x 784 1.8M
Higgs 10M x 28 8.4M
Molecular 150K x 2871 14.3M
@adataoinc @pentagoniac adatao.com/deeplearning
23
Constant-Load ScalingRe
lativ
e Sp
eed
0
6
12
18
24
# of Spark Executors8 16 24
Spark-Only Tachyon-Storage Param-Server Tachyon-CoProc
@adataoinc @pentagoniac adatao.com/deeplearning
24
Training-Time Speed-Up
-20%
-0%
20%
40%
60%
80%
100%
MNIST Molecular Higgs
Spark-Only Tachyon-Storage Param-Server Tachyon-CoProc
@adataoinc @pentagoniac adatao.com/deeplearning
25
Training ConvergenceEr
ror R
ate
0%
25%
50%
75%
100%
Iterations400 10000 20000
Spark-Only Tachyon-Storage Param-Server Tachyon-CoProc
@adataoinc @pentagoniac adatao.com/deeplearning
26
Lessons Learned: Tachyon CoProcessors
§ Spark (Gradient) and Tachyon (Descent) can be scaled independently
§ The combination gives natural mini-batch behavior
§ Up to 60% speed gain, scales almost linearly, and converges faster.
@adataoinc @pentagoniac adatao.com/deeplearning
27
Lessons Learned: Data Partitioning
§Tunable Number of Data Partitions
- Big partitions: slow convergence, shorter time per epoch
- Small partitions: faster convergence, longer time per epoch (for network communication)
@adataoinc @pentagoniac adatao.com/deeplearning
28
Lessons Learned: Memory Tuning
§Typically each machine needs:
- (model_size + batch_size * unit_count) * 3 * 4 * 1.5 * executors
§batch_size matters
§ If low RAM capacity, reduce the number of executors
@adataoinc @pentagoniac adatao.com/deeplearning
29
Lessons Learned: GPU vs CPU
§GPU is 10x faster on local, 2-4x faster on Spark
§GPU memory is limited. AWS commonly 4-6 GB of memory
§Better to have multiple GPUs per worker
§On JVM with multi-process accesses, GPUs might fail randomly
@adataoinc @pentagoniac adatao.com/deeplearning
30
Summary
@adataoinc @pentagoniac adatao.com/deeplearning
31
Summary
§Tachyon is much more than memory-based filesystem
- Tachyon can become filesystem-backed shared-memory
§Combination of Spark & Tachyon CoProcessing yields superior Deep Learning performance in multiple dimensions
§Adatao is open-sourcing both:
- Tachyon CoProcessor design & code
- Spark & Tachyon-CoProcessor Deep Learning implementation
@adataoinc @pentagoniac adatao.com/deeplearning
32
Appendices
@adataoinc @pentagoniac adatao.com/deeplearning
33
Design Choices
§Combine the results of Spark workers:
- Parameter averaging
- Gradient averaging ✓
- Best model
top related