scaling tensorflow with hops, global ai conference santa clara

Scaling Tensorflow to 100s of GPUs with Spark and Hops Hadoop

Global AI Conference, Santa Clara, January 18th 2018

Hops

Jim Dowling

Associate Prof @ KTH

Senior Researcher @ RISE SICS

CEO @ Logical Clocks AB

AI Hierarchy of Needs

2

DDL

(Distributed

Deep Learning)

Deep Learning,

RL, Automated ML

A/B Testing, Experimentation, ML

B.I. Analytics, Metrics, Aggregates,

Features, Training/Test Data

Reliable Data Pipelines, ETL, Unstructured and

Structured Data Storage, Real-Time Data Ingestion

[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]

https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469


3[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]

DDL

(Distributed

Deep Learning)

Deep Learning,

RL, Automated ML






Analytics

Prediction



4

DDL

(Distributed

Deep Learning)

Deep Learning,

RL, Automated ML






Hops

[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]


More Data means Better Predictions

Prediction

Performance

Traditional AI

Deep Neural Nets

Amount Labelled DataHand-crafted

can outperform

1980s1990s2000s 2010s 2020s?

What about More Compute?

“Methods that scale with computation are the future of AI”*

- Rich Sutton (A Founding Father of Reinforcement Learning)

2018-01-19 6/46

* https://www.youtube.com/watch?v=EeMCEQa85tw

https://www.youtube.com/watch?v=EeMCEQa85tw

More Compute should mean Faster Training

Training

Performance

Single-Host

Distributed

Available Compute

20152016 2017 2018?

Reduce DNN Training Time from 2 weeks to 1 hour

2018-01-19 8/46

In 2017, Facebook reduced training time on ImageNetfor a CNN from 2 weeks to 1 hour by scaling out to 256 GPUs using

Ring-AllReduce on Caffe2.

https://arxiv.org/abs/1706.02677

https://arxiv.org/abs/1706.02677

DNN Training Time and Researcher Productivity

9

•Distributed Deep Learning

- Interactive analysis!

- Instant gratification!

•Single Host Deep Learning

• Suffer from Google-Envy

“My Model’s Training.”

Training

Distributed Training: Theory and Practice

Image from @hardmaru on Twitter.

10

Distributed Algorithms are not all Created Equal

Training

Performance

Parameter Servers

AllReduce

Available Compute

Ring-AllReduce vs Parameter Server(s)

2018-01-19 13/46

GPU 0

GPU 1

GPU 2

GPU 3

send

send

send

send

recv

recv

recv

recv GPU 1 GPU 2 GPU 3 GPU 4

Param Server(s)

Network Bandwidth is the Bottleneck for Distributed Training

AllReduce outperforms Parameter Servers

2018-01-19 14/46

*https://github.com/uber/horovod

16 servers with 4 P100 GPUs (64 GPUs) each connected by ROCE-capable 25 Gbit/s network

(synthetic data). Speed below is images processed per second.*

For Bigger Models, Parameter Servers don’t scale

Multiple GPUs on a Single Server

2018-01-19 15/46

NVLink vs PCI-E Single Root Complex

2018-01-19 16/46On Single-Host (dist. Training), the Bus can be the Bottleneck[Images from: https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/ ]

NVLink – 80 GB/s PCI-E – 16 GB/s

https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/

Scale: Remove Bus and Net B/W Bottlenecks

2018-01-19 17/46Only one slow worker or bus or n/w link is needed to bottleneck DNN training time.

Ring-AllReduce

The Cloud is full of Bottlenecks….

Training

Performance

Public Cloud (10 GbE)

Infiniband On-Premise

Available Compute

Deep Learning Hierarchy of Scale

2018-01-19 19/46

DDL

AllReduce

on GPU Servers

DDL with GPU Servers

and Parameter Servers

Parallel Experiments on GPU Servers

Single GPU

Many GPUs on a Single GPU Server

Days/Hours

Days

Weeks

Minutes

Training Time for ImageNet

Hours

Lots of good GPUs > A few great GPUs

Hops

100 x Nvidia 1080Ti (DeepLearning11)

8 x Nvidia P/V100 (DGX-1)

VS

Both top (100 GPUs) and bottom (8 GPUs) cost the same: $150K (2017).

Consumer GPU Server $15K (10 x 1080Ti)

2018-01-19 https://www.oreilly.com/ideas/distributed-tensorflow 21/46

https://www.oreilly.com/ideas/distributed-tensorflow

Cluster of Commodity GPU Servers

#EUai8

22

InfiniBand

Max 1-2 GPU Servers per Rack (2-4 KW per server)

#EUai8

TensorFlow Spark Platforms

•TensorFlow-on-Spark

•Deep Learning Pipelines

•Horovod

•Hops

23

Hops – Running Parallel Experiments

def model_fn(learning_rate, dropout):

import tensorflow as tf

from hops import tensorboard, hdfs, devices

…..

from hops import tflauncher

args_dict = {'learning_rate': [0.001], 'dropout': [0.5]}

tflauncher.launch(spark, model_fn, args_dict)

24

Launch TF jobs as Mappers in Spark

“Pure” TensorFlow code

in the Executor

Hops – Parallel Experiments

25#EUai8

def model_fn(learning_rate, dropout):

…..

from hops import tflauncher

args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6]}

tflauncher.launch(spark, model_fn, args_dict)

Launches 6 Executors with with a different Hyperparameter

combination. Each Executor can have 1-N GPUs.

Hops AllReduce/Horovod/TensorFlow

27#EUai8

import horovod.tensorflow as hvd

def conv_model(feature, target, mode)

…..

def main(_):

hvd.init()

opt = hvd.DistributedOptimizer(opt)

if hvd.local_rank()==0:

hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]

…..

else:

hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]

…..

from hops import allreduce

allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')

“Pure” TensorFlow code

TensorFlow and Hops Hadoop

2018-01-19 28/46

Don’t do this: Different Clusters for Big Data and ML

29

Hops: Single ML and Big Data Cluster

30/70

IT

DataLake

GPUs Compute

Kafka

Data EngineeringData Science

Project1 ProjectN

Elasticsearch

HopsFS: Next Generation HDFS*

16xThroughput

FasterBigger

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

37xNumber of files

Scale Challenge Winner (2017)

31

https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

Size Matters: Improving the Performance of Small Files in HDFS. Salman Niazi, Seif Haridi, Jim Dowling. Poster, EuroSys 2017.`

HopsFS now stores Small Files in the DB

GPUs supported as a Resource in Hops 2.8.2*

33Hops is the only Hadoop distribution to support GPUs-as-a-Resource.

*Robin Andersson, GPU Integration for Deep Learning on YARN, MSc Thesis, 2017

GPU Resource Requests in Hops

34

HopsYARN

4 GPUs on any host

10 GPUs on 1 host

100 GPUs on 10 hosts with ‘Infiniband’

20 GPUs on 2 hosts with ‘Infiniband_P100’

HopsFS

Mix of commodity GPUs and more

powerful GPUs good for (1) parallel

experiments and (2) distributed training

Hopsworks Data Platform

35

Develop Train Test Serve

MySQL Cluster

Hive

InfluxDB

ElasticSearch

KafkaProjects,Datasets,Users

HopsFS / YARN

Spark, Flink, Tensorflow

Jupyter, Zeppelin

Jobs, Kibana, Grafana

RESTAPI

Hopsworks

Python is a First-Class Citizen in Hopsworks

www.hops.io 36

Custom Python Environments with Conda

Python libraries are usable by Spark/Tensorflow

37

What is Hopsworks used for?

2018-01-19 38/46

HopsFS

YARN

Public Cloud or On-Premise

Parquet

ETL Workloads

39

Hive

Hopsworks

Jobs trigger

HopsFS

YARN


Parquet

Business Intelligence Workloads

40

Hive

Jupyter/Zeppelin

or Jobs

Kibana

reports

Zeppelin

HopsFS

YARN

Grafana/

InfluxDB

Elastic/

Kibana


Parquet

Data Src

Batch Analytics

Kafka

…...MySQL

Streaming Analytics in Hopsworks

41

Hive

HopsFS

YARN

FeatureStoreTensorflow

Serving


Tensorboard

TensorFlow in Hopsworks

42

Experiments

Kafka

Hive

One Click Deployment of TensorFlow Models

Hops API

•Java/Scala library

- Secure Streaming Analytics with Kafka/Spark/Flink

• SSL/TLS certs, Avro Schema, Endpoints for Kafka/Hopsworks/etc

•Python Library

- Managing tensorboard, Load/save models in HopsFS

- Distributed Tensorflow in Python

- Parameter sweeps for parallel experiments

TensorFlow-as-a-Service in RISE SICS ICE

• Hops

• Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service

www.hops.site

• RISE SICS ICE

• 250 kW Datacenter, ~400 servers

• Research and test environment

https://www.sics.se/projects/sics-ice-data-center-in-lulea 45

http://www.hops.site/

https://www.sics.se/projects/sics-ice-data-center-in-lulea

Summary

•Distribution can make Deep Learning practitioners more productive.


•Hopsworks is a new Data Platform built on HopsFS with first-class support for Python and Deep Learning / ML

- Tensorflow / Spark


The Team

Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.

Active:

Alumni:

Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka, Tobias Johansson , Roberto Bampi.

www.hops.io@hopshadoop

http://www.hops.io/

Thank You.

Follow us: @hopshadoop

Star us: http://github.com/hopshadoop/hopsworks

Join us: http://www.hops.io

Thank You.

Follow us: @hopshadoop

Star us: http://github.com/hopshadoop/hopsworks

Join us: http://www.hops.io

Hops

scaling tensorflow with hops, global ai conference santa clara

Technology