scaling tensorflow with hops, global ai conference santa clara

46
Scaling Tensorflow to 100s of GPUs with Spark and Hops Hadoop Global AI Conference, Santa Clara, January 18 th 2018 Hops Jim Dowling Associate Prof @ KTH Senior Researcher @ RISE SICS CEO @ Logical Clocks AB

Upload: jim-dowling

Post on 22-Jan-2018

31 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Scaling Tensorflow to 100s of GPUs with Spark and Hops Hadoop

Global AI Conference, Santa Clara, January 18th 2018

Hops

Jim Dowling

Associate Prof @ KTH

Senior Researcher @ RISE SICS

CEO @ Logical Clocks AB

Page 2: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

AI Hierarchy of Needs

2

DDL

(Distributed

Deep Learning)

Deep Learning,

RL, Automated ML

A/B Testing, Experimentation, ML

B.I. Analytics, Metrics, Aggregates,

Features, Training/Test Data

Reliable Data Pipelines, ETL, Unstructured and

Structured Data Storage, Real-Time Data Ingestion

[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]

Page 3: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

AI Hierarchy of Needs

3[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]

DDL

(Distributed

Deep Learning)

Deep Learning,

RL, Automated ML

A/B Testing, Experimentation, ML

B.I. Analytics, Metrics, Aggregates,

Features, Training/Test Data

Reliable Data Pipelines, ETL, Unstructured and

Structured Data Storage, Real-Time Data Ingestion

Analytics

Prediction

Page 4: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

AI Hierarchy of Needs

4

DDL

(Distributed

Deep Learning)

Deep Learning,

RL, Automated ML

A/B Testing, Experimentation, ML

B.I. Analytics, Metrics, Aggregates,

Features, Training/Test Data

Reliable Data Pipelines, ETL, Unstructured and

Structured Data Storage, Real-Time Data Ingestion

Hops

[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]

Page 5: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

More Data means Better Predictions

Prediction

Performance

Traditional AI

Deep Neural Nets

Amount Labelled DataHand-crafted

can outperform

1980s1990s2000s 2010s 2020s?

Page 6: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

What about More Compute?

“Methods that scale with computation are the future of AI”*

- Rich Sutton (A Founding Father of Reinforcement Learning)

2018-01-19 6/46

* https://www.youtube.com/watch?v=EeMCEQa85tw

Page 7: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

More Compute should mean Faster Training

Training

Performance

Single-Host

Distributed

Available Compute

20152016 2017 2018?

Page 8: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Reduce DNN Training Time from 2 weeks to 1 hour

2018-01-19 8/46

In 2017, Facebook reduced training time on ImageNetfor a CNN from 2 weeks to 1 hour by scaling out to 256 GPUs using

Ring-AllReduce on Caffe2.

https://arxiv.org/abs/1706.02677

Page 9: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

DNN Training Time and Researcher Productivity

9

•Distributed Deep Learning

- Interactive analysis!

- Instant gratification!

•Single Host Deep Learning

• Suffer from Google-Envy

“My Model’s Training.”

Training

Page 10: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Distributed Training: Theory and Practice

Image from @hardmaru on Twitter.

10

Page 11: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Distributed Algorithms are not all Created Equal

Training

Performance

Parameter Servers

AllReduce

Available Compute

Page 12: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Ring-AllReduce vs Parameter Server(s)

2018-01-19 13/46

GPU 0

GPU 1

GPU 2

GPU 3

send

send

send

send

recv

recv

recv

recv GPU 1 GPU 2 GPU 3 GPU 4

Param Server(s)

Network Bandwidth is the Bottleneck for Distributed Training

Page 13: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

AllReduce outperforms Parameter Servers

2018-01-19 14/46

*https://github.com/uber/horovod

16 servers with 4 P100 GPUs (64 GPUs) each connected by ROCE-capable 25 Gbit/s network

(synthetic data). Speed below is images processed per second.*

For Bigger Models, Parameter Servers don’t scale

Page 14: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Multiple GPUs on a Single Server

2018-01-19 15/46

Page 15: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

NVLink vs PCI-E Single Root Complex

2018-01-19 16/46On Single-Host (dist. Training), the Bus can be the Bottleneck[Images from: https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/ ]

NVLink – 80 GB/s PCI-E – 16 GB/s

Page 16: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Scale: Remove Bus and Net B/W Bottlenecks

2018-01-19 17/46Only one slow worker or bus or n/w link is needed to bottleneck DNN training time.

Ring-AllReduce

Page 17: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

The Cloud is full of Bottlenecks….

Training

Performance

Public Cloud (10 GbE)

Infiniband On-Premise

Available Compute

Page 18: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Deep Learning Hierarchy of Scale

2018-01-19 19/46

DDL

AllReduce

on GPU Servers

DDL with GPU Servers

and Parameter Servers

Parallel Experiments on GPU Servers

Single GPU

Many GPUs on a Single GPU Server

Days/Hours

Days

Weeks

Minutes

Training Time for ImageNet

Hours

Page 19: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Lots of good GPUs > A few great GPUs

Hops

100 x Nvidia 1080Ti (DeepLearning11)

8 x Nvidia P/V100 (DGX-1)

VS

Both top (100 GPUs) and bottom (8 GPUs) cost the same: $150K (2017).

Page 20: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Consumer GPU Server $15K (10 x 1080Ti)

2018-01-19 https://www.oreilly.com/ideas/distributed-tensorflow 21/46

Page 21: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Cluster of Commodity GPU Servers

#EUai8

22

InfiniBand

Max 1-2 GPU Servers per Rack (2-4 KW per server)

Page 22: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

#EUai8

TensorFlow Spark Platforms

•TensorFlow-on-Spark

•Deep Learning Pipelines

•Horovod

•Hops

23

Page 23: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Hops – Running Parallel Experiments

def model_fn(learning_rate, dropout):

import tensorflow as tf

from hops import tensorboard, hdfs, devices

…..

from hops import tflauncher

args_dict = {'learning_rate': [0.001], 'dropout': [0.5]}

tflauncher.launch(spark, model_fn, args_dict)

24

Launch TF jobs as Mappers in Spark

“Pure” TensorFlow code

in the Executor

Page 24: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Hops – Parallel Experiments

25#EUai8

def model_fn(learning_rate, dropout):

…..

from hops import tflauncher

args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6]}

tflauncher.launch(spark, model_fn, args_dict)

Launches 6 Executors with with a different Hyperparameter

combination. Each Executor can have 1-N GPUs.

Page 25: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Hops AllReduce/Horovod/TensorFlow

27#EUai8

import horovod.tensorflow as hvd

def conv_model(feature, target, mode)

…..

def main(_):

hvd.init()

opt = hvd.DistributedOptimizer(opt)

if hvd.local_rank()==0:

hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]

…..

else:

hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]

…..

from hops import allreduce

allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')

“Pure” TensorFlow code

Page 26: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

TensorFlow and Hops Hadoop

2018-01-19 28/46

Page 27: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Don’t do this: Different Clusters for Big Data and ML

29

Page 28: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Hops: Single ML and Big Data Cluster

30/70

IT

DataLake

GPUs Compute

Kafka

Data EngineeringData Science

Project1 ProjectN

Elasticsearch

Page 29: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

HopsFS: Next Generation HDFS*

16xThroughput

FasterBigger

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

37xNumber of files

Scale Challenge Winner (2017)

31

Page 30: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Size Matters: Improving the Performance of Small Files in HDFS. Salman Niazi, Seif Haridi, Jim Dowling. Poster, EuroSys 2017.`

HopsFS now stores Small Files in the DB

Page 31: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

GPUs supported as a Resource in Hops 2.8.2*

33Hops is the only Hadoop distribution to support GPUs-as-a-Resource.

*Robin Andersson, GPU Integration for Deep Learning on YARN, MSc Thesis, 2017

Page 32: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

GPU Resource Requests in Hops

34

HopsYARN

4 GPUs on any host

10 GPUs on 1 host

100 GPUs on 10 hosts with ‘Infiniband’

20 GPUs on 2 hosts with ‘Infiniband_P100’

HopsFS

Mix of commodity GPUs and more

powerful GPUs good for (1) parallel

experiments and (2) distributed training

Page 33: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Hopsworks Data Platform

35

Develop Train Test Serve

MySQL Cluster

Hive

InfluxDB

ElasticSearch

KafkaProjects,Datasets,Users

HopsFS / YARN

Spark, Flink, Tensorflow

Jupyter, Zeppelin

Jobs, Kibana, Grafana

RESTAPI

Hopsworks

Page 34: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Python is a First-Class Citizen in Hopsworks

www.hops.io 36

Page 35: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Custom Python Environments with Conda

Python libraries are usable by Spark/Tensorflow

37

Page 36: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

What is Hopsworks used for?

2018-01-19 38/46

Page 37: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

HopsFS

YARN

Public Cloud or On-Premise

Parquet

ETL Workloads

39

Hive

Hopsworks

Jobs trigger

Page 38: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

HopsFS

YARN

Public Cloud or On-Premise

Parquet

Business Intelligence Workloads

40

Hive

Jupyter/Zeppelin

or Jobs

Kibana

reports

Zeppelin

Page 39: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

HopsFS

YARN

Grafana/

InfluxDB

Elastic/

Kibana

Public Cloud or On-Premise

Parquet

Data Src

Batch Analytics

Kafka

…...MySQL

Streaming Analytics in Hopsworks

41

Hive

Page 40: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

HopsFS

YARN

FeatureStoreTensorflow

Serving

Public Cloud or On-Premise

Tensorboard

TensorFlow in Hopsworks

42

Experiments

Kafka

Hive

Page 41: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

One Click Deployment of TensorFlow Models

Page 42: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Hops API

•Java/Scala library

- Secure Streaming Analytics with Kafka/Spark/Flink

• SSL/TLS certs, Avro Schema, Endpoints for Kafka/Hopsworks/etc

•Python Library

- Managing tensorboard, Load/save models in HopsFS

- Distributed Tensorflow in Python

- Parameter sweeps for parallel experiments

Page 43: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

TensorFlow-as-a-Service in RISE SICS ICE

• Hops

• Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service

www.hops.site

• RISE SICS ICE

• 250 kW Datacenter, ~400 servers

• Research and test environment

https://www.sics.se/projects/sics-ice-data-center-in-lulea 45

Page 44: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Summary

•Distribution can make Deep Learning practitioners more productive.

https://www.oreilly.com/ideas/distributed-tensorflow

•Hopsworks is a new Data Platform built on HopsFS with first-class support for Python and Deep Learning / ML

- Tensorflow / Spark

Page 45: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

The Team

Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.

Active:

Alumni:

Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka, Tobias Johansson , Roberto Bampi.

www.hops.io@hopshadoop

Page 46: Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Thank You.

Follow us: @hopshadoop

Star us: http://github.com/hopshadoop/hopsworks

Join us: http://www.hops.io

Thank You.

Follow us: @hopshadoop

Star us: http://github.com/hopshadoop/hopsworks

Join us: http://www.hops.io

Hops