deep learning models at scale with apache spark · workarounds (a.k.a hacks) limit spark task slots...
TRANSCRIPT
![Page 1: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/1.jpg)
Deep Learning Models at Scale with Apache SparkTM
Joseph K. Bradley (speaker), Xiangrui MengMay 31, 2019ASA SDSS
1
![Page 2: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/2.jpg)
About me
Joseph Bradley
• Software Engineer at Databricks• Apache Spark committer & PMC member
![Page 3: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/3.jpg)
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCTUnified Analytics Platform
MISSIONMaking Big Data Simple
Try for free today.databricks.com
![Page 4: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/4.jpg)
Apache Spark + AI
![Page 5: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/5.jpg)
Runtime DeltaSpark Core Engine
Big Data ProcessingETL + SQL +Streaming
Machine LearningMLlib + SparkR
Apache Spark: The First Unified Analytics Engine
5
![Page 6: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/6.jpg)
and many more...
Internet of ThingsDigital Personalization
Disruptive innovations are affecting enterprises across the planet
Healthcare and Genomics Fraud Prevention
AI is re-shaping the world
6
![Page 7: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/7.jpg)
Better AI needs more data
7
![Page 8: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/8.jpg)
When AI goes distributed ...
Larger datasetsèMore need for distributed trainingèMore open-source offerings: distributed TensorFlow,
Horovod, distributed MXNet
This is where Spark and AI meet.
8
![Page 9: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/9.jpg)
Challenges
![Page 10: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/10.jpg)
Two user stories
As a data scientist,
● I can build a pipeline that fetches training data from a production data warehouse and fits a DL model at scale.
● I can apply my DL model to a distributed data stream and augment the data stream with predicted labels.
10
![Page 11: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/11.jpg)
Distributed training
data warehouse
load fit model
Required: Read from Databricks Delta, Parquet, MySQL, Hive, etc.
Answer: Apache Spark
Required: distributed GPU cluster for fast training
Answer: Horovod, Distributed TensorFlow, etc.
11
![Page 12: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/12.jpg)
Two separate data and AI clusters?
load using a Spark cluster
fit on a GPU cluster
modelsave data
required: glue code
12
![Page 13: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/13.jpg)
Streaming model inference
Kafka load predict model
required:● save to stream sink● GPU for fast inference
13
![Page 14: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/14.jpg)
A hybrid Spark and AI cluster?
load using a Spark cluster w/ GPUs
fit a model distributedlyon the same cluster
model
load using a Spark cluster w/ GPUs
predict w/ GPUs as a Spark task
model
14
![Page 15: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/15.jpg)
Unfortunately, it doesn’t work out of the box.
![Page 16: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/16.jpg)
Project Hydrogen
![Page 17: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/17.jpg)
Big data for AI
There are many efforts from the Spark community to integrate Spark with AI/ML frameworks:● (Yahoo) CaffeOnSpark, TensorFlowOnSpark● (Intel) BigDL● (John Snow Labs) Spark-NLP● (Databricks) TensorFrames, Deep Learning Pipelines,
spark-sklearn● … 80+ ML/AI packages on spark-packages.org
![Page 18: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/18.jpg)
Project Hydrogen to fill the major gaps
Barrier Execution Mode
Accelerator Aware Scheduling
Optimized Data Exchange
![Page 19: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/19.jpg)
In this talk
Within Apache Spark:● Current status of Project Hydrogen features● Development of new features
From Databricks:● Our use of Project Hydrogen features● Lessons learned and best practices
![Page 20: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/20.jpg)
Story #1:Distributed training
load using a Spark cluster w/ GPUs
fit a model on the same cluster
model
![Page 21: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/21.jpg)
Project Hydrogen
Barrier Execution Mode
Accelerator Aware Scheduling
Optimized Data Exchange
![Page 22: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/22.jpg)
Different execution models
Task 1
Task 2
Task 3
Spark (MapReduce)Tasks are independent of each other
Embarrassingly parallel & massively scalable
Distributed trainingComplete coordination among tasks
Optimized for communication
Task 1
Task 2 Task 3
![Page 23: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/23.jpg)
Barrier execution mode
Gang scheduling on top of MapReduce execution modelè A distributed DL job can run as a Spark job.
● Coordination: start all tasks together● Context: provides info to coordinate across tasks● Failure handling: cancel and restart all tasks in case of failure
JIRA: SPARK-24374 (Spark 2.4)
![Page 24: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/24.jpg)
Barrier mode integration
![Page 25: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/25.jpg)
Horovod (an LF AI hosted project)
Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet.● Little modification to single-node code● High-performance I/O via MPI and NCCL
Developed at Uber, now an LF AI hosted project at Linux Foundation.
![Page 26: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/26.jpg)
Hydrogen integration with Horovod
HorovodRunner: in Databricks Runtime 5.0 ML● Runs Horovod under barrier execution mode.● Hides cluster setup, scripts, MPI command line from users.
def train_hvd():hvd.init()… # train using Horovod
HorovodRunner(np=2).run(train_hvd)
![Page 27: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/27.jpg)
Implementation of HorovodRunner
● Pickle and broadcast the train() function.● Launch a Spark job in barrier execution mode.● In the first executor, use worker addresses to launch the
Horovod MPI job.● Terminate Horovod if the Spark job got cancelled.
![Page 28: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/28.jpg)
Collaboration on Horovod + Spark
Engineers at Uber and Databricks are improving this integration:● Merge design and code development into horovod.spark.● HorovodRunner uses horovod.spark implementation with
extra Databricks-specific features.● Support barrier execution mode and GPU-aware scheduling.
Stay tuned for future announcements!
![Page 29: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/29.jpg)
Project Hydrogen
Barrier Execution Mode
Optimized Data Exchange
Accelerator Aware Scheduling
![Page 30: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/30.jpg)
Accelerator-aware scheduling
Accelerators (GPUs, FPGAs) are widely used for accelerating specialized workloads like deep learning and signal processing.
Spark is currently unaware of GPUs & other accelerators.
JIRA: SPARK-24615 (ETA: Spark 3.0)
![Page 31: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/31.jpg)
Consider a simple case where one task needs one GPU:
Why does Spark need GPU awareness?
Executor 0
GPU:0
GPU:1
Task 0
Task 1
Executor 1
GPU:0
GPU:1
Task 2
Task 3
Task 4 ?
![Page 32: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/32.jpg)
Workarounds (a.k.a hacks)
Limit Spark task slots per node to 1.○ The running task can safely claim all GPUs on the node.○ It might lead to resource waste if the workload doesn’t need all GPUs.○ User also needs to write multithreading code to maximize data I/O.
Let running tasks themselves to collaboratively decide which GPUs to use, e.g., via shared locks.
![Page 33: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/33.jpg)
User-facing API
User can retrieve assigned GPUs from task context (#24374)
context = TaskContext.get()assigned_gpu = context.getResources()[“gpu”][0]
with tf.device(assigned_gpu):# training code ...
![Page 34: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/34.jpg)
Cluster manager support
YARN
SPARK-27361
Kubernetes
SPARK-27362
Mesos
SPARK-27363
Standalone
SPARK-27361
![Page 35: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/35.jpg)
Future features in discussion
● FPGAs and other accelerators● Resource request at task level● Fine-grained scheduling within one GPU● Affinity and anti-affinity● ...
![Page 36: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/36.jpg)
Story #2:Streaming model inference
load using a Spark cluster w/ GPUs
predict w/ GPUs as a Spark task
model
![Page 37: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/37.jpg)
Project Hydrogen
Barrier Execution Mode
Optimized Data Exchange
Accelerator Aware Scheduling
![Page 38: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/38.jpg)
Optimized data exchange
None of the integrations are possible without exchanging data between Spark and AI frameworks. And performance matters.
JIRA: SPARK-24579
![Page 39: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/39.jpg)
Pandas User-Defined Function (UDF)Pandas UDF was introduced in Spark 2.3• Pandas for vectorized
computation• Apache Arrow for data
exchange
![Page 40: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/40.jpg)
Pandas UDF for distributed inference
Pandas UDF makes it simple to apply a model to a data stream.
@pandas_udf(...)def predict(features):...
spark.readStream(...) \.withColumn(‘prediction’, predict(col(‘features’)))
![Page 41: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/41.jpg)
Support for complex return types
We improved scalar Pandas UDF to return StructTypes.E.g., predicted labels and raw scores together
JIRA: SPARK-23836 (Spark 3.0)
@pandas_udf(...)def predict(features):
# ...return pd.DataFrame({'labels': labels, 'scores': scores})
![Page 42: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/42.jpg)
Data pipelining
CPU GPU
t1 fetch batch #1
t2 process batch #1
t3 fetch batch #2
t4 process batch #2
t5 fetch batch #3
t6 process batch #3
CPU GPU
t1 fetch batch #1
t2 fetch batch #2 process batch #1
t3 fetch batch #3 process batch #2
t4 process batch #3
(pipelining)
![Page 43: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/43.jpg)
Pandas UDF prefetch
Goal: improve throughputPrefetch next Arrow record batch while executing the Pandas UDF on the current batch.● Up to 2x for I/O and compute balanced workloads● Observed 1.5x in real workloads
Enabled by default on Databricks Runtime 5.2.JIRA: SPARK-27569 (ETA: Spark 3.0)
![Page 44: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/44.jpg)
Per-batch initialization overheadLoading a model per batch introduces overhead.
New Pandas UDF interface: Load model once. Then iterate over batches.
JIRA: SPARK-26412 (WIP)
@pandas_udf(...)def predict(batches):model = … # load model oncefor batch in batches:yield model.predict(batch)
![Page 45: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/45.jpg)
Standardize on the Arrow format
Many accelerated computing libraries now support Arrow.Proposal: Expose the Arrow format in a public interface.● Simplify data exchange.● Reduce data copy/conversion overhead.● Allow pluggable vectorization code.
JIRA: SPARK-27396 (pending vote)
![Page 46: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/46.jpg)
Project Hydrogen
Barrier Execution Mode
Accelerator Aware Scheduling
Optimized Data Exchange
![Page 47: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/47.jpg)
Getting started
Deep Learning on Apache Spark• Doc sources: docs.databricks.com,
github.com/horovod/horovod• Topics: HorovodRunner, horovod.spark, Pandas UDFs
Project Hydrogen• Apache Spark JIRA & dev mailing list• spark.apache.org
47
![Page 48: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/48.jpg)
Acknowledgements
● Many ideas in Project Hydrogen are based on previous community work: TensorFrames, BigDL, Apache Arrow, Pandas UDF, Spark GPU support, MPI, etc.
● We would like to thank many Spark committers and contributors who helped the project proposal, design, and implementation.
![Page 49: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/49.jpg)
Acknowledgements
● Xingbo Jiang● Thomas Graves● Andy Feng● Alex Sergeev● Shane Knapp● Xiao Li● Li Jin● Bryan Cutler● Takuya Ueshin● Wenchen Fan● Jason Lowe
● Hyukjin Kwon● Madhukar Korupolu● Robert Evans● Yinan Li● Felix Cheung● Imran Rashid● Saisai Shao● Mark Hamstra● Sean Owen● Yu Jiang● … and many more!
![Page 50: Deep Learning Models at Scale with Apache Spark · Workarounds (a.k.a hacks) Limit Spark task slots per node to 1. The running task can safely claim all GPUs on the node. It might](https://reader033.vdocument.in/reader033/viewer/2022042220/5ec6bd99d3e7652ec1664700/html5/thumbnails/50.jpg)
Thank You!Questions?