tensorflowonspark - matroiddeployable on cloud or on-premise why tensorflowonspark at yahoo? major...
TRANSCRIPT
![Page 1: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/1.jpg)
TensorFlowOnSpark
A n d y F e n g Ya h o o
![Page 2: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/2.jpg)
2 * Yoshua Bengio @ ICDM 2016
![Page 3: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/3.jpg)
What is TensorFlowOnSpark?
![Page 4: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/4.jpg)
What’s TensorFlowOnSpark? ▪ Scale up TensorFlow apps with minimal changes
▪ Support all TensorFlow functionalities • Model/data parallelism, Synch/Asynch, TensorBoard
▪ Integrate with existing data & pipeline • ex. HDFS, SQL, MLlib
▪ Deployable on cloud or on-premise
![Page 5: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/5.jpg)
Why TensorFlowOnSpark at Yahoo?
▪ Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor of Spark since 2013
▪ Large clusters in house • Tens of clusters • Thousands of nodes per cluster
▪ Massive amount of data • Petabytes of data
![Page 6: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/6.jpg)
Why TensorFlowOnSpark?
![Page 7: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/7.jpg)
TensorFlowOnSpark
![Page 8: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/8.jpg)
Open Source: github.com/yahoo/TensorFlowOnSpark
8
![Page 9: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/9.jpg)
TensorFlowOnSpark ▪ Launches TF clusters using Spark executors
▪ Supports TF data ingestion modes • Spark – RDD.mapPartitions() • TenforFlow – directly access HDFS
▪ Supports TensorBoard during/after training
▪ Generally agnostic to Spark/TF versions
![Page 10: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/10.jpg)
TFoS Basics 1. Launch TensorFlow cluster
2. Feed data to TensorFlow app
3. Shutdown TensorFlow cluster
![Page 11: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/11.jpg)
TFoS Python API
cluster=TFCluster.run(sc,map_fn,args,num_executors,num_ps,tensorboard,input_mode)
cluster.train(dataRDD,num_epochs=0)
cluster.inference(dataRDD)
cluster.shutdown()
![Page 12: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/12.jpg)
TFoS: Minimum Code Changes #diff–weval_image_classifier.py20a21,27>frompyspark.contextimportSparkContext
>frompyspark.confimportSparkConf>fromcom.yahoo.ml.tfimportTFCluster,TFNode
>importsys>>defmain_fun(argv,ctx):
27a35,36>sys.argv=argv
84,85d92<defmain(_):88a96,97
>cluster_spec,server=TFNode.start_cluster_server(ctx)191c200,204<tf.app.run()
--->sc=SparkContext(conf=SparkConf().setAppName("eval_image_classifier"))
>num_executors=int(sc._conf.get("spark.executor.instances"))>cluster=TFCluster.run(sc,main_fun,sys.argv,num_executors,0,False,TFCluster.InputMode.TENSORFLOW)>cluster.shutdown()
![Page 13: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/13.jpg)
TFoS Input Modes ▪ InputMode.SPARK
• feed_dict • Small-medium scale data • Fed via RDD.mapPartitions()
▪ InputMode.TENSORFLOW • Reader + QueueRunner • Large scale data • Reads directly from HDFS
![Page 14: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/14.jpg)
TFoS Architecture
![Page 15: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/15.jpg)
Executor
python worker
Executor
python worker python worker python worker
TFoS: InputMode.SPARK
worker:0
queue
Executor
python worker python worker
ps:0
worker:1
queue
RDD RDD
![Page 16: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/16.jpg)
Executor
python worker
TFoS: InputMode.TENSORFLOW
Executor
python worker
Executor
python worker python worker
ps:0
python worker
worker:0
python worker
worker:1
HDFS
![Page 17: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/17.jpg)
TFoS: Failure Recovery ▪ TF Checkpoints written to HDFS ▪ InputMode.SPARK
• TF worker runs in background • RDD data feeding tasks can be retried • However, TF worker failures will be “hidden” from Spark
▪ InputMode.TENSORFLOW • TF worker runs in foreground • TF worker failures will be retried as Spark task • TF worker restores from checkpoint
![Page 18: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/18.jpg)
TFoS: Failure Recovery ▪ Executor failures are problematic
• TF cluster_spec is statically-defined • YARN doesn’t re-allocate on same node • Port may no longer be available
▪ Need dynamic cluster membership • Explore options w/ TensorFlow team
![Page 19: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/19.jpg)
TensorBoard on TFoS
![Page 20: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/20.jpg)
TFoS Scaling
Near linear scalability
![Page 21: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/21.jpg)
RDMA Speedup over gRPC
1.4X speedup
![Page 22: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/22.jpg)
Related Work
SparkNet TensorFrames TFonS Programming Language
Scala Python, Scala Python
Migration Major Medium Minor Parallelism Data
Parallelism Data Parallelism
Data + Model Parallelism
Distributed Training Synchronous Synchronous Synchronous + Asynchrnous
TensorBoard ✗ ✗ ✓
Scalability Driver bottleneck
Driver bottleneck
✓
![Page 23: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/23.jpg)
Summary ▪ TFoS brings deep learning to big-data clusters
• TensorFlow: 0.12 -1.0 • Spark: 1.6-2.x • Cluster manager: YARN, Standalone, Mesos • EC2 image provided
▪ RDMA enhancement for faster training • PR for github/tensorflow repo
![Page 24: TensorFlowOnSpark - MatroidDeployable on cloud or on-premise Why TensorFlowOnSpark at Yahoo? Major player of open-source ecosystem • Birth place of Apache Hadoop • Adopter/contributor](https://reader035.vdocument.in/reader035/viewer/2022071402/60ee14584496e85cce2b9a36/html5/thumbnails/24.jpg)
Questions? h t t ps : / / g i t hub . com/yahoo /Tenso rF lowOnSpark