tensorframes: google tensorflow on apache spark
TRANSCRIPT
![Page 1: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/1.jpg)
TensorFrames: Google Tensorflow on Apache Spark
Tim HunterMeetup 08/2016 - Salesforce
![Page 2: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/2.jpg)
How familiar are you with Spark?
1. What is Apache Spark?
2. I have used Spark
3. I am using Spark in production or I contribute to its development
2
![Page 3: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/3.jpg)
How familiar are you with TensorFlow?
1. What is TensorFlow?
2. I have heard about it
3. I am training my own neural networks
3
![Page 4: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/4.jpg)
Founded by the team who created Apache Spark
Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment
About Databricks
4
![Page 5: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/5.jpg)
Software engineer at Databricks
Apache Spark contributor
Ph.D. UC Berkeley in Machine Learning
(and Spark user since Spark 0.5)
About me
5
![Page 6: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/6.jpg)
Outline•Numerical computing with Apache Spark
•Using GPUs with Spark and TensorFlow
•Performance details
•The future
6
![Page 7: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/7.jpg)
Numerical computing for Data Science
•Queries are data-heavy
•However algorithms are computation-heavy
•They operate on simple data types: integers, floats, doubles, vectors, matrices
7
![Page 8: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/8.jpg)
The case for speed•Numerical bottlenecks are good targets for
optimization
• Let data scientists get faster results
• Faster turnaround for experimentations
•How can we run these numerical algorithms faster?
8
![Page 9: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/9.jpg)
Evolution of computing power
9
Failure is not an option: it is a fact
When you can afford your dedicated chip
GPGPU
Scale out
Scal
e up
![Page 10: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/10.jpg)
Evolution of computing power
10
NLTKTheano
Today’s talk:Spark + TensorFlow
![Page 11: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/11.jpg)
Evolution of computing power• Processor speed cannot keep up with memory and
network improvements
• Access to the processor is the new bottleneck
• Project Tungsten in Spark: leverage the processor’s heuristics for executing code and fetching memory
• Does not account for the fact that the problem is numerical
11
![Page 12: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/12.jpg)
Asynchronous vs. synchronous
• Asynchronous algorithms perform updates concurrently• Spark is synchronous model, deep learning frameworks
usually asynchronous• A large number of ML computations are synchronous• Even deep learning may benefit from synchronous updates
12
![Page 13: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/13.jpg)
Outline•Numerical computing with Apache Spark
•Using GPUs with Spark and TensorFlow
•Performance details
•The future
13
![Page 14: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/14.jpg)
GPGPUs
14
•Graphics Processing Units for General Purpose computations
Series1
6000
Theoretical peakthroughput
GPU CPU
Series1
Theoretical peakbandwidth
GPU CPU
![Page 15: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/15.jpg)
• Library for writing “machine intelligence” algorithms
• Very popular for deep learning and neural networks
•Can also be used for general purpose numerical computations
• Interface in C++ and Python
15
Google TensorFlow
![Page 16: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/16.jpg)
Numerical dataflow with Tensorflow
16
x = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()output_value = session.run(output, {x: 3, y: 5})
x:int32
y:int32
mul 3
z
![Page 17: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/17.jpg)
Numerical dataflow with Spark
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)
output_df = tfs.map_rows(output, df)
output_df.collect()
df: DataFrame[x: int, y: int]
output_df: DataFrame[x: int, y: int, z: int]
x:int32
y:int32
mul 3
z
![Page 18: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/18.jpg)
Demo
18
![Page 19: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/19.jpg)
Outline•Numerical computing with Apache Spark
•Using GPUs with Spark and TensorFlow
•Performance details
•The future
19
![Page 20: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/20.jpg)
20
It is a communication problem
Spark worker process Worker python process
C++buffer
Python pickle
Tungsten binary format
Python pickle
Javaobject
![Page 21: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/21.jpg)
21
TensorFrames: native embedding of TensorFlow
Spark worker process
C++buffer
Tungsten binary format
Javaobject
![Page 22: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/22.jpg)
• Estimation of distribution from samples•Non-parametric•Unknown bandwidth
parameter•Can be evaluated with
goodness of fit
An example: kernel density scoring
22
![Page 23: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/23.jpg)
• In practice, compute:
with:
• In a nutshell: a complex numerical function
An example: kernel density scoring
23
![Page 24: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/24.jpg)
24
Speedup
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU0
60
120
180
Run
time
(sec
)
def score(x: Double): Double = { val dis = points.map { z_k => - (x - z_k) * (x - z_k) / ( 2 * b * b) } val minDis = dis.min val exps = dis.map(d => math.exp(d - minDis)) minDis - math.log(b * N) + math.log(exps.sum)}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)sql("select sum(scoreUDF(sample)) from samples").collect()
![Page 25: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/25.jpg)
25
Speedup
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU0
60
120
180
Run
time
(sec
)def score(x: Double): Double = { val dis = new Array[Double](N) var idx = 0 while(idx < N) { val z_k = points(idx) dis(idx) = - (x - z_k) * (x - z_k) / ( 2 * b * b) idx += 1 } val minDis = dis.min var expSum = 0.0 idx = 0 while(idx < N) { expSum += math.exp(dis(idx) - minDis) idx += 1 } minDis - math.log(b * N) + math.log(expSum)}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)sql("select sum(scoreUDF(sample)) from samples").collect()
![Page 26: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/26.jpg)
26
Speedup
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU0
60
120
180
Run
time
(sec
)def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”)
sample = tfs.block(df, "sample")score = cost_fun(sample, bandwidth=0.5)df.agg(sum(tfs.map_blocks(score, df))).collect()
![Page 27: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/27.jpg)
27
Speedup
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU0
60
120
180
Run
time
(sec
)def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”)
with device("/gpu"): sample = tfs.block(df, "sample") score = cost_fun(sample, bandwidth=0.5)df.agg(sum(tfs.map_blocks(score, df))).collect()
![Page 28: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/28.jpg)
Demo: Deep dreams
28
![Page 29: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/29.jpg)
Demo: Deep dreams
29
![Page 30: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/30.jpg)
Outline•Numerical computing with Apache Spark
•Using GPUs with Spark and TensorFlow
•Performance details
•The future
30
![Page 31: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/31.jpg)
31
Improving communication
Spark worker process
C++buffer
Tungsten binary format
Javaobject
Direct memory copy
Columnarstorage
![Page 32: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/32.jpg)
The future• Integration with Tungsten:• Direct memory copy• Columnar storage
•Better integration with MLlib data types
•GPU instances in Databricks: Official support coming this fall
32
![Page 33: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/33.jpg)
Recap•Spark: an efficient framework for running
computations on thousands of computers
•TensorFlow: high-performance numerical framework
•Get the best of both with TensorFrames:• Simple API for distributed numerical computing• Can leverage the hardware of the cluster
33
![Page 34: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/34.jpg)
Try these demos yourself•TensorFrames source code and documentation:
github.com/databricks/tensorframesspark-packages.org/package/databricks/tensorframes
•Demo notebooks available on Databricks
•The official TensorFlow website: www.tensorflow.org
34
![Page 35: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/35.jpg)
Spark Summit EU 2016 15% Discount Code: DatabricksEU16
35
![Page 36: TensorFrames: Google Tensorflow on Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062400/586f79941a28ab10258b6ff3/html5/thumbnails/36.jpg)
Thank you.