spark meetup tensorframes
TRANSCRIPT
![Page 1: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/1.jpg)
TensorFrames: Google Tensorflow on Apache Spark
Tim HunterSpark Summit 2016 - Meetup
![Page 2: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/2.jpg)
How familiar are you with Spark?
1. What is Apache Spark?
2. I have used Spark
3. I am using Spark in production or I contribute to its development
2
![Page 3: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/3.jpg)
How familiar are you with TensorFlow?
1. What is TensorFlow?
2. I have heard about it
3. I am training my own neural networks
3
![Page 4: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/4.jpg)
Founded by the team who created Apache Spark
Offers a hosted service:- Apache Spark in the cloud- Notebooks- Cluster management- Production environment
About Databricks
4
![Page 5: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/5.jpg)
Software engineer at Databricks
Apache Spark contributor
Ph.D. UC Berkeley in Machine Learning
(and Spark user since Spark 0.2)
About me
5
![Page 6: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/6.jpg)
Outline
•Numerical computing with Apache Spark
•Using GPUs with Spark and TensorFlow
•Performance details
•The future
6
![Page 7: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/7.jpg)
Numerical computing for Data Science
•Queries are data-heavy
•However algorithms are computation-heavy
•They operate on simple data types: integers, floats, doubles, vectors, matrices
7
![Page 8: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/8.jpg)
The case for speed
•Numerical bottlenecks are good targets for optimization
• Let data scientists get faster results
• Faster turnaround for experimentations
•How can we run these numerical algorithms faster?
8
![Page 9: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/9.jpg)
Evolution of computing power
9
Failureisnotanoption:itisafact
Whenyoucanaffordyourdedicatedchip
GPGPU
Scaleout
Scaleup
![Page 10: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/10.jpg)
Evolution of computing power
10
NLTKTheano
Today’stalk:Spark+TensorFlow
![Page 11: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/11.jpg)
Evolution of computing power
• Processor speed cannot keep up with memory and network improvements
• Access to the processor is the new bottleneck
• Project Tungsten in Spark: leverage the processor’s heuristics for executing code and fetching memory
• Does not account for the fact that the problem is numerical
11
![Page 12: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/12.jpg)
Asynchronous vs. synchronous
• Asynchronous algorithms perform updates concurrently• Spark is synchronous model, deep learning frameworks
usually asynchronous• A large number of ML computations are synchronous• Even deep learning may benefit from synchronous updates
12
![Page 13: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/13.jpg)
Outline
•Numerical computing with Apache Spark
•Using GPUs with Spark and TensorFlow
•Performance details
•The future
13
![Page 14: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/14.jpg)
GPGPUs
14
•Graphics Processing Units for General Purpose computations
6000
Theoreticalpeakthroughput
GPU CPU
Theoreticalpeakbandwidth
GPU CPU
![Page 15: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/15.jpg)
• Library for writing “machine intelligence” algorithms
• Very popular for deep learning and neural networks
•Can also be used for general purpose numerical computations
• Interface in C++ and Python
15
Google TensorFlow
![Page 16: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/16.jpg)
Numerical dataflow with Tensorflow
16
x = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()output_value = session.run(output,
{x: 3, y: 5})
x:int32
y:int32
mul 3
z
![Page 17: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/17.jpg)
Numerical dataflow with Spark
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)
output_df = tfs.map_rows(output, df)
output_df.collect()
df:DataFrame[x:int,y:int]
output_df:DataFrame[x:int,y:int,z:int]
x:int32
y:int32
mul 3
z
![Page 18: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/18.jpg)
Outline
•Numerical computing with Apache Spark
•Using GPUs with Spark and TensorFlow
•Performance details
•The future
18
![Page 19: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/19.jpg)
19
It is a communication problem
Sparkworkerprocess Workerpythonprocess
C++buffer
Pythonpickle
Tungstenbinaryformat
Pythonpickle
Javaobject
![Page 20: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/20.jpg)
20
TensorFrames: native embedding of TensorFlow
Sparkworkerprocess
C++buffer
Tungstenbinaryformat
Javaobject
![Page 21: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/21.jpg)
•Estimation of distribution from samples•Non-parametric•Unknown bandwidth
parameter•Can be evaluated with
goodness of fit
An example: kernel density scoring
21
![Page 22: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/22.jpg)
• In practice, compute:
with:
• In a nutshell: a complex numerical function
An example: kernel density scoring
22
![Page 23: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/23.jpg)
23
Speedup
0
60
120
180
ScalaUDF ScalaUDF(optimized) TensorFrames TensorFrames+GPU
Runtim
e(sec)
def score(x:Double):Double ={val dis=points.map {z_k =>- (x- z_k)*(x- z_k)/(2*b*b)}val minDis =dis.minval exps =dis.map(d=>math.exp(d- minDis))minDis - math.log(b*N)+math.log(exps.sum)}
val scoreUDF =sqlContext.udf.register("scoreUDF", score_)sql("selectsum(scoreUDF(sample)) fromsamples").collect()
![Page 24: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/24.jpg)
24
Speedup
0
60
120
180
ScalaUDF ScalaUDF(optimized) TensorFrames TensorFrames+GPU
Runtim
e(sec)
def score(x:Double):Double ={val dis=new Array[Double](N)var idx =0while(idx <N){val z_k =points(idx)dis(idx)=- (x- z_k)*(x- z_k)/(2*b*b)idx +=1}val minDis =dis.minvar expSum =0.0idx =0while(idx <N){expSum +=math.exp(dis(idx) - minDis)idx +=1}minDis - math.log(b*N)+math.log(expSum)}
val scoreUDF =sqlContext.udf.register("scoreUDF",score_)sql("selectsum(scoreUDF(sample))fromsamples").collect()
![Page 25: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/25.jpg)
25
Speedup
0
60
120
180
ScalaUDF ScalaUDF(optimized) TensorFrames TensorFrames+GPU
Runtim
e(sec)def cost_fun(block,bandwidth):distances=- square(constant(X)- sample)/(2*b*b)m=reduce_max(distances,0)x=log(reduce_sum(exp(distances- m),0))return identity(x+m- log(b*N),name="score”)
sample=tfs.block(df,"sample")score=cost_fun(sample,bandwidth=0.5)df.agg(sum(tfs.map_blocks(score,df))).collect()
![Page 26: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/26.jpg)
26
Speedup
0
60
120
180
ScalaUDF ScalaUDF(optimized) TensorFrames TensorFrames+GPU
Runtim
e(sec)def cost_fun(block,bandwidth):distances=- square(constant(X)- sample)/(2*b*b)m=reduce_max(distances,0)x=log(reduce_sum(exp(distances- m),0))return identity(x+m- log(b*N),name="score”)
with device("/gpu"):sample=tfs.block(df,"sample")score=cost_fun(sample,bandwidth=0.5)df.agg(sum(tfs.map_blocks(score,df))).collect()
![Page 27: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/27.jpg)
Demo: Deep dreams
27
![Page 28: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/28.jpg)
Demo: Deep dreams
28
![Page 29: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/29.jpg)
Outline
•Numerical computing with Apache Spark
•Using GPUs with Spark and TensorFlow
•Performance details
•The future
29
![Page 30: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/30.jpg)
30
Improving communication
Sparkworkerprocess
C++buffer
Tungstenbinaryformat
Javaobject
Directmemorycopy
Columnarstorage
![Page 31: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/31.jpg)
The future
• Integration with Tungsten:• Direct memory copy• Columnar storage
•Better integration with MLlib data types
•GPU instances in Databricks: Official support coming this summer
31
![Page 32: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/32.jpg)
Recap•Spark: an efficient framework for running
computations on thousands of computers
•TensorFlow: high-performance numerical framework
•Get the best of both with TensorFrames:• Simple API for distributed numerical computing• Can leverage the hardware of the cluster
32
![Page 33: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/33.jpg)
Try these demos yourself
•TensorFrames source code and documentation:github.com/tjhunter/tensorframesspark-packages.org/package/tjhunter/tensorframes
•Demo available later on Databricks
•The official TensorFlow website:www.tensorflow.org
•More questions and attending the Spark summit?We will hold office hours at the Databricks booth.
33
![Page 34: Spark Meetup TensorFrames](https://reader031.vdocument.in/reader031/viewer/2022022415/586f79ad1a28ab10258b7047/html5/thumbnails/34.jpg)
Thank you.