distributed implementation of a lstm on spark and tensorflow

Distributed implementation of a LSTM on Spark and

Tensorflow

Emanuel Di NardoSource code: https://github.com/EmanuelOverflow/LSTM-TensorSpark

https://github.com/EmanuelOverflow/LSTM-TensorSpark

Overview● Introduction● Apache Spark● Tensorflow● RNN-LSTM● Implementation● Results● Conclusions

IntroductionDistributed environment:

● Many computation units;● Each unit is called ‘node’;● Node collaboration/competition;● Message passing;● Synchronization and global

state management;

Apache Spark● Large-scale data processing framework;● In-memory processing;● General purpose:

○ MapReduce;○ Batch and streaming processing;○ Machine learning;○ Graph theory;○ Etc…

● Scalable;● Open source;

Apache Spark● Resilient Distributed Dataset (RDD):

○ Fault-tolerant collection of elements;○ Transformation and actions;○ Lazy computation;

● Spark core:○ Tasks dispatching;○ Scheduling;○ I/O;

● Essentially:○ A master driver organizes nodes and demands tasks to workers passing a RDD; ○ Worker executioner runs tasks and returns results in new RDD;

Apache Spark Streaming● Streaming computation;● Mini-batch strategy;● Latency depends on mini-batch elaboration time/size; ● Easy to combine with batch strategy;● Fault tolerance;

Apache Spark● API for many languages:

○ Java;○ Python;○ Scala;○ R;

● Runs on ○ Hadoop; ○ Mesos; ○ Standalone;○ Vloud.

● It can access diverse data sources including:○ HDFS;○ Cassandra; ○ HBase;

Tensorflow● Numerical computation library;● Computation is graph-based:

○ Nodes are mathematical operations;○ Edges are I/O multidimensional array (tensors);

● Distributed on multiple CPU/GPU;● API:

○ Python;○ C++;

● Open source;● A Google product;

Tensorflow● Data Flow Graph:

○ Oriented graph;○ Nodes are mathematical operations or

data I/O;○ Edges are I/O tensors;○ Operations are asynchronous and parallel:

■ Performed once all input tensors are available;

● Flexible and easily extendible;● Auto-differentiation;● Lazy computation;

RNN-LSTM● Recurrent Neural Network;● Cyclic networks:

○ At each training step the output of the previous step is used to feed the same layer with a different input data;

● Input Xt is transformed in the hidden layer A, the output is also used to feed itself;

*Image from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

RNN-LSTM● Recurrent Neural Network;● Cyclic networks:

○ At each training step the output of the previous step is used to feed the same layer with a different input data;

● Unrolled network:○ Each input feed the network;○ The output is passed to the next step as a supplementary input data;


RNN-LSTM● This kind of network has a great problem...:

○ It is unable to learn long data sequence;○ It works only with in short term;

● It is needed a ‘long memory’ model:○ Long-short term memory;

● Hidden layer is able to memorize long data sequence using:○ Current input;○ Previous output;○ Network memory state;


RNN-LSTM● Hidden layer is able to memorize long data sequence using:

○ Current input;○ Previous output;○ Network memory state;

● Four ‘gate layers’ to preserve information:○ Forget gate layer;○ Input gate layer;○ ‘Candidate’ gate layer;○ Output gate layer;

● Multiple activation functions:○ Sigmoid for the first three layers;○ Tanh for the output layer;


Implementation● RNN-LSTM:

○ Distributed on Spark;○ Mathematical operations with Tensorflow;

● Distribution of mini-batch computation:○ Each partition takes care of a subset of the whole dataset;○ Each subset has the same size, it is not required in the mini-batch strategy, using proper

techniques, but we want to test performances over all partitions with a balanced loading;

● Tensorflow provides many LSTM implementations, but it has been decided to implement a network from scratch for learning purpose;

Implementation● A master driver splits the input data in partitions organized by key:

○ Input data is shuffled and normalized;○ Each partition will have its own RDD;

● Each spark-worker runs an entire LSTM training cycle:○ We will have a number of LSTM equal to number of partitions;○ It is possible to choose number of epochs, number of hidden layers and number of partitions;○ Memory to assign to each worker and many other parameters;

● At the end of training step the returning RDD will be mapped in a key-value data structure with weights and biases values;

● At the end, all elements in the RDDs are averaged to achieve the final result;

Implementation● With tensorflow mathematical operations a new LSTM is created:

○ Operations are executed in a lazy manner;○ Initialization builds and organizes the data graph;

● Weights and biases are initialized randomly;● An optimizer is chosen and an OutputLayer is instantiate;● For the lazy-strategy all operations must be placed in a ‘session’ window:

○ Session handles initialization ops and graph execution;○ All variables must be initialized before any run;

● Taking advantages of python function passing, all computation layers are performed with a unique method:

○ Each time a different function and the right variables are used;

Implementation● At the end minimization is performed:

○ Loss function is computed in the output layer;○ Minimization uses tensorflow auto-differentiation;

● At the end data are organized in a key-value structure with weights and biases;

● It is also possible to perform data evaluation, but it is not a very time-consuming task, therefore it is not reported.

Results● Tested locally in a multicore environment:

○ Distributed environment is not available;○ Each partition is assigned to a core;

● No GPU usage;● Iris dataset*;● Overloaded CPUs vs Idle CPUs;● 12 Core - 64GB RAM;

* http://archive.ics.uci.edu/ml/datasets/Iris

Results● 3 partitions:

Partition T. exec(s) T. exec(m)

1 1385.62 ~23

2 1675.76 ~28

3 1692.48 ~28

Tot+weight average 1704.81 ~28

Tot+repartition 1704.81 ~28


Partition T. exec(s) T. exec(m)

1 867.18 ~14

2 834.31 ~14

3 995.37 ~16

4 970.46 ~16

5 1015.47 ~17




Part. T. exec(s) T. exec(m) Part. T. exec(s) T. exec(m) Part. T. exec(s) T. exec(m)

1 476.76 ~8 6 482.82 ~8 11 458.05 ~8

2 448..91 ~7 7 499.66 ~8 12 504.85 ~8

3 472.05 ~8 8 454.78 ~8 13 470.93 ~8

4 493.39 ~8 9 479.61 ~8 14 450.84 ~8

5 485.66 ~8 10 493.21 ~8 15 454.29 ~8



Results● Comparison without distribution:

System T. exec(s) T. exec(m) Speed up mb Speed up loc.

dist-3 1704.81 ~28 96% 61%

dist-5 1023.91 ~17 97% 76%

dist-15 510.89 ~9 98% 88%

local-opt 4080.94 ~68 89% 6%

local 4335.66 ~72 88% -

local-mb-10 34699.58 ~578 - -

local: not distributed implementationlocal-opt: not distributed - optimized implementationlocal-mb-10: not distributed implementation with mini-batch each 10 elements (like dist-15 organization)

Results● 3 partitions [overloaded vs idle]:

Part. T. exec busy(s) T. exec busy(m) T. exec idle(s) T. exec idle(m)

1 2679.76 ~44 1385.62 ~23

2 2910.69 ~48 1675.76 ~28

3 3063.88 ~51 1692.48 ~28

Tot 3078.15 ~51 1704.81 ~28

Results● 5 partitions [overloaded vs idle]:

Part. T. exec busy(s) T. exec busy(m) T. exec idle(s) T. exec idle(m)

1 1356.44 ~22 867.18 ~14

2 1358.28 ~22 834.31 ~14

3 1373.25 ~22 995.37 ~16

4 1370.11 ~23 970.46 ~16

5 1372.25 ~23 1015.47 ~17

Tot 1393.91 ~23 1023.43 ~17

distributed implementation of a lstm on spark and tensorflow

Data & Analytics