deep recurrent neural networks for sequence learning in spark by yves mabiala

Deep recurrent neural network for sequence learning in Spark

Yves MABIALATHALES

Outline• Thales & Big Data• On the difficulty of Sequence Learning• Deep Learning for Sequence Learning• Spark implementation of Deep Learning• Use cases

– Predictive maintenance– NLP

Thales & Big Data

Thales systems produce a huge quantity of dataTransportation systems (ticketing, supervision, …)Security (radar traces, network logs, …)Satellite (photos, videos, …)

which is oftenMassiveHeterogeneousExtremely dynamic

and where understanding the dynamics of the monitored phenomena is mandatory Sequence Learning

What is sequence learning ?Sequence learning refers to a set of ML tasks where a model has to either deal with sequences as input, produce sequences as output or both

Goal : Understand the dynamic of a sequence to– Classify– Predict– Model

Typical applications– Text

• Classify texts (sentiment analysis)• Generate textual description of images (image captioning)

– Video• Video classification

– Speech• Speech to text

How is it typically handled ?Taking into account the dynamic is difficult

– Often people do not bother• E.g. text analysis using bag of word (one hot encoding)

– Problem for certain tasks such as sentiment classification (order of the words is important)

– Or use popular statistical approaches • (Hidden) Markov model for prediction (and classification)

– Short term dependency (order 1) : 𝑃(𝑋$ = 𝑥 (𝑋$'( = 𝑥$'(,… , 𝑋$', = 𝑥$',)⁄ ) = 𝑃(𝑋$ = 𝑥$ 𝑋$'( = 𝑥$'()⁄

• Autoregressive approaches for time series forecasting

The chair is red 1 0 1 1 0 0 0 0The cat is on a chair

The cat is young 1 1 0 0 1 1 0 0

1 1 1 0 0 1 1 1The is chair red young cat on a

Link with artificial neural network ?Artificial neural network is a set of statistical models inspired from the brain

– Transforms the input by applying at each layer (non linear) functions– More layers equals more capabilities (≥ 2hidden layers : Deep Learning)

• From manual features building to feature learning

Set of transformation and activation operations– Affine : 𝒀 = 𝑾𝒕𝑿+𝒃, sigmoid activation : 𝟏

𝟏8𝐞𝐱𝐩('𝑿), tanh activation : 𝒀 = 𝐭𝐚𝐧𝐡(𝑿)

• Only affine + activation layers = multi layer perceptron (available in Spark ML since 1.5.0)

– Convolutional : Apply a spatial convolution on the 1D/2D input (signal, image, …) : 𝐘 = 𝒄𝒐𝒏𝒗 𝑿,𝑾 +𝒃• Learns spatial features used for classification (images) , prediction

– Recurrent : Introduces a recurrent part to learn dependencies between observations (features related to the dynamic)

Objective– Find the best weights W to minimize the difference between the predicted output and the desired one

(using back-propagation algorithm)

inputhidden layers

output

Able to cope with varying size sequences either at the input or at the output

Recurrent Neural Network basics

One to many (fixed size input, sequence output)

e.g. Image captioning

Many to many(sequence input to sequence

output)

e.g. Speech to text

Many to one(sequence input to fixed size

output)e.g. Text classification

Artificial neural networks with one or more recurrent layers

Classical neural network Recurrent neural network

𝒀𝒌'𝟑 𝒀𝒌'𝟐 𝒀𝒌'𝟏 𝒀𝒌𝒀𝒌

𝑿𝒌'𝟑 𝑿𝒌'𝟐 𝑿𝒌'𝟏 𝑿𝒌𝒀𝒌 = 𝒇(𝑾𝒕𝑿𝒌 +𝑯𝒀𝒌'𝟏)

𝑿𝒌𝑿

𝒀𝒌 = 𝒇(𝑾𝒕𝑿𝒌)

𝒀

Unrolled through time

𝒀𝒌'𝟑 𝒀𝒌'𝟐 𝒀𝒌'𝟏 𝒀𝒌

𝑿

𝒀𝒌'𝟑 𝒀𝒌'𝟐 𝒀𝒌'𝟏 𝒀𝒌

𝑿𝒌'𝟑 𝑿𝒌'𝟐 𝑿𝒌'𝟏 𝑿𝒌𝑿𝒌'𝟑 𝑿𝒌'𝟐 𝑿𝒌'𝟏 𝑿𝒌

𝒀

On the difficulty of training recurrent networksRNNs are (were) known to be difficult to learn

– More weights and more computational steps • More computationally expensive (accelerator needed for matrix ops : Blas or GPU)• More data needed to converge (scalability over Big Data architectures : Spark)

– Theano, Tensor Flow, Caffe do not have distributed versions

– Unable to learn long range dependencies (Graves & Al 2014)• At a given time t, RNN does not remember the observations before 𝑋J',

⇒ New RNN architectures with memory preservation (more context)𝑍$ = 𝑓 𝑊NO𝑋$ + 𝐻N𝑌$'(𝑅$ = 𝑓(𝑊SO𝑋$ +𝐻S𝑌$'()

𝐻T$ = tanh(𝑊YJZ[\O𝑋$ +𝑈 𝑌$'(o𝑅$ )

𝑌$ = 1 − 𝑍$ 𝑌$'(+ 𝑍$𝐻T$LSTM GRU

Recurrent neural networks in SparkSpark implementation of DL algorithms (data parallel)

– All the needed blocks• Affine, convolutional, recurrent layers (Simple and GRU)• Sigmoid, tanh, reLU activations• SGD, rmsprop, adadelta optimizers

– CPU (and GPU backend)– Fully compatible with existing DL library in Spark ML

Performance– On 6 nodes cluster (CPU)

• 5.46 average speedup (some communication overhead)– About the same speedup as MLP in Spark ML

Driver

Worker 1

Worker 2

Worker 3

Resulting gradients (2)

Model broadcast (1)

Use case 1 : predictive maintenance (1)Context

– Thales and its clients build systems in different domains• Transportation (ticketing, controlling)• Defense (radar)• Satellites

– Need better and more accurate maintenance services• From planned maintenance (every x days) to an alert maintenance• From expert detection to automatic failure prediction• From whole subsystem changes to more localized reparations

Goal– Detect early signs of a (sub)system failure using data coming

from sensors monitoring the health of a system (HUMS)

Use case 1 : predictive maintenance (2)Example on a real system

– 20 sensors (20 values every 5 minutes), label (failure or not)

– Take 3 hours of data and predict the probability of failure in the next hour (fully customizable)

Learning using MLLIB

Use case 1 : predictive maintenance (3)Recurrent net learning

Impact of recurrent nets– Logistic regression

• 70% detection with 70% accuracy– Recurrent Neural Network

• 85% detection with 75% accuracy

Use case 2 : Sentiment analysis (1)

Context– Social network analysis application developed at Thales (Twitter, Facebook,

blogs, forums)• Analyze both the content of the texts and the relations (texts, actors)

– Multiple (big data) analysis • Actor community detection• Text clustering (themes)• …

Focus on– Sentiment analysis on the collected texts

• Classify texts based on their sentiment

Use case 2 : Sentiment analysis (2)Learning dataset

– Sentiment140 + Kaggle challenge (1.5M labeled tweets)– 50% positives, 50% negatives

Compare Bag of words + classifier approaches (Naïve Bayes, SVM, logistic regression) versus RNN

Use case 2 : Sentiment analysis (3)

NB SVM LogReg

Neural Net (perceptron) RNN (GRU)

100 61.4 58.4 58.4 55.6 NA

1 000 70.6 70.6 70.6 70.8 68.1

10 000 75.4 75.1 75.4 76.1 72.3

100 000 78.1 76.6 76.9 78.5 79.2

700 000 80 78.3 78.3 80 84.1

Results

4045505560657075808590 NB

SVM

LogReg

NeuralNet

RNN (GRU)

The end…

THANK YOU !

deep recurrent neural networks for sequence learning in spark by yves mabiala

Data & Analytics