using deep learning to do real-time scoring in practical applications

© 2016 LigaData, Inc. All Rights Reserved.

Using Deep Learning to do Real-Time Scoring in Practical Applications

SFbayACM Data Science SIG, Monday, 1/25/2016

By Greg Makowskiwww.Linkedin.com/in/[email protected]

Community @ http://Kamanja.org

Try out

•  Big Picture of 2016 Technology•  Neural Net Basics•  Deep Network Configurations for Practical Applications

–  Auto-Encoder (i.e. data compression or Principal Components Analysis)

–  Convolutional (shift invariance in time or space for voice, image or IoT)–  Real Time Scoring and Lambda Architecture–  Deep Net libraries and tools (Theano, Tourch, TensorFlow, ...

Kamanja) –  Reinforcement Learning, Q-Learning (i.e. beat people at Atari games,

IoT)–  Continuous Space Word Models (i.e. word2vec)

Deep Learning - Outline

David Clearley

Is Deep Learning Hype?

Is this just a “buzzword of the day or year?” Is this improvement at the normal pace?

Is Deep Learning Hype?

Is this just a “buzzword of the day or year?”Is this improvement at the normal pace?

NO !

Not only a buzzwordThis is a leap in the rate of improvement!

So What? Show me…

http://whatsnext.nuance.com/in-the-labs/what-is-deep-machine-learning/

Deep Learning Caused about an 18% / Year Reduction in Error in Speech Recognition (Nuance)

notonlydidDNNsdriveerrorratesdownatonce,…theypromisealotofpoten8alfortheyearstocome.

It is no overstatement to say that DNNs were the single

largest contributor to innovation across many of

our products in recent years.

http://whatsnext.nuance.com/in-the-labs/what-is-deep-machine-learning/

2008, 09, 10, 11, 12, 13, 14, 15

Deep Learning Caused about an 18% / Year Reduction in Error in Speech Recognition (Nuance)

WhatifMoore’sLawchangedfrom2Xto4xoverthelast7years,becauseofanewtechnologyadvance!

notonlydidDNNsdriveerrorratesdownatonce,…theypromisealotofpoten8alfortheyearstocome.

It is no overstatement to say that DNNs were the single

largest contributor to innovation across many of

our products in recent years.

Neural Net training is 10+ times faster on GPU’sThe gaming market is pushing for faster GPU speeds

https://jonpeddie.com/publications/whitepapers/an-analysis-of-the-gpu-market

https://developer.nvidia.com/cudnn



–  Convolutional (shift invariance in time or space for voice, image or IoT)–  Real Time Scoring and Lambda Architecture–  Deep Net libraries and tools (Theano, Tourch, TensorFlow, ...




AdvantagesofaNetoverRegression

10

field1

field2

$

c

$

$

$$

$

$ $

$$$ $$

$

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c c

c

c

c ccc

c

A Regression Solution

“Linear”

Fit one Line

$ c

Targetvaluesforadatapointwithsourcefieldvaluesgraphedby“field1”and“field2”

Showing ONE target field, with values of $ or c https://en.wikipedia.org/wiki/Regression_analysis

Advantages of a Net !over Regression!

11

field1

field2

$

c

$

$

$$

$

$ $

$$$ $$

$

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c c

c

c

c ccc

c

A Neural Net Solution

“Non-Linear”

Severalregions

which arenot adjacent

Hidden nodes can be line or

circle

https://en.wikipedia.org/wiki/Artificial_neural_network

!A Comparison of a Neural Net!

and Regression!

A Logistic regression formula: Y = f( a0 + a1*X1 + a2*X2 + a3*X3) a* are coefficients

Backpropagation, cast in a similar form: H1 = f(w0 + w1*I1 + w2*I2 + w3*I3) H2 = f(w4 + w5*I1 + w6*I2 + w7*I3) : Hn = f(w8 + w9*I1 + w10*I2 + w11*I3)

O1 = f(w12 + w13*H1 + .... + w15*Hn) On = ....

w* are weights, AKA coefficients I1..In are input nodes or input variables. H1..Hn are hidden nodes, which extract features of the data. O1..On are the outputs, which group disjoint categories. Look at ratio of training records v.s. free parameters (complexity, regularization)

a0

a1 a2 a3

X1 X2 X3

Y

Input1 I2 I3

Bias

H1 Hidden2

Output

w1w2

w3

Dot product isCosine similarity,

used broadly

Tensorsare matrices

of N dimensions

Think of Separating Land vs. Water

13

1 line, Regression

(more errors)

5 Hidden Nodes ina Neural Network

Different algorithms use different Basis Functions:•  One line•  Many horizontal & vertical lines•  Many diagonal lines•  Circles

Decision Tree12 splits

(more elements,Less computation)

Q) What is too detailed? “Memorizing high tide boundary” and applying it at all times



–  Convolutional (shift invariance in time or space for voice, image or IoT)–  Real Time Scoring and Lambda Architecture –  Deep Net libraries and tools (Theano, Tourch, TensorFlow, ...




http://deeplearning.net/ http://www.kdnuggets.com/ http://www.analyticbridge.com/

Leading up to an Auto Encoder

•  Supervised Learning–  Regression (one layer, one line, one dot-product)

•  50 inputs à 1 output–  Possible nets:

•  256 à 120 à 1•  256 à 120 à 5 (trees, regression, SVM & most algs are limited to 1

output)•  256 ! 120 ! 60 ! 1 (can try 2 hidden layers, 3 sets of weights)•  256 à 180 à 120 à 60 à 1 (start getting into training stability problems, with 1990’s training processes)

•  Unsupervised Learning–  Clustering (traditional unsupervised):

•  60 inputs (no output target); produce 1-2 new (cluster ID & distance)

Auto Encoder (like data compression)Relate input to output, through compressed middle

At each step of training Only train the black connections

256

256

180

Output (same as input values)

Input

… 256

120

180 …

256

180

… … …

Step 1, Train 1st Hidden Layer (Tensor)

Step 2, Train 2nd Hidden Layer (Tensor)

Called “Auto Encoder” because input values = target values Unsupervised, there are no additional target values

“Data Compression” because Compress 256 numbers into 180 numbers

Auto Encoder (like data compression)Relate input to output, through compressed middle

•  Supervised Learning–  Regression, Tree or Net: 50 inputs à 1 output–  Possible nets:

•  256 à 120 à 1•  256 à 120 à 5 (trees, regressions, SVD and most are limited to 1 output)•  256 à 120 à 60 à 1•  256 à 180 à 120 à 60 à 1

•  Unsupervised Learning–  Clustering (traditional unsupervised):

•  60 inputs (no target); produce 1-2 new (cluster ID & distance)–  Unsupervised training of a net, assign (target record == input record) AUTO-

ENCODING–  Train net in stages,

•  256 à 180 à 256à 120 à

à 120 à à 120 à

•  Add supervised layer to forecast 10 target categoriesà 10

Because of symmetry, Only need to update mirrored weights once

(start getting long training times to stabilize, or may not finish, The BREAKTHROUGH provided by DEEP LEARNING)

4 hidden layers w/ unsupervised training 1 layer at end w/ supervised training https://en.wikipedia.org/wiki/Deep_learning

Auto Encoder (like data compression)With Supervised Layers on Top

Unsupervised output Like cluster output,

Only large values are a match (not distance)

Train Supervised Layers on Top Regular Back Propagation

Using unsupervised nodes as input

256

180

120

120

:

Target specific to the problem Fraud risk * $ Cat, dog, human, other

256

180

120

120 :

50

1, 2, 10 or…

Auto EncoderHow it can be generally used to solve problems

•  Add supervised layer to forecast 10 target categories–  4 hidden layers trained with unuspervised training,

–  1 new layer, trained with supervised learning à 10

•  Outlier detection

•  The “activation” at each of the 120 output nodes indicates the “match” to that cluster or compressed feature

•  When scoring new records, can detect outliers with a process likeIf ( max_output_match < 0.333) then suspected outlier

•  How is it like PCA?–  Individual hidden nodes in the same layer are “different” or “orthogonal”

Fraud Detection Example using Deep Learning – auto encoders

•  Unsupervised Learning of Normal Behavior(Outlier Detection)–  May want to preprocess transaction data - in the context of the

person’s past normal behavior•  0..1, where 1 is the most SURPRISING for that person to act•  0..1, where 1 is the most RISKY of fraud•  General, descriptive attributes that can be used for interactions•  Filter out from the training data – the most surprising & risky•  Want to the net to learn “normal” records

–  Train 5-10 layers deep, end up with 50 to 100+ nodes at end–  Score records on membership in final nodes

•  Transactions that are far from all final nodes are candidates for outliers•  Validate with existing surprising & risky. Add application post-processing

•  Supervised Learning–  Add two layers on top, train to predict normal vs. surprising/risky

labeled data (if it is available)

Internet of Things (IoT) is heavily signal data

http://www.datasciencecentral.com/profiles/blogs/the-internet-of-things-data-science-and-big-data

Convolutional Neural Net (CNN)Enables detecting shift invariant patterns

In Speech and Image applications, patterns vary by size, can be shifted right or leftChallenge: finding a bounding box for a pattern is almost as hard as detecting the pat.

Neural Nets can be explicitly trained to provide a FFT (Fast Fourier Transform)to convert data from time domain to the frequency domain – but typically an explicit FFT is used

Internet ofThings Signal

Data

Convolutional Neural Net (CNN)Enables detecting shift invariant patterns

In Speech and Image applications, patterns vary by size, can be shifted right or leftChallenge: finding a bounding box for a pattern is almost as hard as detecting the pat.Solution: use a siding convolution to detect the pattern

CNN can use very long observational windows, up to 400 ms, long context

Convolution – Shift Horizontal

•  SAME25WEIGHTSFEEDINTOEACHOUTPUT

•  Backpropaga8onweightupdateisaveraged

•  OtherwiseNOconvolu8onandHUGEcomplexity!

Max pooling Layer output = 1.2

Convolution

https://en.wikipedia.org/wiki/Convolution

Convolution – Shift Horizontal & Vertical

Max pooling Layer output

= 0.8

Convolution – 3 Weight Patterns, Shifted 2D

Hidden Layer 1 Output Sections Per Convolution 3(10x3) – detection layer

Hidden Layer 1 WeightsPer Conv.

Pattern

Input pixels, audio, video or IoT signal

(14 x 7).

Convolutions can be over 3+

dimensions

(video frames, time invariance)

Max pooling layer output = 0.8 Max pooling layer output = 1.0 Max pooling layer output = 0.9

Convolution Neural Net (CNN)Same Low Level Features can support different output

http://stats.stackexchange.com/questions/146413/why-convolutional-neural-networks-belong-to-deep-learning

Previous Slides Showed Training this Hidden 1 Layer

Same Training processfor later hidden layers,

one at a time

Think of fraud detectionhigher level

node patterns

Convolution Neural Net: from LeNet-5

Gradient-Based Learning Applied to Document RecognitionProceedings of the IEEE, Nov 1998Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner

DirectorFacebook, AI Research

http://yann.lecun.com/ Can do some size invariance, but it adds to the layers

Convolution Neural Net (CNN)•  How is a CNN trained differently than a typical back

propagation (BP) network?

–  Parts of the training which is the same:•  Present input record•  Forward pass through the network •  Back propagate error (i.e. per epoch)

–  Different parts of training:•  Some connections are CONSTRAINED to the same value

–  The connections for the same pattern, sliding over all input space•  Error updates are averaged and applied equally to the one set of weight

values•  End up with the same pattern detector feeding many nodes at the next

level

http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, 2009

The Mammalian Visual Cortex is Hierarchical (The Brain is a Deep Neural Net - Yann LeCun)

http://www.pamitc.org/cvpr15/files/lecun-20150610-cvpr-keynote.pdf

0

1

2

3 4

5 6

7 8

9

10 11

Convolution Neural Net (CNN)Facebook example

https://gigaom.com/2014/03/18/facebook-shows-off-its-deep-learning-skills-with-deepface/

Convolution Neural Net (CNN)Yahoo + Stanford example – find a face in a pic, even upside down

http://www.dailymail.co.uk/sciencetech/article-2958597/Facial-recognition-breakthrough-Deep-Dense-software-spots-faces-images-partially-hidden-UPSIDE-DOWN.html

Convolutional Neural Nets (CNN)Robotic Grasp Detection (IoT)

http://pjreddie.com/media/files/papers/grasp_detection_1.pdf

Real Time Scoring Neural Net Optimizations•  Auto-Encoding nets

–  Can grow to millions of connections, and start to get computational–  Can reduce connections by 5% to 25+% with pruning & retraining

•  Train with increased regularization settings•  Drop connections with near zero weights, then retrain•  Drop nodes with fan in connections which don’t get used much later,

such as in your predictive problem•  Perform sensitivity analysis – delete possible input fields

•  Convolutional Neural Nets–  With large enough data, can even skip the FFT preprocessing step –  Can use wider than 10ms audio sampling rates for speed up

•  Implement other preprocessing as lookup tables (i.e. Bayesian Priors)

•  Use cloud computing, do not limit to device computing•  Large models don’t fit à use model or data parallelism to

train

© 2016 LigaData, Inc. All Rights Reserved. 39

Real Time Scoring – Enterprise App Architecture usesLambda Architecture – for both Batch and Real Time

•  First architecture to really define how batch and stream processing can work together•  Founded on the concepts of immutability and re-computation, with human fault tolerance•  Pre-computes the results of batch & real-time processes as a set of views, & query layer

merges the views

https://en.wikipedia.org/wiki/Lambda_architecture

Speed Layer

Batch Layer

Query

Master Data

Batch Processing

SpeedProcessing Real-time

AnalyticsSpeedViews

BatchViews

MergedViewsNew

Data B

New Data A


HDFS SparkMap Reduce

SparkStreaming

Storm

Real Time Scoring Lambda Architecture With Kamanja

Decisioning Layer

Batch Layer

Query

Master Data

Batch Processing Real-time Analytics

ActionQueue

Serving Layer

SpeedViews

BatchViews

MergedViews

KafkaMQFiles

AVRO/Flume

Continuous Decisioning

CassandraHBaseDruid

PMML, Java, Scala, Python,Deep Learning

KafkaMQ

AllNew Data

Speed LayerSpeed

Processing

DecisioningData

Continuous Feedback

Model

Data

Elephant DBImpala


Kamanja Technology Stack

Compute Fabric

Cloud, EC2Internal Cloud

Security

KerberosReal Time Streaming

Kafka, MQ

Spark*

LigaData

Data Store

HBase,Cassandra,

InfluxDBHDFS

(Create adaptors tointegrateothers)

Resource Management

Zookeeper,Yarn*

High Level Languages / Abstractions

PMML Producers, MLlib

Real Time ComputingKamanja

PMML, Java, Scala

Deep Learning Tools

Deep Learning Tools

By Google, 600 DL proj Speech Google Photos Translation Gmail Search

Rajat Monga, Tech Lead & Manager for TensorFlow

Deep Learning Tools

https://www.tensorflow.org/versions/0.6.0/get_started/index.html

Python code to make up data in two dimensions and then fit it

Deep Learning Tools

www.Kamanja.org

Reinforcement Learning (RL)•  Different than supervised and unsupervised learning

•  Q) Can the network figure out hot to take one or more actions NOW, to achieve a reward or payout (potentially far-off, i.e. T steps in the FUTURE?

•  Need to solve the credit assignment problem–  There is no teacher and very little labeled data–  Need to learn the best POLICY that will achieve the best outcome–  Assume no knowledge of the process model or reward function

•  Next guess =–  Linear combination of ((current guess) and

(the new reward info just collected)), weighted by the learning ratehttp://www.humphreysheil.com/blog/gorila-google-reinforcement-learning-architecture http://robotics.ai.uiuc.edu/~scandido/?Developing_Reinforcement_Learning_from_the_Bellman_Equation

Deep Reinforcement Learning (RL), Q-Learning

http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf David Silver, Google DeepMind https://en.wikipedia.org/wiki/Reinforcement_learning https://en.wikipedia.org/wiki/Q-learning

Think in terms of IoT….Device agent measures, infers user’s actionMaximizes future reward, recommends to user or system

Deep Reinforcement Learning, Q-Learning (Think about IoT possibilities)

http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf David Silver, Google DeepMind

Use last 4

screen shots

Deep Reinforcement Learning, Q-Learning

http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf David Silver, Google DeepMind

Use 4 screen shots

Use 4 screen shots

IoT challenge: How to replace game score with IoT score?

Shift right fast shift right stay shift left shift left fast

Deep Reinforcement Learning, Q-Learninghttp://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:silver-iclr2015.pdf David Silver, Google DeepMind

Games w/ best Q-learningVideo PinballBreakoutStar GunnerCrazy ClimberGopher



–  Convolutional (shift invariance in time or space for voice, image or IoT)–  Real Time Scoring–  Deep Net libraries and tools (Theano, Tourch, TensorFlow, ...




Continuous Space Word Models (word2vec)

•  Before (a predictive “Bag of Words” model):–  One row per document, paragraph or web page–  Binary word space: 10k to 200k columns, one per word or phrase

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 …. “This word space model is ….”

–  The “Bag of words model” relates input record to a target category

Continuous Space Word Models (word2vec)

•  Before (a predictive “Bag of Words” model):–  One row per document, paragraph or web page–  Binary word space: 10k to 200k columns, one per word or phrase

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 …. “This word space model is ….”–  The “Bag of words model” relates input record to a target category

•  New:–  One row per word (word2vec), possibly per sentence (sent2vec)–  Continuous word space: 100 to 300 columns, continuous values

.01 .05 .02 .00 .00 .68 .01 .01 .35 ... .00 à “King”

.00 .00 .05 .01 .49 .52 .00 .11 .84 ... .01 à “Queen”–  The deep net training resulted in an Emergent Property:

•  Numeric geometry location relates to concept space•  “King” – “man” + “woman” = “Queen” (math to change gender relation)•  “USA” – “Washington DC” + “England” = “London” (math for capital

relation)

Continuous Space Word Models (word2vec)How to SCALE to larger vocabularies?

http://www.slideshare.net/hustwj/cikm-keynotenov2014?qid=f92c9e86-feea-41ac-a099-d086efa6fac1&v=default&b=&from_search=2

Training Continuous Space Word Models

•  How to Train These Models?–  Raw data: “This example sentence shows the word2vec model

training.”


•  How to Train These Models?–  Raw data: “This example sentence shows the word2vec model

training.”–  Training data (with target values underscored, and other words as

input)“This example sentence shows word2vec” (prune “the”)“example sentence shows word2vec model”“sentence shows word2vec model training”

–  The context of the 2 to 5 prior and following words predict the middle word

–  Deep Net model architecture, data compression to 300 continuous nodes

•  50k binary word input vector à ... à 300 à ... à 50k word target vector


•  Use Pre-Trained Models https://code.google.com/p/word2vec/ –  Trained on 100 billion words from Google News–  300 dim vectors for 3 million words and phrases–  https://code.google.com/p/word2vec/

•  Questions on re-use: –  What if I want to train to add client terms or docs?–  What about stability (keeping past training) vs. placticity (learning

new content)


http://www.slideshare.net/hustwj/cikm-keynotenov2014?qid=f92c9e86-feea-41ac-a099-d086efa6fac1&v=default&b=&from_search=2

Applying Continuous Space Word Models

http://static.googleusercontent.com/media/research.google.com/en//people/jeff/BayLearn2015.pdf State of the art in machine translationSequence to Sequence Learning with neural Networks, NIPS 2014

Language translationDocument summaryGenerate text captions for pictures

.01

.05

.89

.00

.05

.62

.00

.34

“Greg’s Guts” on Deep Learning•  Some claim the need for preprocessing and knowledge

representation has ended–  For most of the signal processing applications à yes, simplify–  I am VERY READY TO COMPETE in other applications, continuing

•  expressing explicit domain knowledge – using lookup data for context•  optimizing business value calculations

•  Deep Learning gets big advantages from big data–  Why? Better populating high dimensional space combination

subsets–  Unsupervised feature extraction reduces need for large labeled data

•  However, “regular sized data” gets a big boost as well–  The “ratio of free parameters” (i.e. neurons) to training set records–  For regressions or regular nets, want 5-10 times as many records –  Regularization and weight drop out reduces this pressure–  Especially when only training “the next auto encoding layer”

Deep Learning Summary – ITS EXCITING!

•  Discussed Deep Learning architectures–  Auto Encoder, convolutional, reinforcement learning, continuous word

•  Real Time speed up–  Train model, reduce complexity, retrain–  Simplify preprocessing with lookup tables–  Use cloud computing, do not be limited to device computing–  Lambda architecture like Kamanja, to combine real time and batch

•  Applications–  Fraud detection–  Signal Data: IoT, Speech, Images–  Control System models (like Atari game playing, IoT)–  Language Models

https://www.quora.com/Why-is-deep-learning-in-such-demand-now

© 2016 LigaData, Inc. All Rights Reserved.

Using Deep Learning to do Real-Time Scoring in Practical Applications

SFbayACM Data Science Meetup Monday 1/25/2016

By Greg Makowskiwww.Linkedin.com/in/[email protected]

Community @ http://Kamanja.org

Try out

using deep learning to do real-time scoring in practical applications

Software