![Page 1: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/1.jpg)
Continuous Machine and Deep Learning at Scale WithApache Ignite
Ken Cottrell
Solution Architect
![Page 2: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/2.jpg)
2019 © GridGain Systems
Agenda
1
▪ Continuous Machine Learning / Deep Learning Introduction
▪ Overview of Apache Ignite Continuous ML/DL Capabilities
▪ Demo & API discussion
▪ Q & A
![Page 3: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/3.jpg)
2019 © GridGain Systems
Why Machine Learning at Scale?
2
Scalability
• Data exceed capacity of single server
• Burden for development and business
Models trained & then deployed in
different systems
• Move data out for training
• Wait for training to complete
• Redeploy models in production
![Page 4: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/4.jpg)
2019 © GridGain Systems
Machine Learning Pipelines: where is the time spent?
![Page 5: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/5.jpg)
2019 © GridGain Systems
App
Continuous Machine Learning at Scale
Periodic
update of
models
Periodic ETL
of terabytes
of data
Loading data
for training
Model training
& testing
Storing and
processing
working set
Before
Storing and
processing
working set
Instant
updates of
models
After (With CL)
App ML/DL
Engine
Model training & testing
No ETL
![Page 6: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/6.jpg)
2019 © GridGain Systems
-
50,000
100,000
150,000
200,000
Ap
r-1
4
Jun
-14
Au
g-1
4
Oct-
14
De
c-1
4
Fe
b-1
5
Ap
r-1
5
Jun
-15
Au
g-1
5
Oct-
15
De
c-1
5
Fe
b-1
6
Ap
r-1
6
Jun
-16
Au
g-1
6
Oct-
16
De
c-1
6
Fe
b-1
7
Ap
r-1
7
Jun
-17
Au
g-1
7
Oct-
17
De
c-1
7
Fe
b-1
8
Ap
r-1
8
Jun
-18
Au
g-1
8
Oct-
18
De
c-1
8
Apache Ignite Is a Top 5 Apache Project
Est. 15M today, Apache site
and Docker siteTop 5 Dev Mailing Lists
1.
2.
3.
4.
5.
Top 5 User Mailing Lists
1.
2.
3.
4.
5.
Monthly Ignite/GridGain Downloads
From January 1, 2019 Apache Software Foundation Blog Post:
“Apache in 2018 – By The Digits”
A Top 5 Apache Software Foundation Project
![Page 7: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/7.jpg)
2019 © GridGain Systems
Logistics & Transportation
Apache Ignite Users
IoT
AdTech/Media/Entertainment
Pharma & Healthcare
Reliance
Financial Services
FinTech
Software/Cloud
Telecom & Mobile
IoT
AdTech / Media / Entertainment
Logistics & Transportation
eCommerce & Retail
Pharma & Healthcare
![Page 8: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/8.jpg)
2019 © GridGain Systems
Apache Ignite In-Memory Computing Platform
Mainframe NoSQL HadoopIgnite Persistence
Persistent Layer
RDBMS
Machine and Deep Learning
EventsStreamingMessagingTransactio
nsSQLKey-Value
Service GridCompute Grid
Application Layer
Web SaaS SocialMobile IoT
Rolli
ng U
pgra
des
Securi
ty &
Aud
itin
g
Monitoring &
Manag
em
ent
Segm
enta
tion P
rote
ction
Data
Cente
rR
eplic
ation
Netw
ork
Backups
Full,
Incre
menta
l, C
ontinuous B
ackups
Poin
t-in
-Tim
e R
ecovery
Hete
rogeneous R
ecovery
In-Memory Data Store
GridGain Enterprise FeaturesApache Ignite Features
![Page 9: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/9.jpg)
2019 © GridGain Systems8
Overview of Apache Ignite Continuous ML/DL Capabilities
![Page 10: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/10.jpg)
2019 © GridGain Systems
Apache Ignite Continuous Learning framework
Transactional Persistence
Distributed Machine Learning Datasets
TensorFLowRegressionsK-Means Decision Trees
In-Memory Data Store
Distributed In-Memory Machine and Deep Learning
Compute and Service Grid
C++.NETJava PythonBinary Protocal
(Thin client)
Distributed
Algorithms
Large Scale
Parallelization
Multi-language
Support
No ETL
Distributed
Dataset based
on partitioned
caches
![Page 11: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/11.jpg)
2019 © GridGain Systems10
Partitions Distribution and Replication
Node 1 Node 2
Node 3 Node 4
0 1
2 3
0
1
2
3
Primary
Backup
Co-Located by
Partition:• Transactional
Data
• Vectorized Data
• Training context
data
• Other
Computation
functions
![Page 12: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/12.jpg)
2019 © GridGain Systems
Redundant
Parallel jobs▪ Pre-Process
▪ Vectorize
▪ Train
Redundant
Parallel jobs▪ Pre-Process
▪ Vectorize
▪ Train
Continuous Learning enabled with Partitioned Datasets
Ignite Node
P2 C D
Ignite Node
P1 C DApplication
P = Partition
C = Partition Context
D = Partition Data
D* = Local ETL
Replicated,
Parallel jobs▪ Pre-Process
▪ Vectorize
▪ Train
Map Training
Reduce Training Results
![Page 13: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/13.jpg)
2019 © GridGain Systems
Apache Ignite Distributed Training: Clustering
• K-means (Centroid Mean)
• GMM (Centroid Mean + Variance)
• Use Cases - OLTP and other tabular data that need to be Labeled
– Customer Segmentation
– Anomaly Detection
– Network throughput characterization
![Page 14: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/14.jpg)
2019 © GridGain Systems
Apache Ignite Distributed Training: Classification
• Logistic Regression & Naive Bayes
• SVM, KNN, ANN
• Decision trees & Random Forest
• Use cases - Operational (OLTP) data
prediction:
– Fraud detection
– Credit Card Scoring
– Clinical Trials
– Customer Segmentation
![Page 15: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/15.jpg)
2019 © GridGain Systems
Apache Ignite Distributed Training: Regression
• KNN & Linear Regressions
• Decision tree regression
• Random forest regression
• Gradient-boosted tree regression
• Use cases - Operational data (OLTP)
predictions– Trend analysis
– Financial forecasting
– Time series prediction
– Response modeling (pharma etc)
![Page 16: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/16.jpg)
2019 © GridGain Systems
Apache Ignite: TensorFlow Integration
15
>>> import tensorflow as tf
>>> from tensorflow.contrib.ignite import IgniteDataset
>>>
>>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
>>> iterator = dataset.make_one_shot_iterator()
>>> next_obj = iterator.get_next()
>>>
>>> with tf.Session() as sess:
>>> for _ in range(3):
>>> print(sess.run(next_obj))
{'key': 1, 'val': {'NAME': b'WARM KITTY'}}
{'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
{'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
Use Cases - Operational data “High
dimension” data (Images, Text, Audio,
speech)• Image data classification
• Natural Language Processing Clinical notes
• Document Classification, Free Form text
![Page 17: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/17.jpg)
2019 © GridGain Systems
Apache Ignite Distributed PreProcessing: Normalization
![Page 18: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/18.jpg)
2019 © GridGain Systems
Apache Ignite Distributed Preprocessing: Scaling
https://medium.com/@nsethi610/data-cleaning-scale-and-normalize-data-4a7c781dd628
![Page 19: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/19.jpg)
2019 © GridGain Systems
Apache Ignite Distributed preprocessing: One-Hot Encoder
* Also included:
String Encoding
![Page 20: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/20.jpg)
2019 © GridGain Systems
Achieving Continuous ML/DL at Scale: Architectural Considerations / Trade-Offs
19
• Operational Data Models: from centralized to parallelized
– De-normalization Data Affinity for parallel Loads, Queries, Updates, Joins
– Horizontal scale-out
• Done locally in node: data partition + preprocessing + training + inferencing
– Reduces data shuffling over the network between the cluster and application
• ML pipeline enhancements
– Co-Located & Distributed processing of all ML steps: ingest to inferencing
– ML model performance measured, and updatable, with nearby transaction data
![Page 21: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/21.jpg)
2019 © GridGain Systems20
Demo & API discussion
![Page 22: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/22.jpg)
2019 © GridGain Systems
package org.apache.ignite.examples.ml
21
Adding your own Preprocessor and Algorithm to a Dataset
• dataset/AlgorithmSpecificDatasetExample.java
Passing custom preprocessor classes to the cluster
• environment/TrainingWithCustomPreprocessorsExample.java
TensorFlow data set , inferencing at the cluster nodes
• inference/TensorFlowDistributedInferenceExample.java
Decision tree
• tree/FraudDetectionExample.java
End-to-End Model Prep & Training Tutorial (shows feature preprocessing, transformation, different algorithm comparisons, accuracy metrics, pipelines)
• tutorial/*.java // pipeline to preprocess, train,
// & evaluate Titanic passenger data
![Page 23: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/23.jpg)
2019 © GridGain Systems
Ignite ML API to Update the model
SVMLinearClassificationTrainer trainer = new
SVMLinearClassificationTrainer();
SVMLinearClassificationModel mdl1 =
trainer.fit(ignite, dataCache1, vectorizer);
SVMLinearClassificationModel mdl2 =
trainer.update(mdl1, ignite, dataCache2,
vectorizer);
DatasetTraininer interface:
(Some Constraints according to the Algorithm)
Online / Online Batch with new data
• Centroid updates – KMeans, ANN
• Add new dataset - KNN
• Update with new Gradient – NN, Log
Regression, Linear Regression
• Increment to Current state - SVM, GDB
• Decision Tree – retrain
• Random Forest – adds new DT, may discard
other DTs for size management
![Page 24: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/24.jpg)
2019 © GridGain Systems
Demo tutorial sample
23
To run this example:• Import this directory with pom.xml into your favorite IDE as a
Maven project– <path>\apache-ignite-2.7.6-bin\examples\pom.xml
• I’ll run this job on a single node inside my laptop on Eclipse (normally you would run jobs on a cluster of nodes)
– Each of these Steps can be run independently or all together with TutorialStepByStepExample.java
– Widely used Titanic data set (we include it here)
• Discussion of how Apache Ignite API can be invoked by 3rd party Auto ML and other application wrappers
• Compare the Accuracy obtained different ML steps– Accuracy defined as % correct predictions versus ground truth
– Different algorithms and different preprocessing
– Effects of Test / Train split on Overfitting
![Page 25: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/25.jpg)
2019 © GridGain Systems24
Apache Ignite Spark integration
Write to Ignite DataFrame from
within Spark session
Read from same Ignite DataFrame from
another Spark Session
• DF (and RDD) shared across
sessions
• SQL with Indexing for faster queries
• Ignite DF are mutable
![Page 26: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/26.jpg)
2019 © GridGain Systems
To Summarize: Apache Ignite for Continuous Learning at Scale
25
Massive Scale for Memory, Storage, Computation• Massive Throughput with minimal ETL
• Massive operational data sizes + in-place parallel processing
• Faster cycle times from transactions, ML/DL dataset extraction, predictions
Integrates with Existing ML / DL operations• Low-level Distributed APIs to integrate with Auto ML and other Data Science
workflows
• For End-Users: Python API to manage Cache, Datasets, SQL, ML
• Apache Ignite integrations to accelerate Spark, TensorFlow pipelines; including
Model imports from other tool sets
![Page 27: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/27.jpg)
2019 © GridGain Systems
Resources
26
• Documentation:– https://apacheignite.readme.io/docs
• Python support– https://github.com/gridgain/ml-python-api
• Examples and Tutorials:– https://github.com/apache/ignite/tree/master/examples/s
rc/main/java/org/apache/ignite/examples/ml
• Details on TensorFlow– https://medium.com/tensorflow/tensorflow-on-apache-
ignite-99f1fc60efeb
![Page 28: Continuous Machine and Deep Learning at Scale With Apache Ignite · 2020-03-11 · Mainframe Ignite Persistence NoSQL Hadoop Persistent Layer RDBMS Machine and Deep Learning Messaging](https://reader035.vdocument.in/reader035/viewer/2022062602/5f01e1547e708231d4017caa/html5/thumbnails/28.jpg)
2019 © GridGain Systems27
Q & A