machine learning success: the key to easier model management
TRANSCRIPT
© 2017 MapR Technologies 1
Machine Learning Success:
The Key to Easier Model Management
© 2017 MapR Technologies 2
Contact Information
Ellen Friedman, PhD
Principal Technologist, MapR Technologies
Committer Apache Drill & Apache Mahout projects
O’Reilly author
Email [email protected] [email protected]
Twitter @Ellen_Friedman
© 2017 MapR Technologies 3
Machine Learning Everywhere
Image courtesy Mtell used with permission.Images © Ellen Friedman.
© 2017 MapR Technologies 4
Traditional View
© 2017 MapR Technologies 5
Traditional View: This isn’t the whole story
© 2017 MapR Technologies 6
90% of the effort in successful
machine learning isn’t the
algorithm or the model…
It’s the logistics
© 2017 MapR Technologies 7
Why?
• Just getting the training data is hard
– Which data? How to make it accessible? Multiple sources!
– New kinds of observations force restarts
– Requires a ton of domain knowledge
• The myth of the unitary model
– You can’t train just one
– You will have dozens of models, likely hundreds or more
– Handoff to new versions is tricky
© 2017 MapR Technologies 8
What Machine Learning Tool is Best?
• Most successful groups keep several “favorite” machine
learning tools at hand
– No single tool is best in every situation
• The most important tool is a platform that supports logistics well
– Don’t have to do everything at the application level
– Lots of what matters can be handled at the platform level
• A good design can make a big difference
© 2017 MapR Technologies 9
Rendezvous Architecture
Input Scores
RendezvousModel 1
Model 2
Model 3
request
response
Results
© 2017 MapR Technologies 10
Rendezvous to the Rescue: Better ML Logistics
• Stream-1st architecture is a powerful approach with surprisingly
widespread advantages
– Innovative technologies emerging to for streaming data
• Microservices approach provides flexibility
– Streaming supports microservices (if done right)
• Containers remove surprises
– Predictable environment for running models
© 2017 MapR Technologies 11
Rendezvous: Mainly for Decisioning Type Systems
• Decisioning style machine learning
– Looking for a “right answer”
– Simpler than interactive machine learning (such as in self-driving car)
• Examples include:
– Fraud detection
– Predictive analytics / market prediction
– Churn prediction (as in telecommunications)
– Yield optimization
– Deep learning in form of speech or image recognition, in some cases
© 2017 MapR Technologies 12
Why Stream?
Munich surfing wave Image © 2017 Ellen Friedman
© 2017 MapR Technologies 13
Streaming data has value beyond
real-time insights
© 2017 MapR Technologies 14
Heart of Stream-1st Architecture: Message Transport
Real-time analytics
EMRPatient Facilities
management
Insurance audit
A
B
Medical tests
C
Medical test results
The right messaging tool
supports multiple classes of use
cases (A, B, C in figure)
Image © 2016 Ted Dunning & Ellen Friedman from Chap 1 O’Reilly
book Streaming Architecture used with permission
© 2017 MapR Technologies 15
Stream Transport that Decouples Producers & Consumers
P
P
P
C
C
C
Transport Processing
Kafka /
MapR Streams
© 2017 MapR Technologies 16
MapR Streams in the MapR Converged Data Platform
Enterprise StorageMapR-FS MapR-DB MapR Streams
Database Event Streaming
Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy
• Helps build a global data fabric
• Multiple types of storage engineered into one technology
• Under the same security & administration
© 2017 MapR Technologies 17
With MapR, Geo-Distributed Data Appears Local
streamData
sourceConsumer
© 2017 MapR Technologies 18
With MapR, Geo-Distributed Data Appears Local
stream
streamData
source
Consumer
© 2017 MapR Technologies 19
With MapR, Geo-distributed Data Appears Local
stream
streamData
source
ConsumerGlobal Data Center
Regional Data Center
© 2017 MapR Technologies 20
Stream transport supports microservices
© 2017 MapR Technologies 21
Stream-1st Architecture: Basis for MicroServices
Stream instead of database as the shared “truth”
POS 1..n
Fraud detector
Last card use
Updater
Card analytics
Other
card activity
Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission
© 2017 MapR Technologies 22
Features of Good Streaming
• It is Persistent– Messages stick around for other consumers
– Consumers don’t affect producers
– Consumer doesn’t have to be online when message arrives
• It is Performant– You don’t have to worry if a stream can keep up
• It is Pervasive– It is there whenever you need it, no need to deploy anything
– How much work is it to create a new file? Why harder for a stream?
© 2017 MapR Technologies 23
Raw data is gold!
© 2017 MapR Technologies 24
Raw Data & Training Data Are Key to Success
Model 1
Model 2
Model 3
request
Raw
Add external
dataInput
Database
The world
Raw data may contain features you’ll want in future
© 2017 MapR Technologies 25
Quality & Reproducibility of Input Data is Important!
• Recording raw-ish data is really a big deal
– Data as seen by a model is worth gold
– Data reconstructed later often has time-machine leaks
– Databases were made for updates, streams are safer
• Raw data is useful for non-ML cases as well (think flexibility)
• Decoy model records training data as seen by models under
development & evaluation
© 2017 MapR Technologies 26
Decoy Model in the Rendezvous Architecture
InputScores
Decoy
Model 2
Model 3
Archive
• Looks like a server, but it just archives inputs
• Safe in a good streaming environment, less safe without good isolation
© 2017 MapR Technologies 27
Scores
ArchiveDecoy
m1
m2
m3
Features / profiles
Input Raw
© 2017 MapR Technologies 28
ResultsRendezvousScores
ArchiveDecoy
m1
m2
m3
Features / profiles
Input Raw
© 2017 MapR Technologies 29
MetricsMetrics
ResultsRendezvousScores
ArchiveDecoy
m1
m2
m3
Features / profiles
Input Raw
© 2017 MapR Technologies 30
Models in production live in the
real world:
Conditions may (will) change
© 2017 MapR Technologies 31
How to Do Better – Deployment in Production
• Keep models running “in the wings” – Don’t wait until conditions change to start building the next model
– Keep new models ready
• Hot hand-off– With rendezvous: just stop ignoring the model of interest
• Deploy a canary server– Keep an old model active as a reference
– If it was 90% correct, difference with any better model should be small
– Score distribution should be roughly constant
© 2017 MapR Technologies 32
Advantages of Rendezvous Architecture
Real
model∆
Result
Canary
Decoy
Archive
Input
© 2017 MapR Technologies 33
DataOps: Brings Flexibility & Focus
• You don’t have to be a data scientist to contribute to machine learning
• Software engineer/ developer plays a role: but you need good data skills
© 2017 MapR Technologies 34
Example: Tensor Chicken
Label
training
data
Run the
model
Deploy
model
Gather
training
data
Labeled
image files
Train
model
Update
model
Deep learning project by
software engineer Ian Downard
(see blog + @tensorchicken)
© 2017 MapR Technologies 35
Rendezvous Architecture
Input Scores
RendezvousModel 1
Model 2
Model 3
request
response
Results
© 2017 MapR Technologies 36
How to Do Better
• Data + the right question + domain knowledge matter!
• Prioritize – put serious effort into infrastructure
– DataOps requires more than just data science
• Persist – use streams to keep data around
• Measure – everything, and record it
• Meta-analyze – understand and see what is happening
• Containerize – make deployment repeatable, easy
• Oh… don’t forget to do some machine learning, too
© 2017 MapR Technologies 37
Sign Up for ML Logistics Workshop Series
Three deep-dive machine learning workshops
by Ted Dunning, Chief Applications Architect at MapR:
1. A New Architecture for Machine Learning Logistics: How to use streaming, containers & a microservices design
2. Machine Learning Evaluation: How to do model-to-model comparisons
3. Machine Learning in the Enterprise: How to do model management in production
http://bit.ly/mapr-machine-learning-logistics-series
© 2017 MapR Technologies 38
Additional Resources
O’Reilly report by Ted Dunning & Ellen Friedman © March 2017
Read free courtesy of MapR:
https://mapr.com/geo-distribution-big-data-and-analytics/
O’Reilly book by Ted Dunning & Ellen Friedman
© March 2016
Read free courtesy of MapR:
https://mapr.com/streaming-architecture-using-
apache-kafka-mapr-streams/
© 2017 MapR Technologies 39
Additional Resources
O’Reilly book by Ted Dunning & Ellen Friedman
© June 2014
Read free courtesy of MapR:
https://mapr.com/practical-machine-learning-
new-look-anomaly-detection/
O’Reilly book by Ellen Friedman & Ted Dunning
© February 2014
Read free courtesy of MapR:
https://mapr.com/practical-machine-learning/
© 2017 MapR Technologies 40
Additional Resources
by Ellen Friedman 8 Aug 2017 on MapR blog:
https://mapr.com/blog/tensorflow-mxnet-caffe-h2o-which-ml-best/
by Ted Dunning 13 Sept 2017 in
InfoWorld:
https://www.infoworld.com/article/3223
688/machine-learning/machine-
learning-skills-for-software-
engineers.html
© 2017 MapR Technologies 41
New book:
O’Reilly book by Ellen Friedman & Ted Dunning © Sept 2017
Pre-register for a free pdf copy of book when it becomes
available 25th September, courtesy of MapR:
http://info.mapr.com/2017_Content_Machine-Learning-
Logistics_eBook_Prereg_RegistrationPage.html
© 2017 MapR Technologies 42
Please support women in tech – help build
girls’ dreams of what they can accomplish
© Ellen Friedman 2015#womenintech #datawomen
© 2017 MapR Technologies 43
Thank you !
© 2017 MapR Technologies 44
Q&A
@mapr
Maprtechnologies
ENGAGE WITH US
@ Ellen_Friedman