agile data science with scala
TRANSCRIPT
Agile Data Science with Scalaby @DataFellas
Xavier [email protected]
@xtordoir
Andy [email protected]
@noootsab
Data Fellas
Andy Petrella
MathsGeospatialDistributed Computing
Spark NotebookTrainer Spark/ScalaMachine Learning
Xavier Tordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)trainer SparkMachine Learning
© Data Fellas SPRL 2016
● Pipeline: productizing Data Science● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook)
● Why Micro Services?● Painful points:
○ Data science is Discontiguous○ Context Lost in Translation
● Solution: Data Fellas’ Agile Data Science Toolkit
LineupSo if you’re not sure you want to stay...
© Data Fellas SPRL 2016
PipelineProductizing Data Science
Modelling Coding Deploying
Finding Data
Parsing structures
Cleaning
(Reducing)
Learning
Predicting
Connect PROD data
Tuning training parameters
Create Prediction Service
Generate Deployable
Connect to PROD infrastructure
Integration with existing env
Allocate (schedule) resources
Ensure availability
© Data Fellas SPRL 2016
Distributed Data ScienceDemo
All-In Spark NotebooksGet data: Source → Kafka
Prepare View: Kafka → Cassandra
Train Model: Cassandra → ML...
Create Server: Cassandra/ML/... → Akka Http
Create Client: Json → Html Form, Chart, table, ...
© Data Fellas SPRL 2016
Bad PipelineTargeting Dashboard
Modelling Coding Deploying Dashboard»»»
� Data Scientist focusing on the dashboard/report instead of content
� breaks reusability of data
� time wasted on learning viz instead of increasing accuracy (or velocity)
� monolithic instead of service oriented
© Data Fellas SPRL 2016
Extended PipelineMicro Services
Modelling Coding Deploying IntegratingApplication
Creating Services
Abstracts access to prepared views
Exposes Prediction capabilities
Highly horizontally scalable
Scaling micro services cluster
→ cheaper than computing cluster
Customer integration
Can be any technologies
Can even be another pipeline!
© Data Fellas SPRL 2016
Painful pointsData science is Discontiguous
➔ Highly heterogeneous environment➔ Too many friction areas➔ Time to market too long
Modelling Coding Deploying IntegratingApplication
Scientist Data Eng. Ops. Eng. Web Eng. Customers
➔ No integration ➔ Error prone➔ Schedule delays
Creating Services
Frictions
Result: Lack of Agility
Collecting
Data Eng.
© Data Fellas SPRL 2016
Painful pointsContext Lost in Translation
Data Lake ProcessingMachineLearning
Model
OutputData
InputData
No contextual discovery No quality infoNo lineage (origin of the data)
Link to process and input discarded
Huge gap in architecture: binary and schema aware serving layer
Accuracy depends on concealed quality of inputs
No schema! hard and long integration, poor satisfaction
Moreover:
No backward links → no agility and no context awareness
Result: Lack of Reproducibility
Application
© Data Fellas SPRL 2016
Our ApproachAgile Data Science Toolkit
AutomaticSemantics
Engine+ Autogenerated
Microservices
IntegratedEnd-to-End
Environment
Huge gainin Time and Reliability
+ =
Notebook
ComputingCluster
AccessLayer
KnowledgeBase
Consum
ersC
ustomers
Exposesdatabase,learning models,stream sources,notebooks, ...
data type
process
lineage
usage
Easy to Release
Easy to (Re)Use
Notebook
Version Control(Git)
Spark Job Project(SBT)
Service Projects(SBT)
Metadata(Doc, Logic, Schema, ...)
Catalog(ElasticSearch)
Deployable(Jar, Docker)
Repository(Nexus, Docker Repo,
Pypi, Gem Server)
Client Projects(Node.Js, Java, Scala,
Python, Ruby)
Publishable(NPM, Jar,
Pip/EasyInstall, Gem)
scientistdata
Engineer
ops
Engineer
© Data Fellas SPRL 2016
GrowingWe’re Hiring! http://www.data-fellas.guru/#skillsjobs
Q/AReferences
http://www.data-fellas.guru/
http://spark-notebook.io/
https://github.com/andypetrella/spark-notebook/
https://gitter.im/andypetrella/spark-notebook
Come at Strata -- London at least -- We have two talks :-)