3 ways to move your data science projects to production: secure and scalable data science deployment...
TRANSCRIPT
© 2016 Continuum Analytics - Confidential & Proprietary© 2017 Continuum Analytics - Confidential & Proprietary
Three Ways to Move Your Data Science Projects to ProductionSecure and Scalable Data Science Deployment with AnacondaChristine Doig and Kris Overholt
May 24, 2017
© 2017 Continuum Analytics - Confidential & Proprietary
• Worked on MEMEX, a DARPA-funded project helping stop human trafficking
• Co-author of the recently published book, Breaking Data Science Open, published by O’Reilly
• 5+ years of experience in analytics, operations research, and machine learning
• MS in Industrial Engineering, Polytechnic University of Catalonia, Barcelona.
Christine Doig, Senior Data Scientist
© 2017 Continuum Analytics - Confidential & Proprietary
• Developing the cluster management features of Anaconda • 10+ years of experience in scientific computing, systems administration,
computational modeling and more • Ph.D. in Civil Engineering, University of Texas • Master’s degree, Worcester Polytechnic Institute, focus on computational
fluid dynamics
Kris Overholt, Product Manager
© 2017 Continuum Analytics - Confidential & Proprietary 4
• Overview of Anaconda • End-to-End Collaborative Data Science Workflows • Data Science Development and Deployment
• Anaconda + Docker • Anaconda Project • Anaconda Enterprise
• Examples of Data Science Deployment • Getting Started with Anaconda Enterprise Deployment
Agenda
Overview of Anaconda
© 2017 Continuum Analytics - Confidential & Proprietary 6
Anaconda, the leading Data Science ecosystem with over 4M users
© 2017 Continuum Analytics - Confidential & Proprietary 7
Numba
dask
xlwings
Airflow
Blaze
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
Machine Learning / Statistics
ANACONDA DISTRIBUTION
Python & R distribution with 1000+ curated packages that makes it easy to get started with Data Science
© 2016 Continuum Analytics - Confidential & Proprietary 8
https://www.continuum.io/downloads
© 2016 Continuum Analytics - Confidential & Proprietary 9
What’s in ANACONDA DISTRIBUTION?
© 2016 Continuum Analytics - Confidential & Proprietary 10
• Install data science libraries
$ conda install pandas
• Manage package versions
$ conda install pandas=0.14
• Create isolated environments
$ conda create -n myenv python=3.5 pandas=0.18
• Update package version
$ conda update pandas
© 2016 Continuum Analytics - Confidential & Proprietary 11
…
© 2016 Continuum Analytics - Confidential & Proprietary 12
anaconda-project.yml
• Define and manage: • project package dependencies • deployment commands • data • …
© 2016 Continuum Analytics - Confidential & Proprietary 13
• Launch applications • Manage package
versions and environments
• Create and upload projects
End-to-end Collaborative Data Science Workflows
© 2017 Continuum Analytics - Confidential & Proprietary 15
• Explore, Analyze & Collaborate • Scale, Deploy & Operate
End-to-end Collaborative Data Science Workflows
© 2017 Continuum Analytics - Confidential & Proprietary 16
Biz Analyst
Data Scientists
Explore, Analyze & Collaborate
© 2017 Continuum Analytics - Confidential & Proprietary 17
DevOps
Scale, Deploy & Operate
Developer
Data Engineers
Data Science Development and Deployment
© 2017 Continuum Analytics - Confidential & Proprietary 19
How do you… • Download and install data science libraries? • Manage versions and dependencies? • Upgrade libraries? • Isolate dependencies between projects?
Challenges in the data science ecosystem
© 2016 Continuum Analytics - Confidential & Proprietary 20
What do data scientists develop?
Workflows
Data
Query Visualize
Clean & Tidy
Predict, Simulate, & Optimize
Reports
Presentations
Interactive Notebooks
Interactive Apps
Predictive Models
Interactive data visualizations and dashboards
Jupyter notebooks Scripts
Predictive models
Processed Data
© 2016 Continuum Analytics - Confidential & Proprietary 21
LaptopData Science Development
scikit-learn
Bokeh Tensorflow
Jupyter pandas
matplotlib
seaborn
dask
numba
script 1 script 2 notebook A dataset Zscript 3
Python, R
© 2017 Continuum Analytics - Confidential & Proprietary 22
How do you… • Share your data science project with others? • Ensure that you can reproduce your analysis? • Deploy your project?
Challenges in data science development and deployment
© 2016 Continuum Analytics - Confidential & Proprietary 23
The Path to Simple Data Science Deployment!
Anaconda Enterprise
DIYAnaconda Project
Anaconda
Docker containers
conda env 1 conda env 2 conda env 3
Anaconda and Docker - Better Together
© 2016 Continuum Analytics - Confidential & Proprietary
Laptop
conda env 1
Analysis 1
conda env 2 conda env 3
Analysis 2
Analysis 3
Server
conda env 1
Analysis 1
conda env 2 conda env 3
Analysis 2
Analysis 3
Docker container
Data Science DevelopmentData Science Deployment
© 2016 Continuum Analytics - Confidential & Proprietary 26
https://hub.docker.com/r/continuumio/anaconda/
© 2016 Continuum Analytics - Confidential & Proprietary
• Dependencies
Anaconda and Docker
27
• Data • Deployment commands • Security • Scalability • Availability
Portable Data Science with Anaconda Project - More than just Dockerfiles
© 2016 Continuum Analytics - Confidential & Proprietary
Laptop Server
Project 1 Project 2 Project 3 Project 1 Project 2 Project 3
Data Science Development Data Science Deployment
© 2016 Continuum Analytics - Confidential & Proprietary
LaptopServer
Project 1 Project 2 Project 3 Project 1 Project 2 Project 3
Data Science Development Data Science Deployment
Docker container
© 2016 Continuum Analytics - Confidential & Proprietary
• Dependencies • Data • Deployment commands
Anaconda Project
31
• Security • Scalability • Availability
One-click Data Science Deployments with Anaconda Enterprise
© 2016 Continuum Analytics - Confidential & Proprietary
Laptop
Project 1 Project 2 Project 3
Project 1 Project 2 Project 3
Data Science Development Data Science Development and Deployment
Anaconda Enterprise
Container 1
Container 2
Container 3 Container 4
© 2016 Continuum Analytics - Confidential & Proprietary
• Dependencies • Data • Deployment commands • Security • Scalability • Availability
Anaconda Enterprise
34
© 2016 Continuum Analytics - Confidential & Proprietary 35
© 2017 Continuum Analytics - Confidential & Proprietary 36
• One-click deployment of: • Self-Service Data Science Notebooks (Python and R) • Interactive visualizations and dashboards (Bokeh, Shiny, etc.) • Machine learning models with REST APIs
• Secure deployments to a cluster with end-to-end SSL • API wrapper for easily exposing inputs/outputs for models • Ability to securely share apps with other users, groups, and roles
(LDAP, AD, SAML, Kerberos)
Anaconda Enterprise Features - Data Science Deployment
© 2017 Continuum Analytics - Confidential & Proprietary 37
• Ability to deploy apps and APIs that can be used/consumed via a token • Ability to configure CPU/memory limits for deployed apps in system-wide
configuration • Ability to fetch logs for each app with error handling, health checks, and
automatic app restarts • Deployments can be backed by remote storage, databases, or Hadoop/
Spark • Cluster can be configured for high availability
Anaconda Enterprise Features - Data Science Deployment
Example Data Science Deployments
© 2017 Continuum Analytics - Confidential & Proprietary 39
• 1) Self-service notebooks • 2) Interactive visualizations and dashboards • 3) Machine learning models with REST APIs • 4) Composable data science projects • 5) Machine learning models with visualization
Examples Overview
© 2017 Continuum Analytics - Confidential & Proprietary 40
• Self-service data science notebooks, including: • Python • R
• Notebooks with live, attached kernels • Can be used to share runnable versions of analyses • Share running notebooks with users, groups, and roles • Handle portability and manage dependencies with Anaconda Project
Example 1 - Notebooks (Python/R)
© 2017 Continuum Analytics - Confidential & Proprietary 41
• Deploy apps using any visualization package in Anaconda, including: • Bokeh • Shiny apps • Datashader • deck.gl
• Develop and share visualizations and dashboards • Include data in project or reference remote data and databases • Deploy visualization apps powered by Hadoop and Spark
Example 2 - Interactive Visualizations
© 2017 Continuum Analytics - Confidential & Proprietary 42
• Machine learning models and applications with REST APIs • scikit-learn, Theano, Lasagne, Neon • Tensorflow (w/ GPU), Caffe, H2O • and many more!
• Support for model scoring and prediction APIs from trained models • Compatible with web frameworks in Anaconda, including:
• Flask, Django, Tornado, and more • Models can be shared or consumed via API tokens
Example 3 - Machine Learning w/ APIs
© 2017 Continuum Analytics - Confidential & Proprietary 43
• Deploy composable applications across your data science team • Example end-to-end workflow with custom endpoints and API tokens:
• Stage 1 - Data cleansing • Stage 2 - Anomaly detection • Stage 3 - Model scoring • Stage 4 - Interactive applications and dashboards • Stage 5 - Reports and file exports
Example 4 - Composable Deployments
© 2017 Continuum Analytics - Confidential & Proprietary 44
• Can be built on top of machine learning libraries in Anaconda, including: • Tensorflow, H2O, and many more
• Easily develop interactive applications and dashboards with existing frameworks
• Handle inputs and outputs to machine learning models • Including complex visualization toolkits such as Tensorboard
Example 5 - ML Models with Visualization
Getting Started with Anaconda Enterprise for Data Science Deployments and More
© 2017 Continuum Analytics - Confidential & Proprietary 46
Anaconda Platform
Anaconda Distribution
Anaconda Support
Anaconda Enterprise
•
•
•
The most trusted Python distribution for data science
Deploy Anaconda with Confidence. World class support for open source production environments.
Enterprise-ready data science platform for end-to-end workflows, including governance, collaboration, and deployment.
© 2017 Continuum Analytics - Confidential & Proprietary 47
• Empower data scientists to easily deploy secure and scalable data science projects to production
• World class support for open-source production environments • Securely govern and version control data science artifacts (projects,
packages, installers) from development to production • Secure and scalable data science project collaboration • Manage Anaconda across a cluster and run data science projects
backed by enterprise scalable compute and data sources • Bring the power of data science to Business Analysts
Anaconda Enterprise Features
© 2017 Continuum Analytics - Confidential & Proprietary 48
• Get started with the Anaconda Enterprise Innovator program • https://go.continuum.io/anaconda-enterprise-innovator/
• Contact us at: • [email protected] • https://www.continuum.io/contact-us
Next Steps
© 2017 Continuum Analytics - Confidential & Proprietary 49
Questions?
Christine Doig @ch_doig
Kristopher Overholt @koverholt
@ContinuumIO