scalable & deployable data science with the anaconda platform | anacondacon 2017

SCALABLE & DEPLOYABLE DATA SCIENCE WITH THE ANACONDA PLATFORM

Kristopher OverholtProduct Manager

Continuum Analytics

#OpenDataScienceMeans #AnacondaCON

OVERVIEW• Collaborative Data Science Workflows

• Scaling Out with Anaconda• Spectrum of parallelization• Spark, Hadoop, Dask, and other parallel frameworks• Example distributed/parallel use cases

• Productionizing Data Science Projects

• Enterprise deployment considerations

• Deploying Data Science Projects• Notebooks, dashboards, interactive applications, and models with APIs


COLLABORATIVE DATA SCIENCE WORKFLOWS


COLLABORATIVE DATA SCIENCE WORKFLOWSData science teams often use intermediate deployments and modular, layered development approaches for data ingest, data cleaning, computation, machine learning, visualization, etc.


ANACONDA - SCALED OUT OPEN DATA SCIENCE

Application and Visualization Jupyter Notebook, Matplotlib, seaborn, Bokeh, etc.

Analytics pandas, NumPy, SciPy, Numba, scikit-learn, NLTK, scikit-image, PIL, and more

Computation PySpark, SparkR Dask, Distributed

Data and Resource Management HDFS, NFS, YARN, SGE, SLURM

Servers Bare-metal or Cloud-based Cluster Clus

ter

Ana

cond

a


SPECTRUM OF PARALLELIZATION

ThreadsProcesses

MPIZeroMQ

Explicit control: Fast but low-level Implicit control: Restrictive but easy

Dask HadoopSpark

SQL:HivePig

Impala


SCALING OUT WITH ANACONDA AND SPARKUsing Anaconda with Spark is:

• Extensible: Use libraries from Anaconda with PySpark and SparkR jobs

• Integrated: Use interactive notebooks with data in HDFS and on YARN clusters

• Secure: Works with Kerberized Hadoop clusters

• Scalable: Map pandas, NumPy, SciPy jobs on large clusters and data sets

• Seamless: Works with Cloudera CDH, Hortonworks HDP, and other enterprise Hadoop distributions

Anaconda dramatically simplifies the installation and management of popular Python and R packages and their dependencies.


SCALING OUT WITH ANACONDA AND DASKDask is a Python parallel computing library that is:

• Familiar: Implements parallel NumPy and Pandas objects

• Fast: Optimized for demanding for numerical applications

• Flexible: for sophisticated and messy algorithms

• Scales up: Runs resiliently on clusters of 100s of machines

• Scales down: Pragmatic in a single process on a laptop

• Interactive: Responsive and fast for interactive data science


OTHER WAYS TO SCALE OUT WITH ANACONDAAnaconda integrates with:

• Spark (PySpark, SparkR) and other

Hadoop components, including YARN,

HDFS, Hive, Impala, and more

• Dask, Distributed, knit, dask-ec2, hdfs3,

fastparquet

• CSV, SQL, JSON, HDF5, Parquet, etc.

• Amazon Web Services, Microsoft Azure,

Google Cloud Platform

• Streaming analytics: Streamparse for

Apache Storm, Spark Streaming, Kafka,

Python integration with ELK

Anaconda Technology Partners:

• Cloudera

• Hortonworks

• IBM

• H2O

• Docker

• … and more


SCALING OUT WITH ANACONDA

Anaconda platform

ClusterBiz Analysts, Data Scientists Developers,Data Engineers, DevOps


SCALING OUT WITH ANACONDA

Without Anaconda Scale

Head Node1. Manually install Python,

packages & dependencies2. Manually install R, packages &

dependencies

With Anaconda Scale

Compute Nodes1. Manually install Python,

packages & dependencies2. Manually install R,

packages & dependencies

Compute Nodes

Head NodeEasily install Anaconda with performance optimized Python and R packages and manage environments across all nodes in a cluster


SCALING OUT WITH ANACONDA –EXAMPLE USE CASES

Analyzing text, tabular, or array data using Dask

• Use Pandas dataframes orNumPy arrays at scale

• Work with data in different formats and data stores

Distributed natural language processing with text data using PySpark

• Explore data using a distributed memory cluster

• Interactively query and analyze data using libraries from Anaconda

Distributed machine learning workflows with Dask, Spark, H2O, Tensorflow, and more

• Work interactively and collaboratively in notebooks

• Simplify installation and management of ML libraries and dependencies

Handle custom code and workflows usingDask

• Work with custom data formats

• Construct complex pipelines including ETL and flexible computations


PRODUCTIONIZING DATA SCIENCE PROJECTS


PRODUCTIONIZING DATA SCIENCE PROJECTS

• Provisioning compute resources

• Managing dependencies and environments

• Ensuring availability, uptime, and monitoring status

• Engineering for scalability

• Sharing compute resources

• Securing data and network connectivity and credentials

• Securing network communications and SSL

• Managing authentication and access control


DEPLOYING WITHCOLLABORATIVE DATA SCIENCE WORKFLOWS

Review Design Build Validate Deploy

Assess and review requirements and

data sources

Conceptualdesign of

interactive application or

dashboard

Build the dashboard or

application with Anaconda

Test and validatedashboard or application

Deploy dashboard or application at scale using best

practices


DEPLOYING DATA SCIENCE PROJECTS -NOTEBOOKS


DEPLOYING DATA SCIENCE PROJECTS -DASHBOARDS


DEPLOYING DATA SCIENCE PROJECTS –INTERACTIVE APPLICATIONS


DEPLOYING DATA SCIENCE PROJECTS –MODELS WITH REST APIS

Load Data

Clean Data

Anomaly Detection

Models withREST APIs

DashboardsReports

InteractiveApplications

Regression

Clustering

Machine LearningPipeline

Deployed Applications

Developers and data scientists can build additional layers of visualizations, dashboards, or interactive applications that consume data from API endpoints.


SCALABLE AND DEPLOYABLE DATA SCIENCE… with Anaconda and Anaconda Enterprise, including:

• Scaled-up Analytics: Develop and deploy the same code/environments on your local machine and a cluster

• Environment management: Dynamically manage Python, R, dependencies and other conda packages and environments across a cluster

• Collaboration: Easily share versioned notebooks and projects across users and replicate analysts’ environments for different jobs/users/groups

• Hadoop integration: Support for Hadoop, Spark and other distributed workflows; compatible with enterprise Hadoop distributions


ADDITIONAL RESOURCES FOR SCALABLE AND DEPLOYABLE DATA SCIENCE• Anaconda Enterprise subscriptions:

https://www.continuum.io/anaconda-subscriptions

• Anaconda Scalehttps://docs.continuum.io/anaconda-scale

• Webinars on scaling out with Anacondahttps://www.continuum.io/webinars

• Blog posts on scaling out with Anacondahttps://www.continuum.io/blog/developer-blogProductionizing and Deploying Data Science Projects

Thank You!

@ContinuumIO @koverholt

scalable & deployable data science with the anaconda platform | anacondacon 2017

Data & Analytics