scalable & deployable data science with the anaconda platform | anacondacon 2017

25
SCALABLE & DEPLOYABLE DATA SCIENCE WITH THE ANACONDA PLATFORM Kristopher Overholt Product Manager Continuum Analytics

Upload: continuum-analytics

Post on 12-Apr-2017

189 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

SCALABLE & DEPLOYABLE DATA SCIENCE WITH THE ANACONDA PLATFORM

Kristopher OverholtProduct Manager

Continuum Analytics

Page 2: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

OVERVIEW• Collaborative Data Science Workflows

• Scaling Out with Anaconda• Spectrum of parallelization• Spark, Hadoop, Dask, and other parallel frameworks• Example distributed/parallel use cases

• Productionizing Data Science Projects

• Enterprise deployment considerations

• Deploying Data Science Projects• Notebooks, dashboards, interactive applications, and models with APIs

Page 3: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

COLLABORATIVE DATA SCIENCE WORKFLOWS

Page 4: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

COLLABORATIVE DATA SCIENCE WORKFLOWSData science teams often use intermediate deployments and modular, layered development approaches for data ingest, data cleaning, computation, machine learning, visualization, etc.

Page 5: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

ANACONDA - SCALED OUT OPEN DATA SCIENCE

Application and Visualization Jupyter Notebook, Matplotlib, seaborn, Bokeh, etc.

Analytics pandas, NumPy, SciPy, Numba, scikit-learn, NLTK, scikit-image, PIL, and more

Computation PySpark, SparkR Dask, Distributed

Data and Resource Management HDFS, NFS, YARN, SGE, SLURM

Servers Bare-metal or Cloud-based Cluster Clus

ter

Ana

cond

a

Page 6: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

SPECTRUM OF PARALLELIZATION

ThreadsProcesses

MPIZeroMQ

Explicit control: Fast but low-level Implicit control: Restrictive but easy

Dask HadoopSpark

SQL:HivePig

Impala

Page 7: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

SCALING OUT WITH ANACONDA AND SPARKUsing Anaconda with Spark is:

• Extensible: Use libraries from Anaconda with PySpark and SparkR jobs

• Integrated: Use interactive notebooks with data in HDFS and on YARN clusters

• Secure: Works with Kerberized Hadoop clusters

• Scalable: Map pandas, NumPy, SciPy jobs on large clusters and data sets

• Seamless: Works with Cloudera CDH, Hortonworks HDP, and other enterprise Hadoop distributions

Anaconda dramatically simplifies the installation and management of popular Python and R packages and their dependencies.

Page 8: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

SCALING OUT WITH ANACONDA AND DASKDask is a Python parallel computing library that is:

• Familiar: Implements parallel NumPy and Pandas objects

• Fast: Optimized for demanding for numerical applications

• Flexible: for sophisticated and messy algorithms

• Scales up: Runs resiliently on clusters of 100s of machines

• Scales down: Pragmatic in a single process on a laptop

• Interactive: Responsive and fast for interactive data science

Page 9: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

OTHER WAYS TO SCALE OUT WITH ANACONDAAnaconda integrates with:

• Spark (PySpark, SparkR) and other

Hadoop components, including YARN,

HDFS, Hive, Impala, and more

• Dask, Distributed, knit, dask-ec2, hdfs3,

fastparquet

• CSV, SQL, JSON, HDF5, Parquet, etc.

• Amazon Web Services, Microsoft Azure,

Google Cloud Platform

• Streaming analytics: Streamparse for

Apache Storm, Spark Streaming, Kafka,

Python integration with ELK

Anaconda Technology Partners:

• Cloudera

• Hortonworks

• IBM

• H2O

• Docker

• … and more

Page 10: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

SCALING OUT WITH ANACONDA

Anaconda platform

ClusterBiz Analysts, Data Scientists Developers,Data Engineers, DevOps

Page 11: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

SCALING OUT WITH ANACONDA

Without Anaconda Scale

Head Node1. Manually install Python,

packages & dependencies2. Manually install R, packages &

dependencies

With Anaconda Scale

Compute Nodes1. Manually install Python,

packages & dependencies2. Manually install R,

packages & dependencies

Compute Nodes

Head NodeEasily install Anaconda with performance optimized Python and R packages and manage environments across all nodes in a cluster

Page 12: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

SCALING OUT WITH ANACONDA –EXAMPLE USE CASES

Analyzing text, tabular, or array data using Dask

• Use Pandas dataframes orNumPy arrays at scale

• Work with data in different formats and data stores

Distributed natural language processing with text data using PySpark

• Explore data using a distributed memory cluster

• Interactively query and analyze data using libraries from Anaconda

Distributed machine learning workflows with Dask, Spark, H2O, Tensorflow, and more

• Work interactively and collaboratively in notebooks

• Simplify installation and management of ML libraries and dependencies

Handle custom code and workflows usingDask

• Work with custom data formats

• Construct complex pipelines including ETL and flexible computations

Page 13: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

Page 14: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

Page 15: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

Page 16: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

PRODUCTIONIZING DATA SCIENCE PROJECTS

Page 17: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

PRODUCTIONIZING DATA SCIENCE PROJECTS

• Provisioning compute resources

• Managing dependencies and environments

• Ensuring availability, uptime, and monitoring status

• Engineering for scalability

• Sharing compute resources

• Securing data and network connectivity and credentials

• Securing network communications and SSL

• Managing authentication and access control

Page 18: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

DEPLOYING WITHCOLLABORATIVE DATA SCIENCE WORKFLOWS

Review Design Build Validate Deploy

Assess and review requirements and

data sources

Conceptualdesign of

interactive application or

dashboard

Build the dashboard or

application with Anaconda

Test and validatedashboard or application

Deploy dashboard or application at scale using best

practices

Page 19: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

DEPLOYING DATA SCIENCE PROJECTS -NOTEBOOKS

Page 20: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

DEPLOYING DATA SCIENCE PROJECTS -DASHBOARDS

Page 21: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

DEPLOYING DATA SCIENCE PROJECTS –INTERACTIVE APPLICATIONS

Page 22: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

DEPLOYING DATA SCIENCE PROJECTS –MODELS WITH REST APIS

Load Data

Clean Data

Anomaly Detection

Models withREST APIs

DashboardsReports

InteractiveApplications

Regression

Clustering

Machine LearningPipeline

Deployed Applications

Developers and data scientists can build additional layers of visualizations, dashboards, or interactive applications that consume data from API endpoints.

Page 23: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

SCALABLE AND DEPLOYABLE DATA SCIENCE… with Anaconda and Anaconda Enterprise, including:

• Scaled-up Analytics: Develop and deploy the same code/environments on your local machine and a cluster

• Environment management: Dynamically manage Python, R, dependencies and other conda packages and environments across a cluster

• Collaboration: Easily share versioned notebooks and projects across users and replicate analysts’ environments for different jobs/users/groups

• Hadoop integration: Support for Hadoop, Spark and other distributed workflows; compatible with enterprise Hadoop distributions

Page 24: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON

ADDITIONAL RESOURCES FOR SCALABLE AND DEPLOYABLE DATA SCIENCE• Anaconda Enterprise subscriptions:

https://www.continuum.io/anaconda-subscriptions

• Anaconda Scalehttps://docs.continuum.io/anaconda-scale

• Webinars on scaling out with Anacondahttps://www.continuum.io/webinars

• Blog posts on scaling out with Anacondahttps://www.continuum.io/blog/developer-blogProductionizing and Deploying Data Science Projects

Page 25: Scalable & Deployable Data Science with the Anaconda Platform | AnacondaCON 2017

Thank You!

@ContinuumIO @koverholt