harnessing the power of anaconda for scalable data science · 2017. 5. 16. · harnessing the power...

32
© 2017 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics Stan Seibert Director of Community Innovation

Upload: others

Post on 16-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Harnessing the Power of Anaconda for Scalable Data Science

Peter WangCTO, Co-founderContinuum Analytics

Stan SeibertDirector of Community Innovation

Page 2: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

221 W. 6th StreetSuite #1550Austin, TX 78701+1 512.222.5440

[email protected]

@ContinuumIO

© 2017 Continuum Analytics - Confidential & Proprietary

AnacondaConnecting data scientists to the best tools

Page 3: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 3

Anaconda Distribution

Open Data Science

Cross Platform

Free

Most popular open source Python, R, and C/C++ libraries for data science

• Windows, macOS & Linux• x86, x86_64, ARM, PowerPC, z System• Python 2 and 3

Public repository of >700 packages, updated frequently and regularly

Page 4: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 4

Anaconda delivers core libraries of Python data science to millions of users every month:

• NumPy• SciPy• Pandas• Jupyter• scikit-learn• statsmodels, bokeh, matplotlib, hdf5, sqlalchemy, PyMC, ...

Python for Data Science

Page 5: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

The Machine Learning Paradigm Shift

5

(Much) Bigger Training Sets

Faster & Specialized Hardware

Open Source Tools

Improved Algorithms

Page 6: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

The Deep Learning Software Stack

6

MULTI-CORE CPU GPUMANY-CORE CPU(XEON PHI)HARDWARE

MKL 2017 CUDNNPRIMITIVES

TENSORFLOWTHEANOPYTORCHTENSOR MATH

NEURAL NETWORKS KERAS TFLEARNCAFFE

...and many others

MIOPEN

Page 7: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 7

• Caffe• TensorFlow• Theano• Keras (w/ GPU-accel TensorFlow)• Numba

GPU Accelerated Anaconda Packages

• PyTorch• MXNet• CNTK• H2O Deep Water

Available Now Coming in Q3 2017

Page 8: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

TensorFlow combines:• Multidimensional arrays (tensors)• Computational graphs• Gradient calculation• Optimizers

Create custom machine learningalgorithms.

TensorFlow

8

Page 9: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

221 W. 6th StreetSuite #1550Austin, TX 78701+1 512.222.5440

[email protected]

@ContinuumIO

© 2017 Continuum Analytics - Confidential & Proprietary

Anaconda EnterpriseConnecting data scientists to their businesses

Page 10: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2016 Continuum Analytics - Confidential & Proprietary

Open-source support– Conda support; break-fix support on analytics packages– IP assurance on packages

Reproducible data analytics, on-premises– Governance for your analytics environment– Empowerment of data scientists within scope of IT controls

Scale analytics deployments– Scale up: leverage specialized hardware for performance

– Scale out: deploy analyses to Hadoop & Spark clusters

Anaconda EnterpriseOpen Source Without Anxiety: Support, Governance, and Scalability

Page 11: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2016 Continuum Analytics - Confidential & Proprietary

Public Anaconda Repository

Cloud

conda install numpy ipythonconda update ipython conda create –n env1 ipython pandas

conda env upload environment.yml project1anaconda notebook upload project1.ipynb

conda build project2anaconda upload project2.bz2

Active Directory/ LDAPOptional

Authentication

Mirror

On-site Package Repo and Sharing platform

• Mirror public repository of packages

• Analysts consume packages from local repo

• Analysts upload and share notebooks & pre-configured computing environments

• Developers create, deploy & share custom packages

Internal Anaconda Repository

Anaconda Enterprise: Repository

Biz Analyst (consume)

Data Scientist (share)

Developer (deploy)

</>

Page 12: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2016 Continuum Analytics - Confidential & Proprietary

GPU Compute Nodes

Package Control

Internal Anaconda Package Repository

Authentication

Anaconda Enterprise Notebook Server

Interactive Compute Session

Web Interface

Active Directory/ LDAPOptional

Workflow:• Analysts log into the Enterprise

notebook server, authenticating against LDAP/AD

• Based on the project they select, are re-directed to the appropriate GPU compute node

• All notebooks/python code runs on compute nodes

• Any required packages are pulled down from the local (on-prem) repository

Anaconda Enterprise: Notebooks

Avoids the need to install GPUs on desktops!

Page 13: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

Deep Learning In Anaconda Enterprise

13

Page 14: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

Deep Learning In Anaconda Enterprise

14

Page 15: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

Deep Learning In Anaconda Enterprise

15

Page 16: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2016 Continuum Analytics - Confidential & Proprietary

Anaconda Enterprise: Deploy

Laptop/ServerAnaconda Enterprise

One-click, self-service deployment of data science dashboards, apps, web services

Encapsulates all Docker/Kubernetes operations to deploy your projects, integrated with your Identity Management System.

Anaconda Project: Complete project specification

Docker: Containers that run your deployed applications

Kubernetes: Orchestration of containers

Page 17: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 17

Data Scientist

Analyst/Manager

Advanced Analyst

Anaconda FusionBridging the worlds of Excel and Data Science

Page 18: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary

221 W. 6th StreetSuite #1550Austin, TX 78701+1 512.222.5440

[email protected]

@ContinuumIO

© 2017 Continuum Analytics - Confidential & Proprietary

Anaconda CommunityInnovationConnecting data scientists to the future

Page 19: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 19

• Numba: Compiler for Python Functions• Can target CUDA GPUs• Lowers the barrier to GPU computing in Python

• Dask: Distributing Computing Made Easy• Python native• Can be combined with XGBoost and TensorFlow• Many distributed GPU workflows possible

• And one very new project...

New Tools for GPU-Powered Data Science

Page 20: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 20

• Compile numerical Python functions for execution on CPU or GPU

• Based on the NVVM compiler library

Numba: Python Compilation

See tutorial L7108: CUDA Programming in Python with Numba on Wed @ 2pm!

Page 21: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 21

• Scalable execution task graphs of task graphs from singlecomputers to 1000+ node clusters

• Scheduler is "resource aware" and can direct GPU tasks to nodes with appropriate hardware. Great for heterogeneous clusters!

Dask: Distributed Computing

Page 22: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 22

What's next for GPU-powered data science?

Page 23: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 23

Problem: An Ecosystem of Silos?

GPU

ETL/Data Prep

Database

Machine Learning

Visualization

Data

Data Data

Data

Page 24: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 24

Problem: An Ecosystem of Silos?

GPU

ETL/Data Prep

Database

Machine Learning

Visualization

Data

Data Data

Data

CPU transfer

CPU transferCPU transfer

Page 25: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 25

Problem: An Ecosystem of Silos?

GPU

ETL/Data Prep

Database

Machine Learning

Visualization

Data

Data Data

Data

CPU transfer

CPU transferCPU transfer Why do GPU applications share data through slow CPU memory?

Page 26: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 26

Solution: Keep the Data on the GPU

GPU

ETL/Data Prep

Database

Machine Learning

Visualization

Data

Data Data

Data

Page 27: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 27

GPU Open Analytics InitiativeGoal:

Standardize data exchange between GPU analytics applications

Founding members:

MapD, Continuum Analytics, H2O.ai

http://gpuopenanalytics.com/

Page 28: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 28

Feeding the Beast

Page 29: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 29

• A format for tabular data in GPU memory• Exchange GDF objects between different libraries and different

processes using CUDA IPC• Currently testing with format based on Apache Arrow*

• Continuum Analytics is creating the Python library for GDF• Planned feature complete release for Strata NYC (fall 2017)

GPU Dataframe (GDF)

* format changes likely during alpha period

Page 30: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 30

Planned Feature Set:• Create GDFs from NumPy arrays and Pandas dataframes• Exchange GDFs with other processes• Execute user defined functions on GDFs (compiled with Numba)• Basic GPU accelerated dataframe-functions: sort, filter, join,

group-by• Dask Dataframe support for distributed GPU data frames

PyGDF: Python GPU Dataframes

Page 31: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 31

Example Data Science Pipeline

GPU DatabasePython Data

TransformationGeneralized Linear Model

All data stays on the GPU

GDF PackedArray

Page 32: Harnessing the Power of Anaconda for Scalable Data Science · 2017. 5. 16. · Harnessing the Power of Anaconda for Scalable Data Science Peter Wang CTO, Co-founder Continuum Analytics

© 2017 Continuum Analytics - Confidential & Proprietary 32

Anaconda powers your scalable data science workflow:

• Anaconda Distribution:GPU accelerated packages at your fingertips

• Anaconda Enterprise:Connects your data science teams with GPU servers

• Anaconda Community Innovation:Open source projects to connect Python users to the GPU

Numba, Dask, GPU Dataframe

https://www.continuum.io/

Conclusion