harnessing the power of anaconda for scalable data science · 2017. 5. 16. · harnessing the power...
TRANSCRIPT
© 2017 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Harnessing the Power of Anaconda for Scalable Data Science
Peter WangCTO, Co-founderContinuum Analytics
Stan SeibertDirector of Community Innovation
© 2017 Continuum Analytics - Confidential & Proprietary
221 W. 6th StreetSuite #1550Austin, TX 78701+1 512.222.5440
@ContinuumIO
© 2017 Continuum Analytics - Confidential & Proprietary
AnacondaConnecting data scientists to the best tools
© 2017 Continuum Analytics - Confidential & Proprietary 3
Anaconda Distribution
Open Data Science
Cross Platform
Free
Most popular open source Python, R, and C/C++ libraries for data science
• Windows, macOS & Linux• x86, x86_64, ARM, PowerPC, z System• Python 2 and 3
Public repository of >700 packages, updated frequently and regularly
© 2017 Continuum Analytics - Confidential & Proprietary 4
Anaconda delivers core libraries of Python data science to millions of users every month:
• NumPy• SciPy• Pandas• Jupyter• scikit-learn• statsmodels, bokeh, matplotlib, hdf5, sqlalchemy, PyMC, ...
Python for Data Science
© 2017 Continuum Analytics - Confidential & Proprietary
The Machine Learning Paradigm Shift
5
(Much) Bigger Training Sets
Faster & Specialized Hardware
Open Source Tools
Improved Algorithms
© 2017 Continuum Analytics - Confidential & Proprietary
The Deep Learning Software Stack
6
MULTI-CORE CPU GPUMANY-CORE CPU(XEON PHI)HARDWARE
MKL 2017 CUDNNPRIMITIVES
TENSORFLOWTHEANOPYTORCHTENSOR MATH
NEURAL NETWORKS KERAS TFLEARNCAFFE
...and many others
MIOPEN
© 2017 Continuum Analytics - Confidential & Proprietary 7
• Caffe• TensorFlow• Theano• Keras (w/ GPU-accel TensorFlow)• Numba
GPU Accelerated Anaconda Packages
• PyTorch• MXNet• CNTK• H2O Deep Water
Available Now Coming in Q3 2017
© 2017 Continuum Analytics - Confidential & Proprietary
TensorFlow combines:• Multidimensional arrays (tensors)• Computational graphs• Gradient calculation• Optimizers
Create custom machine learningalgorithms.
TensorFlow
8
© 2017 Continuum Analytics - Confidential & Proprietary
221 W. 6th StreetSuite #1550Austin, TX 78701+1 512.222.5440
@ContinuumIO
© 2017 Continuum Analytics - Confidential & Proprietary
Anaconda EnterpriseConnecting data scientists to their businesses
© 2016 Continuum Analytics - Confidential & Proprietary
Open-source support– Conda support; break-fix support on analytics packages– IP assurance on packages
Reproducible data analytics, on-premises– Governance for your analytics environment– Empowerment of data scientists within scope of IT controls
Scale analytics deployments– Scale up: leverage specialized hardware for performance
– Scale out: deploy analyses to Hadoop & Spark clusters
Anaconda EnterpriseOpen Source Without Anxiety: Support, Governance, and Scalability
© 2016 Continuum Analytics - Confidential & Proprietary
Public Anaconda Repository
Cloud
conda install numpy ipythonconda update ipython conda create –n env1 ipython pandas
conda env upload environment.yml project1anaconda notebook upload project1.ipynb
conda build project2anaconda upload project2.bz2
Active Directory/ LDAPOptional
Authentication
Mirror
On-site Package Repo and Sharing platform
• Mirror public repository of packages
• Analysts consume packages from local repo
• Analysts upload and share notebooks & pre-configured computing environments
• Developers create, deploy & share custom packages
Internal Anaconda Repository
Anaconda Enterprise: Repository
Biz Analyst (consume)
Data Scientist (share)
Developer (deploy)
</>
© 2016 Continuum Analytics - Confidential & Proprietary
GPU Compute Nodes
Package Control
Internal Anaconda Package Repository
Authentication
Anaconda Enterprise Notebook Server
Interactive Compute Session
Web Interface
Active Directory/ LDAPOptional
Workflow:• Analysts log into the Enterprise
notebook server, authenticating against LDAP/AD
• Based on the project they select, are re-directed to the appropriate GPU compute node
• All notebooks/python code runs on compute nodes
• Any required packages are pulled down from the local (on-prem) repository
Anaconda Enterprise: Notebooks
Avoids the need to install GPUs on desktops!
© 2017 Continuum Analytics - Confidential & Proprietary
Deep Learning In Anaconda Enterprise
13
© 2017 Continuum Analytics - Confidential & Proprietary
Deep Learning In Anaconda Enterprise
14
© 2017 Continuum Analytics - Confidential & Proprietary
Deep Learning In Anaconda Enterprise
15
© 2016 Continuum Analytics - Confidential & Proprietary
Anaconda Enterprise: Deploy
Laptop/ServerAnaconda Enterprise
One-click, self-service deployment of data science dashboards, apps, web services
Encapsulates all Docker/Kubernetes operations to deploy your projects, integrated with your Identity Management System.
Anaconda Project: Complete project specification
Docker: Containers that run your deployed applications
Kubernetes: Orchestration of containers
© 2017 Continuum Analytics - Confidential & Proprietary 17
Data Scientist
Analyst/Manager
Advanced Analyst
Anaconda FusionBridging the worlds of Excel and Data Science
© 2017 Continuum Analytics - Confidential & Proprietary
221 W. 6th StreetSuite #1550Austin, TX 78701+1 512.222.5440
@ContinuumIO
© 2017 Continuum Analytics - Confidential & Proprietary
Anaconda CommunityInnovationConnecting data scientists to the future
© 2017 Continuum Analytics - Confidential & Proprietary 19
• Numba: Compiler for Python Functions• Can target CUDA GPUs• Lowers the barrier to GPU computing in Python
• Dask: Distributing Computing Made Easy• Python native• Can be combined with XGBoost and TensorFlow• Many distributed GPU workflows possible
• And one very new project...
New Tools for GPU-Powered Data Science
© 2017 Continuum Analytics - Confidential & Proprietary 20
• Compile numerical Python functions for execution on CPU or GPU
• Based on the NVVM compiler library
Numba: Python Compilation
See tutorial L7108: CUDA Programming in Python with Numba on Wed @ 2pm!
© 2017 Continuum Analytics - Confidential & Proprietary 21
• Scalable execution task graphs of task graphs from singlecomputers to 1000+ node clusters
• Scheduler is "resource aware" and can direct GPU tasks to nodes with appropriate hardware. Great for heterogeneous clusters!
Dask: Distributed Computing
© 2017 Continuum Analytics - Confidential & Proprietary 22
What's next for GPU-powered data science?
© 2017 Continuum Analytics - Confidential & Proprietary 23
Problem: An Ecosystem of Silos?
GPU
ETL/Data Prep
Database
Machine Learning
Visualization
Data
Data Data
Data
© 2017 Continuum Analytics - Confidential & Proprietary 24
Problem: An Ecosystem of Silos?
GPU
ETL/Data Prep
Database
Machine Learning
Visualization
Data
Data Data
Data
CPU transfer
CPU transferCPU transfer
© 2017 Continuum Analytics - Confidential & Proprietary 25
Problem: An Ecosystem of Silos?
GPU
ETL/Data Prep
Database
Machine Learning
Visualization
Data
Data Data
Data
CPU transfer
CPU transferCPU transfer Why do GPU applications share data through slow CPU memory?
© 2017 Continuum Analytics - Confidential & Proprietary 26
Solution: Keep the Data on the GPU
GPU
ETL/Data Prep
Database
Machine Learning
Visualization
Data
Data Data
Data
© 2017 Continuum Analytics - Confidential & Proprietary 27
GPU Open Analytics InitiativeGoal:
Standardize data exchange between GPU analytics applications
Founding members:
MapD, Continuum Analytics, H2O.ai
http://gpuopenanalytics.com/
© 2017 Continuum Analytics - Confidential & Proprietary 28
Feeding the Beast
© 2017 Continuum Analytics - Confidential & Proprietary 29
• A format for tabular data in GPU memory• Exchange GDF objects between different libraries and different
processes using CUDA IPC• Currently testing with format based on Apache Arrow*
• Continuum Analytics is creating the Python library for GDF• Planned feature complete release for Strata NYC (fall 2017)
GPU Dataframe (GDF)
* format changes likely during alpha period
© 2017 Continuum Analytics - Confidential & Proprietary 30
Planned Feature Set:• Create GDFs from NumPy arrays and Pandas dataframes• Exchange GDFs with other processes• Execute user defined functions on GDFs (compiled with Numba)• Basic GPU accelerated dataframe-functions: sort, filter, join,
group-by• Dask Dataframe support for distributed GPU data frames
PyGDF: Python GPU Dataframes
© 2017 Continuum Analytics - Confidential & Proprietary 31
Example Data Science Pipeline
GPU DatabasePython Data
TransformationGeneralized Linear Model
All data stays on the GPU
GDF PackedArray
© 2017 Continuum Analytics - Confidential & Proprietary 32
Anaconda powers your scalable data science workflow:
• Anaconda Distribution:GPU accelerated packages at your fingertips
• Anaconda Enterprise:Connects your data science teams with GPU servers
• Anaconda Community Innovation:Open source projects to connect Python users to the GPU
Numba, Dask, GPU Dataframe
https://www.continuum.io/
Conclusion