open data science with r and anaconda

48
OPEN DATA SCIENCE WITH R Make Life Easier & More Powerful with Anaconda Christine Doig, Senior Data Scientist

Upload: continuum-analytics

Post on 21-Apr-2017

9.059 views

Category:

Data & Analytics


0 download

TRANSCRIPT

OPEN DATA SCIENCE WITH RMake Life Easier & More Powerful with Anaconda

Christine Doig, Senior Data Scientist

2

Christine Doig is a Senior Data Scientist at Continuum Analytics, where she worked on MEMEX, a DARPA-funded project helping stop human trafficking. She has 5+ years of experience in analytics, operations research, and machine learning in a variety of industries, including energy, manufacturing, and banking. Christine holds a M.S. in Industrial Engineering from the Polytechnic University of Catalonia in Barcelona. She is an open source advocate and has spoken at many conferences, including PyData, EuroPython, SciPy and PyCon.

About me

Christine DoigSenior Data Scientist

Continuum Analytics

3

• Introduction to Open Data Science • Introduction to Anaconda, the leading Open Data Science platform • Package and environment management for R

– conda, R-Essentials and MRO • Data Science Collaboration in R

– Jupyter notebooks for R and Anaconda Enterprise Notebooks • Scaling R

– Anaconda for cluster management and SparkR

Agenda - Open Data Science with R

OPEN DATA SCIENCEIntroduction to

“ ”© 2015 Continuum Analytics- Confidential & Proprietary 5

An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms

Wikipedia

Data Science is …

© 2015 Continuum Analytics- Confidential & Proprietary

Open Data Science is …

an inclusive movement that makes open source tools of data science - data, analytics, & computation - easily work

together as a connected ecosystem

6

© 2015 Continuum Analytics- Confidential & Proprietary

Open Source ecosystems for Data Science

7

NumPy SciPy

Pandas Scikit-learn

Jupyter/IPython

dplyr shiny

tidyr

ggplot

Spark

tidyr

ANACONDAIntroduction to

© 2015 Continuum Analytics- Confidential & Proprietary 9

is…. the leading Open Data Science platform powered by Python the fastest growing Open Data Science language

• Accelerate Time-to-Value • Connect Data, Analytics, & Compute • Empower Data Science Teams

10

Why Anaconda? • Easy to install on all platforms • Trusted by industry leaders: e.g. Microsoft Azure ML

• Large user base: 3M+ downloads • BSD license • Extensible - easily build, share and install proprietary libraries with Anaconda Cloud

• Language agnostic - Python, R, Scala… • Allows isolated custom sandboxes with different versions of packages

Why Anaconda?

11

Anaconda Glossary

PYTHONNumPy, SciPy, Pandas, Scikit-learn, Jupyter /

IPython, Numba, Matplotlib, Spyder, Numexpr,

Cython, Theano, Scikit-image, NLTK, NetworkX and

150+ packages

conda

PYTHON

cond

conda

• Anaconda distribution: Python distribution that includes 150+ packages for data science

• conda: Cross-platform and language agnostic package and environment manager

• Miniconda: Lightweight version of Anaconda, with just Python and conda.

• Anaconda Cloud: Cloud service to host and share public and private packages, environments and notebooks

• conda environments: custom isolated sandboxes to easily reproduce and share data science projects

PACKAGE AND ENVIRONMENT MANAGEMENT FOR R

13

From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft

An R Reproducibility Problem

14

Reproducibility• Programming language (R, Python, Scala…) • Packages (OSS libraries or internally developed) • Data or Access to data • Configuration of Services: DBs, keys… • Your Analysis - Script, Notebook

15

Reproducibility solutions

Bare metal

Virtual Machines

Docker containers

Conda environments

Your Analysis or Application

Your laptop, server, EC2 instance

Env 1 Env 2 Env 3

Analysis 1 Analysis 2 Analysis 3

16

Conda Environments• Programming language (R, Python, Scala…) • Packages (OSS libraries or internally developed) • Data or Access to data • Configuration of Services: DBs, keys… • Your Analysis - Script, Notebook

17

lightweight isolated sandbox to manage your dependencies and allow reproducibility of your project

environment.yml

$ conda env create

$ source activate ENV_NAME

Conda Environments

18

Where packages, notebooks, and environments are shared. Powerful collaboration and package management for open source and private projects.

Public projects and notebooks are always free.REGISTER TODAY! ANACONDA.ORG

19

Anaconda for R

https://www.continuum.io/blog/developer/jupyter-and-conda-r

• R-Essentials: A conda metapackage with 80+ R packages for data science

• MRO: Microsoft R Open distribution with MKL

conda config --add channels r conda install r-essentials

conda config --add channels mro conda install r

20

• Package and environment manager • Language angnostic (Python, R, Java…) • Cross-platform (Windows, OS X, Linux)

$ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install mongodb

Conda

21

name: myenv channels: - chdoig - r - foo

dependecies: - python=2.7 - r - r-ldavis - pandas - mongodb - spark=1.5 - pip - pip: - flask-migrate - bar=1.4

environment.yml

$ conda env create $ source activate myenv

$ conda env export -n freeze.yml

Create and activate

Freeze versions

Upload to anaconda.org

$ conda server upload my_foo_env.yml $ conda env create chdoig/my_foo_env.yml

Conda environments flow example

22

FAQ• R-Essentials has too many / too few / not the packages I

want, how can I create my own “R-Essentials”?

• I need an R package that is not on R-Essentials or the R channel, but is available through CRAN, how do I get it?

$ conda skeleton cran ldavis $ conda build r-ldavis/ $ conda server upload r-ldavis $ conda install -c chdoig r-ldavis

$ conda metapackage custom-r-bundle 0.1.0 --dependencies r-irkernel jupyter r-ggplot2 r-dplyr --summary "My custom R bundle”

23

Anaconda: Navigator

• Launch applications and easily manage conda packages, environments and channels.

• No need of using the command line.

•Available for Windows, OS X and Linux.

• Anaconda Navigator has replaced Launcher.

• Integration with Anaconda Cloud.

A desktop graphical user interface included in

Anaconda

24

Anaconda Repository

• Centralized internal repository to share package, environments and notebooks.

• Control user or team access to packages, environments and notebooks

• Blacklist packages in your organization (e.g. GPL licenses)

• Internal mirror Anaconda • Build and easily share internal developed software

DATA SCIENCE COLLABORATION WITH R

© 2015 Continuum Analytics- Confidential & Proprietary

Data Science Development Environments

26

PyCharm Spyder

Text Editors: Sublime, vim, emacs…

RStudio Eclipse

27

http://jupyter.org/https://try.jupyter.org/

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Jupyter

28

IPython IPython notebook

nbviewer tmpnb binderJupyter

https://try.jupyter.org/

http://mybinder.org/

Jupyter

29

Jupyter: IRkernel

https://www.continuum.io/blog/developer/jupyter-and-conda-r

conda config --add channels r conda install r-essentials jupyter notebooks

Trivial to get started writing R notebooks the same way you

write Python ones.

30

To start jupyter notebooks, simply run the following command:

$ jupyter notebook

http://nbviewer.ipython.org/github/chdoig/conda-jupyter-irkernel/blob/master/Jupyter%20and%20conda%20for%20R.ipynb

Jupyter

31

Jupyter

32

Jupyter

33

$ jupyter nbconvert my_r_notebook.ipynb --to slides --post serve

Jupyter

DEMO 1: ENVIRONMENTS & REPOSITORY

35

Moving your team to collaborate with each other with Anaconda Enterprise Notebooks

Data Scientist

Interactive notebooks

Models

Data apps & visualizations

Data Scientist Data Scientist

36

Anaconda Enterprise Notebooks

• Collaborate with your team on the same project

• Notebooks enterprise extensions: diff, collaborative locking

• Manage collaborators and access to projects

• Search and tag notebooks

DEMO 2: NOTEBOOKS AND AEN

SCALING R

39

Scalability

Data Scientists want: • Easy cluster setup and provisioning -> Anaconda for cluster management

• Distributed framework to scale analysis -> SparkR

40

Anaconda for cluster management

• Dynamically manage conda environments across a cluster

• Works with enterprise Hadoop distributions and HPC clusters

• Integrates with on-premises Anaconda repository

• Cluster management features are available with Anaconda subscriptions

Client Machine Compute Node

Compute Node

Compute Node

Head Node

41

Anaconda for cluster management

Before Anaconda for cluster management

Head Node1. Manually install Python,

packages & dependencies2. Manually install R, packages &

dependencies

After Anaconda for cluster management

Compute Nodes1. Manually install Python,

packages & dependencies2. Manually install R,

packages & dependencies

Compute Nodes

Head NodeEasily install conda environments and packages (including Python and R) across cluster nodes

• Empower IT with scalable and supported Anaconda deployments • Fast, secure and scalable Python & R package management on tens or thousands of nodes • Backed by an enterprise configuration management system • Scalable Anaconda deployments tested in enterprise Hadoop and HPC environments

42

SparkR

• Distributed framework for large scale processing

• Provides an R interface through SparkR

DEMO 3: ANACONDA FOR CLUSTER MANAGEMENT AND SPARKR

44

https://www.continuum.io/anaconda-subscriptions

45https://www.continuum.io/anaconda-subscriptions

46

• Need a centralized repository to publish and share notebooks, environments and packages (OSS and private)? Get Anaconda Repository! (Available in Anaconda Workgroups and Enterprise)

• Need a centralized server to help your data science team interactively collaborate on projects? Get Anaconda Enterprise Notebooks! (Available Enterprise)

• Need a “data scientist friendly” cluster manager? Get Anaconda for cluster management! (Available in Anaconda Workgroups and Enterprise)

Enterprise Product Solutions

47

• Download Anaconda: https://www.continuum.io/downloads

• Sign up for Anaconda cloud: https://anaconda.org

• Contact [email protected] for more information aboutAnaconda subscriptions, consulting, or training

Contact Information and Additional Details

48

Email: [email protected]

Twitter: @ContinuumIO

Christine DoigTwitter: @ch_doig

Thank you!