open data science with r and anaconda

48
OPEN DATA SCIENCE WITH R Make Life Easier & More Powerful with Anaconda Christine Doig, Senior Data Scientist

Upload: continuum-analytics

Post on 16-Apr-2017

8.985 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Open Data Science with R and Anaconda

OPEN DATA SCIENCE WITH RMake Life Easier & More Powerful with Anaconda

Christine Doig, Senior Data Scientist

Page 2: Open Data Science with R and Anaconda

2

Christine Doig is a Senior Data Scientist at Continuum Analytics, where she worked on MEMEX, a DARPA-funded project helping stop human trafficking. She has 5+ years of experience in analytics, operations research, and machine learning in a variety of industries, including energy, manufacturing, and banking. Christine holds a M.S. in Industrial Engineering from the Polytechnic University of Catalonia in Barcelona. She is an open source advocate and has spoken at many conferences, including PyData, EuroPython, SciPy and PyCon.

About me

Christine DoigSenior Data Scientist

Continuum Analytics

Page 3: Open Data Science with R and Anaconda

3

• Introduction to Open Data Science • Introduction to Anaconda, the leading Open Data Science platform • Package and environment management for R

– conda, R-Essentials and MRO • Data Science Collaboration in R

– Jupyter notebooks for R and Anaconda Enterprise Notebooks • Scaling R

– Anaconda for cluster management and SparkR

Agenda - Open Data Science with R

Page 4: Open Data Science with R and Anaconda

OPEN DATA SCIENCEIntroduction to

Page 5: Open Data Science with R and Anaconda

“ ”© 2015 Continuum Analytics- Confidential & Proprietary 5

An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms

Wikipedia

Data Science is …

Page 6: Open Data Science with R and Anaconda

© 2015 Continuum Analytics- Confidential & Proprietary

Open Data Science is …

an inclusive movement that makes open source tools of data science - data, analytics, & computation - easily work

together as a connected ecosystem

6

Page 7: Open Data Science with R and Anaconda

© 2015 Continuum Analytics- Confidential & Proprietary

Open Source ecosystems for Data Science

7

NumPy SciPy

Pandas Scikit-learn

Jupyter/IPython

dplyr shiny

tidyr

ggplot

Spark

tidyr

Page 8: Open Data Science with R and Anaconda

ANACONDAIntroduction to

Page 9: Open Data Science with R and Anaconda

© 2015 Continuum Analytics- Confidential & Proprietary 9

is…. the leading Open Data Science platform powered by Python the fastest growing Open Data Science language

• Accelerate Time-to-Value • Connect Data, Analytics, & Compute • Empower Data Science Teams

Page 10: Open Data Science with R and Anaconda

10

Why Anaconda? • Easy to install on all platforms • Trusted by industry leaders: e.g. Microsoft Azure ML

• Large user base: 3M+ downloads • BSD license • Extensible - easily build, share and install proprietary libraries with Anaconda Cloud

• Language agnostic - Python, R, Scala… • Allows isolated custom sandboxes with different versions of packages

Why Anaconda?

Page 11: Open Data Science with R and Anaconda

11

Anaconda Glossary

PYTHONNumPy, SciPy, Pandas, Scikit-learn, Jupyter /

IPython, Numba, Matplotlib, Spyder, Numexpr,

Cython, Theano, Scikit-image, NLTK, NetworkX and

150+ packages

conda

PYTHON

cond

conda

• Anaconda distribution: Python distribution that includes 150+ packages for data science

• conda: Cross-platform and language agnostic package and environment manager

• Miniconda: Lightweight version of Anaconda, with just Python and conda.

• Anaconda Cloud: Cloud service to host and share public and private packages, environments and notebooks

• conda environments: custom isolated sandboxes to easily reproduce and share data science projects

Page 12: Open Data Science with R and Anaconda

PACKAGE AND ENVIRONMENT MANAGEMENT FOR R

Page 13: Open Data Science with R and Anaconda

13

From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft

An R Reproducibility Problem

Page 14: Open Data Science with R and Anaconda

14

Reproducibility• Programming language (R, Python, Scala…) • Packages (OSS libraries or internally developed) • Data or Access to data • Configuration of Services: DBs, keys… • Your Analysis - Script, Notebook

Page 15: Open Data Science with R and Anaconda

15

Reproducibility solutions

Bare metal

Virtual Machines

Docker containers

Conda environments

Your Analysis or Application

Your laptop, server, EC2 instance

Env 1 Env 2 Env 3

Analysis 1 Analysis 2 Analysis 3

Page 16: Open Data Science with R and Anaconda

16

Conda Environments• Programming language (R, Python, Scala…) • Packages (OSS libraries or internally developed) • Data or Access to data • Configuration of Services: DBs, keys… • Your Analysis - Script, Notebook

Page 17: Open Data Science with R and Anaconda

17

lightweight isolated sandbox to manage your dependencies and allow reproducibility of your project

environment.yml

$ conda env create

$ source activate ENV_NAME

Conda Environments

Page 18: Open Data Science with R and Anaconda

18

Where packages, notebooks, and environments are shared. Powerful collaboration and package management for open source and private projects.

Public projects and notebooks are always free.REGISTER TODAY! ANACONDA.ORG

Page 19: Open Data Science with R and Anaconda

19

Anaconda for R

https://www.continuum.io/blog/developer/jupyter-and-conda-r

• R-Essentials: A conda metapackage with 80+ R packages for data science

• MRO: Microsoft R Open distribution with MKL

conda config --add channels r conda install r-essentials

conda config --add channels mro conda install r

Page 20: Open Data Science with R and Anaconda

20

• Package and environment manager • Language angnostic (Python, R, Java…) • Cross-platform (Windows, OS X, Linux)

$ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install mongodb

Conda

Page 21: Open Data Science with R and Anaconda

21

name: myenv channels: - chdoig - r - foo

dependecies: - python=2.7 - r - r-ldavis - pandas - mongodb - spark=1.5 - pip - pip: - flask-migrate - bar=1.4

environment.yml

$ conda env create $ source activate myenv

$ conda env export -n freeze.yml

Create and activate

Freeze versions

Upload to anaconda.org

$ conda server upload my_foo_env.yml $ conda env create chdoig/my_foo_env.yml

Conda environments flow example

Page 22: Open Data Science with R and Anaconda

22

FAQ• R-Essentials has too many / too few / not the packages I

want, how can I create my own “R-Essentials”?

• I need an R package that is not on R-Essentials or the R channel, but is available through CRAN, how do I get it?

$ conda skeleton cran ldavis $ conda build r-ldavis/ $ conda server upload r-ldavis $ conda install -c chdoig r-ldavis

$ conda metapackage custom-r-bundle 0.1.0 --dependencies r-irkernel jupyter r-ggplot2 r-dplyr --summary "My custom R bundle”

Page 23: Open Data Science with R and Anaconda

23

Anaconda: Navigator

• Launch applications and easily manage conda packages, environments and channels.

• No need of using the command line.

•Available for Windows, OS X and Linux.

• Anaconda Navigator has replaced Launcher.

• Integration with Anaconda Cloud.

A desktop graphical user interface included in

Anaconda

Page 24: Open Data Science with R and Anaconda

24

Anaconda Repository

• Centralized internal repository to share package, environments and notebooks.

• Control user or team access to packages, environments and notebooks

• Blacklist packages in your organization (e.g. GPL licenses)

• Internal mirror Anaconda • Build and easily share internal developed software

Page 25: Open Data Science with R and Anaconda

DATA SCIENCE COLLABORATION WITH R

Page 26: Open Data Science with R and Anaconda

© 2015 Continuum Analytics- Confidential & Proprietary

Data Science Development Environments

26

PyCharm Spyder

Text Editors: Sublime, vim, emacs…

RStudio Eclipse

Page 27: Open Data Science with R and Anaconda

27

http://jupyter.org/https://try.jupyter.org/

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Jupyter

Page 28: Open Data Science with R and Anaconda

28

IPython IPython notebook

nbviewer tmpnb binderJupyter

https://try.jupyter.org/

http://mybinder.org/

Jupyter

Page 29: Open Data Science with R and Anaconda

29

Jupyter: IRkernel

https://www.continuum.io/blog/developer/jupyter-and-conda-r

conda config --add channels r conda install r-essentials jupyter notebooks

Trivial to get started writing R notebooks the same way you

write Python ones.

Page 30: Open Data Science with R and Anaconda

30

To start jupyter notebooks, simply run the following command:

$ jupyter notebook

http://nbviewer.ipython.org/github/chdoig/conda-jupyter-irkernel/blob/master/Jupyter%20and%20conda%20for%20R.ipynb

Jupyter

Page 31: Open Data Science with R and Anaconda

31

Jupyter

Page 32: Open Data Science with R and Anaconda

32

Jupyter

Page 33: Open Data Science with R and Anaconda

33

$ jupyter nbconvert my_r_notebook.ipynb --to slides --post serve

Jupyter

Page 34: Open Data Science with R and Anaconda

DEMO 1: ENVIRONMENTS & REPOSITORY

Page 35: Open Data Science with R and Anaconda

35

Moving your team to collaborate with each other with Anaconda Enterprise Notebooks

Data Scientist

Interactive notebooks

Models

Data apps & visualizations

Data Scientist Data Scientist

Page 36: Open Data Science with R and Anaconda

36

Anaconda Enterprise Notebooks

• Collaborate with your team on the same project

• Notebooks enterprise extensions: diff, collaborative locking

• Manage collaborators and access to projects

• Search and tag notebooks

Page 37: Open Data Science with R and Anaconda

DEMO 2: NOTEBOOKS AND AEN

Page 38: Open Data Science with R and Anaconda

SCALING R

Page 39: Open Data Science with R and Anaconda

39

Scalability

Data Scientists want: • Easy cluster setup and provisioning -> Anaconda for cluster management

• Distributed framework to scale analysis -> SparkR

Page 40: Open Data Science with R and Anaconda

40

Anaconda for cluster management

• Dynamically manage conda environments across a cluster

• Works with enterprise Hadoop distributions and HPC clusters

• Integrates with on-premises Anaconda repository

• Cluster management features are available with Anaconda subscriptions

Client Machine Compute Node

Compute Node

Compute Node

Head Node

Page 41: Open Data Science with R and Anaconda

41

Anaconda for cluster management

Before Anaconda for cluster management

Head Node1. Manually install Python,

packages & dependencies2. Manually install R, packages &

dependencies

After Anaconda for cluster management

Compute Nodes1. Manually install Python,

packages & dependencies2. Manually install R,

packages & dependencies

Compute Nodes

Head NodeEasily install conda environments and packages (including Python and R) across cluster nodes

• Empower IT with scalable and supported Anaconda deployments • Fast, secure and scalable Python & R package management on tens or thousands of nodes • Backed by an enterprise configuration management system • Scalable Anaconda deployments tested in enterprise Hadoop and HPC environments

Page 42: Open Data Science with R and Anaconda

42

SparkR

• Distributed framework for large scale processing

• Provides an R interface through SparkR

Page 43: Open Data Science with R and Anaconda

DEMO 3: ANACONDA FOR CLUSTER MANAGEMENT AND SPARKR

Page 44: Open Data Science with R and Anaconda

44

https://www.continuum.io/anaconda-subscriptions

Page 45: Open Data Science with R and Anaconda

45https://www.continuum.io/anaconda-subscriptions

Page 46: Open Data Science with R and Anaconda

46

• Need a centralized repository to publish and share notebooks, environments and packages (OSS and private)? Get Anaconda Repository! (Available in Anaconda Workgroups and Enterprise)

• Need a centralized server to help your data science team interactively collaborate on projects? Get Anaconda Enterprise Notebooks! (Available Enterprise)

• Need a “data scientist friendly” cluster manager? Get Anaconda for cluster management! (Available in Anaconda Workgroups and Enterprise)

Enterprise Product Solutions

Page 47: Open Data Science with R and Anaconda

47

• Download Anaconda: https://www.continuum.io/downloads

• Sign up for Anaconda cloud: https://anaconda.org

• Contact [email protected] for more information aboutAnaconda subscriptions, consulting, or training

Contact Information and Additional Details

Page 48: Open Data Science with R and Anaconda

48

Email: [email protected]

Twitter: @ContinuumIO

Christine DoigTwitter: @ch_doig

Thank you!