using anaconda to build a custom data science distribution at bloomberg | anacondacon 2017

25
© 2017 Bloomberg Finance L.P. All rights reserved. Leveraging the Anaconda Platform Platform to Build a Custom Data Science Science Distribution at Bloomberg AnacondaCON 2017 Armin Burgmeier <[email protected]> Senior Software Engineer February 9, 2017 1

Upload: continuum-analytics

Post on 12-Apr-2017

270 views

Category:

Data & Analytics


0 download

TRANSCRIPT

©2017BloombergFinanceL.P.Allrightsreserved.

Leveraging the Anaconda PlatformPlatform to Build a Custom Data Science ScienceDistribution at BloombergAnacondaCON 2017Armin Burgmeier <[email protected]>Senior Software Engineer

February 9, 2017

1

©2017BloombergFinanceL.P.Allrightsreserved.

Outline• Motivation

• Overview of the Anaconda Platform

• The Bloomberg Distribution

• Building and Deployment

• Lessons Learned

• Wishlist

2

©2017BloombergFinanceL.P.Allrightsreserved.

Motivation• Bloomberg is first and foremost a data company• Many teams augment, ingest and analyze financial data• Provide a modern data science platform for such teams

• “Replace Excel with Jupyter notebooks”

• Combining the Python scientific stack with Bloomberg data and services

3

©2017BloombergFinanceL.P.Allrightsreserved.

Requirements• Deployment to the desktop• Windows only• Support only a limited combination of packages• Reproducible runtime environments• Facilitate sharing of projects

DifferencesfromAnaconda:

• Different set of standard packages (Financial domain)• Automatic management of packages and environments• Allow update of stable environments for security fixes, adaptions

to backend changes

4

©2017BloombergFinanceL.P.Allrightsreserved.

The Anaconda Platform• Conda

• A cross-platform binary package and environment manager

• Anaconda• Conda + a set of commonly used Python packages

• Anaconda Cloud• Share notebooks, environments, packages, …

Aconda packageinanutshell:

matplotlib-1.5.3-np111py35_1.tar.bz2

Name Version Build string Tarball

Setoffilestoinstall+metadata:• Name,Version,Buildstring• Dependencies• Platform• License• …

5Conda logofromhttp://conda.pydata.org

©2017BloombergFinanceL.P.Allrightsreserved.

Building a conda package• Packages are built from “recipes” with conda-build• Recipes are meant to describe a reproducible build environment• Consists of

• Name, Version, Build string• Dependencies needed to build (such as Python and setuptools for Python

packages)• Build scripts (might invoke C/C++ compiler)• Tests for the package

• conda-build ensures binary compatibility

6

©2017BloombergFinanceL.P.Allrightsreserved.

conda-forge• Community-driven effort to provide conda recipes and build

infrastructure• One git repository (“feedstock”) per recipe

• Each recipe gets built by the infrastructure

• https://conda-forge.github.io/• Packages available through a separate “channel”• Some packages available both through anaconda and conda-forge

• Others are exclusive in either anaconda and conda-forge• mix-and-match

7conda-forgelogofromhttps://github.com/conda-forge

©2017BloombergFinanceL.P.Allrightsreserved.

The Package Manifest• Pin versions of all packages in the manifest• Keep a separate manifest of “intended” packages• Both manifests are conda packages

python3.5.21numpy 1.11.1py35_2pandas0.19.1np111_py35_0…

“Locked”Manifest

python3.5*pandas…

“Intended”Manifest

• Similar idea as the difference between cargo.lock and cargo.toml in Rust

8

©2017BloombergFinanceL.P.Allrightsreserved.

Platform Versioning• Assign a semantic version number to each manifest

Distribution1

Distribution2

Distribution3

Distribution4

Deprecated

Supported

Supported

Preview

9

©2017BloombergFinanceL.P.Allrightsreserved.

Picking a Platform Version• Every version is installed into its own environment• Creating a new notebook:

• Choose current default major version

• Opening an existing notebook:• Same major version than the one the notebook was created with• Latest minor version

• Deprecating a major version:• Refuse to open notebooks with that major version

• Upgrading an existing notebook:• Always a conscious action• Run the notebook in the new environment• Testing and verification by developer

Determinemajorversion

Findlatestminorversion

(maybe)Createenvironment

(maybe)LaunchNBserver

OpenfileinNBserver

UserAction

10

©2017BloombergFinanceL.P.Allrightsreserved.

Garbage Collection• Need a way to remove unused packages and environments• Observation: we are never running a version that has a more recent

version in the same stable series (same major version)• Remove all deprecated versions • Remove all versions with no longer supported major versions• Remove all packages no longer installed in any environment

• Beware of concurrent operations!Distribution1

Distribution2

Distribution3

Distribution4

11

©2017BloombergFinanceL.P.Allrightsreserved.

Build System• Inspired by conda-forge

• Feedstock repositories separate from upstream code

• Continuous Integration• Buildbot builds the recipe on every PR and every

push to master• Upload to internal Bloomberg channel

• Works great for C# codebases as well

PRonFeedstockrepo

Automaticbuild

Uploadtoseparatechannel

ManualTestingifneeded

MergePR

UploadtoMainChannel

12buildbot logofromhttp://buildbot.net/about.html

©2017BloombergFinanceL.P.Allrightsreserved.

Customization• matplotlib conda package depends on Qt• Not needed in a Jupyter notebook-based environment• No notion of “optional” dependencies in conda

• Fork conda-forge matplotlib-feedstock repo• Make customization and add “noqt” feature to the build• Created package is matplotlib-1.5.3-np111py35_noqt_0

• Avoids collision with packages from other channels• Tracking the “noqt” feature in our environment makes conda prefer our

customization over the default package

Needforcustomizationofupstreampackage:

13

©2017BloombergFinanceL.P.Allrightsreserved.

Deployment• All builds end up in an internal (“dev”) channel• When a new platform version is ready for a wider audience, propagate

the platform package and all packages it contains into a production channel.

1.0

1.1

1.2

2.0

“dev”channel 1.0

1.1

1.2

“prod”channel

14

©2017BloombergFinanceL.P.Allrightsreserved.

Lessons Learned: Install Order• We are using packages from both conda-forge and anaconda

• Sometimes they don’t play well together

• Bqplot needs ipywidgets installed at install time for post-install script• Circular dependencies are handled fine by the conda solver

• But no guarantee about installation order!

bqplotipywidgets

_nb_ext_conf

ipywidgets

conda-forge conda-forgeanaconda

anaconda• Workaround: prefer conda-forge over anaconda• https://github.com/conda-forge/bqplot-feedstock/issues/11

15

©2017BloombergFinanceL.P.Allrightsreserved.

Lessons Learned: Channel Pinning• One package that we pinned is

mpmath-0.19-py35_1

• Originally it was available in the anaconda channel• “Suddenly” it became available in conda-forge

• With different dependencies

• Build fails because the new dependencies are not pinned

• Ideally we could pin the channel as well• In addition to version and build string

• Workaround: Upload mpmath from anaconda to Bloomberg channel• Ultimate channel priority: Bloomberg -> conda-forge

-> anaconda

mpmath-0.19-py35_1

python

mpir

mpfr

gmpy

mpmath-0.19-py35_1

anaconda

conda-forge

16

©2017BloombergFinanceL.P.Allrightsreserved.

Lessons Learned: Reprod. Builds• Build time dependencies are not pinned

• Hard to enforce with current conda tools

• A build that works today might no longer work tomorrow• e.g. pandas 0.17.1 changed merge behavior which broke one of our

packages

• Possible solution:• After a successful build, “freeze” the dependency resolution and add to the

recipe• On subsequent builds, use the “frozen” dependency resolution• Make it an explicit action to re-resolve dependencies• Would need separate resolutions for different features, platforms, py/np

versions

17

©2017BloombergFinanceL.P.Allrightsreserved.

Wishlist: conda download• A command that downloads packages but does not install them• Allows to work around build dependency pinning:

• Download the build dependencies for a package• Add them to a local channel• Build the package with dependencies only from that channel• Conda is forced to resolve dependencies with the previous downloaded

packages

• Allows to ship packages so they can be installed later without connectivity to the original channels

• http://github.com/conda/conda/issues/1150

18

©2017BloombergFinanceL.P.Allrightsreserved.

Wishlist: Parallelize Install Steps• Optimizing the install time of the first

environment is crucial in our scenario• Creating a conda environment takes

three steps• Download package tarballs (Network I/O-

bound)• Extract package tarballs (CPU-bound)• Install packages into the environment

(Disk I/O and/or CPU bound)

• Download size is O(200MiB)• First two steps could be (easily?)

parallelized

DownloadPackageA

DownloadPackageB

DownloadPackageC

DownloadPackageD

ExtractPackageA

ExtractPackageB

ExtractPackageC

ExtractPackageD

time

19

©2017BloombergFinanceL.P.Allrightsreserved.

Wishlist: .xz conda packages• LZMA has better compression ratio and better decompression speed• Would significantly improve time to download packages and create an

environment

Method Size(MiB) DecompressionSpeed DecompressionMemorybz2 110.0 18.6s 4M

xz 73.3 9.6s 8M

xz -9 59.0 8.3s 64M

Testobject:win-64/mkl-11.3.3-1.tar.bz2

• Drawbacks:• Higher memory requirements at decompression• Extra dependency in Python 2.7 (backports.lzma)

20

©2017BloombergFinanceL.P.Allrightsreserved.

Conclusions• The Anaconda Platform together with conda-forge is a great ecosystem

for creating Python distributions• Bloomberg builds its own provisioning of environments around it

• Automatic management of environments• Long-term support for existing notebooks• Allow minor updates to stable environments for continued maintenance

• Mixing anaconda and conda-forge has some quirks• Pinning of packages

• “Intended” set of packages vs. “frozen” set of packages

21

©2017BloombergFinanceL.P.Allrightsreserved.

Thank you!

22

AnacondaCON 2017Armin Burgmeier <[email protected]>Senior Software Engineer

©2017BloombergFinanceL.P.Allrightsreserved.

Lessons Learned: Conflict Hints• Conflict is if dependency constraints cannot be satisfied

• e.g. ipywidgets=5.2.2 widgetsnbextension=1.2.3• ipywidgets depends on widgetsnbextension >= 1.2.6

• conda creates “hints” on how to resolve a conflict:

The following specifications were found to be in conflict:- alabaster 0.7.8 py35_0- widgetsnbextension 1.2.3 py35_1

• alabaster just happens to be the first entry in the list of dependencies• http://github.com/conda/conda/issues/1859

23

©2017BloombergFinanceL.P.Allrightsreserved.

Environment extensions• What if you need a package not included in the platform?• Add an IPython extension “%install”• Runs the equivalent of

• pip install –t some-directory <name>==<version>--no-deps --only-binary=:all:

• Adds it to sys.path

• Pros:• Reproducible• Sources the python package archive

• Cons:• No dependency resolution• No installation of data files (such as Javascript for IPython widgets)• Only works for wheels (no custom code execution at install time)

24

©2017BloombergFinanceL.P.Allrightsreserved.

Environment extensions (cont.)• Alternative: Use conda• User specifies extra requirements, for example pandas >=0.19.1• When creating the environment

• Install packages from platform• Then install extra requirements• “Freeze” list of additional packages installed (possibly replacing platform

packages)

• When re-creating the environment• Install the packages from the recorded (“frozen”) package list• Might create a conflict when the minor platform version has changed:

• Conda should be able to solve by downgrading some packages in the platform

• Provide option to re-resolve requirements at a later point

25