anaconda data science collaboration

44
DATA SCIENCE TEAM COLLABORATION FORGET ABOUT MEETING ME HALFWAY, TAKE ME THE LAST MILE

Upload: ian-stokes-rees

Post on 16-Apr-2017

65 views

Category:

Software


7 download

TRANSCRIPT

PowerPoint Presentation

Data ScienceTeam CollaborationForget About Meeting Me Halfway,Take Me the Last Mile

Im going to start today by telling you about my background as a computational scientist, an area where I spent a decade partnering with scientists in areas from particle physics to molecular biology. I worked with those scientists to develop the computational models, systems, and simulations that allowed them advance the boundaries of human knowledge.

1

#OpenDataScienceMeans#AnacondaCON Ian.Stokes-Rees @ijstokes

So this is a personal story.2

OGT molecular dynamics simulationProtein mouth opening, 1us

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

About insights and discovery3

CERN computing facilityGeneva, Switzerland

#OpenDataScienceMeans#AnacondaCON Ian.Stokes-Rees @ijstokes

About numbers, computers, math, and science4

CERN LHCb Control RoomFirst physics events, Dec 2010

#OpenDataScienceMeans#AnacondaCON Ian.Stokes-Rees @ijstokes

About the people who work together to achieve great things5

Success comes from team work

There is only one take away from this talk: success comes from team work.

While that may seem like a truism the reality is that for a long time analytics of various stripes has consisted of individuals working away in an assembly line fashion, taking inputs from the person before them, and outputting results to the next person.

In my career I have used software such as Excel, Perl, and Matlab, outputting spreadsheets, PDFs and Power Point. I imagine many of you have been the recipient of the kind of work Ive produced in the past: appreciative for its completeness and insights but unsure how to engage in a conversation to improve or adapt the results.

Or worse, unable to recreate and extend the results quickly and easily the next time a similar situation arises.6

Success comes from team work

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

This is my electrical engineering class mudbowl team from 1996. See if you can spot me.

I played football for 7 years and it shaped me as a person and my ideas about hard work, teams, leadership, and understanding how each person has an important role to play for success to be possible.

I have spent the last 20 years of my life working on large scale data analysis and computational science problems and there has never been a time when there has been more opportunity for teams of people, each bringing their own skills and insights to the game, to be able to do amazing things together.

So if there is a footnote to Success comes from team work it is this: Team work in data science means bringing together individuals with different backgrounds and abilities, who are able to collaborate in real-time, rapidly iterate their analysis, easily reproduce results, and scale their work from laptops to servers to clusters. I believe open data science is the only way to do that today.

7

Ian: Engineer, physicist, biologist?

Ian Stokes-Rees, @ijstokesProduct Marketing ManagerComputational ScientistPassionate advocate ofOpen Data ScienceEducator and evangelist for use ofPython and Anaconda

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

[Start with today and then move through a story to establish credibility, entertain, and build a case for collaborative data science with Anaconda.]8

First taste of big data computing

100,000 acoustic tri-phone models100 parameters per model10 million parameters to estimateadaptation = real-time adjustmentcomputation = tricky!

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

1997 to 1999, Masters degree in large vocabulary speaker independent continuous speech recognition

9

PhD on CERN LHCb COMPUTING TEAMDistributed computing infrastructure1000s of concurrent users100s of federated computing centersno centralized control1M+ servers with software installed20+ year life span20 GB of data per second14 hours per day7 days a week7 months of the year

March 26, 2010 LHCb first physics at 3.5 TeV

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

10

HOW DO CERN PHYSICISTS DO THIS?Some smart people over thereWho brought us the Web, HTTP, and HTML?Big DataMulti-PB per yearLarge collaborating teams1000s of people accessing systemsComputation criticalOr there is no way to make sense of the dataAnd discover new physics

December 2, 2016LHCb proton-lead collisions

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

1997 to 1999, Masters degree in large vocabulary speaker indepdendent continuous speech recognition

11

CERN ATLAS detectorCalorimeter end cap wiring harnessMillions of data feeds @ 40 MHz signal rate

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

12

HOW WOULD YOU DO IT?

Custom hardware:CMS L0 muon trigger ASIC

Giant compute and storage clusters

Wicked fast algorithmswritten in Fortran and C

Python: the Swiss army knife for computational physics

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Do you think it makes sense to build a long running mission critical, high performance, distributed computing system in an interpreted and dynamically typed language? I sure didnt, I thought these physicists had spent too much time playing with anti-matter and theyd annihilated the common sense part of their brains.

13

Python: lingua franca for data scienceHuman readableEasy to learnObject orientedCleanly wraps C and FortranAmazing foundation of high quality data science librariesSuitable for scripting, algorithms, data processing and applications

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

What do you have without a lingua franca? [tower of babel]

It is necessary to have common idioms, tools, and systems to facilitate communication and collaboration.14

The calculus of Newton and leibniz

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Newton and Leibniz were 17th century renaissance thinkers who concurrently established the foundations of calculus to describe and analyze dynamic systems. History suggests that Newton used his influence to be credited as the creator of calculus at the time, however ultimately it is Leibnitz we have to thank for the foundations of calculus as we known it today. It was only with Leibnitzs clear notation and presentation of calculus that the world was able to benefit. In contrast Newtons calculus was esoteric and inaccessible.

15

Sometimes esoteric is OK

16

Hermits and high priests

NPS, Richard Proenneke 1985

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Data hermits work independently and have no accountability to anyone else. They can happily seclude themselves in a cottage off the grid and do their own thing in their own way. I will not deny it: sometimes this can be a path to innovation and enlightenment. But it can also be a path to isolation.

Data high priests have established universal rule over data modeling and analysis. Their power comes from their control, and they exercise it behind closed doors. Few are admitted to this priesthood, as they guard their skills and responsibilities jealously, but in return deliver quantitative insights as the moons and seasons change.

Of course these are both caricatures, but I am sure weve all seen aspects of the data hermit or the data high priest in people or organizations weve worked with.17

Molecular Biology:from protons to proteins

It takes 3-9 months in the wet lab to prepare protein samplesOnce prepared it is only a few days to image those samples and produce digitized representationsHowever the images arent yet 3D atomic modelsThat takes from weeks to months to complete, sitting behind a computerYou may know it as protein foldingNature, 2011 PMID: 21240259Lazarus, Nam, Jiang, Sliz, Walker

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

After I completed my PhD I spent a year at a French research institute working on models for parallel distributed option pricing before moving to Harvard Medical School and joining a structural biology lab that wanted to improve their computational techniques for protein structure determination.

Here we're looking at a molecular dynamics simulation of the OGT enzyme common in mammals. It acts as a nutrient sensor and is involved with signallng metabolic behavior.

OGT's role in metabolic regulation means that it is linked to diabetes, neuro-degenerative diseases, and cancers in cases where it misbehaves.

I was not directly involved in this work, but my colleagues who were spent, collectively, many years working to determine the 3D structure of OGT in order to better understand its behavior. My contribution, in this particular case, was only to construct the MD simulation and produce this animation.

18

How do we acceleratethe time to insight?

In other words, how can we process data faster, reduce the computational time, and improve the quality of the results?19

Success comes from team work

Again, the answer comes from the key take-away of this talk: Success comes from team work

Bringing together biochemists, data scientists, software engineers, and IT systems administrators it is possible to tackle these challenges.20

What does half way look like?Todays good data science environment:Provide high performance computing resourcesFor example, Hadoop infrastructureDeploy a wide selection of the most popular analysis softwareTraining and documentationTechnical support

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

The title of this talk is:

"Data Science Team Collaboration:Forget about meeting me half way,Take me the last mile"

What does half way look like?

First, half-way is a great start, so dont feel badly if the following represents your reality.

[GO THROUGH SLIDE]

But where does that leave our biochemist trying to go from purified protein samples to a 3D molecular model and stuck on the computing part?

21

Fish out of waterWhy would we take an expert biochemist and force them to beA software engineer?An IT system administrator?A statistician?

What can we do to let them focus on being a great biochemist?

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

22

Fish out of waterWhy would we take an expert business analyst and force them to beA software engineer?An IT system administrator?A statistician?

What can we do to let them focus on being a great business analyst?

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

And of course you can swap biochemist for business analyst or any other person or role you can think of.

[DONT READ SLIDE AGAIN]23

Success comes from team work

Teams do not equal team work

Success doesnt come from just a team of people with different skills, it comes from that team being able to work together collaboratively, in real-time, to iterate, each person applying their expertise.24

Take me the last mileDevOps engineer pre-configures scalable computationLaptop to server to clusterDevOps team is a partner, not a service providerSoftware engineer creates and customizes software for the task, project or individualAvoiding generic, static software setupsData scientist composes workflowAnalyst is provided simple high level interfaceWith option to drill down

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Then this is what it means to go the last mile25

What about those proteins?Normally it takes 10-200 hours of computing time to match a template protein fragment to the imaging dataThere are 100k templates (known protein folds) to choose fromBe stupid and just try them all sometimes youll be surprised!I spent 18 months working with biochemists and IT sys admins across the country to create a sensible parallel & distributed workflow4-40 hours wall clock time to run 2k-20k hour parallel computationReal-time updates of resultsWeb based interface to access summary and detailed data vizAnalysis performed in Jupyter Notebook, allowing customizationFile-system based to enable drill down and direct access6M hours per year (~700 years), peak parallelism 20k cores

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

26

Data science patternHow is it done today?What is the opportunity for improvement?Prototype and evaluate is it better? Rinse and repeatStandardize and automate the workflow/modelScale the workflow/modelPreprocess and distribute the dataInstrument execution and set quality metricsEstablish easy access interfaceCreate programmatic APIsA whole talkin one slideFIN

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

27

Success comes from team workRemember the footnote?Collaborative cross-functional teams

28

Breaking data science openA whole talkin one slidebook

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

29

Anaconda & collaboration

I heard Continuums founder, Travis Oliphant, give a talk at Supercomputing in 2012 where he described the vision for Continuum. It was a vision of collaborative, web-based, open data science. It was the embodiment of what I had spent the past decade doing on a one-off basis in computational physics, computational finance, and computational biology. I was hooked, so I left Harvard a few months later and joined Continuum to help make that vision a reality.

Youve heard a lot about Anaconda this week, and I hope youve taken time to speak with my colleagues who are providing demos of the many aspects of the product, platform, and larger ecosystem in the exhibit area.

Im going to finish my talk by providing you with the three step program to enable you to do collaborative data science with Anaconda.30

Step 1: Anaconda

http://continuum.io/downloads

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

With millions of users, its the established way to put everyone onto the same page

Available for Windows, Mac, and Linux, with quarterly releases and rolling updates of the 200 amazing tools and libraries that are included in Anaconda

Without Anaconda it would take you days to weeks to re-create the same set of capabilities

It is the gateway to Open Data Science.

It is designed for a single user on a single system31

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Notebooks supporting over 40 different language kernel, with the strongest support for Python and R34

Notebooks for data science collaborationDo you understand why notebooks are so popular?There are many angles to this, but my take:Visual record of the data science processThey tell a story, and support rich hyperlinked proseData can be embeddedAlgorithms or analysis techniques are capturedInteractive visualizations are inlineSharableReproducible*

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

35

Step 2: Anaconda Cloud

http://anaconda.org

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Step 2: Anaconda Cloud

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Step 2: (MY) Anaconda Cloud

http://anaconda.org/ijstokes

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Step 2: (MY) Anaconda Cloud

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Step 2: (MY) Anaconda Cloud

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Step 3: Anaconda enterprise (TODAY)

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Step 3: Anaconda enterprise (coming soon)

#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes

Anaconda:Giving Superpowers to the peoplewho change the worldteams

43

THANK YOU! QUESTIONS?

Ian Stokes-Rees@ijstokes