anaconda data science collaboration
TRANSCRIPT
PowerPoint Presentation
Data ScienceTeam CollaborationForget About Meeting Me Halfway,Take Me the Last Mile
Im going to start today by telling you about my background as a computational scientist, an area where I spent a decade partnering with scientists in areas from particle physics to molecular biology. I worked with those scientists to develop the computational models, systems, and simulations that allowed them advance the boundaries of human knowledge.
1
#OpenDataScienceMeans#AnacondaCON Ian.Stokes-Rees @ijstokes
So this is a personal story.2
OGT molecular dynamics simulationProtein mouth opening, 1us
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
About insights and discovery3
CERN computing facilityGeneva, Switzerland
#OpenDataScienceMeans#AnacondaCON Ian.Stokes-Rees @ijstokes
About numbers, computers, math, and science4
CERN LHCb Control RoomFirst physics events, Dec 2010
#OpenDataScienceMeans#AnacondaCON Ian.Stokes-Rees @ijstokes
About the people who work together to achieve great things5
Success comes from team work
There is only one take away from this talk: success comes from team work.
While that may seem like a truism the reality is that for a long time analytics of various stripes has consisted of individuals working away in an assembly line fashion, taking inputs from the person before them, and outputting results to the next person.
In my career I have used software such as Excel, Perl, and Matlab, outputting spreadsheets, PDFs and Power Point. I imagine many of you have been the recipient of the kind of work Ive produced in the past: appreciative for its completeness and insights but unsure how to engage in a conversation to improve or adapt the results.
Or worse, unable to recreate and extend the results quickly and easily the next time a similar situation arises.6
Success comes from team work
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
This is my electrical engineering class mudbowl team from 1996. See if you can spot me.
I played football for 7 years and it shaped me as a person and my ideas about hard work, teams, leadership, and understanding how each person has an important role to play for success to be possible.
I have spent the last 20 years of my life working on large scale data analysis and computational science problems and there has never been a time when there has been more opportunity for teams of people, each bringing their own skills and insights to the game, to be able to do amazing things together.
So if there is a footnote to Success comes from team work it is this: Team work in data science means bringing together individuals with different backgrounds and abilities, who are able to collaborate in real-time, rapidly iterate their analysis, easily reproduce results, and scale their work from laptops to servers to clusters. I believe open data science is the only way to do that today.
7
Ian: Engineer, physicist, biologist?
Ian Stokes-Rees, @ijstokesProduct Marketing ManagerComputational ScientistPassionate advocate ofOpen Data ScienceEducator and evangelist for use ofPython and Anaconda
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
[Start with today and then move through a story to establish credibility, entertain, and build a case for collaborative data science with Anaconda.]8
First taste of big data computing
100,000 acoustic tri-phone models100 parameters per model10 million parameters to estimateadaptation = real-time adjustmentcomputation = tricky!
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
1997 to 1999, Masters degree in large vocabulary speaker independent continuous speech recognition
9
PhD on CERN LHCb COMPUTING TEAMDistributed computing infrastructure1000s of concurrent users100s of federated computing centersno centralized control1M+ servers with software installed20+ year life span20 GB of data per second14 hours per day7 days a week7 months of the year
March 26, 2010 LHCb first physics at 3.5 TeV
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
10
HOW DO CERN PHYSICISTS DO THIS?Some smart people over thereWho brought us the Web, HTTP, and HTML?Big DataMulti-PB per yearLarge collaborating teams1000s of people accessing systemsComputation criticalOr there is no way to make sense of the dataAnd discover new physics
December 2, 2016LHCb proton-lead collisions
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
1997 to 1999, Masters degree in large vocabulary speaker indepdendent continuous speech recognition
11
CERN ATLAS detectorCalorimeter end cap wiring harnessMillions of data feeds @ 40 MHz signal rate
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
12
HOW WOULD YOU DO IT?
Custom hardware:CMS L0 muon trigger ASIC
Giant compute and storage clusters
Wicked fast algorithmswritten in Fortran and C
Python: the Swiss army knife for computational physics
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Do you think it makes sense to build a long running mission critical, high performance, distributed computing system in an interpreted and dynamically typed language? I sure didnt, I thought these physicists had spent too much time playing with anti-matter and theyd annihilated the common sense part of their brains.
13
Python: lingua franca for data scienceHuman readableEasy to learnObject orientedCleanly wraps C and FortranAmazing foundation of high quality data science librariesSuitable for scripting, algorithms, data processing and applications
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
What do you have without a lingua franca? [tower of babel]
It is necessary to have common idioms, tools, and systems to facilitate communication and collaboration.14
The calculus of Newton and leibniz
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Newton and Leibniz were 17th century renaissance thinkers who concurrently established the foundations of calculus to describe and analyze dynamic systems. History suggests that Newton used his influence to be credited as the creator of calculus at the time, however ultimately it is Leibnitz we have to thank for the foundations of calculus as we known it today. It was only with Leibnitzs clear notation and presentation of calculus that the world was able to benefit. In contrast Newtons calculus was esoteric and inaccessible.
15
Sometimes esoteric is OK
16
Hermits and high priests
NPS, Richard Proenneke 1985
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Data hermits work independently and have no accountability to anyone else. They can happily seclude themselves in a cottage off the grid and do their own thing in their own way. I will not deny it: sometimes this can be a path to innovation and enlightenment. But it can also be a path to isolation.
Data high priests have established universal rule over data modeling and analysis. Their power comes from their control, and they exercise it behind closed doors. Few are admitted to this priesthood, as they guard their skills and responsibilities jealously, but in return deliver quantitative insights as the moons and seasons change.
Of course these are both caricatures, but I am sure weve all seen aspects of the data hermit or the data high priest in people or organizations weve worked with.17
Molecular Biology:from protons to proteins
It takes 3-9 months in the wet lab to prepare protein samplesOnce prepared it is only a few days to image those samples and produce digitized representationsHowever the images arent yet 3D atomic modelsThat takes from weeks to months to complete, sitting behind a computerYou may know it as protein foldingNature, 2011 PMID: 21240259Lazarus, Nam, Jiang, Sliz, Walker
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
After I completed my PhD I spent a year at a French research institute working on models for parallel distributed option pricing before moving to Harvard Medical School and joining a structural biology lab that wanted to improve their computational techniques for protein structure determination.
Here we're looking at a molecular dynamics simulation of the OGT enzyme common in mammals. It acts as a nutrient sensor and is involved with signallng metabolic behavior.
OGT's role in metabolic regulation means that it is linked to diabetes, neuro-degenerative diseases, and cancers in cases where it misbehaves.
I was not directly involved in this work, but my colleagues who were spent, collectively, many years working to determine the 3D structure of OGT in order to better understand its behavior. My contribution, in this particular case, was only to construct the MD simulation and produce this animation.
18
How do we acceleratethe time to insight?
In other words, how can we process data faster, reduce the computational time, and improve the quality of the results?19
Success comes from team work
Again, the answer comes from the key take-away of this talk: Success comes from team work
Bringing together biochemists, data scientists, software engineers, and IT systems administrators it is possible to tackle these challenges.20
What does half way look like?Todays good data science environment:Provide high performance computing resourcesFor example, Hadoop infrastructureDeploy a wide selection of the most popular analysis softwareTraining and documentationTechnical support
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
The title of this talk is:
"Data Science Team Collaboration:Forget about meeting me half way,Take me the last mile"
What does half way look like?
First, half-way is a great start, so dont feel badly if the following represents your reality.
[GO THROUGH SLIDE]
But where does that leave our biochemist trying to go from purified protein samples to a 3D molecular model and stuck on the computing part?
21
Fish out of waterWhy would we take an expert biochemist and force them to beA software engineer?An IT system administrator?A statistician?
What can we do to let them focus on being a great biochemist?
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
22
Fish out of waterWhy would we take an expert business analyst and force them to beA software engineer?An IT system administrator?A statistician?
What can we do to let them focus on being a great business analyst?
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
And of course you can swap biochemist for business analyst or any other person or role you can think of.
[DONT READ SLIDE AGAIN]23
Success comes from team work
Teams do not equal team work
Success doesnt come from just a team of people with different skills, it comes from that team being able to work together collaboratively, in real-time, to iterate, each person applying their expertise.24
Take me the last mileDevOps engineer pre-configures scalable computationLaptop to server to clusterDevOps team is a partner, not a service providerSoftware engineer creates and customizes software for the task, project or individualAvoiding generic, static software setupsData scientist composes workflowAnalyst is provided simple high level interfaceWith option to drill down
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Then this is what it means to go the last mile25
What about those proteins?Normally it takes 10-200 hours of computing time to match a template protein fragment to the imaging dataThere are 100k templates (known protein folds) to choose fromBe stupid and just try them all sometimes youll be surprised!I spent 18 months working with biochemists and IT sys admins across the country to create a sensible parallel & distributed workflow4-40 hours wall clock time to run 2k-20k hour parallel computationReal-time updates of resultsWeb based interface to access summary and detailed data vizAnalysis performed in Jupyter Notebook, allowing customizationFile-system based to enable drill down and direct access6M hours per year (~700 years), peak parallelism 20k cores
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
26
Data science patternHow is it done today?What is the opportunity for improvement?Prototype and evaluate is it better? Rinse and repeatStandardize and automate the workflow/modelScale the workflow/modelPreprocess and distribute the dataInstrument execution and set quality metricsEstablish easy access interfaceCreate programmatic APIsA whole talkin one slideFIN
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
27
Success comes from team workRemember the footnote?Collaborative cross-functional teams
28
Breaking data science openA whole talkin one slidebook
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
29
Anaconda & collaboration
I heard Continuums founder, Travis Oliphant, give a talk at Supercomputing in 2012 where he described the vision for Continuum. It was a vision of collaborative, web-based, open data science. It was the embodiment of what I had spent the past decade doing on a one-off basis in computational physics, computational finance, and computational biology. I was hooked, so I left Harvard a few months later and joined Continuum to help make that vision a reality.
Youve heard a lot about Anaconda this week, and I hope youve taken time to speak with my colleagues who are providing demos of the many aspects of the product, platform, and larger ecosystem in the exhibit area.
Im going to finish my talk by providing you with the three step program to enable you to do collaborative data science with Anaconda.30
Step 1: Anaconda
http://continuum.io/downloads
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
With millions of users, its the established way to put everyone onto the same page
Available for Windows, Mac, and Linux, with quarterly releases and rolling updates of the 200 amazing tools and libraries that are included in Anaconda
Without Anaconda it would take you days to weeks to re-create the same set of capabilities
It is the gateway to Open Data Science.
It is designed for a single user on a single system31
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Notebooks supporting over 40 different language kernel, with the strongest support for Python and R34
Notebooks for data science collaborationDo you understand why notebooks are so popular?There are many angles to this, but my take:Visual record of the data science processThey tell a story, and support rich hyperlinked proseData can be embeddedAlgorithms or analysis techniques are capturedInteractive visualizations are inlineSharableReproducible*
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
35
Step 2: Anaconda Cloud
http://anaconda.org
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Step 2: Anaconda Cloud
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Step 2: (MY) Anaconda Cloud
http://anaconda.org/ijstokes
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Step 2: (MY) Anaconda Cloud
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Step 2: (MY) Anaconda Cloud
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Step 3: Anaconda enterprise (TODAY)
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Step 3: Anaconda enterprise (coming soon)
#OpenDataScienceMeans#AnacondaCONIan.Stokes-Rees @ijstokes
Anaconda:Giving Superpowers to the peoplewho change the worldteams
43
THANK YOU! QUESTIONS?
Ian Stokes-Rees@ijstokes