back to square one: building a data science team from scratch
DESCRIPTION
Generally speaking, big data and data science originated in the west and are coming to Europe with a bit of a delay. There is at least one exception though: The London-based music discovery website Last.fm is a data company at heart and has been doing large-scale data processing and analysis for years. It started using Hadoop in early 2006, for instance, making it one of the earliest adopters worldwide. When I left Last.fm to join Massive Media, the social media company behind Netlog.com and Twoo.com, I basically moved from a data science forerunner to a newcomer. Massive Media had at least as much data to play with and tremendous potential, but they were not doing much with it yet. The data science team had to be build from the ground up and every step had to be argued for and justified along the way. Having done this exercise of evaluating everything I learned at Last.fm and starting over completely with a clean slate at Massive Media, I developed a pretty clear perspective on how to find good data scientists, what they should be doing, what tools they should be using, and how to organize them to work together efficiently as team, which is precisely what I would like to share in this talk.TRANSCRIPT
BUILDINGDATA SCIENCE TEAMSFROM SCRATCH
Klaas Bosteels @klbostee
MY CAREER PATH SO FAR
2007: Began working with big data as PhD student
2009: Embarked on a data science career at Last.fm
2011: Joined Massive Media as Lead Data Scientist
Data company at heart; one of the earliest Hadoop adopters world-wide; inventors of Ketama; organised first “NoSQL” meetup in SF.
Huge audience and tremendous potential, but data science newcomer at the time.
MY TEAM AT MASSIVE MEDIA
+ interns!Currently 4 permanent people, so not huge just yet
Relatively big and growing faster than anticipated though
OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
high
er r
isk
but
bigg
er r
etur
ns
OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
high
er r
isk
but
bigg
er r
etur
ns
very
wide
ran
ge o
f ta
sks
STEP 1
FOLLOW THE MONEY
photo by Chris Isherwood
BOOTSTRAP BY SAVING OR GAINING MONEY
You need to get some capital to get started
Saving money tends to be easier in practice
Real-world example:
• Analyzing CDN logs unveiled abuse
• Stopping the abuse greatly reduced the bills
STEP 2
EMBRACE HADOOP
photo by Doug Kukurudza
HADOOP
Not the holy grail, but deserves a central role
It has a vibrant community and is proven to be:
ECONOMICAL runs on commodity hardware
SCALABLE smart distributed processing
MAINTAINABLE very robust and fault-tolerant
FLEXIBLE predefined schemas not required
STEP 3
BUILD DASHBOARDS
photo by Dawn Hopkins
STATS PIPELINE BASED ON HADOOP
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
STATS PIPELINE BASED ON HADOOP
Realtimeprocessing
Cfr. “lambda architecture”
coined by @nathanmarz
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
STATS PIPELINE BASED ON HADOOP
Ad-hoc results
Realtimeprocessing
Cfr. “lambda architecture”
coined by @nathanmarz
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
PYTHON IS AN AWESOME JACK OF ALL TRADES
It is great for building dashboards:
• Hadoop support: Dumbo, Python UDFs for Pig, ...
• Several amazing web frameworks, e.g. Flask
• Likewise for drawing graphs, e.g. PyCairo
And it covers many other data science needs as well:
• Scripting, prototyping and full-blown programming
• NumPy, SciPy, PyLab, Scikit-learn, Pandas, ...
STEP 4
ASSEMBLE A TEAM
photo by Jean-François Schmitz
THE SECRET IS IN THE MIX
Hadoop’s tricks also apply to data science teams
• Avoid specialisation to allow easy distribution and scaling
• Exploit data locality by hiring people with wide skill set
Great Data Scientists have the right mix of skills
• Hackers with solid technical background
• Analytical mind that knows statistics and machine learning
• Clever and creative in everything they do
STEP 5
EXPLORE & INNOVATE
photo by NASAr
SOME TIPS AND TRICKS
Dare to fail and/or start from estimates
Introduce data exploration/innovation days
• Basically 20% time devoted to playing with data
• Incorporate brainstorming
• Encourage collaboration
Communicate findings to the rest of the company
• Fun and silliness are allowed
• Prototype early and often
FIVE SIMPLE STEPS IS ALL IT TAKES
1
2
3
4
5
FOLLOW THE MONEY
EMBRACE HADOOP
BUILD DASHBOARDS
ASSEMBLE A TEAM
EXPLORE & INNOVATE
FIVE SIMPLE STEPS IS ALL IT TAKES
1
2
3
4
5
FOLLOW THE MONEY
EMBRACE HADOOP
BUILD DASHBOARDS
ASSEMBLE A TEAM
EXPLORE & INNOVATE
Thanks!Questions?