back to square one: building a data science team from scratch

BUILDINGDATA SCIENCE TEAMSFROM SCRATCH

Klaas Bosteels @klbostee

MY CAREER PATH SO FAR

2007: Began working with big data as PhD student

2009: Embarked on a data science career at Last.fm

2011: Joined Massive Media as Lead Data Scientist

Data company at heart; one of the earliest Hadoop adopters world-wide; inventors of Ketama; organised first “NoSQL” meetup in SF.

Huge audience and tremendous potential, but data science newcomer at the time.

MY TEAM AT MASSIVE MEDIA

+ interns!Currently 4 permanent people, so not huge just yet

Relatively big and growing faster than anticipated though

OUR MISSION IS HELPING THE COMPANY...

MEASURE metrics dashboards

EVALUATE data-driven testing

DECIDE ad hoc data insights

IMPROVE e.g. abuse detection

EXTEND new product features

PROMOTE PR via data porn








high

er r

isk

but

bigg

er r

etur

ns








high

er r

isk

but

bigg

er r

etur

ns

very

wide

ran

ge o

f ta

sks

STEP 1

FOLLOW THE MONEY

photo by Chris Isherwood

http://www.flickr.com/people/isherwoodchris/

http://www.flickr.com/people/isherwoodchris/

BOOTSTRAP BY SAVING OR GAINING MONEY

You need to get some capital to get started

Saving money tends to be easier in practice

Real-world example:

• Analyzing CDN logs unveiled abuse

• Stopping the abuse greatly reduced the bills

STEP 2

EMBRACE HADOOP

photo by Doug Kukurudza

http://www.flickr.com/photos/46009763@N07/

http://www.flickr.com/photos/46009763@N07/

HADOOP

Not the holy grail, but deserves a central role

It has a vibrant community and is proven to be:

ECONOMICAL runs on commodity hardware

SCALABLE smart distributed processing

MAINTAINABLE very robust and fault-tolerant

FLEXIBLE predefined schemas not required

STEP 3

BUILD DASHBOARDS

photo by Dawn Hopkins

http://www.flickr.com/people/seenoevil/

http://www.flickr.com/people/seenoevil/

STATS PIPELINE BASED ON HADOOP

MapReduce

HBase

HDFS

Log collector

Dashboardsin batches

continuous


Realtimeprocessing

Cfr. “lambda architecture”

coined by @nathanmarz

MapReduce

HBase

HDFS

Log collector


continuous


Ad-hoc results

Realtimeprocessing

Cfr. “lambda architecture”

coined by @nathanmarz

MapReduce

HBase

HDFS

Log collector


continuous

PYTHON IS AN AWESOME JACK OF ALL TRADES

It is great for building dashboards:

• Hadoop support: Dumbo, Python UDFs for Pig, ...

• Several amazing web frameworks, e.g. Flask

• Likewise for drawing graphs, e.g. PyCairo

And it covers many other data science needs as well:

• Scripting, prototyping and full-blown programming

• NumPy, SciPy, PyLab, Scikit-learn, Pandas, ...

STEP 4

ASSEMBLE A TEAM

photo by Jean-François Schmitz

http://www.flickr.com/people/jiheffe/

http://www.flickr.com/people/jiheffe/

THE SECRET IS IN THE MIX

Hadoop’s tricks also apply to data science teams

• Avoid specialisation to allow easy distribution and scaling

• Exploit data locality by hiring people with wide skill set

Great Data Scientists have the right mix of skills

• Hackers with solid technical background

• Analytical mind that knows statistics and machine learning

• Clever and creative in everything they do

STEP 5

EXPLORE & INNOVATE

photo by NASAr

http://www.flickr.com/people/gsfc/

http://www.flickr.com/people/gsfc/

SOME TIPS AND TRICKS

Dare to fail and/or start from estimates

Introduce data exploration/innovation days

• Basically 20% time devoted to playing with data

• Incorporate brainstorming

• Encourage collaboration

Communicate findings to the rest of the company

• Fun and silliness are allowed

• Prototype early and often

FIVE SIMPLE STEPS IS ALL IT TAKES

1

2

3

4

5

FOLLOW THE MONEY

EMBRACE HADOOP

BUILD DASHBOARDS

ASSEMBLE A TEAM

EXPLORE & INNOVATE

FIVE SIMPLE STEPS IS ALL IT TAKES

1

2

3

4

5

FOLLOW THE MONEY

EMBRACE HADOOP

BUILD DASHBOARDS

ASSEMBLE A TEAM

EXPLORE & INNOVATE

Thanks!Questions?

back to square one: building a data science team from scratch

Documents

data porn

big data

data insightsimprovee

data science career

data science newcomer

data science needs

data insights improvee

scaling exploit data