creating added value with big data
DESCRIPTION
This talk essentially tells the story of the data science team at Massive Media, the company behind Netlog.com and Twoo.com. After obtaining invaluable first-hand experience in working with big data as a member of the information retrieval team at the music discovery website Last.fm, I joined Massive Media to conceive, build and lead a brand new team around big data and data science for them. In doing so, I developed a pretty clear perspective on how to introduce big data within a company and create added value from it, which is precisely what I would like to share in this talk.TRANSCRIPT
CREATINGADDED VALUEWITH BIG DATA
by KLAAS BOSTEELS@klbostee
MY CAREER PATH SO FAR
2007: Began working with big data as PhD student
2009: Embarked on a data science career at Last.fm
2011: Joined Massive Media as Lead Data Scientist
Data company at heart; one of the earliest Hadoop adopters world-wide; inventors of Ketama; organised first “NoSQL” meetup in SF.
Huge audience and tremendous potential, but data science newcomer at the time.
Second big product of Massive Media, after Netlog
2011: Initial launch of Twoo.com
2012: Biggest dating site world-wide on comScore
2013: Massive Media acquired by InterActiveCorp
IT’S A BIG FAMILY
IAC’s main personals brands:
Some other well-known IAC brands:
STEP 1
FOLLOW THE MONEY
photo by Chris Isherwood
BOOTSTRAP BY SAVING OR GAINING MONEY
You need to get some capital to get started
Saving money tends to be easier in practice
Real-world example:
• Analyzing CDN logs unveiled abuse
• Stopping the abuse greatly reduced the bills
STEP 2
EMBRACE HADOOP
photo by Doug Kukurudza
HADOOP
Not the holy grail, but deserves a central role
It has a vibrant community and is proven to be:
ECONOMICAL runs on commodity hardware
SCALABLE smart distributed processing
MAINTAINABLE very robust and fault-tolerant
FLEXIBLE predefined schemas not required
STEP 3
BUILD DASHBOARDS
photo by Dawn Hopkins
STATS PIPELINE BASED ON HADOOP
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
STATS PIPELINE BASED ON HADOOP
Realtimeprocessing
Cfr. “lambda architecture”
coined by @nathanmarz
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
STATS PIPELINE BASED ON HADOOP
Ad-hoc results
Realtimeprocessing
Cfr. “lambda architecture”
coined by @nathanmarz
MapReduce
HBase
HDFS
Log collector
Dashboardsin batches
continuous
CUSTOM-TAILORED WEB INTERFACE
Annotation & exporting functionality
SupportsA/B testingand cohort
analysis
Various othernifty extra’s
STEP 4
ASSEMBLE A TEAM
photo by Jean-François Schmitz
THE SECRET IS IN THE MIX
Hadoop’s tricks also apply to data science teams
• Avoid specialisation to allow easy distribution and scaling
• Exploit data locality by hiring people with wide skill set
Great Data Scientists have the right mix of skills
• Hackers with solid technical background
• Analytical mind that knows statistics and machine learning
• Clever and creative in everything they do
CHEAPER TECH MAKES PEOPLE MORE EXPENSIVE
Graph by Trifacta. Source: John C. McCallum, Wikipedia and Federal Reserve Bank of St Louis. Inflation adjusted to 2011 dollars.
STEP 5
EXPLORE & INNOVATE
photo by NASAr
SOME TIPS AND TRICKS
Dare to fail and/or start from estimates
Introduce data exploration/innovation days
• Basically 20% time devoted to playing with data
• Incorporate collaborative brainstorming
• Goal is to find promising new projects to work on
Communicate findings to the rest of the company
• Fun and silliness are allowed
• Prototype early and often
PRODUCT INSIGHTS & EXTENSIONS
E.g. recommendations and activity patterns analysis
CUTE OBSERVATIONS FOR PR
http://www.twoo.com/blog/2012/04/twoos-great-global-vocabulary-experiment
FIVE SIMPLE STEPS IS ALL IT TAKES
1
2
3
4
5
FOLLOW THE MONEY
EMBRACE HADOOP
BUILD DASHBOARDS
ASSEMBLE A TEAM
EXPLORE & INNOVATE
FIVE SIMPLE STEPS IS ALL IT TAKES
1
2
3
4
5
FOLLOW THE MONEY
EMBRACE HADOOP
BUILD DASHBOARDS
ASSEMBLE A TEAM
EXPLORE & INNOVATE
Thanks!Questions?