airbnb tech talk: josh wills on the life of a data scientist
DESCRIPTION
This is the accompanying presentation for a tech talk given at Airbnb. Video of the talk here: http://www.youtube.com/watch?v=h9vQIPfe2uU Other tech talks: https://www.airbnb.com/tech_talksTRANSCRIPT
The Life of a Data ScientistMay 23, 2012
About Me
• [email protected] and @josh_wills• Formerly of Google (2008 – 2011)
• Worked on the ad auction• Led the team that build the data infrastructure for Google+
• Before that: a bunch of startups• Sometimes as a software engineer, sometimes as a statistician
• Math degree from Duke and a half-finished PhD from The University of Texas at Austin
• Now: Director of Data Science at Cloudera
Copyright 2011 Cloudera Inc. All rights reserved 2Copyright 2012 Cloudera Inc. All rights reserved
@josh_wills, #hacker
vs.
@josh_wills, #ThoughtLeader
Copyright 2012 Cloudera Inc. All rights reserved
What is a Data Scientist?
Copyright 2012 Cloudera Inc. All rights reserved
One Definition…
Copyright 2012 Cloudera Inc. All rights reserved
… versus Another
Copyright 2012 Cloudera Inc. All rights reserved
Why Is Everyone Talking About Them?
Copyright 2012 Cloudera Inc. All rights reserved
Because They Make Things Fun.
Copyright 2012 Cloudera Inc. All rights reserved
Data Scientists Power The Products You Love
Copyright 2012 Cloudera Inc. All rights reserved
The Job Isn’t New. The Impact Is.
Copyright 2012 Cloudera Inc. All rights reserved
How Do I Become One?
Copyright 2012 Cloudera Inc. All rights reserved
The Standard Reply
Copyright 2012 Cloudera Inc. All rights reserved
Personality Trait #1: Relentless, but in a Lazy Way
Copyright 2012 Cloudera Inc. All rights reserved
Personality Trait #2: (Acquired) Humility
Copyright 2012 Cloudera Inc. All rights reserved
Step 1: Study Math
Copyright 2012 Cloudera Inc. All rights reserved
But…I didn’t study math.
Copyright 2012 Cloudera Inc. All rights reserved
Alternate Step 1: Study (Computer) Science
Copyright 2012 Cloudera Inc. All rights reserved
Things People Don’t Know About Computer Science
Copyright 2012 Cloudera Inc. All rights reserved
Things Scientists Don’t Know About Statistics
Copyright 2012 Cloudera Inc. All rights reserved
Problem Solving In Context
Copyright 2012 Cloudera Inc. All rights reserved
Phase 2: Stuff You Still Don’t Know
Copyright 2012 Cloudera Inc. All rights reserved
Statisticians: How to Work on a Engineering Team
• Modular software design
• Unit tests• Code reviews• Automated build and
test infrastructure• Source code
management
Copyright 2012 Cloudera Inc. All rights reserved
Software Engineers: How to Carry Out an Analysis
Copyright 2012 Cloudera Inc. All rights reserved
Industrial Machine Learning
Copyright 2012 Cloudera Inc. All rights reserved
Data Scientists and Hadoop
Copyright 2012 Cloudera Inc. All rights reserved
Data Analyst
“If my tools and data can’t answer a question, then the question doesn’t get answered.”
Copyright 2012 Cloudera Inc. All rights reserved
Data Scientist
“If my tools and data can’t answer a question, then I go get better tools and data.”
Copyright 2012 Cloudera Inc. All rights reserved
Incredibly Common Question
“When should I use Hadoop instead of a relational database?”
Copyright 2012 Cloudera Inc. All rights reserved
The Unit of Analysis Problem: Three Symptoms
Copyright 2012 Cloudera Inc. All rights reserved
First Symptom: COUNT DISTINCT
Copyright 2012 Cloudera Inc. All rights reserved
Second Symptom: Cursors
Copyright 2012 Cloudera Inc. All rights reserved
Third Symptom: ALTER TABLE OF_DOOM
Copyright 2012 Cloudera Inc. All rights reserved
The Unit of Analysis Problem
• Data warehouses are optimized to analyze transactions• Awesome for finance
and ERP• Not ideal for product
and marketing
• A function of what databases are good at
Copyright 2012 Cloudera Inc. All rights reserved
What Are You Trying to Analyze?
Simple Entities• Static attributes• Flat data structure• Transient• Examples
• SKUs• Line items from an invoice• Log messages
Complex Entities• Evolving attributes• Hierarchical data structure• Persistent• Examples
• Customers• Suppliers• Website visitors
Copyright 2012 Cloudera Inc. All rights reserved
Choosing Our Own Data Format
• We get to structure our data in the way that works best for the problem we are solving• Flexible• Evolvable• Compact• Fast
serialization/deserialization
Copyright 2012 Cloudera Inc. All rights reserved
Spell Correction: The Drosophila of Data Science
Copyright 2012 Cloudera Inc. All rights reserved
Simple Counts on Complex Objects
Copyright 2012 Cloudera Inc. All rights reserved
The Uncanny Valley for Statisticians on Hadoop
Copyright 2012 Cloudera Inc. All rights reserved
The Business of Data Science
Copyright 2012 Cloudera Inc. All rights reserved
Where You Should Work: The Two Options
Copyright 2012 Cloudera Inc. All rights reserved
A Startup
Copyright 2012 Cloudera Inc. All rights reserved
Close to the Money
Copyright 2012 Cloudera Inc. All rights reserved
Dealing for Data
Copyright 2012 Cloudera Inc. All rights reserved
Education and Growth
Copyright 2012 Cloudera Inc. All rights reserved
Questions?@josh_wills