airbnb tech talk: josh wills on the life of a data scientist

Post on 26-Jan-2015

114 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is the accompanying presentation for a tech talk given at Airbnb. Video of the talk here: http://www.youtube.com/watch?v=h9vQIPfe2uU Other tech talks: https://www.airbnb.com/tech_talks

TRANSCRIPT

The Life of a Data ScientistMay 23, 2012

About Me

• jwills@cloudera.com and @josh_wills• Formerly of Google (2008 – 2011)

• Worked on the ad auction• Led the team that build the data infrastructure for Google+

• Before that: a bunch of startups• Sometimes as a software engineer, sometimes as a statistician

• Math degree from Duke and a half-finished PhD from The University of Texas at Austin

• Now: Director of Data Science at Cloudera

Copyright 2011 Cloudera Inc. All rights reserved 2Copyright 2012 Cloudera Inc. All rights reserved

@josh_wills, #hacker

vs.

@josh_wills, #ThoughtLeader

Copyright 2012 Cloudera Inc. All rights reserved

What is a Data Scientist?

Copyright 2012 Cloudera Inc. All rights reserved

One Definition…

Copyright 2012 Cloudera Inc. All rights reserved

… versus Another

Copyright 2012 Cloudera Inc. All rights reserved

Why Is Everyone Talking About Them?

Copyright 2012 Cloudera Inc. All rights reserved

Because They Make Things Fun.

Copyright 2012 Cloudera Inc. All rights reserved

Data Scientists Power The Products You Love

Copyright 2012 Cloudera Inc. All rights reserved

The Job Isn’t New. The Impact Is.

Copyright 2012 Cloudera Inc. All rights reserved

How Do I Become One?

Copyright 2012 Cloudera Inc. All rights reserved

The Standard Reply

Copyright 2012 Cloudera Inc. All rights reserved

Personality Trait #1: Relentless, but in a Lazy Way

Copyright 2012 Cloudera Inc. All rights reserved

Personality Trait #2: (Acquired) Humility

Copyright 2012 Cloudera Inc. All rights reserved

Step 1: Study Math

Copyright 2012 Cloudera Inc. All rights reserved

But…I didn’t study math.

Copyright 2012 Cloudera Inc. All rights reserved

Alternate Step 1: Study (Computer) Science

Copyright 2012 Cloudera Inc. All rights reserved

Things People Don’t Know About Computer Science

Copyright 2012 Cloudera Inc. All rights reserved

Things Scientists Don’t Know About Statistics

Copyright 2012 Cloudera Inc. All rights reserved

Problem Solving In Context

Copyright 2012 Cloudera Inc. All rights reserved

Phase 2: Stuff You Still Don’t Know

Copyright 2012 Cloudera Inc. All rights reserved

Statisticians: How to Work on a Engineering Team

• Modular software design

• Unit tests• Code reviews• Automated build and

test infrastructure• Source code

management

Copyright 2012 Cloudera Inc. All rights reserved

Software Engineers: How to Carry Out an Analysis

Copyright 2012 Cloudera Inc. All rights reserved

Industrial Machine Learning

Copyright 2012 Cloudera Inc. All rights reserved

Data Scientists and Hadoop

Copyright 2012 Cloudera Inc. All rights reserved

Data Analyst

“If my tools and data can’t answer a question, then the question doesn’t get answered.”

Copyright 2012 Cloudera Inc. All rights reserved

Data Scientist

“If my tools and data can’t answer a question, then I go get better tools and data.”

Copyright 2012 Cloudera Inc. All rights reserved

Incredibly Common Question

“When should I use Hadoop instead of a relational database?”

Copyright 2012 Cloudera Inc. All rights reserved

The Unit of Analysis Problem: Three Symptoms

Copyright 2012 Cloudera Inc. All rights reserved

First Symptom: COUNT DISTINCT

Copyright 2012 Cloudera Inc. All rights reserved

Second Symptom: Cursors

Copyright 2012 Cloudera Inc. All rights reserved

Third Symptom: ALTER TABLE OF_DOOM

Copyright 2012 Cloudera Inc. All rights reserved

The Unit of Analysis Problem

• Data warehouses are optimized to analyze transactions• Awesome for finance

and ERP• Not ideal for product

and marketing

• A function of what databases are good at

Copyright 2012 Cloudera Inc. All rights reserved

What Are You Trying to Analyze?

Simple Entities• Static attributes• Flat data structure• Transient• Examples

• SKUs• Line items from an invoice• Log messages

Complex Entities• Evolving attributes• Hierarchical data structure• Persistent• Examples

• Customers• Suppliers• Website visitors

Copyright 2012 Cloudera Inc. All rights reserved

Choosing Our Own Data Format

• We get to structure our data in the way that works best for the problem we are solving• Flexible• Evolvable• Compact• Fast

serialization/deserialization

Copyright 2012 Cloudera Inc. All rights reserved

Spell Correction: The Drosophila of Data Science

Copyright 2012 Cloudera Inc. All rights reserved

Simple Counts on Complex Objects

Copyright 2012 Cloudera Inc. All rights reserved

The Uncanny Valley for Statisticians on Hadoop

Copyright 2012 Cloudera Inc. All rights reserved

The Business of Data Science

Copyright 2012 Cloudera Inc. All rights reserved

Where You Should Work: The Two Options

Copyright 2012 Cloudera Inc. All rights reserved

A Startup

Copyright 2012 Cloudera Inc. All rights reserved

Close to the Money

Copyright 2012 Cloudera Inc. All rights reserved

Dealing for Data

Copyright 2012 Cloudera Inc. All rights reserved

Education and Growth

Copyright 2012 Cloudera Inc. All rights reserved

Questions?@josh_wills

top related