the wild west of data wrangling

The Wild West of Data Wrangling

Sarah Guido PyCon 2017

@sarah_guido

This talk:

•  A day in the life

•  Three examples of dealing with uncooperative data

•  Not ground truth!

Who am I?

•  Senior data scientist at Mashable

•  Mashable == internet culture media!

•  Data sciencing in Python

•  Twitter: @sarah_guido

Iris Dataset

Example 1: Predicting building sales

•  The problem: can we predict if a building will sell the following year?

•  The data: floors, location, square footage, price per sqft, etc

•  The goal: provide valuable insight to platform users

Example 1: Predicting building sales

•  First thought: logistic regression using scikit-learn

•  Binary classification: sale/no sale

Problem…

Data: 95% no sale, 5% sale

Logistic regression: 95% accurate

Problem: Class imbalance

Class imbalance

When the values you are trying to predict are not equal, this can create bias in classification models.

Solution: Gradient boosting

Gradient boosting

Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

Example 2: Clustering user interactions

The problem: how can we identify similar patterns based on click data?

The data: time, geolocation, cookie, browser useragent string, referrer

The goal: understand how people interact with content over time

Why Scala?

Problem: Clustering user interactions

K-means clustering

An unsupervised learning method of grouping data together based on a distance metric.

Problem: Clustering the data

•  Only look at users with 5 or more interactions

•  Each user has a different number of interactions

•  Each data point ends up in a different cluster

Solution: Transform the data

date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12

Length of interactions: 5

Average time between interactions: ~8 days

referrer: facebook, twitter

One-hot encode and transform to matrix

•  Facebook: [1, 0]

•  Twitter: [0, 1]

Example 3: Understand audience composition

The problem: how can we effectively describe our audience?

The data: anonymized demographic and psychographic data

The goal: audience segmentation and channel analysis

Problem: insufficient data

•  Google Analytics data – 1/3 of urls

•  Finicky API

•  Semi-useless psychographic data

Solution: accept defeat

Solution: accept defeat make it work!

Solution: make it work!

•  Theory of highly-performant links

•  Segmentation through archetypal analysis

•  Go get more data!

General strategy

•  What problem are you trying to solve?

•  What’s wrong with your data?

•  What do you need that you don’t have?

Keep in mind…

•  Data your company collects is complicated

•  What you do to your data will affect the model

•  Creativity is your friend

•  Lots of ways to solve the problem

Thank you!

@sarah_guido

the wild west of data wrangling

Technology

wild west! wrangling in the · content management system...

telemedicine: the wild west

data wrangling

wild west themed - questexperiences.com

wrangling text, wrangling people - the life of a technical...

wild west days 2011

wild wild west simmental sale

wild west vadnyugat

jedilnik wild west saloon

crucial conversations for navigating social media’s wild,...

salesforce data loss in the wild wild west

werewolf - the wild west

wild in the west

still the wild west?

wild wild west shootout

wild west presentation

the earthquake wild wild west - canterbury 2015

final exam paper wild wild west. dbq was widespread violence...

byod: device control in the wild, wild, west

the wild, wild west websites and social media in...