jonathan lenaghan, vp of science and technology, placeiq at mlconf atl 2016

1

Discerning Human Behavior from Mobility DataJonathan Lenaghan

VP of Science and Technology

2016

2

AgendaP L AC E I Q A N D M O B I L E OV E RV I E W

DATA G E N E R AT I O N P RO C E S S

B ES T I A RY O F LO C AT I O N F R AU D

D I S C E R N I N G H U M A N B E H AV I O R

Overview

4

CompanyOverview

PlaceIQ is building an advanced understanding of consumer behavior to revolutionize the customer experience through location analytics.

We create customer-driven audiences & understanding for activation in digital media and deployment across enterprise applications.

• Founded 5 ½ years ago, Employs 140 people.

• Headquartered in NYC, with offices in Palo Alto, Chicago, Detroit, LA, Boulder, and London UK

5

M O V E M E N T D ATA

New Model

Consumer Behaviorof

Mobile is the Key to Understanding the Consumer Journey

T H I R D PA R T Y D ATA

Age IncomeAuto PurchaseTV

F I R S T P A R T Y D ATA

DMP CRM

TA R G E T

A N A LY Z E

M E A S U R E

M A N A G E

6

Significant Statistics

Location-Based POIsLocation points-of-interest: 475 Million

Location Commercial polygons: 1.4+ Million

Location Residential Parcels: 137 Million

Predicted Home Dwells: 90 Million

Current Behavioral Profiles: 4+ Thousand

Unique DevicesDevice IDs 4+ Billion

PIQ IDs/unique users 130 Million

InfrastructureData Storage ~10 Petabytes

Ad Requests per Second 250 Thousand

Production Cluster 8K Nodes

7

Driving Components of the Platform

Work

Home

MOVEMENT DATA BASE MAP

ruleb for Retail { use time_periods Monday--Sunday 09:00--20:00; Walmart and K-Mart where count >= 20 in 10 months;}

P IQL

8

Mobility Data

• Scale of data is vast (~10 PB over three years)

• Most of the data is very noisy

• Much of the data is fraudulent

• Location analytics from high-scale mobile ad request data is full of challenging and interesting problems!

How is movement data generated?

10

How Direct Location Data is Obtained

4 App passes location to ad exchange

2 OS gets location from device

5Location analytics platform matches to place or audience

1 App asks OS for location

3OS passes best available location to app

On iOS, the app uses the Core Location API.

On Android, the app uses the android.location API.

Operating System

• ( Lat, Long )• UDID

Ad RequestLocation Analytics Platform

Avg. Accuracy: 2,000 m

Avg. Accuracy: 424 m

Avg. Accuracy: 23 m

Cellular Antenna

WIFI Antenna

GPS Antenna

• ( Lat, Long )• Accuracy

Location Response

Ad ExchangeKey Processes• Identify and filter spam• Verify places and map in high detail• Understand the surrounding

context• Unify a single device’s many

hashed IDs

Bestiary of Location Fraud

12

Quality of Movement Data Varies Greatly by Partner

13

Programmatically-Generated Movement is Common

14

Misrepresentation May be Nefarious or Not

Spoofing High-Value Locations Centroid Geocoding

15

Misrepresentation May be Nefarious or Not

A single device is observed in tens of metros across the United States over the course of a few minutes.

Location-Spoofing Short Distance Jitter

Jitter is typically caused by switching between GPS and cell-tower triangulation.

HyQuP and Darwin

17

• “I know that half of my advertising budget is wasted, I just don’t know which half.”

• Darwin removes on average 40% of all ad requests as misrepresenting location

• Not filtering ensures that nearly half of all location-based ads are wasted

• Inferring human behavior from ad request data is impossible without such a enabling technology

Darwin is a Location Fraud Detection and Prevention Product

18

Measure quality of data and then filter bad data

HyQuP (Hyperlocality Quality Pipeline)

Produces metrics to judge how closely a corpus of movement data reflects human movement and behavior.

Computes two metrics: Hyperlocality and Clusterability

Darwin Fraud Filters

Detects and filters on a locate-by-locate basis those devices IDs and locations that are plagued with misrepresented geocodes.

Hyperlocality

20

Hyperlocality

Location data should reflect human movement in the real world at high resolution

• Information theoretic techniques can be very powerful and since they typically employ simple counting are easy to compute.

• Determines the efficiency of location data as it moves from low to high resolutions

• How good is our inference of out-of-home behavior?

• Is the data human generated or computer-generated?

21

Distribution of Digits

What would the expected distribution of the individual digits of the coordinate pairs representing the movement of humans be?

Consider both the distribution of the individual digits after the decimal places as well as the joint distribution, e.g. for the coordinate pair

(90.123456, 88.981239)

Generating the empirical distribution of digits and compute the Kullback-Leibler divergence (KLD) between these distributions with the uniform distribution.

22

Zoom-Stack Efficiency

We apply the notion of information efficiency and changes in this quantity as we move down a zoom-stack from 1km to 100m to 10m.

The metric measures how much information is gained as we add additional digits to the coordinates.

This is a way to measure the amount of randomness gained with the addition of each digit.

10km x 10km

1km x 1km

100m x 100m

10m x 10m

23

Zoom-Stack Efficiency

Given our knowledge of the Nth digits in a coordinate pair, how much more information do we gain by knowing the next digit?

In other words, how much randomness is induced at the next level of the zoom stack?

10km x 10km

1km x 1km

100m x 100m

10m x 10m

Clusterability

25

Clusterability

Location data shouldn’t be evenly distributed, because humans aren’t

• The clustering of coordinate points captures real-life human behaviors and habits

• Most devices have a few tight clusters that represent where they live and work

• They have less dense clusters around usual social venues

• Does the data give clean clusters around homes and businesses?

26✗ ✓

• Do the locates tend to cluster over residential lots and workplaces in a manner consistent with human behavior?

• Locates on a device-by-device basis should be scattered into clusters with a predictable pattern.

• The silhouette of the clusters should also be well defined. It should not neither point-like nor diffuse.

• Clusters computed using DBSCAN

Clusterability: Does Location Data Look like Humans?

27

Quality Scores

D measures whether clusters are formed and numerically represents the density of the clustering

R measures the robustness of the clustering of the data set

S measures the tightness of the clustering

Clusterability = D * R * (1+S) / [R + (1+S)/2]

i.e. the product of the density of the clustering and the harmonic mean of the robustness and the normalized silhouette score.

Clusterability

28

Quality Scores

• Misrepresentation of location data is widespread in the mobile ad ecosystem

• Rely on hyperlocality (information theoretic approach) and clusterability (unsupervised learning)

• Essential to measure and filter devices, applications and locations

• High scale and challenging problems to be solved

Conclusions