jonathan lenaghan, vp of science and technology, placeiq at mlconf atl 2016
TRANSCRIPT
1
Discerning Human Behavior from Mobility DataJonathan Lenaghan
VP of Science and Technology
2016
2
AgendaP L AC E I Q A N D M O B I L E OV E RV I E W
DATA G E N E R AT I O N P RO C E S S
B ES T I A RY O F LO C AT I O N F R AU D
D I S C E R N I N G H U M A N B E H AV I O R
Overview
4
CompanyOverview
PlaceIQ is building an advanced understanding of consumer behavior to revolutionize the customer experience through location analytics.
We create customer-driven audiences & understanding for activation in digital media and deployment across enterprise applications.
• Founded 5 ½ years ago, Employs 140 people.
• Headquartered in NYC, with offices in Palo Alto, Chicago, Detroit, LA, Boulder, and London UK
5
M O V E M E N T D ATA
New Model
Consumer Behaviorof
Mobile is the Key to Understanding the Consumer Journey
T H I R D PA R T Y D ATA
Age IncomeAuto PurchaseTV
F I R S T P A R T Y D ATA
DMP CRM
TA R G E T
A N A LY Z E
M E A S U R E
M A N A G E
6
Significant Statistics
Location-Based POIsLocation points-of-interest: 475 Million
Location Commercial polygons: 1.4+ Million
Location Residential Parcels: 137 Million
Predicted Home Dwells: 90 Million
Current Behavioral Profiles: 4+ Thousand
Unique DevicesDevice IDs 4+ Billion
PIQ IDs/unique users 130 Million
InfrastructureData Storage ~10 Petabytes
Ad Requests per Second 250 Thousand
Production Cluster 8K Nodes
7
Driving Components of the Platform
Work
Home
MOVEMENT DATA BASE MAP
ruleb for Retail { use time_periods Monday--Sunday 09:00--20:00; Walmart and K-Mart where count >= 20 in 10 months;}
P IQL
8
Mobility Data
• Scale of data is vast (~10 PB over three years)
• Most of the data is very noisy
• Much of the data is fraudulent
• Location analytics from high-scale mobile ad request data is full of challenging and interesting problems!
How is movement data generated?
10
How Direct Location Data is Obtained
4 App passes location to ad exchange
2 OS gets location from device
5Location analytics platform matches to place or audience
1 App asks OS for location
3OS passes best available location to app
On iOS, the app uses the Core Location API.
On Android, the app uses the android.location API.
Operating System
• ( Lat, Long )• UDID
Ad RequestLocation Analytics Platform
Avg. Accuracy: 2,000 m
Avg. Accuracy: 424 m
Avg. Accuracy: 23 m
Cellular Antenna
WIFI Antenna
GPS Antenna
• ( Lat, Long )• Accuracy
Location Response
Ad ExchangeKey Processes• Identify and filter spam• Verify places and map in high detail• Understand the surrounding
context• Unify a single device’s many
hashed IDs
Bestiary of Location Fraud
12
Quality of Movement Data Varies Greatly by Partner
13
Programmatically-Generated Movement is Common
14
Misrepresentation May be Nefarious or Not
Spoofing High-Value Locations Centroid Geocoding
15
Misrepresentation May be Nefarious or Not
A single device is observed in tens of metros across the United States over the course of a few minutes.
Location-Spoofing Short Distance Jitter
Jitter is typically caused by switching between GPS and cell-tower triangulation.
HyQuP and Darwin
17
• “I know that half of my advertising budget is wasted, I just don’t know which half.”
• Darwin removes on average 40% of all ad requests as misrepresenting location
• Not filtering ensures that nearly half of all location-based ads are wasted
• Inferring human behavior from ad request data is impossible without such a enabling technology
Darwin is a Location Fraud Detection and Prevention Product
18
Measure quality of data and then filter bad data
HyQuP (Hyperlocality Quality Pipeline)
Produces metrics to judge how closely a corpus of movement data reflects human movement and behavior.
Computes two metrics: Hyperlocality and Clusterability
Darwin Fraud Filters
Detects and filters on a locate-by-locate basis those devices IDs and locations that are plagued with misrepresented geocodes.
Hyperlocality
20
Hyperlocality
Location data should reflect human movement in the real world at high resolution
• Information theoretic techniques can be very powerful and since they typically employ simple counting are easy to compute.
• Determines the efficiency of location data as it moves from low to high resolutions
• How good is our inference of out-of-home behavior?
• Is the data human generated or computer-generated?
21
Distribution of Digits
What would the expected distribution of the individual digits of the coordinate pairs representing the movement of humans be?
Consider both the distribution of the individual digits after the decimal places as well as the joint distribution, e.g. for the coordinate pair
(90.123456, 88.981239)
Generating the empirical distribution of digits and compute the Kullback-Leibler divergence (KLD) between these distributions with the uniform distribution.
22
Zoom-Stack Efficiency
We apply the notion of information efficiency and changes in this quantity as we move down a zoom-stack from 1km to 100m to 10m.
The metric measures how much information is gained as we add additional digits to the coordinates.
This is a way to measure the amount of randomness gained with the addition of each digit.
10km x 10km
1km x 1km
100m x 100m
10m x 10m
23
Zoom-Stack Efficiency
Given our knowledge of the Nth digits in a coordinate pair, how much more information do we gain by knowing the next digit?
In other words, how much randomness is induced at the next level of the zoom stack?
10km x 10km
1km x 1km
100m x 100m
10m x 10m
Clusterability
25
Clusterability
Location data shouldn’t be evenly distributed, because humans aren’t
• The clustering of coordinate points captures real-life human behaviors and habits
• Most devices have a few tight clusters that represent where they live and work
• They have less dense clusters around usual social venues
• Does the data give clean clusters around homes and businesses?
26✗ ✓
• Do the locates tend to cluster over residential lots and workplaces in a manner consistent with human behavior?
• Locates on a device-by-device basis should be scattered into clusters with a predictable pattern.
• The silhouette of the clusters should also be well defined. It should not neither point-like nor diffuse.
• Clusters computed using DBSCAN
Clusterability: Does Location Data Look like Humans?
27
Quality Scores
D measures whether clusters are formed and numerically represents the density of the clustering
R measures the robustness of the clustering of the data set
S measures the tightness of the clustering
Clusterability = D * R * (1+S) / [R + (1+S)/2]
i.e. the product of the density of the clustering and the harmonic mean of the robustness and the normalized silhouette score.
Clusterability
28
Quality Scores
• Misrepresentation of location data is widespread in the mobile ad ecosystem
• Rely on hyperlocality (information theoretic approach) and clusterability (unsupervised learning)
• Essential to measure and filter devices, applications and locations
• High scale and challenging problems to be solved
Conclusions