using geolocated twitter traces to infer residence and mobility nigel swier, bence kormaniczky, and...
TRANSCRIPT
Using geolocated Twitter traces to infer residence and mobility
Nigel Swier, Bence Kormaniczky, and Ben Clapperton
Background
• ONS Big Data Project: This is one of four pilots exploring the use of big data for official statistics
• Users tweeting from a smartphone have an option to provide a GPS location
• 300,000-plus such tweets sent daily within GB• Data is relatively accessible• Can these data be used to infer residence and
mobility patterns?
Age Distribution of UK Twitter Users
Data Acquisition
• Target data: All geolocated tweets sent within Great Britain between (1 April 2014 to 31 October 2014)
• Combination of Twitter API and procured data (GNIP)
• 81.4 million tweets• Stored as JSON files in MongoDB
Distribution of user activity
Distribution of persistence levels
User frequency
count
Users with geolocated tweets on just one day not shown
Geo-located Twitter volumes by Device Type Great Britain, 15 August to 31 October 2014
Lots of activity in different places but where does this person* live?
* This example is based a real data but has been altered to prevent identification
DBSCAN
DBSCAN (Density Based Spatial Clustering Algorithm with Noise)
•i = distance (radius)•minpts = minimum points to define a cluster
Developed by Ester et al (1996)
Raw Data
Cluster Centroid
Noise
Cluster_id Northing Easting Count Type
60033_1 105?31 530?02 28 Residential
60022_2 104?41 530?94 4 Residential
60033_6 182?46 532?10 13 Commercial
60033_13 104?56 531?17 3 Commercial
60033_15 179?30 533?95 3 Commercial
60033_21 165?47 532?51 3 Commercial
Most likely lives here:“Dominant Residential Cluster”
Time of day profile by address type
Geolocated penetration rates*by local authority
* Dominant residential cluster with date range of at least one month
Student mobility
Conclusions
• Twitter may be useful for identifying short-term mobility patterns
• DBSCAN can identify anchor points and AddressBase can classify them
• Results are indicators NOT estimates - may be possible to produce new de-facto based population statistics
• Twitter could help inform public policy but we need to be extremely alert to source changes.
Next Steps
• Technical Report to be published shortly• Developing methods for inferring socio-
demographic characteristics• Development of an estimation framework
(including a benchmarking survey)