the geography of topics from geo-referenced social media data in london
TRANSCRIPT
The Geography of Topics from
Geo-referenced Social Media Data
in London
Guy LansleyDepartment of Geography,
University College London
@GuyLansley
AAG Annual Meeting 2015
Chicago, USA
Context
• Twitter could pose as a useful source of temporal population data
at a very small area geography
• These data can be used to predict how the population negotiate
travel around cities at an small area aggregate level
• The content of the Tweets poses an interesting area of research
into how peoples activity and their behaviour on social media
may link to time, place and space
• Such insight could be very useful to marketing firms and retailers
• This research aims to implement an unsupervised text modelling
approach to cluster the Tweets into distinctive topics and analyse
how they vary by time and space in Central London
Previous Research
• Geo-located Twitter data and High Street Insight
• Lansley (2014) Evaluating the utility of geo-referenced Twitter data as a
source of reliable footfall insight
• Classifying Tweets using an unsupervised learning algorithm
• Lai, Cheng and Lansley (2015) Spatio-Temporal Patterns of Passengers’
Interests at London Tube Stations
Representativeness of Twitter
Day NightTwitter
Census
• Previous research has found geo-located Twitter data sourced from the
UK to be over-representative of young, White British adults, and there is
also a higher penetration amongst males
• Twitter has been proven to be a useful indicator of footfall. Although the
proportional spatial distribution of Tweets has been found to differ from
Census statistics (see below)
Data available through the Twitter API
• User Creation Date
• Followers
• Friends
• User ID
• Language
• Location
• Name
• Screen Name
• Time Zone
• Geo Enabled
• Latitude
• Longitude
• Tweet date and time
• Tweet text
Twitter Data
• As text modelling is computationally
intensive and the density of Tweets
can be very low in some places it
was decided to restrict the sample
to Inner London.
• Tweets from 1st January 2013 until
31st December 2013 were
downloaded
• To understand the typical weekday
patterns only Tweets from Tuesday,
Wednesday and Thursday were
used
Filtering
Aim: to reduce the amount of noise in the dataset
• Tweets with fewer than 3 words were removed
• Words with fewer than 3 characters and more than 15 characters
were removed
• URLs were removed
• Tweets from users with over 2000 Tweets were removed from
the sample
• Tweets from false users who had requoted texts repeatedly were
removed from the sample
Number of Tweets
Total weekday Tweets from Greater London in 2013 3,341,959
Total weekday Tweets from Inner London 1,679,571
Coordinates cleaned & false users removed 1,545,899
Strings cleaned 1,301,004
Methods• All of the Twitter text strings were converted into a corpus
• Converted to lower case
• Numbers were removed
• Punctuation was removed
• Stop words were removed
• The corpus was lexicalized in R
• The document was then run through a Latent Dirichlet Allocation
(LDA) model
• LDA is an unsupervised approach to document modelling that
discovers latent semantic topics in large collections of texts
• The number of topic groups (k) is predefined by the user
• 20 groups were made
• 100 subgroups were made from running additional LDA models
on the Tweets from each of the 20 groups individually
Latent Dirichlet Allocation
• Blei et al. (2003) Latent Dirichlet Allocation:
Tweet Text Time Date x y
Tweet 1
Tweet 2
Tweet 3
… … … … … …
• Each Tweet (as an individual text document) is assigned to one
topic group based on the generated probabilities from the LDA
model
20 Twitter Groups
1 Photography and Sights
2 Optimism, Kindness and Positivity
3 Leisure and Attractions
4 TV and Film
5 Humour and Informal Conversations
6 Transport and Travel
7 Politics, Beliefs and Current Affairs
8 Sport and Games
9 Anticipation and Socialising
10 Business, Information and Networking
11 Pessimism and Negativity
12 Music and Musicians
13 Routine Activities
14 Food and Drink
15 Body, Appearances and Clothes
16 Social Media and Apps
17 Slang and Profanities
18 Place and Check-Ins
19 Wishes and Gratitude
20 Foreign and Other
Photography and Sights
Optimism, Kindness and Positivity
Leisure and Attractions
TV and Film
Humour and Informal Conversations
Transport and Travel
Politics, Beliefs and Current Affairs
Sport and Games
Anticipation and Socialising
Business, Information and Networking
Pessimism and Negativity
Music and Musicians
Routine Activities
Food and Drink
Body, Appearances and Clothes
Social Media and Apps
Slang and Profanities
Place and Check-Ins
Wishes and Gratitude
Foreign and Other
20 Twitter Groups
Time Distribution0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Photography and SightsOptimism, Kindness and PositivityLeisure and AttractionsTV and FilmHumour and Informal ConversationsTransport and TravelPolitics, Beliefs and Current AffairsSport and GamesAnticipation and SocialisingBusiness, Information and NetworkingPessimism and NegativityMusic and MusiciansRoutine ActivitiesFood and DrinkBody, Appearances and ClothesSocial Media and AppsSlang and ProfanitiesPlace and Check-InsWishes and GratitudeForeign and Other
All Tweets
Spatial Distributions
Leisure and Attractions Transport and Travel Music and Musicians
Standardised Residuals
200x200m grid
Land Use
• To understand how Tweets my correspond with Land Use
• Tweets from different land use categories were intersected with
the Generalised Land Use Database (GLUD)
• The GLUD categorise all of England into polygons of 9
categories
• It was created from recoding OS MasterMap (2005)
Domestic Buildings and Gardens
Non-Domestic Buildings
Public Green Space
• However, it is now 10 years
out of date and there are
some notable errors
-6
-4
-2
0
2
4
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Pro
po
rtio
nal
dif
fere
nce
of
Twit
ter
Gro
up
s (%
)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ph
oto
grap
hy
and
Sig
hts
Op
tim
ism
, Kin
dn
ess
and
P
osi
tivi
ty
Leis
ure
an
d A
ttra
ctio
ns
TV a
nd
Film
Hu
mo
ur
and
Info
rmal
C
on
vers
atio
ns
Tran
spo
rt a
nd
Tra
vel
Po
litic
s, B
elie
fs a
nd
C
urr
ent
Aff
airs
Spo
rt a
nd
Gam
es
An
tici
pat
ion
an
d
Soci
alis
ing
Bu
sin
ess,
Info
rmat
ion
an
d N
etw
ork
ing
Pes
sim
ism
and
N
ega
tivi
ty
Mu
sic
and
Mu
sici
ans
Ro
uti
ne
Act
ivit
ies
Foo
d a
nd
Dri
nk
Bo
dy,
Ap
pea
ran
ces
and
C
loth
es
Soci
al M
edia
an
d A
pp
s
Slan
g an
d P
rofa
nit
ies
Pla
ce a
nd
Ch
eck-
Ins
Wis
hes
an
d G
rati
tud
e
Fore
ign
an
d O
ther
Land Use and Tweets Domestic Buildings and Gardens
Non-Domestic Buildings
Public Green Space
Rail
Key Places
• It is also possible to collect Tweets from particular
locations to observe how they compare
• We have selected all of the Tweets from the
immediate vicinity of 6 unique locations in London
• Both the frequency and the content of Tweets
were found to be influenced by the local activity
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ph
oto
grap
hy
and
Sig
hts
Op
tim
ism
, Kin
dn
ess
and
P
osi
tivi
ty
Leis
ure
an
d A
ttra
ctio
ns
TV a
nd
Film
Hu
mo
ur
and
Info
rmal
C
on
vers
atio
ns
Tran
spo
rt a
nd
Tra
vel
Po
litic
s, B
elie
fs a
nd
C
urr
ent
Aff
airs
Spo
rt a
nd
Gam
es
An
tici
pat
ion
an
d
Soci
alis
ing
Bu
sin
ess,
Info
rmat
ion
an
d N
etw
ork
ing
Pes
sim
ism
and
N
ega
tivi
ty
Mu
sic
and
Mu
sici
ans
Ro
uti
ne
Act
ivit
ies
Foo
d a
nd
Dri
nk
Bo
dy,
Ap
pea
ran
ces
and
C
loth
es
Soci
al M
edia
an
d A
pp
s
Slan
g an
d P
rofa
nit
ies
Pla
ce a
nd
Ch
eck-
Ins
Wis
hes
an
d G
rati
tud
e
Fore
ign
an
d O
ther
Residential
The Emirates Stadium
The O2 Arena
Waterloo Station
Westfield Stratford
Soho
Canary Wharf
Ratio
x 0.5x 1.0x 2.0
Twitter Groups and Key Places
1 Photography and Sights 2Optimism, Kindness and
Positivity3 Leisure and Attractions 4 TV and Film 5
Humour and Informal Conversations
a Landmarks a Anticipation a Fashion and Shopping a Television a Opinions
b Outdoors b Mood b Museums and Galleries b Celebrities b Laughter
c Urban c Achievements c Nightlife c Reality c Chat
d Instagram d Conversations d Shows and Entertainment d Cinema and Film d Affection
e Architecture e Reflections e Events and Socialising e Reactions e Mates
6 Transport and Travel 7Politics, Beliefs and Current
Affairs8 Sport and Games 9 Anticipation and Socialising 10
Business, Information and Networking
a Journeys a Politics a Other Sports a Wishes a Training
b Trains and Delays b Religion b Footballers b The Day before b Conference
c Public Transport c Newspapers c London Teams c Events c Brands
d Roads and Cycling d Political Awareness d International Football d Weekend d Jobs and Careers
e Travel Incidents e Current Affairs e Football Managers e Holidays e Data and Technology
11 Pessimism and Negativity 12 Music and Musicians 13 Routine Activities 14 Food and Drink 15Body, Appearances and
Clothes
a Problems a Pop Stars and Music Videos a Exercise a Food a Cosmetics
b Hate and Anger b Radio and Downloads b Work b Drink b Body and Health
c Sadness and Awkwardness c Concerts c Feelings c Meals c Clothes
d Life and Changes d Albums d Education d Coffee and Cake d Cute
e Worry and Confusion e Sleep e Hunger e Weather
16 Social Media and Apps 17 Slang and Profanities 18 Place and Check-Ins 19 Wishes and Gratitude 20 Foreign and Other
a Social Media Activity a Street Slang a Events a Friends a Portuguese
b Services b Abuse b Routine Places b Via Social Media b French
c Technology and Brands c People c Attractions c People c Spanish
d Communications d Jokes d Markets d Celebrations d Turkish
e Trending e Misuse e Stations e Thanks and Affection e Italian
f Other
100 Subgroups
Labels were inferred from the most overrepresented words
Subgroups
Museums and Galleries
Fashion and Shopping
Events and Socialising
Shows and Entertainments
Nightlife
• Topic 3 – Leisure and Attractions
Topic 3 Subgroups across Central London
Fashion and Shopping Museums and Galleries
Nightlife Shows and Entertainments
Topic 13 Subgroup D – Education
UCL
University of
Westminster
Imperial College
LondonLondon South
Bank University
Kings College
London
Queen Mary
London Metropolitan
University
University of
Greenwich
City University
Goldsmiths
Birkbeck
SOAS
LSE
Various
Various
Underrepresented Overrepresented
25
Clapham Junction
Victoria
Waterloo
London Bridge
Liverpool Street
Fenchurch Street
St Pancras
Kings Cross
Euston
Paddington
Marylebone
Lewisham
Topic 6 Subgroup B – Trains and Delays
Underrepresented Overrepresented
Conclusions
• There are a distinctive geography of Tweets in London
which can be represented by a discrete Tweet content
classification produced from a generative probabilistic
model
• The composition of Tweets varies by time and space within
Central London
• Land use and it’s associated activity correspond with the
content of geo-located Tweets transmitted locally
• It may be possible to link the topics to socio-economic
status via focusing on Tweets recorded from residential
locations
References
• Blei, D., Ng, A., and Jordan, M. (2003) Latent Dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022
• Cheng, T. & Wicks, T. (2014). Event Detection using Twitter: A Spatio-
Temporal Approach. Plos One, 9(6)
• Lai, J., Cheng, T, and Lansley, G. (2015) Spatio-Temporal Patterns of
Passengers’ Interests at London Tube Stations. In the Proceedings of
the 23rd Conference on GIS Research UK. 15th – 17th April, 2015.
University of Leeds, UK
• Lansley, G (2014) Evaluating the utility of geo-referenced Twitter data
as a source of reliable footfall insight. In the Proceedings of the
Association of American Geographers AGM 2014. 8th – 12th April,
2014. Tampa, USA
• Longley, P. Adnan, M. and Lansley, G. (2015) The geo-temporal
demographics of Twitter usage. Environment and Planning A. 47(2) 465
– 484