an investigation in defining neighbourhood boundaries using location based social media

7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media

1/151

1

An Investigation in Defining Neighbourhood Boundaries

Using Location Based Social Media

Tai Tong KAM

28thAugust 2015

For BENVGSC6: Dissertation

Supervised by: Steven Gray, Dr Elsa Arcaute

Word Count: 10,169 words

This dissertation is submitted in partial fulfilment for the requirements for the MSc in

Smart Cities and Urban Analytics in the Centre for Advanced Spatial Analysis, Bartlett

Faculty of the Built Environment, University College London.


2/151

2

ABSTRACT

The widespread use of smartphones and social media has opened opportunities for

researchers to define one of the most elusive concepts in cities: neighbourhoods. While

the number of neighbourhood detection methods using location based social media have

increased in recent years, there is much that we do not know about the process. For

example, researchers have rarely integrated the neighbourhoods detected with

administrative data to add meaning beyond what can be inferred from social media.

This work takes a step towards better understanding neighbourhood detection methods,

and also attempts to add meaning to the clusters / neighbourhoods generated by

incorporating administrative data to these clusters / neighbourhoods.

I break down the neighbourhood detection process into three common elements (a) the

unit used for aggregation, (b) the type of clustering method used; and (c) the similarity

measure.

I then illustrate one way of better understanding the neighbourhood detection process by

applying multiple variations of the Livehoods method (Cranshaw et al., 2012) on data

from Greater London, and find that in addition to neighbourhood clusters, the

Livehoods method may also be able to generate clusters that depict the citys boundaries

from the residents perspective.

I also make a preliminary attempt in this work to combine the clusters / neighbourhoods

formed using the Livehoods method with data from LondonsLower Super Output

Areas to investigate ethnic diversity in neighbourhoods. I found that using location

based social media may generate neighbourhood boundaries that are more appropriate

than or can complement traditional administrative boundaries for studies where

definitions of neighbourhood goes beyond arbitrary administrative boundaries and a

multifaceted view of neighbourhoods is needed.


3/151

3

DECLARATION

I, Tai Tong Kam, hereby declare that this dissertation is all my original work and that all

sources have been acknowledged. It is 10,169 words in length.

Signature

====================

Date: 28thAugust 2015


4/151

4

TABLE OF CONTENTS

1. RESEARCH GOAL AND OVERVIEW..................................................................... 8

1.1. Research goal, motivations, and limitations ......................................................... 8

1.2.

Overview........................................................................................................... 10

2. INTRODUCTION.................................................................................................... 12

2.1. Neighbourhoods ................................................................................................ 12

2.2. Location Based Social Media and Detecting Neighbourhood Boundaries......... . 14

2.3. Review of Methods for Neighbourhood Detection ............................................. 16

3. METHODOLOGY ................................................................................................... 25

3.1. Data sources ...................................................................................................... 25

3.2. Data sorting, import, storage and analysis......................................................... 26

3.3. The Livehoods method...................................................................................... 26

4. ANALYZING THE LIVEHOODS METHOD.......................................................... 30

4.1. Tuning the number of smallest eigenvalues (k).................................................. 30

4.2. Tuning the alpha constant ()............................................................................ 33

4.3. Tuning the nearest neighbours parameter (m).................................................. 34

4.4. Using cosine similarity....................................................................................... 35

4.5. Nearest neighbours versus full similarity graph ................................................ 36

4.6. Summary........................................................................................................... 36

5. DESCRIPTION OF LIVEHOOD CLUSTERS / NEIGHBOURHOODS.................. 38

5.1. Overview of neighbourhoods............................................................................. 38

5.2. Breakdown of individual neighbourhoods ......................................................... 46

6. COMPARING LIVEHOODS CLUSTERS TO LOWER SUPER OUTPUT AREAS 54

7. CONCLUSION......................................................................................................... 59

7.1. Concluding Remarks ......................................................................................... 59

7.2. Limitations and Future Research ...................................................................... 60

8. BIBLIOGRAPHY.................................................................................................... 649. APPENDIX.............................................................................................................. 67

9.1. Scripts for collecting and formatting data for analysis ...................................... 67

9.1.1. IPython notebook: twitter_streaming.ipynb ............................................... 67

9.1.2. IPython notebook: extract_twitter_data.ipynb........................................... 70

9.1.3. IPython notebook: foursquare_search_place.ipynb................................... 75

9.1.4. IPython notebook: format_data_for_analysis.ipynb.................................. 84

9.2. Scripts for Livehoods clustering method........................................................... 89

9.2.1. Bash script: install.sh................................................................................. 89


5/151

5

9.2.2. Bash script: runLDN.sh............................................................................. 90

9.2.3. Python script: clustering.py....................................................................... 92

9.2.4. Python script: clusteringalgo.py................................................................. 94

9.2.5. Python script: getdata.py ..........................................................................100

9.2.6. Python script: utils.py ...............................................................................111

9.3. Scripts for visualizing cluster results ................................................................119

9.3.1. Python script: formatresults.py.................................................................119

9.3.2. Python script: visualize_cluster_results.py................................................127

9.4. Scripts for comparing Lower Super Output Areas with Livehoods clusters in

terms of ethnic diversity ..............................................................................................138

9.4.1. Python script: extract_ldn_lsoa.ipynb .......................................................138

9.4.2. Python script: add_ethnic_diversity_to_geojson.ipynb.............................141

9.4.3. Python script: stats_for_eth_diversity.ipynb .............................................146

9.4.4. R script: ethnic_diversity_chart.R ............................................................148

9.5. Livehood clusters for nearest neighbours parameter m=5 to m=20..................149

9.6. Largest cluster generated from Livehoods method ...........................................151


6/151

6

LIST OF FIGURES

Figure 1: Relationship between number of smallest eigenvalues (k) found and number of

clusters formed ................................................................................................................. 32

Figure 2: Boundaries formed for different number of clusters ... ........ ........ ......... ....... ......... .. 33

Figure 3: Boundaries formed for different alpha constants ........ ........ ......... ....... ........ ......... .. 34Figure 4: Boundaries formed for different nearest neighbours parameter (m) ....................... 35

Figure 5: Clustering results for London ................................................................................ 40

Figure 6: Properties of Livehood clusters ............................................................................. 44

Figure 7: Overall distribution of venues and checkins across clusters ............ ........ ......... ....... 47

Figure 8: Hirschman concentration index (HI) for clusters.......... ........ ......... ....... ........ ......... .. 56

LIST OF TABLES

Table 1: Summary statistics for cluster results for London ........ ........ ........ ........ ........ ........ .... 41

Table 2: Percentage difference between proportion of venues within cluster to proportion of

venues within city in terms of Foursquares main categories............................................... 50

Table 3: Percentage difference between proportion of users within cluster checking-in to

proportion of users within city checking-in in terms of Foursquares main categories............ 52


7/151

7

ACKNOWLEDGMENTS

I would like to thank my supervisors, Steven Gray and Elsa Arcaute, who have been

extremely supportive and helpful throughout the dissertation process. Steven was also

instrumental in helping me process the data by guiding me on the process for setting up

the cloud computing infrastructure required to run the time-consuming scripts in parallel.

On the other hand, Elsa introduced me to Anastasios Noulas from the University of

Cambridge, who kindly provided the Foursquare data used in this work.

I would also like to thank all the teachers, staff and fellow course mates at CASA, who

have given me a great year of friendship, learning and joy in my time at CASA and

inspired me to do better.

Finally, I would like to thank my partner Cherlyn Ng, whose love, patience and support

made it possible for me to focus on my work while we were 6,740 miles apart.


8/151

8

1. RESEARCH GOAL AND OVERVIEW

1.1.Research goal, motivations, and limitations

The widespread use of smartphones and social media has generated an immense

amount of data which has been used to study topics such as mobility and event

detection in the city (Silva et al., 2013). Some researchers have been attempting to

use the data to define one of the most elusive concepts in cities: neighbourhoods

(Cranshaw et al., 2012; Falher et al., 2015; Zhang et al., 2013). While the research is

promising, there is much that we do not understand about the process of detecting

neighbourhoods using location based social media. For example, we do not know

how the neighbourhoods detected compare with traditional administrative

boundaries, and how we can combine the neighbourhoods detected with data from

these administrative boundaries to help us better understand cities dynamically. We

also do not know how the neighbourhoods detected may change when data over

different time periods or different time intervals are used and what these changes

may mean.

This work takes a step towards better understanding neighbourhood detection

methods. I break down the neighbourhood detection process into three common

elements (a) the unit used for aggregation, (b) the type of clustering method used;

and (c) the similarity measure used so that they can be studied in depth.

Better understanding can come in the form of research on particular elements in the

neighbourhood detection process across a variety of methods and comparing the

differences when different elements are used. It can also come in the form of better

understanding a particular method in depth and exploring how the neighbourhoods

formed are different depending on the parameters used.


9/151

9

In this dissertation, I illustrate one way of doing this by applying multiple variations

of the Livehoods method (Cranshaw et al., 2012) on data from Greater London. The

Livehoods method was chosen as it is a venues-based approach which has not been

used as much in the literature. In addition, it has not yet been applied to the Greater

London area.

As mentioned above, we do not understand how we can combine the clusters /

neighbourhoods detected via neighbourhood detection methods with data from these

administrative boundaries to help us better understand cities. Integrating cluster /

neighbourhoods detected using neighbourhood detection with data from

administrative boundaries is rare in the neighbourhood detection literature as most

researchers using neighbourhood detection methods have used them for developing

recommendation engines that find similar places based on social media activity. As

such, I make a preliminary attempt in this work to combine the clusters /

neighbourhoods formed using the Livehoods method with data from more

traditional administrative boundaries (the Lower Super Output Areas in this case) to

extend the meaningfulness of the clusters / neighbourhoods formed. In particular, I

have tried to integrate ethnic diversity data with the clusters / neighbourhoods

formed using the Livehoods method.

As neighbourhood detection using location based social media is relatively new and

there are few comparisons between existing neighbourhood detection methods, this

work is not aimed at evaluating whether one method or even whether particular

elements of a method are better than another. Neighbourhood detection is a form of

clustering, and determining the best clustering method has a certain degree of

subjectivity.


10/151

10

1.2.Overview

The dissertation is divided into seven sections.

Section Twodiscusses the concept of neighbourhoods, its importance for

understanding cities and why social media is a useful source of data for defining

neighbourhoods. I will review the methods that have so far been used for defining

neighbourhoods and three common elements used by the methods: (a) the unit used

for aggregation, (b) the type of clustering method used; and (c) the similarity

measure used. I will then describe what we have learnt so far about neighbourhood

detection using location based social media, and outline some ideas for better

understanding these methods.

Sections three to six illustrates one way we can better understand neighbourhood

detection methods by taking a closer look at the Livehoods method (Cranshaw et al.,

2012). Section Threebegins by describing the data and methodology used.

Section Fourthen considers different variations of Cranshaw et als (2012)

Livehoods method for neighbourhood detection and tests three different parameters

to find out if changing them affects the clustering results.

Section Five describes the clusters / neighbourhoods that are formed using the

Livehoods method and explores some types of information that can be derived from

these clusters, by combining the clusters with Foursquares venues database.

Section Six describes the clusters / neighbourhoods that are formed using the

Livehoods method by combining them with data from Lower Super Output Areas

(LSOAs) in Greater London. It discusses the issue of the modifiable areal unit

problem (Openshaw, 1984) and how characteristics of the clusters / neighbourhoods

formed using the Livehoods method may be more appropriate than traditional

administrative boundaries such as the LSOAs.


11/151

11

Section Seven consists of concluding remarks and outlines some ideas for further

research that can help us better understand neighbourhood detection methods using

location based social media.


12/151

12

2. INTRODUCTION

2.1.Neighbourhoods

Neighbourhoods are a ubiquitous feature of urban livingeveryone lives in a

neighbourhood. Many groups have an interest in understanding neighbourhoods.

Cranshaw and Yano (2010) note that analysing neighbourhoods is of interest to

businesses such as realtors and developers as the quality of a neighbourhood

affects the value of their assets, and to researchers in the social sciences as they seek

to understand neighbourhood and community level factors that influence

phenomenon such as obesity rates and perceived happiness through neighbourhood

effects (Sampson et al., 2002). A third group that has an interest in neighbourhoods

are city governments that implement neighbourhood interventions and wish to

identify where the interventions would make sense and be most effective. Being

able to identify neighbourhoods in our cities would be valuable to all three groups.

While there is a general consensus that a neighbourhood is a contiguous

geographic area within a larger city, limited in size, and somewhat homogeneous in

its characteristics (Weiss et al., 2007), it is hard to pin down a more exact definition

(Chaskin, 1998; Weiss et al., 2007). Researchers have defined neighbourhoods in

terms of 3 dimensions with varying emphasisby social ties, physical demarcations

and residents experiences (Chaskin, 1997). These are influenced by many factors

such as administrative boundaries, manmade features such as roads, natural features

such as rivers, demographics, social networks of the people that live in or frequent

the area, and the availability of services and facilities (Cranshaw and Yano, 2010).

Each persons perception of their neighbourhood boundaries may differ, even from

their neighbours, and these perceptions may also differ from the official boundaries

used by city governments for urban planning or neighbourhood initiatives


13/151

13

(Campbell et al., 2009). However, researchers have also found evidence that

residents often identify a common core within their neighbourhood, and the

differences are about the boundaries where neighbourhoods begin and end

(Campbell et al., 2009).

Neighbourhoods differ from communities, in the sense that neighbourhoods are tied

to a spatial unit with boundaries, while communities are not limited to spatial units.

This difference is reflected in how the role of neighbourhoods in cities has shifted

over time. To summarize Chaskin (1997), neighbourhoods in the past were tied

closely to the idea of community. There were close ties between those living within

a neighbourhood and a strong sense of identity, akin to an urban village. However,

as transportation systems improved and communication over long distances became

available, ties within a neighbourhood have become less close and more functional,

providing a space where neighbours share information, aid and services. When

studying social ties within neighbourhoods, it may be useful to look at common

social and functional activities between those living in a neighbourhood and where

these activities take place. These may give an indication of places that are

considered part of the neighbourhood for those involved in the activities.

Traditionally, studies on neighbourhoods and the neighbourhood effect have used

boundaries where data was easily available, such as administrative and political

boundaries. The data is often reliable as they are typically collected by government

agencies, and the boundaries used usually do not change greatly. Such data is useful

for understanding long term trends and behaviours such as demographics and

urbanisation. However, these traditional data sources are usually collected at certain

periods with long intervals between each period. The data collected represents

snapshots at particular points in time, and do not capture the multiple changes that


14/151

14

may occur in between data collection periods. This means that data from traditional

data sources are less useful for reseachers interested in questions where trends and

behaviors are more short term or temporary in nature, such as commuting behaviour

during transport strikes or riots, are unable to capture the. For example, full censuses

in the United Kingdom take place once every ten years. In addition, data from

traditional sources is often expensive and time consuming to collect. Such issues

means data from traditional sources are less suitable for studying trends and

behaviours that are more short term in nature and change frequently. For studying

more short term and dynamic trends and behaviours, location based social media is

likely to be a more suitable data source.

2.2.Location Based Social Media and Detecting Neighbourhood Boundaries

Location based social media is a relatively new source of data for researchers. Users

of these platforms post their thoughts or activities with location data attached. Many

of the characteristics of data from these posts or check-ins make it suitable for

studying short term phenomena and behaviours. It is easily available, it is cheap and

quick to collect, and it provides multiple points of data within a short period. Its

biggest advantage over other data sources is the amount of context that it provides.

A typical data point from location based social media contains information on who

the user is, where the user was, when the data was created. It also provides

additional information depending on the social media platform used. For example,

Twitter1users post tweets indicating what they were doing or thinking, Instagram2

users post photos, and Foursquare3users provide more detailed information about

1https://twitter.com/

2https://instagram.com/

3https://foursquare.com/
https://twitter.com/https://twitter.com/https://twitter.com/https://instagram.com/https://instagram.com/https://instagram.com/https://foursquare.com/https://foursquare.com/https://foursquare.com/https://foursquare.com/https://instagram.com/https://twitter.com/


15/151

15

their location. Social media platforms may provide additional contextual

information. The aforementioned Foursquare, for example, maintains a database of

venues that their users post from. This database contains rich contextual information

such as the type of venue (e.g. restaurant, school) and its popularity, which can be

linked to the posts from its users. Furthermore, it is possible to look at the

relationships between different users on social media platforms through the users

interactions with each other.

Silva et al (2012) observe that the widespread adoption of smartphones and social

media websites has created a valuable opportunity to study city dynamics. Data

from location based social media provides rich contextual information on user

activity at different times of day. These characteristics make location-based social

media useful in detecting the invisible image of cities (Silva et al., 2012), such as

patterns of transition between locations that serve different functions in the city.

Given that city neighbourhoods do not follow strict boundaries and can shift over

time (Chaskin, 1997), location-based social media, which provides a large amount

of data in real time, is a useful source of information for neighbourhood detection in

cities and identifying changes over time. As such, researchers have also started to

use social media to detect neighbourhood boundaries.

Using data from location based social media has its limitations. While data from

location based social media has rich context and can be collected easily, such

platforms are typically used by young males who are interested in technology

(Cranshaw et al., 2012), thus the data represents a skewed demographic. Using such

data may generate clusters / neighbourhoods that reflect the views of a certain

demographic, which may not be in agreement with the general population. In

addition, data on these platforms are usually private unless the user agrees to share


16/151

16

the data publicly, which further limits the amount of data available for analysis.

Another factor to consider is that users may curate the types of places that they

check-in at using location based social media. Places that are considered more

socially desirable to be at may be over represented when using data from location

based social media. For example, people may be more likely to checkin when eating

at a new fancy restaurant or shopping in a branded goods store rather than when

they are eating at a fast food restaurant or shopping in a discount store. This means

that conclusions based on data from location based social media will likely be

biased towards such socially desirable venues. In the case of neighbourhood

detection, the clusters / neighbourhoods formed may be similarly biased. Previous

research has shown that users have been more likely to check-in at venues

concerning travel and transport, office buildings, and residences (Preotiuc-Pietro

and Cohn, 2013). Despite these limitations, reseachers believe that data from

location based social media can still be valuable for its rich contextual information

and sheer volume available (Silva et al., 2013).

2.3.Review of Methods for Neighbourhood Detection

What follows is a review of neighbourhood detection methods using location-based

social media. Neighbourhood detection using location based social media is

typically treated as a clustering problem, and the methods used so far reflect this

paradigm. Essentially, researchers wish to cluster users social media activities into

contiguous geographic areas based on certain measures of similarity.

Neighbourhood detection methods usually contain three elements:

a. The unit used for aggregation (e.g. grid-based, venue-based)

b. The type of clustering method (e.g. K-Means clustering, spectral

clustering)


17/151

17

c. The similarity measures used

Unit used for aggregation

While the data from location based social media comes in the form of individual

posts or check-ins, they are usually aggregated in some spatial form before being

clustered. A common method used in neighbourhood detection is to take the grid-

based approach for aggregating the posts. This means dividing the city into multiple

grid squares of equal size and aggregating the properties of the posts within the grid

square. The properties of the grid squares are later used to calculate similarity

measures between grid squares during clustering. Noulas et al (2011), for example,

used a grid-square approach where each grid contained the distribution of

Foursquare venue categories nearby and the number of check-ins at these venues.

Grid squares that are contiguous and are similar to each other based on the

clustering algorithm are then grouped up and form neighbourhoods. Grid-based

approaches can alter the neighbourhoods formed depending on the number, size and

shape of the grid cells used, and is an important consideration when adopting this

approach. For example, large grid cells means a lower number of grids overall and

will increase the speed of processing, but are less precise in delineating

neighbourhood boundaries. In certain cases, the grid square itself may be treated as

a neighbourhood. The size of the grid is often a key decision that has to be made in

grid-based approaches.

A second method is the venues-based approach. Venues are locations specifically

identified by location-based social media platforms, which usually have a database

of venues that users can check-in from. Researchers can make use of the data

contained in these venue databases in addition to the posts made by the users to


18/151

18

develop methods for neighbourhood detection. Venues that are considered similar to

each other and fulfil a proximity criterion such as being within a certain distance

from each other are then grouped together and the area bounded by these venues

form a neighbourhood. The proximity criterion is important as it defines the

geographic aspect of the venues. It is similar to how defining the size and shape of

the grids in the grid-based approach determines how the grids are geographically

related to each other. One of the earliest attempts at neighbourhood detection using

location based social media is called Livehoods (Cranshaw et al., 2012) and this

took the venues-based approach. Zhang et al (2013) pointed out that one of the

weaknesses of the venues-based approach is that the neighbourhoods formed have to

be geographically tied to the network of venues used, whereas the grid-based

approach does not.

Clustering methods

Clustering methods used in neighbourhood detection are a reflection of the breadth

and variety of clustering methods used in other fields. This dissertation does not

seek to determine which clustering methods are the best methods for

neighbourhood detection using location baesd social media, since there is a certain

degree of subjectivity. So far, neighbourhood detection methods have included

clustering methods such as K-Means clustering (Del Bimbo et al., 2014), spectral

clustering (Cranshaw et al., 2012; Noulas et al., 2011), and topic-based modelling

(Cranshaw and Yano, 2010). Each clustering method used involves the researcher

choosing parameters used. Examples are the number of topics to use for topic-based

modelling and the number of clusters in K-Means clustering.


19/151

19

Similarity measures

A variety of similarity measures have been used in neighbourhood detection. In

terms of properties to include in the similarity measure, researchers have used

properties related to users, such as the users check-in patterns and interests (Del

Bimbo et al., 2014). Researchers have also used properties related to venues in the

databases of location based social media platforms, such as the distribution of

Foursquare venue categories nearby and the number of check-ins at these venues

(Noulas et al., 2011). Other researchers have combined the above mentioned

properties with temporal properties to provide a contextually richer set of properties

to calculate similarity (Falher et al., 2015; Zhang et al., 2013). Different properties

characterise neighbourhoods in different ways, and makes them useful for different

purposes. Amongst the three dimensions of neighbourhoods mentioned earlier

(social ties, physical demarcations and residents experiences), methods in

neighbourhood detection using location based social media have typically used

properties related to residents experiences, for example the number of check-ins,

the temporal pattern of check-ins, and the type and number of venues in the area.

Cosine similarity measures similarity as the angle between two vectors (Xia et al.,

2015). In neighbourhood detection methods, these vectors represent the properties of

the grid and of the venues in the grid-based method and the venues-based method

respectively. Cosine similarity is often used for clustering in neighbourhood

detection with location based social media, and often preferred over other similarity

measures because cosine similarity does not take the magnitude of the vectors into

account. This is useful in cases where the magnitudes of the vectors differ greatly

but at the same time less important for determining similarity. For example, cosine

similarity is often used in information retrieval to determine document similarity as


20/151

20

the relative frequency of words in each document and across documents are more

important than the total number of words in a document (Huang, 2008). Similarly,

the magnitude of vectors used in neighbourhood detection differ greatly. The most

popular venues often garner many more check-ins than those less popular and the

most active users check-in much more frequently than those who are less active

(Scellato and Mascolo, 2011). As such, researchers have found that relative

frequencies between venues/grid squares are more useful for neighbourhood

detection rather than absolute numbers, and prefer cosine similarity measures over

Euclidean distance measures when measuring similarity for neighbourhood

detection (Cranshaw et al., 2012; Preoiuc-Pietro et al., 2013).

Researchers use different combinations of the three elements (unit used for

aggregation, clustering method, similarity measure) of neighbourhood detection to

create neighbourhoods, depending on their research purpose. Within each element,

researchers have also had to make decisions that influence the eventual

neighbourhoods formed. Most of the research so far seek to compare urban

neighbourhoods within and across cities so that recommendation engines can make

better recommendations based on criteria such as the users check-in patterns, the

users preferred venue categories and the users interests. Their goals are to suggest

new places that the user may wish to visit, which are similar to places the user has

visited in the past.

A typical example of a neighbourhood detection method for recommendation

engines comes from Noulas et al (2011). They take a grid-based approach and use a

spectral clustering algorithm to cluster grid squares based on the distribution of

Foursquare venue categories nearby and the number of check-ins at these venues.

The method creates neighbourhoods that give us an idea of what type of places are


21/151

21

in an area, and a measure of their importancebased on users check-in activity.

Another example is Del Bimbo et al (2014)s LiveCities method, which performed

K-means clustering using data on Facebook check-ins and user interests and

Foursquare venue categories.

An early attempt at neighbourhood detection was the Livehoods algorithm

(Cranshaw et al., 2012), which took the venues-based approach and used spectral

clustering to cluster Foursquare venues in Pittsburgh in the United States based on

spatial and social proximity. Through interviews with local residents, Cranshaw et al

(2012) found that neighbourhood detection methods could generate clusters /

neighbourhoods that reflect the character of life in cities. More recent attempts

have combined more information and experimented with different elements. For

example, Zhang et al (2013)s Hoodsquare method takes a grid-based approach and

assesses the similarity of a grid cell with its neighbouring grid cells based on (a) the

distribution of Foursquare venue categories in vicinity; (b) whether these venues

were frequented by tourists or locals, and; (c) the busiest time of the day in terms of

check-ins at these venues. Neighbourhoods were then formed by finding groups of

grid cells that had high relative homogeneity. Zhang et al (2013) point out that using

multiple types of information may better represent the multifaceted nature of

neighbourhoods, and that grid-based methods may be more suitable for identifying

neighbourhoods as the boundaries formed using grid-based methods are not bound

to a particular set of venues.

The most recent attempt at neighbourhood detection using location based social

media describes neighbourhoods in terms of the activity they host (Falher et al.,

2015). Falher et al consider 2 neighbourhoods to be similar if they contain the same

kind of Foursquare venues in the same proportion. In addition to basing the


22/151

22

similarity of these venues on the number of check-ins and unique users as well as

the temporal distribution of the check-ins, they also take into account the

distribution of Foursquare venues in the surrounding area.

Cranshaw and Yano (2010) provided a different perspective by treating the question

as an issue of latent topic discovery. They divided the city into grids and applied

topic based modeling to the grids, using each grid as a document and each

Foursquare category tag as a word. With this method, they were able to identify

clusters of places and activities that often appeared together (e.g. beach and seafood).

While research on neighbourhood detection using location based social media has

flourished, there is less research available on understanding whether these methods

accurately reflect neighbourhoods in reality, and how they can contribute to

purposes other than recommending new places that users may wish to visit.

Researchers using the Livehoods algorithm attempted to validate the

neighbourhoods generated through their algorithm (Cranshaw et al., 2012). The

neighbourhoods identifiedby Cranshaw et als algorithm included neighbourhoods

that corresponded with municipal boundaries, those that were subsets of municipal

boundaries and those that spilled over to more than one municipal boundary.

Cranshaw et al interviewed 27 residents that lived in the city and found that the

neighbourhoods generated by their Livehoods method closely matched the residents

perspectives of neighbourhoods in the city. Cranshaw et als research provides

evidence that the boundaries generated by neighbourhood detection algorithms can

capture local dynamics that includes factors such as municipal boundaries,

demographics, traffic flow and economic development.


23/151

23

Some researchers have argued that including more properties in the similarity

measures would better characterise the units being aggregated and produce clusters

that more closely match actual neighbourhoods. For example, Del Bimbo et al (2014)

use both static features (e.g. categories assigned by location based social networks)

and dynamic features (e.g. distribution of the interests of the people who check in at

venues) in their LiveCities method to create neighbourhoods for Florence, which

they then validated qualitatively through online questionnaires with 28 residents.

They found that including both types of features produce neighbourhoods that better

reflect the residents perceptions.

There is much that we do not know about the methods used for neighbourhood

detection process with location based social media. For example, we do not know

how the neighbourhoods detected compare with traditional administrative

boundaries, and how we can combine the neighbourhoods detected with data from

these administrative boundaries to help us better understand cities dynamically. We

also do not know how the neighbourhoods detected may change when data over

different time periods or different time intervals are used and what these changes

may mean.

Better understanding can come in the form of research on particular elements in the

neighbourhood detection process across a variety of methods and comparing the

differences when different elements are used. It can also come in the form of better

understanding a particular method in depth and exploring how the neighbourhoods

formed are different depending on the parameters used. In this dissertation, I look at

the Livehoods method in depth by applying variations of the method on data

collected on Greater London. The Livehoods method was chosen as it is a venues-

based approach which has not been used as much in the literature. It is also one of


24/151

24

the rare methods in the neighbourhood detection literature that has validated the

clusters / neighbourhoods generated with the citys residents and found strong

support that the residents perceptions agreed with the clusters formed. This gives it

legitimacy in being able to detect actual neighbourhoods compared to other

neighbourhood detection methods. In addition, it has not yet been applied to the

Greater London area.


25/151

25

3. METHODOLOGY

Python was used for most of the analysis and visualization in this work. IPython

notebooks were used for early exploration and experimentation with the data and

Python scripts were written in the later stages to run the neighbourhood detection

method. All scripts used for this work can be found in the appendix section.

3.1.Data sources

The data used for analysis consists of 42,581 Foursquare check-ins at 8,845 venues

by 12,397 unique users in the Greater London area from 6thApril 2011 to 31stMay

2011. This data was kindly provided by Anastasios Noulas from the University of

Cambridge. For each check-in, the data consists of the user ID, the time, the latitude

and longitude, and the venue ID. Further information on the venues was collected

using the python package foursquare. This included information on the venues

name, category and subcategory (as categorized by the social media network

Foursquare).

Data was also collected from 6thApril 2015 to 31stMay 2015 for three cities:

London, Singapore and New York City. The Python package tweepy was used to

collect data from Twitters streaming API, which offers samples of the data being

posted on Twitter in real time. A subset of this data consists of Foursquare checkins

from users who have linked their Foursquare accounts to their Twitter accounts such

that their Foursquare checkins also appear as tweets on Twitter. The scripts for

collecting this data and formatting them for analysis are also included in the

appendix. While this data was eventually not used in the analysis for this work,

future work could compare the results generated across the three different cities, or

the results generated from 2 different time periods in London.


26/151

26

3.2.Data sorting, import, storage and analysis

The data was formatted using the Python package pandas, which was developed to

mimic the R softwares capabilities in managing large tables of data quickly and

easily. To improve the speed of the analysis, many of the intermediate data required

was pre-generated and stored in various file formats such as JSON files, numpy files

for matrices and pickle files created using the Python pickle package.

As each run of the method took a significant amount of time of one to two hours, an

Amazon cloud server was set up to run the multiple variations of the neighbourhood

detection method. This greatly sped up the process.

The results of the neighbourhood detection method were stored in pickle files. They

were subsequently converted to GeoJSON format and also stored in a MySQL

database using Pythons sqlalchemy package for further analysis and visualization.

In parts of the process where GeoJSON files had to be manipulated, the Python

packages fiona and shapely were used to manage GeoJSON files and check for

relationships between geographic features, for example whether a particular venue

was within a particular boundary.

Many of the visualizations in this work were created using Pythons matplotliband

seaborn packages. Figure 8 was created using the software R and its ggplot library.

3.3.The Livehoods method

The Livehoods method is Cranshaw et als (2012) method for neighbourhood

detection using location based social media. It is a venues-based approach that

performs spectral clustering on an affinity matrix that takes both spatial affinity and

social affinity into consideration. This method sought to fit the intuitive notion that


27/151

27

neighbourhoods are areas that a similar set of people frequent the more often the

same people go to the same venues, the more likely these venues are in the same

neighbourhood. To validate this method, Cranshaw et al (2012) had conducted

qualitative interviews with residents in their study area and verified that the

neighbourhoods generated by their method closely matched the residents

perspectives of neighbourhoods in the city.

Specifically, I applied the following steps from Cranshaw et al (2012) to generate

the affinity matrices used in the spectral clustering algorithm:

1.

Given the following sets:

a. Set V, a set of nvFoursquare venues, for which we can compute a

geographic distance , between the venues given their latitudeand longitude coordinates.

b. Set U, a set of nuFoursquare users

c.

Set C, a set of checkins of users in Uto the venues in V

Each venue vin Vis then represented by an nudimensional vector

where the uthcomponent of is the number of times user uchecked-in

to v.

2. Compute the social similaritys(i, j)between each pair of venues i, j Vby

comparing the vectors and . Cosine similarity was used for this measure,where

, = ( . )


28/151

28

3. Compute an nvby nvaffinity matrix on the venues. For a given venue v, let

Nm(v)be the mclosest venues to vaccording to the , . for someparameter m. Then we let

, = {, + , 0,

where is a small constant that prevents any degenerate matrices from

forming. In Cranshaw et al (2012)s work,a value of 1 102was used for.

The affinity matrices were generated using the python packages numPy (Van Der

Walt et al., 2011) and sciPy (Jones et al., 2001), and spectral clustering was

performed on the affinity matrices using the python package scikit-learn (Pedregosa

et al., 2011). To determine the number of clusters that the algorithm should create, I

used the commonly-used eigengap heuristic (Noulas et al., 2011; Planck and

Luxburg, 2006). This involved calculating the ksmallest eigenvalues of the

normalized Laplacian of the affinity matrix, and setting the number of clusters as the

number where the largest difference in eigenvalues occurred.

The question of determining parameters such as the number of clusters to form is an

important issue for clustering algorithms (Lancichinetti and Fortunato, 2009; Planck

and Luxburg, 2006; Zelnik-Manor and Perona, 2004). For some clustering

algorithms, researchers have found that maximizing modularity is a useful techniqueto guide which values to use for various parameters (Lancichinetti and Fortunato,

2009), though they also recognize that this technique has its own limitations

(Fortunato and Barthlemy, 2007; Good et al., 2010; Lancichinetti and Fortunato,

2011). For spectral clustering algorithms such as the one used in the Livehoods


29/151

29

method, the eigengap heuristic was developed in particular to maximize modularity

for the clusters generated (Donetti and Munoz, 2004).

Cranshaw et al (2012) included a post processing step after spectral clustering to

break up any cluster that spanned too large a geographic area (more than 40% of the

geographic area in their work on Pittsburgh), and redistributed the venues in those

clusters to the nearest cluster instead. In my work, the spectral clustering algorithm

typically produced one cluster that spans a large part of the city. This seems to be a

qualitatively different type of cluster where its boundaries are a reflection of what

the users of the social media platform regard as the boundaries of their city, rather

than any particular neighbourhood. As there was no theoretical reason to redistribute

the venues in this large cluster and as a result expand the boundaries of the other

clusters, I chose not to break up the large cluster.


30/151

30

4. ANALYZING THE LIVEHOODS METHOD

As described above, there are a number of parameters in the Livehoods method

(Cranshaw et al., 2012) that can be tuned to generate the neighbourhood boundaries:

the number of smallest eigenvalues to calculate (k), the number of nearest

neighbours (m), and the alpha constant . Cranshaw et als (2012) values for these

parameters for the Pittsburgh metropolitan area were 45, 10 and 0.01 respectively.

Cranshaw et al (2012) acknowledged that tuning the clusters is non-trivial and may

lead to experimenter bias. As such, it is worthexploring how tuning the parameters

affects the resulting neighbourhoods formed to better understand the Livehoods

method.

4.1.Tuning the number of smallest eigenvalues (k)

In general, as the value for kincreased, the total number of clusters formed

increased as well. Figure 1 illustrates the relationship between k and the total

number of clusters formed using the eigengap heuristic, for values of kfrom 0 to

200 and Cranshaw et als (2012) values of 0.01 for the alpha constant and 10 for the

number of nearest neighbours. The number of clusters formed increases at certain

threshold value of k, and remains constant until the next threshold is reached. The

threshold values for kin this case are 7, 9, 13, 25, 43, 74 and 101 with the

corresponding values for the number of clusters formed being 5, 7, 11, 23, 41, 72

and 99.

Figure 2 shows the boundaries of the clusters that are formed when the 7 different

values are used in the Livehoods method, with m= 10 and = 0.01. As the number

of clusters created increases, the larger clusters tend to break up into smaller and

smaller clusters. The areas near the centre of the city tend to be broken up first, and

continue to be broken up into smaller clusters as the number of clusters increase.


31/151

31

The clusters nearer to the edges of the city tend to remain large and unbroken.

Generally, the clusters formed nearer the edge of the city are larger than the clusters

formed nearer the centre of the city. This phenomenon is likely because the density

of venues further from the centre of the city is much lower than the density of

venues nearer the centre of the city. Since the Livehoods method uses a nearest

neighbours criterion for identifying adjacent venues, areas where venues are less

dense will cover larger areas when searching for adjacent venues and result in the

method creating boundaries with larger areas. Many of the clusters formed when

there are a higher number of clusters are either subsets of the clusters formed using a

lower number of clusters, or very similar to the clusters formed using a lower

number of clusters. The clear exception occurs where k= 74 and 72 clusters are

formeda previously undetected large cluster is formed. This is the qualitatively

different cluster mentioned earlier.

Donetti and Munoz (2004) have pointed out that the weakest part of the eigengap

heuristic is that we do not know how many eigenvalues (kin the Livehoods method)

should be calculated apriori. While Cranshaw et al (2012) also has not provided any

guidelines on how to choose the right value of k for cities of different sizes, cities

occupying a larger area could be seen to potentially contain more neighbourhoods,

and larger values of kshould be used. As the Greater London area is much larger

than Pittsburgh, kshould be larger than 45. A kvalue of 100 was arbitrarily chosen

in this work to test the effects of tuning the nearest neighbour parameter and the

alpha constant, to reflect the possibility of a higher number of neighbourhoods in

London. An even higher value may be more suitable as London is many times larger

than Pittsburgh, but this value was used to keep computation requirements

manageable.


32/151

32

Figure 1: Relationship between number of smallest eigenvalues (k) found and number of clusters formed


33/151

33

Figure 2: Boundaries formed for different number of clusters

5 clusters (k = 7)

7 clusters (k = 9)

11 clusters (k =13)

23 clusters (k = 25)




4.2.Tuning the alpha constant ()

To see if the alpha constant influenced the clusters formed using the Livehoods

method, clusters were formed with k= 100, m= 10 and varying from 0.00 to 0.05

In general, there was little difference in the clusters formed. Figure 3 depicts the

boundaries formed using the various alpha constants. Almost all clusters formed are

consistent or highly similar at the different alpha values. In certain rare instances,

some clusters are merged or subdivided into 2 clusters. This shows that varying the

alpha constant between 0.00 and 0.05 do not greatly influence the boundaries

formed. A clear exception occurs with the largest cluster in the shift from = 0.00


34/151

34

to = 0.01it expands greatly to include many other parts of the Greater London

area. This boundary remains consistent as increases. This behaviour again

highlights the qualitatively different nature of this cluster.

Figure 3: Boundaries formed for different alpha constants

= 0.00

= 0.01

= 0.02

= 0.03

= 0.04

= 0.05

4.3.Tuning the nearest neighbours parameter (m)

To see if the nearest neighbours parameter influenced the clusters formed using the

Livehoods method, clusters were formed with k = 100, = 0.01, and mvarying

from 5 to 20. Figures 4 depicts the boundaries formed for some of the values used.

When m= 5, the boundaries formed overlap many of the other boundaries. As m

increases, the number of overlaps decrease and more stable clusters are formed. For

m= 8 to m= 20, the clusters formed are largely consistent with each other. Smaller

clusters with a high density of venues are more consistent than larger clusters with

low density of venues. The largest cluster changes in shape and size as at different


35/151

35

levels of m. It is hard to determine the optimal number to use for m, but values of 8

and higher seem to generate reasonably consistent clusters.

Figure 4: Boundaries formed for different nearest neighbours parameter (m)

m = 5

m= 8

m= 10

m= 15

m= 18

m= 20

4.4.Using cosine similarity

It has been mentioned earlier that cosine similarity was preferred over other

similarity measures because cosine similarity does not take the magnitude of vectors

into account. In the case of forming neighbourhoods and determining venue

similarity, the relative frequency of the user checkins at each venue and across

venues matter more than the total number of user checkins at each venue. Similarity

measures that include magnitude such as Euclidean distance are thus less suitable

than the cosine similarity measure. Using Jaccard similarity, a variant of the cosine

similarity measure, produced results similar to the cosine similarity measure.


36/151

36

4.5.Nearest neighbours versus full similarity graph

The k-nearest neighbours similarity graph was chosen for constructing the affinity

matrix instead of the full similarity graph as the k-nearest neighbours graph better

captured check-in behaviour in neighbourhoods. While individuals have regular

mobility patterns and often return to a few highly frequented locations such as home,

school or work (Gonzlez et al., 2008), this differs from their check-in behaviour on

location based social media networks60% to 80% of check-ins occur at places

that were not visited before by individual users (Noulas et al., 2012). Using the full

similarity graph meant that most of the similarity captured would relate to new

places that the users visited over the time period. This would create clusters of

venues that related to types of places that groups of users preferred to visit such as

museums, nightspots and stadiums, and generate boundaries that span most of the

city. These boundaries cannot be classified as neighbourhoods, given that they

overlap each other greatly and cover areas that are similar to each other.

The nearest neighbours graph, on the other hand, captures similarity relating to users

who visited sets of venues close to one another. The boundaries formed often have

clear separation from each other and there is very little overlap in terms of area

covered by the boundaries. These boundaries better fit the intuitive notion of

neighbourhoods in a city.

4.6.Summary

Through an investigation of the Livehoods method, I have found that using different

alpha values from 0.01 to 0.05 and nearest neighbours parameters above 8 generally

do not affect the results of the clusters formed. I have also found that using different

values for the number of smallest eigenvalues changes the resulting number of


37/151

37

clusters formed, with more clusters being formed when the number of eigenvalues

increases. The investigation also revealed that two types of clusters may be formed

by the method. One type of cluster is the contiguous geographic space that can be

associated with neighbourhoods, and another type of cluster seems to be large and

spans the entire city.

In the next two sections, I will use one of the sets of clusters / neighbourhoods

generated by the Livehoods method to illustrate the types of information that can be

derived from clusters formed using the Livehoods method, and neighbourhood

detection methods in general. In section 5, I combine the clusters formed with data

from Foursquares venues database and use it to describe the types of venues and

activities that take place within the cluster. Incorporating information from location

based social media to better understand the clusters / neighbourhoods formed is

common for researchers using neighbourhood detection methods.

In section 6, I attempt to combine the cluster / neighbourhoods formed using the

Livehoods method with data from administrative boundaries (the Greater London

Lower Super Output Areas in this case) and determine the ethnic diversity of the

clusters / neighbourhoods formed. Integrating cluster / neighbourhoods detected

using neighbourhood detection with data from administrative boundaries is rare in

the neighbourhood detection literature as most researchers using neighbourhood

detection methods have used them for developing recommendation engines that find

similar places based on social media activity. My attempt tries to add more meaning

to the clusters formed so that they can be used for other purposes, such as

investigating ethnic diversity issues within neighbourhoods.


38/151

38

5. DESCRIPTION OF LIVEHOOD CLUSTERS / NEIGHBOURHOODS

5.1.Overview of neighbourhoods

For comparison, the Livehoods method was applied to the Foursquare data with k=

100, = 0.01, and m= 10. For the Greater London area, 72 clusters were generated.

Their boundaries are depicted in Figure 5. The numbers on the clusters will be used

as a reference for labelling and describing the results below. As mentioned earlier,

the largest cluster formed (cluster 66 in this case) is not depicted in the figures as it

is a qualitatively different type of cluster, and not included when describing the

clustering results. The boundaries for this cluster can be found in the appendix.

Table 1 contains summary statistics related to each cluster. The area for each cluster

ranged from to 0.11 square kilometers (cluster 48) to 203 square kilometers (cluster

18) with a median of 1.86 square kilometers per cluster. While tests (using Pythons

powerlaw package) show no support for a power law distribution, the distribution is

highly skewed with many small clusters and a few huge clusters. The huge clusters

also tend to have low density in terms of checkins and venues, and as such they

could be an artefact of the nearest neighbours proximity criterion. In sparse areas,

the nearest neighbours tend to be further apart from each other than in dense areas,

thus venues far apart from each other are more likely to be linked and clustered

together.

Figures 6a to 6c depict properties of the clusters in terms of absolute numbers - the

number of venues in each cluster ranged from 16 (cluster 45) to 279 (cluster 38)

with a median of 129.0; the number of check-ins in each cluster ranged from 43

(cluster 45) to 5147 (cluster 2) with a median of 412; and the number of unique

users checking-in in each cluster ranged from 10 (cluster 45) to 2585 (cluster 2) with

a median of 230. Figures 6d to 6f depict properties of the clusters relative to the area


39/151

39

of the cluster and the number of venues in the clusterthe number of venues per

square kilometer ranged from 1.27 (cluster 18) to 1,304.61 (cluster 7) with a median

of 43.95; the number of checkins per venue ranged from 1.26 (cluster 65) to 40.09

(cluster 26) with a median of 3.23; and the number of unique users per venue ranged

from 0.55 (cluster 67) to 19.52 (cluster 16) with a median of 1.89.

Many of the distributions of cluster properties are highly skewed. Clusters 2, 13, 16

and 26 are particularly active clusters and are in the top 5 in terms of users and

checkins across all clusters, whether in absolute terms or on a per venue basis.

Collectively, the four clusters account for 29.5% of all checkins from 60% of unique

users despite containing only 5.7% of all venues across the city. This is

understandable for clusters 2 and 13 as they are in the city centre, and cluster 26 as it

is at Heathrow airport. Cluster 16 consists of Wembley stadium, and it is likely that

it had such high values for users and checkins during that period as it was the host

for the 2011 UEFA Champions League Final on 28 thMay 2011, which is within the

period of analysis. People attending this event are highly likely to checkin on social

media as it is a rare and meaningful event for them. Under more normal

circumstances, cluster 16 likely would have values closer to the median.

Across all clusters, cluster 18 stands out with the largest area and relatively low

frequencies of users and venues over such a large area. It could be classified as an

outlier, but results for the cluster have been included for completeness. In addition,

all variations of the Livehoods method detect this cluster or a cluster similar to this

cluster. This is more likely an artefact of using the nearest neighbours proximity

criterion as discussed above.


40/151

40

Figure 5: Clustering results for London

Greater London area

City area


41/151

41

Table 1: Summary statistics for cluster results for London

Cluster Area (sq

km)

Number of

checkins

Number of

users

Number of

venues

Number of

check-ins per

sq km

Number of

users per sq

km

Number of

venues per sq

km

Number of

check-ins per

venue

Number of

check-ins per

user

Number of

users per

venue

0 0.69 1002 641 238 1447.35 925.9 343.78 4.21 1.56 2.69

1 0.89 469 321 165 527.2 360.84 185.48 2.84 1.46 1.95

2 1.25 5147 2585 161 4121.23 2069.82 128.91 31.97 1.99 16.06

3 26.83 356 178 180 13.27 6.63 6.71 1.98 2 0.99

4 2.95 851 450 163 288.6 152.61 55.28 5.22 1.89 2.76

5 0.75 462 230 102 616.58 306.95 136.13 4.53 2.01 2.25

6 2.19 1055 556 239 481.71 253.87 109.13 4.41 1.9 2.33

7 0.16 695 447 215 4217.23 2712.38 1304.61 3.23 1.55 2.08

8 0.82 754 493 195 924.93 604.76 239.21 3.87 1.53 2.53

9 1.77 610 325 241 344.83 183.72 136.24 2.53 1.88 1.35

10 1.5 806 409 253 536.37 272.18 168.36 3.19 1.97 1.62

11 0.6 967 622 231 1602.32 1030.65 382.77 4.19 1.55 2.69

12 1.09 294 163 120 270.77 150.12 110.52 2.45 1.8 1.36

13 2.73 2888 2032 202 1056.98 743.7 73.93 14.3 1.42 10.06

14 4.62 540 213 155 116.81 46.07 33.53 3.48 2.54 1.37

15 0.62 1357 578 108 2184.13 930.31 173.83 12.56 2.35 5.35

16 22.55 3508 1737 89 155.54 77.01 3.95 39.42 2.02 19.52

17 1.74 691 322 165 396.12 184.59 94.59 4.19 2.15 1.95

18 203.11 257 110 157 1.27 0.54 0.77 1.64 2.34 0.7

19 0.88 248 154 101 280.51 174.19 114.24 2.46 1.61 1.52

20 2.08 556 296 154 267.1 142.2 73.98 3.61 1.88 1.92

21 23.94 831 398 257 34.71 16.63 10.74 3.23 2.09 1.55

22 12.1 453 304 157 37.43 25.12 12.97 2.89 1.49 1.94


42/151

42

Cluster Area (sqkm)

Number ofcheckins

Number ofusers

Number ofvenues

Number ofcheck-ins per

sq km

Number ofusers per sq

km

Number ofvenues per sq

km


venue


user

Number ofusers per

venue

23 4.7 378 168 139 80.49 35.78 29.6 2.72 2.25 1.21

24 1.56 464 296 123 296.6 189.21 78.62 3.77 1.57 2.41

25 42.64 285 121 135 6.68 2.84 3.17 2.11 2.36 0.9

26 0.35 2165 975 54 6131.41 2761.26 152.93 40.09 2.22 18.0627 0.41 348 235 163 844.05 569.97 395.34 2.13 1.48 1.44

28 0.31 167 117 48 543.27 380.61 156.15 3.48 1.43 2.44

29 1.24 827 384 54 668.99 310.63 43.68 15.31 2.15 7.11

30 1.71 1921 547 148 1126.03 320.63 86.75 12.98 3.51 3.7

31 0.75 160 124 31 214.22 166.02 41.5 5.16 1.29 4

32 136.96 432 340 131 3.15 2.48 0.96 3.3 1.27 2.6

33 25.62 405 224 141 15.81 8.74 5.5 2.87 1.81 1.59

34 0.21 637 394 188 3098.25 1916.34 914.4 3.39 1.62 2.1

35 0.15 181 94 38 1197.88 622.1 251.49 4.76 1.93 2.47

36 22.11 321 140 93 14.52 6.33 4.21 3.45 2.29 1.51

37 0.6 358 183 73 600.17 306.79 122.38 4.9 1.96 2.51

38 0.32 1169 740 279 3624.81 2294.57 865.12 4.19 1.58 2.65

39 1.4 1366 622 161 974.53 443.75 114.86 8.48 2.2 3.8640 8.27 179 69 81 21.65 8.34 9.8 2.21 2.59 0.85

41 5.94 144 82 87 24.23 13.79 14.64 1.66 1.76 0.94

42 0.28 481 311 75 1702.65 1100.88 265.49 6.41 1.55 4.15

43 1.86 172 134 29 92.24 71.87 15.55 5.93 1.28 4.62

44 75.25 167 69 99 2.22 0.92 1.32 1.69 2.42 0.7

45 1.13 43 10 16 38.16 8.88 14.2 2.69 4.3 0.62

46 6.48 65 30 40 10.03 4.63 6.17 1.62 2.17 0.75

47 11.88 315 149 144 26.51 12.54 12.12 2.19 2.11 1.03


43/151

43

Cluster Area (sqkm)

Number ofcheckins

Number ofusers

Number ofvenues


sq km

Number ofusers per sq

km

Number ofvenues per sq

km


venue


user

Number ofusers per

venue

48 0.11 199 155 36 1761.06 1371.68 318.58 5.53 1.28 4.31

49 31.95 173 86 89 5.42 2.69 2.79 1.94 2.01 0.97

50 0.66 255 117 99 387.71 177.89 150.52 2.58 2.18 1.18

51 0.55 385 248 131 705.65 454.55 240.1 2.94 1.55 1.8952 39.21 775 287 129 19.77 7.32 3.29 6.01 2.7 2.22

53 1.12 751 413 209 670.36 368.65 186.56 3.59 1.82 1.98

54 87.89 202 93 107 2.3 1.06 1.22 1.89 2.17 0.87

55 5.6 316 98 123 56.39 17.49 21.95 2.57 3.22 0.8

56 18.86 551 287 200 29.21 15.21 10.6 2.76 1.92 1.44

57 1.12 189 105 79 168.69 93.72 70.51 2.39 1.8 1.33

58 0.33 766 444 132 2296.85 1331.33 395.8 5.8 1.73 3.36

59 21.86 412 195 193 18.85 8.92 8.83 2.13 2.11 1.01

60 47.01 228 88 107 4.85 1.87 2.28 2.13 2.59 0.82

61 1.27 115 60 56 90.25 47.08 43.95 2.05 1.92 1.07

62 1.99 181 56 66 90.82 28.1 33.12 2.74 3.23 0.85

63 9.31 47 20 28 5.05 2.15 3.01 1.68 2.35 0.71

64 8.39 1325 681 261 157.85 81.13 31.09 5.08 1.95 2.6165 10.86 54 31 43 4.97 2.86 3.96 1.26 1.74 0.72

67 33.75 99 28 51 2.93 0.83 1.51 1.94 3.54 0.55

68 14.95 103 44 38 6.89 2.94 2.54 2.71 2.34 1.16

69 4.78 113 76 73 23.62 15.89 15.26 1.55 1.49 1.04

70 0.5 699 367 115 1388.01 728.75 228.36 6.08 1.9 3.19

71 34.32 532 323 221 15.5 9.41 6.44 2.41 1.65 1.46


44/151

44

Figure 6: Properties of Livehood clusters


45/151

45


46/151

46

5.2.Breakdown of individual neighbourhoods

The venues within each cluster are venues that can be found on the location based

social network Foursquare. Foursquare categorizes its venues in a category hierarchy

with three levels. The 10 main categories at the top of the hierarchy are: Arts &

Entertainment, College & University, Event, Food, Nightlife Spot, Outdoors &

Recreation, Professional & Other Places, Residence, Shop & Service, and Travel &

Transport. Each of these 10 main categories have their own subcategories, which

themselves can be further subcategorized. There are more than 200 subcategories and

sub-subcategories altogether. As places may be referred to at different levels of

granularity, some venues may not have a sub-subcategory. For example, London

Heathrows Terminal 5 falls in the Travel & Transport main category, the airport

subcategory, and the airport terminal sub-subcategory. The London Heathrow Airport,

on the other hand, falls in the same main and subcategories, but does not have a sub-

subcategory.

We can gain insight to the makeup of the city by creating city profiles using

information on venue categories of each and the behavior of the users of location based

social media networks. To calculate the distribution of venues / checkins by category

for the city, the formula used to calculate the value for each category (A) was:

= .

. 100

Figure 7 shows the overall distribution of venues and checkins across all clusters

according to Foursquares main categoriesin percentage values. 29.23% of venues in

the data are in the food category, followed by 17.05% of venues in the nightlife spots


47/151

47

category. Users, however, check-in mostly at venues related to travel & transport

(23.04%), professional & other places (18.86%), and arts & entertainment venues

(15.68%). From here, we can observe that venues in the travel & transport, professional

& other places, nightlife spot and arts & entertainment receive a disproportionate

number of checkins. This means that clusters formed based on Foursquare checkins are

likely to be biased towards these venues in these categories, and may be more suitable

for research questions related to such categories (e.g. transport, culture).

Figure 7: Overall distribution of venues and checkins across clusters

% of venues

% of checkins

Similar profiles can be created for each cluster to form neighbourhood profiles. To

calculate the distribution of venues / checkins by category within a neighbourhood, the

formula used to calculate the value for each category (B) was:


48/151

48

= . . 100

This gives a sense of the type of venues in the clusters and the type of activities that

occur within them. These neighbourhood profiles were compared with the city profile

to understand which categories within the neighbourhood were overrepresented /

underrepresented. For each category, the formula was:

= 100

Tables 2 and 3 contain the percentage difference figures for all clusters for venues and

checkins respectively, with the highest positive difference for each cluster highlighted.

These percentage differences between each category was used to determine which types

of venues occurred more frequently and which types of venues users checked-in at

more frequently within the cluster. For example, clusters 28 and 29 have more venues

and checkins in the travel and transport category, as these clusters are essentially the

London Heathrow airport terminals, which we expect to have a higher concentration of

venues and checkins related to travel and transport. Another example is clusters with

high levels of concentration of venues and checkins in the college & university category.

Clusters 27, 46 and 47 have percentage difference figures of over 1000% for users

checking-in, and they contain University College London, Brunel University London,

and the Queen Mary University of London respectively.

From tables 2 and 3, we again observe differences between checkin behaviour and types

of venues. For many clusters, the most overrepresented category in terms of venues is

different from the most overrepresented category in terms of checkins. Cluster 3, for


49/151

49

example, would be characterised as a cluster in the outdoors & recreation category in

terms of venues, and as a cluster in the residence category in terms of checkins.


50/151

50

Table 2: Percentage difference between proportion of venues within cluster to proportion of venues

within city in terms of Foursquares main categories

Note: Empty cells indicate that the cluster did not contain venues in that categoryCluster Arts &

Entertain

ment

College &

University

Food Nightlife

Spot

Outdoors &

Recreation

Professional &

Other Places

Residence Shop &

Service

Travel &

Transport

0 -36.9 22.98 -12.42 -2.02 -69.76 99.3 -74.25 73.22 -75.35

1 56.72 -22.09 94.24 -56.19 56.2 -62.7 -25.86 -73.22

2 -30.23 -31.03 -20.18 102.84 193.33 -20.29 -75.62 -38.013 -38.26 11.54 -4.65 134.71 -28.4 58.7 -29.9 5.49

4 -58.42 -36.97 18.11 -40.54 -22.52 10.51 18.75 -57.63 75.24

5 67.78 9 34.81 -45.16 -30.51 -65.77 -37.2 63.82

6 -9.85 -79.5 9.09 13.45 31.04 47.02 -17.33 -35.31

7 -31.44 -77.73 60.26 56.88 -78.1 2.95 -52.95 -79.92

8 -54.73 -22.79 23.46 -9.36 -49.38 2.55 -83.84 48.28 -18.77

9 -55.9 180.75 2.22 -4.14 -90.14 133.34 -62.21 -42.23 -57.8

10 -48.85 55.06 -18.64 -4.91 -42.81 122.44 -75.65 -21.83 -15.52

11 49.53 154.99 27.43 9.59 -68.65 -8.55 10.19 -61.68

12 26.02 -4.49 25.29 92.24 -76.52 -54.33 -70.01 -8.28 -71.29

13 30.04 10.82 -22.51 9.05 13.9 -53.57 -43.21 40.73

14 -0.96 -13.19 21.4 -36.72 38.45 41.43 -13.5 -8.14

15 -20.1 5.53 -25.73 -13.13 -73.83 166.216 129.2 -39.23 -34.44 -14.57 1.53 45.47 33.45 56.65

17 21.43 24.18 54.36 -24.57 -41.32 -80.73 6.06 -44.67

18 -52.25 117.14 -29.47 -13.5 60.17 -25.01 263.68 18.16 3.35

19 -14.22 -51.24 31.58 -38.66 -4.08 -53.37 22.5 -25.08 61.23

20 -15.11 286.03 -8.37 -23.11 58.2 7.68 -19.18 -1.14 -12.97

21 13.19 -1.02 -10.97 31.98 143.38 -49.51 74.07 14.06 -46.44

22 60.08 -33.82 -18.15 70.65 95.26 -36.71 24.69 -36.45 -35.35

23 -34.8 -62.94 -2.77 -39.4 64.02 24.05 -6.9 -0.35 33.67

24 18.68 -10.05 31.48 -60.4 -55.77 -21.14 -71.75 141.86 -52.68

25 -49.49 -28.26 -22.95 88.27 9.84 92.36 61.76 9.32

26 -84.9 -74.66 -80.66 520.55

27 5.15 557.5 -5.91 -9.77 52.44 -43.69 -2.42 -68.56

28 -51.49 -42.13 -86.13 -57.63 413.88

29 -57.48 -64.32 -69.86 -27.37 383.09

30 -84.8 280.27 24.38 -69.56 -32 -6.33 -34.86 65.99 -6.48

31 291.8 78.17 6.84 34.47 -12.38 -43.2 11.9 -46.44

32 62.27 -59.01 -4.74 -48.43 20.96 -67.33 80.23 33.85 66.35

33 113.59 49.43 -27.19 -20.11 28.6 -52.36 205.03 -13.9 34.75

34 -52.34 21.84 5.64 -60.03 12.27 118.52 -79.64

35 85.21 40.38 15.75 23.61 -30.97 -55.25 164.5 -19.12 -57.8

36 -16.27 -36.54 -61.95 35.7 212.06 21.38 99.28 -14.69 -4.63

37 -35.32 -17.32 1.74 5.48 38.55 -85.88 143.13

38 171.65 -14.21 54.33 33.81 -83.13 -64.45 -89.22 -24.21 -79.37

39 -13.91 226.23 -21.75 10.8 -51.87 4 -38.53 -68.67 76.51

40 -44.94 12.62 -37 23.14 -80.04 253.85 20.23 41.13

41 3.16 17.28 -7.7 69.65 15.34 -81.31 120.97 57.67 -73.56

42 -39.18 38.28 -6.71 -13.03 -32 -44.9 -86.72 180.57

43 -21.64 -73.29 -77.59 -12.38 13.6 -65.78 328.44

44 -73.54 20.32 -36.87 -1.62 166.26 -23.29 -62.21 15.55 71.79

45 -56.6 -27.16 184.75 269.19 263.68

46 461.5 -36.87 5.95 107.09 11.88 76.33 7.84 -36.71

47 -6.69 112.17 -33.73 -2.14 143.45 7.09 99.89 8.67 -25.59

48 473 73.62 -45.37 42.38 -76.93 -72.2


51/151

51

Cluster Arts &

Entertain

ment

College &

University

Food Nightlife

Spot

Outdoors &

Recreation

Professional &

Other Places

Residence Shop &

Service

Travel &

Transport

49 4.48 -15.42 -17.82 -12.38 -14.8 198.41 59.69 -10.74

50 19.84 9 34.81 -38.3 -46.4 -13.13 -31.54 25.6 -18.09

51 36.96 -61.07 48.82 37.11 -61.71 -13.13 -25.24 -70.75

52 -25.23 27.5 -55.4 -62.58 -58.2 8.39 140.23 22.44 149.11

53 -58.63 -76.49 30.43 -17.18 -19.06 -47.53 94.2 -8.11

54 -75.45 11.62 -41.43 -1.71 37.23 -19.93 5.16 125.1 17.43

55 7.23 -2.48 -12.28 -7.99 43.88 -45.59 83.76 124.77 -56.03

56 19.84 -75.23 16.98 -22.1 -39.09 -48.67 117.82 66.52 -25.54

57 -68.66 -28.73 49.58 25.51 -64.95 -43.2 -10.48 36.88 -67.87

58 92.51 -63.52 -15.24 0.94 25.56 51.16 -31.27 -29.94 -17.77

59 -88.42 -47.36 -9.25 65.54 42.38 -66.44 98.37 1.1 2.85

60 -76.85 5.28 -21.08 -60.27 81.21 -24.48 429 81.98 -28.8

61 -53.7 -21.08 -7.3 -48.23 -16.09 32.25 21.32 89.88

62 -18.51 -7.35 4.17 74.81 173.36 -26.16 -46.62 -58.23

63 -7.39 -36.87 58.92 3.55 -32.87 142.65 -36.71

64 23.23 -62.64 -6.19 -10.71 65.34 10.16 -6.15 -46.19 34.75

65 -48.94 2.83 -33 73.74 156.72 -73.83 104.77

67 120.25 -81.23 25.99 146.27 -0.22 -21.37 20.23 50.53

68 764.33 40.38 -47.39 -11.71 38.06 -77.62 -11.83 -15.6169 -68.66 81.63 -10.35 75.23 -77.28 79.04 -86.31 -14.31

70 -52.24 -17.67 -15.9 -6.06 -31.49 -81.66 215.81

71 18.68 -10.05 -17.4 13.15 54.82 -3.22 154.23 -17.94 -22.27


52/151

52

Table 3: Percentage difference between proportion of users within cluster checking-in to proportion of users

within city checking-in in terms of Foursquares main categories

Note: Empty cells indicate that the cluster did not contain checkins at venues in that categoryCluster Arts &

Entertain

ment

College &

University

Food Nightlife

Spot

Outdoors &

Recreation

Professional &

Other Places

Residence Shop &

Service

Travel &

Transport

0 -79.6 -0.55 40.74 27.39 -68.98 -18.97 -85.59 301.5 -56.14

1 -76.69 102.86 443.13 -84.49 -9.12 -53.68 -62.41 -75.13

2 -96.62 -89.9 -89.18 -79.65 397.75 -94.49 -98.82 -93.593 -81.43 104.46 56.17 47.46 -35.65 512.27 -40.5 -20.42

4 -72.77 9.73 -4.76 -70.33 -73.65 -20.14 4.96 -88.73 154.67

5 126.08 -58.82 45.69 -29.67 -90.09 -83.59 -64.74 54.72

6 116.72 -82.9 21.01 42.39 -48.17 -15.14 50.33 -77.63

7 -81.33 -86.93 259.98 268.23 -43.21 -52.84 -28.4 -87.57

8 -78.01 -39.08 159.78 23.12 -55.41 -63.36 -90.29 176.79 -32.2

9 -85.31 425.78 109.76 105.26 -98.28 67.16 -40.13 -46.84 -47.13

10 -77.83 92.76 14.47 19.42 -85.73 52.78 -81.92 8.74 26.19

11 -13.69 445.6 72.18 60.66 -68.8 -57.56 207.14 -75.39

12 -48.3 71.85 178.87 242.42 -60.69 -44.18 201.37 -45.07 -77.12

13 -82.6 -61.13 -82.41 612.28 -83.99 -89.89 -94.21 -28.25

14 -38.17 13.87 51.86 -92 40.98 290.4 -50.08 13.97

15 -72.15 -35.39 -67.18 -73.94 -96.01 236.6816 524.9 -91.76 -95.61 -99.12 -92.89 -81.57 -86.8 -93.28

17 107.66 10.02 96.5 -12.62 -78.23 -78.63 95.86 -68.76

18 -80.65 243.03 21.36 83.09 -41.15 -43.26 822.86 37.06 -12.31

19 -34.14 -61.93 244.84 8.38 -56.45 -81.68 142.79 -17.42 -14.89

20 -32.68 524.9 32.26 -3.85 83.51 -38.03 75.02 -42.16 -10.84

21 195.81 -44.45 12.16 78.69 -30.11 -76.61 192.27 -35.31 -77.81

22 200.33 117.03 8.52 100.78 -40.43 -73.89 90.3 -72.75 -70.07

23 -84.82 -24.29 61.91 -3.02 41.43 -2.85 20.7 -7.8 11.63

24 -28.72 41.17 65.53 -59.82 -88.47 -39.35 -67.85 397.29 -90.41

25 -89.82 2.18 15.6 -7.1 -20.61 331.64 62.27 28.77

26 -97.28 -94.73 -99.06 316.58

27 -84.32 1098.59 126.17 174.43 -12.23 107.7 -4.81 -73.98

28 -91.32 3.49 -91.78 -73.64 248.71

29 -75.58 -87.72 -96.76 -86.46 295.48

30 -82.04 31.14 -37.7 -92.67 108.91 -84.51 -77.6 505.79 -84.18

31 352.94 186.79 -24.24 14.29 -93.44 -93.1 220.05 -83.63

32 287.79 -56.49 -11.74 -65.94 -77.61 -84.3 125.45 -35.42 -45.16

33 95.82 70.33 -26.53 3.9 -16.51 -61.95 559.49 -22.22 -20.15

34 -92.35 174.14 129.52 -40.18 -36.21 167.03 -86.18

35 -79.11 -44.45 104.41 58.14 204.98 -83.29 342.83 -74.63 -28.66

36 71.27 -30.99 -67.45 32

an investigation in defining neighbourhood boundaries using location based social media

Documents