ANALYZING THE SPATIAL PROPAGATION OF INFORMATION IN TWITTER
By
SRETEN CVETOJEVIĆ
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2018
© 2018 Sreten Cvetojević
To my family, friends and colleagues
4
ACKNOWLEDGMENTS
I would like to express gratitude to my advisor Dr. Hochmair. His guidance
motivated me to overcome countless obstacles along my scientific journey.
Words can hardly express the how grateful I am to my parents. Their sacrifices
and hard work helped me come this far and will forever inspire me to go beyond the
limits. Special thanks to my brother who always inspired me to work harder towards the
future and not to dwell on my previous accomplishments.
Thanks to my former and present lab mates Denis Zielstra, Francesco Tonini,
Majid Alivand, Levente Juhász, Adam Benjamin, Ahmed Ahmouda and my friends at
FLREC for their help, friendship and encouragement.
I would like to thank my committee members for their guidance, understanding
and help during the course of my Ph.D.
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS .................................................................................................. 4
LIST OF TABLES ............................................................................................................ 7
LIST OF FIGURES .......................................................................................................... 8
LIST OF ABBREVIATIONS ........................................................................................... 10
ABSTRACT ................................................................................................................... 11
CHAPTER
1 INTRODUCTION .................................................................................................... 13
Objectives ............................................................................................................... 14
Dissertation Outline ................................................................................................ 14
2 POSITIONAL ACCURACY OF TWITTER AND INSTAGRAM IMAGES IN URBAN ENVIRONMENTS ..................................................................................... 17
Study Background .................................................................................................. 17 Study Setup ............................................................................................................ 19
Data Collection ................................................................................................. 19
Geo-tagging in Twitter and Instagram ........................................................ 20
Obtaining the photographer’s position ....................................................... 21 Data Analysis ................................................................................................... 24
Analysis Results ..................................................................................................... 25 R1: Twitter Image Positional Accuracy ............................................................. 25 R2: Distance Between Photographer And Object ............................................. 27
R3: Distance Between Instagram Locations And Object Position .................... 28 Discussion And Future Work .................................................................................. 29
3 ANALYZING THE SPREAD OF TWEETS IN RESPONSE TO PARIS ATTACKS . 39
Study Background .................................................................................................. 39 Related Work .......................................................................................................... 41
Study Setup ............................................................................................................ 45 Twitter Information Sharing Methods Analyzed In The Study ........................... 45 Data Access ..................................................................................................... 46
Analysis Of Tweet Popularity .................................................................................. 48 The Role Of Tweet Type And Content On Tweet Popularity ............................ 48
The Effect Of The Profession On Tweet Popularity .......................................... 52 Analysis Of Information Spread .............................................................................. 53
Exploring Information Spread On World Maps ................................................. 53
6
Retweets .................................................................................................... 53
Hashtags .................................................................................................... 54
Kernel-density maps .................................................................................. 55 Spatiotemporal Regression For Global Spread Analysis .................................. 56
Model formulation ...................................................................................... 57 Data preparation ........................................................................................ 58 Model estimation ........................................................................................ 59
Discussion .............................................................................................................. 60
4 MODELING INTERURBAN MENTIONING RELATIONSHIPS IN THE U.S. TWITTER NETWORK USING GEO-HASHTAGS .................................................. 80
Study Background .................................................................................................. 80 Related Work .......................................................................................................... 82
Study Setup ............................................................................................................ 84 Analyzing the Network Structure of Mentions ......................................................... 88
Graph Generation ............................................................................................. 88 The Distance Between Mentioning Cities ......................................................... 89
Node Degree .................................................................................................... 89 Network Centrality Measures ........................................................................... 91 Reciprocity And Connectance .......................................................................... 93
Sentiment Analysis ........................................................................................... 93 Homophily and Heterophily ..................................................................................... 96
Data Preparation .............................................................................................. 97 City characteristics (nodal covariates) ....................................................... 97 Dissimilarity and similarity matrices ........................................................... 98
Network Regression ......................................................................................... 99
Discussion And Conclusions ................................................................................. 102
5 CONCLUSIONS ................................................................................................... 121
LIST OF REFERENCES ............................................................................................. 124
BIOGRAPHICAL SKETCH .......................................................................................... 134
7
LIST OF TABLES
Table page 2-1 Number of identified photographer positions and object locations (in
parentheses)....................................................................................................... 31
2-2 Descriptive statistics of distances between photo upload and photo position in different geographic regions ........................................................................... 32
3-1 Breakdown of geometry types in the analyzed dataset of tweets (wide Paris area, 13 Nov-27 Nov) ......................................................................................... 73
3-2 Confusion matrix for tweet content classification ................................................ 74
3-3 Popularity of tweets for different tweet formats and content categories .............. 75
3-4 Analysis of deviance for retweets ....................................................................... 76
3-5 The interaction between tweet format and content category on retweets (P-value adjustment method: Holm) ........................................................................ 77
3-6 Retweet statistics for tweets posted by journalists and non-journalists .............. 78
3-7 Negative binomial regression for panel data (Europe is the default continent) ... 79
4-1 Cities with highest weighted indegree and outdegree (strength) ...................... 115
4-2 Pearson correlation between weighted centrality measures ............................. 116
4-3 City ranking based on closeness centrality, together with Kleinberg hub and authority scores. ............................................................................................... 117
4-4 City mentions state subgraph indicators ........................................................... 118
4-5 Mean number of employees in given occupation per 1000 employees in any occupation across all analyzed cities, and its and standard deviation of the mean. Categories in boldface highlight specific occupational categories whereas those in regular font show broad occupation categories .................... 119
4-6 Arithmetic signs of estimated coefficients from Multivariate QAP regression on four models .................................................................................................. 120
8
LIST OF FIGURES
Figure page 2-1 Analyzed areas ................................................................................................... 33
2-2 Twitter and Instagram photo positions in Vienna ................................................ 34
2-3 Twitter and Instagram photo positions in Belgrade ............................................. 35
2-4 Boxplots of distances in different geographic regions. ........................................ 36
2-5 Offset between the photographer and identified object ...................................... 37
2-6 Spatial distribution of Instagram locations .......................................................... 38
3-1 Bounding box (this map extent) around Paris, which was used to select original tweets with images, hashtags, and keywords whose spread, was analyzed ............................................................................................................. 64
3-2 Tweet with photos. .............................................................................................. 65
3-3 Power law fitting the distribution of retweets, separated by tweet format and content category ................................................................................................. 66
3-4 Interaction between tweet type and content category on the number of retweets .............................................................................................................. 67
3-5 Retweets of tweets with pictures related to the Paris attacks ............................. 68
3-6 Geographic distribution of hashtags ................................................................... 69
3-7 Temporal distribution of hashtags ....................................................................... 70
3-8 Kernel density maps for the first 9 hours of #prayforparis hashtag usage (tweet density is shown in thousand tweets per square km) ............................... 71
3-9 Distance-based clustering of twitter places around Barcelona ........................... 72
4-1 Setup of world regions used for Twitter data download .................................... 106
4-2 Country place tag in geo-tagged tweets JSON file ........................................... 107
4-3 Locations of originating cities of tweets (green polygons) and density of mentioned cities (blueish Kernel density map) ................................................. 108
4-4 Force directed layout for a sub-graph of cities that have more than 30 incoming mentions ............................................................................................ 109
9
4-5 Distribution of weighted and unweighted distances (in km) between U.S. cities ................................................................................................................. 110
4-6 Power law fitting the distribution of the weighted indegree and weighted outdegree of the city mentions graph ............................................................... 111
4-7 A network of mentions between cities in Colorado (link width is proportionate to edge weights) ............................................................................................... 112
4-8 Word clouds of the words most used with some of the analyzed geo-hashtags ........................................................................................................... 113
4-9 Mean sentiment value of tweets between pairs of cities plotted against distance (in 1000s of km) between pairs of cities. ............................................ 114
10
LIST OF ABBREVIATIONS
API Application Programming Interface
EXIF Exchangeable Image File
GIS Geographic Information System
HTML Hypertext Markup Language
JSON Java Script Object Notation
KML Keyhole Markup Language
LDA Latent Dirichlet Allocation
NLTK Natural Language Tool Kit
OSM Open Street Map
POI Point of Interest
QAP Quadratic Assignment Procedure
SMS Short Messaging Service
SQL Structured Query Language
VGI Volunteered Geographic Information
URL Uniform Resource Locator
11
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
ANALYZING THE SPATIAL PROPAGATION OF INFORMATION THROUGH TWITTER
By
Sreten Cvetojević
May 2018
Chair: Hartwig H. Hochmair Co-Chair: Bon A. Dewitt Major: Forest Resources and Conservation
This study explores and models spatiotemporal information propagation through
Twitter. It analyzes in detail the role of different content types of a tweet, such as
images, hashtags, or keywords on information propagation, determines the effect of
sociodemographic characteristics of individuals on tweet popularity, and explores the
role of city attributes on the mentioning frequency between cities in the Twitter network.
Since this research is primarily concerned with the spatial aspect of information
propagation, an understanding of the data quality of spatial information associated with
a tweet is of high relevance for any subsequent analysis. For such an assessment
several aspects related to spatial data quality in tweets are explored, such as available
geo-tagging options, including their associated positional errors and spatial resolution,
the positional accuracy of Twitter photos, social networking (e.g. retweeting) behavior,
technical limitations for Twitter data download, and data noise and spam affecting the
accurate modeling of spatial information spread. For part of this data quality analysis,
Twitter data will be compared to other crowd-sourced data, such as Instagram photos,
to highlight the specifics of Twitter data and Twitter user behavior.
12
To demonstrate various approaches to observing, mapping, and modeling the
dynamic information spread through Twitter, a terrorist attack in Paris was chosen as a
showcase. Various exploratory methods and spatiotemporal regression models were
used to describe and formalize how the news of this event spread around the world,
where the influence of tweet content, tweet format and type, user profession, and
geographic characteristics of places, on the effectiveness and speed of information
spread were analyzed. The identified factors allow adding spatial and spatiotemporal
components to current approaches of information propagation modeling. The analysis of
Twitter communication patterns was furthermore expanded to interurban mentioning
relationships through the exploration of Tweet patterns between U.S. cities based on
geo-hashtags. This provides insight into the inherent structure of the Twitter social
network space, its hierarchies, and the spatial and non-spatial processes and factors
governing the mentioning relationships between cities.
13
CHAPTER 1 INTRODUCTION
This study uses spatiotemporal analysis to advance the understanding of user
behavior and spatial information propagation in Twitter. Twitter is a microblogging
service that allows users to post text in 140 (280 since November 2017) character long
messages called tweets. Tweets can have images attached, or contain links to videos
or other external sources. Twitter was founded in 2006 and initially designed for tweets
to be sent in SMS (Short Messaging Service) messages, which explains the length limit
of 140 characters. Twitter has 330 million monthly active users with 500 million tweets
sent every day. 80% of active users are on mobile phones or tablets, and over 67
million users live in the U.S. (Aslam, 2018). Twitter provides a large volume of data to
analyze human social behavior and movement patterns. However, several reasons
make the quality and usability of Twitter data questionable. For example, tweets are not
representative of the whole population since primarily the younger generation uses it.
Further, its use is concentrated on industrialized nations, leaving several blank spots on
the globe. Also, only 1-2% of tweets are geo-coded, rendering only a small portion of
tweets usable for geographical research (Mitchell et al. 2013). Twitter data is one
prominent example of crowd-sourced data that comes with a spatial component, which
is often referred to as Volunteered Geographic Information (VGI) (Goodchild, 2007).
Other examples of VGI are data from photo-sharing applications, such as Flickr, or from
crowd-sourced maps, such as OpenStreetMap. While VGI is for free, it does not have
official quality standards, making its fitness of use for certain applications often
questionable (MacEachren et al. 2011).
14
Objectives
The overall goal of this dissertation is to explore and develop models of spatial
information propagation in Twitter through geotagged tweets with their different facets
relating to content and format, and to analyze and determine factors affecting such
information propagation in the Twitter network space. This will enhance our current
understanding of the spatial propagation patterns in Twitter and how Twitter users react
to real world events. This overall goal is accomplished through the following objectives:
Description of the geotagging accuracy of tweets,
Analysis of the positional accuracy of Twitter images and its comparison to the accuracy of images from other social networks,
Identification of factors influencing the popularity of tweets, including tweet format, user profession and thematic categories;
Exploration of the geographic spread of event-related tweets over time and the role of the language used;
Identification of factors contributing to the information spread around the world within a spatiotemporal regression model;
Exploration of underlying geographic and socio-demographic factors influencing the formation of the network of mutual city mentions using the Quadratic Assignment Procedure.
Dissertation Outline
In the first case study, certain aspects of Twitter images are explored and
compared to Instagram images. Both Twitter and Instagram provide means for the user
to annotate images with geographic location information to some extent. Using a
selection of images that are shared through these two platforms from various urban
areas around the world, this study compares the photographer’s position, which is
manually estimated from the scene shown in the image, with the annotated location
information of the image and the position of the object being photographed. This
15
approach provides the first insight into the Twitter user’s spatial movement between the
locations where the picture is taken and uploaded to Twitter. Furthermore, the distance
between the photographer position and the photographed object location in Twitter and
Instagram can be used as a proxy for the visual prominence of photographed urban
objects. Lastly, the collected dataset allows us to assess the positional accuracy of
location labels in Instagram through comparison of the label position to the true position
of the referenced object. For each of the different analyses the study discusses potential
sources leading to positional errors of images in Twitter and Instagram and provides a
comprehensive set of illustrative examples from different cities.
In the second case study, different tweet formats, including Twitter images, are
explored with regards to their effect on worldwide information propagation through
Twitter after the attacks that occurred in Paris in November 2015. Exploration of the
images posted by the Twitter users showed that two themes were predominantly used,
namely, events or their aftermath, and artistic support to the victims. This study also
found that journalists extensively used Twitter to share images of the events and that
their tweets received more attention than those of non-journalist. Endogenous
information spread is explored by mapping of retweets, which represents sharing of
information from within the Twitter network only. Exogenous information spread (which
includes event information that may have been obtained from sources outside Twitter) is
modelled through observing time and location of tweets with event related hashtags.
Geographic and temporal aspects and a hierarchical structure of the spread pattern are
modelled using spatiotemporal regression analysis.
16
In the third case study counts of geo-tagged tweets that mention selected U.S.
cities in their hashtags, combined with various measures of network connectivity, node
centrality, and city characteristics are used to examine the prominence of individual
cities in the Twitter landscape, and to identify factors that explain strong mutual
communication ties between cities. In addition, the joint use of the city’s name in a
hashtag along with other thematic hashtags posted in tweets allows extracting user
sentiments about a city, and the effect of geographic distance on mutual sentiments
between cities. This analysis contributes to the modeling of the relationships and ties
between cities in the social network space. It also offers a detailed interpretation of the
Quadratic Assignment Procedure that was used for modeling these relationships.
17
CHAPTER 2 POSITIONAL ACCURACY OF TWITTER AND INSTAGRAM IMAGES IN URBAN
ENVIRONMENTS
Study Background
Driven by the rapid development in computer, sensor, and communication
technology, the past decade experienced a surge in new Web 2.0 and social media
applications that allow users to share spatial information over the World Wide Web and
mobile communication platforms. Two prominent examples of social networking/photo
sharing platforms are Twitter and Instagram. Twitter is an online microblogging service
that allows users to send and read short 140-character messages called tweets. The
nature of Twitter data has been analyzed in numerous aspects, reaching from the
extraction of travel patterns (Hawelka et al., 2014), over estimating the influence of
socio-economic factors on Twitter activity (L. Li, Goodchild, & Xu, 2013), to the
localness of tweets and other geotagged social media (Johnson, Sengupta, Schöning, &
Hecht, 2016). Twitter is also a rich source of images since users can share links to
media from other websites (e.g. YouTube, Instagram) or attach pictures to their posts
which are hosted on Twitter. The spatial aspect of Twitter image sharing has, however,
not been discussed in the research literature so far. Some studies did take on various
other topics of Twitter image analysis, though. For example, (Thelwall et al., 2015)
conducted a content analysis of 800 images tweeted from the UK and the USA, finding
that most of the images were photographs, that about 9% of the images mainly
displayed text, and that about 15% of images were screen grabs of phones. The same
study estimated that about two thirds of the images were taken immediately before
being tweeted. (Yanai & Kawano, 2014) developed a classifier for grouping streamed
Twitter photo data into 100 kinds of food. Classification results are visualized in a
18
prevailing food map showing popular foods in different parts of Japan. The paper
analyzed also how the popularity of different dishes, such as “ramen noodle”, “curry”
and “okonomiyaki”, varies by season and region. The study presented in this paper
complements earlier research efforts by assessing the positional accuracy of Twitter
images at the urban level. For this purpose the photographer’s position will be estimated
from the scenery shown in the image through manual identification of the location by
human analysts. This is then compared to the coordinates of the associated geo-tagged
tweet and the photographed object itself. The method of manually estimating the
photographer’s position from image scenes for accuracy assessment of crowd-sourced
data has already been applied to data from other photo-sharing services, such as Flickr
and Panoramio (Zielstra & Hochmair, 2013). Automated methods to extract the
photographer’s position from image content have already been developed for regions
with high photo density where images sufficiently overlap, and for which a set of control
point with known coordinates is provided (Y. Li, Snavely, & Huttenlocher, 2010).
Instagram is a photo- and video-sharing platform which allows users to take
pictures and videos and to share them with their followers on the Instagram website, as
well as through a variety of social networking platforms such as Facebook, Twitter, and
Flickr. Users can also geo-tag their shared content. The content and spatial distribution
of Instagram images have been analyzed in several recent studies. For example,
(Bakhshi, Shamma, & Gilbert, 2014) found that Instagram photos with faces are 38%
more likely to receive likes and 32% more likely to receive comments than those
without. (Hochman & Manovich, 2013) compared the visual signatures of 13 different
global cities using 2.3 million Instagram photos from these cities and used spatio–
19
temporal visualizations of over 200,000 Instagram photos uploaded in Tel Aviv, Israel,
to demonstrate how they can offer social, cultural and political insights about people’s
activities in particular locations and time periods.
Although social media images provide valuable information about a place, the
research literature has so far barely touched upon the spatial accuracy aspect of
images shared through Twitter and Instagram. Therefore, this paper addresses the
following three related research objectives:
R1: Determine for Twitter images the distance between a photographer’s position
(derived from the image content) and the geo-tagged position from which the tweet has
been sent. This analysis provides information about a photographer’s movement that
occurs between taking a picture and sending the tweet with the picture.
R2: Determine for Twitter and Instagram images the distance between the
photographer’s position and the photographed object. The range of distances
associated with a photographed object gives insight into the visual prominence of the
object.
R3: Determine for Instagram images the distance between the photographed
object and the Instagram location associated with that photograph. This provides
information about the positional accuracy of location tags available in Instagram for
annotating images with positional information.
Study Setup
Data Collection
This study is based on local knowledge of human analysts so that the
photographer’s position can be estimated from the content that is shown on Twitter and
Instagram images. The study was therefore conducted for geographic areas that
20
students participating in this study (as well as the authors) were familiar with. Since
urban environments with their multitude of unique objects, e.g. monuments, stadiums,
plazas, or churches, provide more visual clues to estimate a photographer’s position
than a rural landscape with fewer discernable objects, the study was primarily
conducted in urban areas. In addition to the photographer’s estimated position research
objective R1 requires the geographic coordinates of the location from which the tweet
with an image was sent, and R3 requires the coordinates of the location tag which has
been associated with the image by an Instagram user.
Geo-tagging in Twitter and Instagram
The Twitter mobile application interface allows the user to opt for attaching exact
geographic coordinates as metadata along with the tweet. The geographic coordinates
are in this case obtained through the smartphone geolocation method, which can be
based on the built-in GPS receiver, nearby Wi-Fi networks or from the mobile network
through base station information. The accuracy of the latter method depends on the
mobile network infrastructure. As an alternative for geotagging tweets, the user can also
pick a place from a collection of nearby locations in the mobile application, where more
general geographic entities, such as country, province, or city appear on top of the list.
How general Twitter’s suggestions depend on the geographic region. For example, for
photos from Belgrade, Serbia, the top-most suggested place tag was “Republic of
Serbia”, whereas for photos from Vienna, Austria, the suggested place tag was “Vienna,
Austria”. Since the spatial granularity of these places is too coarse for the research
tasks proposed in this study, only photos from tweets with geographic coordinates
(derived from the cell phone) were used.
21
A geo-tagged image on Instagram does not provide exact geographic
coordinates of the location from where the picture was taken, or from where it was sent
or uploaded, respectively. Instead, it provides the name of the location that has been
selected by the user from a pre-defined list of locations when uploading the image to
Instagram. If the photo to be uploaded to Instagram has geographic coordinates in its
Exif (Exchangeable image file format) image file metadata tags, the Instagram
application lists locations in a list that are near the coordinates in the Exif metadata. Exif
tags contain coordinates if the smartphone geolocation was activated while the image
was taken. If the Exif tags do not contain geographic coordinates, the Instagram
application lists locations near the current upload location identified by the smartphone.
The link to an Instagram image can be tweeted from within the Instagram application as
well. If the image file that is to be shared via Instagram does not contain geographic
coordinates in its Exif metadata and the smartphone geolocation function is turned off,
the image cannot be geo-tagged. Until a recent change in the Instagram application
users were allowed to add custom places based on the Exif metadata coordinates or
the smartphone position to the list of already available location names nearby.
Therefore a single real world place, such as a city, state, or mountain, can have
different Instagram place labels assigned to it, with the same or different coordinates. It
is also possible that the same real-world feature is associated with several same
Instagram place labels, where these place labels vary in position. Adding custom place
labels in Instagram has been deactivated as of August 2015.
Obtaining the photographer’s position
To obtain the position of the photographer at the time when the picture was taken
we relied on the local knowledge of 47 graduate students who took on this task as part
22
of a GIS graduate course at the University of Florida for partial course credit. For data
preparation each student was asked to provide us with the bounding box of two urban
areas they were familiar with, anywhere in the world. For these areas, three types of
photos were collected:
Photos attached to tweets (hosted by Twitter): Links to jpg files are provided in tweet JSON files that can be harvested from the Twitter streaming API.
Photos from Instagram shared in tweets (as a link to Instagram photos): A tweet contains the link to the Instagram Web site for that photo. The HTML code of that Web site was then parsed for the URL to the corresponding jpg file.
Instagram photos: Original photos posted on Instagram containing metadata such as user and location information, links to photos, or captions.
Each photo used in the analysis contained at least one type of location
information in its metadata. Photos obtained through tweets had geographic coordinates
of the place the tweet was uploaded from. Instagram photos contained a user assigned
location tag. Instagram photos shared in tweets contained the location of the Instagram
location that users had chosen to annotate it with. For the conducted data analysis,
Instagram images that were either obtained from the Instagram API or sent as a link in a
tweet were analyzed as one dataset, since for both methods the only geo-tagged
information available for the image is the Instagram location assigned by the user.
For the data collection process, in order to obtain a sufficient number of suitable
photographs that students could analyze in their selected region, the specified polygon
area was increased if necessary. This was often necessary for photos attached to
tweets (source 1), which occurs in about 7.5% of geotagged tweets with exact
geographic coordinates. A smaller percentage of geotagged tweets (2.4%) was found to
contain links to Instagram images (source 2). The highest photo density in a region was
generally obtained from the Instagram API with original Instagram photos (source 3).
23
Prior to handing out photos to students to identify the photographer’s position we
manually removed photos that contained profanities and vulgar content. In a Web
application that was set up for this study students could then browse through the
collected photos for their selected urban areas. The task of the assignment for students
was to indicate for each image (whenever this was possible) the estimated position of
the photographer based on the image content, through adding markers to a “Google My
Maps®” map, together with the photo ID. Students were asked to complete this step for
20 images from each data source. If this was not possible, they were asked to analyze
more images from any data source (whichever one worked) to reach a total of 60
images. The marker locations indicated by students have then extracted from the
shared “Google My Maps®” maps through a script and inserted into a PostgreSQL
database. The authors of this paper went through the same steps for selected areas in
Vienna, Salzburg, Budapest, Szeged, Ispra and Belgrade. For the next steps the photos
from only 23 students (out of the original 47 students) were further processed and
analyzed to reduce the time consuming process of data cleaning to a feasible amount.
That is, for quality assurance all of the photographer positions indicated by the 23
students were manually checked by the authors in a customized Web application that
showed the original photo content, the specified position in a map as a marker, and the
“Google Street View®” image for that position next to the map where available. The Web
application enabled us to either accept the photographer’s position indicated by the
student as is, to move the marker position, or to exclude a photo if it was obviously
placed at the wrong location and if we could not identify the correct photographer’s
position based on the satellite image view or “Google Street View®”. Based on these
24
data it was possible to measure the distance between the photographer’s position and
a) the geo-tagged position of the tweet containing the picture and b) the location
position associated with an Instagram photo. In addition to these efforts, the authors
placed markers at the location of photographed objects that could be well approximated
through a point location, such as a clearly discernible building. Objects that could not be
well approximated with a point on the map and where it was unclear which point the
photographer was focusing on (such as with bridges) were not considered for this task.
Table 2-1 summarizes the number of photographer positions obtained per
country and source that were retained for further analysis. Values in parentheses
indicate the number of object locations that were identified by the authors. Depending
on the research objective under consideration, different data columns are used from
Table 2-1, as will be described in the section about data analysis. Figure 2-1 plots the
photo locations from Table 2-1, and Figure 2-2 and Figure 2-3 provide a zoomed view of
available data sources for parts of Vienna and Belgrade.
Data Analysis
The analysis consists of three parts according to the three research objectives.
To quantify the movement of Twitter users between taking a photograph and uploading
it to the Twitter site (R1), the distance between these two positions is measured. To
assess regional differences, each data point was assigned to a geographic area, i.e.
North America (including the Caribbean), Europe and other. The dataset consists of 273
individual features from Twitter images.
To answer R2 which assesses the visual prominence of objects, a dataset
containing 325 Twitter and Instagram photos was used, for which both the position of
the photographer and the photographed objects could be identified. We hypothesize
25
that the type of object and the object surrounding affects the visual prominence of the
object. Therefore, each photograph was assigned to one of the following categories: a)
Prominent building spatially separated from other buildings; b) photos were taken from a
location that is separated by water from the photographed object, e.g. through a
fountain or river; and c) all other photos. The last group contained for example pictures
of local businesses in downtown areas or other points of interest, such as small
monuments or fountains.
For R3, which analyzes the Instagram location accuracy by measuring the
distance between the photographed object and the annotated Instagram location the
used dataset contains 251 photos. This dataset is a subset of the dataset used to
answer R2, containing only photos originating from the Instagram platform.
Analysis Results
R1: Twitter Image Positional Accuracy
The Twitter dataset can be used to study the movement of a photographer
between taking a picture and uploading it to Twitter. The log-log plot reveals that more
than 60% of photos were uploaded within a 1 km radius of the original photo location.
On the other end of the range, 2% of total photos were uploaded more than 100 km
away from the place where they were taken.
Different user patterns could be observed for posting photos on Twitter.
Approximately 30% of the photos were posted within 50 m of the actual location. This
distance closely resembles the maximum error of smartphone positioning in urban
environments, therefore these photos can be considered as instant uploads. As
opposed to this, 10% of photos were posted from more than 10 km away from the
original location. This category contains for example vacation images or photos from
26
sports events held in different cities. Users in this category decided not to upload their
photos instantly. The spatial distribution of the intermediate distance category provides
some information about the locations from where social media users post their photos.
In some cases, when the offset is large, the upload position corresponds to possible
open Wi-Fi hotspots and hotels. This might be indicative of tourist Twitter activities, for
example, when tourists do not have a cell phone data plan abroad, and are therefore
unable to upload their photos instantly. Images are often uploaded from areas that
appear to be residential, but taken somewhere else, e.g. downtown areas.
Since distances in the three compared global regions are not normally
distributed, even after using a log transformation, a non-parametric test was applied to
test the effect of geographic region on median distance offsets. Data points were
categorized into North America/Caribbean (AME), Europe (EUR) and other (OTH,
consisting of locations from Arabic countries, India and Kenya). Descriptive statistics of
distances for these categories can be found in Table 2-2, revealing that median
distances, which are not as much effected by outliers caused by tweets from other cities
as the mean distance, are highest for regions outside North-America/Caribbean and
Europe. Results of the Mood’s median test show that the geographic region has a
significant effect on the distance between the photo and upload location (p = 0.02). This
can potentially be explained by differences in Wi-Fi and mobile data infrastructure,
which has generally better coverage in regions of stronger economic development,
requiring users in less developed countries to move further for internet connection and
sending a tweet.
27
Figure 2-4 shows boxplots of the log transformed data grouped by geographic
regions, supporting the pattern from Table 2-2.
R2: Distance Between Photographer And Object
The distance between the photographed object and the photographer can be
interpreted as the visual prominence of an object, with larger values indicating that the
object can be seen (and is interesting enough to be photographed) from further away.
Only photos that have a clear focus on an object were used. Therefore landscapes, city
panoramas, portraits and other photos with scenery were excluded from the analysis.
Visual inspection of the distance data revealed that most photos were taken in close
proximity to the photographed object which is because urban environments usually
prevent distant views due to the high building density. Figure 2-5 A shows a typical
image setup in a city, with many objects being photographed from short distances, such
as stairways (lower left inset). The figure shows also that photos of landmark buildings
tend to be taken from larger distances, which is because of their visual prominence and
the setup of their surroundings, which often includes large plazas and parks. A similar
case occurs if a water body is located between the object and the photographer (Figure
2-5 B), preventing the user from moving closer, and often providing a scenic foreground
for the photograph. Boxplots of distances for these categories are shown in Figure 2-5
C. A one-way ANOVA test on the log transformed distances indicates a significant effect
of the object category on the photographer’s distance (F(2,322) = 87.47, p < 0.001).
The overall distribution of distances between the object and photographer also
follows a power law function with an exponent value of 1.31 and R-Squared of 0.89
(Figure 2-5 D). Out of the total 325 photos analyzed, 47% of social media photos with
28
identified objects were taken within 25 m of the object. On the other end, only 16 % of
photos were taken more than 100 m away from the object.
R3: Distance Between Instagram Locations And Object Position
Instagram locations labels are diverse in nature and can denote among others
physical objects, such as a building, street, or monument, or administrative units, such
as a city. Users previously had the ability to create custom locations which resulted in a
high density of Instagram locations in urban environments, as shown in an example for
Salzburg (Figure 2-6 A). After an update in August 2015, attaching photos to existing
locations is the only way to geocode Instagram photos. This update also prevents users
from creating new locations inside the Instagram apps. The offset between the identified
objects and the Instagram locations ranges in the analyzed dataset between 2 m and 24
km (median: 85 m, mean: 635 m). 52 % of the locations were closer than 100 m to the
object and 14 % of them were further away than 1 km.
Several reasons can explain a location offset error. Among the locations more
than 1 km away from the identified object, several locations were tagged with general
names, such as a town (e.g. Ispra - Lago Maggiore) or a geographic area (e.g. Dutch
Harbor). This is not necessarily a positional error of the Instagram location, but rather
the user’s inclination towards increased privacy (i.e., obscuring his or her exact
location), lack of local knowledge, the thinking that a general location name is the best
fit for describing the photo content, or the absence of an appropriate Instagram location
nearby. Another reason to explain large offsets that are not related to Instagram location
position errors is when a user mistakenly picks the wrong location label for the photo. If
the photo is not tagged with geographic coordinates in its Exif tags, users rely on the
Instagram locations suggestions that are based on their current position. In such cases,
29
when a user moves away from the photo location, an image can be associated with a
place in the proximity of the upload location. Figure 2-6 B shows an extreme example of
a photo (distance between location and object: 3.3 km) that is neither associated with
the place where it was taken, nor the true location of the object that is shown on the
photo, but a third location, which is most probably close to the place of upload (the
northern most point).
Furthermore, a number of large distance location - object pairs in Instagram
revealed misplaced Instagram labels, where locations do not align with their true
positions. An example is provided in Figure 2-6 C, where Instagram locations are
marked as red dots. In these cases, it is possible that the first user who created the
location traveled far towards the southeast before creating a custom location. This
phenomenon implies that custom locations were geotagged based on the smartphone's
geolocation, i.e. the current position of the user. This is illustrated in an example for St.
George Island, Florida (Figure 2-6 D). The spread of Instagram locations around the
true object position, a lighthouse, implies that locations were most likely added by
Instagram users, with coordinates corresponding to their smartphone locations. The
example shows also that the same object can have multiple Instagram locations. One
problem with misplaced locations is that users can add photos to them without being
aware of the position error, since only the location names are shown in the apps, but not
their map location.
Discussion And Future Work
This study analyzed the positional accuracy of geotagged images shared over
Twitter and Instagram, using the estimated photographer positions from the image
content, as well as published coordinates and/or locations of tweets with images and
30
Instagram images. For Twitter, the analysis provided some explanations for observed
patterns of distance offsets between photo capture location and tweet position, including
Wi-Fi availability. The study considered primarily images taken within the urban areas
since otherwise the scene could not be recognized by the analyst. Offset distances
between photo capture location and tweet position can be expected to be much larger if
distances to scenes outside the city limits, e.g., in other countries, would be taken into
account as well. Extending this kind of analysis to the worldwide scale is part of the
plans for future work. The study showed that Twitter and Instagram images help to
identify the visual prominence of selected objects, which is affected by the type and
layout of the object. The analysis is therefore relating to the visual aspect of landmark
attractiveness, which could be expanded to determining the semantic and structural
attraction of landmarks (Raubal & Winter, 2002) for these two data sources. The study
provided also various explanations for observed inaccuracies in Instagram location
labels, such as travel between the location where a picture was taken and the location
where it was uploaded. For future work we plan to explore the density and accuracy of
place labels in more depth for cities around the world, and to relate their spatial
characteristics to those of other place label collections, for example in
Foursquare/Swarm.
31
Table 2-1. Number of identified photographer positions and object locations (in parentheses)
Country Twitter Twitter/Instagram Instagram Total
Austria 45 (24) 50 (27) 54 (16) 149 (67)
Canada 26 (6) 28 (1) 26 80 (7)
Germany 1 2 18 21
Haiti 1 0 16 (4) 17 (4)
Hungary 11 (5) 40 (14) 68 (25) 119 (44)
India 7 (2) 12 (3) 14 (2) 33 (7)
Italy 0 2 41 (6) 43 (6)
Kenya 3 3 3 (1) 9 (1)
Libya 3 (1) 3 (1) 52 (11) 58 (13)
Puerto Rico 1 13 (3) 4 (1) 18 (4)
Serbia 24 (12) 19 (8) 26 (13) 69 (33)
Slovakia 5 (1) 16 18 39 (1)
Turkey 10 12 (1) 31 (10) 53 (11)
United Arab Emirates 4 (1) 8 (2) 0 12 (3)
United Kingdom 6 (1) 18 (5) 16 (6) 40 (12)
United States 126 (21) 203 (32) 546 (59) 875 (112)
Total 273 (74) 429 (97) 933 (154) 1635 (325)
32
Table 2-2. Descriptive statistics of distances between photo upload and photo position in different geographic regions
Region Mean [m] Median [m] SD [m] N
North America and the Caribbean 7389.0 198.7 20606.4 154 Europe 2837.0 627.7 13077.5 92 Other 3668.0 1559.0 6983.1 27
33
Figure 2-1. Analyzed areas.
34
Figure 2-2. Twitter and Instagram photo positions in Vienna.
35
Figure 2-3. Twitter and Instagram photo positions in Belgrade.
36
Figure 2-4. Boxplots of distances in different geographic regions.
37
Figure 2-5. Offset between the photographer and identified object. A) in Vienna, B) in Budapest, C) boxplot of distances for different object categories, D) fitted power law function to the frequency distribution of distances for Twitter and Instagram photos.
A B
C D
38
Figure 2-6. Spatial distribution of Instagram locations. A) in Salzburg, B) incorrect
selection of an Instagram location in Budapest, C) misplaced Instagram locations in Florida, D) multiple locations for the same object with similar labels in St. George Island, Florida.
A B
C D
39
CHAPTER 3 ANALYZING THE SPREAD OF TWEETS IN RESPONSE TO PARIS ATTACKS
Study Background
Over the past decade, the number of social media and crowd-sourced data-
sharing platforms has grown substantially and opened a new era of information
collection and analysis. Understanding the dynamics of social networks is crucial for
tracking of opinions (e.g. political trends), management of crises (e.g. environmental
natural hazards or diseases), optimization of business performance (e.g. marketing
campaigns), or the detection of popular topics (Guille et al., 2013). Twitter provides a
prominent platform to study communication patterns among people and the information
flow between them, although, unlike many other social media platforms, Twitter does
not enforce reciprocal sharing (Lotan et al., 2011). The (non-spatial) spread of
information through the Twitter network has been analyzed in numerous studies
(Ferguson et al., 2014; Lerman & Ghosh, 2010; Pei et al., 2014; Romero et al., 2011),
which complements another major thread of Twitter-related analysis, namely that of
human mobility patterns (Hawelka et al., 2014; Hochmair & Cvetojevic, 2014; Hübl et
al., 2017; Jurdak et al., 2015; Lenormand et al., 2014, 2015; Y. Li et al., 2017; Steiger et
al., 2011; Valle et al., 2017). Although several studies addressed the connection
between geographic and social space when analyzing community interaction in social
media platforms (Gründemann & Burghardt, 2016; Takhteyev et al., 2012) most
information diffusion models operate exclusively within the social space, focusing, for
instance, on information promotion (Achananuparp et al., 2012), or the effects of
repeated exposure to hashtags on hashtag adoption (Romero et al., 2011). To better
40
understand the information spread across the physical world, there is a need to
integrate spatial components into diffusion models.
As a step in this direction, we selected a series of six attacks (including suicide
bombings and mass shootings) that occurred in Paris on the night of November 13th,
2015, and analyzed the diffusion of tweets that contain information pertaining to this
event around the globe. Related tweets were divided based on format and content.
Included formats are tweets with images, tweets with hashtags and tweets with
keywords. Related images posted through tweets were visually inspected to identify
dominant content categories. This led to two distinct content categories, namely, tweets
related to the attacks and those expressing sympathy or support. Diffusion
characteristics were then analyzed for each of these two classes separately. This two-
class content distinction is in line with an earlier study (Seo, 2014) which analyzed
images posted to the November 2012 Gaza conflict. It found that Israeli images
primarily featured the analytical propaganda theme, which included images relating to
attacks and destruction, whereas the emotional propaganda theme, e.g., raising
sympathy towards their own people, was dominant in Hamas images. Our paper
identifies several factors that influence tweet popularity (measured by the number of
retweets), including content category (attacks vs. support related), tweet format
(keywords vs. hashtags vs. images), and Twitter user profession (journalist vs. non-
journalist). Using these categories, various exploratory spatial methods, such as Kernel
density maps, are applied to assess the global spread of event-related information
through tweets. This is followed by a spatiotemporal negative binomial regression
model, which uses tweets with event-related hashtags to identify significant predictors of
41
information spread around the world. In summary, this study addresses the following
research objectives:
determine the effect of tweet content category, tweet format, and user profession on the popularity of tweets that are posted in connection with the Paris attacks;
explore the geographic spread of event-related tweets over time;
use of tweets with hashtags that relate to the Paris attacks to identify factors contributing to the information spread around the world within a spatiotemporal regression model.
The remainder of the paper is structured as follows. Section 2 reviews previous
work on information diffusion through Twitter. This is followed by a description of the
study setup in section 3. Section 4 provides results of the tweet popularity analysis,
followed by results of exploratory analysis methods and a spatiotemporal regression
model for twitter related information diffusion. Section 5 discusses findings and the
utilized analysis methods, which is followed by conclusions and directions for future
work.
Related Work
The geospatial data component that comes from social media content and from
crowd-sourcing applications used for communication, navigation, or sharing travel
experiences, is primarily generated by passive, often unaware, contributions means,
and therefore sometimes referred to as Involuntary Geographic Information (iVGI)
(Fischer, 2012). Although georeferenced tweets fall into the same category, and the
sharing of one’s location is not the main purpose of tweets, Twitter position information
has been frequently used to assess the spatio-temporal dimension of emergency
situations, such as earthquakes, floods, forest fires, or terrorist attacks (De Longueville
& Smith, 2009; Hung et al., 2016; L. Li & Goodchild, 2010; MacEachren et al., 2011), to
42
predict the spread of diseases (Brennan et al., 2013; Signorini et al., 2011), and to
model human mobility patterns in the case of unexpected events (Shelton et al., 2014).
Twitter is used by over 300 million users every month and therefore provides a
significant data source for studying communication patterns and information flows
among people (Lotan et al., 2011; Pei et al., 2014). However, it suffers from user
sampling bias (Duggan et al., 2015), and geographical bias through its concentration on
certain countries (Hawelka et al., 2014). Furthermore, only about 1% of all tweets are
geo-tagged (Graham et al., 2014). This means that results of Twitter studies are not
necessarily representative of the general population or even of all Twitter users. To
compensate for the scarcity of geo-tagged tweets, various studies have explored
methods to geo-locate tweets (Cheng et al., 2010; Zahra et al., 2017) or Twitter users
(Jurgens, 2013; Kotzias et al., 2014) through other sources of information in the tweet
post or in the user profile, such as geographic references in the tweet text and the social
network structure. Though these geo-positioning methods are consistently improving,
they add a level of positional uncertainty to any subsequent analysis, and often require
manual checks for reliable results. Therefore, for the presented study only geo-tagged
tweets were used.
Modeling of information diffusion in the Twitter network was often approached
through the analysis of retweet patterns (Guille et al., 2013), where a retweet is an
action taken by a Twitter user to share someone else’s tweets without alteration
(Compston, 2014). For example, Cha, Haddadi, Benevenuto, & Gummadi (2010)
compared three measures of user influence on others, namely the number of followers,
the number of retweets, and the number of user mentions. Results showed that popular
43
users with a high number of followers do not necessarily have more retweets and
mentions, but that it is more influential to have an active audience that retweets or
mentions the user. Another study showed that tweets that contain interesting URLs (as
rated by others), and are posted by users with many followers were likely to be more
widely spread (Bakshy et al., 2011). Similarly, Pei et al. (2014) used several network
topology measures, including degree, PageRank, and k-core, to detect influential
spreaders of information in online social media platforms Twitter, Facebook, and
Livejournal. Based on a diffusion network model Yang & Counts (2010) predicted the
speed, scale, and range of information diffusion on Twitter using a variety of user and
tweet related predictors, including a user’s activity level, the presence of URL in a tweet,
or the stage of topic lifespan when a tweet was posted. Achananuparp et al. (2012)
introduced the notion of weak retweets in their information propagation model. This
concept describes a user posting a tweet that mentioned a relevant item, such as a URL
or hashtag, from an earlier tweet posted by another user.
Besides retweet patterns, hashtags have often been used to observe content
trends and to track topical information propagation. A Twitter hashtag is a string of
characters preceded by the hash (#) character, and is generated by users as a method
to categorize content and to highlight topics. A recent study extracted sentiments and
topics from tweets that contained the #prayforparis hashtag and that were sent four
days after the Paris attacks (Chong, 2016). The topics were extracted using latent
semantic analysis (LSA) (Deerwester et al., 1990; Evangelopoulos et al., 2015;
Landauer & Dumais, 1997) and included among others a tribute to the victims of the
Paris attack during the soccer game between England and France. Lotan et al. (2011)
44
analyzed Twitter information flows during the 2011 revolutions in Egypt and Tunisia for
mainstream media organizations, journalists, and bloggers using tweets with hashtags,
such as #sidibouzid or #jan25. The study concluded that Twitter accounts of
organizations have substantially higher retweet rates than accounts of individuals, but
that news on Twitter is being co-constructed by bloggers and activists alongside
journalists. Tsur & Rappoport (2012) showed that a post’s content (e.g. length of a
hashtag) and context (e.g. cognitive categories), as well as the topology of the social
graph (e.g. number of followers) and global temporal features (e.g. peak hours) are
important predictors of the popularity of hashtags over time. Another study found that
the spread of hashtags varies by topic and that, especially for political hashtags,
repeated exposure leads to frequent hashtag adoption by followers (Romero et al.,
2011). Chang (2010) proposed a Diffusion of Innovation Theory that examines a trend
of hashtag adoption during certain time periods after the user has been exposed to
hashtag information.
Regarding news topicality Kwak et al. (2010) compared the occurrence of
headlines between Twitter and CNN and found that some events, such as accidents
and sporting events, broke out on Twitter first. A comparative analysis of the relative
importance of social media for news in six European countries, Japan, and the U.S.
revealed that television is still the most widely used and most important source of news
(Nielsen & Schrøder, 2014).
Several studies examined the ties between spatial and social network structure
on twitter. For example, it was found that smaller Twitter networks are more socially
clustered and extend over a smaller physical distance than larger ones, suggesting that
45
network and physical distances are related (Stephens & Poorthuis, 2014). Similarly,
Takhteyev et al. (2012) showed that a substantial share of Twitter ties lies within the
same metropolitan region and that distance related variables, such as language,
country, and the number of flights affects Twitter ties between regional clusters.
Overall, the literature review reveals that the spatial and geographic aspects of
current diffusion network models of social media platforms are largely neglected. To
narrow this research gap, the role of distance and spatial hierarchy will be explored in
the context of information propagation. For this purpose geo-tweets with images,
hashtags, and keywords related to the Paris attacks will be used as data source.
Study Setup
Twitter Information Sharing Methods Analyzed In The Study
Twitter is a microblogging service that allows its users to send posts called
tweets. The length of a tweet was limited to 140 characters until November 2017, when
the maximum length was doubled to 280 characters. Our study uses tweets from 2015,
and therefore analyzes posts that are up to 140 characters long. Tweets can be
enriched with different content including images, videos, and links to external web
pages. The geo-positioning capabilities of mobile devices through GPS, Wi-Fi, or cell
phone towers gives Twitter users the opportunity to add location information to their
posts. Users post tweets on their timeline, and a follower is a user who can see another
user’s posts on their own timeline. Followers can either like or retweet another user’s
tweet. In the case of a retweet, a user forwards a tweet and shares it on his or her own
timeline. The retweet mechanism, therefore, allows users to extend the information
beyond the reach of the original tweet’s followers (Kwak et al., 2010). Retweets can be
seen on a user’s timeline together with their own tweets and the list of liked tweets can
46
be seen in a separate tab. Liking a tweet is a sign of an appreciation for a tweet.
Hashtags provide a platform for the discussion of a specific topic and are therefore used
to classify information and highlight topics, promoting folksonomy (Chong, 2016).
Hashtag strings can be clicked to trigger a global search of tweets related to a topic of
interest. Retweeting and assigning tweets to topics through certain hashtags are
common methods of spreading information through Twitter.
Data Access
Twitter provides free access to the public portion of their data through the Twitter
Streaming Application Programming Interface (API) and REST APIs. The dataset used
for this study covers 1,094,009 worldwide geotagged tweets that were posted within two
weeks from the day of the attacks (November 15, 2015). Hashtags related to these
events were used primarily within a span of a few days. Therefore, the two-week range
appeared to be adequate for the proposed analysis. Data was downloaded using the
Tweepy python library from the Twitter Streaming API and stored in a PostgreSQL
database. Since the Streaming API returns tweets in the JavaScript Object Notation
(JSON) file format immediately after they were posted, the number of retweets equals
zero upon download. Therefore, in order to obtain the current number of retweets of a
tweet, the HTML code of tweets was accessed through a URL in the format:
http://twitter.com/statuses/tweet_id, and then parsed using the BeautifulSoup Python
library.
The JSON object for each tweet with an image contains a URL to an actual
image file on the Twitter server. Using a customized Web application we manually
selected a subset of images that were posted from tweets within a predefined polygon
around Paris (Figure 3-1) and that were related to the attacks or showed support. The
47
JSON object contains also a list of all hashtags that are used in a tweet. Furthermore,
tweets that contained attack or sympathy related keywords were extracted using a full-
text search in PostgreSQL, as described in more detail in section 0.
According to the official documentation (Moffitt, 2014), tweets can contain three
types of location information, which are (1) geotags (exact location or Twitter place), (2)
geographic location mentioned in the tweet, or (3) location in the user profile. For this
study, only geotagged tweets were used. The breakdown of types of location
geometries found in the used worldwide dataset of geotagged tweets is shown in Table
3-1. Given the small percentage of tweets with exact coordinates (9.58%) among geo-
tagged tweets, the spatial analysis of information spread based solely on tweets with
exact coordinates would have been seriously limited.
To identify tweets that are posted from Paris and hence serve as a seed source
for information diffusion, various spatial search methods were applied:
Tweets with exact coordinates within the Paris bounding box (Figure 3-1),
Tweets geocoded with a place type “admin” whose centroid falls within the Paris bounding box,
Tweets geocoded with a place type “city” and the value “Paris”
Following tweet formats were analyzed:
Tweets with attack related photos or support pictures (Figure 3-2),
Tweets with hashtags related to attacks or support (in English and French),
Tweets with keywords related to attacks or support (in English and French).
Tweets of all three tweet types were subdivided into two content categories as follows:
Event-related:
a) photos from streets immediately after the attacks (Figure 3-2 A),
48
b) hashtags such as: #ParisAttacks, #Bataclan (Bataclan is a theater where one of the attacks took place),
c) keywords such as: “attack”, “scared”, ”terror”, “armes”, “policiers”.
Support related
a) tweets containing artistic support images (Figure 3-2 B),
b) tweets containing hashtags expressing support and sympathy, such as #PrayForParis,
c) keywords and bigrams such as: “pray”, “stay strong”, “contre terrorisme”.
Analysis Of Tweet Popularity
In the presented study, the average number of retweets was used to measure
the popularity of tweets. The role of tweet format (image, hashtag, keyword), content
category (event, support), and profession of the contributor (journalist, non-journalist) on
popularity is assessed, using different sets of tweets. First, tweets with images from
Paris were selected manually using a customized Web application that visualized the
approximately 9000 tweets with images from the Paris area that were posted between
9:00 p.m. (local time) on November 13 and 7 a.m the day after. Second, tweets posted
from the broader Paris area with matching keywords and hashtags posted within two
weeks from the attacks were extracted after manual selection of keywords and
hashtags.
The Role Of Tweet Type And Content On Tweet Popularity
After the first author selected and classified the images based on content, two
more graduate students verified the content classification of the images. The students
were asked to classify images as attack related or as support related, or to suggest an
alternative category. For the verification procedure the same Web application as for the
initial classification was used. All three individuals (first author and two verification
49
subjects) identified the exact same set of images as support images, as that theme was
very distinctive. The two verification subjects identified two and three additional attack
related images respectively, compared to the first author. After a thorough examination
of tweet text and comments associated with the newly identified photos, it was,
however, found that the photos were screenshots of news and no genuine images.
Therefore, these tweets were not used for further analysis. No other categories were
suggested by the verification subjects.
Using the Python NLTK (Natural Language Toolkit) keywords in English and
French were extracted from tweets posted within the Paris area. Frequent occurrences
of single words (e.g., terror, police and attack) and bigrams (a combination of two
words, such as in “stay safe”) were identified by the first author. A total of 101 single
keywords and 21 bigrams related to the attacks as well as 18 keywords and 10 bigrams
expressing support were identified. In addition, the four most frequent event and two
most frequent support related hashtags, with three in English and three in French, were
identified. Alternative methods of computer-assisted keyword and hashtag extraction
from the unstructured text are presented in the literature (King et al., 2017).
To check the correctness of the classification of keywords, bigrams, and
hashtags into the two content categories, tweets in English were manually classified by
three individuals (two Ph. D. students and one postdoctoral researcher) and tweets in
French by three volunteers (the first author’s relatives who live in Paris and speak
French fluently). Each reviewer from the English group was given a random sample of
100 tweets with English keywords and 100 tweets with English hashtags. Similarly,
each reviewer from the French group was given 100 tweets with French keywords and
50
100 tweets with French hashtags. The six reviewers were given options to classify
tweets as attack related, support related, as “other” (i.e., unrelated to either of the two
categories), or to suggest a new category. Table 3-2 shows the confusion matrix of the
manual classification conducted by reviewers (event, support, other) and the automated
classification (event, support) for hashtags and keywords.
The table shows that 70.8% of the tweets that were automatically (i.e., based on
hashtags) classified as event-related, were confirmed in the process of manual
classification. For support related tweets, the match was even higher with 96.0%. For
keyword-based tweet extraction, the matching rates were somewhat lower, i.e. 72.9%
(events) and 68.5% (support), respectively. Most discrepancies came from one French-
speaking reviewer who identified politics as a subcategory in certain tweets. However,
upon further review we could not identify distinctive keywords or hashtags in these
tweets that would imply a political theme. Other discrepancies came from a few
automatically extracted tweets that used both support and attack related keywords and
hashtags together, such as: “#ParisAttacks I hope @username is safe”. In this example,
the hashtag was related to the event but the text was expressing support, and a
reviewer classified it as support related.
The number of retweets in each of the six combined classes of tweet content
categories and tweet formats follows closely a power law distribution with an r-squared
of 0.83 or higher (Figure 3-3), supporting earlier findings about the distribution of
retweets (Can et al., 2013). This means that only a small number of tweets received a
high number of retweets. For the estimation of the power-law exponent (α), a linear
regression with simple logarithmic binning was used (White et al., 2008). For the
51
frequency distributions in Figure 3-3, only tweets that fall into exactly one format and
content category were used to better understand the effect of content and tweet format.
This means that, for example, tweets containing an image and a hashtag (i.e., two tweet
formats) were excluded.
Table 3-3 shows the mean numbers of retweets and their standard deviations for
the different tweet formats and content categories, using the same dataset. It can be
seen that mean retweet numbers increase from bottom to top (keyword – hashtag –
image) and are larger for the event than for support related tweets across all tweet
formats.
Since observations are count data the effect of tweet format and content
category on the popularity of a tweet was assessed using a two-way analysis of
deviance from the phia R package (De Rosario Martínez, 2015), assuming a negative
binomial distribution of observations. Hence, the observed counts were fit to a negative
binomial model with factors content category and tweet format and their interaction, and
then an ANOVA was run. Results for retweet numbers reveal a significant interaction
between tweet content and format and demonstrate significant main effects for tweet
format and content category (Table 3-4).
Since there are only two content categories (event and support), the main effect
on the content variable indicates that event tweets trigger significantly more retweets
than support tweets. Since there are more than two formats, the effect of format on
retweet numbers will be more closely analyzed using interaction contrasts (Table 3-5).
Results in Table 3-5 show that both for event and support content, tweets with
pictures receive more retweets than those with hashtags and keywords (rows 2, 6, 14,
52
15). In addition, tweets with hashtags receive more retweets than tweets with keywords
(rows 1, 13). Event-related tweets receive more retweets than support related tweets,
which is true for tweets with pictures (row 12), hashtags (row 3), and keywords (row 8).
All these results match the visual pattern observable in Figure 3-4. The figure shows
that differences in mean retweet numbers between event and support related tweets
vary between hashtags, keywords, and pictures, suggesting that the effect of the
content category on the number of retweets depends on the tweet format.
The Effect Of The Profession On Tweet Popularity
Among the 169 users who shared attack related photos (based on the earlier
manual selection), 48 users were identified as journalists. Most of these 48 user
accounts belonged to individuals, and not to organizations. To classify a user profile
description, username, and links were parsed by the first author for information that
expressed an affiliation with any kind of news channel, such as television or online
newspaper.
To verify the user classification three graduate students who were not involved in
this study, were asked to conduct a manual identification of the same 169 users who
posted photos of the attacks. Out of the 48 journalists initially identified by the first
author, the graduate students confirmed 46 users to be journalists, and two additional
Twitter users were identified as journalists. One of the additional journalist users had a
link to an external website that, among others, contained the user’s profession. The
other one had a description in Arabic that included special characters, which were only
identified by one reviewer who speaks Arabic. Using this new set of 48 journalist users,
journalists were found to have a significantly higher median number of followers
(median = 1554) than other users (median = 377) using a Wilcoxon Signed Rank test (Z
53
= 4095, p < 0.0001). To detect whether a tweet of a journalist was more influential than
a tweet of a non-journalist, after removing the effect of the number of followers, the
number of retweets per follower of a user was determined (Table 3-6).
A Wilcoxon Signed Rank test (Z = 3677, p = 0.006) showed that the median
number of retweets per follower for journalists (median = 0.003) was significantly higher
than for non-journalists (median = 0.001). This supports earlier findings from the
literature which states that journalists are able to generate Twitter response levels that
are comparable to those of media organizations, bloggers, bots, activities, and
politicians, and hence engage their audiences more than other types of Twitter users
(Lotan et al., 2011). A possible explanation is that journalists have faster access to
news information, which leads to faster subsequent information dissemination. Another
reason for higher retweet rates could be a higher trustworthiness of individual journalists
who built their credibility over time, especially those who are highly engaged in social
media activities (Jahng & Littau, 2016).
Analysis Of Information Spread
Information spread was first analyzed through exploratory data analysis, using
worldwide maps of retweets of tweets with images, and using worldwide maps of tweets
with event-related hashtags. A spatiotemporal regression analysis provides an
analytical framework for dispersion modeling of tweets with attack related hashtags.
Exploring Information Spread On World Maps
Retweets
For the first analysis, tweets with the event and support related pictures posted
between 9:00 p.m. (local time) on November 13 and 7 a.m. the next morning were used
as seed tweets. The worldwide locations of retweets were identified by finding original
54
geotagged tweets (i.e., not retweets) of retweeting users within a six-hour window
around the retweet time. This approach was necessary since retweets, which were
obtained from the Twitter Search API, did not contain the geographic position of the
retweet, but only that of the original tweet. To obtain a location of the retweet, all
location types (exact coordinates, neighborhood, city, province, country) of tweets
around the retweet time (between three hours before and after) were used. Factors
limiting the success in identifying the location of retweets were the sparsity of
geotagged tweets of 1-2% and the limitation that only the first 20 retweets of a tweet
can be obtained from the Twitter API for tweets. Hence, only tweets with up to 20
retweets were used for this analysis. This approach reduced the sample to 259 tweets
with images of events or support. The method located 68 retweets out of 1451 total
retweets. Figure 3-5 visualizes the location of retweets that were located using the
before described method, separated by event and support related images.
Retweets were primarily found in Europe and the United States, which have a
higher Twitter penetration rate than countries on other continents. The higher density in
some European regions could be explained by their proximity to France, and therefore
higher safety concerns. While the map does not display all retweets of identified seed
tweets due to technical limitations described before, it provides a general overview of
the regions to which information about the attacks primarily spreads. Most retweets are
located in France (44), followed by the United States (13), Spain (5) and Germany (3).
Hashtags
Figure 3-6 visualizes the location of tweets with selected French (A, B) and
English (C, D) hashtags posted within the first two weeks of the attacks. These four
hashtags are a subset of the six hashtags used for tweet extraction described earlier.
55
The maps show that the language of the hashtags has a clear effect on the
geographic spread of tweets. Tweets with French hashtags spread mostly within France
and to some extent to the only francophone Canadian province (Quebec) and
predominantly French-speaking Caribbean islands. As opposed to this, tweets with
English hashtags, whether they relate to attacks (Figure 3-6 C) or support (Figure 3-6
D), spread into many more countries around the world. The fact that English is more
widely spoken around the world than French1 may explain that English rather than
French hashtags are more widely used, leading to these distinct information diffusion
patterns.
Figure 3-7 plots the worldwide proportion of tweets containing particular hashtags
about attacks (solid lines) and support (dashed lines) among all tweets containing any
hashtag for the first two weeks after the attacks. The shape of the line graphs suggests
that the interest in the topic dropped quickly after two days. The daily counts are
measured for Paris local time. Since the attacks happened in the late evening hours
only a smaller proportion of tweets occurs on November 13th.
Kernel-density maps
Kernel-density maps were used to visualize the spatial distribution of tweets with
selected hashtags over time. To illustrate the spread of an English hashtag, Figure 3-8
visualizes Kernel density maps on top of individual tweet locations with the
#prayforparis hashtag within the first 9 hours of the attacks, grouped by 3-hour
aggregations. Visual inspection suggests that during the first three hours tweets occur
primarily in and near France and in parts of the US East coast.
1 http://www.diplomatie.gouv.fr/en/french-foreign-policy/francophony/the-status-of-french-in-the-world/
56
In Figure 3-8 A it is between 9 p.m. and midnight local time in Europe, and the
highest concentration of related tweets is, as expected, in Western Europe due to the
proximity of Twitter users to the event. In that figure, it is late afternoon/early evening on
the US East coast, which is an active time for tweeting compared to morning or late
night hours (Andrienko et al., 2013). This can explain this early concentration of related
tweets in that region. As opposed to this, for selected regions in the Middle East or Asia,
the local time associated with the first map is closer to late night or early morning hours,
e.g. between 1 a.m. and 4 a.m. in Dubai, and between 5 and 8 a.m. in the Philippines.
This may explain the lower initial level of tweet responses to attacks in these areas.
Three hours later (Figure 3-8 B), the news spread further to populated areas around the
world with high Twitter penetration rates, such as Brazil, the western United States,
Central America, Indonesia, and the Philippines, but still only little to the Middle East
(with a local time between 4 a.m. and 7 a.m.). Another three hours later (Figure 3-8 C)
tweets spread further into adjacent regions of those highlighted in Figure 3-8 B, also
showing some response in the Middle East.
Spatiotemporal Regression For Global Spread Analysis
The purpose of the regression model was to find spatial and temporal regression
coefficients that reflect the spread of tweets containing any of the six hashtags in Figure
3-7 around the world. It was expected to reveal patterns similar to those observed in the
kernel density maps.
The data was constructed as a panel, with tweet counts for clusters of Twitter
places in three-hour time intervals prepared over a time period of two weeks. The use of
panel data allowed modeling a time-lagged neighbor effect, where the tweet count in
one area (e.g. Paris), affected the tweets count in “neighboring” areas, e.g. other
57
populous metropolitan areas, such as New York, Rio De Janeiro, or London. The
analysis was run as a negative binomial regression for the count outcomes, since the
count data was over-dispersed. Stata software was used for this purpose.
In the given context of information dispersion through Twitter, the city in each
country that had most tweets with any of the selected hashtags was designated as a
local connector (neighbor) to Paris, which was considered the data source of
information. These major cities, which did not necessarily match political capitals of the
countries but were derived from a clustering process during in preparatory step, were
called tweeting capitals. Predictor variables were set up in a way that estimated
regression coefficients would model the spread of information from Paris to other
tweeting capitals, and the subsequent information spread to other smaller cities around
each tweeting capital. In recent studies, a similar framework with lagged variables was
used to disentangle the cause and effect of land use and transportation network growth
(Levinson, 2008), and to model the interaction in data growth between different crowd-
sourced datasets (Alivand & Hochmair, 2017).
Model formulation
A general model for panel data analysis, using a first order lag and a negative-
binomial distribution of the dependent variable, can be formulated in the presented
context as follows, similarly to (Levinson, 2008):
ln(Di,t)=Di,t-1φ+WDi,t-1ρ+Xi,t-1β+WXi,t-1χ+Ziζ+Tt-1ψ
where
Di,t is the number of tweets with hashtags in cluster i at time t,
W is a matrix of spatial interaction weights (the neighborhood matrix),
58
Di,t-1 is the number of tweets with hashtags in cluster i at time t-1 (the lagged
value of the dependent variable),
X is a vector of variables that change with both cluster and time,
Z is a vector of cluster-specific variables that do not change with time,
T is a vector of time-specific variables that do not change with the cluster, and
φ, ρ, β, χ, ζ and ψ are coefficients to be estimated through regression.
The weight matrix defines spatial relationships between clusters. It consists of
binary values that indicate likely directionality in the change of the dependent variable
over time. With the chosen matrix setup, all the cities in a country are modeled to be
neighbors to their tweeting capital, and all the tweeting capitals are neighbors to Paris.
Vector Xi,t-1 contains the number of all tweets posted in each cluster i at time period t-1.
Vector Zi consists of i) a variable indicating the geodesic distance between the tweeting
capital of cluster i and Paris and ii) a variable describing the continent cluster i is located
in. The latter captures differences in time zones between clusters and thus the different
local times at which the attacks occurred. For modeling purposes, several time zones
were grouped together by continent, giving the following four continent groups: 1)
Europe, 2) the Americas, 3) the Middle East and 4) Asia and Australia. Vector Tt-1
represents the count of three-hour periods passed since the attacks, which is the same
for each cluster.
Data preparation
For the analysis, only those tweets were used that were geocoded either with
exact coordinates, or with a place tag at the neighborhood or city level, and that had any
of the six included hashtags within two weeks from the attacks. Neighborhood and city
places are represented as rectangular bounding box polygons in Twitter. In a first step,
59
in order to avoid excessive zero counts at different time steps in analyzed places, the
number of places used in the regression analysis was reduced. This was achieved by
clustering all Twitter places at the neighborhood and city level into major cities through
distance-based clustering of twitter place centroids, using place polygon centroids and
the PostGIS function St_ClusterWithin(). This function returns an array of geometry
collections. Each collection contains a set of geometries whose centroids are separated
by no more than a specified distance. In our setup, if the distance between places was
shorter than 0.1 arc degrees, places were aggregated to a cluster. Figure 3-9
demonstrates the clustering process of 26 Twitter places (rectangles) in South Florida
into 4 major clusters (ellipses) and six smaller standalone clusters (rectangles of
different colors some distance away from ellipses). These smaller clusters were
retained for tracking the local spread of tweet information out of the tweeting capitals.
Clusters with fewer than one thousand tweets over the course of two weeks were
excluded from the analysis as well as clusters that had fewer than 40 tweets with
hashtags related to the Paris attacks.
Model estimation
Table 3-7 shows the results of the model estimation, which predicts the count of
tweets with selected hashtags in place clusters.
Results indicate that cities which are designated as a country’s tweeting capital
are associated with a higher number of tweets than other cities in that country. An
increasing distance between the tweeting capital and Paris, as well as the number of
three-hour periods since the hashtag inception, are negatively associated with the
number of tweets with hashtags. The latter indicates that the growth in the numbers of
the tweets with hashtags declines over time. The number of tweets in a given time
60
period was positively correlated with the number of tweets with hashtags observed in a
place cluster during that time period. As expected a lagged count of tweets with
hashtags at t-1 (shown as L1) is a strong predictor of the count of tweets with hashtags
at t. The number of tweets with hashtags in clusters is also affected by the number of
tweets sent in the “tweeting capital” of the country in the previous time period, as
indicated by the Δhashtags variable. This shows that the local spread of information
within a country from the tweeting capital to other cities in the country explains part of
the tweeting activities in those cities, suggesting a hierarchical structure of information
diffusion. This matches the visual perception of diffusion patterns in Figure 3-8.
Locations in Asia and Australia received an increased number of tweets compared to
other continents, after controlling for distance from Paris, possible due to high
population densities in certain Asian regions.
Discussion
This research presented a multi-faceted analysis of information spread through
tweets under consideration of tweet format and content category. Event-related tweets
triggered more retweets than those expressing support, possibly due to the higher
information content found in the first group of tweets. The rich visual information content
of images might also explain why tweets with images received higher retweet numbers
than tweets with event-related keywords or hashtags. The 140-character limit in tweets
at that time allowed only so much content to be posted, and a picture seemed to be
worth more than 140 characters. Tweets with hashtags were more popular than those
with keywords related to the attacks, which could be expected because hashtags make
tweets searchable both by followers and non-followers and are links to other tweets that
contain them.
61
The study showed that in emergency situations like the Paris attacks Twitter is
widely used both by journalists and non-journalists. However, tweets with images
posted by journalists received significantly more attention per follower than tweets with
images sent by other users, suggesting that journalists, through their continued work
and frequent association with larger media companies, already built their follower
network and trustworthiness.
Different exploration methods for the geographic diffusion of tweets were chosen
for tweets with images and tweets with hashtags. For tweets with images, global
retweeting patterns were analyzed. This task necessitated, however, a complex
approach to estimate the geographic position of retweets, and was constrained by API
limitations. These technical obstacles may explain why only a few earlier studies tackled
the question of spatial information diffusion on Twitter. If quoted tweets were to add the
user’s current geolocation (instead of the position of the original tweet), this would
render the retweet map (Figure 3-5) more complete. Since such tweets would provide
additional user information, e.g. position information, and hence modify the original
tweets, by definition they would resemble quoted tweets instead of retweets. For the
diffusion of tweets with hashtags, all hashtag occurrences could be mapped,
independent of how a tweeting user learned about that hashtag. This allows for a
complete estimation of spread patterns, although it conceals details about the path that
the information traveled along from a set of seed tweets. Mapping hashtag locations
showed that hashtags in French were predominantly used in francophone territories,
e.g., France and Quebec, whereas English hashtags had a more global coverage.
Kernel density maps of English hashtag occurrences showed radial spread patterns
62
across the world, namely travel from Paris primarily to other metropolitan areas around
the world, and from there to smaller surrounding places.
Twitter users do not represent the general population and it is important to
emphasize that all conclusions about social behavior found in related studies apply
primarily to Twitter users and not necessarily the general population (Lansley &
Longley, 2016; Mislove et al., 2011). In this study, conclusions were driven by an even
smaller group of Twitter users, namely those who post geo-tagged tweets, adding more
to population bias (Malik et al., 2015). For example, the information level about the Paris
attacks in regions with weak phone data coverage (Cvetojevic et al., 2016) and low
Twitter penetration rates (Hawelka et al., 2014) might be underestimated for such
regions if alternative news channels (e.g. TV, radio) exist that offset the lack of Twitter
data access (Nielsen & Schrøder, 2014). These potential limitations apply at least to the
geographic analysis of information spread (e.g., retweet maps, hashtag distribution
maps, Kernel density maps), and the regression model for spread analysis. As opposed
to this, the comparison of retweet numbers as well as the temporal distribution of
hashtags are expected to more closely represent the communication structure among
all Twitter users, because no explicit spatial component was involved in the
corresponding analysis procedures. In addition, the fact that a large portion of the geo-
tagged tweets used in this study had place locations instead of exact coordinates limited
the spatial resolution of the conducted spatial analysis. This posed, however no serious
problem to a global spread analysis, as it was conducted in this paper.
Keyword-based filtering of tweets was limited to English and French languages.
With other languages, it would be difficult to identify content relating to the attacks, and
63
to find volunteers who help to check the correctness of the automatic classification of
tweets into the different content categories. Besides this, the scarcity of geotagged
tweets, combined with the small percentage of tweets posted in other languages (Hong
et al., 2011) limits the spatial spread analysis to only a few languages, such as English,
Japanese, Portuguese, Indonesian, Spanish, or French. Given that pictures relating to
the attacks were selected manually and that this is a time-consuming process, only
tweets with pictures posted between the attacks and the next morning were examined,
which were still around 9000 tweets for the wider Paris area. A longer time frame would
also include the pictures of the aftermath of the attacks, such as crowds and lines at the
airport due to the elevated security measures. However, tweets with such pictures might
tend to have a local rather than a global coverage since only a limited group of the
affected users would be interested in that kind of information (e.g. travel agencies, local
residents).
Generally, the hashtags have shown to be a viable approach to tracking
geographic information flows in Twitter. However, focusing on the occurrence of
hashtags only eliminates the aspect of information flow since hashtag analysis does not
account for follower tracking.
64
Figure 3-1. Bounding box (this map extent) around Paris, which was used to select
original tweets with images, hashtags, and keywords whose spread, was analyzed.
65
A B
Figure 3-2. Tweet with photos A) photos of the attacks, B) artistic images expressing support shared with tweets.
66
Figure 3-3. Power law fitting the distribution of retweets, separated by tweet format and
content category.
67
Figure 3-4. Interaction between tweet type and content category on the number of
retweets.
68
Figure 3-5. Retweets of tweets with pictures related to the Paris attacks.
69
A C
B D
Figure 3-6. Geographic distribution of hashtags: A) #AttentatsParis, B) #fusillade (en: shooting, gunshots), C) #ParisAttacks, D) #PrayForParis.
70
Figure 3-7. Temporal distribution of hashtags.
71
A B
C Figure 3-8. Kernel density maps for the first 9 hours of #prayforparis hashtag usage
(tweet density is shown in thousand tweets per square km). A) 0-3 hours, B) 3-6hours, C) 6-9 hours.
72
Figure 3-9. Distance-based clustering of twitter places around Barcelona.
73
Table 3-1. Breakdown of geometry types in the analyzed dataset of tweets (wide Paris area, 13 Nov-27 Nov)
Geometry type Tweets
Place type: city 85.30% Exact coordinates 9.58% Place type: admin 5.12%
74
Table 3-2. Confusion matrix for tweet content classification
Hashtags Keywords
Events Support Events Support
Events 70.8% 2.0% 72.9% 13.7%
Support 16.9% 96.0% 6.3% 68.5%
Other 12.3% 2.0% 20.8% 17.8%
Total 100.0% 100.0% 100.0% 100.0%
75
Table 3-3. Popularity of tweets for different tweet formats and content categories
Tweet format Events Support
Tweet count Retweets mean
(SD of the mean) Tweet count
Retweets mean (SD of the mean)
Image 183 96.3 (31.3) 188 41.2 (26.0)
Hashtag 10098 9.6 (1.2) 12014 2.8 (0.5)
Keyword 15164 4.9 (0.5) 3181 2.0 (0.3)
76
Table 3-4. Analysis of deviance for retweets
Retweets Degrees of
freedom LR Chisq P(>Chisq) Significance
Content category 1 4295.4 < 0.001 *** Tweet format 2 4891.0 < 0.001 *** Content type: Tweet format 2 93.3 < 0.001 ***
Signif. codes: *** p < 0.001; ** p < 0.01; * p < 0.05
77
Table 3-5. The interaction between tweet format and content category on retweets (P-value adjustment method: Holm)
Row Content type: Tweet format Difference Chisq P(>Chisq) Significance
1 Events:hashtag-Events:keyword 1.970 1291.563 < 0.001 ***
2 Events:hashtag-Events:picture 0.100 471.954 < 0.001 ***
3 Events:hashtag-Support:hashtag 3.401 3684.638 < 0.001 ***
4 Events:hashtag-Support:keyword 4.752 2433.635 < 0.001 ***
5 Events:hashtag-Support:picture 0.233 192.060 < 0.001 ***
6 Events:keyword-Events:picture 0.051 794.777 < 0.001 ***
7 Events:keyword-Support:hashtag 1.727 868.737 < 0.001 ***
8 Events:keyword-Support:keyword 2.412 829.422 < 0.001 ***
9 Events:keyword-Support:picture 0.118 414.868 < 0.001 ***
10 Events:picture-Support:hashtag 34.126 1107.310 < 0.001 ***
11 Events:picture-Support:keyword 47.676 1260.757 < 0.001 ***
12 Events:picture-Support:picture 2.338 32.951 < 0.001 ***
13 Support:hashtag-Support:keyword 1.397 113.445 < 0.001 ***
14 Support:hashtag-Support:picture 0.069 651.304 < 0.001 ***
15 Support:keyword-Support:picture 0.049 781.996 < 0.001 ***
Signif. codes: *** p < 0.001; ** p < 0.01; * p < 0.05
78
Table 3-6. Retweet statistics for tweets posted by journalists and non-journalists
Journalist Number of
users Followers per user (average/median)
Retweets per follower (average/median)
False 121 3480.8/377 0.16/0.001 True 48 7559.9/1554 0.26/0.003
79
Table 3-7. Negative binomial regression for panel data (Europe is the default continent)
Variable Coefficient Std. Err. Z value P>|z| Significance
(Intercept) 0.508 0.037 13.81 <0.001
Tweeting capital 0.585 0.050 11.67 <0.001 ***
Three hour time periods -0.095 0.001 -76.10 <0.001 ***
Number of all tweets 0.001 0.000 8.23 <0.001 ***
Number of tweets with hashtags at t-1 (L1)
>0.000 0.000 6.64 <0.001 ***
Distance from capital to Paris <0.000 0.000 -4.74 <0.001 ***
Δhashtags (for capital) 0.001 0.000 55.37 <0.001 ***
Continent (the Americas) 0.092 0.061 1.51 0.132
Continent (Asia) 0.600 0.095 6.31 <0.001 ***
Continent (the Middle East) -0.112 0.143 -0.78 0.433
Number of observations 20,800
Number of groups (3h time steps)
40
Observations per group 520
Adjusted McFadden pseudo ρ2 0.144
Signif. codes: *** p < 0.001; ** p < 0.01; * p < 0.05
80
CHAPTER 4 MODELING INTERURBAN MENTIONING RELATIONSHIPS IN THE U.S. TWITTER
NETWORK USING GEO-HASHTAGS
Study Background
The study of competition and interactions between cities has a long history
(Kresl, 1995) and aimed at determining a hierarchical structure of a city's importance in
various domains, such as finance or trade. The eminent role of a city can be derived
from the concentration of facilities, such as hospitals, schools, or universities or,
alternatively, be determined within a network of cities. In the latter approach two cities
can be considered linked if they share the headquarters of large multinational
companies, trade goods, interchange services, such as finance, accounting, law,
advertising or management, or interchange people, which, at the global scale, led to so
called world city networks (Derudder et al., 2013; Taylor, 2001; Zook & Brunn, 2005).
More recently, information flows and exchange in telecommunication and social
networks were used to describe the role of cities in travel and communication patterns
at different scales. Especially during the last decade or so, social networks grew
substantially, some with the number of monthly active users exceeding hundreds of
millions (Twitter) or even billions (Facebook). Twitter offers free access to the public
portion of their data, which was hence analyzed to better understand user interaction
and community building within the network (Goolsby, 2010; Myers et al., 2014; Weng et
al., 2013). About 1-2% of public tweets are geo-tagged (Graham et al., 2014) and have
therefore explicit geographic information attached. This information was used to study
the role of geographic distance and national boundaries on the formation of social ties
and communities which showed that online social networks and the underlying real
world geography are closely related (Stephens & Poorthuis, 2014; Takhteyev et al.,
81
2012). Using tweets from 58 cities around the world, (Lenormand et al., 2015) found
that, based on node degree and betweenness network measures, New York and
London play a central role on the global travel scale. (Hawelka et al., 2014) identified
mobility regions by partitioning a country-to-country network of Twitter user flows at
different hierarchical levels, and (Sobolevsky et al., 2013) partitioned human population
based on the network of communication activities using country-wide data sets of
telephone calls. The analysis of inter-urban movements in China from check-in data
(i.e., a piece of geo-tagged content posted by a user) showed that communities follow
approximately province boundaries (Liu et al., 2014).
Little is known about what causes strong social network ties at a larger,
aggregate level and across city boundaries. Such analysis could lead to a better
understanding of mutual cultural, sociodemographic, or economic commonalities
between distant regions and their effect on communication. Explanations of strong ties
between regions would need to reach beyond factors that are commonly used to explain
the strength of a tie between two people, such as the frequency of their interaction or
the intensity of their emotional attachment (Koput, 2010). With the need to find
approaches to strengthen ties within a network (e.g. effectiveness of teamwork in a
company) and to better utilize intra-organizational and extra-organizational capital,
social scientists have explored if and how overall properties of the social network
structure affect the strength of social ties within the network (Fernandez et al., 2000).
For example, a stronger tie between two people is hypothesized to lead to a higher
proportion of other people tied to both of them due to factors such as time capacity
(limited time we can devote to social interaction, leading to larger group events and
82
hence closer ties between all involved individuals) and homophily (Granovetter, 1973).
The latter concept means that we interact socially primarily with others who share
similar interests, for example, based on demographics or location, as opposed to
heterophily, which describes the increased social interaction between individuals of
dissimilar characters. The research presented in this paper will extend communication
analysis from between individual users to the city level and hence explain the role of
cities in the Twitter network with regards of city interactions. The goal of this study is to
explore the interurban network structure of hashtag-based mentions in the Twitter
network using network structure metrics, to model the strength of mutual city mentions
based on city covariates, and to explain some of the underlying processes leading to
this inter-urban interaction.
Related Work
Social network theory provides explanations to many questions about social
phenomena, and the analysis of community network structure remains a prime area of
network research (Stephen P Borgatti et al., 2009). Social science distinguishes
between different types of dyadic relations, including similarities (e.g. sharing a
location), social relations (e.g. kinship), interactions (e.g. who talked to whom), and
flows (e.g. that of resources). The strength of a tie between people can be modeled
along various dimensions, including the amount of time shared, emotional intensity,
intimacy, or social distance, such as education level (Gilbert & Karahalios, 2009), but it
is also influenced by network topology and informal social circles (Burt, 1995).
Social network graphs often comprise communities or cliques, which are natural
divisions of network groups into densely connected subgroups (Koput, 2010). Previous
research efforts have developed algorithmic approaches to optimize the detection of
83
communities in networks, where the quality of partitions is often measured by the
modularity (Newman & Girvan, 2003). For example, (Blondel et al., 2008) developed a
heuristic method to optimize modularity, which was tested on social networks, citation
networks, and web networks of different scale, with up to 1 billion links. Other studies
used Latent Dirichlet allocation (LDA) for detecting communities from individual
movement data, such as GPS tracking trajectories for automobiles or geo-tagged
tweets from visitors in Florida (Kempinska et al., 2017; Valle et al., 2017). Despite the
massive amount of crowd-sourced data from social media it is important to notice that,
due to the demographic and geographic sampling bias (Duggan et al., 2015; Hawelka et
al., 2014; Longley & Adnan, 2016) as well as the small percentage of geo-tagged tweets
the results of Twitter behavioral studies are not necessarily representative of the
general population or even of all Twitter users.
Increasingly complex frameworks of human connectivity define interactions
between places (Thiemann et al., 2010), and the development of new communication
systems, such as the Internet or social media, has generated new forms of social
contacts. (Kato et al., 2012) analyzed in detail favorites, follows, and mentions on
Twitter from a network structural point of view and found that their indegrees and
outdegrees exhibit a scale-free property, which means that their degree distribution
follows approximately a power law. (Weng et al., 2010) analyzed follower behavior in
Twitter and found that the presence of reciprocity can be explained by homophily. This
means that a twitterer follows a friend because of being interested in some of the topics
posted by the friend, and that vice versa the friend follows back because he or she finds
that they share similar topics of interest. The authors therefore propose “TwitterRank”,
84
an extension of the PageRank algorithm, to measure the influence of users in Twitter
under the consideration of the similarity of topics that users are interested in. (Kwak et
al., 2010) found that reciprocal relationships on Twitter are driven by geographic and
popularity homophily, where users with less than 1000 followers tend to be co-located
with their reciprocal followers of similar popularity. (Snijders, 2011) provides a detailed
overview of statistical methods for social network analysis and lists transitivity,
reciprocation and homophily as main network dependencies. The paper mentions also
the Multiple Regression Quadratic Assignment Procedure (QAP) defined by
(Krackhardt, 1988) which can be used for the exploration of nodal covariates for
modeling the strength of social ties. (McPherson et al., 2001) analyze the influence of
homophily on the formation of ties in social networks and concludes that
sociodemographic, behavioral and intrapersonal similarities divide social space and
heavily influence the formation of connections. The similarity between social network
users was found to explain more than half of the behavioral contagion (Aral et al.,
2009). Previous studies examined the structure, topology, and strenght of ties also in
other types of communication networks. For example, (Onnela et al., 2007) examined
the resilience of mobile phone networks to edge removal by analyzing communication
patterns of millions of mobile phone users. The study showed that the removal of weak
links would affect the network's overall integrity more than that of strong links since
weak links connect different communities, as opposed to strong ties.
Study Setup
The study area comprised the 50 U.S. states, i.e. the contiguous U.S., Hawaii,
and Alaska as well as Puerto Rico. Public tweets were downloaded through the Twitter
Streaming Application Programming Interface (API) and REST API, where the Python
85
library Tweepy was used as a client. Tweets were downloaded in Javascript Object
Notation file format (JSON) and stored in a PostgreSQL database. In order to download
all geotagged tweets and to not exceed the maximum available download bandwidth,
the world was divided into seven download regions, for which tweets were collected
between September 20 and October 20, 2016. The total number of geotagged tweets
downloaded per region for that time period together with their download share is shown
in Figure 4-1.
(Moffitt, 2014) lists three types of location information contained in tweets:
geotag (exact location or Twitter place)
the geographic location mentioned in the tweet post (including hashtag)
location in the user profile.
For this study, only tweets that contain both the first and second type of location
were included so that the directionality of city mentioning could be derived. More
specifically, the first type of location was used to identify out of which city the posted
tweet mentioned another city, and the second type of location was used to identify
which city was mentioned in that tweet. A Twitter hashtag is a string of characters
preceded by the hash (#) character, and is generated by users to categorize content
and to highlight topics. Therefore, for the second type of location information,
geographic locations mentioned in tweets were included only if they were part of a
hashtag. Such mentioning would clearly indicate an intended topical connection to that
city, as opposed to a more casual mentioning of the city name in a tweet. Hashtags
have been used before to observe content trends and to track topical information
propagation (Chong, 2016; Lotan et al., 2011), but not to analyze mentioning patterns
between cities.
86
All geotagged tweets contain in the JSON structure a place information tag that
shows the country from which the tweet was posted (see lines highlighted in boldface in
Figure 4-2).
In order to limit tweets to the study region only, i.e., the United States, regions
north-west 1, north-west 2 and north-west 3 in Figure 4-1 were queried. These three
regions contained a total of 98,508,449 geo-tagged tweets, 68,218,710 out of which
were from the United States. The final selection yielded 10,493,455 tweets with
hashtags. In a next step, hashtags were ordered by frequency and the first 1500 most
frequent hashtags were manually analyzed for city names. An earlier automated attempt
to geocode tweets through comparison between hashtags and Twitter place names,
using the Levenshtein distance, led to unsatisfactory results (e.g., due to duplicate
names or a different spatial resolution of place regions between both compared
sources) and was therefore not pursued any further. During the manual matching
process, each hashtag was verified on Google Maps and Wikipedia to ensure that it
indeed represented a city name. City names in hashtags that occurred more than once
at different locations were excluded to avoid ambiguity. This process resulted in a total
of 309 city geo-hashtags.
The geography of mentioning cities was obtained through Twitter places from
tweets that used a place type “city”, like shown in the example in Figure 4-2, or exact
coordinates combined with place type “city”. Cities are a Twitter place type that falls
between the twitter "admin" place type and the Twitter "neighborhood" type in terms of
spatial resolution, and can only be found in selected regions around the world
(Hochmair et al., 2018), including part of the U.S., Europe, Canada, Brazil, India, or
87
Japan. Figure 4-3 compares the spatial layout of originating cities (Twitter places
visualized as green polygons) and that of mentioned cities (Kernel density heat map). It
shows that heavily populated metropolitan areas, such as New York, the San Francisco
Bay, Philadelphia, Washington D.C., or Dallas have the highest density of places
mentioned in hashtags. The Kernel density visualization was used to show the spatial
distribution of mentioned cities (per km2). It should be noted that actual locations of
mentioned cities are typically more dense in the center of the Kernel density peaks, but
they do exist on the fringes as well.
Next, since the same city could be mentioned in a hashtag and but also be the
location of the geo-tagged tweet (e.g. mentioning), the final stage of the data
preparation included the assignment of cities from both data sources (hashtags, place
type) into a common geographic scheme, namely the U.S. Census Metropolitan and
Micropolitan Areas. To assign a city to a Census Metropolitan or Micropolitan Area the
centroid of a city bounding box of Twitter places (compare Figure 4-3) was used. This
was done automated for the city place type in tweets, whereas the cities for the 313
geo-hashtags were first manually geocoded and then fit inside the nearest Census
Metropolitan or Micropolitan area. The union of mentioned and mentioning cities
resulted in a total of 432 cities across the U.S. A few more conditions were used to
ensure that bot or spam tweets were excluded from the data set. At first, only tweets
from mobile devices were used, hence applying the filtering based on the source of the
tweet. Then to filter out bots, the “botometer” API was applied (Varol et al., 2017).
Additionally, several other users were removed who used more than three hashtags per
tweet. The last step was necessary since some users with politically motivated tweets
88
had a high number of geo-hashtags in every tweet and were thus biasing the outdegree
of cities.
Analyzing the Network Structure of Mentions
Social networks are often modeled as graphs. Therefore, measures of graph
structure are important to understand the role of different network components (e.g.
nodes, links) and actors in the network. A comprehensive review of measures relating to
the organization of a social network and the interaction between actors can be found in
the literature (Barthélemy, 2011; Boccaletti et al., 2006; Koput, 2010; Snijders, 2011).
This section reviews concepts of social network analysis which are used in the modeling
of inter-city hashtag mentions, including centrality, node degree, or reciprocity.
Graph Generation
As a basis for subsequent social network analysis a directed, weighted graph
was created. Cities were abstracted as nodes, and mentions of cities in tweet hashtag
as edges. The edge weight was the number of times a tweet in city A mentioned city B
in a hashtag. For graph analysis and visualization the R package igraph was used
(Csardi & Tamas, 2006). As an example, Figure 4-4 shows a sub-graph comprised of
33 cities that have an indegree higher than 30 in a layout proposed by (Adai et al.,
2004). Line width corresponds to the number of directed mentions. The closeness of the
nodes is proportionate to the weight of the links between them. Hence, the layout does
not resemble geographic proximities, but rather proximities in the social network space.
The entire resulting network had the following dyad census:
Mutual links (the number of pairs of cities with mutual mentions): 307
Asymmetric links (the number of pairs of cities with one-way mentions): 1,527
Null links (the number of pairs of cities with no mentions between them): 91,262
89
The Distance Between Mentioning Cities
To calculate mention distances between two cities (where city A mentions city B
in a tweet with a geo-hashtag) all distances between cities were counted as often as city
A mentioned city B. The mean distance of a mention was 1293 km and the median was
834 km (the latter corresponding to the approximate distance between San Diego and
San Francisco). These mention distances reveal significantly smaller values than
unweighted distances between all possible city pairs in the city mention graph (with
mean = 1423 km, median = 1070 km). This means that mentions take place in localized
and regional clusters.
Figure 4-5 shows the distribution of distances between all pairs of cities (blue
histogram) and the distribution of distances of mentions between cities. The pronounced
peak of the weighted distance distribution in the two smallest distance bins (yellow)
compared to the shape of the blue histogram suggests that mentions between cities are
more common at shorter distances than the corresponding geographic layout of cities
would suggest.
Node Degree
In an undirected graph, the degree of a node is the number of links incident to
that node. In a directed graph, the indegree id(n) and outdegree od(n) of a node is the
number of incoming or outgoing edges, respectively, and the degree of a node deg(n) is
the sum of its id(n) and od(n) (Sporns, 2002). The concept of node degree has been
extended to weighted networks, where the weighted in- and outdegrees consider the
sum of weights of incoming or outgoing edges and hence measure the strength of
nodes in terms of the total weight of their connections (Barrat et al., 2004). A weighted
node degree is also referred to as node strength. Node strength is the commonly used
90
measure for the analysis of the weighted networks (Barrat et al., 2004; Opsahl et al.,
2010). Therefore, weighted degree or strength of nodes will be used in the subsequent
analyses of this study.
Table 4-1 shows the weighted indegree and outdegree of cities in the U.S. wide
mention graph. The outdegree denotes the number of times other cities are mentioned
in hashtags of tweets posted in that city whereas the indegree of a city denotes the
number of times tweets posted from other cities mention that city in a hashtag.
Table 4-1 A) shows that New York gets most mentions from other cities (660),
making it the most prominent city in this regard, followed by Atlanta (352), Los Angeles
(349) and Boston (303). Table 4-1 B) shows that New York and Los Angeles mention
the highest number times other cities, which could be attributed the fact that they are
the largest and second largest cities in the U.S. by population. The steep decline in
weighted indegree and outdegree suggests a right-skewed distribution for both
variables.
The frequency of indegree and outdegree was fitted to a power law distribution
(Figure 4-6), where a linear regression with a simple logarithmic binning was used
(White et al., 2008). The R-squared was found to be 0.80 and 0.92 for incoming and
outgoing mentions, respectively. This case is typical for scale-free networks. It has been
shown that node strength follows a power-law distribution in scale-free networks (Tan &
Lei, 2013; Watts & Strogatz, 1998), which is also demonstrated for the network of city
mentions in this study. Also, numerous real-world networks have this topology (Wang &
Chen, 2003).
91
Network Centrality Measures
Network centrality measures are commonly used to identify influential nodes in a
network. This is because of the potential power is given to a central actor to influence
information flows in such a way as to serve the actor’s interests (Freeman, 1977).
Different types of network centrality have been proposed using measures, such as
topology (neighborhood relationships), flows, or network distances (Barthélemy, 2011).
Some prominent examples include degree centrality (or strength centrality in the case of
weighted networks), Eigenvector centrality and its variant Page Rank centrality,
Kleinberg hub and authority centrality scores, or betweenness and closeness centrality.
Since betweenness centrality is not a suitable approach for weighted networks
(Dekker, 2008), we will compare some other centrality measures for the analyzed
mention network. Node strength, measured as the sum of mentions for the in- and
outdegree for a city, denoting the weighted in- and outdegree, is the first presented
centrality measure (Table 4-1).
Other computed weighted centrality measures were degree centrality, closeness
centrality, Eigenvector centrality and PageRank centrality (McCulloh, 2010), using the
igraph R package. The measures are standardized by dividing them with by the highest
possible value, that is 1/(N-1) where N is the number of vertices in the graph. In the
1990’s, the concepts of hubs and authorities have been used to analyze the information
organization in hyperlinked networks (Kleinberg, 1999). Authoritative Web pages are
those that contain relevant information for questions posted on a specific search topic.
In the context of spatial social media networks, authorities can be thought of as
geographic locations that are frequently mentioned in tweets. A hub in hyperlinked
networks is a page that points to many good authorities. Again, in the context of social
92
media networks a hub could denote a city that posts frequently about other important
cities. Hub and authority scores on the graph of mentions were computed using the
igraph R package. Table 4-2 shows the Pearson correlations between some of these
measures, which are all significant at the 5% level. The bivariate correlations between
weighted degree centrality and Eigenvector centrality are close to one, meaning that the
latter (and more complex) centrality measure gives similar score rankings for cities as
the weighted degree centrality, which is simpler to understand.
Since closeness centrality shows how close a node is to other nodes in a
network, information from a node with high closeness would diffuse through the network
the fastest (McCulloh, 2010). For the analyzed mention network, closeness centrality
gives a similar score for most cities. Nashville, TN has the highest closeness centrality
(0.449) and Oakland, CA has the lowest closeness centrality (0.390). Table 4-3 shows
the cities ranked by their Kleinberg hub and authority scores. The highest ranked
“authority” is New York City with an authority score of 1. Las Vegas, Atlanta, Los
Angeles and Washington, D.C. follow with authority scores 0.490, 0.448, 0.436 and
0.387, respectively, showing that there is a wide range of authority values among
analyzed cities.
Since New York users mention Los Angeles and Washington, D.C. only 41 times
each, but Twitter users from Los Angeles mention New York City 63 times, Las Vegas
55 times and Atlanta and Chicago 25 times each, Los Angeles has a high hub score.
We therefore conclude that Twitter users from Los Angeles tend to mention popular
cities more than users in other cities of the United States and that New York is the most
popular city in the country.
93
Reciprocity And Connectance
Reciprocity is the proportion of reciprocated links. For the entire graph of 434
cities, 28.3% of links are mutual (Snijders, 2011). Connectance is a global topological
measure that is computed as the fraction of existing links divided by the squared
number of nodes in the observed network (Dunne et al., 2002), hence it is the fraction of
all possible links that are realized in a network. Table 4-4 shows reciprocity and
connectance values for the U.S. states with more than five cities used in the analyzed
network graph.
Colorado (numbers in bold) has the highest reciprocity in mentions between
cities and the highest connectance. A schematic figure of mention patterns for Colorado
is shown in Figure 4-7. When considering the complete graph with all analyzed U.S.
cities, the connectance is much lower with a value of 0.011. Hence, as expected, cities
located within a state are better connected than cities across the entire country.
(Kwak et al., 2010) found that only 22.1% of Twitter users have a reciprocal
relationship in terms of follower behavior. As opposed to this, for the entire network of
U.S. cities, the percentage of cities that reciprocate mentions by at least one tweet is
higher (28.3%). The correlation between reciprocity and connectance at the state level
is 0.61 (p = 0.012).
Sentiment Analysis
Tweets convey textual information that can be quantified by its sentiment. In the
context of this work, it is of interest to see if the average sentiment score of tweets
associated with a city is related to communication tie variables. Text processing of
entire tweet posts (text and hashtags) was run for tweets that use city hashtags using
the “text2vec” R package (Selivanov, 2016), which implements the method in (Bryl,
94
2017). This approach uses a machine learning classifier that is trained on the Sentiment
140 corpus of 1.6 million tweets that was labeled using emoticons (Go et al., 2009). This
labeled dataset was divided into training and testing subsets in 80:20 ratio. The
following texts processing procedure was applied to the training subset. The vocabulary,
which is a list of all words used that were found in the analyzed text documents (tweets
in this case), is cleaned from stop words. Furthermore, a Document-Term matrix (DTM)
was created and term frequency – inverse document frequency (TF-IDF) model was
applied to DTM. Next, the generalized linear model classifier was trained using “glmnet”
R package (Friedman et al., 2010), with TF-IDF transformed Document Term Matrix as
the independent and existing sentiment as the dependent variable. Then, the trained
classifier was tested against the testing subset of tweets and the training set shows an
area under the curve (AUC) measure is of 0.875 which is generally considered as good
(Vidya et al., 2015). Finally, the trained GLM model classifier is used to classify the
sentiment of the tweets used in this study.
For the analysis, only tweets in English were used, based on the language
metadata setting of every tweet. Each tweet receives a probability value of having a
positive sentiment between zero and one. Based on this, the weighted average
sentiment score was calculated for all of the city’s incoming tweets, where only cities
with more than 30 incoming tweets were used for the analysis to reduce data noise. A
total of 30 cities remained after this step. For the interpretation of mean values, cities
with notably high or low values were reviewed in more detail by looking at the context of
tweets associated with a city.
95
The average incoming tweet sentiment ranged from 0.486 for Tulsa, OK, and
0.644 for Portland, OR. Many Twitter users that were tweeting about Tulsa did so in the
context of the Black Lives Matter movement. This topic was also frequently mentioned
in tweets about Charlotte, SC, which had a mean sentiment of 0.513. While geo-
hashtags and sentiment analysis detected actual events in these cities, for Roanoke,
VA, many tweets were about a fictive event in the television series American Horror
Story: Roanoke, which received a low average sentiment score of 0.560.
As opposed to this, tweets about Cleveland, OH, received tweets with a high
average sentiment value of 0.641. Hashtags often used together with #Cleveland were
#Windians, #RallyTogether, #Indians, which are related to the baseball team Cleveland
Indians. Therefore, high sentiment values can be indicative of sporting events. The
same was observed for Boston, MA, which earned a high average sentiment of 0.643
where #redsox, #RedSox (a baseball team from Boston) and #travel, #fall, #igboston
(all travel related) often occurred. Figure 4-8 shows the most frequently used words in
tweets with geo-hashtags of these cities, reflecting some of these topics. Furthermore,
Los Angeles (0.596), New York City (0.606), Chicago (0.605) and Atlanta (0.618)
received tweets of similar magnitude, although tweets about sports events were not
predominant for these cities. Initial word clouds about New York showed that a Comi-
con was a commonly mentioned topic because the frequently used hashtag #nycc was
associated with that event. This means that #nycc was a dominant hashtag used with
New York geohashtags such as #nyc, #NYC, #NewYork, etc. To be able to avoid other
words being masked by that event, this topic was removed from the word cloud in
Figure 4-8 F). Furthermore, in all word clouds spatial locations were removed as well,
96
e.g. words like Manhattan and Brooklyn, to obtain thematic topics instead. We can
conclude that sentiment analysis is able to detect events in smaller cities.
Next, the average sentiment score of a mentioned city was related to the
distance between mentioning city and the mentioning city, where the distance variable
was transformed with the natural logarithm. Figure 4-9 A) and B) show the mean
sentiment of tweets for inter-city mentions for a total of 10 and 52 pairs of cities,
respectively, where each involved city pair that had more than 30 and 10 tweets,
respectively. The thresholds of 30 and 10 were selected as a minimum sample size for
calculation of the mean sentiment. The plotted data points include New York, Los
Angeles, San Francisco, Detroit, Dallas, Washington D.C. and some of their
surrounding smaller cities for Figure 4-9 A). In Figure 4-9 B) only pairs of cities with
more than 25 mentions are annotated to avoid the clutter.
The negative slope of the regression line for both subgraphs in Figure 4-9
indicates a general decrease of the mean sentiment score in tweets with the distance
between mentioned and the mentioning city. Based on results from these cities, we
conclude that Twitter users love thy neighbor.
Homophily and Heterophily
Further analysis was conducted to explore the processes underlying the inter-city
communication ties. More specifically, this section is concerned with individual
characteristics of cities (and city pairs) that drive homophily or heterophily. For this
purpose, one needs to examine the similarity or dissimilarity of individual characteristics
in city pairs that have higher mutual mentions compared to city pairs with fewer mutual
mentions. This will be achieved through regressing relational data on observed mention
data, using the Quadratic Assignment Procedure (QAP) regression (Krackhardt, 1987).
97
Data Preparation
City characteristics (nodal covariates)
Various city characteristics are expected to influence how frequently a city is
mentioned in hashtags and how strong mutual communication ties between city pairs
will be. The following attributes at the city level were used for the QAP regression model
to predict the tie strength (measured by mutual mentions) between cities:
Demographics. City population and number of housing units were aggregated from 2010 Census block data obtained from (Census, 2014).
Airports. The total number of passengers boarding in a city was derived from the commercial airports within the city area. Boarding numbers were obtained from (FAA, 2010).
Schools. The number of students enrolled in schools per city was compiled for post-Secondary Education facilities from the Homeland Infrastructure Foundation-Level data for the 2014-2015 school year. Types of schools include among others Doctoral/Research Universities, Masters Colleges and Universities, Baccalaureate Colleges, Associates Colleges, Theological seminaries, or Medical Schools.
Occupation employment data. Occupational Employment Statistics for 2016 were obtained from the Bureau of Labor Statistics (BLS) at the city level. More specifically, BLS uses revised metropolitan area divisions (see https://www.bls.gov/oes/current/msa_def.htm). For each city the corresponding division could be matched with a U.S. Census Metropolitan or Micropolitan Area, except for Boston, where three Metropolitan Divisions had to be joined into a Metropolitan area. BLS employment data is subdivided into 22 broad and 1371 specific occupational categories. This study uses all broad and a few specific occupation categories as they relate to tourism and real estate development. Table 4-5 lists the occupations that were used as city covariates, where rows 1-22 are broad BLS occupational categories, rows 23-25 in boldface are specifically chosen BLS occupational categories, and occupations marked with a * were not used since they were not reported for each city. For each category the number of employees per city was divided by the total number of employees across all categories in that city and then multiplied by 1000. For further processing cities from the occupation data table had to be manually matched to the U.S. Census Metropolitan and Micropolitan Areas as the naming conventions were different. Aside from the hypothesized occupation predictors, a number of other occupations included in the model were of exploratory nature.
98
Exploration of occupation variables provided some expected patterns. For
example, Las Vegas has a high number of employees per 1000 who work in hotels (4.1
compared to a mean employment of 2.1 across all cities). Ithaca, NY (Cornell
University) with 164.5, Gainesville, FL (University of Florida) with 129.2, Merced, CA
(University of California) with 128.2, and Champaign-Urbana, IL (University of Illinois)
with 125.2 all have a high number of employees per 1000 working force in the education
sector compared to the mean of 65.2 across all analyzed cities. Out of 432 cities
mentioned or mentioning in hashtags, 316 could be matched to regions in the BLS
tables. The unmatched cities were excluded from the QAP regression since they were
small and had only a few incoming or outgoing mentions.
Dissimilarity and similarity matrices
In the QAP regression an adjacency matrix for a social relation, in our case, the
number of mentions in tweet hashtags is the dependent variable, whereas a set of
attribute dyadic similarity or dissimilarity matrices represent independent variables.
Therefore, all individual level data of cities need to be transformed to dyadic measures
of similarity or dissimilarity that can be regressed on social relations. The individual
characteristic can be subdivided into attributes (numerical or categorical) and
affiliations. A set of rules for transforming individual level data to dyadic measures is
provided in (Koput, 2010). Individual level data in our dataset consist of single item
attributes. The match rule for categorical variables states that if two agents match in
terms of the category (e.g. being located in the same state) a 1 needs to be placed in a
cell for conversion to dyadic, otherwise 0. This approach results in a similarity matrix.
The state variable was the only measure with a similarity matrix in our dataset. The
absolute difference rule converts individual numerical data to dyadic by computing the
99
absolute difference between both agents in the corresponding cell, giving a dissimilarity
matrix.
Demographic and occupational variables were numeric. Hence, the dissimilarity
matrices were computed as an absolute difference between values for a pair of cities for
most variables. One specific case is the distance dissimilarity matrix which was
populated with the geodesic distances between pairs of cities in kilometers. To calculate
this distance the PostGIS function ST_Distance was used. All dyadic independent
variables (except for state) and the dependent variable were log transformed, similarly
to (Zahn, 1991).
Network Regression
QAP correlations were calculated using the UCINET 6 software package
(Borgatti et al., 2002). The highest correlations between the matrix of city mentions and
matrices of absolute differences in city attributes were found for airports (0.1, p = 0.001),
jobs (0.128, p = 0.001), population (0.131, p = 0.001) and schools (0.120, p = 0.001).
A multiple-regression coefficient QAP (short MRQAP) regression (Dekker et al.,
2003) was used to model how the number of mentions between cities depends on
demographic and occupation variables, as well as airport passengers numbers in cities,
the distance between cities and state boundaries. MRQAP is unbiased under
multicollinearity conditions. Therefore, potential QAP correlations between independent
variable matrices were not examined. The regression itself was done using the R
package SNA (Butts, 2016), closely following the method in (McFarland et al., 2010).
OLS regression could not be used to predict network ties since observations are
correlated due to using mentions from the same city or of the same city. The MRQAP
regression does not calculate the standard error to determine statistical significance, but
100
instead randomly shuffles rows and columns of the matrix representing the dependent
variable. For each model in this study, 2000 such permutations were used, as
suggested by (Cook, 2012). OLS regression coefficients are then calculated from the
permuted matrices.
Results of MRQAP regression provide coefficients (with their levels of
significance) describing the slope of a linear relationship between independent variables
and the dependent social relation (Koput, 2010). Four combinations of matrix type and
coefficient sign can occur in the regression results. If the independent variable is a
similarity matrix a positive coefficient indicates that greater similarity contributes to a
stronger tie. A positive coefficient for an independent variable that is coded as a
dissimilarity matrix would indicate that greater dissimilarity makes the tie less likely, or,
expressed differently, that greater similarity makes the tie more likely. Therefore, both
cases provide evidence for homophily. Heterophily is present for the remaining two
cases, i.e. where the independent variable is a similarity matrix and the coefficient are
negative, or where the independent variable is a dissimilarity matrix and the coefficient
is positive. Regression results will be interpreted with respect to these four cases.
Estimated results for three different regression models are reported in Table 4-6
where only the arithmetic sign of coefficients (but not their magnitude) and their level of
significance are shown. The three models include subgraphs of cities that have a
minimum indegree of 10, 30 and 50 respectively and therefore take a more prominent
role in the network compared to the remaining (excluded) cities. Subgraphs were
analyzed since the model for the entire graph explained the very small percentage of
the dependent variable variation, namely only 4%. Only variables that are significant in
101
at least one of the four models are reported in the table. Variables with a *, **, or *** in
the left column of each model with a dissimilarity matrix denote homophily, whereas
such variables with asterisks in right columns reveal heterophily. For the only variable,
that uses a similarity matrix (state variable), the plus sign in the right column actually
indicates homophily.
Results in Table 4-6 show that the R-squared increases with more stringent city
filters and is highest for the model 3. This can be explained by a larger number of
mentions across the participating cities and hence less noise that comes from random
communication ties. Three variables (highlighted in boldface) are significant in all three
models. A few regression outcomes will be looked at in more detail. The positive and
significant population variable means that city pairs that heavily differ in the population
are likely to form connections. This could come from the fact that big cities get many
mentions from smaller neighbors.
The negative sign for distance means that cities that are further apart will be less
likely to form connections. Hence, as expected, a closer proximity between cities has a
positive effect on the formation of the mention ties. For the first three models, cities
located within a state have stronger ties than across state boundaries, showing
evidence of homophily for this variable and confirming findings from section 4.4 that
cities within states are better connected. It demonstrates also a close relation between
online social networks and the underlying ‘real world’ geography (Stephens & Poorthuis,
2014).
102
In addition, the school enrollment variable indicates heterophily, which could
suggest that, similar to the population variable, cities with smaller student population are
those predominantly tweeting about cities with larger educational institutions.
The number of workers in building and grounds cleaning related occupations
expresses heterophily in all three models. The number of occupations in this field is very
high for touristic cities, such as Las Vegas (63.7) or Orlando (49.2), but lower for others,
such as Charlotte, Atlanta and Los Angeles (value of around 25), although those cities
receive many mentions. This variable is also relatively high for Ithaca, NY (~44) which is
home to Cornell University. It seems that this variable indicates touristic cities, although
it is only moderately correlated with hotels (Pearson R2 = 0.538). This occupation
comprises mostly janitors (~50%) and landscaping workers (~20%). This could indicate
that leading cities in some other aspects (e.g. tourism or education), which requires also
a high work force in building maintenance, tend to receive many mentions from less
prominent cities in this aspect.
Discussion And Conclusions
This study analyzes the prominence of cities as well as the interaction between
cities using hashtag mentions in tweets. It expands therefore more traditional measures
of city prominence (e.g. presence of prominent corporation) or city ties (e.g. commodity
flows). In addition, it explains processes leading to stronger or weaker ties between
cities using Quadratic Assignment Procedure.
New York City is the most popular city when considering the highest number of
mentions, where many tweets that use prominent hashtags, such as #newyorkcity, #nyc
and #newyork are related to travel, Central Park, photography and fashion. The city
popularity can have a dynamic component that is, based on events of limited duration.
103
Examples are hashtags #KeithLamontScott, #BlackLivesMatter and #CharlotteProtest
that are often used together with the city hashtag #Charlotte. The countrywide network
of city mentions exhibits a scale-free topology, which means that only a few cities are
connected to many others, whereas the majority of cities are connected to only a few
others. Similarly, incoming and outgoing mentions follow closely a power-law
distribution, giving the network a scale-free property in the aspect of city prominence. At
the state level, cities are mostly connected with Colorado, which also reveals the
highest reciprocal mention relationships among all U.S. states.
Further network centrality measures were calculated and compared to identify
most influential cities. Since weighted degree centrality is highly correlated with
Eigenvector and PageRank centralities, the prior property itself accurately identifies the
most popular cities. New York has the highest node strength and thus receiving most
mentions from other cities. Kleinberg authority scores confirmed New York as the most
popular city in the country and revealed that Los Angeles tweets mostly about other
popular cities. In terms of information diffusion, several cities have similarly high
closeness values. Among them, Nashville, TN, has the highest value and would in this
respect the best entry point for a fast spread of news across the network of city
mentions.
Connectance and reciprocity measures suggest that state-level subgraphs of
mentions are, as can be expected, better connected than the graph for the entire
country, confirming earlier studies which suggest a close relationship between
geographic and network space in terms of communication clustering.
104
Sentiment analysis identified events in smaller cities. Geo-hashtags also allowed
analyzing of the connection between sentiment and the distance between the involved
cities. The moderate negative correlation (R2=0.35) can be interpreted as an attenuation
in the sentiment of tweets with an increase in distance. Therefore, we conclude that
Twitter users tend to tweet more favorably about their neighboring cities. A possible
extension of this analysis would be to include tweets from a longer time period.
QAP regression shed some light on the factors that play a significant role in
communication ties between cities. For example, closer geographic proximity as well as
the location in the same state led to stronger communication ties, displaying examples
of homophily. For other variables, such as population, a larger difference in attribute
levels leads to stronger ties, showcasing heterophily, where larger cities may attract a
disproportionally higher number of mentions from small cities than this is the case for
cities of approximately equal size.
This regression methodology can be extended to account for additional city
covariates. It is important to emphasize that this analysis does not necessarily show the
true (i.e. long-term) popularity of analyzed cities or city ties, but may be biased by short
term events or name ambiguity. For example, some of the tweets that were manually
examined and contained the #Atlanta hashtag were posting about a television show
called Atlanta (hence tweeting about that show and not necessarily about the city). Also,
some tweets containing #LA used it for the state of Louisiana and not for the city of Los
Angeles, what is typically used for. To get more accurate results, future work calls
therefore for advanced filtering techniques for hashtags, e.g. using advanced text
processing, geo-ontologies, similarity measures, and thesauri. Another potential step
105
further in this analysis would be to extend the study to worldwide cities, more closely
resembling the idea of world city (Freeman, 1977). The obvious challenge would be to
obtain attribute data for the production of dyadic relations, such as occupational
employment statistics, for countries across the world, which will hopefully be facilitated
through an increasing number of open data initiatives around the world.
106
Figure 4-1. Setup of world regions used for Twitter data download.
107
{ "id": "42e46bc3663a4b5f", "url": "https://api.twitter.com/1.1/geo/id/42e46bc3663a4b5f.json", "name": "Fort Worth", "country": "United States", "full_name": "Fort Worth, TX", "attributes": {}, "place_type": "city", "bounding_box": { "type": "Polygon", "coordinates": [ [ [-97.538285, 32.569477], [-97.538285, 32.990456], [-97.033542, 32.990456], [-97.033542, 32.569477] ] ] }, "country_code": "US" } Figure 4-2. Country place tag in geo-tagged tweets JSON file.
108
Figure 4-3. Locations of originating cities of tweets (green polygons) and density of mentioned cities (blueish Kernel density map).
109
Figure 4-4. Force directed layout for a sub-graph of cities that have more than 30
incoming mentions.
110
Figure 4-5. Distribution of weighted and unweighted distances (in km) between U.S. cities.
111
Figure 4-6. Power law fitting the distribution of the weighted indegree and weighted
outdegree of the city mentions graph.
112
Figure 4-7. A network of mentions between cities in Colorado (link width is proportionate
to edge weights).
113
A B C
D E F
Figure 4-8. Word clouds of the words most used with some of the analyzed geo-hashtags: A) Cleveland, OH, B) Roanoke, VA, C) Boston, MA, D) Tulsa, OK, E) Charlotte, NC, F) New York, NY.
114
A
B
Figure 4-9. Mean sentiment value of tweets between pairs of cities plotted against distance (in 1000s of km) between pairs of cities. A) links with more than 30 mentions, B) links with more than 10 mentions.
115
Table 4-1. Cities with highest weighted indegree and outdegree (strength): A) Indegree, B) Outdegree City Indegree
New York, NY 660 Atlanta, GA 352 Los Angeles, CA 349 Boston, MA 303 Chicago, IL 279 Las Vegas, NV 270 Charlotte, SC 252 Washington, DC 251 San Francisco, CA 212 Detroit, MI 203 Miami, FL 192 Philadelphia, PA 150 Dallas, TX 133 Seattle, WA 124 Cleveland, OH 118 Nashville, TN 112 Houston, TX 87 Denver, CO 85 San Diego, CA 81 Portland, OR 74
City Outdegree
Los Angeles, CA 385 New York, NY 385 Cambridge, MA 206 Chicago, IL 174 Washington, DC 160 Oakland, CA 117 Warren, MI 116 San Francisco, CA 109 Atlanta, GA 106 Long Island, NY 102 Anaheim, CA 97 Houston, TX 97 Fort Worth, TX 86 San Diego, CA 82 Seattle, WA 81 Miami, FL 78 Fort Lauderdale, FL 76 Newark, NJ 73 Dallas, TX 72 Phoenix, AZ 70
A B
116
Table 4-2. Pearson correlation between weighted centrality measures Pearson correlation Degree centrality Eigenvector centrality Kleinberg Authority Score
Degree centrality 1.000 0.950 0.732
Eigenvector centrality 0.950 1.000 0.716
Kleinberg Authority Score 0.732 0.716 1.000
117
Table 4-3. City ranking based on closeness centrality, together with Kleinberg hub and authority scores: A) Authority scores, B) Hub scores. City Authority score
New York, NY 1.000 Las Vegas, NV 0.490 Atlanta, GA 0.448 Los Angeles, CA 0.436 Washington, DC 0.387 Chicago, IL 0.360 San Francisco, CA 0.307 Miami, FL 0.248 Detroit, MI 0.234 Charlotte, SC 0.203 Philadelphia, PA 0.179 San Diego, CA 0.166 Dallas, TX 0.154 Houston, TX 0.124 Seattle, WA 0.122 Nashville, TN 0.120 Cleveland, OH 0.101 Denver, CO 0.097 Anaheim, CA 0.064 Tulsa, OK 0.063
City Hub score
Los Angeles, CA 1.000 New York, NY 0.668 Long Island, NY 0.561 Chicago, IL 0.401 Newark, NJ 0.395 Washington, DC 0.355 Miami, FL 0.264 Anaheim, CA 0.260 Oakland, CA 0.237 Atlanta, GA 0.230 Houston, TX 0.218 Philadelphia, PA 0.207 Warren, MI 0.195 San Francisco, CA 0.194 Seattle, WA 0.164 San Diego, CA 0.157 Fort Lauderdale, FL 0.150 Dallas, TX 0.147 Fort Worth, TX 0.147 Riverside, CA 0.127
A B
118
Table 4-4. City mentions state subgraph indicators State Reciprocity Connectance Number of nodes
AL 0.000 0.044 10
CA 0.500 0.108 27
CO 0.667 0.214 7
FL 0.372 0.102 21
GA 0.182 0.122 10
IL 0.167 0.109 11
MD 0.500 0.190 7
MI 0.167 0.066 14
NC 0.211 0.122 13
NJ 0.333 0.083 9
NY 0.421 0.144 12
OR 0.000 0.133 6
PA 0.000 0.044 17
SC 0.250 0.190 7
TX 0.333 0.065 22
WA 0.545 0.122 10
119
Table 4-5. Mean number of employees in given occupation per 1000 employees in any occupation across all analyzed cities, and its and standard deviation of the mean. Categories in boldface highlight specific occupational categories whereas those in regular font show broad occupation categories
Row Occupations
Mean number of employees in given occupation per 1000
employees
Standard Error (Standard
Deviation of the Mean)
1 Architecture and engineering 16.5 0.51 2 Arts, design, entertainment, sports, and
media 10.5 0.21
3 Building and grounds cleaning and maintenance
32.5 0.39
4 Business and financial operations 41.9 0.81 5 Community and social service 15.5 0.26 6 Computer and mathematical 21.1 0.75 7 Construction and extraction 40.4 0.69 8 Education, training, and library 65.2 0.86 9 Farming, fishing, and forestry 5.4 0.94 10 Food preparation and serving related 98.6 0.88 11 Healthcare practitioners and technical 64.8 0.76 12 Healthcare support 30.3 0.43 *13 Installation, maintenance, and repair 41.5 0.40 14 Legal 5.7 0.15 15 Life, physical, and social science 7.7 0.29 16 Management 44.9 0.62 17 Office and administrative support 153.7 0.78 18 Personal care and service 32.8 0.54 19 Production 71.1 1.95 *20 Protective service 23.3 0.48 21 Sales and related 106.3 0.76 22 Transportation and material moving 66.1 0.96 23 Hotels 2.1 0.07 24 Retail 49.6 0.47 25 Real estate 2.2 0.07
120
Table 4-6. Arithmetic signs of estimated coefficients from Multivariate QAP regression on four models
Model # 1 2 3 Subgraph selection criterion Indegree>10 Indegree>30 Indegree>50
# of nodes 57 33 21 Reciprocity .556 .710 .847
Arithmetic sign of slope coefficient
- + - + - +
Numerical variables (dissimilarity matrix)
Airports
*
Art and design
*
Building grounds cleaning
***
***
** Distance **
*
**
Farming * Healthcare practitioners *
Healthcare support
* Hotels **
***
**
Personal services
*
Population
**
**
Production
***
*
Retail
** Schools *
Categorical variables (similarity matrix)
State
***
**
Adjusted R squared 0.192 0.290 0.426
Signif. codes: *** p < 0.001; ** p < 0.01; * p < 0.05
121
CHAPTER 5 CONCLUSIONS
The conducted case studies enhance the understanding of geospatial patterns of
information exchange and propagation through Twitter. Some of the factors that
influence the spread of information are also identified. The achieved results can be used
to improve and optimize a variety of real-world applications in the social media and
information science domain.
The exploration of Twitter and Instagram photos helped to better understand the
two VGI sources. Analyzed spatial offsets between object location and photo upload
location varied significantly across the continents, where the offset distance was
smallest in the United States, followed by Europe and other analyzed continents,
potentially indicating different availability levels of mobile Internet. Twitter places and
Instagram location tags were found to be available at a different spatial granularity.
Twitter places were generally found to be available at the neighborhood and city level.
The wide availability of user-contributed points of interest in Instagram facilitated the
analysis of their accuracy, revealing, for example, multiple labels and locations to
indicate the same point of interest.
A comprehensive analysis of Twitter images and other tweet content formats in
the response to Paris attacks answered several questions related to information
propagation. Tweets with images, when compared to tweets with hashtags or text
related to the attacks, received the highest attention from the Twitter community.
Journalists whose tweets earned a higher popularity than those of non-journalists
posted a large portion of the images. This indicates that journalists use Twitter perhaps
more actively than users for whom Twitter is not used as a professional work tool. An
122
expansion of this work could be the analysis of Twitter images during other types of
events. To circumvent the unavailability of location information of retweets due to
technical limitations of the Twitter API, a new method of locating retweets was applied.
That is, retweets were located based on geolocated tweets by the same users posted
within a short period of time before and after the retweet. This method resulted in a
denser map of retweets in Europe indicating higher safety concerns and closer
proximity to the terrorist attack locations.
The geographic distribution of hashtags showed that hashtags in French are not
immune to language barriers, since tweets with French hashtags were used mainly in
francophone territories, whereas hashtags in English were posted all over the world.
Analysis of the temporal distribution of the hashtags showed that the public interest in
the attacks peaked at the day after the events and diminished two to three days later.
The spatiotemporal regression model of the hashtag spread showed that the
number of hashtags in places around the country depends on the number of tweets in
the main city in the country (the “tweeting capital” of the country). This suggests a two-
level hierarchical structure of the spread within a country from the tweeting capital to
surrounding cities.
The third case study introduced new measures of popularity of U.S. cities and
modeled their interaction through geo-hashtags. The network of inter-city mentions was
found to be a scale-free network. Network centrality measures identified New York as
the most popular city and that Twitter users from Los Angeles mention popular cities
more than users from any other place in the U.S. The sentiment analysis identified
events in some smaller cities and showed a weak trend in decline of the average
123
sentiment of tweets with the distance between cities. Further network regression model
identified significant factors in communication between cities. Namely, cities receive
most mentions from smaller neighboring cities and from the cities in the same state. In
case of state and distance the underlying process is homophily since connections are
more likely to occur between similar cities. For population underlying process is
heterophily since connections are more likely to occur between cities that are different.
The network regression can be extended with the inclusion of the additional relevant city
descriptors and by analysis of the tweets collected over a longer period of time.
In conclusion, Twitter is a valuable and rich source of geographic information.
Beneficiaries from the better understanding of information flow through Twitter can be
governments, marketing research companies and emergency management
organizations, to name a few. The format of tweets, their theme (or content category)
and user profession were found to be important in relaying the information to the world
and the news landscape. All these factors affect the popularity of tweets, which is found
to play a significant role in raising geographic situational awareness in emergency
situations such as terrorist attacks. Inter-city mentions can be viewed as information
flows from the mentioned city to the mentioning city. Some of the analyzed cases
showed that events in the mentioned cities were sources of information. The model
proposed in the second study can be used by the emergency situations regulators for
geographic and temporal prediction of the intensity of Twitter users’ reaction. The
potential application of the model of the inter-urban city mentions can be the
dimensioning of airline traffic between cities.
124
LIST OF REFERENCES
Achananuparp, P., Lim, E.-P., Jiang, J., & Hoang, T.-A. (2012). Who is Retweeting the
Tweeters? Modeling, Originating, and Promoting Behaviors in the Twitter Network. ACM Transactions on Management Information Systems, 3(3), 13:1–13:30. https://doi.org/10.1145/2361256.2361258
Adai, A. T., Date, S. V., Wieland, S., & Marcotte, E. M. (2004). LGL: Creating a map of protein function with an algorithm for visualizing very large biological networks. Journal of Molecular Biology, 340(1), 179–190. https://doi.org/10.1016/j.jmb.2004.04.047
Alivand, M., & Hochmair, H. H. (2017). Spatiotemporal analysis of photo contribution patterns to Panoramio and Flickr. Cartography and Geographic Information Science, 44(2), 170–184. https://doi.org/10.1080/15230406.2016.1211489
Andrienko, G., Andrienko, N., Bosch, H., Ertl, T., Fuchs, G., & Jankowski, P. (2013). Thematics Patterns in Georeferenced Tweets through Space-Time Visual Analytics. Computing in Science & Engineering, 15(13), 72–82. https://doi.org/doi.ieeecomputersociety.org/10.1109/MCSE.2013.70
Aslam, S. (2018). Twitter by the Numbers: Stats, Demographics & Fun Facts. Retrieved from https://www.omnicoreagency.com/twitter-statistics/
Bakshy, E., Hofman, J., Mason, W., & Watts, D. (2011). Everyone’s an influencer: quantifying influence on twitter. In Proceedings of the fourth ACM international conference on Web search and data mining SE - WSDM ’11 (pp. 65–74). ACM. https://doi.org/doi: 10.1145/1935826.1935845
Barrat, A., Barthélemy, M., Pastor-Satorras, R., & Vespignani, A. (2004). The architecture of complex weighted networks. In Proceedings of the National Academy of Sciences of the United States of America (Vol. 101, pp. 3747–3752). https://doi.org/10.1073/pnas.0400087101
Barthélemy, M. (2011). Spatial networks. Physics Reports, 499(1–3), 1–101. https://doi.org/10.1016/j.physrep.2010.11.002
Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), 10008–10020. https://doi.org/10.1088/1742-5468/2008/10/P10008
Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., & Hwang, D.-U. (2006). Complex networks: Structure and dynamics. Physics Reports, 424(4–5), 175–308.
Borgatti, S. P., Everett, M. G., & Freeman, L. C. (2002). Ucinet 6 for Windows: Software for Social Network Analysis. Harvard, MA: Analytic Technologies.
125
Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (2009). Network Analysis in the Social Sciences. Science, 323(5916), 892–895. https://doi.org/10.1126/science.1165821
Brennan, S., Sadilek, A., & Kautz, H. (2013). Towards understanding global spread of disease from everyday interpersonal interactions. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (pp. 2783–2789). Beijing, China: AAAI Press.
Bryl, S. (2017). Machine Learning in R using doc2vec approach. Retrieved from https://analyzecore.com/2017/02/08/twitter-sentiment-analysis-doc2vec/
Burt, R. S. (1995). Structural Holes: The Social Structure of Competition. Harvard University Press.
Can, E. F., Oktay, H., & Manmatha, R. (2013). Predicting retweet count using visual cues. Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, 1481–1484. https://doi.org/10.1145/2505515.2507824
Census. (2014). 2014 TIGER/Line® Shapefiles: Blocks (2010). Retrieved from https://www.census.gov/cgi-bin/geo/shapefiles2014/layers.cgi
Cha, M., Haddai, H., Benevenuto, F., & Gummadi, K. P. (2010). Measuring User Influence in Twitter : The Million Follower Fallacy. International AAAI Conference on Weblogs and Social Media, 10–17. https://doi.org/10.1.1.167.192
Chang, H. C. (2010). A new perspective on Twitter hashtag use: Diffusion of innovation theory. Proceedings of the ASIST Annual Meeting, 47. https://doi.org/10.1002/meet.14504701295
Cheng, Z., Caverlee, J., & Lee, K. (2010). You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 759–768). New York, NY, USA: ACM. https://doi.org/10.1145/1871437.1871535
Chong, M. (2016). Sentiment analysis and topic extraction of the twitter network of #prayforparis. Proceedings of the Association for Information Science and Technology, 53(1), 1–4. https://doi.org/10.1002/pra2.2016.14505301133
Compston, S. (2014). Identifying and Understanding Retweets & Quote Tweets. Retrieved January 10, 2017, from http://support.gnip.com/articles/identifying-and-understanding-retweets.html
Cook, J. M. (2012). Gender, voting and cosponsorship in the Maine State legislature. New England Journal of Political Science, IV(1), 1–30.
126
Csardi, G., & Tamas, N. (2006). The igraph software package for complex network research. InterJournal.
Cvetojevic, S., Juhász, L., & Hochmair, H. H. (2016). Positional Accuracy of Twitter and Instagram Images in Urban Environments. GI_Forum 2016, 1, 191–203. https://doi.org/10.1553/giscience2016_01_s191
De Longueville, B., & Smith, R. S. (2009). “ OMG , from here , I can see the flames !”: a use case of mining Location Based Social Networks to acquire spatio- temporal data on forest fires. In Proceedings of the 2009 International Workshop on Location Based Social Networks (LBSN ’09) (pp. 73–80). Seattle, Washington, USA. https://doi.org/10.1145/1629890.1629907
De Rosario Martínez, H. (2015). Analysing interactions of fitted models. https://doi.org/10.1007/s13398-014-0173-7.2
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Dekker, D., Krackhardt, D., & Snijders, T. (2003). Multicollinearity robust QAP for multiple regression. NAACSOS Conference, Omni William Penn., 1–5.
Derudder, B., Taylor, P. J., Hoyler, M., Ni, P., Liu, X., Zhao, M., … Witlox, F. (2013). Measurement and Interpretation of Connectivity of Chinese Cities in World City Network, 2010. Chinese Geographical Science, 23(3), 261–273.
Duggan, M., Ellison, N. B., Lampe, C., Lenhart, A., & Madden, M. (2015). Demographics of Key Social Networking Platforms. Retrieved January 10, 2017, from http://www.pewinternet.org/2015/01/09/demographics-of-key-social-networking-platforms-2/
Dunne, J. A., Williams, R. J., & Martinez, N. D. (2002). Food-web structure and network theory: The role of connectance and size. PNAS, 99(20), 12917–12922. https://doi.org/10.1073/pnas.192407699
Evangelopoulos, N., Ashton, T., Winson-Geideman, K., & Roulac, S. (2015). Latent Semantic Analysis and Real Estate Research: Methods and Applications. Journal of Real Estate Literature, 23(2), 353–380. https://doi.org/10.5555/0927-7544.23.2.353
FAA. (2010). Passenger Boarding (Enplanement) and All-Cargo Data for U.S. Airports - Previous Years. Retrieved from https://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/previous_years/#2000
127
Ferguson, C., Inglis, S. C., Newton, P. J., Cripps, P. J. S., Macdonald, P. S., & Davidson, P. M. (2014). Social media: A tool to spread information: A case study analysis of Twitter conversation at the Cardiac Society of Australia & New Zealand 61st Annual Scientific Meeting 2013. Collegian, 21(2), 89–93. https://doi.org/10.1016/j.colegn.2014.03.002
Fernandez, R. M., Castilla, E. J., & Moore, P. (2000). Social capital at work: networks and employment at a phone center. American Journal of Sociology, 105(5), 1288–1356.
Fischer, F. (2012). VGI as big data. A new but delicate geographic data-source. GeoInformatics, 15(3), 46–47.
Freeman, L. C. (1977). A Set of Measures of Centrality Based on Betweenness. Sociometry, 40(1), 35–41.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Journal of statistical software. Journal of Statistical Software, 33(1).
Gilbert, E., & Karahalios, K. (2009). Predicting Tie Strength With Social Media. In CHI 2009. Boston, Massachusetts, USA: ACM.
Go, A., Bhayani, R., & Huang, L. (n.d.). Twitter Sentiment Classification using Distant Supervision.
Goodchild, M. F. (2007). Citizens as Voluntary Sensors: Spatial Data Infrastructure in the World of Web 2.0. International Journal of Spatial Data Infrastructures Research, 2, 24–32.
Goolsby, R. (2010). Social Media as Crisis Platform: The Future of Community Maps/Crisis Maps. ACM Transactions on Intelligent Systems and Technology, 1(1), Article 7.
Graham, M., Hale, S. A., & Gaffney, D. (2014). Where in the World Are You? Geolocation and Language Identification in Twitter. Professional Geographer, 66(4), 568–578. https://doi.org/10.1080/00330124.2014.907699
Granovetter, M. P. (1973). The strength of weak ties. American Journal of Sociology, 78(6), 1360–1380.
Gründemann, T., & Burghardt, D. (2016). Visual Analysis of Thematic, Social and Geospatial Patterns of Microblogging Content Using D3. In LinkVGI workshop in association with the 19th AGILE Conference on Geographic Information Science.
Guille, A., Hacid, H., Favre, C., & Zighed, D. a. (2013). Information Diffusion in Online Social Networks: A Survey. Sigmod, 42(2), 17–28. https://doi.org/10.1145/2503792.2503797
128
Hawelka, B., Sitko, I., Beinat, E., Sobolevsky, S., Kazakopoulos, P., & Ratti, C. (2014). Geo-located Twitter as proxy for global mobility patterns. Cartography and Geographic Information Science, 41(3), 260–271. https://doi.org/10.1080/15230406.2014.890072
Hochmair, H. H., & Cvetojevic, S. (2014). Assessing the Usability of Georeferenced Tweets for the Extraction of Travel Patterns: A Case Study for Austria and Florida. GI_Forum 2014, 30–39. https://doi.org/10.1553/giscience2014s30
Hong, L., Convertino, G., & Chi, E. H. (2011). Language Matters in Twitter : A Large Scale Study. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (pp. 518–521).
Hübl, F., Cvetojevic, S., Hochmair, H. H., & Gernot, P. (2017). Analyzing Refugee Migration Patterns using Geo-tagged Tweets. ISPRS International Journal of Geo-Information, 6(10), 302. https://doi.org/10.3390/ijgi6100302
Hung, K.-C., Kalantari, M., & Rajabifard, A. (2016). Methods for assessing the credibility of volunteered geographic information in flood response: A case study in Brisbane, Australia. Applied Geography, 68, 37–47. https://doi.org/10.1016/j.apgeog.2016.01.005
Jahng, M. R., & Littau, J. (2016). Interacting is believing: Interactivity, social cue, and perceptions of journalistic credibility on twitter. Journalism & Mass Communication Quarterly, 93(1), 38–58. https://doi.org/10.1177/1077699015606680
Jurdak, R., Zhao, K., Liu, J., AbouJaoude, M., Cameron, M., & Newth, D. (2015). Understanding human mobility from Twitter. PLoS ONE, 10(7). https://doi.org/10.1371/journal.pone.0131469
Jurgens, D. (2013). That’s What Friends Are For: Inferring Location in Online Social Media Platforms Based on Social Relationships. In Proceedings of the 7th International AAAI Conference on Weblogs and Social Media (pp. 273–282). https://doi.org/papers3://publication/uuid/7775D7FA-9933-4BE3-B8D4-023023980AB5
King, G., Lam, P., & Roberts, M. E. (2017, March). Computer-Assisted Keyword and Document Set Discovery from Unstructured Text. American Journal of Political Science. https://doi.org/10.1111/ajps.12291
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632. https://doi.org/10.1145/324133.324140
Koput, K. W. (2010). Social Capital: An Introduction to Managing Networks. (E. Elgar, Ed.). Cheltenham, UK.
129
Kotzias, D., Lappas, T., & Gunopulos, D. (2014). Addressing the sparsity of location information on twitter. In CEUR Workshop Proceedings (Vol. 1133, pp. 339–346). https://doi.org/10.1.1.429.2390
Krackhardt, D. (1987). QAP Partialing as a Test of Spuriousness. Social Networks, 9(9), 171–186.
Krackhardt, D. (1988). Predicting With Networks: Nonparametric Multiple Regression Analysis of Dyadic Data. Social Networks. https://doi.org/10.1016/0378-8733(88)90004-4
Kresl, P. K. (1995). The Determinants of Urban Competitiveness: A Survey. In P. K. Kresl & G. Gapper (Eds.), North American Cities and the Global Economy (pp. 45–68). London: Sage.
Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a social network or a news media? Proceedings of the 19th International Conference on World Wide Web. Raleigh, North Carolina, USA: ACM. https://doi.org/10.1145/1772690.1772751
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211
Lansley, G., & Longley, P. A. (2016). The geography of Twitter topics in London. Computers, Environment and Urban Systems, 58, 85–96. https://doi.org/10.1016/j.compenvurbsys.2016.04.002
Lenormand, M., Gonçalves, B., Tugores, A., & Ramasco, J. J. (2015). Human diffusion and city influence. Journal of The Royal Society Interface, 12(109), 20150473. https://doi.org/10.1098/rsif.2015.0473
Lenormand, M., Tugores, A., Colet, P., & Ramasco, J. J. (2014). Tweets on the road. PLoS ONE, 9(8). https://doi.org/10.1371/journal.pone.0105407
Lerman, K., & Ghosh, R. (2010). Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (Vol. V, pp. 90–97). Washington, D.C. https://doi.org/10.1146/annurev.an.03.100174.001431
Levinson, D. (2008). Density and dispersion: The co-development of land use and rail in London. Journal of Economic Geography, 8(1), 55–77. https://doi.org/10.1093/jeg/lbm038
Li, L., & Goodchild, M. F. (2010). The Role of Social Networks in Emergency Management. International Journal of Information Systems for Crisis Response and Management, 2(4), 48–58. https://doi.org/10.4018/jiscrm.2010100104
130
Li, Y., Li, Q., & Shan, J. (2017). Discover Patterns and Mobility of Twitter Users—A Study of Four US College Cities. ISPRS International Journal of Geo-Information, 6(2), 42. https://doi.org/10.3390/ijgi6020042
Liu, Y., Sui, Z., Kang, C., & Gao, Y. (2014). Uncovering Patterns of Inter-Urban Trip and Spatial Interaction from Social Media Check-In Data. PLoS ONE, 9(1), e86026.
Longley, P. A., & Adnan, M. (2016). Geo-temporal Twitter demographics. International Journal of Geographical Information Science, 30(2), 369–389. https://doi.org/10.1080/13658816.2015.1089441
Lotan, G., Graeff, E., Ananny, M., Gaffney, D., Pearce, I., & Boyd, D. (2011). The Revolutions Were Tweeted: Information Flows during the 2011 Tunisian and Egyptian Revolutions. International Journal of Communication, 5, 1375–1405. https://doi.org/1932–8036/2011FEA1375
MacEachren, A. M., Robinson, A. C., Jaiswal, A., Pezanowski, S., Savelyev, A., Blanford, J., & Mitra, P. (2011). Geo-twitter analytics: Applications in crisis management. In 25th International Cartographic Conference (pp. 3–8).
Malik, M. M., Lamba, H., Nakos, C., & Pfeffer, J. (2015). Population Bias in Geotagged Tweets. 9th International AAAI Conference on Weblogs and Social Media, 18–27.
McCulloh, I. (2010). Network Topology Effects on Correlation between Centrality Measures. Connections, 30(1), 21–28.
McFarland, D., Messing, S., Nowak, M., & Westwood, S. J. (2010). Social Network Analysis Labs in R.
Mislove, A., Lehmann, S., Ahn, Y., Onnela, J., & Rosenquist, J. N. (2011). Understanding the Demographics of Twitter Users. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (pp. 554–557).
Moffitt, J. (2014). Twitter Geographical Metadata. Retrieved from http://support.gnip.com/articles/geo-intro.html
Myers, S. A., Sharma, A., Gupta, P., & Lin, J. (2014). Information Network or Social Network? The Structure of the Twitter Follow Graph. In Proceedings of the 23rd International Conference on World Wide Web (pp. 493–498). Seoul, Korea: ACM.
Nielsen, R. K., & Schrøder, K. C. (2014). The Relative Importance of Social Media for Accessing, Finding, and Engaging with News: An eight-country cross-media comparison. Digital Journalism, 2(4), 472–489. https://doi.org/10.1080/21670811.2013.872420
131
Onnela, J. P., Saramaki, J., Hyvonen, J., Szabo, G., Lazer, D., Kaski, K., … Barabasi, A. L. (2007). Structure and tie strengths in mobile communication networks. In Proceedings of the National Academy of Sciences (Vol. 104, pp. 7332–7336). https://doi.org/10.1073/pnas.0610245104
Opsahl, T., Agneessens, F., & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks, 32(3), 245–251. https://doi.org/10.1016/j.socnet.2010.03.006
Pei, S., Muchnik, L., Andrade José S., J., Zheng, Z., & Makse, H. A. (2014). Searching for superspreaders of information in real-world social media. Scientific Reports, 4, 5547.
Romero, D. M., Meeder, B., & Kleinberg, J. (2011). Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. WWW’11 Proceedings of the 20th International Conference on World Wide Web, 695–704. https://doi.org/10.1145/1963405.1963503
Selivanov, D. (2016). text2vec: Modern Text Mining Framework for R.
Seo, H. (2014). Visual Propaganda in the Age of Social Media: An Empirical Analysis of Twitter Images During the 2012 Israeli–Hamas Conflict. Visual Communication Quarterly, 21(3), 150–161. https://doi.org/10.1080/15551393.2014.955501
Shelton, T., Poorthuis, A., Graham, M., & Zook, M. (2014). Mapping the data shadows of Hurricane Sandy: Uncovering the sociospatial dimensions of “big data.” Geoforum, 52, 167–179. https://doi.org/10.1016/j.geoforum.2014.01.006
Signorini, A., Segre, A. M., & Polgreen, P. M. (2011). The use of Twitter to track levels of disease activity and public concern in the US during the influenza A H1N1 pandemic. PloS One, 6(5), e19467. https://doi.org/http://dx.doi.org/10.1371/journal.pone.0019467
Snijders, T. A. . (2011). Statistical Models for Social Networks. Annual Review of Sociology, 37(1), 131–153. https://doi.org/10.1146/annurev.soc.012809.102709
Sobolevsky, S., Szell, M., Campari, R., Couronné, T., Smoreda, Z., & Ratti, C. (2013). Delineating Geographical Regions with Networks of Human Interactions in an Extensive Set of Countries. PLoS ONE, 8(12), e81707.
Sporns, O. (2002). Graph Theory Methods for the Analysis of Neural Connectivity Patterns. In R. Kötter (Ed.), Neuroscience databases. A practical guide. (pp. 171–185). Boston, MA: Kluwer Academic Press. https://doi.org/10.1007/978-1-4615-1079-6_12
Steiger, E., Ellersiek, T., Resch, B., & Zipf, A. (2011). Uncovering latent mobility patterns from Twitter during mass events. Journal for Geographic Information Science, 1, 525–534. https://doi.org/10.1553/giscience2015s525
132
Stephens, M., & Poorthuis, A. (2014). Follow thy neighbor: Connecting the social and the spatial networks on Twitter. Computers, Environment and Urban Systems, 53, 87–95. https://doi.org/10.1016/j.compenvurbsys.2014.07.002
Takhteyev, Y., Gruzd, A., & Wellman, B. (2012). Geography of Twitter networks. Social Networks, 34(1), 73–81. https://doi.org/10.1016/j.socnet.2011.05.006
Tan, L., & Lei, D. (2013). Exact Solutions of a Generalized Weighted Scale Free Network. Journal of Applied Mathematics, 2013, 1–6. https://doi.org/10.1155/2013/902519
Taylor, P. J. (2001). Specification of the World City Network. Geographical Analysis, 33(2), 181–194.
Tsur, O., & Rappoport, A. (2012). What’s in a Hashtag? Content based Prediction of the Spread of Ideas in Microblogging Communities. Proceedings of the Fifth ACM International Conference on Web Search and Data Mining - WSDM ’12, 643. https://doi.org/10.1145/2124295.2124320
Valle, D., Cvetojevic, S., Robertson, E. P., Reichert, B. E., Hochmair, H. H., & Fletcher, R. J. (2017). Individual Movement Strategies Revealed through Novel Clustering of Emergent Movement Patterns. Scientific Reports, 7, 44052. https://doi.org/10.1038/srep44052
Varol, O., Ferrara, E., Davis, C. A., Menczer, F., & Flammini, A. (2017). Online Human-Bot Interactions: Detection, Estimation, and Characterization.
Vidya, N. A., Fanany, M. I., & Budi, I. (2015). Twitter Sentiment to Analyze Net Brand Reputation of Mobile Phone Providers. Procedia Computer Science, 72, 519–526. https://doi.org/10.1016/j.procs.2015.12.159
Wang, X. F., & Chen, G. (2003). Complex networks: Small-world, scale-free and beyond. IEEE Circuits and Systems Magazine, 3(1), 6–20. https://doi.org/10.1109/MCAS.2003.1228503
Watts, D. J. J., & Strogatz, S. H. H. (1998). Collective dynamics of “small-world” networks. Nature, 393(6684), 440–442. https://doi.org/10.1038/30918
Weng, L., Menczer, F., & Ahn, Y.-Y. (2013). Virality Prediction and Community Structure in Social Networks. Scientific Reports, 3(2522). https://doi.org/10.1038/srep02522
White, E. P., Enquist, B. J., & Green, J. L. (2008). On estimating the exponent of power law frequency distributions. Ecology, 89(4), 905–912.
Yang, J., & Counts, S. (2010). Predicting the Speed , Scale , and Range of Information Diffusion in Twitter. Fourth International AAAI Conference on Weblogs and Social Media, 355–358. https://doi.org/10.1016/j.adhoc.2011.06.003
133
Zahra, K., Ostermann, F. O., & Purves, R. S. (2017). Geographic variability of Twitter usage characteristics during disaster events. Geo-Spatial Information Science, 20(3), 231–240. https://doi.org/10.1080/10095020.2017.1371903
Zook, M. A., & Brunn, S. D. (2005). Hierarchies, Regions and Legacies: European Cities and Global Commercial Passenger Air Travel. Journal of Contemporary European Studies , 13(2), 203–220.
134
BIOGRAPHICAL SKETCH
Sreten Cvetojević was born in Ljubovija, Serbia. In 2011, he graduated with an
Engineer’s Diploma (Dipl. Ing. – equivalent to U.S. Bachelor of Science and Master of
Science degree) in Telecommunications networks and traffic engineering from the
University of Belgrade, Serbia. He was accepted as a Ph.D. student and a research
assistant at the University of Florida in 2013 and graduated with a Ph.D. in Forest
resources and conservation with a concentration in geomatics in 2018.