proximity, interactions, and communities in social ...szymansk/theses/nguyen.2014.pdf · proximity,...
TRANSCRIPT
PROXIMITY, INTERACTIONS, AND COMMUNITIES INSOCIAL NETWORKS: PROPERTIES AND
APPLICATIONS.
By
Tommy Nguyen
A Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: COMPUTER SCIENCE
Examining Committee:
Boleslaw K. Szymanski, Thesis Adviser
Sibel Adalı, Member
James A. Hendler, Member
Gyorgy Korniss, Member
Mohammed J. Zaki, Member
Rensselaer Polytechnic InstituteTroy, New York
October 2014(For Graduation December 2014)
c© Copyright 2014
by
Tommy Nguyen
All Rights Reserved
ii
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Ranking Information in Social Networks . . . . . . . . . . . . . . . . 2
1.2 Small Worlds and Social Stratification . . . . . . . . . . . . . . . . . 4
1.3 Summary of Contributions & Organization . . . . . . . . . . . . . . . 6
1.3.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Ranking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Web Conceptualization . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 User Data & Trust Models . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Small-world Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Six Degrees of Separation . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Social Stratification . . . . . . . . . . . . . . . . . . . . . . . . 16
3. SOCIAL NETWORK ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Geography, Co-Appearance, & Interactions . . . . . . . . . . . . . . . 19
3.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Notations & Definitions . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 Data Analysis & Results . . . . . . . . . . . . . . . . . . . . . 21
3.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Incorporating Geography into Community Detection . . . . . . . . . 24
3.2.1 Clique Percolation Method . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Modularity Maximization . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Speaker-Label Propagation (GANXiS) . . . . . . . . . . . . . 27
3.3 Contrasting Communities to Null Models . . . . . . . . . . . . . . . . 28
3.3.1 Techniques for Generating Covers . . . . . . . . . . . . . . . . 29
iii
3.3.2 Measuring Covers & Communities . . . . . . . . . . . . . . . . 29
3.3.3 Examining Covers in Gowalla . . . . . . . . . . . . . . . . . . 31
3.4 Examining Detected Communities . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Network Community Profile (NCP) . . . . . . . . . . . . . . . 34
3.4.2 Link Connectivity Measurements . . . . . . . . . . . . . . . . 35
3.4.3 Face-to-Face Interactions Measurements . . . . . . . . . . . . 35
3.5 Application: Social Relationships & Human Mobility . . . . . . . . . 39
3.5.1 Network Congestion in MANETs . . . . . . . . . . . . . . . . 41
3.5.2 Mobility Generation . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Experimental Congestion Design . . . . . . . . . . . . . . . . 42
3.5.4 Congestion Simulation Results . . . . . . . . . . . . . . . . . . 43
3.6 Application: Long Ties & Economic Development . . . . . . . . . . . 44
3.6.1 A Stochastic Model of Economic Development . . . . . . . . . 47
3.6.2 Experimental Results & Discussion . . . . . . . . . . . . . . . 48
3.7 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4. SOCIAL RANKING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . 57
4.1 Google Buzz & Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Categories of URLs. . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.2 Spreaders & Affected Sets . . . . . . . . . . . . . . . . . . . . 60
4.1.3 Information Distances . . . . . . . . . . . . . . . . . . . . . . 61
4.1.4 Geographical Distances . . . . . . . . . . . . . . . . . . . . . . 62
4.1.5 Densities of Social Relationships . . . . . . . . . . . . . . . . . 64
4.1.6 Keyword Similarity . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Social Ranking Techniques . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 PageRank on Social Network . . . . . . . . . . . . . . . . . . 66
4.2.2 HITS on Social Network . . . . . . . . . . . . . . . . . . . . . 67
4.2.3 Ranking with Maximum Flow . . . . . . . . . . . . . . . . . . 68
4.2.4 Variants of Maximum Flow . . . . . . . . . . . . . . . . . . . 70
4.3 Social Ranking Experiments . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.1 Comparing PageRank & HITS . . . . . . . . . . . . . . . . . . 70
4.3.2 Flow Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.3 Rank Differences . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.4 Rank Distributions . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.5 Rank Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
iv
5. SOCIAL SEARCHING EXPERIMENTS . . . . . . . . . . . . . . . . . . . 81
5.1 Attrition, Geography, & Communities . . . . . . . . . . . . . . . . . . 82
5.1.1 Modeling Attrition . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1.2 Geographical Analysis . . . . . . . . . . . . . . . . . . . . . . 84
5.1.3 Detecting Communities . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Routing Strategies . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.2 Starter & Target Selections . . . . . . . . . . . . . . . . . . . 88
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.1 Selection & Routing Combinations . . . . . . . . . . . . . . . 89
5.3.2 Friends-of-Friends Knowledge Densities . . . . . . . . . . . . . 90
5.3.3 Distributions of Successful Chains . . . . . . . . . . . . . . . . 91
5.3.4 Effects of Hubs and Connectors . . . . . . . . . . . . . . . . . 92
5.3.5 Individual and Community Prominence . . . . . . . . . . . . . 93
5.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6. CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 97
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
v
LIST OF TABLES
1.1 Aspects of SNA & applications. . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Data summary of Gowalla network. . . . . . . . . . . . . . . . . . . . . 20
3.2 Six techniques for generating covers. . . . . . . . . . . . . . . . . . . . . 29
3.3 Measurements for cover C of the size k. . . . . . . . . . . . . . . . . . . 31
3.4 Detected communities and their sizes. . . . . . . . . . . . . . . . . . . . 34
3.5 Measuring spatial conductance. . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Measuring face-to-face interactions. . . . . . . . . . . . . . . . . . . . . 36
3.7 Network simulator ns-2 parameters. . . . . . . . . . . . . . . . . . . . . 43
3.8 Measuring economic development (Gowalla). . . . . . . . . . . . . . . . 52
3.9 Measuring economic development (FourSquare). . . . . . . . . . . . . . 53
4.1 Data summary of Google Buzz. . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Data summary of Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Google Buzz (left) & Twitter (right) with geography. . . . . . . . . . . 59
4.4 Social relationships densities in Google Buzz. . . . . . . . . . . . . . . . 64
4.5 Social relationships densities in Twitter. . . . . . . . . . . . . . . . . . . 65
4.6 Ranking results of 30 popular URLs in Google Buzz. . . . . . . . . . . . 74
4.7 Ranking results of 30 random URLs in Google Buzz. . . . . . . . . . . . 75
4.8 Avg. ranking differences in Google Buzz. . . . . . . . . . . . . . . . . . 76
4.9 Avg. ranking differences in Twitter. . . . . . . . . . . . . . . . . . . . . 76
5.1 Summaries of online social networks datasets. . . . . . . . . . . . . . . . 81
5.2 Communities detected by GANXiS. . . . . . . . . . . . . . . . . . . . . 86
5.3 Prominence of individuals and communities. . . . . . . . . . . . . . . . 88
5.4 Experimental results for Gowalla. . . . . . . . . . . . . . . . . . . . . . 88
5.5 Experimental results for FourSquare. . . . . . . . . . . . . . . . . . . . 89
6.1 Aspects of SNA & applications. . . . . . . . . . . . . . . . . . . . . . . 97
vi
LIST OF FIGURES
3.1 Geographical spread of 100K checkins in Gowalla. . . . . . . . . . . . . 19
3.2 Friendship is bounded by geographical distance. . . . . . . . . . . . . . 21
3.3 Densities of pairs as a function of geographical distance. . . . . . . . . . 22
3.4 Measuring face-to-face interactions (tε=30mins, dε=1km). . . . . . . . . 23
3.5 Generating CTA & FTA covers. . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Intra-edge count, boundary-edge count, and geographic diameter of covers. 32
3.7 Contraction, expansion, conductance, and geographic distance of covers. 33
3.8 Communities detected by Clique Percolation Method. . . . . . . . . . . 36
3.9 Communities detected by Inference Algorithm. . . . . . . . . . . . . . . 37
3.10 Communities detected by GANXiS. . . . . . . . . . . . . . . . . . . . . 38
3.11 Measuring face-to-face interactions among members. . . . . . . . . . . . 39
3.12 Generating a Markov Model using checkins. . . . . . . . . . . . . . . . . 41
3.13 Design of simulation overview. . . . . . . . . . . . . . . . . . . . . . . . 43
3.14 Traffic congestion in FMM and RWP. . . . . . . . . . . . . . . . . . . . 44
3.15 Frequency of pauses using the RWP. . . . . . . . . . . . . . . . . . . . . 45
3.16 Scaling laws of short and long ties. . . . . . . . . . . . . . . . . . . . . . 49
3.17 Face-to-face interactions of short ties and long ties. . . . . . . . . . . . . 49
3.18 The collective strength of long ties in a simple contagion model. . . . . 50
3.19 Distribution of long ties for adopters and non-adopters. . . . . . . . . . 51
3.20 Economic development as a function of idea flow (Gowalla). . . . . . . . 52
3.21 Economic development as a function of idea flow (FourSquare). . . . . . 53
3.22 Speedy idea flow as a function of social diversity. . . . . . . . . . . . . . 53
4.1 Conceptualization of social ranking. . . . . . . . . . . . . . . . . . . . . 57
4.2 Categories of popular (a,c) and random (b,d) URLs. . . . . . . . . . . . 60
vii
4.3 Shortest paths to URLs in Google Buzz (a) and Twitter (b). . . . . . . 61
4.4 Ultra small-world property from starters to information. . . . . . . . . . 62
4.5 Densities of shortest path lengths from starters to URLs. . . . . . . . . 62
4.6 Two degrees of spatial concentration. . . . . . . . . . . . . . . . . . . . 63
4.7 Four dimensions of social relationships. . . . . . . . . . . . . . . . . . . 64
4.8 CKS for friendship, following, peers, and random pairs. . . . . . . . . . 65
4.9 Graph G′p for ranking URLs {u1, u2} with respect to node p. . . . . . . 69
4.10 Ranking URLs on Google Buzz. . . . . . . . . . . . . . . . . . . . . . . 71
4.11 Ranking URLs on Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.12 Social ranking with popular URLs on Google Buzz. . . . . . . . . . . . 72
4.13 Social ranking with random URLs on Google Buzz. . . . . . . . . . . . 73
4.14 Social ranking with popular URLs on Twitter. . . . . . . . . . . . . . . 73
4.15 Social ranking with random URLs on Twitter. . . . . . . . . . . . . . . 73
4.16 Densities of rank correlation coefficient. . . . . . . . . . . . . . . . . . . 77
4.17 Ranking quality results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Stratification graph of communities in Gowalla. . . . . . . . . . . . . . . 83
5.2 Distributions of shortest path lengths & average path lengths. . . . . . 84
5.3 Densities of geographical distances. . . . . . . . . . . . . . . . . . . . . 85
5.4 Friends-of-friends knowledge densities. . . . . . . . . . . . . . . . . . . . 90
5.5 Path length of successful chains & drop rates. . . . . . . . . . . . . . . . 92
5.6 Effects of routing to connectors & hubs. . . . . . . . . . . . . . . . . . . 93
5.7 Prominence of individuals & communities on reachability. . . . . . . . . 94
5.8 Prominence of individuals & communities correlations. . . . . . . . . . . 95
viii
ACKNOWLEDGMENT
I like to thank everyone that mentored me during my undergraduate and graduate
studies. This dissertation is not possible without their guidance.
First, I like thank my dissertation chair for his guidance, ideas and intellectual
contributions in this dissertation. From seeking research problems to career planning,
he was always encouraging and supportive throughout my graduate studies. To quote
a previous graduate student, “his pleasant and friendly personality made this graduate
study more enjoyable.” Also, I like to thank committee members for providing their
feedback and helping me organize the structure of this thesis.
Second, I like to thank the entire staff in the CS department. Ms. Coonrad
and Ms. Hayden are always responsive to my questions regarding classes, graduation
requirements, etc. even when there are hundreds of questions from other students.
Mr. Lindsay is always around and ready to help whenever a server crashes. It was
always a pleasure to interact with them throughout my graduate studies.
Last but not least, I like to acknowledge the graduate students and postdocs in
our center and computer science department. Some of them are talented scientists and
experts in their areas of research; others are going to become experts one day. They
make me feel proud of being a member of our center and alumni of the university.
ix
ABSTRACT
Social network analysis, in the form of network theory, where nodes represent humans
and edges represent social relationships between humans, have a wide range of appli-
cations in information science, political science, social science, economics, etc. The
availability of data from location-based social media such as Gowalla and FourSquare
has helped scientists model and analyze human relationships and their interactions.
In this thesis, we use such data to analyze multiple dimensions of social relationships
in terms of three specific aspects: geographical proximity of nodes, their face-to-face
interactions, and the structure of their communities. Then we incorporate these three
aspects of social relationships into the following applications.
First, we propose techniques for analyzing human relationships in terms of ge-
ographical proximity, face-to-face interactions, and communities. We show how ge-
ographical proximity shapes structure of the social network by limiting face-to-face
interactions among distant users. We also incorporate geographical locations that
users visited into a few community detection algorithms for the purpose of detecting
communities where members are on average separated by a few friendship link, are
close to each other geographically, and are likely to interact with each other face-
to-face. These aspects of social network analysis allowed the study of the first two
applications − human mobility patterns and the spread of ideas.
Second, we use URLs that people share with their followers on social media to
personalize the ranking of information by looking at who follows whom, geographical
location of the users, and the structure of their detected communities. This allows us
to analyze how social media tunnels the flow of information in the network. More im-
portantly, personalized ranking based on these aspects allow users to see information
through the eyes of other users whom they consider important (neighbors, friends,
peers, etc.) and provides an opportunity for them to interact with information which
was used by the people that they care − resulting in the third application studied in
this thesis.
Finally, we replicate the small world experiment by emulating the process of
searching for targets by routing a folder among their acquaintances. Geographical
x
information and community structure allow us to selectively choose starters and tar-
gets based on the knowledge of where users are located and to which community they
belong. In addition, we examine various routing strategies based on geographical
proximity and community structure that perhaps were likely used by participants in
the small-world experiment to reach a target. In doing so, we discover which combina-
tions of routing strategies and selection techniques are likely to make the small-world
experiment successful in terms of the small number of hops required to reach the
target and the percentage of such successful chains − resulting in the last application
studied in this thesis.
xi
CHAPTER 1
INTRODUCTION
Social network analysis examines human relationships in terms of graph theory where
nodes represent humans and edges represent their social relationships. In addition,
social network analysis can also examine the geographical proximity of the nodes,
their face-to-face interactions, and the structure of their detected communities. This
thesis examines these three aspects of social network analysis in detail.
Within the last five years, the proliferation of smartphones has provided a new
type of social networking where people can share their current location with their
friends and tag the activities that they are doing. This new type of social networking
has provided a much richer dataset of human behavior because geographical locations
and face-to-face interactions were not previously available. More importantly, this
new type of social networking provides a bridge that connects the digital world with
the physical world where physical activities of human behavior such as proximity and
face-to-face interactions are recorded and shared instantly.
Before location-based social media, scientists used CDRs (call detail records) of
telephone companies to study spatial properties, infer friendship topology, and guess
face-to-face interactions. However, a problem with CDRs is that call volume is not a
good proxy for friendship because people can make phone calls to order food, request
technical support, seek medical help, and so on. More importantly, using calling
patterns to infer friendship is biased towards those that are more likely to be strong
ties since weak ties are by definition those that are contacted infrequently; hence using
CDRs to infer friendship leaves out an important dimension of social relationships in
the study of social network analysis.
Therefore, location-based social media is valuable for the study of social network
analysis because it provides a network that is embedded into physical space - the
Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Social RankingTechniques for the Web,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysisand Mining, Niagara Falls, Ontario, 2013, pp. 49-55.
Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and SocialStratification,” PLoS ONE, (under review).
1
2
surface of earth, and its nodes - humans, are constantly moving. In addition, the
links have different characteristics depending on the frequency of interactions. The
questions that immediately arise are what are ramifications of this type of social graph
embedding in physical space, and what are the roles of ties (weak/strong, long/short)
in human behavior. The collection of data from Gowalla and FourSquare allows the
investigation of these issues which are studied in detail in this thesis.
Chapter 3 addresses the issue of face-to-face interactions and finds that friend-
ship still requires both face-to-face interactions and geographical proximity. Moreover,
the desire to interact face-to-face motivates strong ties to travel together impact-
ing human mobility patterns with ramification for transportation traffic and wireless
bandwidth infrastructure management (one of the applications studied in Chapter
3). However, this does not mean that weak ties are unimportant. The last section
of Chapter 3 shows that weak ties that are geographically distant tunnel the flow of
ideas and are a strong predictor of economic development in the US in terms of GDP,
patents, and startups.
Chapter 4 returns to strong ties and examines social influence that people have
on each other in terms of interests, geographical distance, and communities. Chapter 4
explores this influence to improve relevancy of responses to queries by individualizing
them for the users based on the ranking of web pages shared on social networks.
Some potential evidence of increased relevancy mentioned in this thesis could possibly
demonstrate the level of influence the friends exert on the interests of others.
Chapter 5 expands the last section of chapter 3 by examining how spatial em-
bedding of social networks, long distance ties, and communities underlie strategies
of social search. These aspects of social network analysis examine whether social
networks are small-world, stratified, or both simultaneously. Results show that while
social networks have small topological path lengths, there is no evidence that people
with limited knowledge can find a designated target within a small number of hops
when attrition is completely eliminated.
1.1 Ranking Information in Social Networks
Over the last decade, scientists examined the structure of web [1]-[4] and pro-
posed algorithms to rank web pages based on significance and relevance to a given
3
query [5]-[9]. A conceptualization of the web is to look at patterns in the topol-
ogy of hyperlinks containing web pages to separate prominent websites that serve as
authorities for trusted information from malicious pages created by spammers [1].
This conceptualization of the web eliminates the complexity of textual analysis
and creates a pot-pourri of information that gets incorporated into search engines or
other information retrieval systems for the purpose of finding information on personal
computers, mobile devices, and any other computing platforms [10]. In the case of
a search engine, billions of web pages containing rich context of information are
organized where end users can find their target quickly. Thus, this need for speed
makes ranking crucial in information retrieval systems. Also, ranking has many other
applications in social sciences such as the citation analysis of legal and scientific
documents [11].
Advances in social network analysis and the proliferation of online social media
have provided a different perspective for examining ranking [12]-[18]. The study of
algorithms used for ranking and organizing information in hybrid networks such as
social search engines have promising improvements when incorporating social network
analysis into them; for example, incorporating personal information containing social
relationships on G+ for personalizing search results on Google. As the proliferation
of social media continues to expand, we want to be able to use techniques from social
network analysis to personalize the ranking of information for a given user. This is
important because social relevance allows users to see information through the eyes
of other users who they consider important and provides an opportunity for them to
interact with the information accessed by the people about whom they care.
Social media such as Twitter and Google Buzz can be characterized as a web
service that allows users to share information with their followers. While a lot of
research has been devoted to examining text in hashtags and messages [19]-[21] we
focus on URLs because information contained in URLs is not restricted by length
limitation, is less likely to be informally written, and contains less slang and fewer
abbreviations. Analyzing URLs provides a unique opportunity to infer the interests
of users based on their reading habits. We assume that URLs shared via people
concentrate on selected topics of their interests. It is important to notice that our
purpose here is not to rank a set of URLs based on a given query but instead to rank a
4
set of URLs based on whether we think a user is likely to engage with the information
contained within the URLs. Such engagement could be clicking, commenting, re-
sharing, and spending time reading them.
The problem we want to solve is to provide a framework for ranking URLs
shared on social media based on social relationships; where some of the URLs are
ranked higher if they are shared via certain type of social relationships. The social
relationships we examine for ranking URLs include but are not limited to neighbors
(nodes that are within geographical proximity [22]) and peers (nodes that are within a
detected community [23]) The literature review on this subject is provided in Chapter
2 (Section 1) and the contribution is discussed in Chapter 6.
Some data-driven questions that we examine are whether pairs of users that
are geographically close are more likely to have similar interests than pairs that are
distant, and whether reciprocal relationships have higher keyword similarity in web
pages than non-reciprocal relationships. Other related questions that we explore are
examining the densities of friends, peers, neighbors, and people with similar inter-
ests, since these social relationships are the building block for understanding social
relevance.
1.2 Small Worlds and Social Stratification
Data scientists have recently calculated the distribution of the shortest path
lengths between randomly selected pairs of users in online social networking sites and
confirmed that the majority of people are on average within six degrees of separation
(e.g., 4.7 in Facebook [24], 2.7 in MySpace [25], 4.2 in Twitter [26], and so on [27]).
However, empirical research in social stratification such as racial segregation and
income inequality undermine the premise that we live in a small-world where there are
short paths connecting people with culturally and economically diverse backgrounds
together. In [28], Kleinfeld mentioned that Beck and Cadamagnani were unsuccessful
in replicating the small-world experiment with high success rates when they attempted
to reach a high-income target starting from a low-income person, suggesting that the
world we live in is divided by wealth caused by income inequality.
Before the availability of data from online social networking sites, Milgram and
his colleagues performed an experiment to demonstrate the small-world phenomenon
5
by recruiting randomly selected starters from Nebraska and Oklahoma to reach a
broker in Boston [29]. In their experiment, starters were asked to mail a folder to
an acquaintance known to them on a first-name basis and would be likely to reach
the target using the least number of hops. The process repeats until the chain stops
when the folder eventually reaches the target or its current holder drops out the
experiment for the lack of qualified acquaintances or unwillingness to participate
in the experiment. Hence, the expected number of hops required for a starter to
successfully reach a target is an upper bound and also a lose estimate for the length
of shortest path connecting them. Travers and Milgram reported that 64% of the
chains successfully reached the designated target within 5.2 hops [29], suggesting
that the diameter of the network of social connections is small.
The problem we want to solve is finding out whether the network of our so-
cial connections is small, stratified, or both simultaneously. We want to investigate
this problem by replicating the process of routing a folder from selected starters
to randomly chosen targets by using data containing geographical locations and so-
cial relationships of hundreds of thousands of users from location-based social media.
The advantage of incorporating large-scale and multi-dimensional data into the small-
world experiment is that many aspects of the experiment can be controlled such as
determining how to strategically route a folder between acquaintances and having real
data on who is actually connected to whom for hundreds of thousands of users. Un-
like other social experiments requiring incentives for human subjects to participate,
we can control the effect of participation by supposing that everyone who receives a
chain letter participates in the experiment once, since long chains are not likely to
exist when the average participant rate is 37% [30] (e.g., 0.375 < 0.01) reported by
Dodds et al. These advantages from the data help us focus on how two factors of the
experiment, geographical locations and community structure of users’s connections,
make it possible for social networks to be either small-world, stratified, or both simul-
taneously. These aspects of geographical proximity and community structures allows
us to strategically route a folder between their acquaintances and also select starters
and targets based on geographical distance or by a fixed number of community hops
connecting them.
We used community detection algorithms to partition a social network so that
6
starters and targets can be selected in the following ways. We define the network
distance from community of the starter Cs to the community of the target Ct as the
length of the shortest path connecting nodes from Cs to Ct. The question we ask is
how many hops does it take to reach a target t originating from a starter s if the
length of the shortest path connecting their communities is fixed at k? When k ≈ 0,
we expect to capture the small-world phenomenon where it is easy to find short paths
connecting people together. On the other hand, when k >> 0, we expect that while
there might exist short paths connecting people together, it is much harder to find
them with limited information available to the participants due to the stratified nature
of society where some people have little social capital compare to others, making it
difficult for people to reach targets outside of their communities and social class.
Beside the debate between whether we live in a small world or stratified one, the
techniques that were used by the participants in the experiment to select an acquain-
tance have practical applications in rescue and search operations [31] and job searching
via personal contacts [32]. Dodds et al. reported that such successful techniques used
by the participants including forwarding the folder to a selected acquaintance such as
a friend (67%), relative (10%), co-worker (9%), sibling (5%), significant other (3%),
and others (6%) based on geographical proximity and occupation “for at least half
of the decisions” [30]. In addition, the results from the small-world experiment led
to an avalanche of network models that have certain properties resembling real social
networks such as the short diameter and high clustering coefficient [33].
The literature review on this subject is included in Chapter 2 (Section 2) and
the contribution is discussed in Chapter 6.
1.3 Summary of Contributions & Organization
First, this thesis collects terabytes of data that users shared on social media
and analyzes their relationship dynamics in terms of three specific aspects: geog-
raphy, face-to-face interactions, and communities. Such data allows us to analyze
human behavior in terms of social network analysis such as the interplay between
interactions, geographical proximity, and community structure. An example of an in-
teresting behavior we notice is the creation of friendship between two people is more
likely to occur when they are geographically close and friends-of-friends are also more
7
likely than not to be within proximity of each other. Also, geography has an effect
by limiting face-to-face interactions as well as their interests in terms of what users
read on social media. For more details on data analysis of human behavior and their
social relationships, see Chapter 3.
Second, this thesis proposes techniques for incorporating social relevance into
the process of ranking URLs. Personalized ranking results using variants of net-
work flow are highly independent from PageRank. The four dimensions of social
relationships that we use for ranking URLs are friends, neighbors, peers, and users
with similar interests. Results from the experiments show that social relevance can
improve ranking quality of up to 19% compare to the baseline and 5% compare to
PageRank. For more details on the personalization of information, see Chapter 4.
Third, this thesis examines effects of social stratification in the small-world
problem. Results show that while using geographical and community information
in modeling social routing for the small-world problem is more realistic than using
either one alone, average path lengths are 3 times longer then in Travers-Milgram
experiments when attrition is eliminated. Community distance is more effective and
robust at predicting probability of reaching targets than geographical distance in
terms of average path lengths and percentage of successful chains. Finally, results
show that prominent targets and targets in prominent communities can be reached
much quicker than on average. Our results can be summarized as follows: the small-
world property holds for the prominent but everyone else is lost in the crowd except
when being reached by members within its own community. For more details on
effects of stratification in searching for people, see Chapter 5.
1.3.1 Organization
Table 1.1: Aspects of SNA & applications.Geography Interactions Communities
Human Mobility Congestion Communication GroupSpreading Ideas Long Ties Weak Ties Bridge Ties
Personalized Ranking Geo. Influence Peer Influ. Collective Influ.Small-world Selection Cognitive Biases Routing
The organization of this thesis can be summarized by using Table 1.1. The
8
three aspects of social network analysis are geographical proximity of nodes (Chapter
3 Section 1), their face-to-face interactions (Chapter 3 Section 1), and the structure
of their communities (Chapter 3 Section 2). The four applications studied in this
thesis are human mobility & congestion modeling (Chapter 3 Section 5), spreading
ideas & economic development (Chapter 3 Section 6), personalized ranking (Chapter
4), and the small-world experiment (Chapter 5). Each element in Table 1.1 describes
how the corresponding aspect of social network analysis can be used to analyze the
corresponding application.
For the first application (human mobility), geography in terms of the geograph-
ical proximity of friends shows that human mobility traces can be used to study
wireless bandwidth infrastructure management, and as we later see, network conges-
tion is centralized in a few geographical locations impacting the throughput of the
bandwidth when studying mobile ad-hoc networks. Later in Chapter 3 Section 5,
face-to-face interactions is analogous to establishing wireless connections, since the
purpose of establishing connections in wireless networks is to communicate, and es-
tablishing connection is only possible when nodes are within geographical proximity
just like face-to-face interactions. Last but not least, this can be extended to incorpo-
rate the communities where mobility traces are simulated based on a group of nodes
belonging to the same community and moving together.
For the second application (spreading ideas), geography plays a role in dis-
tinguishing between short and long ties where the effects of long ties are examined
in simple contagion models for the purpose of measuring economic development of
large geographical areas. The analysis of face-to-face interactions shows that long
ties are especially weak. In addition to long ties, ties that connect between different
communities are also examined in Chapter 3 Section 6.
For the third application (personalized ranking), three elements are incorpo-
rated into the process of ranking URLs. Geography allows selecting users based on
geographical distance (neighbors). Reciprocal interactions in terms of social relation-
ship (friends instead of followers) allows us to select nodes based on their interactions.
Last but not least, community structures allow us to select nodes that belong to the
same community.
For the last application (small-world), geography allows selecting a starter and
9
a target in the simulations based on their geographical distance. Face-to-face inter-
actions could affect the statistics of average path lengths because the folder holder is
likely to pass the folder to the next holder based on the number of their interactions
and independent of the target. And finally, community strictures allow the nodes in
the simulations to pass the folder based on community awareness.
CHAPTER 2
LITERATURE REVIEW
This chapter provides a literature review on ranking techniques and the small-world
problem.
2.1 Ranking Techniques
The literature review on ranking techniques is broken down into three parts.
The first part looks at the conceptualization of the web (Sec. 2.1.1), the second part
looks at incorporating more sources of data and modeling trust (Sec. 2.1.2), and the
third part looks at data mining techniques for learning how to rank (Sec. 2.1.3).
2.1.1 Web Conceptualization
Early days of search engines rated information on the web by using the text em-
bedded in the page rather than by the hypertext containing the information invisible
to the end users. Previous work in the ranking of web pages incorporated text and
hypertext to determine the rank of a page, since hypertext by itself does not contain
information related to the query and a lot of information in the text does not mean
it is authoritative [34]. In a sense, ranking pages by counting the number of inlinks
is like voting, where the number of inlinks is the number of votes for a page, and
additional textual analysis can be applied to a query for retrieving a subset of related
pages ranked by the number of votes.
Advances came from Page and Brin when they devised an algorithm now known
as PageRank to capture not only the number of incoming inlinks like in voting but
also the quality of those links [5]. The initial score of a web page is equal to 1n′ where
n′ is the number of pages containing a link to that page. At the first iteration, each
page sends its score divided by the number of its links pointing to other pages. Then
each page replaces its current score with the sum of scores that were sent to it by the
pointing links. The process of sending and updating scores repeats until convergence
Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and SocialStratification,” PLoS ONE, (under review).
10
11
or a pre-defined number of iterations is reached. The final scores determined by
PageRank are used to rank pages across the web graph.
Kleinberg purposed a ranking algorithm known as HITS (Hypertext-Induced
Topic Search) based on the idea that good hubs point to good authoritative pages
and vice-versa [35]. This query dependent algorithm first retrieves a subset of pages
that are related to a query. Then it applies an update technique to recalculate scores
of hubs and authorities, and the algorithm uses the scores of the authorities to rank
the pages. Initially, the score of an authority is the number of backlinks coming from
hubs, and the score of a hub is the sum of scores of authorities that it points to. At
the second iteration, the algorithm updates the score of an authority by taking the
sum of the scores of the hubs pointing to it. The updating scores process is then
repeated, and the algorithm stops after reaching some number of iterations.
Stochastic Approach for Link-Structure Analysis, or SALSA for abbreviation,
is proposed by Lempel and Moran where two independent random walks are applied
to a bipartite graph consisting of hubs and authorities [2]. Instead of repeatedly cal-
culating and updating scores for hubs and authorities as is done in HITS, the number
of times a page is visited by the surfer in the random walk is used to extrapolate
the quality of the pages. The TKC (tightly knit community) effect is shown where
communities of web pages are scored relatively high even though some pages are
not authoritative or relevant to the topic when every hub points to every authority
causing a tight knit community of hubs and authorities.
2.1.2 User Data & Trust Models
While the link analysis of the web structure is a powerful tool used to capture the
ranking of pages, an emergence of algorithms and ideas came from difference sources
of data where additional information about end users is taken into consideration.
For instance, how long on average do users stay on a page, and how often are two
pages consecutively visited? BrowseRank is proposed to capture the number of page
visits and the amount of time a user stays on a page modeled as a continuous time
Markov process [8]. Another technique is taken from the principle of isolation or
the disconnectivity of trustworthy pages from spam pages where trust is propagated
from trustworthy pages to other trustworthy pages [6]. EdgeRank is proposed by
12
researchers from Facebook to consider interactions of two people or social associates
during the process of ranking updated messages, photos, URLs, etc. on news feed
[36]. Last but not least, the annotation of web pages created by users on Delicious is
used to rank pages in SocialSimRank by considering the structure of annotators and
annotated pages [12].
A technique of using personal data to rank pages was proposed by Liu et al.
called BrowseRank where they used the browsing graph in which vertices represent
visited pages and edges between vertices represent a transition from one page to
another [8]. The novelty in BrowseRank is that it incorporates data that provides the
amount of time an average user stays on a page which is an indicator of the page’s
quality and that cannot be captured by discreet time link analysis techniques such
as PageRank, HITS, and SALSA. Also as mentioned by the authors, the web graph
is not the most reliable source of data because of its large size and decentralized
architecture where problems can come from spammers creating link farms to increase
the visibility of their pages and web masters are constantly changing the content of
their pages. Empirical results suggest that BrowseRank outperforms PageRank when
independently hired researchers evaluated the ranked pages according to a linear
combination of relevance and importance.
TrustRank algorithm proposed by Gyongyi et al. relies on the principle of
isolation, under the assumption that it is unlikely for trustworthy pages to link to
spam pages [6]. Seed detection is a process that determines a small set of pages
to be evaluated where these pages are likely to point to other trustworthy pages.
First, a small set of seed pages is evaluated by using an oracle function to determine
whether a page is trustworthy or not. In practice, the oracle function represents
human judgment and would be too costly to use on a large set of pages. Second, each
trustworthy page propagates its trust to pages that its points to and the value of the
trust gets divided equally among all pointed pages. The propagation process repeats
until convergence or some predefined number of iterations is reached.
Additional advances came from the interests of Facebook in ranking items such
as photos, messages, URLs, etc. on each individual news feed. In EdgeRank, the
affinity score of two users, the weight of the posted item, and time decay are taken
into consideration for the ranking of items on personalized news feeds [36]. The
13
affinity score of the viewing user and the item creator is calculated by looking at their
online interactions; the more they have interacted, the more likely the item is shown
or ranked higher. Time decay decreases the relevance of a posted item as time goes
on, and the edge weight increases the score of items that have a high level of potential
interaction such as photo albums, messages embedded with URLs, etc. In addition
to EdgeRank, Bao et al. proposed SocialSimRank that uses social annotations on
Delicious to rank pages according to the observation that popular pages are annotated
by up-to-date users and up-to-date users annotate popular pages [12]. The novelty
of SocialSimRank comes from using the annotations of users to match search queries
to the corresponding annotated pages and applying the PageRank algorithm to the
annotated pages as means to rank pages corresponding to the view of the annotator.
2.1.3 Learning to Rank
Learning to rank is an intersection between information retrieval and machine
learning where techniques in machine learning are used to model the learning process
of ranking documents. Techniques are based on the idea of computing a function
to maximize quality measures in ranking or minimize the sum of differences between
the computed function and human-defined ratings. The advantage of using machine
learning techniques is that parameters in proposed learning models are tuned au-
tomatically. In pointwise comparison, the objective is to minimize the difference
between the calculated score of a document and the human-defined rating of it. In
pairwise comparison, the objective is to determine whether the first document in a
pair of documents is ranked higher than the second document or vice-versa. One
of the challenges in learning to rank is to go from pointwise to pairwise comparison
where the goal is to predict the ranking positions of two given documents. Another
challenge is to optimize non-continuous and non-differential objective functions. For-
tunately, previous work in the machine learning literature shows that techniques were
developed to handle such cases. RankNet learns how to rank pages by using a neural
network with pairwise comparison [37], SoftRank approximates the non-continuous
and non-differential objective function [9], and SVMRank uses support vector ma-
chines to minimize pairwise inconsistency [38].
In RankNet, Burges et al. proposed to use a two layer neural network for learn-
14
ing the process of ranking pages [37]. Given a pair of pages represented as vectors,
the ranking problem that the authors proposed is to compute the probability that
the first page is ranked higher than or equal to the second page. One advantage
in the learning stage is pairs of ranks might not be complete or even consistent to
reflect the missing pieces of information in the data or the noise containing in them.
First, they proposed using the cross-entropy cost function where ranking probabil-
ities are modeled by using the logistic function. Second, they proposed using the
backward propagation algorithm to optimally calculate the weights and offsets in a
two layer neural network such that the difference between the computed function and
human-defined ratings is minimalized. They conducted their learning, testing, and
validation experiments by using data from a proprietary search engine consisting of
17,000 searched queries where each query contains the top 1,000 ranked pages. A page
is represented as a vector consisting of 569 features. Query-dependent features are
extracted from the anchor text, URL representations, title, and content. The remain-
ing features are taken from log files in the proprietary search engine [37]. Empirical
results suggested that NetRank outperformed the other learning models (RankProp
[39], PRank [40]) in the validation stage.
Taylor et al. proposed SoftRank where the idea is to consider ranking scores
as random variables, map score distributions to rank distributions, calculate the ex-
pected SoftNDCG (normalized discounted cumulative gain), and use gradient tech-
niques to optimize parameters in a two layer neural network with respect to Soft-
NDCG as a cost function. While it is possible to use the cost function proposed in
RankNet, there are many other metrics in information retrieval such as MAP (mean
average precision), precision, and NDCG that reflect the experience of end users. As
mentioned, using these metrics as objective functions for training is challenging since
small parameter changes might yield different scores but ranking positions will change
when a score passes another score making the function non-differential. SoftNDCG is
a proposed metric based on the approximation of NDCG by mapping scores to ran-
dom variables. Also as in RankNet, backward propagation uses gradient techniques
to optimize parameters in a two layer neural network where the cost function is the
approximated NDCF metric.
Last but not least, SVMRank is an algorithm proposed by Joachims based on the
15
idea of using SVM (support vector machines) to construct a function that maximizes
the empirical Kendals Tau distance between the targeted function determined from
click through data and the system function computed by SVM [38]. Click through
data provides constructive feedback of the ranking system where a clicked URL implies
an estimate of relevancy relative to the query. While a clicked link does not represent
absolute judgement, it provides useful insights about the ranking positions of the
unclicked items. For instance, clicking on the link that is ranked 7th implies that 7th
link is more relevant to the query than the unclicked links starting from one to six.
This motivates the usage of pairwise comparison where the objective is to minimize
pairwise inconsistency between a computed function and the targeted function derived
from click through data.
2.2 Small-world Problem
This literature review on the small-world problem is broken down into two parts.
The first part provides an overview of the small-world phenomenon in terms of six
degrees of separation (Sec. 2.2.1). The second part looks at effects of inequality and
stratification that undermine the small-world property (Sec. 2.2.2).
2.2.1 Six Degrees of Separation
Milgram and his colleagues proposed an experiment to demonstrate the small-
world property by recruiting starters from Nebraska and Oklahoma to reach a broker
in Boston [29]. Starters in the experiments were asked to mail a folder to an ac-
quaintance who would be likely to reach the target quickly. Previous folder holders
were recorded into the folder roster so that they would not be selected twice in a
mail-forwarding chain. The process repeats until the chain stops either when folder
reaches the target, or the current holder drops out of the experiment for various rea-
sons. The expected number of hops it requires for a starter to successfully reach a
target is an upper bound of the shortest path length connecting them. Travers and
Milgram reported that 64% of the chains successfully reached the designated target
within 5.2 hops [29] which gave name to the six degrees of separation. The idea of
six degrees of separation is that if we pick any two people on this planet, there are on
average 5 unique individuals who are connected in such a way where the first person
16
knows the second person, who knows the third person, who eventually knows the last
person.
Beside the debate between whether we live in a small world or stratified one,
the techniques that were used by the participants in the experiment to select an
acquaintance have practical applications in rescue and search operations [31] and
job searching via personal contacts [32]. Dodds et al. reported that such successful
techniques used by the participants including forwarding the folder to a selected
acquaintance such as a friend (67%), relative (10%), co-worker (9%), sibling (5%),
significant other (3%), and miscellaneous ties (6%) based on geographical proximity
and occupation “for at least half of the decisions” [30]. In addition, the results
from the small-world experiment led to an avalanche of network models that have
certain properties resembling real social networks such as the short diameter and
high clustering coefficient [33].
2.2.2 Social Stratification
Research in stratification such as racial segregation in neighborhoods and income
inequality undermine the premise that we live in a small-world. For instance, are there
really short paths connecting random people together? What about people who are
isolated from the rest of the world? Clearly, isolated people are much harder to reach
than prominent individuals such as politicans, CEOs, religious leaders, celebrities,
etc. In [28], Kleinfeld mentioned that Beck and Cadamagnani were unsuccessful in
replicating the small-world experiment with high success rates when they attempted
to reach a high-income target starting from a low-income person. This suggests that
one causes of stratification comes from income inequality where people are segregated
into economic classes. This leads to a question what are the elements that cause
stratification? What attributes do we associate with other people? Since people have
an inclination to associate with people of the same ethnicity, cultural heritage, and
other economic classes, how do such tendencies affect the small-world property?
Th small-world property has been accepted in the research literature because
possible routing strategies have been proposed to show how people strategically make
routing decisions. A routing strategy proposed by Kleinberg relies on participants
passing the folder to the acquaintance who is closest in terms of geography to the
17
target [41]. This make sense since people have cognitive abilities to remember where
there acquaintances live. Also, it is common to have a few acquaintances who are
geographically close and a few acquaintances who are distant due to the relocation
for a new job, studying at a university, retiring, etc.
CHAPTER 3
SOCIAL NETWORK ANALYSIS
Typically social network analysis examines relationships among people in terms of
graph theory where nodes represent actors and edges represent their relationships.
In this chapter, we examine three important aspects of social network analysis. The
first is understanding the effect of geography in terms of the location of actors on the
structure of the social network. The second is measuring face-to-face interactions of
the actors and their social relationships. The third is detecting hidden communities
that are well-connected in terms of social relationships and highly-active in terms of
face-to-face interactions. We examine these three aspects of social network analysis
in details using data collected from a location-based social network called Gowalla.
Beside ranking and searching, these three aspects of social network analysis can also
be used to model human mobility in mobile ad-hoc network (see Sec. 3.5) and predict
economic development of large geographical areas (see Sec. 3.6).
In section 3.1, we examined geography, co-appearance, and interactions of users
in Gowalla focusing on the effect of geography on the structure of the network and
face-to-face interactions. In section 3.2, we incorporated geographical information
of users into three selected community detection algorithms consisting of a modified
version of Clique Percolation Method (CPM), Inference Algorithm (IA), and GANXiS
to detect disjoint and overlapping communities that are well-connected in terms of
social relationships and highly-active in terms of face-to-face interactions. In section
3.3, we designed an experiment in which we generated different types of covers by
using a combination of social and geographic information. In section 3.4, we used
quality measurements based on the link connectivity, geographical proximity, and
physical interactions among members to examine detected communities as a function
of their sizes and used covers as a baseline. We conclude this chapter in section 3.7
Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Using Location-Based Social Networks to Validate Human Mobility and Relationships Models,” in Proc. IEEE/ACMInt. Conf. Advances in Social Network Analysis and Mining, Istanbul, 2012, pp. 1247-1253.
This chapter previously appeared as: T. Nguyen et al., “Analyzing the Proximity and Interac-tions of Friends in Communities in Gowalla,” in Proc. IEEE/ACM Int. Conf. Advances on DataMining Workshops, Dallas, TX, 2013, pp. 1036-1044.
18
19
Figure 3.1: Geographical spread of 100K checkins in Gowalla.
with a summary of the results and potential applications that might benefit from the
analysis of geography and spatially-aware community detection.
3.1 Geography, Co-Appearance, & Interactions
3.1.1 Data Collection
We collected data from a location-based social networking provider called Gowalla
that allowed people to use their internet-enabled and sensing-capable mobile phones
to record and share their current location with their friends. By using the Gowalla’s
API, we were able to retrieve 391,223 users with public profiles (friends and checkins)
from mid September in 2011 to late October of that year. Unfortunately, Gowalla
has been purchased by Facebook and is no longer operating by itself. The data for
FourSquare, Twitter, and Google Buzz are collected in the similar manner by using
breath first search.
To collect the data, we start with a user randomly chosen and process all the
public information available about that user. Then we store all id’s of the user’s
friends and put them into a processing queue in a FIFO order. After that, we retrieve
the next user from the queue and repeat the process. Therefore, we crawled Gowalla
breadth-first, a standard technique in the social networking literature often referred
to as Breadth First Search (BFS) sampling.
As shown in Table 3.1, the users accumulated a total of around 26 million
checkins and 8 million friendship links. The average day of the checkins is 3.14 which
20
Table 3.1: Data summary of Gowalla network.x σX
∑Users − − 391,223
Checkins 164.64 636.68 26,303,580Friends 11.13 67.03 2,176,384
Weekday 3.14 2.01 Jan. 21, 2009Distance 128.72 356.51 20,565,644
Time 6.41 13.29 -
represents Wednesday. The earliest checkin is on Jan 21, 2009. The average distance
between two consecutive checkins of a user is 128.72 km. The average time interval
between two consecutive checkins of a user is 6.41 days with a standard deviation of
13.29. The geographical spread of the checkins is shown in Fig. 3.1. The checkins
from Gowalla allow us to measure the face-to-face interactions between friends by
inferring how often do friends checked into the same location at approximately the
same time.
3.1.2 Notations & Definitions
Given a set of users U , let u ∈ U be a particular user, Lu be a set of its shared
locations known as checkins, and Fu be a set of its friends. A shared location l ∈ Luof the user u is a tuple of three elements denoted as l1, l2, and l3 corresponding to
the latitude, longitude, and timestamp of the location l, respectively. The friendship
network denoted as F = (U,EU) is an undirected and non-weighted graph where an
edge represents reciprocal friendship; that is, e = (u, u′) ∈ EU means u′ ∈ Fu and
u ∈ Fu′ . The geographic distance d(u, u′) between two users u and u′ is estimated by
averaging the locations in Lu and Lu′ and using the haversine formula to calculate
arch distances. The checkin similarity CS(u, u′) of user u and u′ is defined as:
CS(u, u′) =|Lu ∩ Lu′ ||Lu ∪ Lu′ |
. (3.1)
The level of physical interaction between user u and u′ denoted as I(u, u′) is
calculated from their shared locations as follows. Two locations l ∈ Lu and l′ ∈ Lu′
are equivalent if they are within geographic proximity d(l, l′) < dε and occurred within
a time interval |l3 − l′3| < tε. Have such two equivalent locations lu and lu′ means we
infer u and u′ have gone to the place l together.
21
Checkin Similarity
Dis
tanc
e Si
mila
rity
log(
km)
0 0.2 0.4 0.6 0.8 10
2
4
6
8
Not Friends Friends
Figure 3.2: Friendship is bounded by geographical distance.
The maximum pair-wise equivalence between Lu and Lu′ is defined as the longest
sequence of equivalent location pairs ((l1, l′1), . . . , (lk, l
′k)), such that for each 1 ≤ i ≤ k,
li ∈ Lu, l′i ∈ Lu′ and li is equivalent to l′i. The level of physical interaction I(u, u′) is
defined as the length k of the maximum pairwise equivalence divided by the size of
the smallest locations set:
k/min(|Lu|, |Lu′|)). (3.2)
Finding the maximum pairwise equivalence can be reduced to a network flow
problem where polynomial running time algorithms such as Ford-Fulkerson can be
used to calculate the maximum number of matches.
3.1.3 Data Analysis & Results
In Fig. 3.2, there are 701 blue points that represent two randomly selected users
who are friends and 620 red points that represent two randomly selected users who
are not friends within the dataset. The shaded region is drawn by using the k-nearest
neighbor algorithm for classifying whether two users are friends given their average
distance apart and checkin similarity.
In Fig. 3.2, we notice that co-appearance represented by checking similarity is
a poor indicator of friendship; that is, people who are temporarily within the same
place and time are not likely to be friends. Intuitively, co-appearance happens often
22
at popular spots, like concerts and cafes that attract people living at great variety of
locations. Even if a group of a few friends goes together for a concert, they would
not be friends with thousands of other attendees, hence, a chance that a random
pair of attendees are friends is low. Occasional co-appearances are not sufficient, but
geo-proximity helps in establishing and maintaining friendship, as seen in Fig. 3.2.
0 1000 2000 3000 40000
0.05
0.1
0.15
0.2
0.25
0.3
Fra
ctio
n
Avg. Distance of Separation (km)
Hop=1Hop=2Hop=3
(a) Hop=1-3
0 1000 2000 3000 40000
0.02
0.04
0.06
0.08
Fra
ctio
nAvg. Distance of Separation (km)
Hop=4Hop=5Hop=6
(b) Hop=4-6
Figure 3.3: Densities of pairs as a function of geographical distance.
In Fig. 3.3, we plotted the density of friends (hop=1), friends-of-friends (hop=2),
and pairs of users up to six degrees of separation as a function of the average geo-
graphic distance between two users in km. For each level 1 ≤ k ≤ 6 of indirection
(measured in the number of hops), we randomly selected 5,000 non-cyclic paths of
length k and created from the ends of these paths 5,000 pairs from the Gowalla
dataset, each pair with k indirection of friendship. We analyzed pairs that were
within 4,000 km distance from each other.
In Fig. 3.3(a), the density of direct friends (4,317 total) reaches the highest
value of 0.35 (in other words, 1511 pairs) at the lowest geographic separation in the
range from 0 to 160 km (each point at distance x represent users with distances
from x-160km to x+160 km) and continues to decrease as the distance between them
increases. At the second level of indirection, the density of friends-of-friends (3,464
total) achieves the highest value 0.19 in the range from 0 to 160 km and continues to
decrease as the geographic distance between them increases.
Geographic proximity has an effect where friends (hop=1) and friends-of-friends
(hop=2) are more likely but not necessary required to be within proximity of each
23
0 1000 2000 3000 40000
0.010.02
0 1000 2000 3000 40000
0.51
1.5x 10−3
Avg. Distance of Speration (km)
Leve
l of I
nter
actio
n
Hop=1
Hop=2
Figure 3.4: Measuring face-to-face interactions (tε=30mins, dε=1km).
other. For instance, 61% of friends are within 480 km and 47% of friends-of-friends are
within 640 km of each other. Another way of looking at the results is that people who
are separated by three or more hops are unlikely to be within geographic proximity
of each other.
In Fig. 3.3(b), we plotted pairs of users who are separated by four, five, and
six hops. We noticed that they are not likely to be within geographic proximity of
each other. The density of those pairs reaches the highest value 0.07 at the 160 km
range centered at 1,200 km and continues to decrease regardless of their degrees of
separation.
In Fig. 3.4, we plotted the average level of face-to-face interactions I(u, u′)
of friends (hop=1) and friends-of-friends (hop=2) as a function of their geographic
distance in km. The larger the geographic distance between friends, the less likely they
physically interact by going to the same places together. The highest peak (0.027)
is at the lowest geographic separation from 0 to 266 km and continue to gradually
decrease (with some small fluctuations) as the distance between them increases. For
friends-of-friends, the physical interactions reflect the probability that they happened
to be together.
24
3.1.4 Limitations
We like to mention that it is possible the locations of some users are irrelevant
to their distant friends. This may be a source of potential bias where the geographic
proximity of friends may be enlarged by a friendship selection process in Gowalla
in which users subjectively add friends who are within their geographic proximity.
However, we noticed that 38% of friends are geographically separated by more than
520 km. Also, the Gowalla data and other social media indicate that distant friends
are selected, perhaps for the purpose of keeping in contact [42].
In addition, Mislove et al. mentioned that the population of users who tweet
on Twitter is unbalanced [43]. Therefore, we believe that the users who checks in on
Gowalla do not make a representative sample of the entire population as shown in
the concentration of checkins in Fig. 3.1.
3.2 Incorporating Geography into Community Detection
A common approach in community detection is to divide a network into multiple
partitions by maximizing the number of edges within each partition and minimizing
the number of edges between them. The often used quality measurement for the
partitions is modularity that compares the difference between the fraction of edges
inside and fraction of edges across a partition and such expected difference if edges
in the network were randomly distributed [44]. Greedy approaches like hierarchical
clustering [45] and spectral approaches such as minimum cuts [46] divide a network
into disjoint partitions by combining or separating clusters of nodes so that modularity
is maximized at every step. As studied by authors in [47], [48], a problem with
this modularity maximization approach is that it inclines to merge two separated
communities together, increasing the value of modularity, but creating the merger
that does not reflect the ground truth.
Another approach to community detection is to divide a network into multiple
partitions so that the majority of members within each partition shares a common
attribute [49]. A proposed attribute is based on friendship similarity defined as the
density of common friends between pairs of nodes [49]. A problem with this proposed
attribute is that it allows for a community consisting of people who have a lot of friends
in common but are not friends of each other. However, this imperfect definition works
25
well in practice because people who have a lot of friends in common are likely to be
friends themselves. Since community detection is an active area of research, our
goal is not to provide another technique that detect communities (many have been
proposed) but to incorporate the spatial information of nodes into existing algorithms
for analyzing Gowalla and propose a null model (generating covers) to benchmark the
detected communities.
We combine these two approaches in community detection by incorporating
the location information of users and geographic distances between them into three
selected algorithms taken from the rich literature. First, we want to minimize the
number of edges between communities and maximize the number of edges within
them. Second, we want members inside a community to be within spatial proximity
by giving geographically correlated friends more weight than distant friends during
the detection process. This combined approach applies a natural interpretation of
a friendship community where members are well connected and also likely to be
geographically close. Also, geographically correlated nodes are more likely to interact
with each other face-to-face as seen previously.
We selected three community detection algorithms based on their popularity
(CPM), promising experimental results (IA), and ability to scale to millions of nodes
and edges (GANXiS) for the purpose of capturing and measuring the interactions of
users inside a community. In the following subsections, we summarize the selected
algorithms and describe how we incorporated geographic information of users into
the process of detecting friendship communities in Gowalla since level of interactions
is correlated with distance as seen previously.
3.2.1 Clique Percolation Method
The CPM algorithm was proposed to detect overlapping communities by com-
bining cliques or fully connected subgraphs [50]. Given an undirected graph F =
(U,EU), let Hm denotes the set of all cliques in F of the size m. The clique-graph
G = (Hm, E) consists of cliques in Hm represented as nodes, and edges between pairs
of cliques if they have m−1 overlapping members. Each connected component of the
graph G is a community consisting of many fully connected subgraphs of F .
A problem of the CPM algorithm is its lack of scalability because the number
26
of cliques explodes as m increases for large networks. Unfortunately, the problem of
finding the clique with the largest size in a given graph is NP-hard [51] preventing
the algorithm from using cliques with the near largest size.
We modified CPM to incorporate geographic information of nodes and made the
algorithm scalable as follows. Instead of finding cliques of large sizes, we find triangles
(m = 3) since they can be efficiently identified in parallel using map-reduce. To
limit the number of triangles, we select a subset of disjoint triangles from all possible
triangles by using geographic distances between pairs of nodes as follows. The average
geographic distance of a triangle t is defined as (1/3)∑d(u, u′) for u 6= u′ ∈ t. We
take a triangle one at a time from a sorted list of triangles until all possible disjoint
triangles have been taken. If a user is not part of any disjoint triangle, we assign it
to a triangle that maximizes the number of edges between this user and the triangle
and use geographic distances to break ties by assigning a user to the geographically
closest triangle.
The clique-graph G′ is defined as G′ = (T,ET ) where T is the set of modified
triangles and ET is the set of edges between triangles that are assigned as follows. For
each triangle, we create a single clique edge from this triangle to the one that maxi-
mizes the number of friendship edges between them, and use geographic distances to
break ties if necessary. Like in the original CPM algorithm, each connected compo-
nent of G′ is a community consisting of geographically correlated and well connected
subgraphs of F .
3.2.2 Modularity Maximization
Modularity maximization is a popular technique used to find communities pro-
posed in [44], [45]. Given a graph F = (U,EU) and a set P containing disjoint
partitions or subsets of U , the modularity Q of the partitions in P is defined as:
Q =∑pi∈P
eii − a2i (3.3)
where eij is the fraction of edges between nodes in the partitions pi and pj, and
ai =∑
j eij is the fraction of edges leaving the partition pi [44]. A positive value of
27
Q correlates with the difference between densities of edges inside and edges leaving
the partitions compared to a null model.
To maximize modularity, a greedy approach based on hierarchical clustering was
proposed in [45], [52]. Initially, every node in U belongs to its own community. Then
the pair of communities with the highest increase in modularity is merged together.
The process of merging repeats n − 1 times where n = |U |. The clusters with the
highest overall value of modularity at each iteration are taken as a set of communities.
For weighted networks, Newman proposed a simple technique to map weights
of integer values to multigraphs [53]. For every edge of the weight wij, there will be
wij − 1 additional unweighed edges added between node i and j, and the weight wij
is set to 1. The definition of modularity remains the same, since the fraction of edges
eij between partition pi and pj can simply incorporate multiple edges between nodes.
We incorporated geographic information about users into the Inference Algo-
rithm by assigning weights to edges based on spontaneousness and typical means of
travel: walking up to 1.6km, biking/using public transportation up to 25km, short
car/train ride up to 100km, long car/train ride up to 500km, and plane flight above
500km. Friends who are within walking distance (1.6 km) get the highest weight of
24. Friends who are within biking distance (25 km) get the second highest weight of
23. Friends who are within driving distance get a weight of 22, and so on.
3.2.3 Speaker-Label Propagation (GANXiS)
GANXiS was proposed in [54] based on a probabilistic propagation process that
spread labels between speakers and listeners. Given a graph F = (U,EU), each node
ui ∈ U initially carries a unique label i in its pocket pi = {i}. When a node u is
randomly selected to speak, it requests all members of its neighborhood, nodes that
are adjacent to u to randomly send a label in their pocket to u. The probability of a
label being chosen by u′ in its pocket pu′ is proportional to number of times the label
was added; the more times a label was added, the more likely it will be chosen. The
probability of a speaker ui choosing a label from a listener uj is based on the weight
wij/wi where wi is the sum of all weighted edges coming out of ui. For unweighted
networks, wij = 1.
The algorithm repeats until the maximum number of iterations is completed
28
where in each iteration everyone gets to speak exactly once in a random order. At
the end, labels that have a probability of being chosen to send to a speaker less
than a threshold r are deleted. Finally, the labels that a node carries determine the
communities that to which it belongs. For instance, nodes that carry a label i will
belong to the community ci. Time to live (TTL) has been recently proposed to limit
the number of labels that nodes propagate. TTL defines the number of times a label
can be sent (so it reaches limited number of nodes within TTL hop distance).
The advantage of GANXiS is that it scales linearly with the number of edges,
but the disadvantage is that the relationship between convergence and the number of
iterations is yet unknown. GANXiS is capable of discovering overlapping communi-
ties, but we selected its running parameters in such a way that the results included
only disjoint communities to make them compatible with the results of other algo-
rithms. We incorporated geographic information of users into GANXiS by assigning
weights based on spontaneousness and typical means of travel like in weighted IA.
Friends who are within walking distance (1.6 km) get the highest weight of 24. Friends
who are within biking distance (25 km) get the second highest weight of 23. Friends
who are within driving distance get a weight of 22, and so on. This is an extension of
the interpretation of speaker-listener propagation algorithm where a listener is more
likely to be able to hear a speaker if they are within spatial proximity.
3.3 Contrasting Communities to Null Models
We proposed to integrate spatial and friendship information of nodes into a
process of generating covers. The purpose of the covers is to serve as a baseline
for analyzing the performance of various community detection algorithms under a
quality measurement. In section 3.3.1, we described how we generated six covers by
using a combination of spatial and friendship information in traversing the network.
In section 3.3.2, we selected a few quality measurements for examining covers and
detected communities. In section 3.3.3, we examined the covers using the selected
quality measurements.
29
Table 3.2: Six techniques for generating covers.Algorithm Abbreviation Spatial Info.? Social Info.?
Completely Random CR no noRandom Walk RW no yes
Closest Friend First CFF yes yesFarthest Friend First FFF yes yes
Closest to All CTA yes yesFarthest to All FTA yes yes
3.3.1 Techniques for Generating Covers
Given a graph F = (U,EU), a cover C ⊂ U of size k is a subgraph of F with k
nodes selected in a specific way. A completely random cover CR is one where each
user u ∈ U has the same probability of being added during the selection. In a random
walk cover RW , we first randomly add a seed into the cover, then randomly select a
friend of the most recently added user, and continue selecting friends until the cover
reaches the size k. The closest-friend-first cover CFF is similar to RW but instead of
adding a random friend, we add the spatially closest friend not in the cover of the last
added user. If all of that user’s friends have already been added into the cover, we go
back one step to the previously last added user and branch out from there. We call
this the roll-back mechanism. The farthest-friend-first cover FFF is similar to CFF
except that we take the spatially farthest friend instead of taking the closest one.
The closest-to-all cover CTA is similar to CFF but instead of adding the spatially
closest friend to the last added user, we add the spatially closest friend with respect
to all members already in the cover. Finally, the farthest-to-all cover FTA is one
where we take the spatially farthest friend with respect to all members already in the
cover. Cover generation algorithms such as CTA and FTA are described in Fig. 3.5
without the roll back mechanism for simplicity. We listed the covers and their details
in Table 3.2.
3.3.2 Measuring Covers & Communities
We use three types of quality measurements based on the link connectivity and
location of members to measure covers and communities.
The first type of measurements is based on the intra-edge count IEC defined as
the number of edges whose both ends are inside the cover. The contraction CONT of
30
1: procedure CoverGeneration(k)2: F = (U,EU)3: seed = rand(1, |U |), cover = [seed]4: while len(cover) < k do5: distances = [ ], m = len(cover)6: for u in Fseed do7: // Compute haversine distance from u to cover[i].8: du = 1
m
∑mi=1 d(cover[i], u)
9: distances.append((u, du))10: end for11: // sort du from least to greatest or vice-versa12: distances = sort(distances, key = x: x[1])13: for u, du in distances do14: if u /∈ cover then15: cover.append(u)16: seed = u17: end if18: end for19: end while20: return cover21: end procedure
Figure 3.5: Generating CTA & FTA covers.
a cover is computed by dividing intra-edge count by the size of the cover. The intra-
density IND of a cover is calculated by dividing intra-edge count by the intra-edge
count of a completely connected cover of the same size. For these three measures
(IEC, CONT , IND), higher the value, better formed is the community.
The second type of measurements is based on the boundary-edge count BEC
defined as the number of edges whose one end is inside the cover while the other is
outside. This metric is useful for taking into account the effect of adding high degree
users into covers of large sizes since such users are likely to increase both the intra-
and boundary-edge counts. The expansion EXP of a cover is computed by dividing
the boundary-edge count by the size of the cover. The conductance COND of a cover
is defined as COND(C) = BEC(C)2IEC(C)+BEC(C)
. For these three measures (BEC, EXP ,
COND), lower the value, better formed is the community.
The third type of measurements is based on pair-similarity that measures a
given metric such as friendship similarity among pairs of nodes. This is applicable to
the definition of a community of which members have a lot of commonality [49]. We
31
Table 3.3: Measurements for cover C of the size k.Measurement Definition
IEC [55] |{(vi, vj) ∈ E | vi ∈ C ∧ vj ∈ C}|BEC [56] |{(vi, vj) ∈ E | vi ∈ C ∨ vj ∈ C}| - IECCONT IEC/k
EXP [57] BEC/kIND [55] IEC/(0.5k(k − 1))
COND [56, 57] BEC/(2IEC +BEC)GDI max d(u, u′) ∀u, u′ ∈ CAGD
∑u6=u′∈C d(u, u′)/(0.5k(k − 1))
SLI∑
u6=u′∈C I(u, u′)
replace friendship similarity ratio with three additional measurements based on the
geographic proximity and location of nodes. The first one is the geographic diameter
of a cover GDI defined as the geographic distance between the two farthest nodes.
The second one is the average geographic distance AGD among pairs of nodes. Here,
lower the measure (GDI and AGD), better formed is the community. The third one
is the sum of the levels of physical interactions SLI among pairs of nodes for which
higher the measure, better formed is the community.
3.3.3 Examining Covers in Gowalla
For each technique, we generated covers of fixed sizes from 5 to 100 with an
increment of 1. For each cover size, we generated 100 covers and calculated the
average intra-edge count, boundary-edge count, geographic distance, and geographic
diameter. We then derived the remaining measurements.
In Fig. 3.6(a), we noticed that FFF outgrows the other techniques in terms
of intra-edge count as the cover size increases. In Fig. 3.6(b), we noticed that FFF
and FTA outgrow the other techniques in terms of boundary-edge count by a great
margin suggesting that they strategically add users with very large degrees. While
RW is decent at generating covers with high intra-edge counts as seen in Fig. 3.6(a),
it is also biased since users with high degrees are more likely to be added, which
increases the intra-edge count as the cover continues to grow. However, FFF and
FTA are even more biased than RW and FFF outgrows the other five techniques
because the radius of the farthest friend would cover everyone including common
friends in between. On the other hand, we noticed that CFF and CTA are most
32
20 40 60 80 1000
50
100
150
200
250
300
350
400
450
Cover Size
Intr
a−E
dge
Cou
nt
CRRWCFFFFFCTAFTA
(a)
20 40 60 80 1000
0.5
1
1.5
2x 10
5
Cover Size
Bou
ndar
y−E
dge
Cou
nt
(b)
0 20 40 60 80 1000
0.5
1
1.5
2x 10
4
Cover Size
Geo
grap
hic
Dia
met
er (
km)
CRRWCFFFFFCTAFTA
(c)
Figure 3.6: Intra-edge count, boundary-edge count, and geographic diam-eter of covers.
effective out of the six techniques at increasing the intra-edge count while minimizing
the boundary-edge count at the same time.
In Fig. 3.6(c), we measure the geographic diameter of a cover as a function of its
size. As expected from how covers are generated, FFF and FTA are most effective
at maximizing the geographic diameter while CFF and CTA are most effective at
minimizing this measurement. The geographic diameter of FFF and FTA reaches
the limit within 20 iterations, while the diameter for CTA and CFF slowly continues
to grow. A similar trend is seen in Fig. 3.7(c) which shows the average geographic
distance in contrast to the growth rate of intra- and boundary-edge counts seen in
Fig. 3.7(a).
Last but not least, conductance is a measurement used to determine the quality
of a community by considering both the intra- and boundary-edge counts. As seen in
Fig. 3.7(b), CFF is the most effective out of the six covers at minimizing conductance
33
20 40 60 80 1000
2
4
6
Cover Size
Con
trac
tion
CRRWCFFFFFCTAFTA
20 40 60 80 1000
2000
4000
6000
Cover Size
Exp
ansi
on
(a)
0 20 40 60 80 1000.985
0.99
0.995
1
Cover Size
Con
duct
ance
CRRWCFFFFFCTAFTA
(b)
0 20 40 60 80 1000
2000
4000
6000
8000
10000
12000
Cover Size
Avg
. Geo
. Dis
tanc
e (k
m)
CRRWCFFFFFCTAFTA
(c)
Figure 3.7: Contraction, expansion, conductance, and geographic distanceof covers.
since it preserves some geographic structure of the social network by traversing the
edges based on who is the geographically closest friend, and adding friends who are
likely to be friends with the members already in the cover. CTA is not as effective
as CFF because geographic distances get diluted as the size of the cover increases.
FFF and FTA are worse than RW at minimizing conductance. We later use the
physical interactions of users to compare and contrast the results generated by the
CFF cover to results detected by the community detection algorithms.
3.4 Examining Detected Communities
We first examined the results by looking at the total number of communities
detected and the number of members in each one. The modified CPM algorithm with
geographic information detected 2.6K communities whose average size was 60 with
the size of the largest one being 69K. We did not run the original CPM algorithm
34
Table 3.4: Detected communities and their sizes.Community Size
Algorithms Avg. Std. Smallest Largest TotalCPM 60 1,356 6 68,671 2,572
IA 134 1,935 2 52,315 1,151IA w (w for weighted) 442 2,954 2 45,242 349
GANXiS TTL 21 87 3 3,139 7,236GANXiS TTL w 33 767 3 48,290 4,636
because of the long execution time required to generate the clique graph. IA without
geographic information detected 1.2K communities with the average size of 134 and
the size of the largest one being 52K. IA with geographic information detected 349
communities with the average size of 442 and the size of the largest one being 45K.
GANXiS without geographic information detected 7.2K communities with the average
size of 21 and the size of the largest one being 3K. Finally, GANXiS with geographic
information detected 4.6K communities with the average size of 33 and the size of
the largest one being 48,290. Additional information relating to community sizes is
listed in Table 3.4.
3.4.1 Network Community Profile (NCP)
We used the network community profile (NCP) proposed in [56] to examine
detected communities as a function of its size. The authors proposed to take the best
partition defined by a quality feature of a given community size because it represents
the potential of a partition in a community detection algorithm. By inspecting all
communities in the set of communities with the same size, we find for this set the
lowest conductance or the highest intra-density among its members, one quality metric
at a time.
For intra-density and conductance without geographic information, we use the
classical definitions from Table 3.3 and include all existing intra- and boundary-edges
in the counts.
For intra-density and conductance with geographic information, we only include
edges that are within geographic proximity of 160 km or roughly 2 hours of driving. A
low value of conductance is good because this means that the fraction of edges leading
outside the community is low, but the value of 0 is rare since it would indicate that
35
the community is isolated. However, for conductance with geographic information,
a value of 0 means there are no edges that connect to other communities that are
geographically close, so all bridge edges are long. This means also that seeing a
short bridge edge, the community detection algorithm tends to merges communities
connected by such edge together following the insight that neighbors tend to be
friends.
The potential issues resulting from using this approach are discussed below.
First, in many situations, taking the average value of a community quality gives a
more representative picture and probably is less sensitive in cases containing outliers.
Second, the number of communities for a given size might vary from a large number
of small communities to very few for large communities. Last but not least, there
might be no communities of a particular size, and taking the average quality might
give a smooth function that is easier to extrapolate at the missing points as seen with
the covers. Fig. 3.8-3.10 present the results for communities detected by CPM, IA,
and GANXiS respectively.
3.4.2 Link Connectivity Measurements
First, intra-density rapidly decreases as the size of the cover increases because
adding another member into a large community requires everyone already in it to be
connected with this new member, as seen in Fig. 3.8-3.10(a). Unlike intra-density,
conductance is not correlated with the community size because there are some small
and large communities of varying values, as seen in Fig. 3.8-3.10(b). Third, GANXiS
and IA are a little better than CPM at maximizing intra-edges that are within geo-
graphic proximity, as seen in Fig. 3.8-3.10(c). IA is the best at minimizing boundary-
edges that are within geographic proximity, as seen in Fig. 3.9(d). Last but not least,
GANXiS and IA benefited from incorporating the geographic information of users, as
seen in Fig. 3.9-3.10(d), where geographically correlated friends are captured in the
community detection process.
3.4.3 Face-to-Face Interactions Measurements
Comparing Fig. 3.8-3.10(d) to Fig. 3.8-3.10(b), we noticed that some detected
communities had a conductance value of 0. This means that every potential node
36
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Community Size (log−scale)
Intr
a−de
nsity
(a)
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Community Size (log−scale)
Con
duct
ance
(b)
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Community Size (log−scale)
Intr
a−de
nsity
with
Spa
tial i
nfo.
(c)
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Community Size (log−scale)
Con
duct
ance
with
Spa
tial i
nfo.
(d)
Figure 3.8: Communities detected by Clique Percolation Method.
Table 3.5: Measuring spatial conductance.Algorithm # Spatial Cond. of 0 Total Ratio
CPM 21 175 0.12IA 20 78 0.26
IA w (w for weighted) 19 84 0.23GANXiS TTL 48 126 0.38
GANXiS TTL w 47 155 0.30
Table 3.6: Measuring face-to-face interactions.Algorithm Count Total Ratio
CPM 84 95 0.88IA 38 41 0.93
IA w (w for weighted) 28 30 0.93GANXiS TTL 60 87 0.69
GANXiS TTL w 77 85 0.91
37
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Community Size (log−scale)
Intr
a−de
nsity
Weighted NetworkUnweighted Network
(a)
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
Community Size (log−scale)
Con
duct
ance
Weighted NetworkUnweighted Network
(b)
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Community Size (log−scale)
Intr
a−de
nsity
with
Spa
tial i
nfo.
Weighted NetworkUnweighted Network
(c)
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Community Size (log−scale)
Con
duct
ance
with
Spa
tial i
nfo.
Weighted NetworkUnweighted Network
(d)
Figure 3.9: Communities detected by Inference Algorithm.
within geographic proximity of a community has already been included in it. For the
IA without geographic information, out of the 78 community sizes, 20 of them have
geographic conductance of 0, yielding 20/78 ≈ 0.26 ratio. For the IA with geographic
information, out of the 84 communities, 19 of them have a geographic conductance of
0, yielding 19/84 ≈ 0.23 ratio. The remaining values are listed in Table 3.5. Results in
Table 3.5 show that GANXIS has the highest ratio of the number of communities with
a 0 spatial conductance divided by the number of communities detected. From this
perspective, a good community detection algorithm detects communities that have a
lot of communities with 0 spatial conductance as the result of merging connected and
geographically close communities together.
We examined small-size communities because humans have limited resources
and cognitive abilities to keep and maintain social relationships resulting in a limited
38
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Community Size
Intr
a−de
nsity
Weighted NetworkUnweighted Network
(a)
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Community Size (log−scale)
Con
duct
ance
Weighted NetworkUnweighted Network
(b)
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Community Size (log−scale)
Intr
a−de
nsity
with
Spa
tial i
nfo.
Weighted NetworkUnweighted Network
(c)
0 2 4 6 8 10 120
0.2
0.4
0.6
0.8
1
Community Size (log−scale)
Con
duct
ance
with
Spa
tial i
nfo.
Weighted NetworkUnweighted Network
(d)
Figure 3.10: Communities detected by GANXiS.
number of friendships known as Dunbar’s number [58]. We measured and then plotted
in Fig. 3.11 the NCP level of physical interactions in communities and covers by
summing the level of physical interactions among pairs. From the plots, we observed
that CPM have small communities where members are statistically more likely than
members in covers to physically interact with each other by going to the same places
together. In Fig. 3.11(a), out of 95 communities detected by CPM of the size up to
100, 84 of them have higher amount of physical interaction among members than a
null model, CFF . In Fig. 3.11(b), out of 41 communities detected by IA under the
size of 100, 38 of them have higher amount of physical interaction among members
than CFF . The remaining values are listed in Table 3.6.
While CPM is the most effective at detecting communities that are intrinsically
small (95 total) and where the physical interaction among member is likely to be
39
20 40 60 80 1000
5
10
15
Community Size
Am
ount
of P
hysi
cal I
nter
actio
n
CPMCFF
(a) CPM
20 40 60 80 1000
5
10
15
Size
Am
ount
of P
hysi
cal I
nter
actio
n
Weighted NetworkUnweighted NetworkCFF
(b) IA
20 40 60 80 1000
5
10
15
Size
Leve
l of P
hysi
cal I
nter
actio
n
Weighted NetworkUnweighted NetworkCFF
(c) GANXiS TTL
Figure 3.11: Measuring face-to-face interactions among members.
higher than CFF (88%), IA is the most effective at detecting communities where
93% of them have higher amount of physical interaction than the null model, as
seen in Table 3.6. Incorporating geographical information into GANXiS improves the
overall performance of GANXiS (91% vs. 69% (without geography)).
3.5 Application: Social Relationships & Human Mobility
Random mobility models have been popular among applied researchers for gen-
erating synthetic movements. Random walk is commonly used for graph traversals,
clustering analysis, and many other applications to model unpredictable behavior.
Random waypoint is a mobility model on Cartesian coordinate systems where two
dimensions are commonly used in simulations and higher dimensions are used for
theoretical analysis and generalization. Not only these random models are useful for
application purposes, but they are also powerful tools for analytical understanding
40
of many networking applications, like routing in decentralized architectures where
mobility plays a large role.
A typical ad-hoc network is a decentralized network formed by mobile agents
in a dynamic process without any fixed infrastructure. It is dynamic because the
topology of who is connected to whom is constantly changing due to the mobility and
connection preferences of the agents and the physical limitation of communication
devices. If two mobile agents are outside of transmission range, then the connection
is dropped. If they are within the transmission range, then the connection could
be established. Hence, the topology of the ad-hoc networks depends on a complex
combination of agent mobility, connection preferences, and environmental factors that
could disrupt services or enhance communication.
Some of these networks could be uncoordinated where each agent acts selfishly
on its behalf while other networks could be coordinated where all agents are collab-
orating to accomplish a particular goal, task, or mission. For instance, peer-to-peer
networks are uncoordinated networks where the architecture is designed for robust-
ness to reduce the damage of selfish activities in which users engage but are reluctant
to contribute and anti-choking algorithms are designed for effectively distribute pieces
of a file to maximize throughput and efficiency. On the other hand, military ad-hoc
networks are coordinated networks where soldiers communicate through a network
channel to rescue innocent civilians or capture fugitives in a mission.
Outside of computer networks, human mobility is important for studying the
spread of contagious diseases, traffic engineering, methods of large scale emergency
evacuations, and so on [59]. While individual mobility is important at a micro-level, it
serves as a building block for population mobility that has many potential applications
in studying the population at large scale. Using data to observe statistical patterns
that capture, characterize, and predict trajectories of human movements during their
daily activities is important for health organizations, civil engineers, and national
interests.
For instance, health organizations may want to study the spread of transmitted
diseases, while traffic and civil engineers may want to incorporate human mobility
analysis into their transportation models, where travellers can use a transportation
system consisting of bikes, buses, and subways to get from one place to another. Un-
41
School
Home
Pij
Work
Mall
Lunch
Figure 3.12: Generating a Markov Model using checkins.
derstanding population mobility allows the design of effective transportation systems
where traffic congestion is controlled and reduced. Last but not least, national security
might be interested in knowing how social relationships impact population mobility,
so guidelines can be provided during emergency evacuations in natural disasters like
the Hurricane Irene and Japan Nuclear Meltdown of 2011, where evacuating 45,000
people within a six mile radius of two malfunctioned nuclear power plants required
optimal efficiency since every second could potentially counts toward saving a life.
3.5.1 Network Congestion in MANETs
The backoff timer in the MAC 802.11 protocol is an algorithm designed for
preventing traffic collision of wireless signal. If two or more concurrent wireless trans-
missions are within radio range, one will randomly backoff to let the other one talk.
Suppose we are interested in measuring the throughput of a wireless network where
people are working on their laptops and moving from location to location with some
hidden attributes. Since human beings do not move randomly, we know that there will
be more congestion at popular locations. If we use the RWP, most of the congestion
occurs in the middle due to the stationary distribution as shown in Fig. 3.15.
3.5.2 Mobility Generation
We propose a following algorithm for generating mobility traces using social
networking data from Gowalla. For our Friendship Mobility Model (FMM) using
Markov Model as an underpinning, we first randomly select a user from the dataset
and include his or her friends into the selected group of users. For each user selected,
we calculate the patterns of checkin activities from the datasets. To define set of
42
locations, we look into how many unique places have this user checked in. For each
pair of subsequent locations, we calculate the shortest haversine route. For the prob-
ability in the Markov Model of moving from location a to location b, we calculate how
many times the user checks in at location a immediately after checking in at location
b divided by the number of times the user checks in at the location a. Finally, we
calculate the time it takes for a given user to go from one checkin to another. The
entire process is depicted in Fig. 3.12.
After we have our empirical Markov Model built for each user, we use Miller’s
coordinate projection to convert geographic space into a Cartesian coordinate system
that preserve the triangle law of distances. Finally for mobility simulation, each node
randomly gets assigned to one of its checkins. Then each node randomly picks with
the assigned probability the location of the next checkin and moves directly to it
using a straight line trajectory. Once the node reaches the new checkin, it repeats
the process until the end of the simulation.
Hence, the difference between the RWP mobility model and our FMM is that in
the latter the space of travel is limited to the area of the checkins for each individual
node. Moreover, each node moves differently based on its training set of checkins. For
instance, an adult might be inclined to check in at work more often than a student.
3.5.3 Experimental Congestion Design
We designed a controlled experiment in MANET using ns-2 to compare the
traffic congestion between the RWP and the FMM. In the experiment, there are 15
mobile nodes constantly sending out packets to their neighbors within the transmis-
sion range. Other simulation parameters are listed in Table 3.7. When two or more
nodes are within radio range of each other, at most one can make a successful trans-
mission and the rest has to pause. We measure the overall congestion of the network
by counting how many times did a node need to pause given that we know its current
geographic location during the simulation.
Fig. 3.13 provides the outline of a simulated node moving and how it causes
congestion. Suppose a node starts at p1 and travels to p2 with some speed dictated
We use “user” when referring to the dataset and “node” when referring to the simulation. Anode is built from the social network data provided by the users.
43
Figure 3.13: Design of simulation overview.
by the mobility model. A mobile node cannot transmit if there is already a concur-
rent transmission within some nearby range. Therefore, it pauses until it detects no
concurrent transmissions. The pause time duration in a subarea is the total amount
of time of all the nodes pausing or suspending their transmissions due to the backoff
timer of the MAC 802.11 protocol. During the trip from p1 to p2, the node pauses in
3 subareas (1,2), (2,2), (3,3) represented by the dashed line, meaning that the trans-
mission was suspended for some time. The length of the dashed line in a subarea
represents the duration of pause time for that particular trip.
3.5.4 Congestion Simulation Results
Table 3.7: Network simulator ns-2 parameters.
Parameters RWP FMMSimulation Time (t ) 10,000s 10,000sMAC Layer 802.11Ext 802.11ExtWidth (x ) 2000m 2000mLength (l ) 2000m 2000mNodes (n) 15 15Pause Time 0 0Min Speed 0 5Max Speed 5 5Total Backoffs. 598,316 1,654,967
With the FMM (see [22]), we were surprised that it had 2.77 times more conges-
tion than the RWP. However, this agrees with our intuition that in the FMM, friends
44
0 500 1000 1500 20000
500
1000
1500
2000
X (m)
Len
gth
(m)
FMM
RWP
Figure 3.14: Traffic congestion in FMM and RWP.
like to maintain their relationships by being closer to each other. Economic factors
like the cost of transportation and mobility have a great impact on how we choose
with whom to be friends.
Fig. 3.14 displays the simulation results of network congestion in a controlled
MANET. We took a sample of locations with traffic congestion. The points represent
places where at least one node had to backoff within the simulation. Notice how traffic
congestion is dispersed for RWP and clustered for FMM. Please note that this graph
only shows places of congestion but not density or total volume of communications.
Fig. 3.15 displays the frequency of pauses caused by the backoff timer in the MAC
802.11 protocol using the RWP. We noticed how congestion is centralized in the
middle, which is correlated to the stationary distribution of the RWP.
3.6 Application: Long Ties & Economic Development
A number of results in economic sociology suggested that human relationships
affect economic opportunities because information often spread between people [60]-
[65]. In addition, information coming from interpersonal relationships is often richer
than traditional broadcast media such as television, newspaper, radio, etc. because
acquaintances can interact face-to-face and influence one another in terms of adopt-
ing new behavior and ideas [66]. Therefore, social networks can be portrayed as
45
Figure 3.15: Frequency of pauses using the RWP.
a transportation system where individuals are drivers for generating ideas and the
links between people are vehicles for transporting ideas from one person to another.
Metaphorically, some links are faster at transporting ideas to a larger number of
people than others because not all vehicles are created equal.
It has been argued that information coming from weak ties is often richer than
information arriving via strong ties because “those to whom we are weakly tied are
more likely to move in circles different from our own ... and have access to infor-
mation different from what we [usually] receive [65].” Weak ties have been shown
to be valuable sources of information because individuals can use them to find jobs
[32], [60], solicit feedback on starting new ventures [63], and search for people like
in the small-world experiment [31], [41], [67], [68]. In other settings such as examin-
ing workplaces, structural holes can affect productivity and innovation of employees
and could lead to higher compensation, more promotion opportunities, and better
performance evaluations [61]-[64]. Structural holes are those social relationships that
connect non-redundant contacts together [61]. An example of a structural hole is a
bridge that connects non-redundant contacts from two communities together. The
effect of weak ties on economic opportunities [69] suggests that perhaps information
coming from weak ties can also be used for measuring economic development on a
46
larger scale.
Contemporary development in the science of urbanization has provided scaling
laws for innovation and wealth creation as a power function of the population size in
the equation: y(t) = cx(t)m where x(t) is the population size and y(t) is the metric
of innovation at time t [70]. These results show that as the population size increases,
GDP, wages, patents, private research employment & development increase at super-
liner rates where 1.03 ≤ m ≤ 1.46 [70]. A plausible explanation for the superliner
scaling of wealth creation is that as the population size increases, the number of social
relationships between people increases because there are more choices for establishing
relationships; therefore, increasing the connectivity between people and decreasing
the time for ideas to spread as long as the rate of establishing connections is faster
than the rate of population growth.
Following this line of thinking, recent results in [71] suggest that a generative
model for tie formation as a function of population density yields results very similar
to the model based on population size [70]. Results show that algorithmically gen-
erated social ties based on population density, assuming that nodes are distributed
uniformly on a Euclidean space and they establish connections similar to the rank
friendship model [67], can be used to model urban characteristics of cities such as
GDP, HIV transmissions, and communication volume. Here we extend this line of
thinking by focusing on characteristics of economic development as a function of
speedy idea flow emulated on real social relationships - using long ties as the main
component enabling such flow. This was accomplished by using data containing ge-
ographical locations and friendship information of hundreds of thousands of people
from location-based social media such as Gowalla and FourSquare [22]. More impor-
tantly, these datasets allow us to infer face-to-face interactions [23] and measure the
strength of ties in terms of not only interactions but also geographical distance (i.e.,
short or long ties [72], [73]).
Other approaches for measuring economic development of large geographical
areas include examining the diversity of social contacts (i.e., call records as a proxy for
social relationships) since more contacts imply more channels for receiving information
[74], but using calling patterns to infer social contacts is biased towards those that are
more likely to be strong ties since weak ties are by definition those that are contacted
47
infrequently. While these approaches [71], [74] can vary in their complexity, ranging
from mathematically oriented to data-driven, what they share in common is using
social network analysis to predict innovation, wealth creation, and even patterns of
complex human behavior. The novelty of our approach lies at the intersection of
economic sociology (i.e., the interplay of weak ties and economic opportunities) and
simple contagion models (i.e., the spread of good ideas from one place to another).
Results show that the speed of access to ideas is a near prefect measure for social
diversity and also a signature of economic development in the US without needing
to tune parameters or incorporate secondary factors such as the level of educational
attainment and internal transportation infrastructure.
3.6.1 A Stochastic Model of Economic Development
We propose a simple stochastic model that uses long ties as the main component
for measuring economic development of large geographical areas. Let G = (V,E, L)
be a social network where V is the set of nodes, E is the set of their undirected
relationships, and L is the mapping of users to locations of their residences. Let Ai
denotes the set of nodes that reside in area i; i.e., Ai = {v ∈ V |L(v) = i}. The flow of
ideas matrix denoted as F = (fij) where fij is the probability of an idea going from
Ai to Aj in one step defined as the fraction of long ties connecting nodes from Ai to
Aj divided by the number of long ties originating from Ai; i.e.,
fij =LT (Ai, Aj)∑mk=1 LT (Ai, Ak)
(3.4)
where m is the total number of areas and LT (Ai, Aj) (1 ≤ i 6= j ≤ m) denotes the
number of long ties connecting nodes from Ai to Aj; i.e.,
LT (Ai, Aj) = |{(s, t) ∈ E | (s ∈ Ai & t ∈ Aj) or (t ∈ Ai & s ∈ Aj)}| (3.5)
If we assume that innovative ideas travel randomly between areas, and the probability
of an idea spreading from Ai to Aj depends only on the present area and not the
previous areas, then {Xt, t ≥ 0} is a discrete-time Markov chain where Xt denotes
48
where the idea is located at time t.
Let Hij denotes the expected time it takes for the idea originating at Ai to
arrive at Aj. Then the average expected time for the idea originating from anywhere
to arrive at Ai denoted as φi is defined as:
φi =1
m− 1
m∑k=1
Hki (3.6)
where Hii is 0. Hence, we expect φi to be inversely correlated with economic devel-
opment since areas that receive information quicker can act faster.
Suppose an innovative idea travels indefinitely, then the fraction of time the
idea stays in Ai is denoted as:
λi = P (Xt = Ai) (3.7)
λ = (λ1, λ2, ..., λm) is known as the stationary distribution, and there exists a unique
stationary distribution of Xt since it is irreducible [24]. If φi denotes the fraction of
time the idea spends in area i, then 1/λi denotes the expected time needed for the
idea to come back to i; therefore, φi ≈ 1λi
.
3.6.2 Experimental Results & Discussion
We extracted users and their social relationships in Gowalla and FourSquare
and kept those that are confined to the US. We partitioned the US into 51 areas
where each area corresponds to a federal state.
Figure 3.16 shows scaling laws of the number of short and long ties as a function
of the population size. Short ties are defined as those relationships where both users
live in the same state, while long ties are defined as those who live in separate states.
The total number of ties (i.e., all ties) is the sum of the number of short and long ties.
A point is a state where the x-axis corresponds to the number of users that live there,
and the y-axis corresponds to the number of their ties. Results show that as the
population size increases, the number of short ties increases at superliner rates where
m ≈ 1.34 for Gowalla (a) and m ≈ 1.43 for FourSquare (b). This result supports
49
4 6 8 102
4
6
8
10
12
14
Population Size (log)
Num
ber
of T
ies
(log)
a)
Short Ties (m=1.34, r=0.97)Long Ties (m=0.95, r=0.98)All Ties (m=1.02, r=0.99)
4 6 8 100
2
4
6
8
10
12
14
Population Size (log)
Num
ber
of T
ies
(log)
b)
Short Ties (m=1.43, r=0.95)Long Ties (m=1.00, r=0.94)All Ties (m=1.07, r=0.96)
Figure 3.16: Scaling laws of short and long ties.
Short Ties Long Ties0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Num
ber
of fa
ce−
to−
face
inte
ract
ions
a)
0 2 4 6 8−10
−8
−6
−4
−2
0
K (log−scale)
P(k
> K
) (lo
g−sc
ale)
b)
Short tiesLong ties
Figure 3.17: Face-to-face interactions of short ties and long ties.
the claim that increasing the population size increases the number of relationships
between people and decreasing their path lengths so ideas can spread quicker. How-
ever, long ties do not increase at superlinear rates but instead approximately at linear
rates where m ≈ 0.95 for Gowalla (a) and m ≈ 1.00 for FourSquare (b). Therefore,
long ties do not explain superlinear scaling of innovation and wealth creation as a
function of population size.
Figure 3.17 shows that most of long ties are weak because face-to-face interac-
tions occur more often when people are geographically close. In this experiment, we
selected all pairs of long and short ties and calculated the number of their face-to-face
interactions by matching their checkins. The average number of interactions for short
ties is 3.95 (std=43.20) while this number for long ties is 0.73 (std=9.19). While not
all short ties are strong, most of long ties are weak since 90% of them have no more
50
Adopters Adopters Non−Adop.Non−Adop.0
5
10
15
20
25
Num
ber
of T
ies
a)
Short TiesLong Ties
Adopters Adopters Non−Adop.Non−Adop.0
5
10
15
20
25
Num
ber
of T
ies
b)
Weak TiesLong Ties
Figure 3.18: The collective strength of long ties in a simple contagionmodel.
than two interactions. The x-axis in (b) represents the number of interactions K, and
the y-axis represents the probability that a tie has more than K interactions. We did
not repeat the same experiment for FourSquare because their API did not provide
access to users checkins.
We emulated a simple contagion process using social relationships of users to
examine the effects of short and long ties on adopting versus non-adopting a con-
tagion (similar to the process of spreading ideas in [71]). Using Rogers work on
the diffusion of innovations [75], we assume that 2.5% of the population, randomly
selected, is responsible for generating innovative ideas (i.e., the seed set). In each
step, they randomly select one of their acquaintances to propagate the contagion and
that acquaintance decides whether to adopt it with some fixed probability pc. If the
acquaintance decides to adopt the contagion, then it later becomes an initiator for
spreading it. The process stops when 13.5% of the population has adopted the con-
tagion. Those 13.5% of the population would be considered as early adopters in the
diffusion of innovations [75].
Figure 3.18 shows that early adopters have on average more long than short
ties. For Gowalla (a), the average adopter has 17.77 (std=38.67) short ties and 23.90
(std=111.57) long ties compared to 3.81 (std=6.58) short ties and 2.99 (std=6.20)
long ties for non-adopters. For FourSquare, the average adopter has 16.36 (std=27.67)
short ties and 25.14 (std=54.11) long ties compared to 1.62 (std=3.45) short ties and
1.63 (std=6.74) long ties for non-adopters. For the distribution of short and long
51
1 2 3 4 5 6 7 8 9100
0.2
0.4
Long Ties (log−scale)
Fra
ctio
n
a) Adopters
1 2 3 4 5 6 7 8 9100
0.2
0.4
0.6
Long Ties (log−scale)
b) Non−Adopters
1 2 3 4 5 6 7 8 9100
0.1
0.2
Long Ties (log−scale)
Fra
ctio
n
c) Adopters
1 2 3 4 5 6 7 8 9100
0.2
0.4
0.6
Long Ties (log−scale)
d) Non−Adopters
Figure 3.19: Distribution of long ties for adopters and non-adopters.
ties of adopters and non-adopters see Fig. 3.19 for Gowalla (a,b) and FourSquare
(c,d). Since nodes in the social networks are more likely to adopt if they have more
acquaintances, the point is that a job source, valuable idea, or even a social contagion
is more likely to come from a weak tie because people have limited number of strong
ties but many more weak ties [61]. This experiment shows the collective strength of
long ties by showing that people have a higher chance of adopting a new idea if they
have more long ties.
We generate the flow matrix F = (fij) and calculate λi as a proxy for φi. Figures
3.20 and 3.21 show the economic development of US states as a function of the speed
of access to ideas for Gowalla and FourSquare respectively. The metrics we used for
economic development are gross GDP [76], the number of patents issued [77], and
the number of startups defined as non-profit firms with less than 20 employees [78].
Overall, results show that φi is highly correlated with the economic development in
the US.
Tables 1 and 2 show results using other techniques that have been proposed in
the literature for measuring economic development. The population density of a state
is defined as the number of residents [79] divided by the state’s land area in sq. mi
52
−8 −6 −4 −2−15
−14
−13
−12
−11
−10
φi (log−scale)
Gro
ss G
DP
(lo
g−sc
ale)
a)
2009, m=−0.67, r=−0.92,2010, m=−0.67, r=−0.922011, m=−0.66, r=−0.922012, m=−0.67, r=−0.91
−8 −6 −4 −2
−10
−8
−6
−4
φi (log−scale)
Pat
ents
(lo
g−sc
ale)
b)
2009, m=−0.81, r=−0.76,2010, m=−0.82, r=−0.772011, m=−0.83, r=−0.772012, m=−0.83, r=−0.79
−8 −6 −4 −2
−13
−12
−11
−10
−9c)
φi (log−scale)
Sta
rtup
s (lo
g−sc
ale)
2009, m=−0.59, r=−0.862010, m=−0.59, r=−0.862011, m=−0.59, r=−0.86
Figure 3.20: Economic development as a function of idea flow (Gowalla).
(excluding water) [80]. The social diversity of a state i denoted as Di is defined as:
Di =
∑mj=1 pijlog(pij)
log(m− 1)(3.8)
where pij is the number of edges connecting Ai and Aj divided by the number of
edges leaving Ai [74].
Table 3.8: Measuring economic development (Gowalla).GDP Patents Startups
Population Density r = 0.50 r = 0.45 r = 0.38Social Diversity r = 0.88 r = 0.74 r = 0.83
Ideas Flow r = 0.92 r = 0.77 r = 0.86
In Table 3.8, results show that speed of access to ideas φi in Gowalla is more
correlated with economic development than population density and social diversity.
53
−8 −6 −4 −2−15
−14
−13
−12
−11
−10
φi (log−scale)
Gro
ss G
DP
(lo
g−sc
ale)
a)
2009, m=−0.59, r=−0.882010, m=−0.59, r=−0.882011, m=−0.59, r=−0.882012, m=−0.59, r=−0.88
−8 −6 −4 −2
−10
−8
−6
−4
φi (log−scale)
Pat
ents
(lo
g−sc
ale)
b)
2009, m=−0.70, r=−0.712010, m=−0.71, r=−0.722011, m=−0.73, r=−0.742012, m=−0.72, r=−0.74
−8 −6 −4 −2
−13
−12
−11
−10
−9
φi (log−scale)
Sta
rtup
s (lo
g−sc
ale)
c)
2009, m=−0.51, r=−0.802010, m=−0.51, r=−0.812011, m=−0.51, r=−0.81
Figure 3.21: Economic development as a function of idea flow(FourSquare).
Table 3.9: Measuring economic development (FourSquare).GDP Patents Startups
Population Density r = 0.50 r = 0.45 r = 0.38Social Diversity r = 0.88 r = 0.74 r = 0.83
Ideas Flow r = 0.92 r = 0.77 r = 0.86
6 7 8 9 10 11 12 13 141
2
3
4
5
6
7
8
Social Diversity Di
Spe
edy
Idea
Flo
w φ
i
a)
y = − 0.9*x + 13
r=−0.98 linear
5 6 7 8 9 10 11 12 132
3
4
5
6
7
8
9
Social Diversity Di
Spe
edy
Idea
Flo
w
φ i
b)
y = − 0.9*x + 13
r=−0.99 linear
Figure 3.22: Speedy idea flow as a function of social diversity.
54
In Table 3.9, there are two instances where social diversity is more correlated with
economic development in FourSquare but still less correlated than the results in Table
3.8.
Results show that the speed of access to ideas is correlated with economic de-
velopment in the US from 2009 to 2012 because it is a near prefect measure for social
diversity as shown in Fig. 3.22 for Gowalla (a) and FourSquare (b); however, the
causality between the two relationships is still unknown but the results suggest that
perhaps combining long ties and the spread of ideas might be an important indicator
of economic development in addition to population size, density and social diversity.
Aggregating and normalizing hundreds of thousands of long ties across the US re-
moves the potential effect of ideas not traveling randomly. Unlike social diversity,
population density performed not as well as others because it was simply designed to
measure characteristics of cities and not geographical areas with diverse ranges of pop-
ulation densities (e.g., New York consists of dense NYC and sparse NYS; therefore,
limiting its predictive power).
Finally, we focus only on a very specific dimension of social relationships (i.e.,
long ties) and ignore other ties that could lead to better correlations of economic de-
velopment. While there are many more dimensions of human relationships (e.g., short
ties, strong ties, friends from different communities, etc.), one particular dimension
that could lead to better results within a geographical area is friends with different
interests or skills since they would complement each other in terms of collaboration
like solving a difficult problem. Perhaps understanding the interplay of human rela-
tionships and economic development can suggest radical socially-driven alternatives
in addition to the traditional stimulus packages for growing the economy [74] and a
direction for studying urban growth [71].
3.7 Summary of Results
Contrary to the belief in the death of distance barrier to forming social ties [81],
we find that the creation of friendship between two people in Gowalla is more likely
to occur when they are geographically closer, and the likelihood of users being friends
rapidly decreases as the geographic distance between them increases. Such geographic
effects may help in designing spatially-aware community detection algorithms where
55
on average every two people in a community are separated by a few hops and also
likely to be within spatial proximity.
First, our data analysis of Gowalla friendship network reveals two degrees of
geographical concentration where friends and friends-of-friends are more likely to be
within geographic proximity. Conversely, pairs of users who are separated by three or
more hops of friendship relation are unlikely to be within geographic proximity. Also,
friends who are within geographic proximity are more likely to physically interact by
going to the same places together than distant friends. Yet, the likelihood of physical
interactions among friends-of-friends is minuscule even though they are geographically
concentrated.
Second, we showed that covers can serve as a null model for examining com-
munity structures. For most quality metrics, small communities are more likely to
outperform large ones because it is much easier to find a small group to maximize a
particular metric. Therefore, comparing detected communities to covers tell us how
much better the algorithm is performing than a proposed null model for a given size
of the community.
Finally, we used the results from the covers and compared them to the com-
munities detected by modified CPM, unweighted and weighted IA, and GANXiS.
By incorporating spatial information into CPM to make the algorithm scalable, it
detected meaningful communities of a large online social network where members
are more likely to physically interact than members of a cover used as a null model.
From the NCP plots, we noticed the importance of small-size communities in large
social networks in which it is much harder to find a large community because humans
have limited resources to create and maintain relationships. We used the level of
physical interactions among members in a community as the final quality measure to
compare and validate the performance of the community detection algorithms to the
closest-friend-first cover.
Other applications that we foresee might benefit from such spatial effects in-
clude recommendation systems and link prediction by designing systems based on the
knowledge of users’ geographical locations, their social connections, and the structure
of their friendship communities. For instance, recommendation systems could be en-
riched by incorporating geographical information of users, their friends and location-
56
based ratings to increase the quality of the recommended items [82]. Link prediction
could be enriched by using pairs of users that are geographically close and belong to
the same community to predict how likely they will become friends or connected in
the future [83].
CHAPTER 4
SOCIAL RANKING TECHNIQUES
Web Graph
Social Graph
CNN
ABC
MSNBCFox
Yahoo Digg
P1P2 P3
P4
P5 P6
Figure 4.1: Conceptualization of social ranking.
Previous work on the ranking of pages conceptualized the web as a network con-
sisting of pages representing nodes, and links representing directed edges illustrated
in Fig. 4.1. Advances in social networks enabled a different perspective of ranking
pages from a relationship point of view. For simplicity, the social network of users
illustrated in the top rectangular box in Fig. 4.1 consists of nodes P1, P2, ..., P6 where
an undirected edge between P1 and P2 represents a social relationship of the two users
and an undirected edge from P1 to CNN represents P1 broadcasting a CNN URL to
its ties P2, P3, and P4. Note that the edge from P1 to CNN is not a part of the social
network, but a connection between the web and social network.
4.1 Google Buzz & Twitter
We collected data from two networks on the web. The first one is the Google
Buzz, a platform that combines social relationships and mini-blogging for information
dissemination. The second network is Twitter where users choose to follow sources
Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Social RankingTechniques for the Web,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysisand Mining, Niagara Falls, Ontario, 2013, pp. 49-55.
57
58
of information. These two networks have messages containing URLs that provide us
clues into how users would rank the quality of the information coming from URLs by
using the techniques we later describe.
We collected the Google Buzz data from early September of 2011 to the middle of
October of the same year. There were around 2.5M users who shared approximately
100M messages of which about 30M messages had URLs embedded in them. We
collected the Twitter data from early September of 2011 to the late December of that
year. There were around 1M users who shared approximately 300M messages and
50M of them had URLs embedded in them. Additional details of the datasets for
Google Buzz and Twitter are provided in the Table 4.1 and Table 4.2. Please note
that all URLs refer to all representations of URLs embedded into messages and two
different representations could be the same URL when they are masked by redirect
services. *URLs refer to the final destination of URLs that have been shared by at
least two users within the network. In addition, we reduced the size of the datasets by
keeping users whose geographical locations were known. To pinpoint the geographical
location of a user, we extracted locations from their geo-tagged messages and used
the most frequent location as the location of their residence. Reduced networks are
shown in Table 4.3.
Parsing URLs from messages is prone to errors where humans have multiple
ways of writing supposedly the same link. Examples are URLs containing typos and
spelling mistakes, masked by redirect services, and so on. Second, with limits on
hardware resources, bandwidth sharing and data access, we attempted to collect as
much as we could for the purpose of ranking URLs on social media. Third, we were
able to collect the entire connected component with BFS sampling for Google Buzz,
which resulted in the sum of indegree being equal to the sum of outdegree. Twitter
is a much larger network that consists of hundreds of millions of accounts [26]. When
calculating the data summary of Twitter, we look at users who have been processed
in terms of collecting their information and not users who are waiting to be processed,
which resulted in the sum of indegree not being equal to the sum of outdegree.
59
Table 4.1: Data summary of Google Buzz.x σX
∑Users − − 2,522,109
Inlinks 7.36 115.04 18,566,607Outlinks 7.36 58.39 18,566,607Messages 42.94 1,067.21 108,439,019All URLs 11.67 21,706.36 34,472,205
*URLs 3.85 174.80 2,647,561
Table 4.2: Data summary of Twitter.x σX
∑Users − − 1,057,163
Inlinks 17,675.58 334,127.10 18.69BOutlinks 520.66 7,676.48 550,421,023Messages 280.84 1,005.09 277,310,683All URLs 44.26 45,359.19 46,532,403
*URLs 8.19 57.59 2,294,077
4.1.1 Categories of URLs.
Figure 4.2 shows categories of 100 most popular and 100 randomly selected
URLs for Google Buzz (a,b) and Twitter (c,d). Popular URLs are defined by the
number of spreaders, that is the users who shared or re-shared a given URL. In Google
Buzz (a), 24% of popular URLs are from social media, 16% are about technological
products such as Apple, 15% are videos from Youtube, and so on. In Twitter (c), 41%
of popular URLs are from social media, 19% are videos from Youtube, 11% are image
related, and so on. Google Buzz has more URLs relating to technological products,
while Twitter has more popular URLs relating to social media. For random URLs,
Google Buzz has 27% of URLs from social media while for Twitter this number is
53%.
Table 4.3: Google Buzz (left) & Twitter (right) with geography.x σX
∑x σX
∑Users − − 24,813 − − 15,036
Inlinks 8.30 39.92 206K 102.60 279.35 1.5MOutlinks 8.30 33.45 206K 102.60 278.77 1.5M
Extracted URLs 260.66 978.30 6.5M 227.93 305.39 3.4M
60
Google 7%
Youtube 15%
Twitter 15%
Information 27%
Yfrog 3%
FourSquare 2%
Facebook 4%
Games 1%
Images 9%
News 1%
Technology 16%
a)
Information 27%
Google 4%
Videos 2%< 1%
Twitter 11%
Youtube 6%Last.fm2%
Tumblr 5%< 1%
Foursquare 7%
Facebook 2%
News 21%
< 1%
Technology 10%
b)
Youtube 19%
Twitter 30%
Information 21%
Yfrog 5%
Facebook 6%
Images 11%
News 3% Technology 5%
c)
Information 23%
Twitter 28%
Youtube 5%Tumblr 5%
Yfrog 8%
Foursquare 11%
Facebook 1%
News 18%
Technology 1%d)
Figure 4.2: Categories of popular (a,c) and random (b,d) URLs.
4.1.2 Spreaders & Affected Sets
From both Google Buzz and Twitter datasets, we have randomly chosen 2,000
URLs with equal probability denoted as the random set of URLs. We also have chosen
the top 2,000 shared URLs denoted as the popular set of URLs. There are two sets
of URLs in each network giving us four sets of URLs in total. For each URL, we
calculated the size of the affected set consists of nodes that received the URL from
the spreaders but chose not to spread it further.
We also computed the average length of all shortest paths from 10 randomly
chosen users to members of a random subset of spreaders. The results are shown in
Fig. 4.3(a) for Google Buzz and Fig. 4.3(b) for Twitter. A point on the plot is a URL
where the x-axis corresponds to the size of the affected set in logarithmic scale, and
the y-axis corresponds to the average length of shortest paths from randomly chosen
users to the spreaders. A red point is a URL from the random set, and a blue star is a
URL from the popular set. The black line is a linear classifier that separates popular
61
Size of Affected Set (log−scale)
Avg
. Dis
tanc
e
2 4 6 8 10 12
3
3.5
4
4.5
5Random
Popular
Size of Affected Set (log−scale)
Avg
. Dis
tanc
e
5 10 15
2
3
4
5
6 RandomPopular
Figure 4.3: Shortest paths to URLs in Google Buzz (a) and Twitter (b).
URLs from random URLs and crosses are points that have been miss-classified. We
substitute the entire spreader set with a randomly selected subset simply as a matter
of efficiency because shortest-path computations are expensive in large networks as
mentioned by authors in [84].
In Fig. 4.3, we noticed that as the size of the affected set increases, the average
distance from randomly selected users to the information on the web page decreases
for random and popular sets of URLs in Google Buzz. This is because very large
affected sets increase the likelihood that a randomly chosen user has a path through
an affected user reaching a spreader. This agrees with our intuition that information
collectively shared by users with high outdegrees has a greater coverage of dissemina-
tion. However, this correlation is weaker in Twitter due to the celebrity effect of some
users having millions of followers and creating large affected sets. For instance, a URL
that was only shared in the network by a celebrity. More importantly, affected sets
influence our social ranking techniques where the structure of the network instead of
the web topology is used to rank pages or URLs.
4.1.3 Information Distances
Figure 4.4 shows ultra small-world property of the distance from a randomly
selected starter to popular and random URLs in Google Buzz (a) and Twitter (b).
For each URL, we randomly selected 100 starters and calculated the length of shortest
path from the starter to the closest spreader of the URL. We calculated the densities
of the number of hops in Fig. 4.5 and the average shortest path lengths Fig. 4.4.
62
YO TW YF FA IM NE TE RA0
1
2
3
4
5a) Avg. Path Length
YO TW YF FA IM NE TE RA0
1
2
3
4
b) Avg. Path Length
Figure 4.4: Ultra small-world property from starters to information.
0 2 4 60
0.1
0.2
0.3
0.4
0.5
Hop
Den
sity
a)
Images
News
Tech
Youtube
Random
0 2 4 60
0.1
0.2
0.3
0.4
0.5
Hop
Den
sity
b)
Images
News
Tech
Youtube
Random
Figure 4.5: Densities of shortest path lengths from starters to URLs.
Results show that a randomly selected starter in Google Buzz is about one hop
away from a popular URL compared to 2.5 hops distance from a random URL. For
Twitter, a randomly selected starter is about 2 hops away from a popular URL
and a little bit further for a random URL. These average shortest path lengths to
popular and random URLs are much shorter than six degrees of separation in Travers-
Milgram small-world experiment [29] demonstrating that the distance from human
to information is sometimes shorter than the distance from human to human.
4.1.4 Geographical Distances
Figure 4.6 shows geographical concentration of pairs of users who are separated
by a fixed number of hops in Google Buzz and Twitter, and two additional networks:
Gowalla and FourSquare. We noticed that these four social networks have two degrees
63
0 2000 40000
0.1
0.2
0.3
0.4
0.5
Den
sity
a) Hop 1
0 2000 40000
0.1
0.2
0.3
b) Hop 2
0 2000 40000
0.02
0.04
0.06
c) Hop 3
0 2000 40000
0.02
0.04
0.06
d) Hop 4
0 2000 40000
0.02
0.04
0.06
Geographic Distances (km)
e) Hop 5
0 2000 40000
0.02
0.04
0.06
f) Hop 6
B
T
G
F
Figure 4.6: Two degrees of spatial concentration.
of spatial concentration where users who are separated by one or two hops are more
geographically concentrated than pairs who are separated by 3 hops or more. For
instance, 69% of friendship pairs (hops=1 shown in a) are within 560 km, 47% of
friends-of-friends pairs (hops=2 shown in b) are within 560 km, 25% of pairs with
hops=3 (shown in c) are within 560 km, 20% of pairs with hops=4 (shown in d) are
within 560 km, 17% of pairs with hops=5 (shown in e) are within 560 km, and 17%
of pairs with hops=6 (shown in f) are within 560 km. An explanation for this two
degrees of concentration is the effect of local clustering coefficient of a user defined as
the fraction of its friends who are friends with each other. In order for a probability
of two people who have a friend in common being friends themselves to be high, they
need to be within some geographical proximity or else the opportunity for them to
interact is small. The average local clustering coefficient of 104 randomly selected
pairs of users in Google Buzz, Twitter, Gowalla, and FourSquare are 0.31, 0.36, 0.30,
and 0.34 respectively.
64
Figure 4.7: Four dimensions of social relationships.
4.1.5 Densities of Social Relationships
Four dimensions of social relationships are visualized in Fig. 4.7. Friends are de-
fined as reciprocal following relationships. Neighbors are users that are geographically
close. Peers are users that belong in the same community. Interests are users that
have similar interests measured by the keyword similarity in URLs they share. The
intersection of circles represents pairs of users with multiple dimensions of social re-
lationships. Two represents pairs of users with two dimensions of social relationships
such as being friends and neighbors.
Table 4.4: Social relationships densities in Google Buzz.Buzz Friends Peers Interests Neighbors
Among Friends — 0.99 0.09 0.58Among Peers 0.26 — 0.25 0.41
Among Interests 0.01 0.32 — 0.06Among Neighbors 0.05 0.50 0.13 —Among Random 0.01 0.27 0.06 0.03
Tables 4.4-4.5 show the densities of friends, peers, neighbors, and users with
similar interests. The left column represents relationships of the pairs and the top
row represents the density of the relationships. For example, among friends in Table
4.4 for Google Buzz, 99% of are also peers, 9% of them have similar interests, 58% of
65
Table 4.5: Social relationships densities in Twitter.Twitter Friends Peers Interests Neighbors
Among Friends — 0.85 0.11 0.30Among Peers 0.32 — 0.12 0.29
Among Interests < 0.01 0.19 — 0.03Among Neighbors 0.01 0.36 0.09 —Among Random < 0.01 0.13 0.04 0.02
0 2000 4000 6000 8000 100000.1
0.15
0.2
0.25
Geographical Distance (km)
Avg
. CK
S
a)
FriendsFollowingsPeersRandom
0 2000 4000 6000 8000 100000.1
0.2
0.3
0.4
Geographical Distance (km)
Avg
. CK
S
b)
FriendsFollowingsPeersRandom
Figure 4.8: CKS for friendship, following, peers, and random pairs.
them are neighbors. For Twitter, among friends, 85% of them are peers, 11% have
similar interests, and 30% are neighbors. The densities of friends, peers, interests,
and neighbors are consistent in Google Buzz and Twitter. For example, most of the
friends are among peers, most of the peers are among friends, most of people with
similar interests are among peers, and most of the neighbors are among friends.
4.1.6 Keyword Similarity
Figure 4.8 shows cosine keyword similarity (CKS) of selected friendship, follow-
ing, peers, and random pairs of users in Google Buzz (a) and Twitter (b). The CKS
of two users is the cosine of the angle between the two vectors consisting of keyword
frequencies extracted from webpages shared by these two users.
Let Wv and Wv′ be lists of words in web pages that users v and v′ have shared.
Let Av be a vector of word frequencies where the ith index in Av represents the
number of times the word wi appears in Wv The keyword cosine similarity for v and
v′ is defined as:
66
cos(u, u′) =AuAu′
||A||||B||. (4.1)
A pair of nodes (v, v′) represents friendship if they follow each other, following
if v follows v′ but not vice-versa and is a random pair if there is no following in either
direction. We calculated the average CKS of friendship, following, peers, and random
pairs as a function of geographical distance separating members of these pairs. For
random pairs, we noticed that CKS decreases as the geographical distance increases.
On the other hand, the effect of geography on cosine keyword similarity is negligible
when comparing friendship, peer, and following pairs. However, they have a higher
cosine keyword similarity than random pairs.
4.2 Social Ranking Techniques
Let GU = (V,E) be a directed multi-labeled graph where V is the set of nodes,
E is the set of edges where e = (vi, vj) represents a directed edge from node vi to
node vj, and U is the set of URLs with subsets of which nodes in V are labeled. For
URL u ∈ U , let S(u) denotes the set of all spreaders of the URL u; in other words
all nodes in V who has posted u.
4.2.1 PageRank on Social Network
We extend the PageRank algorithm to rank URLs on a social network (PRSN)
as follows. Given a multi-labeled graph GU = (V,E), let F = (fij) be a n×n weighted
adjacency matrix where n is the number of nodes (i.e, n = |V |), fij = 0 if there is
no directed edge from vi to vj, and fij = 1/deg(i) otherwise. Let R be a vector
consisting of n elements where the ith element of R denoted as ri corresponds to the
PageRank score of the ith node. Let k be the maximum number of iterations that the
PageRank algorithm runs. At the first iteration, every node sends its score divided
by the number of links pointing from this node to other nodes through each outgoing
link. After that, each node updates its score to the sum of scores that it has received:
ri = f1ir1 + f2ir2 + ...+ fnirn. (4.2)
If there is an edge from node j to node i, then fji > 0 and node j will send
67
fji fraction 1deg(j)
of its score rj to node i. Equation 4.2 can be compactly written
as R<1> = F TR<0> where F T is the transpose of the matrix F , the superscript <1>
denotes the scores of all nodes after the first iteration, and R<0> is the initial vector.
Let R<k> be the scores of nodes at the k > 0 or last iteration defined by induction
as:
R<k> = F TR<k−1> (4.3)
If there are sinks in the graph G, that is nodes without outgoing edges, then
for large enough k’s they will absorb all scores since the scores can enter but cannot
leave the sinks. One way to fix this problem is to scale the strength of links by a
constant factor of 0 < σ < 1 and to compensate this scaling by adding an artificial
flow between any two nodes with the weight 1−σn
. This solution is known as the scaled
version of PageRank [85]. The score of the ith node is then denoted as r′i and is defined
as:
r′i =n∑j=1
(σfji +1− σn
)r′j. (4.4)
Equation 4.3 can be compactly written using the following matrix F = σF+ 1−σn
.
By the Perron-Forbenius Theorem [85], the scaled PageRank scores converge to a
stable solution:
R′i = F TR′i−1 where 0 < i ≤ k. (4.5)
Given a subset of URLs U ′ ⊂ U , the PageRank score of a URL u ∈ U ′ on a
social network (PRSN) is defined as:
PRSN(u) =
∑vi∈S(u) r
′ki∑
u′∈U ′∑
vi∈S(u′) r′ki
. (4.6)
4.2.2 HITS on Social Network
The HITS algorithm used to rank URLs on a social network (HSN) is defined
as follows [35], [85]. Given GU = (V,E), let M = (mij) be a n× n adjacency matrix
where n is the number of nodes, mij = 1 if there is a directed edge from node vi to
68
node vj, and mij = 0 otherwise. Let k be the maximum number of iterations. Given
a set of URLs U ′ ⊂ U , let H and A be vectors of scores for hubs and authorities,
respectively. Authorities are the URLs (i.e., u ∈ U ′) and hubs are nodes that share
these URLs. The ith element of the vector H represents the score of the ith hub,
and the jth element of the vector A represents the score of the jth authority. At the
first iteration, the score hi of a hub gets set to the number of authorities to which it
points, and the score aj of an authority gets set to the scores of hubs pointing to it.
More formally, hi and aj are defined as:
h<0>i = mi1 +mi2 + ...+min, (4.7)
a<0>j = m1jh
<0>1 +m2jh
<0>2 + ...+mnjh
<0>n . (4.8)
Let H<l> and A<l> be the scores of hubs and authorities at the iteration l, the
HITS algorithm [85] can be written as:
H<l> = (MMT )lH<0> where 0 < l ≤ k, (4.9)
A<l> = (MTM)l−1MTH<0> where 0 < l ≤ k. (4.10)
Finally, the score of a URL in the authorities is the value a<k>j normalized by
the sum of scores in the vector A.
4.2.3 Ranking with Maximum Flow
We defined the following maximum flow algorithm to rank URLs on a social
network. Given a graph GU = (V,E) and a subset of URLs U ′ ⊂ U , let p represent
a node. We want to rank the URLs in U ′ with respect to p and G by constructing a
directed flow graph denoted as G′p = (V ′, E ′).
The first part of the construction requires copying the social structure of G to
G′p. For every node vi that p follows, we add vi to V ′ and the edge e = (p, vi) into
E ′. At the subsequent iteration, we repeat the same process for every node that has
been added into V ′ from the previous iteration; that is, if vi was added into V ′ and
69
Source
Information and Social Network Web Pages
Super Sink
p
P 1
P 3
P 2
P 4
P 5
u 2
u 1
t
Figure 4.9: Graph G′p for ranking URLs {u1, u2} with respect to node p.
there is an edge e = (vi, vj), then we add vj to V ′ if vj has not been added before.
The edge e = (vi, vj) will still be added into E ′ if vj has been added before. This
process of constructing the graph G′p continues until all possible nodes from V that
are reachable from p have been added into V ′. For practical reasons, it is wise to stop
when the diameter of G′p is small; e.g., three to reflect the influence of nodes that are
within network proximity. At the end of the process, an edge originating from node
v gets the weight equal to the inverse of the node degree in G′p.
The second part of constructing G′p introduces some additional nodes and edges.
For every URL u′ ∈ U ′, we add u′ into V ′. For every spreader s ∈ S(u′) of the URL
u′, we add an edge e = (s, u′) with a weight of 1 into E ′ if s ∈ V ′. We add a super
sink denoted t into V ′ and add an edge e = (u′, t) with an edge weight of 1 for every
URL u′ in U ′.
The maximum flow of the graph G′p from source p to super sink t is a function
F that assigns a non-negative value to each edge so that it maximizes the total flow
coming from the source p to the super sink t satisfying two conditions: first, it does
not exceed the weight of an edge; i.e, F (e) ≤ ce and second, it obeys the conservation
of flow law except for the source p and the super sink t; i.e,
Fout(v) =
Flow out to social ties︷ ︸︸ ︷∑ce +
Flows out to pages︷ ︸︸ ︷∑c′
e = Fin(v) (4.11)
where ce is the assigned flow for the edge e = (vi, vj) between two nodes, and c′e is
the assigned flow for the edge e′ = (vi, uj) for the node vi and the URL uj. The
construction of the graph G′p is illustrated in Fig. 4.9. Polynomial running time
algorithms such as the Edmonds-Karp algorithm O(V ′E ′2) for finding the maximum
70
flow can be found in [85], [86].
4.2.4 Variants of Maximum Flow
The second variant of network flow incorporates social relationships and geog-
raphy by assigning weights to edges based on the geographical distance between the
nodes. We assign the edge weight for nodes vi and vj as:
wij =gd(vi, vj)
−1∑vk∈vouti
gd(vi, vk)−1. (4.12)
where gd(vi, vj) is the geographical distance from vi to vj. The third variant uses
cosine keyword similarity to assign the weights. The edge weight for nodes vi and vj
is defined as:
wij =CKS(vi, vj)
−1∑vk∈vouti
CKS(vi, vk)−1. (4.13)
The last variant of network flow uses community structure by replacing the
social network with the community group and connecting the source to all members
in the community. Weights (binary) for the edges in community do not taken into
account geography or cosine keyword similarity so their values are 1.
4.3 Social Ranking Experiments
4.3.1 Comparing PageRank & HITS
We selected 30 URLs from the popular and random URLs sets. For each selected
URL, we calculated its score by using PageRank and HITS, and ranked the URLs
(i.e, 1st, 2nd, 3rd, etc.) with respect to the set. We compared the ranking results of
PageRank and HITS for popular and random URLs shown in Fig. 4.10 for Google
Buzz and Fig. 4.11 for Twitter. Ranking Results of Google Buzz are listed in Table
4.6 and Table 4.7.
The ranking of popular URLs using PageRank and HITS are more consistent
than the random URLs. We measured the ranking consistency as the average differ-
ence of two ranking algorithms on a set of URLs (i.e., 1w
∑u∈U ′ |PHSN(u)−PPRSN(u)|)
and the sum of differences (i.e.,∑
u∈U ′ |PHSN(u)−PPRSN(u)|) where Px(u) is the po-
sition of the URL u determined by the algorithm x and w is the number of URLs.
71
The average difference is more appropriate than the sum difference for ranking a large
number of pages. An example is ranking 1000 pages instead of 5 pages. The average
gives the average difference of two ranking algorithms in the 1000 pages, and the sum
difference gives the difference in ranks of the two algorithms. For smaller number
of pages, sum might be more appropriate in quantifying the difference between two
ranking algorithms.
For the popular URLs in Google Buzz, the average difference was 2.9 meaning
that on average HITS and PageRank were off by 3 positions and the sum of differences
between them was 86. For the random URLs in Google Buzz, the average difference
was 9.6 and the sum of differences between them was 288. For the popular URLs
in Twitter, the average difference was 5.9 and the sum of differences between them
was 178. For random URLs in Twitter, the average difference was 7.2 and the sum
of differences between them was 216. In both networks, popular URLs are ranked
more consistently than random URLs which makes the HITS algorithm more suitable
than PageRank when ranking viral information because it is computationally more
efficient.
0 5 10 15 20 25 300
5
10
15
20
25
30
abcnews.go
amazon
apple
appleinsider
bbc
bloomberg
boston
businessweek
empireavenue
engadget
gizmodo
guardian
lockerznytimes
pcworld
photofocuspingchat
reuters
stackoverflow
techcrunch
ted
thesocialnetwork−movie
whitehouse
wiredwordpress
xkcd
yahoo
youtube
PageRank on Social Network
HIT
S o
n S
ocia
l Net
wor
k
(a) Popular URLs.
0 5 10 15 20 25 300
5
10
15
20
25
30
addictivefonts
behancebusinessinsider
digg dslreports
economist
entrepreneur
fastestwaylosebellyfat
forbes
foxnews
huffingtonpost
income4free
last.fm
marketwatch
networkedblogs
npropencog
picasaweb.google
ping.fm
popscipuntogov socialturns
sports.espn.go
tech.slashdottelegraph
thenextweb
theprism
wimp
wired
PageRank on Social Network
HIT
S o
n S
ocia
l Net
wor
k
(b) Random URLs.
Figure 4.10: Ranking URLs on Google Buzz.
4.3.2 Flow Ranking
We noticed that the ranking results determined by each individual user using
maximum flow are less correlated with themselves than the results computed by
PageRank and HITS. First, we compared the ranking results of maximum flow with
72
0 5 10 15 20 25 300
5
10
15
20
25
30
abc.gobarackobama
brightkit
businessweek
change.orgebay
espn.go
estovar
forbeshollywoodlife
huffingtonpost
latimes
mtv
myspace
nbcnews
news.yahoo
newstomatopepsi
pitchengine
ted
tinychat
twitpic.co
ubersocial
usatoday
vimeo
wefollow
wired
wordpress
zdnet
PageRank on Social Network
HIT
S o
n S
ocia
l Net
wor
k
(a) Popular URLs
0 5 10 15 20 25 300
5
10
15
20
25
30
9gagadage
amazonbarnesandnoble
blog.naver
blog.vegas
chinadaily
eco4planet
fastcodesignfizy
foxnews
getglue gigaom gototennis
happyplace
hotlist
influxinsights
iphoneblog
keekmacrumors
meadowparty
mtv
newscj
nmescientificamerican
techcrunch
turbotdoublevice
viewsnnews
wimp.com
PageRank on Social Network
HIT
S o
n S
ocia
l Net
wor
k
(b) Random URLs
Figure 4.11: Ranking URLs on Twitter.
PageRank and HITS using popular and random URLs for Google Buzz shown in Fig.
4.12 for popular URLs and Fig. 4.13 for random URLs. The first and second plots
on the left are ranking results of popular URLs and the third and fourth plots on the
right are ranking results of random URLs labelled by their sub-captions. A point on
the graph is a URL where the x-axis is the ranking position of the URL determined by
maximum flow and the y-axis is the ranking position determined by either PageRank
or HITS labelled on the y-axis. The identical layout for Twitter is shown in Fig. 4.14
for popular URLs and Fig. 4.15 for random URLs.
0 5 10 15 20 25 300
5
10
15
20
25
30
Personalized Ranking with Maximum Flow
Pag
eRan
k on
Soc
ial G
raph
Person 1Person 2Person 3Person 4y=x
(a) Max. Flow vs. PageRank
0 5 10 15 20 25 300
5
10
15
20
25
30
Personalized Ranking with Maximum Flow
HIT
S o
n S
ocia
l Gra
ph
Person 1Person 2Person 3Person 4y=x
(b) Max. Flow vs. HITS
Figure 4.12: Social ranking with popular URLs on Google Buzz.
73
0 5 10 15 20 25 300
5
10
15
20
25
30
Personalized Ranking with Maximum Flow
Pag
eRan
k on
Soc
ial G
raph
Person 1Person 2Person 3Person 4y=x
(a) Max. Flow vs. HITS
0 5 10 15 20 25 300
5
10
15
20
25
30
Personalized Ranking with Maximum Flow
HIT
S o
n S
ocia
l Gra
ph
Person 1Person 2Person 3Person 4y=x
(b) Max. Flow vs. PageRank
Figure 4.13: Social ranking with random URLs on Google Buzz.
0 5 10 15 20 25 300
5
10
15
20
25
30
Personalized Ranking with Maximum Flow
HIT
S o
n S
ocia
l Gra
ph
(a) Max. Flow vs. HITS
0 5 10 15 20 25 300
5
10
15
20
25
30
Personalized Ranking with Maximum Flow
Pag
eRan
k on
Soc
ial G
raph
(b) Max. Flow vs. PageRank
Figure 4.14: Social ranking with popular URLs on Twitter.
0 5 10 15 20 25 300
5
10
15
20
25
30
Personalized Ranking with Maximum Flow
Pag
eRan
k on
Soc
ial G
raph
Person 1Person 2Person 3Person 4y=x
(a) Random URLs.
0 5 10 15 20 25 300
5
10
15
20
25
30
Personalized Ranking with Maximum Flow
HIT
S o
n S
ocia
l Gra
ph
Person 1Person 2Person 3Person 4y=x
(b) Random URLs.
Figure 4.15: Social ranking with random URLs on Twitter.
74
Table 4.6: Ranking results of 30 popular URLs in Google Buzz.URLs PRSN HSN MF
abcnews.go 1 1 9/12/10/15youtube 2 2 5/7/5/6yahoo 3 10 1/2/2/4businessweek 4 14 10/14/12/14bloomberg 5 9 10/14/13/12wordpress 6 7 5/5/7/9nytimes 7 4 10/14/6/10appleinsider 8 3 10/14/13/16facebook 9 8 1/1/1/1wired 10 5 9/14/13/15lockerz 11 6 4/6/6/6apple 12 11 6/8/9/8pcworld 13 15 8/13/10/7guardian 14 12 10/14/8/10reuters 15 19 10/14/10/16ted 16 13 9/13/7/10amazon 17 21 8/9/8/10techcrunch 18 17 8/13/9/14engadget 19 16 9/13/7/7reddit 20 23 10/13/8/11empireavenue 21 22 9/14/11/15boston 22 25 3/3/3/3/xkcd 23 24 2/4/8/2whitehouse 24 18 9/14/11/14gizmodo 25 20 7/10/12/12pingchat 26 27 9/12/12/14thesocialnetwork-movie 27 28 9/14/13/14bbc 28 29 10/11/4/13photofocus 29 26 8/14/13/16stackoverflow 30 30 6/11/12/12
4.3.3 Rank Differences
For personalized ranking, we measured the ranking consistency as the average
difference of a pair of users with respect to a URL set. For instance, in the Table
4.8, the left column and the top row are the four selected users where the element aij
corresponds to the average difference of users i and j. Please note the upper triangle
or elements above the diagonal refer to the random URLs and the lower triangle or
elements below the diagonal refer to the popular URLs. The right column refers to
the outdegree of users in the random URLs, and the last row refers to the outdegree
75
Table 4.7: Ranking results of 30 random URLs in Google Buzz.URLs PRSN HSN MF
networkedblogs 1 28 6/5/7/2picasaweb.google 2 29 1/3/1/5ping.fm 3 1 5/4/4/4thenextweb 4 3 8/7/8/3twitter 5 18 12/17/13/10income4free 6 17 2/1/2/1fastestwaylosebellyfat 7 19 10/9/10/10digg 8 25 12/19/12/5sports.espn.go 9 4 4/6/6/6wired 10 5 12/21/9/9businessinsider 11 13 3/2/3/8forbes 12 12 7/12/12/9foxnews 13 27 11/13/5/9behance 14 11 11/23/13/8huffingtonpost 15 23 12/20/11/7entrepreneur 16 2 12/21/13/10puntogov 17 15 12/23/13/10addictivefonts 18 6 10/14/13/9theprism 19 30 12/20/13/10telegraph 20 22 9/10/13/10npr 21 7 10/19/13/10popsci 22 16 10/11/13/10economist 23 10 12/16/13/10marketwatch 24 8 8/8/13/10opencog 25 9 12/23/13/8dslreports 26 26 12/15/13/10last.fm 27 24 12/23/13/10tech.slashdot 28 20 12/22/13/10wimp 29 21 12/18/13/10socialturns 30 14 12/18/13/10
of users in the popular URLs. For Twitter, the ranking results in the same format
are given in Table 4.9.
For random URLs in Google Buzz, we noticed that persons p1 and p3 have
an average difference of 1.7 where p2 and p4 have an average difference of 6.7. For
popular URLs, the variability is smaller where p4 and p2 have an average difference of
2.0 and p1 and p2 have an average difference of 3.2. Outdegree measures the number
of people a user follows since the ranking results are based on them. And finally,
ties are expected when using maximum flow since the number of URLs shared among
76
friends is minuscule compared to the number of pages in the deep Web. Therefore,
we simply use PageRank or HITS to break ties among pages when necessary.
Table 4.8: Avg. ranking differences in Google Buzz.- p1 p2 p3 p4 outdegree.p1 - 5.1 1.7 2.4 369p2 3.2 - 4.8 6.7 4,505p3 2.5 2.6 - 3.1 1,125p4 3.2 2.0 2.5 - 102
out deg. 159 355 503 340
Table 4.9: Avg. ranking differences in Twitter.- p1 p2 p3 p4 outdegree.p1 - 1.5 2.0 4.0 203p2 3.7 - 3.0 3.8 122p3 3.3 3.3 - 4.6 426p4 3.7 3.8 5.2 - 119
out deg. 324 158 129 1,731
4.3.4 Rank Distributions
We examine variants of flow ranking as follows. We selected a user in Twitter,
selected the top 25 URLs shared by people that this user is following in terms of
CKS shown. These 25 URLs contain similar keywords to the URLs that this user has
previously shared. Once we have the candidate URLs, we use network flow to re-rank
them taken into account social relationships, the effect of geography, and community
structure. Results show a re-ordering where geography have an effect on reducing
the number of URLs with positive scores by considering spreaders of URLs who are
geographically close. On the other hand, community have an effect on distributing
the scores of URLs more evenly since more spreaders are taken into consideration.
This flexibility allows users to select information that are locally relevant when it is
appropriate or select information of potential interests from their community mem-
bers.
Figure 4.16 shows the rank correlation coefficient of URLs between variants of
network flow and PageRank. For a selected user, we selected 25 URLs from its neigh-
borhood and ranked these URLs using variants of network flow: without geography
77
−0.1 0 0.1 0.2
0.05
0.1
0.15
a)
Tau
P(x
)
TwitterBuzz
−0.1 0 0.1 0.20.02
0.04
0.06
0.08
0.1
0.12
0.14b)
Tau
P(x
)
−0.1 0 0.1 0.20.02
0.04
0.06
0.08
0.1
0.12
0.14c)
Tau
P(x
)
Figure 4.16: Densities of rank correlation coefficient.
Flow O Flow G Flow I Flow C PR BL0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Avg
. NC
DG
a)
Flow O Flow G Flow I Flow C PR BL0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Avg
. NC
DG
b)
Figure 4.17: Ranking quality results.
(a), with geography (b), and with community (c). Given a set of URLs U , let Ru(v)
and Ru(v′) be the ranking results for nodes v and v′. The rank correlation coefficient
denoted as τ is defined as:
τ =nc − nd
0.5k(k − 1)(4.14)
where k = |U |, nc is the number of concordant pairs, and nd is the number discordant
pairs in Ru(v) and Ru(v′). Then we calculated the rank correlation coefficient τ where
a value of 1 means the ranking results are identical, -1 if they are in reverse order, 0
if they are independent. Results show that personalized ranking using network flow
is highly independent from PageRank.
4.3.5 Rank Validation
Fig. 4.17 shows ranking quality results for Google Buzz (a) and Twitter (b)
using the four variants of network flow, PageRank applied to the social/information
network, and the baseline. The y-axis is the normalized cumulative discounted gain
78
(NCDG) used to benchmark the quality of ranking results and defined below. For this
experiment, we selected 50 users and 100 URLs from a user’s neighborhood. Then
we ranked these URLs by using the six ranking techniques.
NCDG is defined as follows. Let p be a source node and R a list of ranked URLs
for p. The discounted cumulative gain DCG for R with respect to p is:
δ(Ri, p) +w∑i=2
δ(Ri, p)
log(i)(4.15)
where δ(Ri, p) is 1 if Ri is relevant to p and 0 otherwise, and w is the number of
pages to be ranked. We assume Ri is relevant to p if p has shared Ri before. The
normalized discounted cumulative gain (NDCG) is the DCG divided by the DCG of
the optimal ordering of R with respect to p. Optimal ordering is defined by using the
pages that the user has later shared in the future.
To capture any effect of social relevance, we randomly rank these URLs and
use this random ranking as the baseline. Results shown in Fig. 4.17 confirmed that
social relevance can improve ranking results of up to 19% in Google Buzz and 17% in
Twitter. The improvement is defined as the difference in two ranks in terms of average
NCDG of PageRank and flow rank divided by the average NCDG of PageRank (See
Fig. 4.17). It is interesting that peers in community have a stronger effect in ranking
quality than friends in Google Buzz. This is consistent with the densities of social
relationships in Table 4.4 where 25% of peers have similar interests compared to 9%
for friends. For Twitter, the densities in Table 4.5 align with the ranking quality
results in Fig. 4.17(b) where the densities of interests among friends and peers are
almost identical. Recall that the PageRank is calculated by using the social network
and not by using the web graph.
4.4 Summary of Results
Information shared between users in online social networks such as URLs pro-
vides a unique perspective of the ranking of web pages. In our approach, humans
instead of pages are the ones who rank the URLs by sharing them, and the social
network of the users instead the web graph topology is used to propagate the ranking.
First, we collected two large-scale information networks of online users to study
79
how users in these networks share URLs which impacts the distance between a person
and a URL. For instance, researchers in [3] estimated the number of hops between
any two pages to be on average 19; while Milgram estimated that the number of
hops between any two people is no more than 6 [87]. Since information propagates
differently in social networks, the social structure bounds how far a person is away
from a shared URL.
Second, we reinterpreted the ranking techniques of PageRank and HITS and
proposed to use maximum network flow to personalized the ranking of pages tailored
to each individual user. Maximum flow detects the popularity of a shared URL
among friends but popularity does not necessary reflect endorsement which could
impact ranking because one could share something that was not meant to be positive
(e.g., a sad news). We expected that each unique individual would rank the URLs
differently, since no two people on a social network are the same. Interestingly, the
ranking results of popular URLs using PageRank and HITS are more correlated than
random URLs suggesting that the overall view of users on ubiquitous information is
more consistent, but everyone has their own opinion in the end. Instead of attempting
to socially rank the entire web, we re-ranked a selected set of URLs to make it scalable
and efficiently executable for search engines. If the size of the web doubles in the next
few years, it would not affect our approach since only a subset of URLs that users
shared are actually re-ranked.
Third, experimental results show that personalization can improve ranking qual-
ity of up to 19% compared to the baseline and 5% compared to PageRank in Google
Buzz. For Twitter, personalization improves ranking quality of up to 17% compared
to the baseline but it is not better than PageRank.
More importantly, we believe that personalizing the ranking is useful for social
searching because it provides a mechanism for the interaction between the searcher
and the sharer where the searcher can discuss with the sharer about the item relating
to a query on a search engine. For instance, a new product that the sharer posted
on appleinsider.com or a piece of political news on nytimes.com. This potential
interaction between the searcher and the sharer is valuable because the influence of
the sharer on the searcher is stronger than the influence coming from the authorities
detected by HITS and PageRank in many non-technical and social situations but not
80
for all. This feature could be implemented in search engines where pages returned
to a given query are re-ranked via social networks if there are pages shared among
friends or other associates of the searcher that are related to the query.
CHAPTER 5
SOCIAL SEARCHING EXPERIMENTS
We collected friendship, checkin, and location data from two location-based social
media, Gowalla and FourSquare, that allowed people to use their internet-enabled and
sensing-capable smart phones to record and share their current location. Gowalla is no
longer operating by itself since it has been integrated into Facebook. Unlike Gowalla,
FourSquare doesn’t allow an automated mechanism for collecting publicly shared
checkins through their API. We have also collected two additional social networks
containing social relationships, Flickr and Last.fm, but without geographical locations
of their users.
The reason for collecting data from these four diverse networks is that we can
directly calculate the hop length of the shortest path between randomly selected
pairs of users and use these path lengths as an estimate for the ground truth in
the small-world experiment. We use Gowalla and FourSquare for the emulation of
the small-world experiment in which knowing geographical distance between users is
essential. Even though the collected data from online social media is not a represen-
tative sample of the entire population, it still provides “one of the best estimates of
social distance”[88] and one of the best environments for analyzing the small-world
experiment at large scale.
Table 5.1: Summaries of online social networks datasets.Social Networks Number of Users Number of Edges PeriodGowalla 154,557 1,139,110 Sept. 11 - Oct. 12FourSquare 251,621 800,201 Jun. 13 - Aug. 13Flickr 2,435,257 155,110,479 Jun. 13 - Aug. 13Last.fm 4,355,516 30,325,890 Jun. 13 - Aug. 13
In Table 5.1, we list the number of users and edges collected for each network
over the specified time period. These numbers in case of Gowalla and FourSquare
refer to a subset of the collected network reduced after data cleaning. In Gowalla,
we removed users that did not have any publicly shared checkins. In FourSquare,
Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and SocialStratification,” Plos One, (under review.)
81
82
we kept only users that were successfully geocoded by Google’s Maps. This subtle
difference between Gowalla and FourSquare is important because checkins in Gowalla
directly pinpoint users’ locations, making connections between users in Gowalla more
dense than in FourSquare. However, the advantage of FourSquare is that it provides
different perspective to some of the questions being asked such as the effect of network
sparsity on the small-world problem.
5.1 Attrition, Geography, & Communities
Let G = (V,E) be a social network where V is the set of users and E is the set
of edges representing undirected relationships among users. The great-circle distance
between two users s and t is denoted as gd(s, t) and estimated based on the users’
self-entered location of residence (FourSquare) or the most-frequent checkin that they
have shared (Gowalla). The network distance between s and t is denoted as nd(s, t)
and defined as the smallest number of hops needed to reach t starting from s.
Let A be a community detection algorithm that partitions nodes in G into m
overlapping clusters denoted as {C1, C2, ..., Cm}. An edge-bridge is an edge e = (u, u′)
such that u ∈ Ci and u′ ∈ Cj for i 6= j. A node-bridge is a node u such that for
certain i 6= j, u ∈ Ci and u ∈ Cj. The stratification graph of G denoted as S = (sij)
is defined as:
sij =eb(i, j) + nb(i, j)∑mk=1 eb(i, k) + nb(i, k)
(5.1)
where eb(i, j) and nb(i, j) are the number of edge- and node-bridges connecting com-
munities i and j respectively. We extend the definition of network distance of users
to communities denoted as nd(Ci, Cj) and defined it as the smallest number of node-
or edge-bridges needed to reach Cj starting from Ci. We latter use sij to define the
prominence of community Ci.
Fig. 5.1 shows the stratification graph of communities for Gowalla.
5.1.1 Modeling Attrition
Let pk denotes the probability of getting from a source to a target in k hops in
chains that are of length at least k, and let p denotes the probability of dropping out of
83
Figure 5.1: Stratification graph of communities in Gowalla.
experiment for nodes that are not adjacent to a target. Let N denotes the number of
folders sent, Dk be the number of folders delivered to the target at the kth hop, and Ck
be the number of chains continuing for at least k hops. If participants do not drop out
of the experiment, then the number of deliveries in k hops is Ek = pk(N −∑k−1
i=1 Ei).
The expected number of deliveries for one hop targets is D1 = N ∗p1, and the number
of chains continuing for two or more hops is C2 = N(1− p1)p. For k > 1, Dk = pkCk
and Ck+1 = Ck(1− pk)p. In Travers-Milgram’s experiment, we know N , Ck,and Dk.
Then, the numbers of deliveries including drops for k > 1 is:
Ek = Dk + (N − Ck −k−1∑i
Ei) ∗ pk. (5.2)
84
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
Path length
Den
sity
a)
GWFSFRLFMTMOTMA
GW FS FR LFM TMO TMA0
2
4
6
8
10
Avg
. pat
h le
ngth
b)
Figure 5.2: Distributions of shortest path lengths & average path lengths.
With these formulas, we can compute average number of hops taking into
account the effect of participants dropping out of the experiment. In the original
Travers-Milgram, the reported path length was 6.2 plus 2 additional hops for drop-
ping. When taking into consideration dropping, the path length should be reported
as at least 8. An element of novelty here is that we can apply the effect of attrition to
our experimental results from the opposite point of view. Suppose a participant does
not drop out in our emulations unless it has sent the folder to (one at a time) all of
its acquaintances. Since we know Ek, pk, and Ck from the social routing emulations,
we can calculate Dk and report the average path length in Dk as a function of the
dropping rate.
5.1.2 Geographical Analysis
In Fig. 5.2(a,b), we compare the shortest path lengths distribution of four online
social networks (a) and the average shortest path length with one standard deviation
(b) with two results from Travers and Milgram. The shortest path lengths with one
standard deviation are 4.91 ± 0.78 in Gowalla, 7.74 ± 1.99 in FourSquare, 4.90 ±0.78 in Flickr, and 5.98 ± 0.99 in Last.fm. The average path length reported by
Travers and Milgram is within one standard deviation away from the ground truth in
FourSquare and LastFM but not for Gowalla and Flickr, but the average path length
adjusted by the impact of attrition is out of range for LastFM.
In Fig. 5.3(a,b), we plot the probability density (log-scale) as a function of geo-
graphical distances (log-scale) between pairs of friends for Gowalla (a) and FourSquare
85
102
103
104
10−4
10−3
10−2
10−1
100
Distance d (km)
Den
sity
f(d)
a)
Friends19/d
9580/d2
102
103
104
10−4
10−3
10−2
10−1
100
Distance d (km)
Den
sity
f(d)
b)
Friends5/d
4608/d2
Figure 5.3: Densities of geographical distances.
(b). The probability density function f(d) is defined as the fraction of friends such
that their geographic distance is d± ε. We fitted the data for each network with two
models, one assuming inverse proportionality to the distance c/d and the other to
the square of the distance c′/d2. We found two constants c and c′ by minimizing the
difference between the model and the data:
minc
{∑d
(f(d)− c/d)2
c/d
}, min
c′
{∑d
(f(d)− c′/d2)2
c′/d2
}. (5.3)
For Gowalla, the error is 0.15 for 19.26/d and 1.64 for 9580.42/d2. The error
is 0.55 for 4.78/d and 16.76 for 4608.35/d2 for FourSquare. In other words, c/d
fits the distribution of geographical distances about 20 times better than c′/d2. An
explanation of this difference is that online social media distorts physical dimension
(approximately 2-dimensional surface) by allowing people from anywhere to establish
a connection.
Even a better fit is a model c′′/dδ where 1 < δ < 2 which means that the social
network space is fractal. This observation is in agreement with Liben-Nowell et al.
[67]. Kleinberg’s theoretical results in [41] bounding the expected delivery time to
O(log n) assumes that the distribution of distances is d−2. Hence, empirical results do
not satisfy assumptions made in mathematical models and therefore the theoretical
bounds in those models cannot be universally applied.
86
Table 5.2: Communities detected by GANXiS.Gowalla Foursquare
Average Size 16.60 15.86Weighted Average Size 591.81 399.77Total Communities 10,562 16,495Avg. Link Density 0.35 0.37Edge-Bridges 1,004,964 574,748Node-Bridges 20,137 9,492
5.1.3 Detecting Communities
We selected GANXiS to detect overlapping communities based on its promising
experimental results and the ability to scale to millions of nodes and edges [54]. The
intuition behind GANXiS is that there should be a lot of edges within a community,
and an important feature of GANXiS is that it is able to detect either disjoint or
overlapping communities. This intuition is consistent with the stratified nature of
society in which members within a community such as a family, workplace team,
religious congregation, sport club, etc. are more likely to be connected with each
other than to casual acquaintances. Consequently, once a folder reaches a person
that belongs to the same community as the target, only a few more hops are needed
to reach the target.
In Table 5.2, we listed several measurements of communities detected by GANXiS.
They include the average community size, weighted community size defined as the
average size of a community as observed by each member, the total number of com-
munities detected, the average link density defined as the number of edges inside a
community divided by the maximum number of possible edges, and the number of
edge- and node-bridges.
5.2 Experimental Design
There are two strategies that define our emulation of the social routing. The
first one describes the process of routing a folder by defining in each step of routing
which acquaintance of the current folder holder is receiving the folder, and the second
one defines the process of selecting starters and targets.
87
5.2.1 Routing Strategies
The first routing strategy, denoted as GEOGREEDY in [67], is to pass the folder
to an acquaintance who is the geographically closest to the target, that is, picking
an acquaintance u with the smallest gd(u, t). The second routing strategy, denoted
as COMGREEDY, is to pass the folder to an acquaintance who is the closest to the
target in terms of community distance, that is, picking an acquaintance u with the
smallest nd(Cu, Ct). For overlapping communities, COMGREEDY selects the corre-
sponding community of u and t in such a way so that nd(Cu, Ct) is minimum. Such
information may not be always available to the current folder holder, so GEOGREEDY
is more realistic than COMGREEDY, but the purpose of introducing COMGREEDY is to
understand which property of the network, geography or community, is more useful
for reaching the target for the majority of the cases.
The third routing strategy is to use a combination of the knowledge of geography
and community, denoted as GEOCOM, when selecting an acquaintance. In GEOCOM,
a node gives the highest preference to acquaintances who belong to the same com-
munity as the target (i.e, nd(u, t) = 0), and breaks ties between them by selecting
the acquaintance who is the geographically closest to the target (i.e, GEOGREEDY).
If a node have no acquaintances who belong to the same community as the target,
then the node uses GEOGREEDY. An element of novelty in using a combination of
geography and community is that it seems to be more realistic than using either one
alone.
In all strategies, routing stops either when the folder has reached the target or
when a user does not have any more acquaintances to whom it can pass the folder
because all of its acquaintances have already been chosen by the current holder. If the
current holder doesn’t have any acquaintances who belong to the same community
as the target, then sij defines the probability going from Ci to Cj in one step. The
implication is that sij influences the routing strategy in GEOGREEDY and GEOCOM
but not in COMGREEDY, and this influence does not depend on the target but on how
communities are interconnected. Therefore, we define the prominence of community
Ci, denoted as λi, as P (Xt = Ci) where Xt denotes the community reached in a
random walk at step 0 < t < ∞. The idea is the more prominent a community, the
more likely it is to be reached in a random walk process.
88
Table 5.3: Prominence of individuals and communities.Percentile PageRank Steady State PageRank Steady State
Top 1% 0.000121 0.005421 0.000065 0.003652Top 20% 0.000014 0.000134 0.000010 0.000094
60th-80th% 0.000005 0.000036 0.000003 0.00001640th-60th% 0.000003 0.000021 0.000002 0.000008Bottom 40% 0.000002 0.000021 0.000001 0.000003
Gowalla FourSquare
Table 5.4: Experimental results for Gowalla.Average Number of Hops in Successful Chains
GEOGREEDY COMGREEDY GEOCOMRandom 29.43 20.57 19.08nd(Cs, Ct) = 0 5.61 3.61 3.61nd(Cs, Ct) = 1 26.13 16.06 16.23nd(Cs, Ct) = 2 27.78 18.71 21.13nd(Cs, Ct) = 3 29.06 19.76 24.36
Percentage of Successful ChainsRandom 0.30 0.44 0.50nd(Cs, Ct) = 0 0.71 0.87 0.91nd(Cs, Ct) = 1 0.34 0.57 0.58nd(Cs, Ct) = 2 0.27 0.46 0.44nd(Cs, Ct) = 3 0.18 0.38 0.27
5.2.2 Starter & Target Selections
First, we select a starter and a target that are separated by a fixed number
of communities; i.e, nd(Cs, Ct) = k. For example, when k = 0, the starter s and
target t are selected within the same community. Then, we select a target t based
on its prominence measured by its PageRank score and next a random target from
a prominent community as measured by the steady state of the random walk on
the stratification graph S. The percentile of individual and community prominence
measured by PageRank and steady state of λi are listed in Table 5.3. Finally, we select
starters and targets randomly to mimic the most unbiased way in which participants
could be selected for Travers-Milgram’s like experiment.
89
Table 5.5: Experimental results for FourSquare.Average Number of Hops in Successful Chains
GEOGREEDY COMGREEDY GEOCOMRandom 18.19 16.01 16.52nd(Cs, Ct) = 0 1.93 2.06 1.99nd(Cs, Ct) = 1 7.81 7.37 6.21nd(Cs, Ct) = 2 15.36 12.96 12.10nd(Cs, Ct) = 3 18.02 15.14 15.81
Percentage of Successful ChainsRandom 0.01 0.22 0.04nd(Cs, Ct) = 0 0.75 0.86 0.88nd(Cs, Ct) = 1 0.13 0.51 0.38nd(Cs, Ct) = 2 0.04 0.37 0.12nd(Cs, Ct) = 3 0.02 0.28 0.04
5.3 Experimental Results
5.3.1 Selection & Routing Combinations
Table 5.4 contains the experimental results for Gowalla. The upper section in
the Table 5.4 displays the average number of hops it takes to successfully reach a
target using the five selection techniques listed in the left column and three routing
strategies listed in the second row. The lower section of Table 5.4 refers to the
percentage of successful chains defined as the number of times the target was reached
divided by the number of trails. For each selection process and routing strategy, we
ran N = 104 trails. The experimental results for FourSquare are displayed in Table
5.5.
Tables 5.4 and 5.5 show that selecting a starter and a target from the same
community makes it likely for the target to be reached in a few hops, about 4 hops
in Gowalla and 2 hops in FourSquare, with high success rate of approximately 83%
for both networks. The percentage of successful chains decreases as the community
distance between the starter and target increases. On average, it takes approximately
22 hops to reach a target with a success rate of 39% for Gowalla, and 12 hops to reach
a target with a success rate of 21% for FourSquare for the community distance ranging
from 0 to 3. As the community distance between the starter and target increases,
the percentage of successful chains decreases to about 19% for Gowalla and 24% for
FourSquare.
90
0 0.5 16
7
8
9
Avg
. Pat
h Le
ngth
a)
drop=5%
b)
0 0.5 1
5.4
5.6
5.8
6
6.2
Gow
alla
c)
0 0.5 14.3
4.4
4.5
4.6
4.7
4.8
Avg
. Pat
h Le
ngth
d)
0 0.5 1
9
9.5
10
0 0.5 16
6.5
7
7.5
Friends−of−Friends Knoweldge Density
e)
drop=15%
0 0.5 15.2
5.3
5.4
5.5
5.6
Fou
rSqu
are
f)
drop=30%
Figure 5.4: Friends-of-friends knowledge densities.
Also, Tables 5.4 and 5.5 show that COMGREEDY is much more effective than
GEOGREEDY in terms of average path length and percentage of successful chains
in both networks. On average, COMGREEDY reaches the target in about 8 hops
quicker than GEOGREEDY in Gowalla and 2 hops quicker in FourSquare. Moreover,
COMGREEDY reaches the target 18% more often than GEOGREEDY in Gowalla and
26% more often in FourSquare. Hence, using community distances is more effective
at reaching targets than using geographical distances.
5.3.2 Friends-of-Friends Knowledge Densities
To make GEOCOM more realistic, we introduce the probability that current
holder might have some relevant clues about its acquaintances. A possible clue is the
friends-of-friends knowledge where a holder might know some of its friends’ friends,
where they are geographically located, and to which communities they belong. In Fig.
5.4, we plotted the average path length as a function of friends-of-friends knowledge
density for Gowalla (a-c) and FourSquare (d-f). The x-axis represents the probability
that the current holder might know the geographical location and community infor-
91
mation of a friend-of-friend. A value of 0 means the holder only uses its friends to
make a routing decision, and a value of 1 means a holder knows all the friends of its
friends. In addition, we examined three levels of attrition added into this particular
experiment. Subfigures a) and d) refer to a 5% dropping rate, subfigures b) and
e) refer to a 15% dropping rate, and subfigures c) and f) refer to a 30% dropping
rate. Regions within one standard deviation away from the ground truth in terms
of average path length are shaded in blue. In Gowalla, results show that with a 5%
dropping rate, the friends-of-friends knowledge level is too low to make the average
path length within one standard deviation away from the ground truth. However,
with a 15% dropping rate, knowledge level of about 20% is sufficient to reach one
standard deviation away from ground truth, and no friends-of-friends is needed when
the drop rate is 30% or higher. In FourSquare, results show that with a 5% dropping
rate, no friends-of-friends knowledge is needed to be within one standard deviation
away from the ground truth, and average path lengths are very short and not within
one standard deviation when the dropping rate is 15% or more. The reason for the
contrasting behavior is that increasing attrition makes the path length of successful
chains smaller than the ground truth (i.e., 5 in Gowalla vs. 8 in FourSquare).
A difference between Gowalla and FourSquare is that Gowalla is much more
connected in terms of the density of relationships between nodes. The percentage of
finding targets successfully is overall higher in Gowalla than in FourSquare. Recall
that nodes drop out in the simulations when they do not have any more acquaintances
to the pass the folder to. Since there are more relationships in Gowalla, participants
stay longer in the simulations, which increases the path length of successful chains.
For FourSquare, participants have less social relationships so they drop out quicker;
therefore, successful chains are shorter in FourSquare than in Gowalla.
5.3.3 Distributions of Successful Chains
In Fig. 5.5, we plotted the distribution of the lengths of successful chains
in a) and c) and the modified average path length as a function of the dropping
rate in b) and d) for Gowalla and FourSquare, respectively. Results show that it
is difficult to find targets when nd > 0, but still the average path length decreases
when the dropping rate increases. For instance, the average path length of successful
92
0 10 20 300
0.1
0.2
0.3
0.4
0.5
Path length of Successful Chains
Per
cent
age
a)
n
d = 0
nd = 1
nd = 2
nd = 3
Ground Truth
Drop rate (%)
Avg
. Pat
h Le
ngth
b)
0.1 0.2 0.3 0.4 0.5 0.60
5
10
15
TM drop rate
5 10 15 200
0.1
0.2
0.3
0.4
Path length of Successful Chains
Per
cent
age
c)
n
d = 0
nd = 1
nd = 2
nd = 3
Ground Truth
Drop rate (%)
Avg
. Pat
h Le
ngth
s
d)
0.1 0.2 0.3 0.4 0.5 0.60
2
4
6
8
10
12
TM drop rate
Figure 5.5: Path length of successful chains & drop rates.
chains with a dropping rate increasing from 0.2 to 0.4 grows on average from 2 to
6 for Gowalla and 2 to 7 for FourSquare. More importantly, the variances of the
distributions for nd > 0 are large compared to the ground truth as seen in a) and c),
meaning that some targets are easier to reach than others. This leads us to measure
the reachability of a target by examining its individual prominence.
5.3.4 Effects of Hubs and Connectors
In Fig. 5.6, we examined effects of routing the folder to connectors and hubs
discussed in the literature [89]. The first experiment is to pass the folder to the
connector defined as the acquaintance who has the highest number of connections to
other nodes within the community. Results show an improvement in the delivery rates
in Gowalla and FourSquare as seen in Fig. 5.6(a,b). For this connector experiment, we
did not selected starters and targets randomly because connectors would be flooded
with requests making the routing strategy not practical in reality. Perhaps a setting
where passing the folder to a connector would not be too unrealistic is when the
93
GEO COM GCOM CON.0
0.2
0.4
0.6
0.8
1
% o
f Suc
cesf
ul C
hain
s
a)
GEO COM GCOM CON.0
0.2
0.4
0.6
0.8
1
% o
f Suc
cesf
ul C
hain
s
b)
0 5 10 15 20 250
0.02
0.04
0.06
0.08
Path Length of Successful Chains
Den
sity
c)
R=80km (28,72%)R=241 (32,74%) R=400 (38,76%)R=563 (42,78%)
0 5 10 15 20 250
0.02
0.04
0.06
Path Length of Successful Chains
Den
sity
d)
R=80km (36,12%)R=241 (44,14%)R=400 (65,16%)R=563 (75,17%)
Figure 5.6: Effects of routing to connectors & hubs.
starter and target are from the same community.
Another setting that would reduce the flooding of requests is selecting a hub
within some geographical radius from the target. For this experiment, we modified
GEOCOM to incoportate indegree into making a routing decision. First, if the holder
has multiple friends who belong to the same community as the target, then it break
ties by selecting the connector. If the connector does not exist, then it selects a group
of acquaintances who are within some radius away from the target, and select a hub
from this group defined as the friend who has the highest degree. If the hub does
not exist, then it uses GEOGREEDY. As the radius increases by 161km, the delivery
rates for Gowalla and FourSquare increase by approximately 2%, and the average
path length of successful chains increases about 5 hops in Gowalla as seen in Fig. 5.6
(c) and 10 hops in FourSquare as seen in Fig. 5.6 (d).
5.3.5 Individual and Community Prominence
In Fig. 5.7, we calculated the average path length of finding a target as a
function of its PageRank for Gowalla a) and FourSquare b). When PageRank score
94
0 1 2 3 4 5
x 10−5
4
6
8
10
12
14
16
18
PageRank
Avg
. Pat
h Le
ngth
a)
Emulations Linear
r=−0.71
0 1 2 3 4 5
x 10−5
5
10
15
20
PageRank
Avg
. Pat
h Le
ngth
b)
Emulations Linear
r=−0.44
−14 −12 −10 −8 −6 −4 −2
10
15
20
25
λi (log−scale)
Avg
. Pat
h le
ngth
c)
Emulations linear
r=−0.54
−12 −10 −8 −6 −4 −2
5
10
15
20
λi (log−scale)
Avg
. Pat
h le
ngth
d)
Emulations linear
r=−0.65
Figure 5.7: Prominence of individuals & communities on reachability.
increases, the average path length decreases from 16 to 4 in Gowalla and 15 to 5 in
FourSquare. The routing algorithm used in this particular experiment is GEOCOM
with a 8% friends-of-friends knowledge level with starters and targets randomly se-
lected. Hence, results from this experiment show that small-world property holds
for the highly prominent while everyone else is lost in the crowd. In addition, we
calculated the average path length of finding a target as a function of its community
prominence measured by λi for Gowalla c) and FourSquare d). Results from this
experiment also show that targets selected from prominent communities are reached
quicker than targets from non-prominent communities. Correlation coefficients of the
linear relationship between prominence and average path lengths are displayed in each
individual subfigures.
Finally, we examined the correlation between the individual prominence of tar-
gets measured by the PageRank and community prominence measured by a random
walk process in Fig. 5.8. Results show that these two measurements are highly corre-
lated and consistent in the sense that prominent users are in prominent communities
95
−20 −10 0−15
−10
−5
0
log−
scal
e
a)
−20 −10 0−14
−13
−12
−11
−10b)
−20 −10 0−15
−10
−5
Gow
alla
c)
−20 −10 0−15
−10
−5
0
log−
scal
e
Sum PageRank
−20 −10 0−14
−13
−12
−11
−10
λi (log−scale)
Avg. PageRank
−20 −10 0−14
−12
−10
−8
−6
Max. PageRank
Fou
rSqu
are
r = 0.79r = 0.95 r = 0.81
r = 0.82r = 0.61r = 0.91
e)d) f)
Figure 5.8: Prominence of individuals & communities correlations.
and prominent communities contain prominent users. For each community, we cal-
culated the collective prominence of users measured by total, average, and maximum
PageRank of its users. Subfigures a-c refer to communities in Gowalla and subfigures
d-f refer to communities in FourSquare. Each point in a figure is a community where
the x-axis for all subfigures refer to the community prominence and the y-axis in a)
and d) refer to sum PageRank, b) and e) refer to the average PageRank, and c) and
f) refer to the maximum PageRank of a community. Correlation coefficients of the
linear relationship between community and individual prominence are shown in each
individual subfigures.
5.4 Summary of Results
By analyzing data recently available from location-based social media, we pro-
vided three conclusions from our social routing experiments. First, results show that
while using geographical and community information in modeling social routing for
the small-world problem is more realistic than using either one alone, average path
96
lengths are 3 times longer when attrition is eliminated and not even within two
standard deviations away from the ground truth defined as the calculated average
shortest path length. Second, COMGREEDY is more effective and robust at reaching
targets than GEOGREEDY in terms of average path lengths and percentage of success-
ful chains. It is quite plausible that participants could use COMGREEDY cognitively.
For example, a holder can select an acquaintance whose occupation is mortgage in-
surance as being ‘closer’ to commodity broker than a social science teacher. Third,
results from the data show that prominent targets and targets in prominent commu-
nities can be reached much quicker than on average. This leads us to ask what would
the results be if Travers and Milgram had not select a broker but instead a much less
prominent target such as a homeless man? To conclude, our results show that the
small-world property holds for the prominent while everyone else is lost in the crowd
except when being reached by members within its own community.
CHAPTER 6
CONCLUSION AND FUTURE WORK
Table 6.1: Aspects of SNA & applications.Geography Interactions Communities
Human Mobility Distance Communication GroupSpreading Ideas Long Ties Weak Ties Bridge Ties
Personalized Ranking Geo. Influence Peer Influ. Collective Influ.Small-world Selection Cognitive Biases Routing
In Chapter 3, we examined interesting human dynamics in online social networks
in terms of geographical proximity, face-to-face interactions, communities, and found
some valuable insights. For instance, the creation of friendship between two people is
more likely to occur when they are close, and friends and friends-of-friends are more
likely to be within geographic proximity but not further. Geography has an effect on
limiting face-to-face interactions as well as keyword similarity in terms of what users
read on social media. One possible direction for future research is to investigate social
influence as a function of geographical distance. For instance, if a friend checkins at a
location, how likely is his friend going to checkin at the same location in the future?
Two applications we studied in Chapter 3 are human mobility & congestion
modeling and ideas spreading & economic development. Geography shows how friends
are likely to be close in terms of moving together (human mobility). Face-to-face
interactions could be used in the establishing connections in the wireless simulations
where nodes that are frequently interacting are more likely to establish a connection,
and communities can be used to simulate a group of nodes moving together. For
ideas spreading, geography can be used to measure the length of short and long ties,
face-to-face interactions can be used to measure the strength of ties, and communities
can be used to distinguish between bridge and non-bridge ties.
In Chapter 4, we proposed to personalize the ranking of URLs by using public
information that users shared in social media. We incorporated the following two
important aspects of the social networks into the processing of ranking URLs: geo-
graphical distance and community structure. Personalized ranking results from three
97
98
variants of network flow are highly independent from PageRank meaning that each
individual has their own unique way to rank information. Experimental results show
that personalization can improve ranking quality of up to 19% when compared to the
baseline and 5% when compared to PageRank in Google Buzz. For Twitter, person-
alization improves ranking quality of up to 17% compared to the baseline but it is
not better than PageRank. Future work could incorporate calculating novelty of a
piece of information by examining its keywords [90] and determining the popularity
of information in terms of burstiness [91]. These filters allow users to see or filter
information on the web through the eyes of the world.
In Chapter 5, results show that average path lengths in social searching are 3
times longer when attrition is eliminated and not even within two standard devia-
tions away from the ground truth. COMGREEDY is more highly effective at reaching
targets. Also, it is plausible that participants could use COMGREEDY cognitively. On
the other hand, prominent targets can be reached much quicker. The small-world
property holds for the prominent while everyone else is lost in the crowd except when
being reached by members within its own community. Future work could incorpo-
rate face-to-face interactions for measuring potential cognitive biases in selecting the
next acquaintance. In addition, instead of assuming a fixed probability for attrition,
participants could drop out based on interactions in the sense that the next folder
holder has a higher chance of participating if he interacts frequently with the previous
holder.
To summarize, this thesis collects terabytes of information that users share on
social networks and analyzes their social dynamics in terms of geography, face-to-face
interactions and community structures.
REFERENCES
[1] J. Kleinberg and S. Lawrence, “The structure of the web,” Sci., vol. 294, no.
5548, pp. 1849-1850, Nov. 2001.
[2] R. Lempel and S. Moran, “SALSA: The stochastic approach for link-structure
analysis,” ACM Trans. Inf. Syst., vol. 19, no. 2, pp. 131-160, Apr. 2001.
[3] R. Albert et al., “The diameter of the world wide web,” Nature, vol. 401, no.
6749, pp. 130-131, Sept. 1999.
[4] T. Berners-Lee et al., “The semantic web,” Sci. Amer., vol. 284, no. 5, pp.
34-43, May 2001.
[5] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search
engine,” Comput. Networks and ISDN Syst., vol. 30, no. 1, pp. 107-117, Apr.
1998.
[6] Z. Gyongyi et al., “Combating web spam with trustrank,” in Proc. 30th Int.
Conf. Very Large Data Bases, Toronto, Canada, 2004, pp. 576-587.
[7] J. Xu and H. Li, “Adarank: A boosting algorithm for information retrieval,”
in Proc. 30th Int. ACM SIGIR Conf. Res. and Develop. in Inform. Retrieval,
Amsterdam, Netherlands, 2007, pp. 391-398.
[8] Y. Liu et al., “Browserank: Letting web users vote for page importance,” in
Proc. 31st Int. ACM SIGIR Conf. Res. and Develop. in Inform. Retrieval,
Singapore, Republic of Singapore, 2008, pp. 451-458.
[9] M. Taylor et al., “Softrank: Optimizing non-smooth rank metrics,” in Proc.
1st Int. Conf. Web Search and Data Mining, Palo Alto, CA, 2008, pp. 77-86.
[10] H. Yan et al., “Architectural design and evaluation of an efficient
web-crawling system,” J. Syst. Softw., vol. 60, no. 3, pp. 185-193, Feb. 2002.
[11] E. Leicht et al., “Large-scale structure of time evolving citation networks,”
Eur. Phys. J. B, vol. 59, no. 1, pp. 75-83, Oct. 2007.
99
100
[12] S. Bao et al., “Optimizing web search using social annotations,” in Proc. 16th
Int. Conf. World Wide Web, Alberta, Canada, 2007, pp. 501-510.
[13] J. Davitz et al., “ilink: Search and routing in social networks,” in Proc. 13th
ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Jose,
CA, 2007, pp. 931-940.
[14] D. Carmel et al., “Personalized social search based on the user’s social
network,” in Proc. 18th ACM Conf. Inform. and Knowledge Manage., Hong
Kong, China, 2009, pp. 1227-1236.
[15] D. Horowitz and S. Kamvar, “The anatomy of a large-scale social search
engine,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp.
431-440.
[16] A. Dong et al., “Time is of the essence: Improving recency ranking using
Twitter data,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010,
pp. 331-340.
[17] B. Bahmani and A. Goel, “Partitioned multi-indexing: Bringing order to
social search,” in Proc. 21st Int. Conf. World Wide Web, Lyon, France, 2012,
pp. 399-408.
[18] T. Nguyen and B. Szymanski, “Social ranking techniques for the web,” in
Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and
Mining, Niagara Falls, Canada, 2013, pp. 49-55.
[19] D. Romero et al., “Differences in the mechanics of information diffusion across
topics: Idioms, political hashtags, and complex contagion on Twitter,” in
Proc. 20th Int. Conf. World Wide Web, Hyderabad, India, 2011, pp. 695-704.
[20] A. Ritter et al., “Open domain event extraction from Twitter,” in Proc. 18th
ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Beijing,
China, 2012, pp. 1104-1112.
[21] P. Bogdanov et al., “The social media genome: Modeling individual
topic-specific behavior in social media,” in Proc. IEEE/ACM Int. Conf.
101
Advances in Social Network Analysis and Mining, Niagara Falls, Canada,
2013, pp. 236-242.
[22] T. Nguyen and B. Szymanski, “Using location-based social networks to
validate human mobility and relationships models,” in Proc. IEEE/ACM Int.
Conf. Advances in Social Network Analysis and Mining (SNAA Workshop),
Istanbul, Turkey, 2012, pp. 1247-1253.
[23] T. Nguyen, M. Chen and B. Szymanski “Analyzing the proximity and interactions
of friends in communities in Gowalla,” in Proc. IEEE 13th Int. Conf.
Data Mining Workshops, Dallas, TX, 2013, pp. 1036-1044.
[24] L. Backstrom et al., “Four degrees of separation,” in Proc. 4th ACM Int.
Conf. Web Science, Evanston, IL, 2012, pp. 33-42.
[25] Y. Ahn et al., “Analysis of topological characteristics of huge online social
networking services,” in Proc. 16th Int. Conf. World Wide Web, Alberta,
Canada, 2007, pp. 835-844.
[26] H. Kwak et al., “What is Twitter, a social network or a news media?,” in
Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp. 591-600.
[27] A. Mislove et al., “Measurement and analysis of online social networks,” in
Proc. 7th ACM SIGCOMM Conf. Internet Measurement, San Diego, CA,
2007, pp. 29-42.
[28] J. Kleinfeld, “Could it be a big world after all? The ‘six degrees of separation’
myth,” Soc., vol. 39, no. 2, pp. 61-66, Apr. 2002.
[29] J. Travers and S. Milgram, “An experimental study of the small world
problem,” Sociometry, vol. 32, no. 4, pp. 425-443, Dec. 1969.
[30] P. Dodds et al., “An experimental study of search in global social networks,”
Sci., vol. 301, no. 5634, pp. 827-829, Aug. 2003.
[31] D. Watts et al., “Identity and search in social networks,” Sci., vol. 296, no.
5571, pp. 1302-1305, May 2002.
102
[32] M. Granovetter, Getting a Job: A Study of Contacts and Careers. Chicago, IL:
University Chicago Press, 1995.
[33] D. Watts, “Networks, dynamics, and the small-world phenomenon,” AJS, vol.
105, no. 2, pp. 493-527, Sept. 1999.
[34] M. Marchiori, “The quest for correct information on the web hyper search
engines,” Comput. Networks and ISDN Syst., vol. 29, no. 8, pp. 1225-1235,
Sept. 1997.
[35] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM,
vol. 46, no. 5, pp. 604-632, Sept. 1999.
[36] C. Warden. (2010, April 22) EdgeRank: The Secret Sauce That Makes
Facebook’s News Feed Tick [Blog]. Available:
http://www.techcrunch.com/2010/04/22/facebook-edgerank/ (Date Last
Accessed, September, 22, 2014).
[37] C. Burges et al., “Learning to rank using gradient descent,” in Proc. 22nd Int.
Conf. Mach. Learning, Bonn, Germany, 2005, pp. 89-96.
[38] T. Joachims, “Optimizing search engines using clickthrough data,” in Proc.
8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining,
Edmonton, Canada, 2002, pp. 133-142.
[39] R. Caruana et al., “Using the future to sort out the present: Rankprop and
multitask learning for medical risk evaluation,” in Proc. Advances in Neural
Inform. Processing Symp., Denver, CO, 1995, pp. 959-965.
[40] K. Crammer and Y. Singer, “Pranking with ranking,” in Proc. Advances in
Neural Inform. Processing Syst., Vancouver, Canada, 2001, pp. 641-647.
[41] J. Kleinberg, “The small-world phenomenon: An algorithmic perspective,” in
Proc. 32nd Ann. ACM Symp. Theory Computing, Portland, OR, pp. 163-170,
2000.
103
[42] M. Burke et al., “Social capital on Facebook: Differentiating uses and users,”
in Proc. SIGCHI Conf. Human Factors in Computing Syst., Vancouver,
Canada, 2011, pp. 571-580.
[43] A. Mislove et al., “Understanding the demographics of Twitter users,”
presented at 2011 5th Int. AAAI Conf. Weblogs and Social Media, Barcelona,
Spain, 2011.
[44] M. Newman, “Fast algorithm for detecting community structure in networks,”
Phys. Rev. E, vol. 69, no. 6, doi: 10.1103/PhysRevE.69.066133, June 2004.
[45] A. Clauset et al., “Finding community structure in very large networks,”
Phys. Rev. E, vol. 70, no. 6, doi: 10.1103/PhysRevE.70.066111, Dec. 2004.
[46] M. Newman, “Modularity and community structure in networks,” Proc. Nat.
Academy Sci., vol. 103, no. 23, pp. 8577-8582, May 2006.
[47] S. Fortunato and M. Barthelemy, “Resolution limit in community detection,”
Proc. Nat. Academy Sci., vol. 104, no. 1, pp. 36-41, Dec. 2006.
[48] M. Chen et al., “A new metric for quality of network community structure,”
ASE Human J., vol. 1, no. 4, pp. 226-240, 2013.
[49] M. Goldberg et al., “Finding overlapping communities in social networks,” in
Proc. 4th ASE/IEEE Int. Conf. Social Computing, Minneapolis, MN, 2010,
pp. 37-54.
[50] G. Palla et al., “Uncovering the overlapping community structure of complex
networks in nature and society,” Nature, vol. 435, no. 7043, pp. 814-818, Apr.
2005.
[51] M. Sipser, Introduction to the Theory of Computation. Boston, MA: PWS,
1997.
[52] B. Good et al., “The performance of modularity maximization in practical
contexts,” Phys. Rev. E, vol. 81, no. 4, doi: 10.1103/PhysRevE.81.046106,
2010.
104
[53] M. Newman, “Analysis of weighted networks,” Phys. Rev. E, vol. 70, no. 5,
doi: 10.1103/PhysRevE.70.056131, Apr. 2004.
[54] J. Xie and B. Szymanski, “Towards linear time overlapping community
detection in social networks,” in Proc. 16th Pacific-Asia Conf. Knowledge
Discovery and Data Mining PAKDD, Kuala Lumpur, Malaysia, 2012, pp.
25-36.
[55] S. Fortunato, “Community detection in graphs,” Phy. Rep., vol. 486, no. 3,
pp. 75-174, Feb. 2010.
[56] J. Leskovec et al., “Empirical comparison of algorithms for network
community detection,” in Proc. 19th Int. Conf. World Wide Web, Raleigh,
NC, 2010, pp. 631-640.
[57] R. Kannan et al., “On clusterings: Good, bad and spectral,” J. ACM, vol. 51,
no. 3, pp. 497-515, May 2004.
[58] R. Dunbar, “Neocortex size as a constraint on group size in primates,” J.
Human Evolution, vol. 22, no. 6, pp. 469-493, June 1992.
[59] M. Gonzalez et al., “Understanding individual human mobility patterns,”
Nature, vol. 453, no. 7196, pp. 779-782, June 2008.
[60] E. Boxman et al., “The impact of social and human capital on the income
attainment of Dutch managers,” Social Networks, vol. 13, no. 1, pp. 51-73,
Mar. 1991.
[61] B. Ronald, Structural Holes: The Social Structure of Competition. Cambridge,
MA: Harvard University Press, 1992.
[62] Ray Reagans and Ezra W. Zuckerman, “Networks, diversity, and productivity:
The social capital of corporate R & D teams,” Organ. Sci., vol. 12, no. 4, pp.
502-517, Aug. 2001.
[63] Martin Ruef, “Strong ties, weak ties and islands: Structural and cultural
predictors of organizational innovation,” ICC vol. 11, no. 3, pp. 427-449, Jun.
2002.
105
[64] R. Burt, “Structural holes and good ideas”, AJS, vol. 10, no. 2, pp. 349-399,
Sept. 2004.
[65] M. Granovetter, “The impact of social structure on economic outcomes,” JEP,
vol. 19, no. 1, pp. 33-50, Dec. 2005.
[66] A. Pentland, Social Physics: How Good Ideas Spread Lessons From a New
Science. London, UK: Penguin Press, 2014.
[67] D. Liben-Nowell et al., “Geographic routing in social networks,” Proc. Nat.
Academy Sci., vol. 102, no. 33, pp. 11623-11628, June 2005.
[68] Lada Adamic and Eytan Adar, “How to search a social network,” Social
Networks, vol. 27, no. 3, pp. 187-203, Jul. 2005.
[69] M. Granovetter, “The strength of weak ties,” AJS, vol. 78, no. 6, pp.
1360-1380, May 1973.
[70] L. Bettencourt et al., “Growth, innovation, scaling, and the pace of life in
cities,” Proc. Nat. Acad., vol. 104, no. 17, pp. 7301-7306, Mar. 2007.
[71] W. Pan et al., “Urban characteristics attributable to density-driven tie
formation,” Nat. Commun., vol. 4, no. 1, doi: 10.1038/ncomms2961.
[72] G. Ghasemiesfeh et al., “Complex contagion and the weakness of long ties in
social networks: revisited,” in Proc. 14th ACM Conf. Electronic Commerce,
Philadelphia, PA, 2013, pp. 507-524, 2013.
[73] Damon Centola and Michael Macy, “Complex contagions and the weakness of
long Ties,” ASJ, vol. 113, no. 3, pp. 702-734, Nov. 2007.
[74] N. Eagle et al., “Network diversity and economic development,” Sci., vol. 328,
no. 5981, pp. 1029-1031, May 2010.
[75] Everett M. Rogers, Diffusion of Innovations. New York: Free Press, 2003.
[76] (2014, August 27) Gross Domestic Product by State [Online]. Available:
http://www.bea.gov/regional/gsp/ (Date Last Accessed, September, 22,
2014).
106
[77] (2014, August 27) Patents By Country, State, and Year - Utility Patents
[Online]. Available:
http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cstutl.htm (Date Last
Accessed, September, 22, 2014).
[78] (2014, August 27) Statistics of U.S. Businesses [Online]. Available:
http://www.census.gov/econ/susb/ (Date Last Accessed, September, 22,
2014).
[79] (2014, August 27) Annual Estimates of the Population for the United States,
Regions, States, and Puerto Rico [Online]. Available:
http://www.census.gov/popest/index.html (Date Last Accessed, September,
22, 2014).
[80] (2014, August 27) Census of Population and Housing 2010 [Online]. Available:
https://www.census.gov/prod/www/decennial.html (Date Last Accessed,
September, 22, 2014).
[81] F. Cairncross, The Death of Distance: How the Communications Revolution is
Changing Our Lives. Cambridge, MA: Harvard Business Review Press, 2001.
[82] J. Levandoski et al., “Lars: A location-aware recommender system,” in Proc.
28th Int. Conf. Data Eng., Washington, DC, 2012, pp. 450-461.
[83] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social
networks,” in Proc. 12th Int. Conf. Inform. and Knowledge Manage., New
Orleans, LA, 2003, pp. 556-559.
[84] A. Sarma et al., “A sketch-based distance oracle for web-scale graphs,” in
Proc. 3rd ACM Int. Conf. Web Search and Data Mining, New York, NY,
2010, pp. 401-410.
[85] D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning About
a Highly Connected World. Cambridge, England: Cambridge University Press,
2010.
107
[86] A. Goldberg et al., “Network flow algorithms,” in Paths, Flows, and
VLSI-Design, Berlin, Heidelberg: Springer, 1990, pp. 101-164.
[87] S. Milgram, “The small world problem,” Psychology Today, vol. 2, no. 1, pp.
60-67, May 1967.
[88] S. Schnettler, “A structured overview of 50 years of small-world research,”
Social Networks, vol. 31, no. 3, pp. 165-178, July 2009.
[89] H. P. Thadakamalla et al., “Search in spatial scale-free networks,” J. Phys,
vol. 9, no. 6, doi: 10.1088/1367-2630/9/6/190, June 2007.
[90] S. Sreenivasan, “Quantitative analysis of the evolution of novelty in cinema
through crowdsourced keywords,” Scientific Rep., vol. 3, no. 1, doi:
10.1038/srep02758, Apr. 2013.
[91] A. Hoonlor et al., “Trends in computer science research,” Commun. ACM, vol.
56, no. 10, pp. 74-83, Oct. 2013.
[92] V. Lolla et al., “Detecting MAC layer back-off timer violations in mobile ad
hoc networks,” in Proc. 26th IEEE Int. Conf. Distributed Comput. Syst.,
Lisboa, Portugal, pp. 63-63, 2006.
[93] Q. Chen et al., “Overhaul of IEEE 802.11 modeling and simulation in ns-2,”
in Proc. 10th ACM Symp. Modeling, Analysis, and Simulation Wireless and
Mobile Syst., Chania, Greece, 2007, pp. 159-168.
[94] H. Zhang et al., “Bootstrapping deny-by-default access control for mobile
ad-hoc networks,” in IEEE Military Commun. Conf., San Diego, CA, 2008,
pp. 1-7.
[95] J. Broch et al., “A performance comparison of multi-hop wireless ad hoc
network routing protocols,” in Proc. 4th Ann. ACM/IEEE Int. Conf. Mobile
Computing and Networking, Dallas, TX, 1998, pp. 85-97.
[96] P. Erdos and A. Renyi, “On random graphs,” Publ. Math. Debrecen, vol. 6,
no. 1, pp. 290-297, 1959.
108
[97] F. Simini et al., “A universal model for mobility and migration patterns,”
Nature, vol. 484, no. 7392, pp. 96-100, Apr. 2012.
[98] P. Boldi et al., “Ubicrawler: A scalable fully distributed web crawler,”
Software: Practice and Experience, vol. 34, no. 8, pp. 711-726, July 2004.
[99] T. Camp et al., “A survey of mobility models for ad hoc network research,”
Wireless Commun. and Mobile Computing, vol. 2, no. 5, pp. 483-502, Aug.
2002.
[100] C. Bettstetter et al., “The node distribution of the random waypoint mobility
model for wireless ad hoc networks,” IEEE Trans. Mobile Computing, vol. 2,
no. 3, pp. 257-269, July 2003.
[101] W. Navidi and T. Camp, “Stationary distributions for the random waypoint
mobility model,” IEEE Trans. Mobile Comput., vol. 3, no. 1, pp. 99-108, Jan.
2004.
[102] M. Kurant et al., “Towards unbiased BFS sampling,” Computing Res.
Repository, vol. 29, no. 9, pp. 1799-1809, Oct. 2011.
[103] C. Foh and M. Zukerman, “Performance analysis of the IEEE 802.11 MAC
protocol,” in Proc. Eur. Wireless Conf., Florence, Italy, 2002, pp. 184-190.
[104] S. Geyik et al., “PCFG based synthetic mobility trace generation,” in Proc.
IEEE Global Telecommun. Conf., Miami, FL, 2010, pp. 1-5.
[105] M. Chen et al., “On measuring the quality of a network community
structure,” in Proc. ASE/IEEE Int. Conf. Social Computing, Washington,
DC, 2013, pp. 122-127.
[106] K. Kuzmin et al., “Parallel overlapping community detection with SLPA,” in
Proc. ASE/IEEE Int. Conf. Social Computing, Washington, DC, 2013, pp.
204-212.
[107] D. Wang et al., “Human mobility, social ties, and link prediction,” in Proc.
17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San
Diego, CA, 2011, pp. 1100-1108.
109
[108] E. Cho et al., “Friendship and mobility: User movement in location-based
social networks,” in Proc. 17th ACM SIGKDD Int. Conf. Knowledge
Discovery and Data Mining, San Diego, CA, 2011, pp. 1082-1090.
[109] T. Razafindralambo and F. Valois, “Performance evaluation of backoff
algorithms in 802.11 ad-hoc networks,” in Proc. 3rd ACM Int. Performance
Evaluation Wireless Ad hoc, Sensor and Ubiquitous Networks, Terromolinos,
Spain, 2006, pp. 82-89.
[110] J. Yoo et al., “Random waypoint considered harmful,” in Proc. 22nd Ann.
Joint Conf. IEEE Comput. and Commun., San Francisco, CA, 2003, pp.
1312-1321.
[111] L. Katzir et al., “Estimating sizes of social networks via biased sampling,” in
Proc. 20th Int. Conf. World Wide Web, Hyderabad, India, 2011, pp. 597-606.
[112] C. Boldrini et al., “Users mobility models for opportunistic networks: The role
of physical locations,” in Proc. Wireless Rural and Emergency Commun.,
Rome, Italy, 2007, pp. 255-267.
[113] X. Hong et al., “A group mobility model for ad hoc wireless networks,” in
Proc. 2nd ACM Int. Workshop Modeling, Analysis and Simulation of Wireless
and Mobile Syst., Seattle, WA, 1999, pp. 53-60.
[114] H. Hsu, Schaum’s Outline of Probability, Random Variables, and Random
Processes. New York: McGraw-Hill, 2010.
[115] A. Langville et al., “Deeper inside pagerank,” Internet Math., vol. 1, no. 3, pp.
335-380, Jan. 2004.
[116] W. Steward, Introduction to the Numerical Solution of Markov Chains.
Princeton, NJ: Princeton University Press, 1994.
[117] J. A. Rice, Mathematical Statistics and Data Analysis. Stamford, CT:
Cengage Learning, 2006.
110
[118] A. Banerjee and S. Basu, “A social query model for decentralized search,” in
Proc. 2nd ACM Workshop on Social Network Mining and Analysis, Las Vegas,
NV, 2008.
[119] A. Bozzon et al., “Answering search queries with crowdsearcher,” in Proc. 21st
Int. Conf. World Wide Web, Lyon, France, 2012, pp. 1009-1018.
[120] S. Sahay et al., “Social ranking for spoken web search,” in Proc. 20th ACM
Int. Conf. Inform. and Knowledge Manage., Glasgow, Scotland, 2011, pp.
1835-1840.
[121] A. Agarwal et al., “Learning to rank networked entities,” in Proc. 12th ACM
SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Philadelphia, PA,
2006, pp. 14-23.
[122] S. Chakrabarti et al., “Focused crawling: A new approach to topic-specific web
resource discovery,” Comput. Networks, vol. 31, no. 11, pp. 1623-1640, May
1999.
[123] A. Maiya and T. Wolf, “Expansion and search in networks,” in Proc. 19th
ACM Int. Conf. Inform. and Knowledge Manage., Toronto, Canada, 2010, pp.
239-248.
[124] J. Kleinberg and E. Tardos, Algorithm Design. London, UK: Pearson, 2006.
[125] M. Girvan and M. Newman, “Community structure in social and biological
networks,” Proc. Nat. Academy Sci., vol. 99, no. 12, pp. 7821-7826, Apr. 2002.
[126] J. Xie et al., “Overlapping community detection in networks: The state of the
art and comparative study,” ACM Comput. Surveys, vol. 45, no. 4, Aug. 2013.
[127] S. Scellato et al., “Distance matters: Geo-social metrics for online social
networks,” in Proc. 3rd Conf. Online Social Networks, Boston, MA, 2010, pp.
8.
[128] M. Newman, “Communities, modules and large-scale structure in networks,”
Nature Physics, vol. 8, no. 1, pp. 25-31, Dec. 2011.
111
[129] L. Backstrom et al., “Find me if you can: Improving geographical prediction
with social and spatial proximity,” in Proc. 19th Int. Conf. World Wide Web,
Raleigh, NC, 2010, pp. 61-70.
[130] S. Scellato et al., “Socio-spatial properties of online location-based social
networks,” in Proc. 5th Int. AAAI Conf. Weblogs and Social Media,
Barcelona, Spain, 2011, pp. 329-336.
[131] M. Allamanis et al., “Evolution of a location-based online social network:
Analysis and models,” in Proc. 2012 ACM Conf. Internet Measurement,
Boston, MA, 2012, pp. 145-158.
[132] S. Adali et al., “Deconstructing centrality: Thinking locally and ranking
globally in networks,” in Proc. IEEE/ACM Int. Conf. Advances in Social
Network Analysis and Mining, Niagara Falls, Canada, 2013, pp. 418-425.
[133] P. Expert et al., “Uncovering space-independent communities in spatial
networks,” Proc. Nat. Academy Sci., vol. 108, no. 19, pp. 7663-7668, Aug.
2011.
[134] M. McPherson et al., “Birds of a feather: Homophily in social networks,” Ann.
Review Sociol., vol. 27, no. 1, pp. 415-444, Aug. 2001.
[135] J. Yang and J. Leskovec, “Defining and evaluating network communities based
on ground-truth,” in Proc. ACM SIGKDD Workshop Mining Data Semantics,
Beijing, China, 2012, pp. 31-38.
[136] M. Deutsch and H. Gerard, “A study of normative and informational social
influences upon individual judgment,” J. Abnormal & Social Psychology, vol.
51, no. 3, pp. 629-36, Sept. 1955.
[137] E. Bulut and B. Szymanski, “Exploiting friendship relations for efficient
routing in mobile social networks,” IEEE Trans. Parallel Distrib. Syst., vol. 3,
no. 12, pp. 2254-2265, Dec. 2012.
112
[138] M. Cha et al., “A measurement-driven analysis of information propagation in
the flickr social network,” in Proc. 18th Int. Conf. World Wide Web, Madrid,
Spain, 2009, pp. 721-730.
[139] A. Hannak et al., “Measuring personalization of web search,” in Proc. Int.
Conf. World Wide Web, Rio de Janeiro, Brazil, 2013, pp. 527-538.
[140] J. Leskovec and E. Horvitz, “Planetary-scale views on a large
instant-messaging network,” in Proc. 17th Int. Conf. World Wide Web,
Beijing, China, 2008, pp. 915-924.
[141] D. Watts and S. Strogatz, “Collective dynamics of ‘small-world’ networks,”
Nature, vol. 393, no. 6684, pp. 409-410, June 1998.
[142] J. Onnela et al., “Geographic constraints on social network groups,” PloS
One, vol. 6, no. 4, doi: 10.1371/journal.pone.0016939, Apr. 2011.
[143] E. Garfield, “It is a small world after all,” Essays of an Inform. Scientist, vol.
4, no. 43, pp. 299-304, Oct. 1978.
[144] S. Adali et al., “Attentive betweenness centrality (ABC): Considering options
and bandwidth when measuring criticality,” in Proc. ASE/IEEE Int. Conf.
Social Computing, Amsterdan, Netherlands, 2012, pp. 358-367.
[145] E. Daly and M. Haahr, “Social network analysis for routing in disconnected
delay-tolerant MANETs,” in Proc. 8th ACM Int. Symp. on Mobile Ad Hoc
Networking and Computing, Montreal, Canada, 2007, pp. 32-40.
[146] M. Newman, “Models of the small world,” J. Stat. Phys, vol. 101, no. 4, pp.
819-841, Nov. 2000.
[147] B. Uzzi and J. Spiro, “Collaboration and creativity: The small world
problem,” AJS, vol. 111, no. 2, pp. 447-504, Sept. 2005.
[148] N. Hodas and K. Lerman, “The simple rules of social contagion,” Sci. Rep.,
vol. 4, no. 434, doi:10.1038/srep04343.
113
[149] L. Bettencourt and G. West, “A unified theory of urban living,” Nat., vol.
467, no. 7318, pp. 912-913, Oct. 2010.