proximity, interactions, and communities in social ...szymansk/theses/nguyen.2014.pdf · proximity,...

PROXIMITY, INTERACTIONS, AND COMMUNITIES INSOCIAL NETWORKS: PROPERTIES AND

APPLICATIONS.

By

Tommy Nguyen

A Thesis Submitted to the Graduate

Faculty of Rensselaer Polytechnic Institute

in Partial Fulfillment of the

Requirements for the Degree of

DOCTOR OF PHILOSOPHY

Major Subject: COMPUTER SCIENCE

Examining Committee:

Boleslaw K. Szymanski, Thesis Adviser

Sibel Adalı, Member

James A. Hendler, Member

Gyorgy Korniss, Member

Mohammed J. Zaki, Member

Rensselaer Polytechnic InstituteTroy, New York

October 2014(For Graduation December 2014)

c© Copyright 2014

by

Tommy Nguyen

All Rights Reserved

ii

CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ACKNOWLEDGMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Ranking Information in Social Networks . . . . . . . . . . . . . . . . 2

1.2 Small Worlds and Social Stratification . . . . . . . . . . . . . . . . . 4

1.3 Summary of Contributions & Organization . . . . . . . . . . . . . . . 6

1.3.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Ranking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Web Conceptualization . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 User Data & Trust Models . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Small-world Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Six Degrees of Separation . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Social Stratification . . . . . . . . . . . . . . . . . . . . . . . . 16

3. SOCIAL NETWORK ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Geography, Co-Appearance, & Interactions . . . . . . . . . . . . . . . 19

3.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.2 Notations & Definitions . . . . . . . . . . . . . . . . . . . . . 20

3.1.3 Data Analysis & Results . . . . . . . . . . . . . . . . . . . . . 21

3.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Incorporating Geography into Community Detection . . . . . . . . . 24

3.2.1 Clique Percolation Method . . . . . . . . . . . . . . . . . . . . 25

3.2.2 Modularity Maximization . . . . . . . . . . . . . . . . . . . . 26

3.2.3 Speaker-Label Propagation (GANXiS) . . . . . . . . . . . . . 27

3.3 Contrasting Communities to Null Models . . . . . . . . . . . . . . . . 28

3.3.1 Techniques for Generating Covers . . . . . . . . . . . . . . . . 29

iii

3.3.2 Measuring Covers & Communities . . . . . . . . . . . . . . . . 29

3.3.3 Examining Covers in Gowalla . . . . . . . . . . . . . . . . . . 31

3.4 Examining Detected Communities . . . . . . . . . . . . . . . . . . . . 33

3.4.1 Network Community Profile (NCP) . . . . . . . . . . . . . . . 34

3.4.2 Link Connectivity Measurements . . . . . . . . . . . . . . . . 35

3.4.3 Face-to-Face Interactions Measurements . . . . . . . . . . . . 35

3.5 Application: Social Relationships & Human Mobility . . . . . . . . . 39

3.5.1 Network Congestion in MANETs . . . . . . . . . . . . . . . . 41

3.5.2 Mobility Generation . . . . . . . . . . . . . . . . . . . . . . . 41

3.5.3 Experimental Congestion Design . . . . . . . . . . . . . . . . 42

3.5.4 Congestion Simulation Results . . . . . . . . . . . . . . . . . . 43

3.6 Application: Long Ties & Economic Development . . . . . . . . . . . 44

3.6.1 A Stochastic Model of Economic Development . . . . . . . . . 47

3.6.2 Experimental Results & Discussion . . . . . . . . . . . . . . . 48

3.7 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4. SOCIAL RANKING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . 57

4.1 Google Buzz & Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.1 Categories of URLs. . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.2 Spreaders & Affected Sets . . . . . . . . . . . . . . . . . . . . 60

4.1.3 Information Distances . . . . . . . . . . . . . . . . . . . . . . 61

4.1.4 Geographical Distances . . . . . . . . . . . . . . . . . . . . . . 62

4.1.5 Densities of Social Relationships . . . . . . . . . . . . . . . . . 64

4.1.6 Keyword Similarity . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Social Ranking Techniques . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.1 PageRank on Social Network . . . . . . . . . . . . . . . . . . 66

4.2.2 HITS on Social Network . . . . . . . . . . . . . . . . . . . . . 67

4.2.3 Ranking with Maximum Flow . . . . . . . . . . . . . . . . . . 68

4.2.4 Variants of Maximum Flow . . . . . . . . . . . . . . . . . . . 70

4.3 Social Ranking Experiments . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.1 Comparing PageRank & HITS . . . . . . . . . . . . . . . . . . 70

4.3.2 Flow Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3.3 Rank Differences . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.4 Rank Distributions . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.5 Rank Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 77


iv

5. SOCIAL SEARCHING EXPERIMENTS . . . . . . . . . . . . . . . . . . . 81

5.1 Attrition, Geography, & Communities . . . . . . . . . . . . . . . . . . 82

5.1.1 Modeling Attrition . . . . . . . . . . . . . . . . . . . . . . . . 82

5.1.2 Geographical Analysis . . . . . . . . . . . . . . . . . . . . . . 84

5.1.3 Detecting Communities . . . . . . . . . . . . . . . . . . . . . . 86

5.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2.1 Routing Strategies . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2.2 Starter & Target Selections . . . . . . . . . . . . . . . . . . . 88

5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.3.1 Selection & Routing Combinations . . . . . . . . . . . . . . . 89

5.3.2 Friends-of-Friends Knowledge Densities . . . . . . . . . . . . . 90

5.3.3 Distributions of Successful Chains . . . . . . . . . . . . . . . . 91

5.3.4 Effects of Hubs and Connectors . . . . . . . . . . . . . . . . . 92

5.3.5 Individual and Community Prominence . . . . . . . . . . . . . 93


6. CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 97

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

v

LIST OF TABLES

1.1 Aspects of SNA & applications. . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Data summary of Gowalla network. . . . . . . . . . . . . . . . . . . . . 20

3.2 Six techniques for generating covers. . . . . . . . . . . . . . . . . . . . . 29

3.3 Measurements for cover C of the size k. . . . . . . . . . . . . . . . . . . 31

3.4 Detected communities and their sizes. . . . . . . . . . . . . . . . . . . . 34

3.5 Measuring spatial conductance. . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Measuring face-to-face interactions. . . . . . . . . . . . . . . . . . . . . 36

3.7 Network simulator ns-2 parameters. . . . . . . . . . . . . . . . . . . . . 43

3.8 Measuring economic development (Gowalla). . . . . . . . . . . . . . . . 52

3.9 Measuring economic development (FourSquare). . . . . . . . . . . . . . 53

4.1 Data summary of Google Buzz. . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Data summary of Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Google Buzz (left) & Twitter (right) with geography. . . . . . . . . . . 59

4.4 Social relationships densities in Google Buzz. . . . . . . . . . . . . . . . 64

4.5 Social relationships densities in Twitter. . . . . . . . . . . . . . . . . . . 65

4.6 Ranking results of 30 popular URLs in Google Buzz. . . . . . . . . . . . 74

4.7 Ranking results of 30 random URLs in Google Buzz. . . . . . . . . . . . 75

4.8 Avg. ranking differences in Google Buzz. . . . . . . . . . . . . . . . . . 76

4.9 Avg. ranking differences in Twitter. . . . . . . . . . . . . . . . . . . . . 76

5.1 Summaries of online social networks datasets. . . . . . . . . . . . . . . . 81

5.2 Communities detected by GANXiS. . . . . . . . . . . . . . . . . . . . . 86

5.3 Prominence of individuals and communities. . . . . . . . . . . . . . . . 88

5.4 Experimental results for Gowalla. . . . . . . . . . . . . . . . . . . . . . 88

5.5 Experimental results for FourSquare. . . . . . . . . . . . . . . . . . . . 89

6.1 Aspects of SNA & applications. . . . . . . . . . . . . . . . . . . . . . . 97

vi

LIST OF FIGURES

3.1 Geographical spread of 100K checkins in Gowalla. . . . . . . . . . . . . 19

3.2 Friendship is bounded by geographical distance. . . . . . . . . . . . . . 21

3.3 Densities of pairs as a function of geographical distance. . . . . . . . . . 22

3.4 Measuring face-to-face interactions (tε=30mins, dε=1km). . . . . . . . . 23

3.5 Generating CTA & FTA covers. . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Intra-edge count, boundary-edge count, and geographic diameter of covers. 32

3.7 Contraction, expansion, conductance, and geographic distance of covers. 33

3.8 Communities detected by Clique Percolation Method. . . . . . . . . . . 36

3.9 Communities detected by Inference Algorithm. . . . . . . . . . . . . . . 37

3.10 Communities detected by GANXiS. . . . . . . . . . . . . . . . . . . . . 38

3.11 Measuring face-to-face interactions among members. . . . . . . . . . . . 39

3.12 Generating a Markov Model using checkins. . . . . . . . . . . . . . . . . 41

3.13 Design of simulation overview. . . . . . . . . . . . . . . . . . . . . . . . 43

3.14 Traffic congestion in FMM and RWP. . . . . . . . . . . . . . . . . . . . 44

3.15 Frequency of pauses using the RWP. . . . . . . . . . . . . . . . . . . . . 45

3.16 Scaling laws of short and long ties. . . . . . . . . . . . . . . . . . . . . . 49

3.17 Face-to-face interactions of short ties and long ties. . . . . . . . . . . . . 49

3.18 The collective strength of long ties in a simple contagion model. . . . . 50

3.19 Distribution of long ties for adopters and non-adopters. . . . . . . . . . 51

3.20 Economic development as a function of idea flow (Gowalla). . . . . . . . 52

3.21 Economic development as a function of idea flow (FourSquare). . . . . . 53

3.22 Speedy idea flow as a function of social diversity. . . . . . . . . . . . . . 53

4.1 Conceptualization of social ranking. . . . . . . . . . . . . . . . . . . . . 57

4.2 Categories of popular (a,c) and random (b,d) URLs. . . . . . . . . . . . 60

vii

4.3 Shortest paths to URLs in Google Buzz (a) and Twitter (b). . . . . . . 61

4.4 Ultra small-world property from starters to information. . . . . . . . . . 62

4.5 Densities of shortest path lengths from starters to URLs. . . . . . . . . 62

4.6 Two degrees of spatial concentration. . . . . . . . . . . . . . . . . . . . 63

4.7 Four dimensions of social relationships. . . . . . . . . . . . . . . . . . . 64

4.8 CKS for friendship, following, peers, and random pairs. . . . . . . . . . 65

4.9 Graph G′p for ranking URLs {u1, u2} with respect to node p. . . . . . . 69

4.10 Ranking URLs on Google Buzz. . . . . . . . . . . . . . . . . . . . . . . 71

4.11 Ranking URLs on Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.12 Social ranking with popular URLs on Google Buzz. . . . . . . . . . . . 72

4.13 Social ranking with random URLs on Google Buzz. . . . . . . . . . . . 73

4.14 Social ranking with popular URLs on Twitter. . . . . . . . . . . . . . . 73

4.15 Social ranking with random URLs on Twitter. . . . . . . . . . . . . . . 73

4.16 Densities of rank correlation coefficient. . . . . . . . . . . . . . . . . . . 77

4.17 Ranking quality results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1 Stratification graph of communities in Gowalla. . . . . . . . . . . . . . . 83

5.2 Distributions of shortest path lengths & average path lengths. . . . . . 84

5.3 Densities of geographical distances. . . . . . . . . . . . . . . . . . . . . 85

5.4 Friends-of-friends knowledge densities. . . . . . . . . . . . . . . . . . . . 90

5.5 Path length of successful chains & drop rates. . . . . . . . . . . . . . . . 92

5.6 Effects of routing to connectors & hubs. . . . . . . . . . . . . . . . . . . 93

5.7 Prominence of individuals & communities on reachability. . . . . . . . . 94

5.8 Prominence of individuals & communities correlations. . . . . . . . . . . 95

viii

ACKNOWLEDGMENT

I like to thank everyone that mentored me during my undergraduate and graduate

studies. This dissertation is not possible without their guidance.

First, I like thank my dissertation chair for his guidance, ideas and intellectual

contributions in this dissertation. From seeking research problems to career planning,

he was always encouraging and supportive throughout my graduate studies. To quote

a previous graduate student, “his pleasant and friendly personality made this graduate

study more enjoyable.” Also, I like to thank committee members for providing their

feedback and helping me organize the structure of this thesis.

Second, I like to thank the entire staff in the CS department. Ms. Coonrad

and Ms. Hayden are always responsive to my questions regarding classes, graduation

requirements, etc. even when there are hundreds of questions from other students.

Mr. Lindsay is always around and ready to help whenever a server crashes. It was

always a pleasure to interact with them throughout my graduate studies.

Last but not least, I like to acknowledge the graduate students and postdocs in

our center and computer science department. Some of them are talented scientists and

experts in their areas of research; others are going to become experts one day. They

make me feel proud of being a member of our center and alumni of the university.

ix

ABSTRACT

Social network analysis, in the form of network theory, where nodes represent humans

and edges represent social relationships between humans, have a wide range of appli-

cations in information science, political science, social science, economics, etc. The

availability of data from location-based social media such as Gowalla and FourSquare

has helped scientists model and analyze human relationships and their interactions.

In this thesis, we use such data to analyze multiple dimensions of social relationships

in terms of three specific aspects: geographical proximity of nodes, their face-to-face

interactions, and the structure of their communities. Then we incorporate these three

aspects of social relationships into the following applications.

First, we propose techniques for analyzing human relationships in terms of ge-

ographical proximity, face-to-face interactions, and communities. We show how ge-

ographical proximity shapes structure of the social network by limiting face-to-face

interactions among distant users. We also incorporate geographical locations that

users visited into a few community detection algorithms for the purpose of detecting

communities where members are on average separated by a few friendship link, are

close to each other geographically, and are likely to interact with each other face-

to-face. These aspects of social network analysis allowed the study of the first two

applications − human mobility patterns and the spread of ideas.

Second, we use URLs that people share with their followers on social media to

personalize the ranking of information by looking at who follows whom, geographical

location of the users, and the structure of their detected communities. This allows us

to analyze how social media tunnels the flow of information in the network. More im-

portantly, personalized ranking based on these aspects allow users to see information

through the eyes of other users whom they consider important (neighbors, friends,

peers, etc.) and provides an opportunity for them to interact with information which

was used by the people that they care − resulting in the third application studied in

this thesis.

Finally, we replicate the small world experiment by emulating the process of

searching for targets by routing a folder among their acquaintances. Geographical

x

information and community structure allow us to selectively choose starters and tar-

gets based on the knowledge of where users are located and to which community they

belong. In addition, we examine various routing strategies based on geographical

proximity and community structure that perhaps were likely used by participants in

the small-world experiment to reach a target. In doing so, we discover which combina-

tions of routing strategies and selection techniques are likely to make the small-world

experiment successful in terms of the small number of hops required to reach the

target and the percentage of such successful chains − resulting in the last application

studied in this thesis.

xi

CHAPTER 1

INTRODUCTION

Social network analysis examines human relationships in terms of graph theory where

nodes represent humans and edges represent their social relationships. In addition,

social network analysis can also examine the geographical proximity of the nodes,

their face-to-face interactions, and the structure of their detected communities. This

thesis examines these three aspects of social network analysis in detail.

Within the last five years, the proliferation of smartphones has provided a new

type of social networking where people can share their current location with their

friends and tag the activities that they are doing. This new type of social networking

has provided a much richer dataset of human behavior because geographical locations

and face-to-face interactions were not previously available. More importantly, this

new type of social networking provides a bridge that connects the digital world with

the physical world where physical activities of human behavior such as proximity and

face-to-face interactions are recorded and shared instantly.

Before location-based social media, scientists used CDRs (call detail records) of

telephone companies to study spatial properties, infer friendship topology, and guess

face-to-face interactions. However, a problem with CDRs is that call volume is not a

good proxy for friendship because people can make phone calls to order food, request

technical support, seek medical help, and so on. More importantly, using calling

patterns to infer friendship is biased towards those that are more likely to be strong

ties since weak ties are by definition those that are contacted infrequently; hence using

CDRs to infer friendship leaves out an important dimension of social relationships in

the study of social network analysis.

Therefore, location-based social media is valuable for the study of social network

analysis because it provides a network that is embedded into physical space - the

Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Social RankingTechniques for the Web,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysisand Mining, Niagara Falls, Ontario, 2013, pp. 49-55.

Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and SocialStratification,” PLoS ONE, (under review).

1

2

surface of earth, and its nodes - humans, are constantly moving. In addition, the

links have different characteristics depending on the frequency of interactions. The

questions that immediately arise are what are ramifications of this type of social graph

embedding in physical space, and what are the roles of ties (weak/strong, long/short)

in human behavior. The collection of data from Gowalla and FourSquare allows the

investigation of these issues which are studied in detail in this thesis.

Chapter 3 addresses the issue of face-to-face interactions and finds that friend-

ship still requires both face-to-face interactions and geographical proximity. Moreover,

the desire to interact face-to-face motivates strong ties to travel together impact-

ing human mobility patterns with ramification for transportation traffic and wireless

bandwidth infrastructure management (one of the applications studied in Chapter

3). However, this does not mean that weak ties are unimportant. The last section

of Chapter 3 shows that weak ties that are geographically distant tunnel the flow of

ideas and are a strong predictor of economic development in the US in terms of GDP,

patents, and startups.

Chapter 4 returns to strong ties and examines social influence that people have

on each other in terms of interests, geographical distance, and communities. Chapter 4

explores this influence to improve relevancy of responses to queries by individualizing

them for the users based on the ranking of web pages shared on social networks.

Some potential evidence of increased relevancy mentioned in this thesis could possibly

demonstrate the level of influence the friends exert on the interests of others.

Chapter 5 expands the last section of chapter 3 by examining how spatial em-

bedding of social networks, long distance ties, and communities underlie strategies

of social search. These aspects of social network analysis examine whether social

networks are small-world, stratified, or both simultaneously. Results show that while

social networks have small topological path lengths, there is no evidence that people

with limited knowledge can find a designated target within a small number of hops

when attrition is completely eliminated.

1.1 Ranking Information in Social Networks

Over the last decade, scientists examined the structure of web [1]-[4] and pro-

posed algorithms to rank web pages based on significance and relevance to a given

3

query [5]-[9]. A conceptualization of the web is to look at patterns in the topol-

ogy of hyperlinks containing web pages to separate prominent websites that serve as

authorities for trusted information from malicious pages created by spammers [1].

This conceptualization of the web eliminates the complexity of textual analysis

and creates a pot-pourri of information that gets incorporated into search engines or

other information retrieval systems for the purpose of finding information on personal

computers, mobile devices, and any other computing platforms [10]. In the case of

a search engine, billions of web pages containing rich context of information are

organized where end users can find their target quickly. Thus, this need for speed

makes ranking crucial in information retrieval systems. Also, ranking has many other

applications in social sciences such as the citation analysis of legal and scientific

documents [11].

Advances in social network analysis and the proliferation of online social media

have provided a different perspective for examining ranking [12]-[18]. The study of

algorithms used for ranking and organizing information in hybrid networks such as

social search engines have promising improvements when incorporating social network

analysis into them; for example, incorporating personal information containing social

relationships on G+ for personalizing search results on Google. As the proliferation

of social media continues to expand, we want to be able to use techniques from social

network analysis to personalize the ranking of information for a given user. This is

important because social relevance allows users to see information through the eyes

of other users who they consider important and provides an opportunity for them to

interact with the information accessed by the people about whom they care.

Social media such as Twitter and Google Buzz can be characterized as a web

service that allows users to share information with their followers. While a lot of

research has been devoted to examining text in hashtags and messages [19]-[21] we

focus on URLs because information contained in URLs is not restricted by length

limitation, is less likely to be informally written, and contains less slang and fewer

abbreviations. Analyzing URLs provides a unique opportunity to infer the interests

of users based on their reading habits. We assume that URLs shared via people

concentrate on selected topics of their interests. It is important to notice that our

purpose here is not to rank a set of URLs based on a given query but instead to rank a

4

set of URLs based on whether we think a user is likely to engage with the information

contained within the URLs. Such engagement could be clicking, commenting, re-

sharing, and spending time reading them.

The problem we want to solve is to provide a framework for ranking URLs

shared on social media based on social relationships; where some of the URLs are

ranked higher if they are shared via certain type of social relationships. The social

relationships we examine for ranking URLs include but are not limited to neighbors

(nodes that are within geographical proximity [22]) and peers (nodes that are within a

detected community [23]) The literature review on this subject is provided in Chapter

2 (Section 1) and the contribution is discussed in Chapter 6.

Some data-driven questions that we examine are whether pairs of users that

are geographically close are more likely to have similar interests than pairs that are

distant, and whether reciprocal relationships have higher keyword similarity in web

pages than non-reciprocal relationships. Other related questions that we explore are

examining the densities of friends, peers, neighbors, and people with similar inter-

ests, since these social relationships are the building block for understanding social

relevance.

1.2 Small Worlds and Social Stratification

Data scientists have recently calculated the distribution of the shortest path

lengths between randomly selected pairs of users in online social networking sites and

confirmed that the majority of people are on average within six degrees of separation

(e.g., 4.7 in Facebook [24], 2.7 in MySpace [25], 4.2 in Twitter [26], and so on [27]).

However, empirical research in social stratification such as racial segregation and

income inequality undermine the premise that we live in a small-world where there are

short paths connecting people with culturally and economically diverse backgrounds

together. In [28], Kleinfeld mentioned that Beck and Cadamagnani were unsuccessful

in replicating the small-world experiment with high success rates when they attempted

to reach a high-income target starting from a low-income person, suggesting that the

world we live in is divided by wealth caused by income inequality.

Before the availability of data from online social networking sites, Milgram and

his colleagues performed an experiment to demonstrate the small-world phenomenon

5

by recruiting randomly selected starters from Nebraska and Oklahoma to reach a

broker in Boston [29]. In their experiment, starters were asked to mail a folder to

an acquaintance known to them on a first-name basis and would be likely to reach

the target using the least number of hops. The process repeats until the chain stops

when the folder eventually reaches the target or its current holder drops out the

experiment for the lack of qualified acquaintances or unwillingness to participate

in the experiment. Hence, the expected number of hops required for a starter to

successfully reach a target is an upper bound and also a lose estimate for the length

of shortest path connecting them. Travers and Milgram reported that 64% of the

chains successfully reached the designated target within 5.2 hops [29], suggesting

that the diameter of the network of social connections is small.

The problem we want to solve is finding out whether the network of our so-

cial connections is small, stratified, or both simultaneously. We want to investigate

this problem by replicating the process of routing a folder from selected starters

to randomly chosen targets by using data containing geographical locations and so-

cial relationships of hundreds of thousands of users from location-based social media.

The advantage of incorporating large-scale and multi-dimensional data into the small-

world experiment is that many aspects of the experiment can be controlled such as

determining how to strategically route a folder between acquaintances and having real

data on who is actually connected to whom for hundreds of thousands of users. Un-

like other social experiments requiring incentives for human subjects to participate,

we can control the effect of participation by supposing that everyone who receives a

chain letter participates in the experiment once, since long chains are not likely to

exist when the average participant rate is 37% [30] (e.g., 0.375 < 0.01) reported by

Dodds et al. These advantages from the data help us focus on how two factors of the

experiment, geographical locations and community structure of users’s connections,

make it possible for social networks to be either small-world, stratified, or both simul-

taneously. These aspects of geographical proximity and community structures allows

us to strategically route a folder between their acquaintances and also select starters

and targets based on geographical distance or by a fixed number of community hops

connecting them.

We used community detection algorithms to partition a social network so that

6

starters and targets can be selected in the following ways. We define the network

distance from community of the starter Cs to the community of the target Ct as the

length of the shortest path connecting nodes from Cs to Ct. The question we ask is

how many hops does it take to reach a target t originating from a starter s if the

length of the shortest path connecting their communities is fixed at k? When k ≈ 0,

we expect to capture the small-world phenomenon where it is easy to find short paths

connecting people together. On the other hand, when k >> 0, we expect that while

there might exist short paths connecting people together, it is much harder to find

them with limited information available to the participants due to the stratified nature

of society where some people have little social capital compare to others, making it

difficult for people to reach targets outside of their communities and social class.

Beside the debate between whether we live in a small world or stratified one, the

techniques that were used by the participants in the experiment to select an acquain-

tance have practical applications in rescue and search operations [31] and job searching

via personal contacts [32]. Dodds et al. reported that such successful techniques used

by the participants including forwarding the folder to a selected acquaintance such as

a friend (67%), relative (10%), co-worker (9%), sibling (5%), significant other (3%),

and others (6%) based on geographical proximity and occupation “for at least half

of the decisions” [30]. In addition, the results from the small-world experiment led

to an avalanche of network models that have certain properties resembling real social

networks such as the short diameter and high clustering coefficient [33].

The literature review on this subject is included in Chapter 2 (Section 2) and

the contribution is discussed in Chapter 6.

1.3 Summary of Contributions & Organization

First, this thesis collects terabytes of data that users shared on social media

and analyzes their relationship dynamics in terms of three specific aspects: geog-

raphy, face-to-face interactions, and communities. Such data allows us to analyze

human behavior in terms of social network analysis such as the interplay between

interactions, geographical proximity, and community structure. An example of an in-

teresting behavior we notice is the creation of friendship between two people is more

likely to occur when they are geographically close and friends-of-friends are also more

7

likely than not to be within proximity of each other. Also, geography has an effect

by limiting face-to-face interactions as well as their interests in terms of what users

read on social media. For more details on data analysis of human behavior and their

social relationships, see Chapter 3.

Second, this thesis proposes techniques for incorporating social relevance into

the process of ranking URLs. Personalized ranking results using variants of net-

work flow are highly independent from PageRank. The four dimensions of social

relationships that we use for ranking URLs are friends, neighbors, peers, and users

with similar interests. Results from the experiments show that social relevance can

improve ranking quality of up to 19% compare to the baseline and 5% compare to

PageRank. For more details on the personalization of information, see Chapter 4.

Third, this thesis examines effects of social stratification in the small-world

problem. Results show that while using geographical and community information

in modeling social routing for the small-world problem is more realistic than using

either one alone, average path lengths are 3 times longer then in Travers-Milgram

experiments when attrition is eliminated. Community distance is more effective and

robust at predicting probability of reaching targets than geographical distance in

terms of average path lengths and percentage of successful chains. Finally, results

show that prominent targets and targets in prominent communities can be reached

much quicker than on average. Our results can be summarized as follows: the small-

world property holds for the prominent but everyone else is lost in the crowd except

when being reached by members within its own community. For more details on

effects of stratification in searching for people, see Chapter 5.

1.3.1 Organization

Table 1.1: Aspects of SNA & applications.Geography Interactions Communities

Human Mobility Congestion Communication GroupSpreading Ideas Long Ties Weak Ties Bridge Ties

Personalized Ranking Geo. Influence Peer Influ. Collective Influ.Small-world Selection Cognitive Biases Routing

The organization of this thesis can be summarized by using Table 1.1. The

8

three aspects of social network analysis are geographical proximity of nodes (Chapter

3 Section 1), their face-to-face interactions (Chapter 3 Section 1), and the structure

of their communities (Chapter 3 Section 2). The four applications studied in this

thesis are human mobility & congestion modeling (Chapter 3 Section 5), spreading

ideas & economic development (Chapter 3 Section 6), personalized ranking (Chapter

4), and the small-world experiment (Chapter 5). Each element in Table 1.1 describes

how the corresponding aspect of social network analysis can be used to analyze the

corresponding application.

For the first application (human mobility), geography in terms of the geograph-

ical proximity of friends shows that human mobility traces can be used to study

wireless bandwidth infrastructure management, and as we later see, network conges-

tion is centralized in a few geographical locations impacting the throughput of the

bandwidth when studying mobile ad-hoc networks. Later in Chapter 3 Section 5,

face-to-face interactions is analogous to establishing wireless connections, since the

purpose of establishing connections in wireless networks is to communicate, and es-

tablishing connection is only possible when nodes are within geographical proximity

just like face-to-face interactions. Last but not least, this can be extended to incorpo-

rate the communities where mobility traces are simulated based on a group of nodes

belonging to the same community and moving together.

For the second application (spreading ideas), geography plays a role in dis-

tinguishing between short and long ties where the effects of long ties are examined

in simple contagion models for the purpose of measuring economic development of

large geographical areas. The analysis of face-to-face interactions shows that long

ties are especially weak. In addition to long ties, ties that connect between different

communities are also examined in Chapter 3 Section 6.

For the third application (personalized ranking), three elements are incorpo-

rated into the process of ranking URLs. Geography allows selecting users based on

geographical distance (neighbors). Reciprocal interactions in terms of social relation-

ship (friends instead of followers) allows us to select nodes based on their interactions.

Last but not least, community structures allow us to select nodes that belong to the

same community.

For the last application (small-world), geography allows selecting a starter and

9

a target in the simulations based on their geographical distance. Face-to-face inter-

actions could affect the statistics of average path lengths because the folder holder is

likely to pass the folder to the next holder based on the number of their interactions

and independent of the target. And finally, community strictures allow the nodes in

the simulations to pass the folder based on community awareness.

CHAPTER 2

LITERATURE REVIEW

This chapter provides a literature review on ranking techniques and the small-world

problem.

2.1 Ranking Techniques

The literature review on ranking techniques is broken down into three parts.

The first part looks at the conceptualization of the web (Sec. 2.1.1), the second part

looks at incorporating more sources of data and modeling trust (Sec. 2.1.2), and the

third part looks at data mining techniques for learning how to rank (Sec. 2.1.3).

2.1.1 Web Conceptualization

Early days of search engines rated information on the web by using the text em-

bedded in the page rather than by the hypertext containing the information invisible

to the end users. Previous work in the ranking of web pages incorporated text and

hypertext to determine the rank of a page, since hypertext by itself does not contain

information related to the query and a lot of information in the text does not mean

it is authoritative [34]. In a sense, ranking pages by counting the number of inlinks

is like voting, where the number of inlinks is the number of votes for a page, and

additional textual analysis can be applied to a query for retrieving a subset of related

pages ranked by the number of votes.

Advances came from Page and Brin when they devised an algorithm now known

as PageRank to capture not only the number of incoming inlinks like in voting but

also the quality of those links [5]. The initial score of a web page is equal to 1n′ where

n′ is the number of pages containing a link to that page. At the first iteration, each

page sends its score divided by the number of its links pointing to other pages. Then

each page replaces its current score with the sum of scores that were sent to it by the

pointing links. The process of sending and updating scores repeats until convergence

Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and SocialStratification,” PLoS ONE, (under review).

10

11

or a pre-defined number of iterations is reached. The final scores determined by

PageRank are used to rank pages across the web graph.

Kleinberg purposed a ranking algorithm known as HITS (Hypertext-Induced

Topic Search) based on the idea that good hubs point to good authoritative pages

and vice-versa [35]. This query dependent algorithm first retrieves a subset of pages

that are related to a query. Then it applies an update technique to recalculate scores

of hubs and authorities, and the algorithm uses the scores of the authorities to rank

the pages. Initially, the score of an authority is the number of backlinks coming from

hubs, and the score of a hub is the sum of scores of authorities that it points to. At

the second iteration, the algorithm updates the score of an authority by taking the

sum of the scores of the hubs pointing to it. The updating scores process is then

repeated, and the algorithm stops after reaching some number of iterations.

Stochastic Approach for Link-Structure Analysis, or SALSA for abbreviation,

is proposed by Lempel and Moran where two independent random walks are applied

to a bipartite graph consisting of hubs and authorities [2]. Instead of repeatedly cal-

culating and updating scores for hubs and authorities as is done in HITS, the number

of times a page is visited by the surfer in the random walk is used to extrapolate

the quality of the pages. The TKC (tightly knit community) effect is shown where

communities of web pages are scored relatively high even though some pages are

not authoritative or relevant to the topic when every hub points to every authority

causing a tight knit community of hubs and authorities.

2.1.2 User Data & Trust Models

While the link analysis of the web structure is a powerful tool used to capture the

ranking of pages, an emergence of algorithms and ideas came from difference sources

of data where additional information about end users is taken into consideration.

For instance, how long on average do users stay on a page, and how often are two

pages consecutively visited? BrowseRank is proposed to capture the number of page

visits and the amount of time a user stays on a page modeled as a continuous time

Markov process [8]. Another technique is taken from the principle of isolation or

the disconnectivity of trustworthy pages from spam pages where trust is propagated

from trustworthy pages to other trustworthy pages [6]. EdgeRank is proposed by

12

researchers from Facebook to consider interactions of two people or social associates

during the process of ranking updated messages, photos, URLs, etc. on news feed

[36]. Last but not least, the annotation of web pages created by users on Delicious is

used to rank pages in SocialSimRank by considering the structure of annotators and

annotated pages [12].

A technique of using personal data to rank pages was proposed by Liu et al.

called BrowseRank where they used the browsing graph in which vertices represent

visited pages and edges between vertices represent a transition from one page to

another [8]. The novelty in BrowseRank is that it incorporates data that provides the

amount of time an average user stays on a page which is an indicator of the page’s

quality and that cannot be captured by discreet time link analysis techniques such

as PageRank, HITS, and SALSA. Also as mentioned by the authors, the web graph

is not the most reliable source of data because of its large size and decentralized

architecture where problems can come from spammers creating link farms to increase

the visibility of their pages and web masters are constantly changing the content of

their pages. Empirical results suggest that BrowseRank outperforms PageRank when

independently hired researchers evaluated the ranked pages according to a linear

combination of relevance and importance.

TrustRank algorithm proposed by Gyongyi et al. relies on the principle of

isolation, under the assumption that it is unlikely for trustworthy pages to link to

spam pages [6]. Seed detection is a process that determines a small set of pages

to be evaluated where these pages are likely to point to other trustworthy pages.

First, a small set of seed pages is evaluated by using an oracle function to determine

whether a page is trustworthy or not. In practice, the oracle function represents

human judgment and would be too costly to use on a large set of pages. Second, each

trustworthy page propagates its trust to pages that its points to and the value of the

trust gets divided equally among all pointed pages. The propagation process repeats

until convergence or some predefined number of iterations is reached.

Additional advances came from the interests of Facebook in ranking items such

as photos, messages, URLs, etc. on each individual news feed. In EdgeRank, the

affinity score of two users, the weight of the posted item, and time decay are taken

into consideration for the ranking of items on personalized news feeds [36]. The

13

affinity score of the viewing user and the item creator is calculated by looking at their

online interactions; the more they have interacted, the more likely the item is shown

or ranked higher. Time decay decreases the relevance of a posted item as time goes

on, and the edge weight increases the score of items that have a high level of potential

interaction such as photo albums, messages embedded with URLs, etc. In addition

to EdgeRank, Bao et al. proposed SocialSimRank that uses social annotations on

Delicious to rank pages according to the observation that popular pages are annotated

by up-to-date users and up-to-date users annotate popular pages [12]. The novelty

of SocialSimRank comes from using the annotations of users to match search queries

to the corresponding annotated pages and applying the PageRank algorithm to the

annotated pages as means to rank pages corresponding to the view of the annotator.

2.1.3 Learning to Rank

Learning to rank is an intersection between information retrieval and machine

learning where techniques in machine learning are used to model the learning process

of ranking documents. Techniques are based on the idea of computing a function

to maximize quality measures in ranking or minimize the sum of differences between

the computed function and human-defined ratings. The advantage of using machine

learning techniques is that parameters in proposed learning models are tuned au-

tomatically. In pointwise comparison, the objective is to minimize the difference

between the calculated score of a document and the human-defined rating of it. In

pairwise comparison, the objective is to determine whether the first document in a

pair of documents is ranked higher than the second document or vice-versa. One

of the challenges in learning to rank is to go from pointwise to pairwise comparison

where the goal is to predict the ranking positions of two given documents. Another

challenge is to optimize non-continuous and non-differential objective functions. For-

tunately, previous work in the machine learning literature shows that techniques were

developed to handle such cases. RankNet learns how to rank pages by using a neural

network with pairwise comparison [37], SoftRank approximates the non-continuous

and non-differential objective function [9], and SVMRank uses support vector ma-

chines to minimize pairwise inconsistency [38].

In RankNet, Burges et al. proposed to use a two layer neural network for learn-

14

ing the process of ranking pages [37]. Given a pair of pages represented as vectors,

the ranking problem that the authors proposed is to compute the probability that

the first page is ranked higher than or equal to the second page. One advantage

in the learning stage is pairs of ranks might not be complete or even consistent to

reflect the missing pieces of information in the data or the noise containing in them.

First, they proposed using the cross-entropy cost function where ranking probabil-

ities are modeled by using the logistic function. Second, they proposed using the

backward propagation algorithm to optimally calculate the weights and offsets in a

two layer neural network such that the difference between the computed function and

human-defined ratings is minimalized. They conducted their learning, testing, and

validation experiments by using data from a proprietary search engine consisting of

17,000 searched queries where each query contains the top 1,000 ranked pages. A page

is represented as a vector consisting of 569 features. Query-dependent features are

extracted from the anchor text, URL representations, title, and content. The remain-

ing features are taken from log files in the proprietary search engine [37]. Empirical

results suggested that NetRank outperformed the other learning models (RankProp

[39], PRank [40]) in the validation stage.

Taylor et al. proposed SoftRank where the idea is to consider ranking scores

as random variables, map score distributions to rank distributions, calculate the ex-

pected SoftNDCG (normalized discounted cumulative gain), and use gradient tech-

niques to optimize parameters in a two layer neural network with respect to Soft-

NDCG as a cost function. While it is possible to use the cost function proposed in

RankNet, there are many other metrics in information retrieval such as MAP (mean

average precision), precision, and NDCG that reflect the experience of end users. As

mentioned, using these metrics as objective functions for training is challenging since

small parameter changes might yield different scores but ranking positions will change

when a score passes another score making the function non-differential. SoftNDCG is

a proposed metric based on the approximation of NDCG by mapping scores to ran-

dom variables. Also as in RankNet, backward propagation uses gradient techniques

to optimize parameters in a two layer neural network where the cost function is the

approximated NDCF metric.

Last but not least, SVMRank is an algorithm proposed by Joachims based on the

15

idea of using SVM (support vector machines) to construct a function that maximizes

the empirical Kendals Tau distance between the targeted function determined from

click through data and the system function computed by SVM [38]. Click through

data provides constructive feedback of the ranking system where a clicked URL implies

an estimate of relevancy relative to the query. While a clicked link does not represent

absolute judgement, it provides useful insights about the ranking positions of the

unclicked items. For instance, clicking on the link that is ranked 7th implies that 7th

link is more relevant to the query than the unclicked links starting from one to six.

This motivates the usage of pairwise comparison where the objective is to minimize

pairwise inconsistency between a computed function and the targeted function derived

from click through data.

2.2 Small-world Problem

This literature review on the small-world problem is broken down into two parts.

The first part provides an overview of the small-world phenomenon in terms of six

degrees of separation (Sec. 2.2.1). The second part looks at effects of inequality and

stratification that undermine the small-world property (Sec. 2.2.2).

2.2.1 Six Degrees of Separation

Milgram and his colleagues proposed an experiment to demonstrate the small-

world property by recruiting starters from Nebraska and Oklahoma to reach a broker

in Boston [29]. Starters in the experiments were asked to mail a folder to an ac-

quaintance who would be likely to reach the target quickly. Previous folder holders

were recorded into the folder roster so that they would not be selected twice in a

mail-forwarding chain. The process repeats until the chain stops either when folder

reaches the target, or the current holder drops out of the experiment for various rea-

sons. The expected number of hops it requires for a starter to successfully reach a

target is an upper bound of the shortest path length connecting them. Travers and

Milgram reported that 64% of the chains successfully reached the designated target

within 5.2 hops [29] which gave name to the six degrees of separation. The idea of

six degrees of separation is that if we pick any two people on this planet, there are on

average 5 unique individuals who are connected in such a way where the first person

16

knows the second person, who knows the third person, who eventually knows the last

person.

Beside the debate between whether we live in a small world or stratified one,

the techniques that were used by the participants in the experiment to select an

acquaintance have practical applications in rescue and search operations [31] and

job searching via personal contacts [32]. Dodds et al. reported that such successful

techniques used by the participants including forwarding the folder to a selected

acquaintance such as a friend (67%), relative (10%), co-worker (9%), sibling (5%),

significant other (3%), and miscellaneous ties (6%) based on geographical proximity

and occupation “for at least half of the decisions” [30]. In addition, the results

from the small-world experiment led to an avalanche of network models that have

certain properties resembling real social networks such as the short diameter and

high clustering coefficient [33].

2.2.2 Social Stratification

Research in stratification such as racial segregation in neighborhoods and income

inequality undermine the premise that we live in a small-world. For instance, are there

really short paths connecting random people together? What about people who are

isolated from the rest of the world? Clearly, isolated people are much harder to reach

than prominent individuals such as politicans, CEOs, religious leaders, celebrities,

etc. In [28], Kleinfeld mentioned that Beck and Cadamagnani were unsuccessful in

replicating the small-world experiment with high success rates when they attempted

to reach a high-income target starting from a low-income person. This suggests that

one causes of stratification comes from income inequality where people are segregated

into economic classes. This leads to a question what are the elements that cause

stratification? What attributes do we associate with other people? Since people have

an inclination to associate with people of the same ethnicity, cultural heritage, and

other economic classes, how do such tendencies affect the small-world property?

Th small-world property has been accepted in the research literature because

possible routing strategies have been proposed to show how people strategically make

routing decisions. A routing strategy proposed by Kleinberg relies on participants

passing the folder to the acquaintance who is closest in terms of geography to the

17

target [41]. This make sense since people have cognitive abilities to remember where

there acquaintances live. Also, it is common to have a few acquaintances who are

geographically close and a few acquaintances who are distant due to the relocation

for a new job, studying at a university, retiring, etc.

CHAPTER 3

SOCIAL NETWORK ANALYSIS

Typically social network analysis examines relationships among people in terms of

graph theory where nodes represent actors and edges represent their relationships.

In this chapter, we examine three important aspects of social network analysis. The

first is understanding the effect of geography in terms of the location of actors on the

structure of the social network. The second is measuring face-to-face interactions of

the actors and their social relationships. The third is detecting hidden communities

that are well-connected in terms of social relationships and highly-active in terms of

face-to-face interactions. We examine these three aspects of social network analysis

in details using data collected from a location-based social network called Gowalla.

Beside ranking and searching, these three aspects of social network analysis can also

be used to model human mobility in mobile ad-hoc network (see Sec. 3.5) and predict

economic development of large geographical areas (see Sec. 3.6).

In section 3.1, we examined geography, co-appearance, and interactions of users

in Gowalla focusing on the effect of geography on the structure of the network and

face-to-face interactions. In section 3.2, we incorporated geographical information

of users into three selected community detection algorithms consisting of a modified

version of Clique Percolation Method (CPM), Inference Algorithm (IA), and GANXiS

to detect disjoint and overlapping communities that are well-connected in terms of

social relationships and highly-active in terms of face-to-face interactions. In section

3.3, we designed an experiment in which we generated different types of covers by

using a combination of social and geographic information. In section 3.4, we used

quality measurements based on the link connectivity, geographical proximity, and

physical interactions among members to examine detected communities as a function

of their sizes and used covers as a baseline. We conclude this chapter in section 3.7

Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Using Location-Based Social Networks to Validate Human Mobility and Relationships Models,” in Proc. IEEE/ACMInt. Conf. Advances in Social Network Analysis and Mining, Istanbul, 2012, pp. 1247-1253.

This chapter previously appeared as: T. Nguyen et al., “Analyzing the Proximity and Interac-tions of Friends in Communities in Gowalla,” in Proc. IEEE/ACM Int. Conf. Advances on DataMining Workshops, Dallas, TX, 2013, pp. 1036-1044.

18

19

Figure 3.1: Geographical spread of 100K checkins in Gowalla.

with a summary of the results and potential applications that might benefit from the

analysis of geography and spatially-aware community detection.

3.1 Geography, Co-Appearance, & Interactions

3.1.1 Data Collection

We collected data from a location-based social networking provider called Gowalla

that allowed people to use their internet-enabled and sensing-capable mobile phones

to record and share their current location with their friends. By using the Gowalla’s

API, we were able to retrieve 391,223 users with public profiles (friends and checkins)

from mid September in 2011 to late October of that year. Unfortunately, Gowalla

has been purchased by Facebook and is no longer operating by itself. The data for

FourSquare, Twitter, and Google Buzz are collected in the similar manner by using

breath first search.

To collect the data, we start with a user randomly chosen and process all the

public information available about that user. Then we store all id’s of the user’s

friends and put them into a processing queue in a FIFO order. After that, we retrieve

the next user from the queue and repeat the process. Therefore, we crawled Gowalla

breadth-first, a standard technique in the social networking literature often referred

to as Breadth First Search (BFS) sampling.

As shown in Table 3.1, the users accumulated a total of around 26 million

checkins and 8 million friendship links. The average day of the checkins is 3.14 which

20

Table 3.1: Data summary of Gowalla network.x σX

∑Users − − 391,223

Checkins 164.64 636.68 26,303,580Friends 11.13 67.03 2,176,384

Weekday 3.14 2.01 Jan. 21, 2009Distance 128.72 356.51 20,565,644

Time 6.41 13.29 -

represents Wednesday. The earliest checkin is on Jan 21, 2009. The average distance

between two consecutive checkins of a user is 128.72 km. The average time interval

between two consecutive checkins of a user is 6.41 days with a standard deviation of

13.29. The geographical spread of the checkins is shown in Fig. 3.1. The checkins

from Gowalla allow us to measure the face-to-face interactions between friends by

inferring how often do friends checked into the same location at approximately the

same time.

3.1.2 Notations & Definitions

Given a set of users U , let u ∈ U be a particular user, Lu be a set of its shared

locations known as checkins, and Fu be a set of its friends. A shared location l ∈ Luof the user u is a tuple of three elements denoted as l1, l2, and l3 corresponding to

the latitude, longitude, and timestamp of the location l, respectively. The friendship

network denoted as F = (U,EU) is an undirected and non-weighted graph where an

edge represents reciprocal friendship; that is, e = (u, u′) ∈ EU means u′ ∈ Fu and

u ∈ Fu′ . The geographic distance d(u, u′) between two users u and u′ is estimated by

averaging the locations in Lu and Lu′ and using the haversine formula to calculate

arch distances. The checkin similarity CS(u, u′) of user u and u′ is defined as:

CS(u, u′) =|Lu ∩ Lu′ ||Lu ∪ Lu′ |

. (3.1)

The level of physical interaction between user u and u′ denoted as I(u, u′) is

calculated from their shared locations as follows. Two locations l ∈ Lu and l′ ∈ Lu′

are equivalent if they are within geographic proximity d(l, l′) < dε and occurred within

a time interval |l3 − l′3| < tε. Have such two equivalent locations lu and lu′ means we

infer u and u′ have gone to the place l together.

21

Checkin Similarity

Dis

tanc

e Si

mila

rity

log(

km)

0 0.2 0.4 0.6 0.8 10

2

4

6

8

Not Friends Friends

Figure 3.2: Friendship is bounded by geographical distance.

The maximum pair-wise equivalence between Lu and Lu′ is defined as the longest

sequence of equivalent location pairs ((l1, l′1), . . . , (lk, l

′k)), such that for each 1 ≤ i ≤ k,

li ∈ Lu, l′i ∈ Lu′ and li is equivalent to l′i. The level of physical interaction I(u, u′) is

defined as the length k of the maximum pairwise equivalence divided by the size of

the smallest locations set:

k/min(|Lu|, |Lu′|)). (3.2)

Finding the maximum pairwise equivalence can be reduced to a network flow

problem where polynomial running time algorithms such as Ford-Fulkerson can be

used to calculate the maximum number of matches.

3.1.3 Data Analysis & Results

In Fig. 3.2, there are 701 blue points that represent two randomly selected users

who are friends and 620 red points that represent two randomly selected users who

are not friends within the dataset. The shaded region is drawn by using the k-nearest

neighbor algorithm for classifying whether two users are friends given their average

distance apart and checkin similarity.

In Fig. 3.2, we notice that co-appearance represented by checking similarity is

a poor indicator of friendship; that is, people who are temporarily within the same

place and time are not likely to be friends. Intuitively, co-appearance happens often

22

at popular spots, like concerts and cafes that attract people living at great variety of

locations. Even if a group of a few friends goes together for a concert, they would

not be friends with thousands of other attendees, hence, a chance that a random

pair of attendees are friends is low. Occasional co-appearances are not sufficient, but

geo-proximity helps in establishing and maintaining friendship, as seen in Fig. 3.2.

0 1000 2000 3000 40000

0.05

0.1

0.15

0.2

0.25

0.3

Fra

ctio

n

Avg. Distance of Separation (km)

Hop=1Hop=2Hop=3

(a) Hop=1-3

0 1000 2000 3000 40000

0.02

0.04

0.06

0.08

Fra

ctio

nAvg. Distance of Separation (km)

Hop=4Hop=5Hop=6

(b) Hop=4-6

Figure 3.3: Densities of pairs as a function of geographical distance.

In Fig. 3.3, we plotted the density of friends (hop=1), friends-of-friends (hop=2),

and pairs of users up to six degrees of separation as a function of the average geo-

graphic distance between two users in km. For each level 1 ≤ k ≤ 6 of indirection

(measured in the number of hops), we randomly selected 5,000 non-cyclic paths of

length k and created from the ends of these paths 5,000 pairs from the Gowalla

dataset, each pair with k indirection of friendship. We analyzed pairs that were

within 4,000 km distance from each other.

In Fig. 3.3(a), the density of direct friends (4,317 total) reaches the highest

value of 0.35 (in other words, 1511 pairs) at the lowest geographic separation in the

range from 0 to 160 km (each point at distance x represent users with distances

from x-160km to x+160 km) and continues to decrease as the distance between them

increases. At the second level of indirection, the density of friends-of-friends (3,464

total) achieves the highest value 0.19 in the range from 0 to 160 km and continues to

decrease as the geographic distance between them increases.

Geographic proximity has an effect where friends (hop=1) and friends-of-friends

(hop=2) are more likely but not necessary required to be within proximity of each

23

0 1000 2000 3000 40000

0.010.02

0 1000 2000 3000 40000

0.51

1.5x 10−3

Avg. Distance of Speration (km)

Leve

l of I

nter

actio

n

Hop=1

Hop=2

Figure 3.4: Measuring face-to-face interactions (tε=30mins, dε=1km).

other. For instance, 61% of friends are within 480 km and 47% of friends-of-friends are

within 640 km of each other. Another way of looking at the results is that people who

are separated by three or more hops are unlikely to be within geographic proximity

of each other.

In Fig. 3.3(b), we plotted pairs of users who are separated by four, five, and

six hops. We noticed that they are not likely to be within geographic proximity of

each other. The density of those pairs reaches the highest value 0.07 at the 160 km

range centered at 1,200 km and continues to decrease regardless of their degrees of

separation.

In Fig. 3.4, we plotted the average level of face-to-face interactions I(u, u′)

of friends (hop=1) and friends-of-friends (hop=2) as a function of their geographic

distance in km. The larger the geographic distance between friends, the less likely they

physically interact by going to the same places together. The highest peak (0.027)

is at the lowest geographic separation from 0 to 266 km and continue to gradually

decrease (with some small fluctuations) as the distance between them increases. For

friends-of-friends, the physical interactions reflect the probability that they happened

to be together.

24

3.1.4 Limitations

We like to mention that it is possible the locations of some users are irrelevant

to their distant friends. This may be a source of potential bias where the geographic

proximity of friends may be enlarged by a friendship selection process in Gowalla

in which users subjectively add friends who are within their geographic proximity.

However, we noticed that 38% of friends are geographically separated by more than

520 km. Also, the Gowalla data and other social media indicate that distant friends

are selected, perhaps for the purpose of keeping in contact [42].

In addition, Mislove et al. mentioned that the population of users who tweet

on Twitter is unbalanced [43]. Therefore, we believe that the users who checks in on

Gowalla do not make a representative sample of the entire population as shown in

the concentration of checkins in Fig. 3.1.

3.2 Incorporating Geography into Community Detection

A common approach in community detection is to divide a network into multiple

partitions by maximizing the number of edges within each partition and minimizing

the number of edges between them. The often used quality measurement for the

partitions is modularity that compares the difference between the fraction of edges

inside and fraction of edges across a partition and such expected difference if edges

in the network were randomly distributed [44]. Greedy approaches like hierarchical

clustering [45] and spectral approaches such as minimum cuts [46] divide a network

into disjoint partitions by combining or separating clusters of nodes so that modularity

is maximized at every step. As studied by authors in [47], [48], a problem with

this modularity maximization approach is that it inclines to merge two separated

communities together, increasing the value of modularity, but creating the merger

that does not reflect the ground truth.

Another approach to community detection is to divide a network into multiple

partitions so that the majority of members within each partition shares a common

attribute [49]. A proposed attribute is based on friendship similarity defined as the

density of common friends between pairs of nodes [49]. A problem with this proposed

attribute is that it allows for a community consisting of people who have a lot of friends

in common but are not friends of each other. However, this imperfect definition works

25

well in practice because people who have a lot of friends in common are likely to be

friends themselves. Since community detection is an active area of research, our

goal is not to provide another technique that detect communities (many have been

proposed) but to incorporate the spatial information of nodes into existing algorithms

for analyzing Gowalla and propose a null model (generating covers) to benchmark the

detected communities.

We combine these two approaches in community detection by incorporating

the location information of users and geographic distances between them into three

selected algorithms taken from the rich literature. First, we want to minimize the

number of edges between communities and maximize the number of edges within

them. Second, we want members inside a community to be within spatial proximity

by giving geographically correlated friends more weight than distant friends during

the detection process. This combined approach applies a natural interpretation of

a friendship community where members are well connected and also likely to be

geographically close. Also, geographically correlated nodes are more likely to interact

with each other face-to-face as seen previously.

We selected three community detection algorithms based on their popularity

(CPM), promising experimental results (IA), and ability to scale to millions of nodes

and edges (GANXiS) for the purpose of capturing and measuring the interactions of

users inside a community. In the following subsections, we summarize the selected

algorithms and describe how we incorporated geographic information of users into

the process of detecting friendship communities in Gowalla since level of interactions

is correlated with distance as seen previously.

3.2.1 Clique Percolation Method

The CPM algorithm was proposed to detect overlapping communities by com-

bining cliques or fully connected subgraphs [50]. Given an undirected graph F =

(U,EU), let Hm denotes the set of all cliques in F of the size m. The clique-graph

G = (Hm, E) consists of cliques in Hm represented as nodes, and edges between pairs

of cliques if they have m−1 overlapping members. Each connected component of the

graph G is a community consisting of many fully connected subgraphs of F .

A problem of the CPM algorithm is its lack of scalability because the number

26

of cliques explodes as m increases for large networks. Unfortunately, the problem of

finding the clique with the largest size in a given graph is NP-hard [51] preventing

the algorithm from using cliques with the near largest size.

We modified CPM to incorporate geographic information of nodes and made the

algorithm scalable as follows. Instead of finding cliques of large sizes, we find triangles

(m = 3) since they can be efficiently identified in parallel using map-reduce. To

limit the number of triangles, we select a subset of disjoint triangles from all possible

triangles by using geographic distances between pairs of nodes as follows. The average

geographic distance of a triangle t is defined as (1/3)∑d(u, u′) for u 6= u′ ∈ t. We

take a triangle one at a time from a sorted list of triangles until all possible disjoint

triangles have been taken. If a user is not part of any disjoint triangle, we assign it

to a triangle that maximizes the number of edges between this user and the triangle

and use geographic distances to break ties by assigning a user to the geographically

closest triangle.

The clique-graph G′ is defined as G′ = (T,ET ) where T is the set of modified

triangles and ET is the set of edges between triangles that are assigned as follows. For

each triangle, we create a single clique edge from this triangle to the one that maxi-

mizes the number of friendship edges between them, and use geographic distances to

break ties if necessary. Like in the original CPM algorithm, each connected compo-

nent of G′ is a community consisting of geographically correlated and well connected

subgraphs of F .

3.2.2 Modularity Maximization

Modularity maximization is a popular technique used to find communities pro-

posed in [44], [45]. Given a graph F = (U,EU) and a set P containing disjoint

partitions or subsets of U , the modularity Q of the partitions in P is defined as:

Q =∑pi∈P

eii − a2i (3.3)

where eij is the fraction of edges between nodes in the partitions pi and pj, and

ai =∑

j eij is the fraction of edges leaving the partition pi [44]. A positive value of

27

Q correlates with the difference between densities of edges inside and edges leaving

the partitions compared to a null model.

To maximize modularity, a greedy approach based on hierarchical clustering was

proposed in [45], [52]. Initially, every node in U belongs to its own community. Then

the pair of communities with the highest increase in modularity is merged together.

The process of merging repeats n − 1 times where n = |U |. The clusters with the

highest overall value of modularity at each iteration are taken as a set of communities.

For weighted networks, Newman proposed a simple technique to map weights

of integer values to multigraphs [53]. For every edge of the weight wij, there will be

wij − 1 additional unweighed edges added between node i and j, and the weight wij

is set to 1. The definition of modularity remains the same, since the fraction of edges

eij between partition pi and pj can simply incorporate multiple edges between nodes.

We incorporated geographic information about users into the Inference Algo-

rithm by assigning weights to edges based on spontaneousness and typical means of

travel: walking up to 1.6km, biking/using public transportation up to 25km, short

car/train ride up to 100km, long car/train ride up to 500km, and plane flight above

500km. Friends who are within walking distance (1.6 km) get the highest weight of

24. Friends who are within biking distance (25 km) get the second highest weight of

23. Friends who are within driving distance get a weight of 22, and so on.

3.2.3 Speaker-Label Propagation (GANXiS)

GANXiS was proposed in [54] based on a probabilistic propagation process that

spread labels between speakers and listeners. Given a graph F = (U,EU), each node

ui ∈ U initially carries a unique label i in its pocket pi = {i}. When a node u is

randomly selected to speak, it requests all members of its neighborhood, nodes that

are adjacent to u to randomly send a label in their pocket to u. The probability of a

label being chosen by u′ in its pocket pu′ is proportional to number of times the label

was added; the more times a label was added, the more likely it will be chosen. The

probability of a speaker ui choosing a label from a listener uj is based on the weight

wij/wi where wi is the sum of all weighted edges coming out of ui. For unweighted

networks, wij = 1.

The algorithm repeats until the maximum number of iterations is completed

28

where in each iteration everyone gets to speak exactly once in a random order. At

the end, labels that have a probability of being chosen to send to a speaker less

than a threshold r are deleted. Finally, the labels that a node carries determine the

communities that to which it belongs. For instance, nodes that carry a label i will

belong to the community ci. Time to live (TTL) has been recently proposed to limit

the number of labels that nodes propagate. TTL defines the number of times a label

can be sent (so it reaches limited number of nodes within TTL hop distance).

The advantage of GANXiS is that it scales linearly with the number of edges,

but the disadvantage is that the relationship between convergence and the number of

iterations is yet unknown. GANXiS is capable of discovering overlapping communi-

ties, but we selected its running parameters in such a way that the results included

only disjoint communities to make them compatible with the results of other algo-

rithms. We incorporated geographic information of users into GANXiS by assigning

weights based on spontaneousness and typical means of travel like in weighted IA.

Friends who are within walking distance (1.6 km) get the highest weight of 24. Friends

who are within biking distance (25 km) get the second highest weight of 23. Friends

who are within driving distance get a weight of 22, and so on. This is an extension of

the interpretation of speaker-listener propagation algorithm where a listener is more

likely to be able to hear a speaker if they are within spatial proximity.

3.3 Contrasting Communities to Null Models

We proposed to integrate spatial and friendship information of nodes into a

process of generating covers. The purpose of the covers is to serve as a baseline

for analyzing the performance of various community detection algorithms under a

quality measurement. In section 3.3.1, we described how we generated six covers by

using a combination of spatial and friendship information in traversing the network.

In section 3.3.2, we selected a few quality measurements for examining covers and

detected communities. In section 3.3.3, we examined the covers using the selected

quality measurements.

29

Table 3.2: Six techniques for generating covers.Algorithm Abbreviation Spatial Info.? Social Info.?

Completely Random CR no noRandom Walk RW no yes

Closest Friend First CFF yes yesFarthest Friend First FFF yes yes

Closest to All CTA yes yesFarthest to All FTA yes yes

3.3.1 Techniques for Generating Covers

Given a graph F = (U,EU), a cover C ⊂ U of size k is a subgraph of F with k

nodes selected in a specific way. A completely random cover CR is one where each

user u ∈ U has the same probability of being added during the selection. In a random

walk cover RW , we first randomly add a seed into the cover, then randomly select a

friend of the most recently added user, and continue selecting friends until the cover

reaches the size k. The closest-friend-first cover CFF is similar to RW but instead of

adding a random friend, we add the spatially closest friend not in the cover of the last

added user. If all of that user’s friends have already been added into the cover, we go

back one step to the previously last added user and branch out from there. We call

this the roll-back mechanism. The farthest-friend-first cover FFF is similar to CFF

except that we take the spatially farthest friend instead of taking the closest one.

The closest-to-all cover CTA is similar to CFF but instead of adding the spatially

closest friend to the last added user, we add the spatially closest friend with respect

to all members already in the cover. Finally, the farthest-to-all cover FTA is one

where we take the spatially farthest friend with respect to all members already in the

cover. Cover generation algorithms such as CTA and FTA are described in Fig. 3.5

without the roll back mechanism for simplicity. We listed the covers and their details

in Table 3.2.

3.3.2 Measuring Covers & Communities

We use three types of quality measurements based on the link connectivity and

location of members to measure covers and communities.

The first type of measurements is based on the intra-edge count IEC defined as

the number of edges whose both ends are inside the cover. The contraction CONT of

30

1: procedure CoverGeneration(k)2: F = (U,EU)3: seed = rand(1, |U |), cover = [seed]4: while len(cover) < k do5: distances = [ ], m = len(cover)6: for u in Fseed do7: // Compute haversine distance from u to cover[i].8: du = 1

m

∑mi=1 d(cover[i], u)

9: distances.append((u, du))10: end for11: // sort du from least to greatest or vice-versa12: distances = sort(distances, key = x: x[1])13: for u, du in distances do14: if u /∈ cover then15: cover.append(u)16: seed = u17: end if18: end for19: end while20: return cover21: end procedure

Figure 3.5: Generating CTA & FTA covers.

a cover is computed by dividing intra-edge count by the size of the cover. The intra-

density IND of a cover is calculated by dividing intra-edge count by the intra-edge

count of a completely connected cover of the same size. For these three measures

(IEC, CONT , IND), higher the value, better formed is the community.

The second type of measurements is based on the boundary-edge count BEC

defined as the number of edges whose one end is inside the cover while the other is

outside. This metric is useful for taking into account the effect of adding high degree

users into covers of large sizes since such users are likely to increase both the intra-

and boundary-edge counts. The expansion EXP of a cover is computed by dividing

the boundary-edge count by the size of the cover. The conductance COND of a cover

is defined as COND(C) = BEC(C)2IEC(C)+BEC(C)

. For these three measures (BEC, EXP ,

COND), lower the value, better formed is the community.

The third type of measurements is based on pair-similarity that measures a

given metric such as friendship similarity among pairs of nodes. This is applicable to

the definition of a community of which members have a lot of commonality [49]. We

31

Table 3.3: Measurements for cover C of the size k.Measurement Definition

IEC [55] |{(vi, vj) ∈ E | vi ∈ C ∧ vj ∈ C}|BEC [56] |{(vi, vj) ∈ E | vi ∈ C ∨ vj ∈ C}| - IECCONT IEC/k

EXP [57] BEC/kIND [55] IEC/(0.5k(k − 1))

COND [56, 57] BEC/(2IEC +BEC)GDI max d(u, u′) ∀u, u′ ∈ CAGD

∑u6=u′∈C d(u, u′)/(0.5k(k − 1))

SLI∑

u6=u′∈C I(u, u′)

replace friendship similarity ratio with three additional measurements based on the

geographic proximity and location of nodes. The first one is the geographic diameter

of a cover GDI defined as the geographic distance between the two farthest nodes.

The second one is the average geographic distance AGD among pairs of nodes. Here,

lower the measure (GDI and AGD), better formed is the community. The third one

is the sum of the levels of physical interactions SLI among pairs of nodes for which

higher the measure, better formed is the community.

3.3.3 Examining Covers in Gowalla

For each technique, we generated covers of fixed sizes from 5 to 100 with an

increment of 1. For each cover size, we generated 100 covers and calculated the

average intra-edge count, boundary-edge count, geographic distance, and geographic

diameter. We then derived the remaining measurements.

In Fig. 3.6(a), we noticed that FFF outgrows the other techniques in terms

of intra-edge count as the cover size increases. In Fig. 3.6(b), we noticed that FFF

and FTA outgrow the other techniques in terms of boundary-edge count by a great

margin suggesting that they strategically add users with very large degrees. While

RW is decent at generating covers with high intra-edge counts as seen in Fig. 3.6(a),

it is also biased since users with high degrees are more likely to be added, which

increases the intra-edge count as the cover continues to grow. However, FFF and

FTA are even more biased than RW and FFF outgrows the other five techniques

because the radius of the farthest friend would cover everyone including common

friends in between. On the other hand, we noticed that CFF and CTA are most

32

20 40 60 80 1000

50

100

150

200

250

300

350

400

450

Cover Size

Intr

a−E

dge

Cou

nt

CRRWCFFFFFCTAFTA

(a)

20 40 60 80 1000

0.5

1

1.5

2x 10

5

Cover Size

Bou

ndar

y−E

dge

Cou

nt

(b)

0 20 40 60 80 1000

0.5

1

1.5

2x 10

4

Cover Size

Geo

grap

hic

Dia

met

er (

km)

CRRWCFFFFFCTAFTA

(c)

Figure 3.6: Intra-edge count, boundary-edge count, and geographic diam-eter of covers.

effective out of the six techniques at increasing the intra-edge count while minimizing

the boundary-edge count at the same time.

In Fig. 3.6(c), we measure the geographic diameter of a cover as a function of its

size. As expected from how covers are generated, FFF and FTA are most effective

at maximizing the geographic diameter while CFF and CTA are most effective at

minimizing this measurement. The geographic diameter of FFF and FTA reaches

the limit within 20 iterations, while the diameter for CTA and CFF slowly continues

to grow. A similar trend is seen in Fig. 3.7(c) which shows the average geographic

distance in contrast to the growth rate of intra- and boundary-edge counts seen in

Fig. 3.7(a).

Last but not least, conductance is a measurement used to determine the quality

of a community by considering both the intra- and boundary-edge counts. As seen in

Fig. 3.7(b), CFF is the most effective out of the six covers at minimizing conductance

33

20 40 60 80 1000

2

4

6

Cover Size

Con

trac

tion

CRRWCFFFFFCTAFTA

20 40 60 80 1000

2000

4000

6000

Cover Size

Exp

ansi

on

(a)

0 20 40 60 80 1000.985

0.99

0.995

1

Cover Size

Con

duct

ance

CRRWCFFFFFCTAFTA

(b)

0 20 40 60 80 1000

2000

4000

6000

8000

10000

12000

Cover Size

Avg

. Geo

. Dis

tanc

e (k

m)

CRRWCFFFFFCTAFTA

(c)

Figure 3.7: Contraction, expansion, conductance, and geographic distanceof covers.

since it preserves some geographic structure of the social network by traversing the

edges based on who is the geographically closest friend, and adding friends who are

likely to be friends with the members already in the cover. CTA is not as effective

as CFF because geographic distances get diluted as the size of the cover increases.

FFF and FTA are worse than RW at minimizing conductance. We later use the

physical interactions of users to compare and contrast the results generated by the

CFF cover to results detected by the community detection algorithms.

3.4 Examining Detected Communities

We first examined the results by looking at the total number of communities

detected and the number of members in each one. The modified CPM algorithm with

geographic information detected 2.6K communities whose average size was 60 with

the size of the largest one being 69K. We did not run the original CPM algorithm

34

Table 3.4: Detected communities and their sizes.Community Size

Algorithms Avg. Std. Smallest Largest TotalCPM 60 1,356 6 68,671 2,572

IA 134 1,935 2 52,315 1,151IA w (w for weighted) 442 2,954 2 45,242 349

GANXiS TTL 21 87 3 3,139 7,236GANXiS TTL w 33 767 3 48,290 4,636

because of the long execution time required to generate the clique graph. IA without

geographic information detected 1.2K communities with the average size of 134 and

the size of the largest one being 52K. IA with geographic information detected 349

communities with the average size of 442 and the size of the largest one being 45K.

GANXiS without geographic information detected 7.2K communities with the average

size of 21 and the size of the largest one being 3K. Finally, GANXiS with geographic

information detected 4.6K communities with the average size of 33 and the size of

the largest one being 48,290. Additional information relating to community sizes is

listed in Table 3.4.

3.4.1 Network Community Profile (NCP)

We used the network community profile (NCP) proposed in [56] to examine

detected communities as a function of its size. The authors proposed to take the best

partition defined by a quality feature of a given community size because it represents

the potential of a partition in a community detection algorithm. By inspecting all

communities in the set of communities with the same size, we find for this set the

lowest conductance or the highest intra-density among its members, one quality metric

at a time.

For intra-density and conductance without geographic information, we use the

classical definitions from Table 3.3 and include all existing intra- and boundary-edges

in the counts.

For intra-density and conductance with geographic information, we only include

edges that are within geographic proximity of 160 km or roughly 2 hours of driving. A

low value of conductance is good because this means that the fraction of edges leading

outside the community is low, but the value of 0 is rare since it would indicate that

35

the community is isolated. However, for conductance with geographic information,

a value of 0 means there are no edges that connect to other communities that are

geographically close, so all bridge edges are long. This means also that seeing a

short bridge edge, the community detection algorithm tends to merges communities

connected by such edge together following the insight that neighbors tend to be

friends.

The potential issues resulting from using this approach are discussed below.

First, in many situations, taking the average value of a community quality gives a

more representative picture and probably is less sensitive in cases containing outliers.

Second, the number of communities for a given size might vary from a large number

of small communities to very few for large communities. Last but not least, there

might be no communities of a particular size, and taking the average quality might

give a smooth function that is easier to extrapolate at the missing points as seen with

the covers. Fig. 3.8-3.10 present the results for communities detected by CPM, IA,

and GANXiS respectively.

3.4.2 Link Connectivity Measurements

First, intra-density rapidly decreases as the size of the cover increases because

adding another member into a large community requires everyone already in it to be

connected with this new member, as seen in Fig. 3.8-3.10(a). Unlike intra-density,

conductance is not correlated with the community size because there are some small

and large communities of varying values, as seen in Fig. 3.8-3.10(b). Third, GANXiS

and IA are a little better than CPM at maximizing intra-edges that are within geo-

graphic proximity, as seen in Fig. 3.8-3.10(c). IA is the best at minimizing boundary-

edges that are within geographic proximity, as seen in Fig. 3.9(d). Last but not least,

GANXiS and IA benefited from incorporating the geographic information of users, as

seen in Fig. 3.9-3.10(d), where geographically correlated friends are captured in the

community detection process.

3.4.3 Face-to-Face Interactions Measurements

Comparing Fig. 3.8-3.10(d) to Fig. 3.8-3.10(b), we noticed that some detected

communities had a conductance value of 0. This means that every potential node

36

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Community Size (log−scale)

Intr

a−de

nsity

(a)

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1


Con

duct

ance

(b)

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Intr

a−de

nsity

with

Spa

tial i

nfo.

(c)

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1


Con

duct

ance

with

Spa

tial i

nfo.

(d)

Figure 3.8: Communities detected by Clique Percolation Method.

Table 3.5: Measuring spatial conductance.Algorithm # Spatial Cond. of 0 Total Ratio

CPM 21 175 0.12IA 20 78 0.26

IA w (w for weighted) 19 84 0.23GANXiS TTL 48 126 0.38

GANXiS TTL w 47 155 0.30

Table 3.6: Measuring face-to-face interactions.Algorithm Count Total Ratio

CPM 84 95 0.88IA 38 41 0.93

IA w (w for weighted) 28 30 0.93GANXiS TTL 60 87 0.69

GANXiS TTL w 77 85 0.91

37

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1


Intr

a−de

nsity

Weighted NetworkUnweighted Network

(a)

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5


Con

duct

ance


(b)

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1


Intr

a−de

nsity

with

Spa

tial i

nfo.


(c)

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Con

duct

ance

with

Spa

tial i

nfo.


(d)

Figure 3.9: Communities detected by Inference Algorithm.

within geographic proximity of a community has already been included in it. For the

IA without geographic information, out of the 78 community sizes, 20 of them have

geographic conductance of 0, yielding 20/78 ≈ 0.26 ratio. For the IA with geographic

information, out of the 84 communities, 19 of them have a geographic conductance of

0, yielding 19/84 ≈ 0.23 ratio. The remaining values are listed in Table 3.5. Results in

Table 3.5 show that GANXIS has the highest ratio of the number of communities with

a 0 spatial conductance divided by the number of communities detected. From this

perspective, a good community detection algorithm detects communities that have a

lot of communities with 0 spatial conductance as the result of merging connected and

geographically close communities together.

We examined small-size communities because humans have limited resources

and cognitive abilities to keep and maintain social relationships resulting in a limited

38

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1

Community Size

Intr

a−de

nsity


(a)

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1


Con

duct

ance


(b)

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1


Intr

a−de

nsity

with

Spa

tial i

nfo.


(c)

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1


Con

duct

ance

with

Spa

tial i

nfo.


(d)

Figure 3.10: Communities detected by GANXiS.

number of friendships known as Dunbar’s number [58]. We measured and then plotted

in Fig. 3.11 the NCP level of physical interactions in communities and covers by

summing the level of physical interactions among pairs. From the plots, we observed

that CPM have small communities where members are statistically more likely than

members in covers to physically interact with each other by going to the same places

together. In Fig. 3.11(a), out of 95 communities detected by CPM of the size up to

100, 84 of them have higher amount of physical interaction among members than a

null model, CFF . In Fig. 3.11(b), out of 41 communities detected by IA under the

size of 100, 38 of them have higher amount of physical interaction among members

than CFF . The remaining values are listed in Table 3.6.

While CPM is the most effective at detecting communities that are intrinsically

small (95 total) and where the physical interaction among member is likely to be

39

20 40 60 80 1000

5

10

15

Community Size

Am

ount

of P

hysi

cal I

nter

actio

n

CPMCFF

(a) CPM

20 40 60 80 1000

5

10

15

Size

Am

ount

of P

hysi

cal I

nter

actio

n

Weighted NetworkUnweighted NetworkCFF

(b) IA

20 40 60 80 1000

5

10

15

Size

Leve

l of P

hysi

cal I

nter

actio

n

Weighted NetworkUnweighted NetworkCFF

(c) GANXiS TTL

Figure 3.11: Measuring face-to-face interactions among members.

higher than CFF (88%), IA is the most effective at detecting communities where

93% of them have higher amount of physical interaction than the null model, as

seen in Table 3.6. Incorporating geographical information into GANXiS improves the

overall performance of GANXiS (91% vs. 69% (without geography)).

3.5 Application: Social Relationships & Human Mobility

Random mobility models have been popular among applied researchers for gen-

erating synthetic movements. Random walk is commonly used for graph traversals,

clustering analysis, and many other applications to model unpredictable behavior.

Random waypoint is a mobility model on Cartesian coordinate systems where two

dimensions are commonly used in simulations and higher dimensions are used for

theoretical analysis and generalization. Not only these random models are useful for

application purposes, but they are also powerful tools for analytical understanding

40

of many networking applications, like routing in decentralized architectures where

mobility plays a large role.

A typical ad-hoc network is a decentralized network formed by mobile agents

in a dynamic process without any fixed infrastructure. It is dynamic because the

topology of who is connected to whom is constantly changing due to the mobility and

connection preferences of the agents and the physical limitation of communication

devices. If two mobile agents are outside of transmission range, then the connection

is dropped. If they are within the transmission range, then the connection could

be established. Hence, the topology of the ad-hoc networks depends on a complex

combination of agent mobility, connection preferences, and environmental factors that

could disrupt services or enhance communication.

Some of these networks could be uncoordinated where each agent acts selfishly

on its behalf while other networks could be coordinated where all agents are collab-

orating to accomplish a particular goal, task, or mission. For instance, peer-to-peer

networks are uncoordinated networks where the architecture is designed for robust-

ness to reduce the damage of selfish activities in which users engage but are reluctant

to contribute and anti-choking algorithms are designed for effectively distribute pieces

of a file to maximize throughput and efficiency. On the other hand, military ad-hoc

networks are coordinated networks where soldiers communicate through a network

channel to rescue innocent civilians or capture fugitives in a mission.

Outside of computer networks, human mobility is important for studying the

spread of contagious diseases, traffic engineering, methods of large scale emergency

evacuations, and so on [59]. While individual mobility is important at a micro-level, it

serves as a building block for population mobility that has many potential applications

in studying the population at large scale. Using data to observe statistical patterns

that capture, characterize, and predict trajectories of human movements during their

daily activities is important for health organizations, civil engineers, and national

interests.

For instance, health organizations may want to study the spread of transmitted

diseases, while traffic and civil engineers may want to incorporate human mobility

analysis into their transportation models, where travellers can use a transportation

system consisting of bikes, buses, and subways to get from one place to another. Un-

41

School

Home

Pij

Work

Mall

Lunch

Figure 3.12: Generating a Markov Model using checkins.

derstanding population mobility allows the design of effective transportation systems

where traffic congestion is controlled and reduced. Last but not least, national security

might be interested in knowing how social relationships impact population mobility,

so guidelines can be provided during emergency evacuations in natural disasters like

the Hurricane Irene and Japan Nuclear Meltdown of 2011, where evacuating 45,000

people within a six mile radius of two malfunctioned nuclear power plants required

optimal efficiency since every second could potentially counts toward saving a life.

3.5.1 Network Congestion in MANETs

The backoff timer in the MAC 802.11 protocol is an algorithm designed for

preventing traffic collision of wireless signal. If two or more concurrent wireless trans-

missions are within radio range, one will randomly backoff to let the other one talk.

Suppose we are interested in measuring the throughput of a wireless network where

people are working on their laptops and moving from location to location with some

hidden attributes. Since human beings do not move randomly, we know that there will

be more congestion at popular locations. If we use the RWP, most of the congestion

occurs in the middle due to the stationary distribution as shown in Fig. 3.15.

3.5.2 Mobility Generation

We propose a following algorithm for generating mobility traces using social

networking data from Gowalla. For our Friendship Mobility Model (FMM) using

Markov Model as an underpinning, we first randomly select a user from the dataset

and include his or her friends into the selected group of users. For each user selected,

we calculate the patterns of checkin activities from the datasets. To define set of

42

locations, we look into how many unique places have this user checked in. For each

pair of subsequent locations, we calculate the shortest haversine route. For the prob-

ability in the Markov Model of moving from location a to location b, we calculate how

many times the user checks in at location a immediately after checking in at location

b divided by the number of times the user checks in at the location a. Finally, we

calculate the time it takes for a given user to go from one checkin to another. The

entire process is depicted in Fig. 3.12.

After we have our empirical Markov Model built for each user, we use Miller’s

coordinate projection to convert geographic space into a Cartesian coordinate system

that preserve the triangle law of distances. Finally for mobility simulation, each node

randomly gets assigned to one of its checkins. Then each node randomly picks with

the assigned probability the location of the next checkin and moves directly to it

using a straight line trajectory. Once the node reaches the new checkin, it repeats

the process until the end of the simulation.

Hence, the difference between the RWP mobility model and our FMM is that in

the latter the space of travel is limited to the area of the checkins for each individual

node. Moreover, each node moves differently based on its training set of checkins. For

instance, an adult might be inclined to check in at work more often than a student.

3.5.3 Experimental Congestion Design

We designed a controlled experiment in MANET using ns-2 to compare the

traffic congestion between the RWP and the FMM. In the experiment, there are 15

mobile nodes constantly sending out packets to their neighbors within the transmis-

sion range. Other simulation parameters are listed in Table 3.7. When two or more

nodes are within radio range of each other, at most one can make a successful trans-

mission and the rest has to pause. We measure the overall congestion of the network

by counting how many times did a node need to pause given that we know its current

geographic location during the simulation.

Fig. 3.13 provides the outline of a simulated node moving and how it causes

congestion. Suppose a node starts at p1 and travels to p2 with some speed dictated

We use “user” when referring to the dataset and “node” when referring to the simulation. Anode is built from the social network data provided by the users.

43

Figure 3.13: Design of simulation overview.

by the mobility model. A mobile node cannot transmit if there is already a concur-

rent transmission within some nearby range. Therefore, it pauses until it detects no

concurrent transmissions. The pause time duration in a subarea is the total amount

of time of all the nodes pausing or suspending their transmissions due to the backoff

timer of the MAC 802.11 protocol. During the trip from p1 to p2, the node pauses in

3 subareas (1,2), (2,2), (3,3) represented by the dashed line, meaning that the trans-

mission was suspended for some time. The length of the dashed line in a subarea

represents the duration of pause time for that particular trip.

3.5.4 Congestion Simulation Results

Table 3.7: Network simulator ns-2 parameters.

Parameters RWP FMMSimulation Time (t ) 10,000s 10,000sMAC Layer 802.11Ext 802.11ExtWidth (x ) 2000m 2000mLength (l ) 2000m 2000mNodes (n) 15 15Pause Time 0 0Min Speed 0 5Max Speed 5 5Total Backoffs. 598,316 1,654,967

With the FMM (see [22]), we were surprised that it had 2.77 times more conges-

tion than the RWP. However, this agrees with our intuition that in the FMM, friends

44

0 500 1000 1500 20000

500

1000

1500

2000

X (m)

Len

gth

(m)

FMM

RWP

Figure 3.14: Traffic congestion in FMM and RWP.

like to maintain their relationships by being closer to each other. Economic factors

like the cost of transportation and mobility have a great impact on how we choose

with whom to be friends.

Fig. 3.14 displays the simulation results of network congestion in a controlled

MANET. We took a sample of locations with traffic congestion. The points represent

places where at least one node had to backoff within the simulation. Notice how traffic

congestion is dispersed for RWP and clustered for FMM. Please note that this graph

only shows places of congestion but not density or total volume of communications.

Fig. 3.15 displays the frequency of pauses caused by the backoff timer in the MAC

802.11 protocol using the RWP. We noticed how congestion is centralized in the

middle, which is correlated to the stationary distribution of the RWP.

3.6 Application: Long Ties & Economic Development

A number of results in economic sociology suggested that human relationships

affect economic opportunities because information often spread between people [60]-

[65]. In addition, information coming from interpersonal relationships is often richer

than traditional broadcast media such as television, newspaper, radio, etc. because

acquaintances can interact face-to-face and influence one another in terms of adopt-

ing new behavior and ideas [66]. Therefore, social networks can be portrayed as

45

Figure 3.15: Frequency of pauses using the RWP.

a transportation system where individuals are drivers for generating ideas and the

links between people are vehicles for transporting ideas from one person to another.

Metaphorically, some links are faster at transporting ideas to a larger number of

people than others because not all vehicles are created equal.

It has been argued that information coming from weak ties is often richer than

information arriving via strong ties because “those to whom we are weakly tied are

more likely to move in circles different from our own ... and have access to infor-

mation different from what we [usually] receive [65].” Weak ties have been shown

to be valuable sources of information because individuals can use them to find jobs

[32], [60], solicit feedback on starting new ventures [63], and search for people like

in the small-world experiment [31], [41], [67], [68]. In other settings such as examin-

ing workplaces, structural holes can affect productivity and innovation of employees

and could lead to higher compensation, more promotion opportunities, and better

performance evaluations [61]-[64]. Structural holes are those social relationships that

connect non-redundant contacts together [61]. An example of a structural hole is a

bridge that connects non-redundant contacts from two communities together. The

effect of weak ties on economic opportunities [69] suggests that perhaps information

coming from weak ties can also be used for measuring economic development on a

46

larger scale.

Contemporary development in the science of urbanization has provided scaling

laws for innovation and wealth creation as a power function of the population size in

the equation: y(t) = cx(t)m where x(t) is the population size and y(t) is the metric

of innovation at time t [70]. These results show that as the population size increases,

GDP, wages, patents, private research employment & development increase at super-

liner rates where 1.03 ≤ m ≤ 1.46 [70]. A plausible explanation for the superliner

scaling of wealth creation is that as the population size increases, the number of social

relationships between people increases because there are more choices for establishing

relationships; therefore, increasing the connectivity between people and decreasing

the time for ideas to spread as long as the rate of establishing connections is faster

than the rate of population growth.

Following this line of thinking, recent results in [71] suggest that a generative

model for tie formation as a function of population density yields results very similar

to the model based on population size [70]. Results show that algorithmically gen-

erated social ties based on population density, assuming that nodes are distributed

uniformly on a Euclidean space and they establish connections similar to the rank

friendship model [67], can be used to model urban characteristics of cities such as

GDP, HIV transmissions, and communication volume. Here we extend this line of

thinking by focusing on characteristics of economic development as a function of

speedy idea flow emulated on real social relationships - using long ties as the main

component enabling such flow. This was accomplished by using data containing ge-

ographical locations and friendship information of hundreds of thousands of people

from location-based social media such as Gowalla and FourSquare [22]. More impor-

tantly, these datasets allow us to infer face-to-face interactions [23] and measure the

strength of ties in terms of not only interactions but also geographical distance (i.e.,

short or long ties [72], [73]).

Other approaches for measuring economic development of large geographical

areas include examining the diversity of social contacts (i.e., call records as a proxy for

social relationships) since more contacts imply more channels for receiving information

[74], but using calling patterns to infer social contacts is biased towards those that are

more likely to be strong ties since weak ties are by definition those that are contacted

47

infrequently. While these approaches [71], [74] can vary in their complexity, ranging

from mathematically oriented to data-driven, what they share in common is using

social network analysis to predict innovation, wealth creation, and even patterns of

complex human behavior. The novelty of our approach lies at the intersection of

economic sociology (i.e., the interplay of weak ties and economic opportunities) and

simple contagion models (i.e., the spread of good ideas from one place to another).

Results show that the speed of access to ideas is a near prefect measure for social

diversity and also a signature of economic development in the US without needing

to tune parameters or incorporate secondary factors such as the level of educational

attainment and internal transportation infrastructure.

3.6.1 A Stochastic Model of Economic Development

We propose a simple stochastic model that uses long ties as the main component

for measuring economic development of large geographical areas. Let G = (V,E, L)

be a social network where V is the set of nodes, E is the set of their undirected

relationships, and L is the mapping of users to locations of their residences. Let Ai

denotes the set of nodes that reside in area i; i.e., Ai = {v ∈ V |L(v) = i}. The flow of

ideas matrix denoted as F = (fij) where fij is the probability of an idea going from

Ai to Aj in one step defined as the fraction of long ties connecting nodes from Ai to

Aj divided by the number of long ties originating from Ai; i.e.,

fij =LT (Ai, Aj)∑mk=1 LT (Ai, Ak)

(3.4)

where m is the total number of areas and LT (Ai, Aj) (1 ≤ i 6= j ≤ m) denotes the

number of long ties connecting nodes from Ai to Aj; i.e.,

LT (Ai, Aj) = |{(s, t) ∈ E | (s ∈ Ai & t ∈ Aj) or (t ∈ Ai & s ∈ Aj)}| (3.5)

If we assume that innovative ideas travel randomly between areas, and the probability

of an idea spreading from Ai to Aj depends only on the present area and not the

previous areas, then {Xt, t ≥ 0} is a discrete-time Markov chain where Xt denotes

48

where the idea is located at time t.

Let Hij denotes the expected time it takes for the idea originating at Ai to

arrive at Aj. Then the average expected time for the idea originating from anywhere

to arrive at Ai denoted as φi is defined as:

φi =1

m− 1

m∑k=1

Hki (3.6)

where Hii is 0. Hence, we expect φi to be inversely correlated with economic devel-

opment since areas that receive information quicker can act faster.

Suppose an innovative idea travels indefinitely, then the fraction of time the

idea stays in Ai is denoted as:

λi = P (Xt = Ai) (3.7)

λ = (λ1, λ2, ..., λm) is known as the stationary distribution, and there exists a unique

stationary distribution of Xt since it is irreducible [24]. If φi denotes the fraction of

time the idea spends in area i, then 1/λi denotes the expected time needed for the

idea to come back to i; therefore, φi ≈ 1λi

.

3.6.2 Experimental Results & Discussion

We extracted users and their social relationships in Gowalla and FourSquare

and kept those that are confined to the US. We partitioned the US into 51 areas

where each area corresponds to a federal state.

Figure 3.16 shows scaling laws of the number of short and long ties as a function

of the population size. Short ties are defined as those relationships where both users

live in the same state, while long ties are defined as those who live in separate states.

The total number of ties (i.e., all ties) is the sum of the number of short and long ties.

A point is a state where the x-axis corresponds to the number of users that live there,

and the y-axis corresponds to the number of their ties. Results show that as the

population size increases, the number of short ties increases at superliner rates where

m ≈ 1.34 for Gowalla (a) and m ≈ 1.43 for FourSquare (b). This result supports

49

4 6 8 102

4

6

8

10

12

14

Population Size (log)

Num

ber

of T

ies

(log)

a)

Short Ties (m=1.34, r=0.97)Long Ties (m=0.95, r=0.98)All Ties (m=1.02, r=0.99)

4 6 8 100

2

4

6

8

10

12

14

Population Size (log)

Num

ber

of T

ies

(log)

b)

Short Ties (m=1.43, r=0.95)Long Ties (m=1.00, r=0.94)All Ties (m=1.07, r=0.96)

Figure 3.16: Scaling laws of short and long ties.

Short Ties Long Ties0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Num

ber

of fa

ce−

to−

face

inte

ract

ions

a)

0 2 4 6 8−10

−8

−6

−4

−2

0

K (log−scale)

P(k

> K

) (lo

g−sc

ale)

b)

Short tiesLong ties

Figure 3.17: Face-to-face interactions of short ties and long ties.

the claim that increasing the population size increases the number of relationships

between people and decreasing their path lengths so ideas can spread quicker. How-

ever, long ties do not increase at superlinear rates but instead approximately at linear

rates where m ≈ 0.95 for Gowalla (a) and m ≈ 1.00 for FourSquare (b). Therefore,

long ties do not explain superlinear scaling of innovation and wealth creation as a

function of population size.

Figure 3.17 shows that most of long ties are weak because face-to-face interac-

tions occur more often when people are geographically close. In this experiment, we

selected all pairs of long and short ties and calculated the number of their face-to-face

interactions by matching their checkins. The average number of interactions for short

ties is 3.95 (std=43.20) while this number for long ties is 0.73 (std=9.19). While not

all short ties are strong, most of long ties are weak since 90% of them have no more

50

Adopters Adopters Non−Adop.Non−Adop.0

5

10

15

20

25

Num

ber

of T

ies

a)

Short TiesLong Ties

Adopters Adopters Non−Adop.Non−Adop.0

5

10

15

20

25

Num

ber

of T

ies

b)

Weak TiesLong Ties

Figure 3.18: The collective strength of long ties in a simple contagionmodel.

than two interactions. The x-axis in (b) represents the number of interactions K, and

the y-axis represents the probability that a tie has more than K interactions. We did

not repeat the same experiment for FourSquare because their API did not provide

access to users checkins.

We emulated a simple contagion process using social relationships of users to

examine the effects of short and long ties on adopting versus non-adopting a con-

tagion (similar to the process of spreading ideas in [71]). Using Rogers work on

the diffusion of innovations [75], we assume that 2.5% of the population, randomly

selected, is responsible for generating innovative ideas (i.e., the seed set). In each

step, they randomly select one of their acquaintances to propagate the contagion and

that acquaintance decides whether to adopt it with some fixed probability pc. If the

acquaintance decides to adopt the contagion, then it later becomes an initiator for

spreading it. The process stops when 13.5% of the population has adopted the con-

tagion. Those 13.5% of the population would be considered as early adopters in the

diffusion of innovations [75].

Figure 3.18 shows that early adopters have on average more long than short

ties. For Gowalla (a), the average adopter has 17.77 (std=38.67) short ties and 23.90

(std=111.57) long ties compared to 3.81 (std=6.58) short ties and 2.99 (std=6.20)

long ties for non-adopters. For FourSquare, the average adopter has 16.36 (std=27.67)

short ties and 25.14 (std=54.11) long ties compared to 1.62 (std=3.45) short ties and

1.63 (std=6.74) long ties for non-adopters. For the distribution of short and long

51

1 2 3 4 5 6 7 8 9100

0.2

0.4

Long Ties (log−scale)

Fra

ctio

n

a) Adopters

1 2 3 4 5 6 7 8 9100

0.2

0.4

0.6


b) Non−Adopters

1 2 3 4 5 6 7 8 9100

0.1

0.2


Fra

ctio

n

c) Adopters

1 2 3 4 5 6 7 8 9100

0.2

0.4

0.6


d) Non−Adopters

Figure 3.19: Distribution of long ties for adopters and non-adopters.

ties of adopters and non-adopters see Fig. 3.19 for Gowalla (a,b) and FourSquare

(c,d). Since nodes in the social networks are more likely to adopt if they have more

acquaintances, the point is that a job source, valuable idea, or even a social contagion

is more likely to come from a weak tie because people have limited number of strong

ties but many more weak ties [61]. This experiment shows the collective strength of

long ties by showing that people have a higher chance of adopting a new idea if they

have more long ties.

We generate the flow matrix F = (fij) and calculate λi as a proxy for φi. Figures

3.20 and 3.21 show the economic development of US states as a function of the speed

of access to ideas for Gowalla and FourSquare respectively. The metrics we used for

economic development are gross GDP [76], the number of patents issued [77], and

the number of startups defined as non-profit firms with less than 20 employees [78].

Overall, results show that φi is highly correlated with the economic development in

the US.

Tables 1 and 2 show results using other techniques that have been proposed in

the literature for measuring economic development. The population density of a state

is defined as the number of residents [79] divided by the state’s land area in sq. mi

52

−8 −6 −4 −2−15

−14

−13

−12

−11

−10

φi (log−scale)

Gro

ss G

DP

(lo

g−sc

ale)

a)

2009, m=−0.67, r=−0.92,2010, m=−0.67, r=−0.922011, m=−0.66, r=−0.922012, m=−0.67, r=−0.91

−8 −6 −4 −2

−10

−8

−6

−4

φi (log−scale)

Pat

ents

(lo

g−sc

ale)

b)

2009, m=−0.81, r=−0.76,2010, m=−0.82, r=−0.772011, m=−0.83, r=−0.772012, m=−0.83, r=−0.79

−8 −6 −4 −2

−13

−12

−11

−10

−9c)

φi (log−scale)

Sta

rtup

s (lo

g−sc

ale)

2009, m=−0.59, r=−0.862010, m=−0.59, r=−0.862011, m=−0.59, r=−0.86

Figure 3.20: Economic development as a function of idea flow (Gowalla).

(excluding water) [80]. The social diversity of a state i denoted as Di is defined as:

Di =

∑mj=1 pijlog(pij)

log(m− 1)(3.8)

where pij is the number of edges connecting Ai and Aj divided by the number of

edges leaving Ai [74].

Table 3.8: Measuring economic development (Gowalla).GDP Patents Startups

Population Density r = 0.50 r = 0.45 r = 0.38Social Diversity r = 0.88 r = 0.74 r = 0.83

Ideas Flow r = 0.92 r = 0.77 r = 0.86

In Table 3.8, results show that speed of access to ideas φi in Gowalla is more

correlated with economic development than population density and social diversity.

53

−8 −6 −4 −2−15

−14

−13

−12

−11

−10

φi (log−scale)

Gro

ss G

DP

(lo

g−sc

ale)

a)

2009, m=−0.59, r=−0.882010, m=−0.59, r=−0.882011, m=−0.59, r=−0.882012, m=−0.59, r=−0.88

−8 −6 −4 −2

−10

−8

−6

−4

φi (log−scale)

Pat

ents

(lo

g−sc

ale)

b)

2009, m=−0.70, r=−0.712010, m=−0.71, r=−0.722011, m=−0.73, r=−0.742012, m=−0.72, r=−0.74

−8 −6 −4 −2

−13

−12

−11

−10

−9

φi (log−scale)

Sta

rtup

s (lo

g−sc

ale)

c)

2009, m=−0.51, r=−0.802010, m=−0.51, r=−0.812011, m=−0.51, r=−0.81

Figure 3.21: Economic development as a function of idea flow(FourSquare).

Table 3.9: Measuring economic development (FourSquare).GDP Patents Startups

Population Density r = 0.50 r = 0.45 r = 0.38Social Diversity r = 0.88 r = 0.74 r = 0.83

Ideas Flow r = 0.92 r = 0.77 r = 0.86

6 7 8 9 10 11 12 13 141

2

3

4

5

6

7

8

Social Diversity Di

Spe

edy

Idea

Flo

w φ

i

a)

y = − 0.9*x + 13

r=−0.98 linear

5 6 7 8 9 10 11 12 132

3

4

5

6

7

8

9

Social Diversity Di

Spe

edy

Idea

Flo

w

φ i

b)

y = − 0.9*x + 13

r=−0.99 linear

Figure 3.22: Speedy idea flow as a function of social diversity.

54

In Table 3.9, there are two instances where social diversity is more correlated with

economic development in FourSquare but still less correlated than the results in Table

3.8.

Results show that the speed of access to ideas is correlated with economic de-

velopment in the US from 2009 to 2012 because it is a near prefect measure for social

diversity as shown in Fig. 3.22 for Gowalla (a) and FourSquare (b); however, the

causality between the two relationships is still unknown but the results suggest that

perhaps combining long ties and the spread of ideas might be an important indicator

of economic development in addition to population size, density and social diversity.

Aggregating and normalizing hundreds of thousands of long ties across the US re-

moves the potential effect of ideas not traveling randomly. Unlike social diversity,

population density performed not as well as others because it was simply designed to

measure characteristics of cities and not geographical areas with diverse ranges of pop-

ulation densities (e.g., New York consists of dense NYC and sparse NYS; therefore,

limiting its predictive power).

Finally, we focus only on a very specific dimension of social relationships (i.e.,

long ties) and ignore other ties that could lead to better correlations of economic de-

velopment. While there are many more dimensions of human relationships (e.g., short

ties, strong ties, friends from different communities, etc.), one particular dimension

that could lead to better results within a geographical area is friends with different

interests or skills since they would complement each other in terms of collaboration

like solving a difficult problem. Perhaps understanding the interplay of human rela-

tionships and economic development can suggest radical socially-driven alternatives

in addition to the traditional stimulus packages for growing the economy [74] and a

direction for studying urban growth [71].

3.7 Summary of Results

Contrary to the belief in the death of distance barrier to forming social ties [81],

we find that the creation of friendship between two people in Gowalla is more likely

to occur when they are geographically closer, and the likelihood of users being friends

rapidly decreases as the geographic distance between them increases. Such geographic

effects may help in designing spatially-aware community detection algorithms where

55

on average every two people in a community are separated by a few hops and also

likely to be within spatial proximity.

First, our data analysis of Gowalla friendship network reveals two degrees of

geographical concentration where friends and friends-of-friends are more likely to be

within geographic proximity. Conversely, pairs of users who are separated by three or

more hops of friendship relation are unlikely to be within geographic proximity. Also,

friends who are within geographic proximity are more likely to physically interact by

going to the same places together than distant friends. Yet, the likelihood of physical

interactions among friends-of-friends is minuscule even though they are geographically

concentrated.

Second, we showed that covers can serve as a null model for examining com-

munity structures. For most quality metrics, small communities are more likely to

outperform large ones because it is much easier to find a small group to maximize a

particular metric. Therefore, comparing detected communities to covers tell us how

much better the algorithm is performing than a proposed null model for a given size

of the community.

Finally, we used the results from the covers and compared them to the com-

munities detected by modified CPM, unweighted and weighted IA, and GANXiS.

By incorporating spatial information into CPM to make the algorithm scalable, it

detected meaningful communities of a large online social network where members

are more likely to physically interact than members of a cover used as a null model.

From the NCP plots, we noticed the importance of small-size communities in large

social networks in which it is much harder to find a large community because humans

have limited resources to create and maintain relationships. We used the level of

physical interactions among members in a community as the final quality measure to

compare and validate the performance of the community detection algorithms to the

closest-friend-first cover.

Other applications that we foresee might benefit from such spatial effects in-

clude recommendation systems and link prediction by designing systems based on the

knowledge of users’ geographical locations, their social connections, and the structure

of their friendship communities. For instance, recommendation systems could be en-

riched by incorporating geographical information of users, their friends and location-

56

based ratings to increase the quality of the recommended items [82]. Link prediction

could be enriched by using pairs of users that are geographically close and belong to

the same community to predict how likely they will become friends or connected in

the future [83].

CHAPTER 4

SOCIAL RANKING TECHNIQUES

Web Graph

Social Graph

CNN

ABC

MSNBCFox

Yahoo Digg

P1P2 P3

P4

P5 P6

Figure 4.1: Conceptualization of social ranking.

Previous work on the ranking of pages conceptualized the web as a network con-

sisting of pages representing nodes, and links representing directed edges illustrated

in Fig. 4.1. Advances in social networks enabled a different perspective of ranking

pages from a relationship point of view. For simplicity, the social network of users

illustrated in the top rectangular box in Fig. 4.1 consists of nodes P1, P2, ..., P6 where

an undirected edge between P1 and P2 represents a social relationship of the two users

and an undirected edge from P1 to CNN represents P1 broadcasting a CNN URL to

its ties P2, P3, and P4. Note that the edge from P1 to CNN is not a part of the social

network, but a connection between the web and social network.

4.1 Google Buzz & Twitter

We collected data from two networks on the web. The first one is the Google

Buzz, a platform that combines social relationships and mini-blogging for information

dissemination. The second network is Twitter where users choose to follow sources

Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Social RankingTechniques for the Web,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysisand Mining, Niagara Falls, Ontario, 2013, pp. 49-55.

57

58

of information. These two networks have messages containing URLs that provide us

clues into how users would rank the quality of the information coming from URLs by

using the techniques we later describe.

We collected the Google Buzz data from early September of 2011 to the middle of

October of the same year. There were around 2.5M users who shared approximately

100M messages of which about 30M messages had URLs embedded in them. We

collected the Twitter data from early September of 2011 to the late December of that

year. There were around 1M users who shared approximately 300M messages and

50M of them had URLs embedded in them. Additional details of the datasets for

Google Buzz and Twitter are provided in the Table 4.1 and Table 4.2. Please note

that all URLs refer to all representations of URLs embedded into messages and two

different representations could be the same URL when they are masked by redirect

services. *URLs refer to the final destination of URLs that have been shared by at

least two users within the network. In addition, we reduced the size of the datasets by

keeping users whose geographical locations were known. To pinpoint the geographical

location of a user, we extracted locations from their geo-tagged messages and used

the most frequent location as the location of their residence. Reduced networks are

shown in Table 4.3.

Parsing URLs from messages is prone to errors where humans have multiple

ways of writing supposedly the same link. Examples are URLs containing typos and

spelling mistakes, masked by redirect services, and so on. Second, with limits on

hardware resources, bandwidth sharing and data access, we attempted to collect as

much as we could for the purpose of ranking URLs on social media. Third, we were

able to collect the entire connected component with BFS sampling for Google Buzz,

which resulted in the sum of indegree being equal to the sum of outdegree. Twitter

is a much larger network that consists of hundreds of millions of accounts [26]. When

calculating the data summary of Twitter, we look at users who have been processed

in terms of collecting their information and not users who are waiting to be processed,

which resulted in the sum of indegree not being equal to the sum of outdegree.

59

Table 4.1: Data summary of Google Buzz.x σX

∑Users − − 2,522,109

Inlinks 7.36 115.04 18,566,607Outlinks 7.36 58.39 18,566,607Messages 42.94 1,067.21 108,439,019All URLs 11.67 21,706.36 34,472,205

*URLs 3.85 174.80 2,647,561

Table 4.2: Data summary of Twitter.x σX

∑Users − − 1,057,163

Inlinks 17,675.58 334,127.10 18.69BOutlinks 520.66 7,676.48 550,421,023Messages 280.84 1,005.09 277,310,683All URLs 44.26 45,359.19 46,532,403

*URLs 8.19 57.59 2,294,077

4.1.1 Categories of URLs.

Figure 4.2 shows categories of 100 most popular and 100 randomly selected

URLs for Google Buzz (a,b) and Twitter (c,d). Popular URLs are defined by the

number of spreaders, that is the users who shared or re-shared a given URL. In Google

Buzz (a), 24% of popular URLs are from social media, 16% are about technological

products such as Apple, 15% are videos from Youtube, and so on. In Twitter (c), 41%

of popular URLs are from social media, 19% are videos from Youtube, 11% are image

related, and so on. Google Buzz has more URLs relating to technological products,

while Twitter has more popular URLs relating to social media. For random URLs,

Google Buzz has 27% of URLs from social media while for Twitter this number is

53%.

Table 4.3: Google Buzz (left) & Twitter (right) with geography.x σX

∑x σX

∑Users − − 24,813 − − 15,036

Inlinks 8.30 39.92 206K 102.60 279.35 1.5MOutlinks 8.30 33.45 206K 102.60 278.77 1.5M

Extracted URLs 260.66 978.30 6.5M 227.93 305.39 3.4M

60

Google 7%

Youtube 15%

Twitter 15%

Information 27%

Yfrog 3%

FourSquare 2%

Facebook 4%

Games 1%

Images 9%

News 1%

Technology 16%

a)

Information 27%

Google 4%

Videos 2%< 1%

Twitter 11%

Youtube 6%Last.fm2%

Tumblr 5%< 1%

Foursquare 7%

Facebook 2%

News 21%

< 1%

Technology 10%

b)

Youtube 19%

Twitter 30%

Information 21%

Yfrog 5%

Facebook 6%

Images 11%

News 3% Technology 5%

c)

Information 23%

Twitter 28%

Youtube 5%Tumblr 5%

Yfrog 8%

Foursquare 11%

Facebook 1%

News 18%

Technology 1%d)

Figure 4.2: Categories of popular (a,c) and random (b,d) URLs.

4.1.2 Spreaders & Affected Sets

From both Google Buzz and Twitter datasets, we have randomly chosen 2,000

URLs with equal probability denoted as the random set of URLs. We also have chosen

the top 2,000 shared URLs denoted as the popular set of URLs. There are two sets

of URLs in each network giving us four sets of URLs in total. For each URL, we

calculated the size of the affected set consists of nodes that received the URL from

the spreaders but chose not to spread it further.

We also computed the average length of all shortest paths from 10 randomly

chosen users to members of a random subset of spreaders. The results are shown in

Fig. 4.3(a) for Google Buzz and Fig. 4.3(b) for Twitter. A point on the plot is a URL

where the x-axis corresponds to the size of the affected set in logarithmic scale, and

the y-axis corresponds to the average length of shortest paths from randomly chosen

users to the spreaders. A red point is a URL from the random set, and a blue star is a

URL from the popular set. The black line is a linear classifier that separates popular

61

Size of Affected Set (log−scale)

Avg

. Dis

tanc

e

2 4 6 8 10 12

3

3.5

4

4.5

5Random

Popular

Size of Affected Set (log−scale)

Avg

. Dis

tanc

e

5 10 15

2

3

4

5

6 RandomPopular

Figure 4.3: Shortest paths to URLs in Google Buzz (a) and Twitter (b).

URLs from random URLs and crosses are points that have been miss-classified. We

substitute the entire spreader set with a randomly selected subset simply as a matter

of efficiency because shortest-path computations are expensive in large networks as

mentioned by authors in [84].

In Fig. 4.3, we noticed that as the size of the affected set increases, the average

distance from randomly selected users to the information on the web page decreases

for random and popular sets of URLs in Google Buzz. This is because very large

affected sets increase the likelihood that a randomly chosen user has a path through

an affected user reaching a spreader. This agrees with our intuition that information

collectively shared by users with high outdegrees has a greater coverage of dissemina-

tion. However, this correlation is weaker in Twitter due to the celebrity effect of some

users having millions of followers and creating large affected sets. For instance, a URL

that was only shared in the network by a celebrity. More importantly, affected sets

influence our social ranking techniques where the structure of the network instead of

the web topology is used to rank pages or URLs.

4.1.3 Information Distances

Figure 4.4 shows ultra small-world property of the distance from a randomly

selected starter to popular and random URLs in Google Buzz (a) and Twitter (b).

For each URL, we randomly selected 100 starters and calculated the length of shortest

path from the starter to the closest spreader of the URL. We calculated the densities

of the number of hops in Fig. 4.5 and the average shortest path lengths Fig. 4.4.

62

YO TW YF FA IM NE TE RA0

1

2

3

4

5a) Avg. Path Length

YO TW YF FA IM NE TE RA0

1

2

3

4

b) Avg. Path Length

Figure 4.4: Ultra small-world property from starters to information.

0 2 4 60

0.1

0.2

0.3

0.4

0.5

Hop

Den

sity

a)

Facebook

Images

News

Tech

Twitter

Youtube

Random

0 2 4 60

0.1

0.2

0.3

0.4

0.5

Hop

Den

sity

b)

Facebook

Images

News

Tech

Twitter

Youtube

Random

Figure 4.5: Densities of shortest path lengths from starters to URLs.

Results show that a randomly selected starter in Google Buzz is about one hop

away from a popular URL compared to 2.5 hops distance from a random URL. For

Twitter, a randomly selected starter is about 2 hops away from a popular URL

and a little bit further for a random URL. These average shortest path lengths to

popular and random URLs are much shorter than six degrees of separation in Travers-

Milgram small-world experiment [29] demonstrating that the distance from human

to information is sometimes shorter than the distance from human to human.

4.1.4 Geographical Distances

Figure 4.6 shows geographical concentration of pairs of users who are separated

by a fixed number of hops in Google Buzz and Twitter, and two additional networks:

Gowalla and FourSquare. We noticed that these four social networks have two degrees

63

0 2000 40000

0.1

0.2

0.3

0.4

0.5

Den

sity

a) Hop 1

0 2000 40000

0.1

0.2

0.3

b) Hop 2

0 2000 40000

0.02

0.04

0.06

c) Hop 3

0 2000 40000

0.02

0.04

0.06

d) Hop 4

0 2000 40000

0.02

0.04

0.06

Geographic Distances (km)

e) Hop 5

0 2000 40000

0.02

0.04

0.06

f) Hop 6

B

T

G

F

Figure 4.6: Two degrees of spatial concentration.

of spatial concentration where users who are separated by one or two hops are more

geographically concentrated than pairs who are separated by 3 hops or more. For

instance, 69% of friendship pairs (hops=1 shown in a) are within 560 km, 47% of

friends-of-friends pairs (hops=2 shown in b) are within 560 km, 25% of pairs with

hops=3 (shown in c) are within 560 km, 20% of pairs with hops=4 (shown in d) are

within 560 km, 17% of pairs with hops=5 (shown in e) are within 560 km, and 17%

of pairs with hops=6 (shown in f) are within 560 km. An explanation for this two

degrees of concentration is the effect of local clustering coefficient of a user defined as

the fraction of its friends who are friends with each other. In order for a probability

of two people who have a friend in common being friends themselves to be high, they

need to be within some geographical proximity or else the opportunity for them to

interact is small. The average local clustering coefficient of 104 randomly selected

pairs of users in Google Buzz, Twitter, Gowalla, and FourSquare are 0.31, 0.36, 0.30,

and 0.34 respectively.

64

Figure 4.7: Four dimensions of social relationships.

4.1.5 Densities of Social Relationships

Four dimensions of social relationships are visualized in Fig. 4.7. Friends are de-

fined as reciprocal following relationships. Neighbors are users that are geographically

close. Peers are users that belong in the same community. Interests are users that

have similar interests measured by the keyword similarity in URLs they share. The

intersection of circles represents pairs of users with multiple dimensions of social re-

lationships. Two represents pairs of users with two dimensions of social relationships

such as being friends and neighbors.

Table 4.4: Social relationships densities in Google Buzz.Buzz Friends Peers Interests Neighbors

Among Friends — 0.99 0.09 0.58Among Peers 0.26 — 0.25 0.41

Among Interests 0.01 0.32 — 0.06Among Neighbors 0.05 0.50 0.13 —Among Random 0.01 0.27 0.06 0.03

Tables 4.4-4.5 show the densities of friends, peers, neighbors, and users with

similar interests. The left column represents relationships of the pairs and the top

row represents the density of the relationships. For example, among friends in Table

4.4 for Google Buzz, 99% of are also peers, 9% of them have similar interests, 58% of

65

Table 4.5: Social relationships densities in Twitter.Twitter Friends Peers Interests Neighbors

Among Friends — 0.85 0.11 0.30Among Peers 0.32 — 0.12 0.29

Among Interests < 0.01 0.19 — 0.03Among Neighbors 0.01 0.36 0.09 —Among Random < 0.01 0.13 0.04 0.02

0 2000 4000 6000 8000 100000.1

0.15

0.2

0.25

Geographical Distance (km)

Avg

. CK

S

a)

FriendsFollowingsPeersRandom

0 2000 4000 6000 8000 100000.1

0.2

0.3

0.4

Geographical Distance (km)

Avg

. CK

S

b)

FriendsFollowingsPeersRandom

Figure 4.8: CKS for friendship, following, peers, and random pairs.

them are neighbors. For Twitter, among friends, 85% of them are peers, 11% have

similar interests, and 30% are neighbors. The densities of friends, peers, interests,

and neighbors are consistent in Google Buzz and Twitter. For example, most of the

friends are among peers, most of the peers are among friends, most of people with

similar interests are among peers, and most of the neighbors are among friends.

4.1.6 Keyword Similarity

Figure 4.8 shows cosine keyword similarity (CKS) of selected friendship, follow-

ing, peers, and random pairs of users in Google Buzz (a) and Twitter (b). The CKS

of two users is the cosine of the angle between the two vectors consisting of keyword

frequencies extracted from webpages shared by these two users.

Let Wv and Wv′ be lists of words in web pages that users v and v′ have shared.

Let Av be a vector of word frequencies where the ith index in Av represents the

number of times the word wi appears in Wv The keyword cosine similarity for v and

v′ is defined as:

66

cos(u, u′) =AuAu′

||A||||B||. (4.1)

A pair of nodes (v, v′) represents friendship if they follow each other, following

if v follows v′ but not vice-versa and is a random pair if there is no following in either

direction. We calculated the average CKS of friendship, following, peers, and random

pairs as a function of geographical distance separating members of these pairs. For

random pairs, we noticed that CKS decreases as the geographical distance increases.

On the other hand, the effect of geography on cosine keyword similarity is negligible

when comparing friendship, peer, and following pairs. However, they have a higher

cosine keyword similarity than random pairs.

4.2 Social Ranking Techniques

Let GU = (V,E) be a directed multi-labeled graph where V is the set of nodes,

E is the set of edges where e = (vi, vj) represents a directed edge from node vi to

node vj, and U is the set of URLs with subsets of which nodes in V are labeled. For

URL u ∈ U , let S(u) denotes the set of all spreaders of the URL u; in other words

all nodes in V who has posted u.

4.2.1 PageRank on Social Network

We extend the PageRank algorithm to rank URLs on a social network (PRSN)

as follows. Given a multi-labeled graph GU = (V,E), let F = (fij) be a n×n weighted

adjacency matrix where n is the number of nodes (i.e, n = |V |), fij = 0 if there is

no directed edge from vi to vj, and fij = 1/deg(i) otherwise. Let R be a vector

consisting of n elements where the ith element of R denoted as ri corresponds to the

PageRank score of the ith node. Let k be the maximum number of iterations that the

PageRank algorithm runs. At the first iteration, every node sends its score divided

by the number of links pointing from this node to other nodes through each outgoing

link. After that, each node updates its score to the sum of scores that it has received:

ri = f1ir1 + f2ir2 + ...+ fnirn. (4.2)

If there is an edge from node j to node i, then fji > 0 and node j will send

67

fji fraction 1deg(j)

of its score rj to node i. Equation 4.2 can be compactly written

as R<1> = F TR<0> where F T is the transpose of the matrix F , the superscript <1>

denotes the scores of all nodes after the first iteration, and R<0> is the initial vector.

Let R<k> be the scores of nodes at the k > 0 or last iteration defined by induction

as:

R<k> = F TR<k−1> (4.3)

If there are sinks in the graph G, that is nodes without outgoing edges, then

for large enough k’s they will absorb all scores since the scores can enter but cannot

leave the sinks. One way to fix this problem is to scale the strength of links by a

constant factor of 0 < σ < 1 and to compensate this scaling by adding an artificial

flow between any two nodes with the weight 1−σn

. This solution is known as the scaled

version of PageRank [85]. The score of the ith node is then denoted as r′i and is defined

as:

r′i =n∑j=1

(σfji +1− σn

)r′j. (4.4)

Equation 4.3 can be compactly written using the following matrix F = σF+ 1−σn

.

By the Perron-Forbenius Theorem [85], the scaled PageRank scores converge to a

stable solution:

R′i = F TR′i−1 where 0 < i ≤ k. (4.5)

Given a subset of URLs U ′ ⊂ U , the PageRank score of a URL u ∈ U ′ on a

social network (PRSN) is defined as:

PRSN(u) =

∑vi∈S(u) r

′ki∑

u′∈U ′∑

vi∈S(u′) r′ki

. (4.6)

4.2.2 HITS on Social Network

The HITS algorithm used to rank URLs on a social network (HSN) is defined

as follows [35], [85]. Given GU = (V,E), let M = (mij) be a n× n adjacency matrix

where n is the number of nodes, mij = 1 if there is a directed edge from node vi to

68

node vj, and mij = 0 otherwise. Let k be the maximum number of iterations. Given

a set of URLs U ′ ⊂ U , let H and A be vectors of scores for hubs and authorities,

respectively. Authorities are the URLs (i.e., u ∈ U ′) and hubs are nodes that share

these URLs. The ith element of the vector H represents the score of the ith hub,

and the jth element of the vector A represents the score of the jth authority. At the

first iteration, the score hi of a hub gets set to the number of authorities to which it

points, and the score aj of an authority gets set to the scores of hubs pointing to it.

More formally, hi and aj are defined as:

h<0>i = mi1 +mi2 + ...+min, (4.7)

a<0>j = m1jh

<0>1 +m2jh

<0>2 + ...+mnjh

<0>n . (4.8)

Let H<l> and A<l> be the scores of hubs and authorities at the iteration l, the

HITS algorithm [85] can be written as:

H<l> = (MMT )lH<0> where 0 < l ≤ k, (4.9)

A<l> = (MTM)l−1MTH<0> where 0 < l ≤ k. (4.10)

Finally, the score of a URL in the authorities is the value a<k>j normalized by

the sum of scores in the vector A.

4.2.3 Ranking with Maximum Flow

We defined the following maximum flow algorithm to rank URLs on a social

network. Given a graph GU = (V,E) and a subset of URLs U ′ ⊂ U , let p represent

a node. We want to rank the URLs in U ′ with respect to p and G by constructing a

directed flow graph denoted as G′p = (V ′, E ′).

The first part of the construction requires copying the social structure of G to

G′p. For every node vi that p follows, we add vi to V ′ and the edge e = (p, vi) into

E ′. At the subsequent iteration, we repeat the same process for every node that has

been added into V ′ from the previous iteration; that is, if vi was added into V ′ and

69

Source

Information and Social Network Web Pages

Super Sink

p

P 1

P 3

P 2

P 4

P 5

u 2

u 1

t

Figure 4.9: Graph G′p for ranking URLs {u1, u2} with respect to node p.

there is an edge e = (vi, vj), then we add vj to V ′ if vj has not been added before.

The edge e = (vi, vj) will still be added into E ′ if vj has been added before. This

process of constructing the graph G′p continues until all possible nodes from V that

are reachable from p have been added into V ′. For practical reasons, it is wise to stop

when the diameter of G′p is small; e.g., three to reflect the influence of nodes that are

within network proximity. At the end of the process, an edge originating from node

v gets the weight equal to the inverse of the node degree in G′p.

The second part of constructing G′p introduces some additional nodes and edges.

For every URL u′ ∈ U ′, we add u′ into V ′. For every spreader s ∈ S(u′) of the URL

u′, we add an edge e = (s, u′) with a weight of 1 into E ′ if s ∈ V ′. We add a super

sink denoted t into V ′ and add an edge e = (u′, t) with an edge weight of 1 for every

URL u′ in U ′.

The maximum flow of the graph G′p from source p to super sink t is a function

F that assigns a non-negative value to each edge so that it maximizes the total flow

coming from the source p to the super sink t satisfying two conditions: first, it does

not exceed the weight of an edge; i.e, F (e) ≤ ce and second, it obeys the conservation

of flow law except for the source p and the super sink t; i.e,

Fout(v) =

Flow out to social ties︷︸︸︷∑ce +

Flows out to pages︷︸︸︷∑c′

e = Fin(v) (4.11)

where ce is the assigned flow for the edge e = (vi, vj) between two nodes, and c′e is

the assigned flow for the edge e′ = (vi, uj) for the node vi and the URL uj. The

construction of the graph G′p is illustrated in Fig. 4.9. Polynomial running time

algorithms such as the Edmonds-Karp algorithm O(V ′E ′2) for finding the maximum

70

flow can be found in [85], [86].

4.2.4 Variants of Maximum Flow

The second variant of network flow incorporates social relationships and geog-

raphy by assigning weights to edges based on the geographical distance between the

nodes. We assign the edge weight for nodes vi and vj as:

wij =gd(vi, vj)

−1∑vk∈vouti

gd(vi, vk)−1. (4.12)

where gd(vi, vj) is the geographical distance from vi to vj. The third variant uses

cosine keyword similarity to assign the weights. The edge weight for nodes vi and vj

is defined as:

wij =CKS(vi, vj)

−1∑vk∈vouti

CKS(vi, vk)−1. (4.13)

The last variant of network flow uses community structure by replacing the

social network with the community group and connecting the source to all members

in the community. Weights (binary) for the edges in community do not taken into

account geography or cosine keyword similarity so their values are 1.

4.3 Social Ranking Experiments

4.3.1 Comparing PageRank & HITS

We selected 30 URLs from the popular and random URLs sets. For each selected

URL, we calculated its score by using PageRank and HITS, and ranked the URLs

(i.e, 1st, 2nd, 3rd, etc.) with respect to the set. We compared the ranking results of

PageRank and HITS for popular and random URLs shown in Fig. 4.10 for Google

Buzz and Fig. 4.11 for Twitter. Ranking Results of Google Buzz are listed in Table

4.6 and Table 4.7.

The ranking of popular URLs using PageRank and HITS are more consistent

than the random URLs. We measured the ranking consistency as the average differ-

ence of two ranking algorithms on a set of URLs (i.e., 1w

∑u∈U ′ |PHSN(u)−PPRSN(u)|)

and the sum of differences (i.e.,∑

u∈U ′ |PHSN(u)−PPRSN(u)|) where Px(u) is the po-

sition of the URL u determined by the algorithm x and w is the number of URLs.

71

The average difference is more appropriate than the sum difference for ranking a large

number of pages. An example is ranking 1000 pages instead of 5 pages. The average

gives the average difference of two ranking algorithms in the 1000 pages, and the sum

difference gives the difference in ranks of the two algorithms. For smaller number

of pages, sum might be more appropriate in quantifying the difference between two

ranking algorithms.

For the popular URLs in Google Buzz, the average difference was 2.9 meaning

that on average HITS and PageRank were off by 3 positions and the sum of differences

between them was 86. For the random URLs in Google Buzz, the average difference

was 9.6 and the sum of differences between them was 288. For the popular URLs

in Twitter, the average difference was 5.9 and the sum of differences between them

was 178. For random URLs in Twitter, the average difference was 7.2 and the sum

of differences between them was 216. In both networks, popular URLs are ranked

more consistently than random URLs which makes the HITS algorithm more suitable

than PageRank when ranking viral information because it is computationally more

efficient.

0 5 10 15 20 25 300

5

10

15

20

25

30

abcnews.go

amazon

apple

appleinsider

bbc

bloomberg

boston

businessweek

empireavenue

engadget

facebook

gizmodo

guardian

lockerznytimes

pcworld

photofocuspingchat

reddit

reuters

stackoverflow

techcrunch

ted

thesocialnetwork−movie

whitehouse

wiredwordpress

xkcd

yahoo

youtube

PageRank on Social Network

HIT

S o

n S

ocia

l Net

wor

k

(a) Popular URLs.

0 5 10 15 20 25 300

5

10

15

20

25

30

addictivefonts

behancebusinessinsider

digg dslreports

economist

entrepreneur

fastestwaylosebellyfat

forbes

foxnews

huffingtonpost

income4free

last.fm

marketwatch

networkedblogs

npropencog

picasaweb.google

ping.fm

popscipuntogov socialturns

sports.espn.go

tech.slashdottelegraph

thenextweb

theprism

twitter

wimp

wired


HIT

S o

n S

ocia

l Net

wor

k

(b) Random URLs.

Figure 4.10: Ranking URLs on Google Buzz.

4.3.2 Flow Ranking

We noticed that the ranking results determined by each individual user using

maximum flow are less correlated with themselves than the results computed by

PageRank and HITS. First, we compared the ranking results of maximum flow with

72

0 5 10 15 20 25 300

5

10

15

20

25

30

abc.gobarackobama

brightkit

businessweek

change.orgebay

espn.go

estovar

forbeshollywoodlife

huffingtonpost

latimes

mtv

myspace

nbcnews

news.yahoo

newstomatopepsi

pinterest

pitchengine

ted

tinychat

twitpic.co

ubersocial

usatoday

vimeo

wefollow

wired

wordpress

zdnet


HIT

S o

n S

ocia

l Net

wor

k

(a) Popular URLs

0 5 10 15 20 25 300

5

10

15

20

25

30

9gagadage

amazonbarnesandnoble

blog.naver

blog.vegas

chinadaily

eco4planet

fastcodesignfizy

foxnews

getglue gigaom gototennis

happyplace

hotlist

influxinsights

iphoneblog

keekmacrumors

meadowparty

mtv

newscj

nmescientificamerican

techcrunch

turbotdoublevice

viewsnnews

wimp.com


HIT

S o

n S

ocia

l Net

wor

k

(b) Random URLs

Figure 4.11: Ranking URLs on Twitter.

PageRank and HITS using popular and random URLs for Google Buzz shown in Fig.

4.12 for popular URLs and Fig. 4.13 for random URLs. The first and second plots

on the left are ranking results of popular URLs and the third and fourth plots on the

right are ranking results of random URLs labelled by their sub-captions. A point on

the graph is a URL where the x-axis is the ranking position of the URL determined by

maximum flow and the y-axis is the ranking position determined by either PageRank

or HITS labelled on the y-axis. The identical layout for Twitter is shown in Fig. 4.14

for popular URLs and Fig. 4.15 for random URLs.

0 5 10 15 20 25 300

5

10

15

20

25

30

Personalized Ranking with Maximum Flow

Pag

eRan

k on

Soc

ial G

raph

Person 1Person 2Person 3Person 4y=x

(a) Max. Flow vs. PageRank

0 5 10 15 20 25 300

5

10

15

20

25

30


HIT

S o

n S

ocia

l Gra

ph


(b) Max. Flow vs. HITS

Figure 4.12: Social ranking with popular URLs on Google Buzz.

73

0 5 10 15 20 25 300

5

10

15

20

25

30


Pag

eRan

k on

Soc

ial G

raph


(a) Max. Flow vs. HITS

0 5 10 15 20 25 300

5

10

15

20

25

30


HIT

S o

n S

ocia

l Gra

ph


(b) Max. Flow vs. PageRank

Figure 4.13: Social ranking with random URLs on Google Buzz.

0 5 10 15 20 25 300

5

10

15

20

25

30


HIT

S o

n S

ocia

l Gra

ph

(a) Max. Flow vs. HITS

0 5 10 15 20 25 300

5

10

15

20

25

30


Pag

eRan

k on

Soc

ial G

raph

(b) Max. Flow vs. PageRank

Figure 4.14: Social ranking with popular URLs on Twitter.

0 5 10 15 20 25 300

5

10

15

20

25

30


Pag

eRan

k on

Soc

ial G

raph


(a) Random URLs.

0 5 10 15 20 25 300

5

10

15

20

25

30


HIT

S o

n S

ocia

l Gra

ph


(b) Random URLs.

Figure 4.15: Social ranking with random URLs on Twitter.

74

Table 4.6: Ranking results of 30 popular URLs in Google Buzz.URLs PRSN HSN MF

abcnews.go 1 1 9/12/10/15youtube 2 2 5/7/5/6yahoo 3 10 1/2/2/4businessweek 4 14 10/14/12/14bloomberg 5 9 10/14/13/12wordpress 6 7 5/5/7/9nytimes 7 4 10/14/6/10appleinsider 8 3 10/14/13/16facebook 9 8 1/1/1/1wired 10 5 9/14/13/15lockerz 11 6 4/6/6/6apple 12 11 6/8/9/8pcworld 13 15 8/13/10/7guardian 14 12 10/14/8/10reuters 15 19 10/14/10/16ted 16 13 9/13/7/10amazon 17 21 8/9/8/10techcrunch 18 17 8/13/9/14engadget 19 16 9/13/7/7reddit 20 23 10/13/8/11empireavenue 21 22 9/14/11/15boston 22 25 3/3/3/3/xkcd 23 24 2/4/8/2whitehouse 24 18 9/14/11/14gizmodo 25 20 7/10/12/12pingchat 26 27 9/12/12/14thesocialnetwork-movie 27 28 9/14/13/14bbc 28 29 10/11/4/13photofocus 29 26 8/14/13/16stackoverflow 30 30 6/11/12/12

4.3.3 Rank Differences

For personalized ranking, we measured the ranking consistency as the average

difference of a pair of users with respect to a URL set. For instance, in the Table

4.8, the left column and the top row are the four selected users where the element aij

corresponds to the average difference of users i and j. Please note the upper triangle

or elements above the diagonal refer to the random URLs and the lower triangle or

elements below the diagonal refer to the popular URLs. The right column refers to

the outdegree of users in the random URLs, and the last row refers to the outdegree

75

Table 4.7: Ranking results of 30 random URLs in Google Buzz.URLs PRSN HSN MF

networkedblogs 1 28 6/5/7/2picasaweb.google 2 29 1/3/1/5ping.fm 3 1 5/4/4/4thenextweb 4 3 8/7/8/3twitter 5 18 12/17/13/10income4free 6 17 2/1/2/1fastestwaylosebellyfat 7 19 10/9/10/10digg 8 25 12/19/12/5sports.espn.go 9 4 4/6/6/6wired 10 5 12/21/9/9businessinsider 11 13 3/2/3/8forbes 12 12 7/12/12/9foxnews 13 27 11/13/5/9behance 14 11 11/23/13/8huffingtonpost 15 23 12/20/11/7entrepreneur 16 2 12/21/13/10puntogov 17 15 12/23/13/10addictivefonts 18 6 10/14/13/9theprism 19 30 12/20/13/10telegraph 20 22 9/10/13/10npr 21 7 10/19/13/10popsci 22 16 10/11/13/10economist 23 10 12/16/13/10marketwatch 24 8 8/8/13/10opencog 25 9 12/23/13/8dslreports 26 26 12/15/13/10last.fm 27 24 12/23/13/10tech.slashdot 28 20 12/22/13/10wimp 29 21 12/18/13/10socialturns 30 14 12/18/13/10

of users in the popular URLs. For Twitter, the ranking results in the same format

are given in Table 4.9.

For random URLs in Google Buzz, we noticed that persons p1 and p3 have

an average difference of 1.7 where p2 and p4 have an average difference of 6.7. For

popular URLs, the variability is smaller where p4 and p2 have an average difference of

2.0 and p1 and p2 have an average difference of 3.2. Outdegree measures the number

of people a user follows since the ranking results are based on them. And finally,

ties are expected when using maximum flow since the number of URLs shared among

76

friends is minuscule compared to the number of pages in the deep Web. Therefore,

we simply use PageRank or HITS to break ties among pages when necessary.

Table 4.8: Avg. ranking differences in Google Buzz.- p1 p2 p3 p4 outdegree.p1 - 5.1 1.7 2.4 369p2 3.2 - 4.8 6.7 4,505p3 2.5 2.6 - 3.1 1,125p4 3.2 2.0 2.5 - 102

out deg. 159 355 503 340

Table 4.9: Avg. ranking differences in Twitter.- p1 p2 p3 p4 outdegree.p1 - 1.5 2.0 4.0 203p2 3.7 - 3.0 3.8 122p3 3.3 3.3 - 4.6 426p4 3.7 3.8 5.2 - 119

out deg. 324 158 129 1,731

4.3.4 Rank Distributions

We examine variants of flow ranking as follows. We selected a user in Twitter,

selected the top 25 URLs shared by people that this user is following in terms of

CKS shown. These 25 URLs contain similar keywords to the URLs that this user has

previously shared. Once we have the candidate URLs, we use network flow to re-rank

them taken into account social relationships, the effect of geography, and community

structure. Results show a re-ordering where geography have an effect on reducing

the number of URLs with positive scores by considering spreaders of URLs who are

geographically close. On the other hand, community have an effect on distributing

the scores of URLs more evenly since more spreaders are taken into consideration.

This flexibility allows users to select information that are locally relevant when it is

appropriate or select information of potential interests from their community mem-

bers.

Figure 4.16 shows the rank correlation coefficient of URLs between variants of

network flow and PageRank. For a selected user, we selected 25 URLs from its neigh-

borhood and ranked these URLs using variants of network flow: without geography

77

−0.1 0 0.1 0.2

0.05

0.1

0.15

a)

Tau

P(x

)

TwitterBuzz

−0.1 0 0.1 0.20.02

0.04

0.06

0.08

0.1

0.12

0.14b)

Tau

P(x

)

−0.1 0 0.1 0.20.02

0.04

0.06

0.08

0.1

0.12

0.14c)

Tau

P(x

)

Figure 4.16: Densities of rank correlation coefficient.

Flow O Flow G Flow I Flow C PR BL0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Avg

. NC

DG

a)

Flow O Flow G Flow I Flow C PR BL0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Avg

. NC

DG

b)

Figure 4.17: Ranking quality results.

(a), with geography (b), and with community (c). Given a set of URLs U , let Ru(v)

and Ru(v′) be the ranking results for nodes v and v′. The rank correlation coefficient

denoted as τ is defined as:

τ =nc − nd

0.5k(k − 1)(4.14)

where k = |U |, nc is the number of concordant pairs, and nd is the number discordant

pairs in Ru(v) and Ru(v′). Then we calculated the rank correlation coefficient τ where

a value of 1 means the ranking results are identical, -1 if they are in reverse order, 0

if they are independent. Results show that personalized ranking using network flow

is highly independent from PageRank.

4.3.5 Rank Validation

Fig. 4.17 shows ranking quality results for Google Buzz (a) and Twitter (b)

using the four variants of network flow, PageRank applied to the social/information

network, and the baseline. The y-axis is the normalized cumulative discounted gain

78

(NCDG) used to benchmark the quality of ranking results and defined below. For this

experiment, we selected 50 users and 100 URLs from a user’s neighborhood. Then

we ranked these URLs by using the six ranking techniques.

NCDG is defined as follows. Let p be a source node and R a list of ranked URLs

for p. The discounted cumulative gain DCG for R with respect to p is:

δ(Ri, p) +w∑i=2

δ(Ri, p)

log(i)(4.15)

where δ(Ri, p) is 1 if Ri is relevant to p and 0 otherwise, and w is the number of

pages to be ranked. We assume Ri is relevant to p if p has shared Ri before. The

normalized discounted cumulative gain (NDCG) is the DCG divided by the DCG of

the optimal ordering of R with respect to p. Optimal ordering is defined by using the

pages that the user has later shared in the future.

To capture any effect of social relevance, we randomly rank these URLs and

use this random ranking as the baseline. Results shown in Fig. 4.17 confirmed that

social relevance can improve ranking results of up to 19% in Google Buzz and 17% in

Twitter. The improvement is defined as the difference in two ranks in terms of average

NCDG of PageRank and flow rank divided by the average NCDG of PageRank (See

Fig. 4.17). It is interesting that peers in community have a stronger effect in ranking

quality than friends in Google Buzz. This is consistent with the densities of social

relationships in Table 4.4 where 25% of peers have similar interests compared to 9%

for friends. For Twitter, the densities in Table 4.5 align with the ranking quality

results in Fig. 4.17(b) where the densities of interests among friends and peers are

almost identical. Recall that the PageRank is calculated by using the social network

and not by using the web graph.


Information shared between users in online social networks such as URLs pro-

vides a unique perspective of the ranking of web pages. In our approach, humans

instead of pages are the ones who rank the URLs by sharing them, and the social

network of the users instead the web graph topology is used to propagate the ranking.

First, we collected two large-scale information networks of online users to study

79

how users in these networks share URLs which impacts the distance between a person

and a URL. For instance, researchers in [3] estimated the number of hops between

any two pages to be on average 19; while Milgram estimated that the number of

hops between any two people is no more than 6 [87]. Since information propagates

differently in social networks, the social structure bounds how far a person is away

from a shared URL.

Second, we reinterpreted the ranking techniques of PageRank and HITS and

proposed to use maximum network flow to personalized the ranking of pages tailored

to each individual user. Maximum flow detects the popularity of a shared URL

among friends but popularity does not necessary reflect endorsement which could

impact ranking because one could share something that was not meant to be positive

(e.g., a sad news). We expected that each unique individual would rank the URLs

differently, since no two people on a social network are the same. Interestingly, the

ranking results of popular URLs using PageRank and HITS are more correlated than

random URLs suggesting that the overall view of users on ubiquitous information is

more consistent, but everyone has their own opinion in the end. Instead of attempting

to socially rank the entire web, we re-ranked a selected set of URLs to make it scalable

and efficiently executable for search engines. If the size of the web doubles in the next

few years, it would not affect our approach since only a subset of URLs that users

shared are actually re-ranked.

Third, experimental results show that personalization can improve ranking qual-

ity of up to 19% compared to the baseline and 5% compared to PageRank in Google

Buzz. For Twitter, personalization improves ranking quality of up to 17% compared

to the baseline but it is not better than PageRank.

More importantly, we believe that personalizing the ranking is useful for social

searching because it provides a mechanism for the interaction between the searcher

and the sharer where the searcher can discuss with the sharer about the item relating

to a query on a search engine. For instance, a new product that the sharer posted

on appleinsider.com or a piece of political news on nytimes.com. This potential

interaction between the searcher and the sharer is valuable because the influence of

the sharer on the searcher is stronger than the influence coming from the authorities

detected by HITS and PageRank in many non-technical and social situations but not

80

for all. This feature could be implemented in search engines where pages returned

to a given query are re-ranked via social networks if there are pages shared among

friends or other associates of the searcher that are related to the query.

CHAPTER 5

SOCIAL SEARCHING EXPERIMENTS

We collected friendship, checkin, and location data from two location-based social

media, Gowalla and FourSquare, that allowed people to use their internet-enabled and

sensing-capable smart phones to record and share their current location. Gowalla is no

longer operating by itself since it has been integrated into Facebook. Unlike Gowalla,

FourSquare doesn’t allow an automated mechanism for collecting publicly shared

checkins through their API. We have also collected two additional social networks

containing social relationships, Flickr and Last.fm, but without geographical locations

of their users.

The reason for collecting data from these four diverse networks is that we can

directly calculate the hop length of the shortest path between randomly selected

pairs of users and use these path lengths as an estimate for the ground truth in

the small-world experiment. We use Gowalla and FourSquare for the emulation of

the small-world experiment in which knowing geographical distance between users is

essential. Even though the collected data from online social media is not a represen-

tative sample of the entire population, it still provides “one of the best estimates of

social distance”[88] and one of the best environments for analyzing the small-world

experiment at large scale.

Table 5.1: Summaries of online social networks datasets.Social Networks Number of Users Number of Edges PeriodGowalla 154,557 1,139,110 Sept. 11 - Oct. 12FourSquare 251,621 800,201 Jun. 13 - Aug. 13Flickr 2,435,257 155,110,479 Jun. 13 - Aug. 13Last.fm 4,355,516 30,325,890 Jun. 13 - Aug. 13

In Table 5.1, we list the number of users and edges collected for each network

over the specified time period. These numbers in case of Gowalla and FourSquare

refer to a subset of the collected network reduced after data cleaning. In Gowalla,

we removed users that did not have any publicly shared checkins. In FourSquare,

Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and SocialStratification,” Plos One, (under review.)

81

82

we kept only users that were successfully geocoded by Google’s Maps. This subtle

difference between Gowalla and FourSquare is important because checkins in Gowalla

directly pinpoint users’ locations, making connections between users in Gowalla more

dense than in FourSquare. However, the advantage of FourSquare is that it provides

different perspective to some of the questions being asked such as the effect of network

sparsity on the small-world problem.

5.1 Attrition, Geography, & Communities

Let G = (V,E) be a social network where V is the set of users and E is the set

of edges representing undirected relationships among users. The great-circle distance

between two users s and t is denoted as gd(s, t) and estimated based on the users’

self-entered location of residence (FourSquare) or the most-frequent checkin that they

have shared (Gowalla). The network distance between s and t is denoted as nd(s, t)

and defined as the smallest number of hops needed to reach t starting from s.

Let A be a community detection algorithm that partitions nodes in G into m

overlapping clusters denoted as {C1, C2, ..., Cm}. An edge-bridge is an edge e = (u, u′)

such that u ∈ Ci and u′ ∈ Cj for i 6= j. A node-bridge is a node u such that for

certain i 6= j, u ∈ Ci and u ∈ Cj. The stratification graph of G denoted as S = (sij)

is defined as:

sij =eb(i, j) + nb(i, j)∑mk=1 eb(i, k) + nb(i, k)

(5.1)

where eb(i, j) and nb(i, j) are the number of edge- and node-bridges connecting com-

munities i and j respectively. We extend the definition of network distance of users

to communities denoted as nd(Ci, Cj) and defined it as the smallest number of node-

or edge-bridges needed to reach Cj starting from Ci. We latter use sij to define the

prominence of community Ci.

Fig. 5.1 shows the stratification graph of communities for Gowalla.

5.1.1 Modeling Attrition

Let pk denotes the probability of getting from a source to a target in k hops in

chains that are of length at least k, and let p denotes the probability of dropping out of

83

Figure 5.1: Stratification graph of communities in Gowalla.

experiment for nodes that are not adjacent to a target. Let N denotes the number of

folders sent, Dk be the number of folders delivered to the target at the kth hop, and Ck

be the number of chains continuing for at least k hops. If participants do not drop out

of the experiment, then the number of deliveries in k hops is Ek = pk(N −∑k−1

i=1 Ei).

The expected number of deliveries for one hop targets is D1 = N ∗p1, and the number

of chains continuing for two or more hops is C2 = N(1− p1)p. For k > 1, Dk = pkCk

and Ck+1 = Ck(1− pk)p. In Travers-Milgram’s experiment, we know N , Ck,and Dk.

Then, the numbers of deliveries including drops for k > 1 is:

Ek = Dk + (N − Ck −k−1∑i

Ei) ∗ pk. (5.2)

84

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

Path length

Den

sity

a)

GWFSFRLFMTMOTMA

GW FS FR LFM TMO TMA0

2

4

6

8

10

Avg

. pat

h le

ngth

b)

Figure 5.2: Distributions of shortest path lengths & average path lengths.

With these formulas, we can compute average number of hops taking into

account the effect of participants dropping out of the experiment. In the original

Travers-Milgram, the reported path length was 6.2 plus 2 additional hops for drop-

ping. When taking into consideration dropping, the path length should be reported

as at least 8. An element of novelty here is that we can apply the effect of attrition to

our experimental results from the opposite point of view. Suppose a participant does

not drop out in our emulations unless it has sent the folder to (one at a time) all of

its acquaintances. Since we know Ek, pk, and Ck from the social routing emulations,

we can calculate Dk and report the average path length in Dk as a function of the

dropping rate.

5.1.2 Geographical Analysis

In Fig. 5.2(a,b), we compare the shortest path lengths distribution of four online

social networks (a) and the average shortest path length with one standard deviation

(b) with two results from Travers and Milgram. The shortest path lengths with one

standard deviation are 4.91 ± 0.78 in Gowalla, 7.74 ± 1.99 in FourSquare, 4.90 ±0.78 in Flickr, and 5.98 ± 0.99 in Last.fm. The average path length reported by

Travers and Milgram is within one standard deviation away from the ground truth in

FourSquare and LastFM but not for Gowalla and Flickr, but the average path length

adjusted by the impact of attrition is out of range for LastFM.

In Fig. 5.3(a,b), we plot the probability density (log-scale) as a function of geo-

graphical distances (log-scale) between pairs of friends for Gowalla (a) and FourSquare

85

102

103

104

10−4

10−3

10−2

10−1

100

Distance d (km)

Den

sity

f(d)

a)

Friends19/d

9580/d2

102

103

104

10−4

10−3

10−2

10−1

100

Distance d (km)

Den

sity

f(d)

b)

Friends5/d

4608/d2

Figure 5.3: Densities of geographical distances.

(b). The probability density function f(d) is defined as the fraction of friends such

that their geographic distance is d± ε. We fitted the data for each network with two

models, one assuming inverse proportionality to the distance c/d and the other to

the square of the distance c′/d2. We found two constants c and c′ by minimizing the

difference between the model and the data:

minc

{∑d

(f(d)− c/d)2

c/d

}, min

c′

{∑d

(f(d)− c′/d2)2

c′/d2

}. (5.3)

For Gowalla, the error is 0.15 for 19.26/d and 1.64 for 9580.42/d2. The error

is 0.55 for 4.78/d and 16.76 for 4608.35/d2 for FourSquare. In other words, c/d

fits the distribution of geographical distances about 20 times better than c′/d2. An

explanation of this difference is that online social media distorts physical dimension

(approximately 2-dimensional surface) by allowing people from anywhere to establish

a connection.

Even a better fit is a model c′′/dδ where 1 < δ < 2 which means that the social

network space is fractal. This observation is in agreement with Liben-Nowell et al.

[67]. Kleinberg’s theoretical results in [41] bounding the expected delivery time to

O(log n) assumes that the distribution of distances is d−2. Hence, empirical results do

not satisfy assumptions made in mathematical models and therefore the theoretical

bounds in those models cannot be universally applied.

86

Table 5.2: Communities detected by GANXiS.Gowalla Foursquare

Average Size 16.60 15.86Weighted Average Size 591.81 399.77Total Communities 10,562 16,495Avg. Link Density 0.35 0.37Edge-Bridges 1,004,964 574,748Node-Bridges 20,137 9,492

5.1.3 Detecting Communities

We selected GANXiS to detect overlapping communities based on its promising

experimental results and the ability to scale to millions of nodes and edges [54]. The

intuition behind GANXiS is that there should be a lot of edges within a community,

and an important feature of GANXiS is that it is able to detect either disjoint or

overlapping communities. This intuition is consistent with the stratified nature of

society in which members within a community such as a family, workplace team,

religious congregation, sport club, etc. are more likely to be connected with each

other than to casual acquaintances. Consequently, once a folder reaches a person

that belongs to the same community as the target, only a few more hops are needed

to reach the target.

In Table 5.2, we listed several measurements of communities detected by GANXiS.

They include the average community size, weighted community size defined as the

average size of a community as observed by each member, the total number of com-

munities detected, the average link density defined as the number of edges inside a

community divided by the maximum number of possible edges, and the number of

edge- and node-bridges.

5.2 Experimental Design

There are two strategies that define our emulation of the social routing. The

first one describes the process of routing a folder by defining in each step of routing

which acquaintance of the current folder holder is receiving the folder, and the second

one defines the process of selecting starters and targets.

87

5.2.1 Routing Strategies

The first routing strategy, denoted as GEOGREEDY in [67], is to pass the folder

to an acquaintance who is the geographically closest to the target, that is, picking

an acquaintance u with the smallest gd(u, t). The second routing strategy, denoted

as COMGREEDY, is to pass the folder to an acquaintance who is the closest to the

target in terms of community distance, that is, picking an acquaintance u with the

smallest nd(Cu, Ct). For overlapping communities, COMGREEDY selects the corre-

sponding community of u and t in such a way so that nd(Cu, Ct) is minimum. Such

information may not be always available to the current folder holder, so GEOGREEDY

is more realistic than COMGREEDY, but the purpose of introducing COMGREEDY is to

understand which property of the network, geography or community, is more useful

for reaching the target for the majority of the cases.

The third routing strategy is to use a combination of the knowledge of geography

and community, denoted as GEOCOM, when selecting an acquaintance. In GEOCOM,

a node gives the highest preference to acquaintances who belong to the same com-

munity as the target (i.e, nd(u, t) = 0), and breaks ties between them by selecting

the acquaintance who is the geographically closest to the target (i.e, GEOGREEDY).

If a node have no acquaintances who belong to the same community as the target,

then the node uses GEOGREEDY. An element of novelty in using a combination of

geography and community is that it seems to be more realistic than using either one

alone.

In all strategies, routing stops either when the folder has reached the target or

when a user does not have any more acquaintances to whom it can pass the folder

because all of its acquaintances have already been chosen by the current holder. If the

current holder doesn’t have any acquaintances who belong to the same community

as the target, then sij defines the probability going from Ci to Cj in one step. The

implication is that sij influences the routing strategy in GEOGREEDY and GEOCOM

but not in COMGREEDY, and this influence does not depend on the target but on how

communities are interconnected. Therefore, we define the prominence of community

Ci, denoted as λi, as P (Xt = Ci) where Xt denotes the community reached in a

random walk at step 0 < t < ∞. The idea is the more prominent a community, the

more likely it is to be reached in a random walk process.

88

Table 5.3: Prominence of individuals and communities.Percentile PageRank Steady State PageRank Steady State

Top 1% 0.000121 0.005421 0.000065 0.003652Top 20% 0.000014 0.000134 0.000010 0.000094

60th-80th% 0.000005 0.000036 0.000003 0.00001640th-60th% 0.000003 0.000021 0.000002 0.000008Bottom 40% 0.000002 0.000021 0.000001 0.000003

Gowalla FourSquare

Table 5.4: Experimental results for Gowalla.Average Number of Hops in Successful Chains

GEOGREEDY COMGREEDY GEOCOMRandom 29.43 20.57 19.08nd(Cs, Ct) = 0 5.61 3.61 3.61nd(Cs, Ct) = 1 26.13 16.06 16.23nd(Cs, Ct) = 2 27.78 18.71 21.13nd(Cs, Ct) = 3 29.06 19.76 24.36

Percentage of Successful ChainsRandom 0.30 0.44 0.50nd(Cs, Ct) = 0 0.71 0.87 0.91nd(Cs, Ct) = 1 0.34 0.57 0.58nd(Cs, Ct) = 2 0.27 0.46 0.44nd(Cs, Ct) = 3 0.18 0.38 0.27

5.2.2 Starter & Target Selections

First, we select a starter and a target that are separated by a fixed number

of communities; i.e, nd(Cs, Ct) = k. For example, when k = 0, the starter s and

target t are selected within the same community. Then, we select a target t based

on its prominence measured by its PageRank score and next a random target from

a prominent community as measured by the steady state of the random walk on

the stratification graph S. The percentile of individual and community prominence

measured by PageRank and steady state of λi are listed in Table 5.3. Finally, we select

starters and targets randomly to mimic the most unbiased way in which participants

could be selected for Travers-Milgram’s like experiment.

89

Table 5.5: Experimental results for FourSquare.Average Number of Hops in Successful Chains

GEOGREEDY COMGREEDY GEOCOMRandom 18.19 16.01 16.52nd(Cs, Ct) = 0 1.93 2.06 1.99nd(Cs, Ct) = 1 7.81 7.37 6.21nd(Cs, Ct) = 2 15.36 12.96 12.10nd(Cs, Ct) = 3 18.02 15.14 15.81

Percentage of Successful ChainsRandom 0.01 0.22 0.04nd(Cs, Ct) = 0 0.75 0.86 0.88nd(Cs, Ct) = 1 0.13 0.51 0.38nd(Cs, Ct) = 2 0.04 0.37 0.12nd(Cs, Ct) = 3 0.02 0.28 0.04

5.3 Experimental Results

5.3.1 Selection & Routing Combinations

Table 5.4 contains the experimental results for Gowalla. The upper section in

the Table 5.4 displays the average number of hops it takes to successfully reach a

target using the five selection techniques listed in the left column and three routing

strategies listed in the second row. The lower section of Table 5.4 refers to the

percentage of successful chains defined as the number of times the target was reached

divided by the number of trails. For each selection process and routing strategy, we

ran N = 104 trails. The experimental results for FourSquare are displayed in Table

5.5.

Tables 5.4 and 5.5 show that selecting a starter and a target from the same

community makes it likely for the target to be reached in a few hops, about 4 hops

in Gowalla and 2 hops in FourSquare, with high success rate of approximately 83%

for both networks. The percentage of successful chains decreases as the community

distance between the starter and target increases. On average, it takes approximately

22 hops to reach a target with a success rate of 39% for Gowalla, and 12 hops to reach

a target with a success rate of 21% for FourSquare for the community distance ranging

from 0 to 3. As the community distance between the starter and target increases,

the percentage of successful chains decreases to about 19% for Gowalla and 24% for

FourSquare.

90

0 0.5 16

7

8

9

Avg

. Pat

h Le

ngth

a)

drop=5%

b)

0 0.5 1

5.4

5.6

5.8

6

6.2

Gow

alla

c)

0 0.5 14.3

4.4

4.5

4.6

4.7

4.8

Avg

. Pat

h Le

ngth

d)

0 0.5 1

9

9.5

10

0 0.5 16

6.5

7

7.5

Friends−of−Friends Knoweldge Density

e)

drop=15%

0 0.5 15.2

5.3

5.4

5.5

5.6

Fou

rSqu

are

f)

drop=30%

Figure 5.4: Friends-of-friends knowledge densities.

Also, Tables 5.4 and 5.5 show that COMGREEDY is much more effective than

GEOGREEDY in terms of average path length and percentage of successful chains

in both networks. On average, COMGREEDY reaches the target in about 8 hops

quicker than GEOGREEDY in Gowalla and 2 hops quicker in FourSquare. Moreover,

COMGREEDY reaches the target 18% more often than GEOGREEDY in Gowalla and

26% more often in FourSquare. Hence, using community distances is more effective

at reaching targets than using geographical distances.

5.3.2 Friends-of-Friends Knowledge Densities

To make GEOCOM more realistic, we introduce the probability that current

holder might have some relevant clues about its acquaintances. A possible clue is the

friends-of-friends knowledge where a holder might know some of its friends’ friends,

where they are geographically located, and to which communities they belong. In Fig.

5.4, we plotted the average path length as a function of friends-of-friends knowledge

density for Gowalla (a-c) and FourSquare (d-f). The x-axis represents the probability

that the current holder might know the geographical location and community infor-

91

mation of a friend-of-friend. A value of 0 means the holder only uses its friends to

make a routing decision, and a value of 1 means a holder knows all the friends of its

friends. In addition, we examined three levels of attrition added into this particular

experiment. Subfigures a) and d) refer to a 5% dropping rate, subfigures b) and

e) refer to a 15% dropping rate, and subfigures c) and f) refer to a 30% dropping

rate. Regions within one standard deviation away from the ground truth in terms

of average path length are shaded in blue. In Gowalla, results show that with a 5%

dropping rate, the friends-of-friends knowledge level is too low to make the average

path length within one standard deviation away from the ground truth. However,

with a 15% dropping rate, knowledge level of about 20% is sufficient to reach one

standard deviation away from ground truth, and no friends-of-friends is needed when

the drop rate is 30% or higher. In FourSquare, results show that with a 5% dropping

rate, no friends-of-friends knowledge is needed to be within one standard deviation

away from the ground truth, and average path lengths are very short and not within

one standard deviation when the dropping rate is 15% or more. The reason for the

contrasting behavior is that increasing attrition makes the path length of successful

chains smaller than the ground truth (i.e., 5 in Gowalla vs. 8 in FourSquare).

A difference between Gowalla and FourSquare is that Gowalla is much more

connected in terms of the density of relationships between nodes. The percentage of

finding targets successfully is overall higher in Gowalla than in FourSquare. Recall

that nodes drop out in the simulations when they do not have any more acquaintances

to the pass the folder to. Since there are more relationships in Gowalla, participants

stay longer in the simulations, which increases the path length of successful chains.

For FourSquare, participants have less social relationships so they drop out quicker;

therefore, successful chains are shorter in FourSquare than in Gowalla.

5.3.3 Distributions of Successful Chains

In Fig. 5.5, we plotted the distribution of the lengths of successful chains

in a) and c) and the modified average path length as a function of the dropping

rate in b) and d) for Gowalla and FourSquare, respectively. Results show that it

is difficult to find targets when nd > 0, but still the average path length decreases

when the dropping rate increases. For instance, the average path length of successful

92

0 10 20 300

0.1

0.2

0.3

0.4

0.5

Path length of Successful Chains

Per

cent

age

a)

n

d = 0

nd = 1

nd = 2

nd = 3

Ground Truth

Drop rate (%)

Avg

. Pat

h Le

ngth

b)

0.1 0.2 0.3 0.4 0.5 0.60

5

10

15

TM drop rate

5 10 15 200

0.1

0.2

0.3

0.4

Path length of Successful Chains

Per

cent

age

c)

n

d = 0

nd = 1

nd = 2

nd = 3

Ground Truth

Drop rate (%)

Avg

. Pat

h Le

ngth

s

d)

0.1 0.2 0.3 0.4 0.5 0.60

2

4

6

8

10

12

TM drop rate

Figure 5.5: Path length of successful chains & drop rates.

chains with a dropping rate increasing from 0.2 to 0.4 grows on average from 2 to

6 for Gowalla and 2 to 7 for FourSquare. More importantly, the variances of the

distributions for nd > 0 are large compared to the ground truth as seen in a) and c),

meaning that some targets are easier to reach than others. This leads us to measure

the reachability of a target by examining its individual prominence.

5.3.4 Effects of Hubs and Connectors

In Fig. 5.6, we examined effects of routing the folder to connectors and hubs

discussed in the literature [89]. The first experiment is to pass the folder to the

connector defined as the acquaintance who has the highest number of connections to

other nodes within the community. Results show an improvement in the delivery rates

in Gowalla and FourSquare as seen in Fig. 5.6(a,b). For this connector experiment, we

did not selected starters and targets randomly because connectors would be flooded

with requests making the routing strategy not practical in reality. Perhaps a setting

where passing the folder to a connector would not be too unrealistic is when the

93

GEO COM GCOM CON.0

0.2

0.4

0.6

0.8

1

% o

f Suc

cesf

ul C

hain

s

a)

GEO COM GCOM CON.0

0.2

0.4

0.6

0.8

1

% o

f Suc

cesf

ul C

hain

s

b)

0 5 10 15 20 250

0.02

0.04

0.06

0.08

Path Length of Successful Chains

Den

sity

c)

R=80km (28,72%)R=241 (32,74%) R=400 (38,76%)R=563 (42,78%)

0 5 10 15 20 250

0.02

0.04

0.06

Path Length of Successful Chains

Den

sity

d)

R=80km (36,12%)R=241 (44,14%)R=400 (65,16%)R=563 (75,17%)

Figure 5.6: Effects of routing to connectors & hubs.

starter and target are from the same community.

Another setting that would reduce the flooding of requests is selecting a hub

within some geographical radius from the target. For this experiment, we modified

GEOCOM to incoportate indegree into making a routing decision. First, if the holder

has multiple friends who belong to the same community as the target, then it break

ties by selecting the connector. If the connector does not exist, then it selects a group

of acquaintances who are within some radius away from the target, and select a hub

from this group defined as the friend who has the highest degree. If the hub does

not exist, then it uses GEOGREEDY. As the radius increases by 161km, the delivery

rates for Gowalla and FourSquare increase by approximately 2%, and the average

path length of successful chains increases about 5 hops in Gowalla as seen in Fig. 5.6

(c) and 10 hops in FourSquare as seen in Fig. 5.6 (d).

5.3.5 Individual and Community Prominence

In Fig. 5.7, we calculated the average path length of finding a target as a

function of its PageRank for Gowalla a) and FourSquare b). When PageRank score

94

0 1 2 3 4 5

x 10−5

4

6

8

10

12

14

16

18

PageRank

Avg

. Pat

h Le

ngth

a)

Emulations Linear

r=−0.71

0 1 2 3 4 5

x 10−5

5

10

15

20

PageRank

Avg

. Pat

h Le

ngth

b)

Emulations Linear

r=−0.44

−14 −12 −10 −8 −6 −4 −2

10

15

20

25

λi (log−scale)

Avg

. Pat

h le

ngth

c)

Emulations linear

r=−0.54

−12 −10 −8 −6 −4 −2

5

10

15

20

λi (log−scale)

Avg

. Pat

h le

ngth

d)

Emulations linear

r=−0.65

Figure 5.7: Prominence of individuals & communities on reachability.

increases, the average path length decreases from 16 to 4 in Gowalla and 15 to 5 in

FourSquare. The routing algorithm used in this particular experiment is GEOCOM

with a 8% friends-of-friends knowledge level with starters and targets randomly se-

lected. Hence, results from this experiment show that small-world property holds

for the highly prominent while everyone else is lost in the crowd. In addition, we

calculated the average path length of finding a target as a function of its community

prominence measured by λi for Gowalla c) and FourSquare d). Results from this

experiment also show that targets selected from prominent communities are reached

quicker than targets from non-prominent communities. Correlation coefficients of the

linear relationship between prominence and average path lengths are displayed in each

individual subfigures.

Finally, we examined the correlation between the individual prominence of tar-

gets measured by the PageRank and community prominence measured by a random

walk process in Fig. 5.8. Results show that these two measurements are highly corre-

lated and consistent in the sense that prominent users are in prominent communities

95

−20 −10 0−15

−10

−5

0

log−

scal

e

a)

−20 −10 0−14

−13

−12

−11

−10b)

−20 −10 0−15

−10

−5

Gow

alla

c)

−20 −10 0−15

−10

−5

0

log−

scal

e

Sum PageRank

−20 −10 0−14

−13

−12

−11

−10

λi (log−scale)

Avg. PageRank

−20 −10 0−14

−12

−10

−8

−6

Max. PageRank

Fou

rSqu

are

r = 0.79r = 0.95 r = 0.81

r = 0.82r = 0.61r = 0.91

e)d) f)

Figure 5.8: Prominence of individuals & communities correlations.

and prominent communities contain prominent users. For each community, we cal-

culated the collective prominence of users measured by total, average, and maximum

PageRank of its users. Subfigures a-c refer to communities in Gowalla and subfigures

d-f refer to communities in FourSquare. Each point in a figure is a community where

the x-axis for all subfigures refer to the community prominence and the y-axis in a)

and d) refer to sum PageRank, b) and e) refer to the average PageRank, and c) and

f) refer to the maximum PageRank of a community. Correlation coefficients of the

linear relationship between community and individual prominence are shown in each

individual subfigures.


By analyzing data recently available from location-based social media, we pro-

vided three conclusions from our social routing experiments. First, results show that

while using geographical and community information in modeling social routing for

the small-world problem is more realistic than using either one alone, average path

96

lengths are 3 times longer when attrition is eliminated and not even within two

standard deviations away from the ground truth defined as the calculated average

shortest path length. Second, COMGREEDY is more effective and robust at reaching

targets than GEOGREEDY in terms of average path lengths and percentage of success-

ful chains. It is quite plausible that participants could use COMGREEDY cognitively.

For example, a holder can select an acquaintance whose occupation is mortgage in-

surance as being ‘closer’ to commodity broker than a social science teacher. Third,

results from the data show that prominent targets and targets in prominent commu-

nities can be reached much quicker than on average. This leads us to ask what would

the results be if Travers and Milgram had not select a broker but instead a much less

prominent target such as a homeless man? To conclude, our results show that the

small-world property holds for the prominent while everyone else is lost in the crowd

except when being reached by members within its own community.

CHAPTER 6

CONCLUSION AND FUTURE WORK

Table 6.1: Aspects of SNA & applications.Geography Interactions Communities

Human Mobility Distance Communication GroupSpreading Ideas Long Ties Weak Ties Bridge Ties

Personalized Ranking Geo. Influence Peer Influ. Collective Influ.Small-world Selection Cognitive Biases Routing

In Chapter 3, we examined interesting human dynamics in online social networks

in terms of geographical proximity, face-to-face interactions, communities, and found

some valuable insights. For instance, the creation of friendship between two people is

more likely to occur when they are close, and friends and friends-of-friends are more

likely to be within geographic proximity but not further. Geography has an effect on

limiting face-to-face interactions as well as keyword similarity in terms of what users

read on social media. One possible direction for future research is to investigate social

influence as a function of geographical distance. For instance, if a friend checkins at a

location, how likely is his friend going to checkin at the same location in the future?

Two applications we studied in Chapter 3 are human mobility & congestion

modeling and ideas spreading & economic development. Geography shows how friends

are likely to be close in terms of moving together (human mobility). Face-to-face

interactions could be used in the establishing connections in the wireless simulations

where nodes that are frequently interacting are more likely to establish a connection,

and communities can be used to simulate a group of nodes moving together. For

ideas spreading, geography can be used to measure the length of short and long ties,

face-to-face interactions can be used to measure the strength of ties, and communities

can be used to distinguish between bridge and non-bridge ties.

In Chapter 4, we proposed to personalize the ranking of URLs by using public

information that users shared in social media. We incorporated the following two

important aspects of the social networks into the processing of ranking URLs: geo-

graphical distance and community structure. Personalized ranking results from three

97

98

variants of network flow are highly independent from PageRank meaning that each

individual has their own unique way to rank information. Experimental results show

that personalization can improve ranking quality of up to 19% when compared to the

baseline and 5% when compared to PageRank in Google Buzz. For Twitter, person-

alization improves ranking quality of up to 17% compared to the baseline but it is

not better than PageRank. Future work could incorporate calculating novelty of a

piece of information by examining its keywords [90] and determining the popularity

of information in terms of burstiness [91]. These filters allow users to see or filter

information on the web through the eyes of the world.

In Chapter 5, results show that average path lengths in social searching are 3

times longer when attrition is eliminated and not even within two standard devia-

tions away from the ground truth. COMGREEDY is more highly effective at reaching

targets. Also, it is plausible that participants could use COMGREEDY cognitively. On

the other hand, prominent targets can be reached much quicker. The small-world

property holds for the prominent while everyone else is lost in the crowd except when

being reached by members within its own community. Future work could incorpo-

rate face-to-face interactions for measuring potential cognitive biases in selecting the

next acquaintance. In addition, instead of assuming a fixed probability for attrition,

participants could drop out based on interactions in the sense that the next folder

holder has a higher chance of participating if he interacts frequently with the previous

holder.

To summarize, this thesis collects terabytes of information that users share on

social networks and analyzes their social dynamics in terms of geography, face-to-face

interactions and community structures.

REFERENCES

[1] J. Kleinberg and S. Lawrence, “The structure of the web,” Sci., vol. 294, no.

5548, pp. 1849-1850, Nov. 2001.

[2] R. Lempel and S. Moran, “SALSA: The stochastic approach for link-structure

analysis,” ACM Trans. Inf. Syst., vol. 19, no. 2, pp. 131-160, Apr. 2001.

[3] R. Albert et al., “The diameter of the world wide web,” Nature, vol. 401, no.

6749, pp. 130-131, Sept. 1999.

[4] T. Berners-Lee et al., “The semantic web,” Sci. Amer., vol. 284, no. 5, pp.

34-43, May 2001.

[5] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search

engine,” Comput. Networks and ISDN Syst., vol. 30, no. 1, pp. 107-117, Apr.

1998.

[6] Z. Gyongyi et al., “Combating web spam with trustrank,” in Proc. 30th Int.

Conf. Very Large Data Bases, Toronto, Canada, 2004, pp. 576-587.

[7] J. Xu and H. Li, “Adarank: A boosting algorithm for information retrieval,”

in Proc. 30th Int. ACM SIGIR Conf. Res. and Develop. in Inform. Retrieval,

Amsterdam, Netherlands, 2007, pp. 391-398.

[8] Y. Liu et al., “Browserank: Letting web users vote for page importance,” in

Proc. 31st Int. ACM SIGIR Conf. Res. and Develop. in Inform. Retrieval,

Singapore, Republic of Singapore, 2008, pp. 451-458.

[9] M. Taylor et al., “Softrank: Optimizing non-smooth rank metrics,” in Proc.

1st Int. Conf. Web Search and Data Mining, Palo Alto, CA, 2008, pp. 77-86.

[10] H. Yan et al., “Architectural design and evaluation of an efficient

web-crawling system,” J. Syst. Softw., vol. 60, no. 3, pp. 185-193, Feb. 2002.

[11] E. Leicht et al., “Large-scale structure of time evolving citation networks,”

Eur. Phys. J. B, vol. 59, no. 1, pp. 75-83, Oct. 2007.

99

100

[12] S. Bao et al., “Optimizing web search using social annotations,” in Proc. 16th

Int. Conf. World Wide Web, Alberta, Canada, 2007, pp. 501-510.

[13] J. Davitz et al., “ilink: Search and routing in social networks,” in Proc. 13th

ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Jose,

CA, 2007, pp. 931-940.

[14] D. Carmel et al., “Personalized social search based on the user’s social

network,” in Proc. 18th ACM Conf. Inform. and Knowledge Manage., Hong

Kong, China, 2009, pp. 1227-1236.

[15] D. Horowitz and S. Kamvar, “The anatomy of a large-scale social search

engine,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp.

431-440.

[16] A. Dong et al., “Time is of the essence: Improving recency ranking using

Twitter data,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010,

pp. 331-340.

[17] B. Bahmani and A. Goel, “Partitioned multi-indexing: Bringing order to

social search,” in Proc. 21st Int. Conf. World Wide Web, Lyon, France, 2012,

pp. 399-408.

[18] T. Nguyen and B. Szymanski, “Social ranking techniques for the web,” in

Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and

Mining, Niagara Falls, Canada, 2013, pp. 49-55.

[19] D. Romero et al., “Differences in the mechanics of information diffusion across

topics: Idioms, political hashtags, and complex contagion on Twitter,” in

Proc. 20th Int. Conf. World Wide Web, Hyderabad, India, 2011, pp. 695-704.

[20] A. Ritter et al., “Open domain event extraction from Twitter,” in Proc. 18th

ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Beijing,

China, 2012, pp. 1104-1112.

[21] P. Bogdanov et al., “The social media genome: Modeling individual

topic-specific behavior in social media,” in Proc. IEEE/ACM Int. Conf.

101

Advances in Social Network Analysis and Mining, Niagara Falls, Canada,

2013, pp. 236-242.

[22] T. Nguyen and B. Szymanski, “Using location-based social networks to

validate human mobility and relationships models,” in Proc. IEEE/ACM Int.

Conf. Advances in Social Network Analysis and Mining (SNAA Workshop),

Istanbul, Turkey, 2012, pp. 1247-1253.

[23] T. Nguyen, M. Chen and B. Szymanski “Analyzing the proximity and interactions

of friends in communities in Gowalla,” in Proc. IEEE 13th Int. Conf.

Data Mining Workshops, Dallas, TX, 2013, pp. 1036-1044.

[24] L. Backstrom et al., “Four degrees of separation,” in Proc. 4th ACM Int.

Conf. Web Science, Evanston, IL, 2012, pp. 33-42.

[25] Y. Ahn et al., “Analysis of topological characteristics of huge online social

networking services,” in Proc. 16th Int. Conf. World Wide Web, Alberta,

Canada, 2007, pp. 835-844.

[26] H. Kwak et al., “What is Twitter, a social network or a news media?,” in

Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp. 591-600.

[27] A. Mislove et al., “Measurement and analysis of online social networks,” in

Proc. 7th ACM SIGCOMM Conf. Internet Measurement, San Diego, CA,

2007, pp. 29-42.

[28] J. Kleinfeld, “Could it be a big world after all? The ‘six degrees of separation’

myth,” Soc., vol. 39, no. 2, pp. 61-66, Apr. 2002.

[29] J. Travers and S. Milgram, “An experimental study of the small world

problem,” Sociometry, vol. 32, no. 4, pp. 425-443, Dec. 1969.

[30] P. Dodds et al., “An experimental study of search in global social networks,”

Sci., vol. 301, no. 5634, pp. 827-829, Aug. 2003.

[31] D. Watts et al., “Identity and search in social networks,” Sci., vol. 296, no.

5571, pp. 1302-1305, May 2002.

102

[32] M. Granovetter, Getting a Job: A Study of Contacts and Careers. Chicago, IL:

University Chicago Press, 1995.

[33] D. Watts, “Networks, dynamics, and the small-world phenomenon,” AJS, vol.

105, no. 2, pp. 493-527, Sept. 1999.

[34] M. Marchiori, “The quest for correct information on the web hyper search

engines,” Comput. Networks and ISDN Syst., vol. 29, no. 8, pp. 1225-1235,

Sept. 1997.

[35] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM,

vol. 46, no. 5, pp. 604-632, Sept. 1999.

[36] C. Warden. (2010, April 22) EdgeRank: The Secret Sauce That Makes

Facebook’s News Feed Tick [Blog]. Available:

http://www.techcrunch.com/2010/04/22/facebook-edgerank/ (Date Last

Accessed, September, 22, 2014).

[37] C. Burges et al., “Learning to rank using gradient descent,” in Proc. 22nd Int.

Conf. Mach. Learning, Bonn, Germany, 2005, pp. 89-96.

[38] T. Joachims, “Optimizing search engines using clickthrough data,” in Proc.

8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining,

Edmonton, Canada, 2002, pp. 133-142.

[39] R. Caruana et al., “Using the future to sort out the present: Rankprop and

multitask learning for medical risk evaluation,” in Proc. Advances in Neural

Inform. Processing Symp., Denver, CO, 1995, pp. 959-965.

[40] K. Crammer and Y. Singer, “Pranking with ranking,” in Proc. Advances in

Neural Inform. Processing Syst., Vancouver, Canada, 2001, pp. 641-647.

[41] J. Kleinberg, “The small-world phenomenon: An algorithmic perspective,” in

Proc. 32nd Ann. ACM Symp. Theory Computing, Portland, OR, pp. 163-170,

2000.

103

[42] M. Burke et al., “Social capital on Facebook: Differentiating uses and users,”

in Proc. SIGCHI Conf. Human Factors in Computing Syst., Vancouver,

Canada, 2011, pp. 571-580.

[43] A. Mislove et al., “Understanding the demographics of Twitter users,”

presented at 2011 5th Int. AAAI Conf. Weblogs and Social Media, Barcelona,

Spain, 2011.

[44] M. Newman, “Fast algorithm for detecting community structure in networks,”

Phys. Rev. E, vol. 69, no. 6, doi: 10.1103/PhysRevE.69.066133, June 2004.

[45] A. Clauset et al., “Finding community structure in very large networks,”

Phys. Rev. E, vol. 70, no. 6, doi: 10.1103/PhysRevE.70.066111, Dec. 2004.

[46] M. Newman, “Modularity and community structure in networks,” Proc. Nat.

Academy Sci., vol. 103, no. 23, pp. 8577-8582, May 2006.

[47] S. Fortunato and M. Barthelemy, “Resolution limit in community detection,”

Proc. Nat. Academy Sci., vol. 104, no. 1, pp. 36-41, Dec. 2006.

[48] M. Chen et al., “A new metric for quality of network community structure,”

ASE Human J., vol. 1, no. 4, pp. 226-240, 2013.

[49] M. Goldberg et al., “Finding overlapping communities in social networks,” in

Proc. 4th ASE/IEEE Int. Conf. Social Computing, Minneapolis, MN, 2010,

pp. 37-54.

[50] G. Palla et al., “Uncovering the overlapping community structure of complex

networks in nature and society,” Nature, vol. 435, no. 7043, pp. 814-818, Apr.

2005.

[51] M. Sipser, Introduction to the Theory of Computation. Boston, MA: PWS,

1997.

[52] B. Good et al., “The performance of modularity maximization in practical

contexts,” Phys. Rev. E, vol. 81, no. 4, doi: 10.1103/PhysRevE.81.046106,

2010.

104

[53] M. Newman, “Analysis of weighted networks,” Phys. Rev. E, vol. 70, no. 5,

doi: 10.1103/PhysRevE.70.056131, Apr. 2004.

[54] J. Xie and B. Szymanski, “Towards linear time overlapping community

detection in social networks,” in Proc. 16th Pacific-Asia Conf. Knowledge

Discovery and Data Mining PAKDD, Kuala Lumpur, Malaysia, 2012, pp.

25-36.

[55] S. Fortunato, “Community detection in graphs,” Phy. Rep., vol. 486, no. 3,

pp. 75-174, Feb. 2010.

[56] J. Leskovec et al., “Empirical comparison of algorithms for network

community detection,” in Proc. 19th Int. Conf. World Wide Web, Raleigh,

NC, 2010, pp. 631-640.

[57] R. Kannan et al., “On clusterings: Good, bad and spectral,” J. ACM, vol. 51,

no. 3, pp. 497-515, May 2004.

[58] R. Dunbar, “Neocortex size as a constraint on group size in primates,” J.

Human Evolution, vol. 22, no. 6, pp. 469-493, June 1992.

[59] M. Gonzalez et al., “Understanding individual human mobility patterns,”

Nature, vol. 453, no. 7196, pp. 779-782, June 2008.

[60] E. Boxman et al., “The impact of social and human capital on the income

attainment of Dutch managers,” Social Networks, vol. 13, no. 1, pp. 51-73,

Mar. 1991.

[61] B. Ronald, Structural Holes: The Social Structure of Competition. Cambridge,

MA: Harvard University Press, 1992.

[62] Ray Reagans and Ezra W. Zuckerman, “Networks, diversity, and productivity:

The social capital of corporate R & D teams,” Organ. Sci., vol. 12, no. 4, pp.

502-517, Aug. 2001.

[63] Martin Ruef, “Strong ties, weak ties and islands: Structural and cultural

predictors of organizational innovation,” ICC vol. 11, no. 3, pp. 427-449, Jun.

2002.

105

[64] R. Burt, “Structural holes and good ideas”, AJS, vol. 10, no. 2, pp. 349-399,

Sept. 2004.

[65] M. Granovetter, “The impact of social structure on economic outcomes,” JEP,

vol. 19, no. 1, pp. 33-50, Dec. 2005.

[66] A. Pentland, Social Physics: How Good Ideas Spread Lessons From a New

Science. London, UK: Penguin Press, 2014.

[67] D. Liben-Nowell et al., “Geographic routing in social networks,” Proc. Nat.

Academy Sci., vol. 102, no. 33, pp. 11623-11628, June 2005.

[68] Lada Adamic and Eytan Adar, “How to search a social network,” Social

Networks, vol. 27, no. 3, pp. 187-203, Jul. 2005.

[69] M. Granovetter, “The strength of weak ties,” AJS, vol. 78, no. 6, pp.

1360-1380, May 1973.

[70] L. Bettencourt et al., “Growth, innovation, scaling, and the pace of life in

cities,” Proc. Nat. Acad., vol. 104, no. 17, pp. 7301-7306, Mar. 2007.

[71] W. Pan et al., “Urban characteristics attributable to density-driven tie

formation,” Nat. Commun., vol. 4, no. 1, doi: 10.1038/ncomms2961.

[72] G. Ghasemiesfeh et al., “Complex contagion and the weakness of long ties in

social networks: revisited,” in Proc. 14th ACM Conf. Electronic Commerce,

Philadelphia, PA, 2013, pp. 507-524, 2013.

[73] Damon Centola and Michael Macy, “Complex contagions and the weakness of

long Ties,” ASJ, vol. 113, no. 3, pp. 702-734, Nov. 2007.

[74] N. Eagle et al., “Network diversity and economic development,” Sci., vol. 328,

no. 5981, pp. 1029-1031, May 2010.

[75] Everett M. Rogers, Diffusion of Innovations. New York: Free Press, 2003.

[76] (2014, August 27) Gross Domestic Product by State [Online]. Available:

http://www.bea.gov/regional/gsp/ (Date Last Accessed, September, 22,

2014).

106

[77] (2014, August 27) Patents By Country, State, and Year - Utility Patents

[Online]. Available:

http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cstutl.htm (Date Last

Accessed, September, 22, 2014).

[78] (2014, August 27) Statistics of U.S. Businesses [Online]. Available:

http://www.census.gov/econ/susb/ (Date Last Accessed, September, 22,

2014).

[79] (2014, August 27) Annual Estimates of the Population for the United States,

Regions, States, and Puerto Rico [Online]. Available:

http://www.census.gov/popest/index.html (Date Last Accessed, September,

22, 2014).

[80] (2014, August 27) Census of Population and Housing 2010 [Online]. Available:

https://www.census.gov/prod/www/decennial.html (Date Last Accessed,

September, 22, 2014).

[81] F. Cairncross, The Death of Distance: How the Communications Revolution is

Changing Our Lives. Cambridge, MA: Harvard Business Review Press, 2001.

[82] J. Levandoski et al., “Lars: A location-aware recommender system,” in Proc.

28th Int. Conf. Data Eng., Washington, DC, 2012, pp. 450-461.

[83] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social

networks,” in Proc. 12th Int. Conf. Inform. and Knowledge Manage., New

Orleans, LA, 2003, pp. 556-559.

[84] A. Sarma et al., “A sketch-based distance oracle for web-scale graphs,” in

Proc. 3rd ACM Int. Conf. Web Search and Data Mining, New York, NY,

2010, pp. 401-410.

[85] D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning About

a Highly Connected World. Cambridge, England: Cambridge University Press,

2010.

107

[86] A. Goldberg et al., “Network flow algorithms,” in Paths, Flows, and

VLSI-Design, Berlin, Heidelberg: Springer, 1990, pp. 101-164.

[87] S. Milgram, “The small world problem,” Psychology Today, vol. 2, no. 1, pp.

60-67, May 1967.

[88] S. Schnettler, “A structured overview of 50 years of small-world research,”

Social Networks, vol. 31, no. 3, pp. 165-178, July 2009.

[89] H. P. Thadakamalla et al., “Search in spatial scale-free networks,” J. Phys,

vol. 9, no. 6, doi: 10.1088/1367-2630/9/6/190, June 2007.

[90] S. Sreenivasan, “Quantitative analysis of the evolution of novelty in cinema

through crowdsourced keywords,” Scientific Rep., vol. 3, no. 1, doi:

10.1038/srep02758, Apr. 2013.

[91] A. Hoonlor et al., “Trends in computer science research,” Commun. ACM, vol.

56, no. 10, pp. 74-83, Oct. 2013.

[92] V. Lolla et al., “Detecting MAC layer back-off timer violations in mobile ad

hoc networks,” in Proc. 26th IEEE Int. Conf. Distributed Comput. Syst.,

Lisboa, Portugal, pp. 63-63, 2006.

[93] Q. Chen et al., “Overhaul of IEEE 802.11 modeling and simulation in ns-2,”

in Proc. 10th ACM Symp. Modeling, Analysis, and Simulation Wireless and

Mobile Syst., Chania, Greece, 2007, pp. 159-168.

[94] H. Zhang et al., “Bootstrapping deny-by-default access control for mobile

ad-hoc networks,” in IEEE Military Commun. Conf., San Diego, CA, 2008,

pp. 1-7.

[95] J. Broch et al., “A performance comparison of multi-hop wireless ad hoc

network routing protocols,” in Proc. 4th Ann. ACM/IEEE Int. Conf. Mobile

Computing and Networking, Dallas, TX, 1998, pp. 85-97.

[96] P. Erdos and A. Renyi, “On random graphs,” Publ. Math. Debrecen, vol. 6,

no. 1, pp. 290-297, 1959.

108

[97] F. Simini et al., “A universal model for mobility and migration patterns,”

Nature, vol. 484, no. 7392, pp. 96-100, Apr. 2012.

[98] P. Boldi et al., “Ubicrawler: A scalable fully distributed web crawler,”

Software: Practice and Experience, vol. 34, no. 8, pp. 711-726, July 2004.

[99] T. Camp et al., “A survey of mobility models for ad hoc network research,”

Wireless Commun. and Mobile Computing, vol. 2, no. 5, pp. 483-502, Aug.

2002.

[100] C. Bettstetter et al., “The node distribution of the random waypoint mobility

model for wireless ad hoc networks,” IEEE Trans. Mobile Computing, vol. 2,

no. 3, pp. 257-269, July 2003.

[101] W. Navidi and T. Camp, “Stationary distributions for the random waypoint

mobility model,” IEEE Trans. Mobile Comput., vol. 3, no. 1, pp. 99-108, Jan.

2004.

[102] M. Kurant et al., “Towards unbiased BFS sampling,” Computing Res.

Repository, vol. 29, no. 9, pp. 1799-1809, Oct. 2011.

[103] C. Foh and M. Zukerman, “Performance analysis of the IEEE 802.11 MAC

protocol,” in Proc. Eur. Wireless Conf., Florence, Italy, 2002, pp. 184-190.

[104] S. Geyik et al., “PCFG based synthetic mobility trace generation,” in Proc.

IEEE Global Telecommun. Conf., Miami, FL, 2010, pp. 1-5.

[105] M. Chen et al., “On measuring the quality of a network community

structure,” in Proc. ASE/IEEE Int. Conf. Social Computing, Washington,

DC, 2013, pp. 122-127.

[106] K. Kuzmin et al., “Parallel overlapping community detection with SLPA,” in

Proc. ASE/IEEE Int. Conf. Social Computing, Washington, DC, 2013, pp.

204-212.

[107] D. Wang et al., “Human mobility, social ties, and link prediction,” in Proc.

17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San

Diego, CA, 2011, pp. 1100-1108.

109

[108] E. Cho et al., “Friendship and mobility: User movement in location-based

social networks,” in Proc. 17th ACM SIGKDD Int. Conf. Knowledge

Discovery and Data Mining, San Diego, CA, 2011, pp. 1082-1090.

[109] T. Razafindralambo and F. Valois, “Performance evaluation of backoff

algorithms in 802.11 ad-hoc networks,” in Proc. 3rd ACM Int. Performance

Evaluation Wireless Ad hoc, Sensor and Ubiquitous Networks, Terromolinos,

Spain, 2006, pp. 82-89.

[110] J. Yoo et al., “Random waypoint considered harmful,” in Proc. 22nd Ann.

Joint Conf. IEEE Comput. and Commun., San Francisco, CA, 2003, pp.

1312-1321.

[111] L. Katzir et al., “Estimating sizes of social networks via biased sampling,” in

Proc. 20th Int. Conf. World Wide Web, Hyderabad, India, 2011, pp. 597-606.

[112] C. Boldrini et al., “Users mobility models for opportunistic networks: The role

of physical locations,” in Proc. Wireless Rural and Emergency Commun.,

Rome, Italy, 2007, pp. 255-267.

[113] X. Hong et al., “A group mobility model for ad hoc wireless networks,” in

Proc. 2nd ACM Int. Workshop Modeling, Analysis and Simulation of Wireless

and Mobile Syst., Seattle, WA, 1999, pp. 53-60.

[114] H. Hsu, Schaum’s Outline of Probability, Random Variables, and Random

Processes. New York: McGraw-Hill, 2010.

[115] A. Langville et al., “Deeper inside pagerank,” Internet Math., vol. 1, no. 3, pp.

335-380, Jan. 2004.

[116] W. Steward, Introduction to the Numerical Solution of Markov Chains.

Princeton, NJ: Princeton University Press, 1994.

[117] J. A. Rice, Mathematical Statistics and Data Analysis. Stamford, CT:

Cengage Learning, 2006.

110

[118] A. Banerjee and S. Basu, “A social query model for decentralized search,” in

Proc. 2nd ACM Workshop on Social Network Mining and Analysis, Las Vegas,

NV, 2008.

[119] A. Bozzon et al., “Answering search queries with crowdsearcher,” in Proc. 21st

Int. Conf. World Wide Web, Lyon, France, 2012, pp. 1009-1018.

[120] S. Sahay et al., “Social ranking for spoken web search,” in Proc. 20th ACM

Int. Conf. Inform. and Knowledge Manage., Glasgow, Scotland, 2011, pp.

1835-1840.

[121] A. Agarwal et al., “Learning to rank networked entities,” in Proc. 12th ACM

SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Philadelphia, PA,

2006, pp. 14-23.

[122] S. Chakrabarti et al., “Focused crawling: A new approach to topic-specific web

resource discovery,” Comput. Networks, vol. 31, no. 11, pp. 1623-1640, May

1999.

[123] A. Maiya and T. Wolf, “Expansion and search in networks,” in Proc. 19th

ACM Int. Conf. Inform. and Knowledge Manage., Toronto, Canada, 2010, pp.

239-248.

[124] J. Kleinberg and E. Tardos, Algorithm Design. London, UK: Pearson, 2006.

[125] M. Girvan and M. Newman, “Community structure in social and biological

networks,” Proc. Nat. Academy Sci., vol. 99, no. 12, pp. 7821-7826, Apr. 2002.

[126] J. Xie et al., “Overlapping community detection in networks: The state of the

art and comparative study,” ACM Comput. Surveys, vol. 45, no. 4, Aug. 2013.

[127] S. Scellato et al., “Distance matters: Geo-social metrics for online social

networks,” in Proc. 3rd Conf. Online Social Networks, Boston, MA, 2010, pp.

8.

[128] M. Newman, “Communities, modules and large-scale structure in networks,”

Nature Physics, vol. 8, no. 1, pp. 25-31, Dec. 2011.

111

[129] L. Backstrom et al., “Find me if you can: Improving geographical prediction

with social and spatial proximity,” in Proc. 19th Int. Conf. World Wide Web,

Raleigh, NC, 2010, pp. 61-70.

[130] S. Scellato et al., “Socio-spatial properties of online location-based social

networks,” in Proc. 5th Int. AAAI Conf. Weblogs and Social Media,

Barcelona, Spain, 2011, pp. 329-336.

[131] M. Allamanis et al., “Evolution of a location-based online social network:

Analysis and models,” in Proc. 2012 ACM Conf. Internet Measurement,

Boston, MA, 2012, pp. 145-158.

[132] S. Adali et al., “Deconstructing centrality: Thinking locally and ranking

globally in networks,” in Proc. IEEE/ACM Int. Conf. Advances in Social

Network Analysis and Mining, Niagara Falls, Canada, 2013, pp. 418-425.

[133] P. Expert et al., “Uncovering space-independent communities in spatial

networks,” Proc. Nat. Academy Sci., vol. 108, no. 19, pp. 7663-7668, Aug.

2011.

[134] M. McPherson et al., “Birds of a feather: Homophily in social networks,” Ann.

Review Sociol., vol. 27, no. 1, pp. 415-444, Aug. 2001.

[135] J. Yang and J. Leskovec, “Defining and evaluating network communities based

on ground-truth,” in Proc. ACM SIGKDD Workshop Mining Data Semantics,

Beijing, China, 2012, pp. 31-38.

[136] M. Deutsch and H. Gerard, “A study of normative and informational social

influences upon individual judgment,” J. Abnormal & Social Psychology, vol.

51, no. 3, pp. 629-36, Sept. 1955.

[137] E. Bulut and B. Szymanski, “Exploiting friendship relations for efficient

routing in mobile social networks,” IEEE Trans. Parallel Distrib. Syst., vol. 3,

no. 12, pp. 2254-2265, Dec. 2012.

112

[138] M. Cha et al., “A measurement-driven analysis of information propagation in

the flickr social network,” in Proc. 18th Int. Conf. World Wide Web, Madrid,

Spain, 2009, pp. 721-730.

[139] A. Hannak et al., “Measuring personalization of web search,” in Proc. Int.

Conf. World Wide Web, Rio de Janeiro, Brazil, 2013, pp. 527-538.

[140] J. Leskovec and E. Horvitz, “Planetary-scale views on a large

instant-messaging network,” in Proc. 17th Int. Conf. World Wide Web,

Beijing, China, 2008, pp. 915-924.

[141] D. Watts and S. Strogatz, “Collective dynamics of ‘small-world’ networks,”

Nature, vol. 393, no. 6684, pp. 409-410, June 1998.

[142] J. Onnela et al., “Geographic constraints on social network groups,” PloS

One, vol. 6, no. 4, doi: 10.1371/journal.pone.0016939, Apr. 2011.

[143] E. Garfield, “It is a small world after all,” Essays of an Inform. Scientist, vol.

4, no. 43, pp. 299-304, Oct. 1978.

[144] S. Adali et al., “Attentive betweenness centrality (ABC): Considering options

and bandwidth when measuring criticality,” in Proc. ASE/IEEE Int. Conf.

Social Computing, Amsterdan, Netherlands, 2012, pp. 358-367.

[145] E. Daly and M. Haahr, “Social network analysis for routing in disconnected

delay-tolerant MANETs,” in Proc. 8th ACM Int. Symp. on Mobile Ad Hoc

Networking and Computing, Montreal, Canada, 2007, pp. 32-40.

[146] M. Newman, “Models of the small world,” J. Stat. Phys, vol. 101, no. 4, pp.

819-841, Nov. 2000.

[147] B. Uzzi and J. Spiro, “Collaboration and creativity: The small world

problem,” AJS, vol. 111, no. 2, pp. 447-504, Sept. 2005.

[148] N. Hodas and K. Lerman, “The simple rules of social contagion,” Sci. Rep.,

vol. 4, no. 434, doi:10.1038/srep04343.

113

[149] L. Bettencourt and G. West, “A unified theory of urban living,” Nat., vol.

467, no. 7318, pp. 912-913, Oct. 2010.

proximity, interactions, and communities in social ...szymansk/theses/nguyen.2014.pdf · proximity,...

Documents